data-lake
Here are 272 public repositories matching this topic...
Language:All
Sort:Most stars
lakeFS - Data version control for your data lake | Git for data
- Updated
Mar 20, 2025 - Go
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
- Updated
Mar 20, 2025 - Python
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
- Updated
Mar 20, 2025 - Scala
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
- Updated
Jan 1, 2024 - Java
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
- Updated
Aug 26, 2022 - Python
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
- Updated
Mar 9, 2020 - Python
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
- Updated
Jan 12, 2023 - Java
Personal Data Engineering Projects
- Updated
Feb 8, 2023 - Jupyter Notebook
Data API Framework for AI Agents and Data Apps
- Updated
Jul 1, 2024 - TypeScript
Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.
- Updated
Mar 20, 2025 - Rust
Generic Data Ingestion & Dispersal Library for Hadoop
- Updated
Mar 19, 2023 - Java
Enterprise-grade, production-hardened, serverless data lake on AWS
- Updated
Mar 18, 2025 - Python
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
- Updated
Nov 5, 2020 - Jupyter Notebook
Use SQL to build ELT pipelines on a data lakehouse.
- Updated
May 25, 2022 - JavaScript
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
- Updated
Mar 5, 2025 - Python
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
- Updated
May 7, 2024 - C++
🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
- Updated
Mar 20, 2025 - Java
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
- Updated
Mar 18, 2025 - Java
Improve this page
Add a description, image, and links to thedata-lake topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thedata-lake topic, visit your repo's landing page and select "manage topics."