data-lake

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

real-time big-data high-performance data-lake data-integration flink data-synchronization data-pipeline

UpdatedJan 1, 2024
Java

san089 /goodreads_etl_pipeline

Star1.5k

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

python airflow spark apache-spark scheduler s3 data-engineering data-lake warehouse redshift data-migration livy etl-framework apache-airflow emr-cluster etl-pipeline etl-job data-engineering-pipeline airflow-dag goodreads-data-pipeline

UpdatedMar 9, 2020
Python

lakekeeper /lakekeeper

Star1.2k

Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.

rust catalog data-lake iceberg lakehouse open-lakehouse lakehouse-governance

UpdatedFeb 20, 2026
Rust

Teradata /kylo

Star1.1k

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

spark hadoop data-lake teradata nifi kylo

UpdatedJan 12, 2023
Java

apache /amoro

Star1.1k

Apache Amoro(incubating) is a Lakehouse management system built on open data lake formats.

big-data spark data-lake flink management-system iceberg trino hudi paimon lake-house self-optimizing

UpdatedFeb 20, 2026
Java

alanchn31 /Data-Engineering-Projects

Star992

Personal Data Engineering Projects

postgres airflow spark cassandra mongodb data-warehouse data-engineering data-lake scrapy data-modeling aws-redshift star-schema ingest-data data-engineering-nanodegree

UpdatedFeb 8, 2023
Jupyter Notebook

pixelsdb /pixels

Star858

An efficient storage and compute engine for both on-prem and cloud-native data analytics.

database data-warehouse data-lake olap column-store cloud-database

UpdatedFeb 20, 2026
Java

Canner /vulcan-sql

Star785

Data API Framework for AI Agents and Data Apps

bigquery typescript sql database ai analytics clickhouse reporting postgresql spreadsheet snowflake data-warehouse data-lake restful-api api-builder ksqldb duckdb ai-agent vulcan-sql vulcansql

UpdatedJul 1, 2024
TypeScript

Canner /wren-engine

Star555

🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥

agent semantic data sql ai mcp data-warehouse data-lake business-intelligence data-analytics data-analysis hacktoberfest semantic-layer llm agentic-ai mcp-server

UpdatedFeb 16, 2026
Java

uber /marmaray

Star481

Generic Data Ingestion & Dispersal Library for Hadoop

spark hadoop data-lake avro-schema ingest-data schema-format

UpdatedMar 19, 2023
Java

aws-solutions-library-samples /data-lakes-on-aws

Star478

Enterprise-grade, production-hardened, serverless data lake on AWS

aws framework serverless etl analytics best-practices data-engineering iac data-lake lake-formation

UpdatedOct 1, 2025
Python

kaiwaehner /hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference

Star418

Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required

python java kubernetes mqtt cloud kafka mongodb tensorflow terraform gcp grpc data-lake confluent hivemq kafka-connect kafka-streams ksql ksqldb tiered-storage tensorflow-io

UpdatedNov 5, 2020
Jupyter Notebook

gigapi /gigapi

Star377

GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐

api golang sql database rest-api s3 data-lake olap parquet query-engine datalake clickhouse-server lakehouse duckdb qryn gigapipe duckdb-api fdap duckdb-server ducklake

UpdatedOct 20, 2025
Go

cuebook /cuelake

Star288

Use SQL to build ELT pipelines on a data lakehouse.

sql apache-spark etl pipelines data-engineering data-lake data-transfer delta data-integration upsert elt data-pipeline datalake data-ingestion spark-sql zeppelin-notebook apache-iceberg lakehouse incremental-updates

UpdatedMay 25, 2022
JavaScript

maxi-k /btrblocks

Star279

BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)

compression research databases data-lake

UpdatedApr 7, 2025
C++

awslabs /amazon-s3-find-and-forget

Star245

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

aws data privacy big-data s3 data-lake parquet gdpr right-to-be-forgotten amazon-s3 data-erasure ccpa

UpdatedJan 24, 2026
Python

Improve this page

Add a description, image, and links to thedata-lake topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thedata-lake topic, visit your repo's landing page and select "manage topics."

Learn more

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly