etl-pipeline

Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

python data-science machine-learning etl pandas orchestration data-engineering data-analysis software-engineering feature-engineering dataframe hacktoberfest dag lineage etl-framework etl-pipeline rag mlops llmops

UpdatedDec 6, 2025
Jupyter Notebook

AlexIoannides /pyspark-example-project

Star2k

Implementing best practices for PySpark ETL jobs and applications.

python data-science spark etl pyspark data-engineering etl-pipeline etl-job

UpdatedJan 1, 2023
Python

san089 /Udacity-Data-Engineering-Projects

Star1.8k

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

infrastructure aws postgres data airflow cloudformation cassandra cluster aws-s3 aws-sdk data-warehouse data-engineering data-lake aws-ec2 postgresql-database data-modeling cassandra-database etl-pipeline data-engineering-pipeline airflow-operators

UpdatedAug 26, 2022
Python

san089 /goodreads_etl_pipeline

Star1.4k

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

python airflow spark apache-spark scheduler s3 data-engineering data-lake warehouse redshift data-migration livy etl-framework apache-airflow emr-cluster etl-pipeline etl-job data-engineering-pipeline airflow-dag goodreads-data-pipeline

UpdatedMar 9, 2020
Python

Open-Source-Legal /OpenContracts

Star1.1k

Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

agent etl unstructured-data etl-pipeline vector-database llm prompt-engineering agentic-ai

UpdatedDec 17, 2025
Python

stitchfix /hamilton

Star860

A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TOwww.github.com/dagworks-inc/hamilton

python data-science machine-learning etl numpy pandas data-engineering data-platform software-engineering feature-engineering dataframe dag hamiltonian etl-framework hamilton featurization etl-pipeline stitch-fix

UpdatedJul 3, 2023
Python

techascent /tech.ml.dataset

Star734

A Clojure high performance data processing system

java machine-learning clojure csv xlsx datascience dataset dataframe etl-pipeline

UpdatedDec 17, 2025
Clojure

SorellaLabs /brontes

Star647

A blazingly fast general purpose blockchain analytics engine specialized in systematic mev detection

rust ethereum evm etl-pipeline mev

UpdatedJul 28, 2025
Rust

Pravko-Solutions /FlashLearn

Star606

Integrate LLM in any pipeline - fit/predict pattern, JSON driven flows, and built in concurency support.

python ai concurrency ai-agents etl-pipeline llm llm-agent ai-agents-framework agentic-ai-development

UpdatedMar 10, 2025
Python

YotpoLtd /metorikku

Star586

A simplified, lightweight ETL Framework based on Apache Spark

scala sql big-data spark etl distributed-computing etl-framework etl-pipeline

UpdatedJan 24, 2024
Scala

DataWithBaraa /sql-data-warehouse-project

Sponsor

Star463

A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.

data-science sql sql-server etl data-warehouse sql-query datascience data-engineering data-analytics datawarehousing data-analysis sqlserver datawarehouse data-cleaning datalake data-warehousing etl-pipeline etl-job data-lakehouse medallion-architecture

UpdatedApr 23, 2025
TSQL

unbody-io /unbody

Star422

The Supabase of AI era. A modular, open-source backend for building AI-native software — designed for knowledge, not static data.

backend chatbot developer-tools knowledge-base data-ingestion etl-pipeline rag data-enhancement vector-database llm ai-native generative-ai agentic-ai supabase-alternative

UpdatedJun 5, 2025
TypeScript

airscholar /e2e-data-engineering

Star298

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

docker big-data cassandra apache-spark data-storage postgresql data-engineering apache-kafka data-processing data-pipeline real-time-analytics containerization apache-zookeeper apache-airflow etl-pipeline