Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

End-to-end ETL pipeline python

License

NotificationsYou must be signed in to change notification settings

LugolBis/data-immo-python

Repository files navigation

data-immo is a high-performance ETL pipeline built inRust to efficiently extract, transform, and load real estate transaction data from theDVF+ API into aLakehouse (Dremio).

The project is designed with a focus onperformance, reliability, and scalability, leveraging modern data engineering tools and practices.

🔍​ Schema of the pipeline

flowchart LR    A[**API** DVF+] -->|Extraction| B[**Rust**]    B -->|Transformation| C[**Rust**]    C -->|Loading| D[(**DuckDB**)]    D -->|Saving cleaned data| E{**dbt**}    E -->|Validation<br>& loading| F[(**Dremio**)]    subgraph Extraction [Extract]        A    end        subgraph TransformationRust [Transform]        B        C    end    subgraph LT [Load & Transform]        D    end    subgraph VD [Validate Data quality]        E    end        subgraph LakeHouse [LakeHouse]        F    end    subgraph Docker [Docker Container]        LakeHouse    end        style A fill:#000091,stroke:#ffffff,color:#ffffff,stroke-width:1px    style B fill:#955a34,stroke:#000000,color:#000000,stroke-width:1px    style C fill:#955a34,stroke:#000000,color:#000000,stroke-width:1px    style D fill:#fef242,stroke:#000000,color:#000000,stroke-width:1px    style E fill:#fc7053,stroke:#000000,color:#000000,stroke-width:1px    style F fill:#31d3db,stroke:#ffffff,color:#ffffff,stroke-width:1px    style Docker fill: #099cec
Loading

🚀 Features

  • Data Extraction

    • Fetches real estate transaction data from theDVF+ API.
    • Handles API rate limiting, retries, and efficient pagination using Rust’s concurrency model.
  • Data Transformation

    • UsesDuckDB to transform optimizedParquet data into structured, queryable formats.
    • Rust is used for additional transformations, data enrichment, and performance-critical operations (I/O, etc.).
  • Data Validation & Loading

    • dbt is used to validate, test, and model the data.
    • The cleaned and validated data is loaded intoDremio, enabling a Lakehouse architecture.

🛠️ Tech Stack

  • Rust → Core language for API calls, transformations, and performance optimization.
  • DuckDB → In-process SQL engine for fast transformations of optimized Parquet datasets.
  • dbt → Data modeling, testing, and validation layer.
  • Dremio → Lakehouse platform for analytics and querying.

📂 Pipeline Overview

  1. Extract: Retrieve raw transaction data from DVF+ API.
  2. Stage: Store raw data as Parquet.
  3. Transform: Apply transformations using DuckDB and Rust.
  4. Validate & Model: Use dbt to ensure data quality and prepare final schemas.
  5. Load: Push validated datasets into Dremio for downstream analytics.

About

End-to-end ETL pipeline python

Topics

Resources

License

Stars

Watchers

Forks

Languages


[8]ページ先頭

©2009-2025 Movatter.jp