Movatterモバイル変換


[0]ホーム

URL:


KDnuggets logo
  • Facebook
  • Twitter
  • LinkedIn
Join Newsletter
 

Build Your Own Simple Data Pipeline with Python and Docker

Learn how to develop a simple data pipeline and execute it easily.



Build Your Own Simple Data Pipeline with Python and Docker

 

Data is the asset that drives our work as data professionals. Without proper data, we cannot perform our tasks, and our business will fail to gain a competitive advantage. Thus, securing suitable data is crucial for any data professional, and data pipelines are the systems designed for this purpose.

Data pipelines are systems designed to move and transform data from one source to another. These systems are part of the overall infrastructure for any business that relies on data, as they guarantee that our data is reliable and always ready to use.

Building a data pipeline may sound complex, but a few simple tools are sufficient to create reliable data pipelines with just a few lines of code. In this article, we will explore how to build a straightforward data pipeline using Python and Docker that you can apply in your everyday data work.

Let’s get into it.

 

Building the Data Pipeline

 
Before we build our data pipeline, let’s understand the concept of ETL, which stands for Extract, Transform, and Load. ETL is a process where the data pipeline performs the following actions:

  • Extract data from various sources. 
  • Transform data into a valid format. 
  • Load data into an accessible storage location.

ETL is a standard pattern for data pipelines, so what we build will follow this structure. 

With Python and Docker, we can build a data pipeline around the ETL process with a simple setup. Python is a valuable tool for orchestrating any data flow activity, while Docker is useful for managing the data pipeline application's environment using containers.

Let’s set up our data pipeline with Python and Docker. 

 

Step 1: Preparation

First, we must nsure that we have Python and Docker installed on our system (we will not cover this here).

For our example, we will use theheart attack dataset from Kaggle as the data source to develop our ETL process.  

With everything in place, we will prepare the project structure. Overall, the simple data pipeline will have the following skeleton:

simple-data-pipeline/├── app/│   └── pipeline.py├── data/│   └── Medicaldataset.csv├── Dockerfile├── requirements.txt└── docker-compose.yml

 

There is a main folder calledsimple-data-pipeline, which contains:

  • Anapp folder containing thepipeline.py file.
  • Adata folder containing the source data (Medicaldataset.csv).
  • Therequirements.txt file for environment dependencies.
  • TheDockerfile for the Docker configuration.
  • Thedocker-compose.yml file to define and run our multi-container Docker application.

We will first fill out therequirements.txt file, which contains the libraries required for our project.

In this case, we will only use the following library:

pandas

 

In the next section, we will set up the data pipeline using our sample data.

 

Step 2: Set up the Pipeline

We will set up the Pythonpipeline.py file for the ETL process. In our case, we will use the following code.

import pandas as pdimport osinput_path = os.path.join("/data", "Medicaldataset.csv")output_path = os.path.join("/data", "CleanedMedicalData.csv")def extract_data(path):    df = pd.read_csv(path)    print("Data Extraction completed.")    return dfdef transform_data(df):    df_cleaned = df.dropna()    df_cleaned.columns = [col.strip().lower().replace(" ", "_") for col in df_cleaned.columns]    print("Data Transformation completed.")    return df_cleaneddef load_data(df, output_path):    df.to_csv(output_path, index=False)    print("Data Loading completed.")def run_pipeline():    df_raw = extract_data(input_path)    df_cleaned = transform_data(df_raw)    load_data(df_cleaned, output_path)    print("Data pipeline completed successfully.")if __name__ == "__main__":    run_pipeline()

 

The pipeline follows the ETL process, where we load the CSV file, perform data transformations such as dropping missing data and cleaning the column names, and load the cleaned data into a new CSV file. We wrapped these steps into a singlerun_pipeline function that executes the entire process.

 

Step 3: Set up the Dockerfile

With the Python pipeline file ready, we will fill in theDockerfile to set up the configuration for the Docker container using the following code:

FROM python:3.10-slimWORKDIR /appCOPY ./app /appCOPY requirements.txt .RUN pip install --no-cache-dir -r requirements.txtCMD ["python", "pipeline.py"]

 

In the code above, we specify that the container will use Python version 3.10 as its environment. Next, we set the container's working directory to/app and copy everything from our localapp folder into the container'sapp directory. We also copy therequirements.txt file and execute the pip installation within the container. Finally, we specify the command to run the Python script when the container starts.

With theDockerfile ready, we will prepare thedocker-compose.yml file to manage the overall execution:

version: '3.9'services:  data-pipeline:    build: .    container_name: simple_pipeline_container    volumes:      - ./data:/data

 

The YAML file above, when executed, will build the Docker image from the current directory using the availableDockerfile. We also mount the localdata folder to thedata folder within the container, making the dataset accessible to our script.

 

Executing the Pipeline

 
With all the files ready, we will execute the data pipeline in Docker. Go to the project root folder and run the following command in your command prompt to build the Docker image and execute the pipeline.

docker compose up --build

 

If you run this successfully, you will see an informational log like the following:

 ✔ data-pipeline                           Built                                                                                   0.0s  ✔ Network simple_docker_pipeline_default  Created                                                                                 0.4s  ✔ Container simple_pipeline_container     Created                                                                                 0.4s Attaching to simple_pipeline_containersimple_pipeline_container  | Data Extraction completed.simple_pipeline_container  | Data Transformation completed.simple_pipeline_container  | Data Loading completed.simple_pipeline_container  | Data pipeline completed successfully.simple_pipeline_container exited with code 0

 

If everything is executed successfully, you will see a newCleanedMedicalData.csv file in your data folder. 

Congratulations! You have just created a simple data pipeline with Python and Docker. Try using various data sources and ETL processes to see if you can handle a more complex pipeline.

 

Conclusion

 
Understanding data pipelines is crucial for every data professional, as they are essential for acquiring the right data for their work. In this article, we explored how to build a simple data pipeline using Python and Docker and learned how to execute it.

I hope this has helped!
 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.






Latest Posts



Top Posts








© 2025Guiding Tech Media   |  About   |  Contact   |  Advertise |  Privacy   |  Terms of Service

 
Published on July 17, 2025 by

By subscribing you accept KDnuggetsPrivacy Policy

By subscribing you accept KDnuggetsPrivacy Policy

No, thanks!


[8]ページ先頭

©2009-2025 Movatter.jp