Learn how to develop a simple data pipeline and execute it easily.
Data is the asset that drives our work as data professionals. Without proper data, we cannot perform our tasks, and our business will fail to gain a competitive advantage. Thus, securing suitable data is crucial for any data professional, and data pipelines are the systems designed for this purpose.
Data pipelines are systems designed to move and transform data from one source to another. These systems are part of the overall infrastructure for any business that relies on data, as they guarantee that our data is reliable and always ready to use.
Building a data pipeline may sound complex, but a few simple tools are sufficient to create reliable data pipelines with just a few lines of code. In this article, we will explore how to build a straightforward data pipeline using Python and Docker that you can apply in your everyday data work.
Let’s get into it.
Before we build our data pipeline, let’s understand the concept of ETL, which stands for Extract, Transform, and Load. ETL is a process where the data pipeline performs the following actions:
ETL is a standard pattern for data pipelines, so what we build will follow this structure.
With Python and Docker, we can build a data pipeline around the ETL process with a simple setup. Python is a valuable tool for orchestrating any data flow activity, while Docker is useful for managing the data pipeline application's environment using containers.
Let’s set up our data pipeline with Python and Docker.
First, we must nsure that we have Python and Docker installed on our system (we will not cover this here).
For our example, we will use theheart attack dataset from Kaggle as the data source to develop our ETL process.
With everything in place, we will prepare the project structure. Overall, the simple data pipeline will have the following skeleton:
simple-data-pipeline/├── app/│ └── pipeline.py├── data/│ └── Medicaldataset.csv├── Dockerfile├── requirements.txt└── docker-compose.yml
There is a main folder calledsimple-data-pipeline
, which contains:
app
folder containing thepipeline.py
file.data
folder containing the source data (Medicaldataset.csv
).requirements.txt
file for environment dependencies.Dockerfile
for the Docker configuration.docker-compose.yml
file to define and run our multi-container Docker application.We will first fill out therequirements.txt
file, which contains the libraries required for our project.
In this case, we will only use the following library:
pandas
In the next section, we will set up the data pipeline using our sample data.
We will set up the Pythonpipeline.py
file for the ETL process. In our case, we will use the following code.
import pandas as pdimport osinput_path = os.path.join("/data", "Medicaldataset.csv")output_path = os.path.join("/data", "CleanedMedicalData.csv")def extract_data(path): df = pd.read_csv(path) print("Data Extraction completed.") return dfdef transform_data(df): df_cleaned = df.dropna() df_cleaned.columns = [col.strip().lower().replace(" ", "_") for col in df_cleaned.columns] print("Data Transformation completed.") return df_cleaneddef load_data(df, output_path): df.to_csv(output_path, index=False) print("Data Loading completed.")def run_pipeline(): df_raw = extract_data(input_path) df_cleaned = transform_data(df_raw) load_data(df_cleaned, output_path) print("Data pipeline completed successfully.")if __name__ == "__main__": run_pipeline()
The pipeline follows the ETL process, where we load the CSV file, perform data transformations such as dropping missing data and cleaning the column names, and load the cleaned data into a new CSV file. We wrapped these steps into a singlerun_pipeline
function that executes the entire process.
With the Python pipeline file ready, we will fill in theDockerfile
to set up the configuration for the Docker container using the following code:
FROM python:3.10-slimWORKDIR /appCOPY ./app /appCOPY requirements.txt .RUN pip install --no-cache-dir -r requirements.txtCMD ["python", "pipeline.py"]
In the code above, we specify that the container will use Python version 3.10 as its environment. Next, we set the container's working directory to/app
and copy everything from our localapp
folder into the container'sapp
directory. We also copy therequirements.txt
file and execute the pip installation within the container. Finally, we specify the command to run the Python script when the container starts.
With theDockerfile
ready, we will prepare thedocker-compose.yml
file to manage the overall execution:
version: '3.9'services: data-pipeline: build: . container_name: simple_pipeline_container volumes: - ./data:/data
The YAML file above, when executed, will build the Docker image from the current directory using the availableDockerfile
. We also mount the localdata
folder to thedata
folder within the container, making the dataset accessible to our script.
With all the files ready, we will execute the data pipeline in Docker. Go to the project root folder and run the following command in your command prompt to build the Docker image and execute the pipeline.
docker compose up --build
If you run this successfully, you will see an informational log like the following:
✔ data-pipeline Built 0.0s ✔ Network simple_docker_pipeline_default Created 0.4s ✔ Container simple_pipeline_container Created 0.4s Attaching to simple_pipeline_containersimple_pipeline_container | Data Extraction completed.simple_pipeline_container | Data Transformation completed.simple_pipeline_container | Data Loading completed.simple_pipeline_container | Data pipeline completed successfully.simple_pipeline_container exited with code 0
If everything is executed successfully, you will see a newCleanedMedicalData.csv
file in your data folder.
Congratulations! You have just created a simple data pipeline with Python and Docker. Try using various data sources and ETL processes to see if you can handle a more complex pipeline.
Understanding data pipelines is crucial for every data professional, as they are essential for acquiring the right data for their work. In this article, we explored how to build a simple data pipeline using Python and Docker and learned how to execute it.
I hope this has helped!
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.
Top Posts |
---|