- Notifications
You must be signed in to change notification settings - Fork0
Homework assignments for MFF UK course NDBI046 - Introduction to Data Engineering
License
lhotanok/data-engineering
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
- Linux / Windows 10 or higher with WSL installed
- Nodejs
v19.8.1
or higher - NPM
v9.6.2
or higher - Python
v3.8
- PyPI
v20.0.2
or higher
We're generating the following data cubes:
- Care providers (Poskytovatelé zdravotních služeb)
- Population 2021 (Obyvatelé okresy 2021)
To install required project dependencies, run the following commands:
cd data-cubesnpm ci
Scripts for data cubes generation can be launched fromdata-cubes
directory using a command:
npm start
To generate both data cubes and test their integrity constraints, run the following:
npmtest
Input files are stored in thedata-cubes/input
directory. There're 3 source CSV files:
care-providers-registry.csv
(data sources for care providers data cube)population-cs-2021.csv
(data sources for mean population data cube)county-codes.csv
(mapping for translation of county codes frompopulation-cs-2021.csv
into standardized NUTS codes)
The remaining files contain RDF schemas for the generated data cubes.
Output files are stored in thedata-cubes/output
directory. The following files will be generated into theoutput
directory:
care-providers.ttl
(Care providers data cube with metadata)population.ttl
(Mean population data cube with metadata)datasets.ttl
(merged data cubes fromcare-providers.ttl
andpopulation.ttl
)
The following scripts need to be run to generate both data cubes and to validate their integrity constraints:
care-providers.ts
population.ts
constraints-validation.ts
We're generating the same data cubes as in the previous section:
- Care providers (Poskytovatelé zdravotních služeb)
- Population 2021 (Obyvatelé okresy 2021)
This project needs to be run on a Linux machine or in the WSL environment.
First, open project's directory:
cd airflow
Then proceed with installing dependencies for Nodejs scripts frompackage.json
:
npm ci
Next step will be configuration of a virtual environment for Python. You can install Apache Airflow from the exported dependencies inrequirements.txt
file or follow the officialinstallation guide.
To set up Python virtual environment calledvenv
with all required dependencies, run the following commands:
# Create Python venvpython3 -m venv venv# Activate venv. venv/bin/activate# Install Python dependencies from requirements.txtpip install -r requirements.txt
You'll also need to define a home directory for Apache Airflow:
export AIRFLOW_HOME=~/airflow
To finish Airflow setup, run the following:
# Initialize a databaseairflow db init# Create an administratorairflow users create --username"admin" --firstname"Harry" --lastname"Potter" --role"Admin" --email"harry.potter@gmail.com"# Check the existing usersairflow users list
You should also update the generatedairflow.cfg
file:
dags_folder
needs to point to thedags
directory, e.g./home/kristyna/data-engineering/airflow/dags
load_examples
should better be set toFalse
if you want to avoid seeing tons of example DAGs
You'll need 2 commands to launch Airflow. Each command needs to be triggered in its own terminal.
airflow scheduler
airflow webserver --port 8080
Now you can visithttp://localhost:8080/home
in your web browser. If you setload_examples = False
inairflow.cfg
,you should only see one DAG -data-cubes
.
DAGs can be easily triggered using the web interface. You need to provide a JSON configuration withoutput_path
field.An example of the right configuration can be seen atdag-configuration.json
. The generated data cubes will be stored into the directory specified byoutput_path
parameter. If you forget to pass this parameter,{dags_folder}/output
will be used as a default location for output files.
Input files are gathered inairflow/dags/input
directory. Three CSV files will be downloaded (same as inData Cubes project):
care-providers-registry.csv
(data sources for care providers data cube)population-cs-2021.csv
(data sources for mean population data cube)county-codes.csv
(mapping for translation of county codes frompopulation-cs-2021.csv
into standardized NUTS codes)
The remaining files contain RDF schemas for the generated data cubes.
Output files will be stored into theoutput_path
directory or{dags_folder}/output
by default.The following files will be generated:
health_care.ttl
(Care providers data cube with metadata)population.ttl
(Mean population data cube with metadata)datasets.ttl
(merged data cubes fromhealth_care.ttl
andpopulation.ttl
)
There's exactly one DAG defined indata-cubes.py
. It specifies 6 tasks for the Airflow pipeline:
health_care_providers_download
population_2021_download
county_codes_download
care_providers_data_cube
population_data_cube
integrity_constraints_validation
Tasks 1-3 take care of downloading the source CSV files into theinput
directory.Tasks 4 and 5 are responsible of generating the target data cubeshealth_care.ttl
andpopulation.ttl
.The last task merges the output datasets and validates them against a set of integrity constraints.
For a better idea of how the tasks are connected, see the DAG's graph visualization:
Scripts operating with CSV and RDF files were adapted from theData Cubes project. They're stored inairflow/dags/scripts
and they're triggered bydata-cubes.py
. The most important (entry) scripts are again the following:
care-providers.ts
population.ts
constraints-validation.ts
A provenance document describing a process of generating datasets fromData Cubes project can be found inprovenance
directory. It is stored asprovenance.trig
file in RDF TriG format. It followsPROV-O specification and its attached examples.
The datasets fromData Cubes project are extended with SKOS hierarchy and DCAT-AP metadata in theskos-and-dcat
project.
For project installation & usage instructions, refer to the originalData Cubes section. Project structure is the same as indata-cubes
project, except for 2 changes:
Compared to the originalData Cubes project, metadata were removed frompopulation
dataset completely and they were moved fromcare-providers
dataset into a separate datasetcare-providers-metadata
. Metadata are described in a fileskos-and-dcat/input/care-providers-metadata.ttl
. File's content is loaded into an RDF store, normalized and dumped intoskos-and-dcat/output/care-providers-metadata.ttl
.
A SKOS hierarchy was employed for regions and counties in both data cubes -care-providers
andpopulation
. Regions and counties are defined as separate SKOS concepts (skos:Concept
) and they're connected to the hierarchy through Broader and Narrower relationships (skos:broader
,skos:narrower
).
About
Homework assignments for MFF UK course NDBI046 - Introduction to Data Engineering
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.