Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Homework assignments for MFF UK course NDBI046 - Introduction to Data Engineering

License

NotificationsYou must be signed in to change notification settings

lhotanok/data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

  1. System requirements
  2. Prerequisities
  3. Projects

System requirements

  • Linux / Windows 10 or higher with WSL installed
  • Nodejsv19.8.1 or higher
  • NPMv9.6.2 or higher
  • Pythonv3.8
  • PyPIv20.0.2 or higher

Prerequisities

Projects

Data Cubes

We're generating the following data cubes:

  • Care providers (Poskytovatelé zdravotních služeb)
  • Population 2021 (Obyvatelé okresy 2021)

Installation instructions

Configuration

To install required project dependencies, run the following commands:

cd data-cubesnpm ci
Run

Scripts for data cubes generation can be launched fromdata-cubes directory using a command:

npm start

To generate both data cubes and test their integrity constraints, run the following:

npmtest

Files

Input

Input files are stored in thedata-cubes/input directory. There're 3 source CSV files:

  • care-providers-registry.csv (data sources for care providers data cube)
  • population-cs-2021.csv(data sources for mean population data cube)
  • county-codes.csv (mapping for translation of county codes frompopulation-cs-2021.csv into standardized NUTS codes)

The remaining files contain RDF schemas for the generated data cubes.

Output

Output files are stored in thedata-cubes/output directory. The following files will be generated into theoutput directory:

  • care-providers.ttl (Care providers data cube with metadata)
  • population.ttl (Mean population data cube with metadata)
  • datasets.ttl (merged data cubes fromcare-providers.ttl andpopulation.ttl)
Scripts

The following scripts need to be run to generate both data cubes and to validate their integrity constraints:

  • care-providers.ts
  • population.ts
  • constraints-validation.ts

Apache Airflow

We're generating the same data cubes as in the previous section:

  • Care providers (Poskytovatelé zdravotních služeb)
  • Population 2021 (Obyvatelé okresy 2021)

Installation instructions

This project needs to be run on a Linux machine or in the WSL environment.

Configuration

First, open project's directory:

cd airflow

Then proceed with installing dependencies for Nodejs scripts frompackage.json:

npm ci

Next step will be configuration of a virtual environment for Python. You can install Apache Airflow from the exported dependencies inrequirements.txt file or follow the officialinstallation guide.

To set up Python virtual environment calledvenv with all required dependencies, run the following commands:

# Create Python venvpython3 -m venv venv# Activate venv. venv/bin/activate# Install Python dependencies from requirements.txtpip install -r requirements.txt

You'll also need to define a home directory for Apache Airflow:

export AIRFLOW_HOME=~/airflow

To finish Airflow setup, run the following:

# Initialize a databaseairflow db init# Create an administratorairflow users create --username"admin" --firstname"Harry" --lastname"Potter" --role"Admin" --email"harry.potter@gmail.com"# Check the existing usersairflow users list

You should also update the generatedairflow.cfg file:

  1. dags_folder needs to point to thedags directory, e.g./home/kristyna/data-engineering/airflow/dags
  2. load_examples should better be set toFalse if you want to avoid seeing tons of example DAGs
Run

You'll need 2 commands to launch Airflow. Each command needs to be triggered in its own terminal.

  1. airflow scheduler
  2. airflow webserver --port 8080

Now you can visithttp://localhost:8080/home in your web browser. If you setload_examples = False inairflow.cfg,you should only see one DAG -data-cubes.

DAGs can be easily triggered using the web interface. You need to provide a JSON configuration withoutput_path field.An example of the right configuration can be seen atdag-configuration.json. The generated data cubes will be stored into the directory specified byoutput_path parameter. If you forget to pass this parameter,{dags_folder}/output will be used as a default location for output files.

Files

Input

Input files are gathered inairflow/dags/input directory. Three CSV files will be downloaded (same as inData Cubes project):

  • care-providers-registry.csv (data sources for care providers data cube)
  • population-cs-2021.csv(data sources for mean population data cube)
  • county-codes.csv (mapping for translation of county codes frompopulation-cs-2021.csv into standardized NUTS codes)

The remaining files contain RDF schemas for the generated data cubes.

Output

Output files will be stored into theoutput_path directory or{dags_folder}/output by default.The following files will be generated:

  • health_care.ttl (Care providers data cube with metadata)
  • population.ttl (Mean population data cube with metadata)
  • datasets.ttl (merged data cubes fromhealth_care.ttl andpopulation.ttl)
Airflow pipeline

There's exactly one DAG defined indata-cubes.py. It specifies 6 tasks for the Airflow pipeline:

  1. health_care_providers_download
  2. population_2021_download
  3. county_codes_download
  4. care_providers_data_cube
  5. population_data_cube
  6. integrity_constraints_validation

Tasks 1-3 take care of downloading the source CSV files into theinput directory.Tasks 4 and 5 are responsible of generating the target data cubeshealth_care.ttl andpopulation.ttl.The last task merges the output datasets and validates them against a set of integrity constraints.

For a better idea of how the tasks are connected, see the DAG's graph visualization:

Graph of data-cubes DAG

Scripts

Scripts operating with CSV and RDF files were adapted from theData Cubes project. They're stored inairflow/dags/scripts and they're triggered bydata-cubes.py. The most important (entry) scripts are again the following:

  • care-providers.ts
  • population.ts
  • constraints-validation.ts

Provenance

A provenance document describing a process of generating datasets fromData Cubes project can be found inprovenance directory. It is stored asprovenance.trig file in RDF TriG format. It followsPROV-O specification and its attached examples.

SKOS & DCAT-AP

The datasets fromData Cubes project are extended with SKOS hierarchy and DCAT-AP metadata in theskos-and-dcat project.

For project installation & usage instructions, refer to the originalData Cubes section. Project structure is the same as indata-cubes project, except for 2 changes:

1. Metadata

Compared to the originalData Cubes project, metadata were removed frompopulation dataset completely and they were moved fromcare-providers dataset into a separate datasetcare-providers-metadata. Metadata are described in a fileskos-and-dcat/input/care-providers-metadata.ttl. File's content is loaded into an RDF store, normalized and dumped intoskos-and-dcat/output/care-providers-metadata.ttl.

2. SKOS hierarchy

A SKOS hierarchy was employed for regions and counties in both data cubes -care-providers andpopulation. Regions and counties are defined as separate SKOS concepts (skos:Concept) and they're connected to the hierarchy through Broader and Narrower relationships (skos:broader,skos:narrower).


[8]ページ先頭

©2009-2025 Movatter.jp