- Notifications
You must be signed in to change notification settings - Fork4
A Python PySpark Projet with Poetry
License
nsphung/pyspark-template
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
People has asked me several times how to setup a good/clean/code organization for Python project with PySpark. I didn't find a fully feature project, so this is my attempt for one. Moreover, have a simple integration with Jupyter Notebook inside the project too.
Table of Contents
- https://mungingdata.com/pyspark/chaining-dataframe-transformations/
- https://medium.com/albert-franzi/the-spark-job-pattern-862bc518632a
- https://pawamoy.github.io/copier-poetry/
- https://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure
All you need is the following configuration already installed:
- Git
- The project was tested withPython 3.10.13 managed bypyenv:
- Use
make pyenv
goal to launch the automated install of pyenv
- Use
JAVA_HOME
environment variable configured with a JavaJDK11
SPARK_HOME
environment variable configured with Spark versionspark-3.5.2-bin-hadoop3
packagePYSPARK_PYTHON
environment variable configured with"python3.10"
PYSPARK_DRIVER_PYTHON
environment variable configured with"python3.10"
- Install Make to run
Makefile
file - Why
Python 3.10
becausePySpark 3.5.2
doesn't work withPython 3.11
at the moment it seems (I haven't tried with Python 3.12)
- pyenv prerequisites for ubuntu. Check the prerequisites for your OS.
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
pyenv
installed and available in pathpyenv installation with Prerequisites- Install python 3.10 with pyenv on homebrew/linuxbrew
CONFIGURE_OPTS="--with-openssl=$(brew --prefix openssl)" pyenv install 3.10
Auto format via IDEhttps://github.com/psf/black#pycharmintellij-idea
[Optional] You could setup a pre-commit to enforce Black format before commithttps://github.com/psf/black#version-control-integration
Or remember to type
black .
to apply the black rules formatting to all sources before commitAdd integratin with Jenkins and it will complain and tests will fail if black format is not applied
Add same mypy option for vscode in
Preferences: Open User Settings
Use the option to lint/format with black and flake8 on editor save in vscode
Checked optional type with MypyPEP 484
Configure Mypy to help annotating/hinting type with Python Code. It's very useful for IDE and for catching errors/bugs early.
- Installmypy plugin for intellij
- Adjust the plugin with the following options:
"--follow-imports=silent","--show-column-numbers","--ignore-missing-imports","--disallow-untyped-defs","--check-untyped-defs"
- Documentation:Type hints cheat sheet (Python 3)
- Add same mypy option for vscode in
Preferences: Open User Settings
- isort is the default on pycharm
- isort with vscode
- Lint/format/sort import on save with vscode in
Preferences: Open User Settings
:
{ "editor.formatOnSave": true, "python.formatting.provider": "black", "[python]": { "editor.codeActionsOnSave": { "source.organizeImports": true } }}
- isort configuration for pycharm. SeeSet isort and black formatting code in pycharm
- You can use
make lint
command to check flake8/mypy rules & apply automatically format black and isort to the code with the previous configuration
isort .
- Show a way to treat json erroneous file like
data/pubmed.json
- Create a poetry env with python 3.10
poetry env use 3.10
- Installpyenv
make pyenv
- Install dependencies in poetry env (virtualenv)
make deps
- Lint & Test
make build
- Lint,Test & Run
make run
- Run dev
make dev
- Build binary/python whell
make dist
poetry run drugs_gen --help
Usage: drugs_gen [OPTIONS]Options: -d, --drugs TEXT Path to drugs.csv -p, --pubmed TEXT Path to pubmed.csv -c, --clinicals_trials TEXT Path to clinical_trials.csv -o, --output TEXT Output path to result.json (e.g /path/to/result.json) --help Show this message and exit.
- Use
spark-submit
with the Python Wheel file built bymake dist
command in thedist
folder.
About
A Python PySpark Projet with Poetry