Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A Python PySpark Projet with Poetry

License

NotificationsYou must be signed in to change notification settings

nsphung/pyspark-template

Repository files navigation

made-with-pythonpython-3.10Code style: blackImports: isortChecked with mypymade-with-Markdown

People has asked me several times how to setup a good/clean/code organization for Python project with PySpark. I didn't find a fully feature project, so this is my attempt for one. Moreover, have a simple integration with Jupyter Notebook inside the project too.

Table of Contents

Inspiration

Development

Prerequisites

All you need is the following configuration already installed:

  • Git
  • The project was tested withPython 3.10.13 managed bypyenv:
    • Usemake pyenv goal to launch the automated install of pyenv
  • JAVA_HOME environment variable configured with a JavaJDK11
  • SPARK_HOME environment variable configured with Spark versionspark-3.5.2-bin-hadoop3 package
  • PYSPARK_PYTHON environment variable configured with"python3.10"
  • PYSPARK_DRIVER_PYTHON environment variable configured with"python3.10"
  • Install Make to runMakefile file
  • WhyPython 3.10 becausePySpark 3.5.2 doesn't work withPython 3.11 at the moment it seems (I haven't tried with Python 3.12)

Pyenv Manual Install [Optional]

  • pyenv prerequisites for ubuntu. Check the prerequisites for your OS.
    sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
  • pyenv installed and available in pathpyenv installation with Prerequisites
  • Install python 3.10 with pyenv on homebrew/linuxbrew
CONFIGURE_OPTS="--with-openssl=$(brew --prefix openssl)" pyenv install 3.10

Add format, lint code tools

Autolint/Format code with Black in IDE:

Code style: black

  • Auto format via IDEhttps://github.com/psf/black#pycharmintellij-idea

  • [Optional] You could setup a pre-commit to enforce Black format before commithttps://github.com/psf/black#version-control-integration

  • Or remember to typeblack . to apply the black rules formatting to all sources before commit

  • Add integratin with Jenkins and it will complain and tests will fail if black format is not applied

  • Add same mypy option for vscode inPreferences: Open User Settings

  • Use the option to lint/format with black and flake8 on editor save in vscode

Checked optional type with MypyPEP 484

Checked with mypy

Configure Mypy to help annotating/hinting type with Python Code. It's very useful for IDE and for catching errors/bugs early.

  • Installmypy plugin for intellij
  • Adjust the plugin with the following options:
    "--follow-imports=silent","--show-column-numbers","--ignore-missing-imports","--disallow-untyped-defs","--check-untyped-defs"
  • Documentation:Type hints cheat sheet (Python 3)
  • Add same mypy option for vscode inPreferences: Open User Settings

Isort

Imports: isort

{    "editor.formatOnSave": true,    "python.formatting.provider": "black",    "[python]": {        "editor.codeActionsOnSave": {            "source.organizeImports": true        }    }}
isort .

Fix

  • Show a way to treat json erroneous file likedata/pubmed.json

Usage Local

  • Create a poetry env with python 3.10
poetry env use 3.10
  • Installpyenvmake pyenv
  • Install dependencies in poetry env (virtualenv)make deps
  • Lint & Testmake build
  • Lint,Test & Runmake run
  • Run devmake dev
  • Build binary/python whellmake dist

Use with poetry

poetry run drugs_gen --help

Usage: drugs_gen [OPTIONS]Options:  -d, --drugs TEXT             Path to drugs.csv  -p, --pubmed TEXT            Path to pubmed.csv  -c, --clinicals_trials TEXT  Path to clinical_trials.csv  -o, --output TEXT            Output path to result.json (e.g                               /path/to/result.json)  --help                       Show this message and exit.

Usage in distributed-mode depending on your cluster manager type

  • Usespark-submit with the Python Wheel file built bymake dist command in thedist folder.

[8]ページ先頭

©2009-2025 Movatter.jp