Data Engineer and Scientist | Open Source Enthusiast
I'm a data engineer and scientist specializing in natural language processing. On Github I'm the author and maintainer of projects like Trafilatura, a popular open-source package to gather and extract text data used by researchers and the AI industry.
- Extracting the main text content from web pages using Python
- A simple multilingual lemmatizer for Python
- A module to extract date information from web pages
- Web scraping with R: Text and metadata extraction
Skills | Programming languages |
---|---|
PinnedLoading
- trafilatura
trafilatura PublicPython & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
- rust-primes
rust-primes PublicRust code and command-line utility to compute and visualize prime sequences
Rust
- German-NLP
German-NLP PublicCurated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
Something went wrong, please refresh the page to try again.
If the problem persists, check theGitHub status page orcontact support.
If the problem persists, check theGitHub status page orcontact support.