Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
@adbar
adbar
Follow
View adbar's full-sized avatar

Adrien Barbaresi adbar

Organizations

@deutschestextarchiv

Block or report adbar

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more aboutblocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more aboutreporting abuse.

Report abuse
adbar/README.md

Github starsYoutube channel views

⚡  Web  ✍  Blog  ☕  Coffee

I'm a data engineer and scientist specializing in natural language processing. On Github I'm the author and maintainer of projects like Trafilatura, a popular open-source package to gather and extract text data used by researchers and the AI industry.

Most Popular Blog Posts

Open-Source Tech Stack

SkillsProgramming languages
Open source skillsMost used languages

PinnedLoading

  1. trafilaturatrafilaturaPublic

    Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

    Python 4.1k 287

  2. htmldatehtmldatePublic

    Fast and robust date extraction from web pages, with Python or on the command-line

    Python 124 26

  3. simplemmasimplemmaPublic

    Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

    Python 154 12

  4. rust-primesrust-primesPublic

    Rust code and command-line utility to compute and visualize prime sequences

    Rust

  5. courlancourlanPublic

    Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

    Python 135 9

  6. German-NLPGerman-NLPPublic

    Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German

    468 66


[8]ページ先頭

©2009-2025 Movatter.jp