corpus-builder
Here are 19 public repositories matching this topic...
Language:All
Sort:Most stars
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
- Updated
Mar 17, 2025 - Python
Crawler for linguistic corpora
- Updated
Dec 5, 2023 - Python
Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora
- Updated
Sep 21, 2022 - C
Collector and speech cutter for librivox audiobooks
- Updated
Dec 8, 2022 - C#
- Updated
Feb 26, 2022 - Java
Ebook Corpus - A parser and extractor for electronic books
- Updated
Aug 6, 2019 - Ruby
Katya or The Liberated Corpus a text corpus that allows you to request and scrape any web resource!
- Updated
Mar 14, 2024 - Go
Article title, authors, date and body extraction dataset.
- Updated
Mar 26, 2024 - HTML
A corpus builder for evaluation of plagiarism detection tools
- Updated
Dec 12, 2016 - PHP
The user interface for the Corpus & Repository of Writing, built in Angular
- Updated
Feb 16, 2025 - TypeScript
Crawl Ask.fm QA lists and create corpus for ML.
- Updated
Dec 15, 2023 - Python
The canonical resources to build the backend for a corpus/repository management framework for Crow, the Corpus and Repository of Writing
- Updated
Feb 16, 2025 - PHP
Automated text preprocessing pipeline for large corpora. Features customizable filters for diacritics, stop words, punctuation, and regex.
- Updated
Jan 24, 2025 - Python
This is a text corpus management system for the german linguistic department of the university of Basel.
- Updated
Apr 15, 2020 - PHP
App and Scripts working with the corpus-builder CorpusCook, to have a corpus updated with corrected wrong predictions
- Updated
Mar 20, 2020 - Python
Chatbot in Polish language, trained on movie subtitles collected using web scraping, based on Transformer architecture.
- Updated
Jun 30, 2024 - Jupyter Notebook
Extract text from Vikidia/Wikipedia articles [fr]
- Updated
Jul 20, 2021 - Python
Corpus Development Software for Machine Translation
- Updated
Apr 23, 2024 - JavaScript
Builds Wikipedia corpora in I5 (a TEI-based format)
- Updated
Jun 21, 2022 - Java
Improve this page
Add a description, image, and links to thecorpus-builder topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thecorpus-builder topic, visit your repo's landing page and select "manage topics."