corpus-tools
Here are 97 public repositories matching this topic...
Language:All
Sort:Most stars
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
- Updated
Mar 17, 2025 - Python
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
- Updated
Mar 23, 2025 - Python
A very simple news crawler with a funny name
- Updated
Mar 25, 2025 - Python
Bitextor generates translation memories from multilingual websites
- Updated
Nov 11, 2024 - Python
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
- Updated
Feb 11, 2024 - Macaulay2
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
- Updated
Nov 19, 2024 - Python
Python library for handling audio datasets.
- Updated
Jul 6, 2023 - Python
OpusFilter - Parallel corpus processing toolkit
- Updated
Mar 26, 2025 - Python
Utilities for Processing the Switchboard Dialogue Act Corpus
- Updated
Jan 24, 2021 - Python
An advanced, extensible web front-end for the Manatee-open corpus search engine
- Updated
Mar 19, 2025 - TypeScript
An open source reimplementation of Benny Brodda's BETA in Python
- Updated
Oct 28, 2019 - Python
SpeCT - Speech Corpus Toolkit for Praat. Documentation:https://lennes.github.io/spect/
- Updated
Aug 11, 2023 - HTML
A set of workflows for corpus building through OCR, post-correction and normalisation
- Updated
Sep 7, 2022 - Python
A parser for annotated MuseScore 3 files.
- Updated
Mar 25, 2025 - Python
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
- Updated
May 22, 2023 - Python
Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
- Updated
Nov 12, 2024 - Python
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
- Updated
Dec 19, 2023 - PHP
Reading the data from OPIEC - an Open Information Extraction corpus
- Updated
Jun 12, 2019 - Java
Rezonator: Dynamics of human engagement
- Updated
Oct 18, 2024 - Yacc
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
- Updated
Jan 24, 2021 - Python
Improve this page
Add a description, image, and links to thecorpus-tools topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thecorpus-tools topic, visit your repo's landing page and select "manage topics."