tokenizer

Star

A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.

Here are 1,614 public repositories matching this topic...

Language:All

Filter by language

All1,614 Python407 Jupyter Notebook244 JavaScript116 Rust112 TypeScript103 Java90 C++88 Go88 C77 C#59

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

theseer /tokenizer

Sponsor

Star5.2k

A small library for converting tokenized PHP source code into XML (and potentially other formats)

php xml tokenizer

UpdatedFeb 3, 2026
PHP

Chevrotain /chevrotain

Star2.7k

Parser Building Toolkit for JavaScript

javascript open-source grammars typescript parsing parser-library tokenizer lexer

UpdatedFeb 20, 2026
TypeScript

dqbd /tiktokenizer

Star1.5k

Online playground for OpenAPI tokenizers

nextjs tokenizer openai t3-stack chatgpt tiktoken

UpdatedApr 24, 2025
TypeScript

roshan-research /hazm

Star1.4k

Persian NLP Toolkit

python nlp natural-language-processing tokenizer embeddings persian text-processing dependency-parser farsi pos-tagging persian-nlp normalization lemmatization

UpdatedDec 21, 2025
Python

natasha /natasha

Star1.3k

Solves basic Russian NLP tasks, API for lower level Natasha projects

visualization python nlp syntax morphology tokenizer embeddings russian ner sentence-segmentation

UpdatedOct 17, 2024
Python

lovit /soynlp

Star984

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

nlp tokenizer postagging korean-text-processing korean-nlp word-extraction

UpdatedMay 7, 2025
Python

ikawaha /kagome

Sponsor

Star943

Self-contained Japanese Morphological Analyzer written in pure Go

japanese tokenizer segmentation korean japanese-language nlp-library hacktoberfest pos-tagging morphological-analysis

UpdatedJan 29, 2026
Go

no-context /moo

Star874

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

javascript regexp tokenizer lexer

UpdatedMay 16, 2023
JavaScript

wangfenjin /simple

Star773

支持中文和拼音的 SQLite fts5 全文搜索扩展｜ A SQLite3 fts5 tokenizer which supports Chinese and PinYin

sqlite tokenizer cpp14 pinyin chinese sqlite3 fts fts5 sqlite3-fts5

UpdatedFeb 20, 2026
C++

BLKSerene /Wordless

Star748

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

translation tokenizer corpus linguistics tagger literature dependency-parser corpus-linguistics lemmatizer corpus-tools corpus-processing corpus-search corpus-statistics stopword corpus-analysis

UpdatedOct 30, 2025
Python

niieani /gpt-tokenizer

Sponsor

Star738

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT models (gpt-5, gpt-o*, gpt-4o, etc.). Port of OpenAI's tiktoken with additional features.

machine-learning encoder decoder tokenizer openai bpe gpt-2 gpt-3 gpt-4 gpt-5 gpt-4o gpt-o1

UpdatedFeb 10, 2026
TypeScript

risesoft-y9 /Data-Labeling

Star695

数据标注是一款专门对文本数据进行处理和标注的工具，通过简化快捷的文本标注流程和动态的算法反馈，支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础，再由自动标注反哺人工标注，最后由人工标注进行纠偏，从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。

java docker elasticsearch tokenizer chinese data-annotations springboot2 vue3 nacos tokenizer-parser data-annotation-tools

UpdatedJun 23, 2025
Java

mathewsanders /Mustard

Star686

🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

swift tokenizer substrings

UpdatedMay 14, 2018
Swift

cbaziotis /ekphrasis

Star675

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

nlp tokenizer text-processing semeval nlp-library word-segmentation spelling-correction tokenization text-segmentation spell-corrector word-normalization