- Notifications
You must be signed in to change notification settings - Fork112
Natural language detection library for Rust. Try demo online:https://whatlang.org/
License
greyblake/whatlang-rs
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Natural language detection for Rust with focus on simplicity and performance.
- Features
- Get started
- Who uses Whatlang?
- Documentation
- Supported languages
- Feature toggles
- How does it work?
- Running benchmark
- Comparison with alternatives
- Ports and clones
- Donations
- Derivation
- License
- Contributors
- Supports69 languages
- 100% written in Rust
- Lightweight, fast and simple
- Recognizes not only a language, but also a script (Latin, Cyrillic, etc)
- Provides reliability information
Example:
use whatlang::{detect,Lang,Script};fnmain(){let text ="Ĉu vi ne volas eklerni Esperanton? Bonvolu! Estas unu de la plej bonaj aferoj!";let info =detect(text).unwrap();assert_eq!(info.lang(),Lang::Epo);assert_eq!(info.script(),Script::Latin);assert_eq!(info.confidence(),1.0);assert!(info.is_reliable());}
For more details (e.g. how to blacklist some languages) please check thedocumentation.
Whatlang is used within the following big projects as direct or indirect dependency for language recognition.You're gonna be in a great company using Whatlang:
- Sonic - fast, lightweight and schema-less search backend in Rust.
- Meilisearch - an open-source, easy-to-use, blazingly fast, and hyper-relevant search engine built in Rust.
Feature | Description |
---|---|
enum-map | Lang andScript implementEnum trait fromenum-map |
arbitrary | SupportArbitrary |
serde | ImplementsSerialize andDeserialize forLang andScript |
dev | Enableswhatlang::dev module which provides some internal API.It exists for profiling purposes and normal users are discouraged to to rely on this API. |
The algorithm is based on the trigram language models, which is a particular case of n-grams.To understand the idea, please check the original whitepaperCavnar and Trenkle '94: N-Gram-Based Text Categorization'.
It is based on the following factors:
- How many unique trigrams are in the given text
- How big is the difference between the first and the second(not returned) detected languages? This metric is called
rate
in the code base.
Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas.This function is a hyperbola and it looks like the following one:
For more details, please check a blog articleIntroduction to Rust Whatlang Library and Natural Language Identification Algorithms.
make bench
- run performance benchmarksmake doc
- generate and open docmake test
- run testsmake watch
- watch changes and run tests
Whatlang | CLD2 | CLD3 | |
---|---|---|---|
Implementation language | Rust | C++ | C++ |
Languages | 68 | 83 | 107 |
Algorithm | trigrams | quadgrams | neural network |
Supported Encoding | UTF-8 | UTF-8 | ? |
HTML support | no | yes | ? |
- whatlang-ffi - C bindings
- whatlanggo - whatlang clone for Go language
- whatlang-py - bindings for Python
- whatlang-rb - bindings for Ruby
- whatlangex - bindings for Elixir
You can support the project by donatingNEAR tokens.
Our NEAR wallet address iswhatlang.near
Whatlang is a derivative work fromFranc (JavaScript, MIT) byTitus Wormer.
- greyblake Potapov Sergey - creator, maintainer.
- Dr-Emann Zachary Dremann - optimization and improvements
- BaptisteGelez Baptiste Gelez - improvements
- Vishesh Chopra - designed the logo
- Joel Natividad - support of Tagalog
- ManyTheFish - crazy optimization
- Kerollmops Clément Renault - crazy optimization
About
Natural language detection library for Rust. Try demo online:https://whatlang.org/
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.