dhchenx/rsnltkPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star22

Rust-based Natural Language Toolkit using Python Bindings

crates.io/crates/rsnltk

22 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
experiments/rsnltk-experiment		experiments/rsnltk-experiment
src		src
tests		tests
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
rsnltk.iml		rsnltk.iml

Repository files navigation

Rust-based Natural Language Toolkit (rsnltk)

A Rust library to support natural language processing with pure Rust implementation and Python bindings

Rust Docs |Crates Home Page |Tests |NER-Kit

Features

Thersnltk library integrates various existing Python-written NLP toolkits for powerful text analysis in Rust-based applications.

Functions

This toolkit is based on the Python-writtenStanza and other important NLP crates.

A list of functions from Stanza and others we bind here include:

Tokenize
Sentence Segmentation
Multi-Word Token Expansion
Part-of-Speech & Morphological Features
Named Entity Recognition
Sentiment Analysis
Language Identification
Dependency Tree Analysis

Some amazing crates are also included inrsnltk but with simplified APIs for actual use:

Additionally, we can calculate the similarity between words based on WordNet though thesemantic-kit PyPI project viapip install semantic-kit.

Installation

Make sure you install Python 3.6.6+ and PIP environment in your computer. Typepython -V in the Terminal should print no error message;
Install our Python-basedner-kit (version>=0.0.5a2) for binding theStanza package viapip install ner-kit==0.0.5a2;
Then, Rust should be also installed in your computer. I use IntelliJ to develop Rust-based applications, where you can write Rust codes;
Create a simple Rust application project with amain() function.
Add thersnltk dependency to theCargo.toml file, keep up the Latest version.
After you add thersnltk dependency in thetoml file, install necessary language models from Stanza using the following Rust code for the first time you use this package.

fninit_rsnltk_and_test(){// 1. first install the necessary language models// using language codeslet list_lang=vec!["en","zh"];//e.g. you install two language models,// namely, for English and Chinese text analysis.download_langs(list_lang);// 2. then do test NLP taskslet text="I like Beijing!";let lang="en";// 2. Uncomment the below codes for Chinese NER// let text="我喜欢北京、上海和纽约！";// let lang="zh";let list_ner=ner(text,lang);for nerin list_ner{println!("{:?}",ner);}}

Or you can manually install thoselanguage models via the Python-writtenner-kit package which provides more features in using Stanza. Go to:ner-kit

If no error occurs in the above example, then it works. Finally, you can try the following advanced example usage.

Currently, we tested the use of English and Chinese language models; however, other language models should work as well.

Examples with Stanza Bindings

Example 1: Part-of-speech Analysis

fntest_pos(){//let text="我喜欢北京、上海和纽约！";//let lang="zh";let text="I like apple";let lang="en";let list_result=pos(text,lang);for wordin list_result{println!("{:?}",word);}}

Example 2: Sentiment Analysis

fntest_sentiment(){//let text="I like Beijing!";//let lang="en";let text="我喜欢北京";let lang="zh";let sentiments=sentiment(text,lang);for senin sentiments{println!("{:?}",sen);}}

Example 3: Named Entity Recognition

fntest_ner(){// 1. for English NERlet text="I like Beijing!";let lang="en";// 2. Uncomment the below codes for Chinese NER// let text="我喜欢北京、上海和纽约！";// let lang="zh";let list_ner=ner(text,lang);for nerin list_ner{println!("{:?}",ner);}}

Example 4: Tokenize for Multiple Languages

fntest_tokenize(){let text="我喜欢北京、上海和纽约！";let lang="zh";let list_result=tokenize(text,lang);for nerin list_result{println!("{:?}",ner);}}

Example 5: Tokenize Sentence

fntest_tokenize_sentence(){let text="I like apple. Do you like it? No, I am not sure!";let lang="en";let list_sentences=tokenize_sentence(text,lang);for sentencein list_sentences{println!("Sentence: {}",sentence);}}

Example 6: Language Identification

fntest_lang(){let list_text =vec!["I like Beijing!","我喜欢北京！","Bonjour le monde!"];let list_result=lang(list_text);for langin list_result{println!("{:?}",lang);}}

Example 7: MWT expand

fntest_mwt_expand(){let text="Nous avons atteint la fin du sentier.";let lang="fr";let list_result=mwt_expand(text,lang);}

Example 8: Estimate the similarity between words in WordNet

You need to firstly installsemantic-kit PyPI package!

fntest_wordnet_similarity(){let s1="dog.n.1";let s2="cat.n.2";let sims=wordnet_similarity(s1,s2);for simin sims{println!("{:?}",sim);}}

Example 9: Obtain a dependency tree from a text

fntest_dependency_tree(){let text="I like you. Do you like me?";let lang="en";let list_results=dependency_tree(text,lang);for list_tokenin list_results{for tokenin list_token{println!("{:?}",token)}}}

Examples in Pure Rust

Example 1: Word2Vec similarity

fntest_open_wv_bin(){let wv_model=wv_get_model("GoogleNews-vectors-negative300.bin");let positive =vec!["woman","king"];let negative =vec!["man"];println!("analogy: {:?}", wv_analogy(&wv_model,positive, negative,10));println!("cosine: {:?}", wv_cosine(&wv_model,"man",10));}

Example 2: Text summarization

use rsnltk::native::summarizer::*;fntest_summarize(){let text="Some large txt...";let stopwords=&[];let summarized_text=summarize(text,stopwords,5);println!("{}",summarized_text);}

Example 3: Get token list from English strings

use rsnltk::native::token::get_token_list;fntest_get_token_list(){let s="Hello, Rust. How are you?";let result=get_token_list(s);for rin result{println!("{}\t{:?}",r.text,r);}}

Example 4: Word segmentation for some language where no space exists between terms, e.g. Chinese text.

We implement three word segmentation methods in this version:

Forward Maximum Matching (fmm), which is baseline method
Backward Maximum Matching (bmm), which is considered better
Bidirectional Maximum Matching (bimm), high accuracy but low speed

use rsnltk::native::segmentation::*;fntest_real_word_segmentation(){let dict_path="30wdict.txt";// empty if only for tokenizinglet stop_path="baidu_stopwords.txt";// empty when no stop wordslet _sentence="美国太空总署希望，在深海的探险发现将有助于解开一些外太空的秘密，\    同时也可以测试前往太阳系其他星球探险所需的一些设备和实验。";let meaningful_words=get_segmentation(_sentence,dict_path,stop_path,"bimm");// bimm can be changed to fmm or bmm.println!("Result: {:?}",meaningful_words);}

Credits

ThankStanford NLP Group for their hard work inStanza.

License

Thersnltk library with MIT License is provided byDonghua Chen.

About

Rust-based Natural Language Toolkit using Python Bindings

crates.io/crates/rsnltk

Releases

No releases published

Packages

No packages published

Languages

Rust100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Rust-based Natural Language Toolkit (rsnltk)

Features

Functions

Installation

Examples with Stanza Bindings

Examples in Pure Rust

Credits

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

dhchenx/rsnltk

Folders and files

Latest commit

History

Repository files navigation

Rust-based Natural Language Toolkit (rsnltk)

Features

Functions

Installation

Examples with Stanza Bindings

Examples in Pure Rust

Credits

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages