avineshpvs/indic_taggerPublic

NotificationsYou must be signed in to change notification settings
Fork13
Star41

Indian Language Tagger and Chunker (Hindi, Telugu, Tamil, Marathi, Punjabi, Kanada, Malayalam, Urdu, Bengali)

License

Apache-2.0 license

41 stars 13 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.idea		.idea
data/test		data/test
examples		examples
fasttextmodels/te		fasttextmodels/te
lstmcrf		lstmcrf
models		models
polyglot-tokenizer		polyglot-tokenizer
spacypackages/te_model-0.0.0		spacypackages/te_model-0.0.0
tagger		tagger
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pipeline.py		pipeline.py
requirements.txt		requirements.txt
spacy_tagger.py		spacy_tagger.py
spacy_tagger_test.py		spacy_tagger_test.py

Repository files navigation

Indic Tagger (Indian Language Tagger)

In this project, we build part-of-speech (POS) taggers and chunkers for Indian Languages.

Languages supported: Telugu (te), Hindi (hi), Tamil (ta), Marathi (mr), Punjabi (pa), Kannada (kn), Malayalam (ml), Urdu (ur), Bengali (bn)

If you reuse this software, please use the following citation:

@inproceedings{PVS:SPSAL2007,  editor    = {P.V.S., Avinesh and Gali, Karthik},  title     = {Part of Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning}  booktitle = {Proceedings of the  Shallow Parsing for South Asian Languages (SPSAL) Workshop, held at IJCAI-07, Hyderabad, India},  series    = {{SPSAL} Workshop Proceedings},  month     = {January},  year      = {2007},  pages     = {21--24},}

Training Data Statistics and System Performances (F1 macro)

Languages	# Words	# Sents	CRF POS	CRF Chunk	BI-LSTM-CRF POS	BI-LSTM CRF Chunk
te	347k	30k	93%	96%	92%	92%
hi	350k	16.3k	93%	97%	94%	93%
bn	298.3k	14.6k	84%	95%	85%	88%
pa	152.5k	5.6k	92%	98%	94%	96%
mr	207.9k	8.5k	89%	95%	88%	90%
ur	158.9k	7.6k	90%	96%	92%	89%
ta	337k	14.2k	88%	92%	87%	85%
ml	192k	11.4k	96%	95%	98%	98%
kn	294.3k	16.5k	90%	98%	88%	87%

Training Data Statistics and System Performances (F1 macro) for NER

Languages	# Words	# Sents	CRF NER	BI-LSTM-CRF NER
te	347k	30k	69%	65%
hi	503k	19k	62%	63%
bn	120k	6k	54%	48%
ur	35k	1.5k	65%	56%
or	93k	1.8k	68%	43%

Install using Anaconda

    # INSTALL python environment    conda create -n tagger3.6 anaconda python=3.6    source activate tagger3.6        # Install the tokenizer    cd polyglot-tokenizer    python setup.py install        # Install requirements    pip install -r requirements.txt

Run

    python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i input_file -o output_file    -l, --languages       select language (2 letter ISO-639 code)                           {hi, be, ml, pu, te, ta, ka, mr, ur}    -t, --tag_type      pos, chunk, parse, ner    -m, --model_type    crf, hmm, lstm    -f, --data_format   ssf, txt, conll    -e, --encoding      utf8, wx   (default: utf8)    -i, --input_file      <input-file>    -o, --output_file     <output-file>    -s, --sent_split      True/False (default: True)    python pipeline.py --help

Train the POS tagger:

    # CRF model    python pipeline.py -p train -o outputs -l te -t pos -m crf -e utf -f ssf        # BI-LSTM-CRF model    python pipeline.py -p train -t pos -f conll -m lstm -e utf -l te

Predict on text:

    # CRF models     python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i data/test/te/test.utf.txt        # BI-LSTM-CRF models    python pipeline.py -p predict -l te -t pos -m lstm -f txt -e utf -i data/test/te/test.utf.txt        # SpaCy models    python spacy_tagger_test.py -l te -t pos

Train the NER tagger:

    # CRF model    python pipeline.py -p train -o outputs -l te -t ner -m crf -e utf -f conll        # BI-LSTM-CRF model    python pipeline.py -p train -t ner -f conll -m lstm -e utf -l te

Predict NER on text:

    # CRF model    python pipeline.py -p predict -l hi -t ner -m crf -f txt -e utf -i data/test/hi/test.utf.txt        # BI-LSTM-CRF model    python pipeline.py -p predict -l hi -t ner -m lstm -f txt -e utf -i data/test/hi/test.utf.txt

ToDo List

Telugu, Hindi trained CRF models
Bengali, Punjabi, Marathi, Urdu, Tamil trained CRF models
Bug: Utf-8 error Malayalam, Kannada trained CRF models
Deep learning (BI-LSTM-CRF)
Analysis Comparision w.r.t other ML algorithms
Bug: Punjabi & Urdu training file doesn't have "|" (or) end of sentence marker.
NER for Indian Languages
Feature addition to BI-LSTM-CRF models
Active Learning based sampling strategies

About

Indian Language Tagger and Chunker (Hindi, Telugu, Tamil, Marathi, Punjabi, Kanada, Malayalam, Urdu, Bengali)

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Indic Tagger (Indian Language Tagger)

Training Data Statistics and System Performances (F1 macro)

Training Data Statistics and System Performances (F1 macro) for NER

Install using Anaconda

Run

ToDo List

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors3

Uh oh!

Languages

Movatterモバイル変換

License

avineshpvs/indic_tagger

Folders and files

Latest commit

History

Repository files navigation

Indic Tagger (Indian Language Tagger)

Training Data Statistics and System Performances (F1 macro)

Training Data Statistics and System Performances (F1 macro) for NER

Install using Anaconda

Run

ToDo List

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages