Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Indian Language Tagger and Chunker (Hindi, Telugu, Tamil, Marathi, Punjabi, Kanada, Malayalam, Urdu, Bengali)

License

NotificationsYou must be signed in to change notification settings

avineshpvs/indic_tagger

Repository files navigation

In this project, we build part-of-speech (POS) taggers and chunkers for Indian Languages.

Languages supported: Telugu (te), Hindi (hi), Tamil (ta), Marathi (mr), Punjabi (pa), Kannada (kn), Malayalam (ml), Urdu (ur), Bengali (bn)

If you reuse this software, please use the following citation:

@inproceedings{PVS:SPSAL2007,  editor    = {P.V.S., Avinesh and Gali, Karthik},  title     = {Part of Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning}  booktitle = {Proceedings of the  Shallow Parsing for South Asian Languages (SPSAL) Workshop, held at IJCAI-07, Hyderabad, India},  series    = {{SPSAL} Workshop Proceedings},  month     = {January},  year      = {2007},  pages     = {21--24},}

Training Data Statistics and System Performances (F1 macro)

Languages# Words# SentsCRF POSCRF ChunkBI-LSTM-CRF POSBI-LSTM CRF Chunk
te347k30k93%96%92%92%
hi350k16.3k93%97%94%93%
bn298.3k14.6k84%95%85%88%
pa152.5k5.6k92%98%94%96%
mr207.9k8.5k89%95%88%90%
ur158.9k7.6k90%96%92%89%
ta337k14.2k88%92%87%85%
ml192k11.4k96%95%98%98%
kn294.3k16.5k90%98%88%87%

Training Data Statistics and System Performances (F1 macro) for NER

Languages# Words# SentsCRF NERBI-LSTM-CRF NER
te347k30k69%65%
hi503k19k62%63%
bn120k6k54%48%
ur35k1.5k65%56%
or93k1.8k68%43%

Install using Anaconda

    # INSTALL python environment    conda create -n tagger3.6 anaconda python=3.6    source activate tagger3.6        # Install the tokenizer    cd polyglot-tokenizer    python setup.py install        # Install requirements    pip install -r requirements.txt

Run

    python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i input_file -o output_file    -l, --languages       select language (2 letter ISO-639 code)                           {hi, be, ml, pu, te, ta, ka, mr, ur}    -t, --tag_type      pos, chunk, parse, ner    -m, --model_type    crf, hmm, lstm    -f, --data_format   ssf, txt, conll    -e, --encoding      utf8, wx   (default: utf8)    -i, --input_file      <input-file>    -o, --output_file     <output-file>    -s, --sent_split      True/False (default: True)    python pipeline.py --help

Train the POS tagger:

    # CRF model    python pipeline.py -p train -o outputs -l te -t pos -m crf -e utf -f ssf        # BI-LSTM-CRF model    python pipeline.py -p train -t pos -f conll -m lstm -e utf -l te

Predict on text:

    # CRF models     python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i data/test/te/test.utf.txt        # BI-LSTM-CRF models    python pipeline.py -p predict -l te -t pos -m lstm -f txt -e utf -i data/test/te/test.utf.txt        # SpaCy models    python spacy_tagger_test.py -l te -t pos

Train the NER tagger:

    # CRF model    python pipeline.py -p train -o outputs -l te -t ner -m crf -e utf -f conll        # BI-LSTM-CRF model    python pipeline.py -p train -t ner -f conll -m lstm -e utf -l te

Predict NER on text:

    # CRF model    python pipeline.py -p predict -l hi -t ner -m crf -f txt -e utf -i data/test/hi/test.utf.txt        # BI-LSTM-CRF model    python pipeline.py -p predict -l hi -t ner -m lstm -f txt -e utf -i data/test/hi/test.utf.txt

ToDo List

  • Telugu, Hindi trained CRF models
  • Bengali, Punjabi, Marathi, Urdu, Tamil trained CRF models
  • Bug: Utf-8 error Malayalam, Kannada trained CRF models
  • Deep learning (BI-LSTM-CRF)
  • Analysis Comparision w.r.t other ML algorithms
  • Bug: Punjabi & Urdu training file doesn't have "|" (or) end of sentence marker.
  • NER for Indian Languages
  • Feature addition to BI-LSTM-CRF models
  • Active Learning based sampling strategies

About

Indian Language Tagger and Chunker (Hindi, Telugu, Tamil, Marathi, Punjabi, Kanada, Malayalam, Urdu, Bengali)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp