Home

Categories: use only categories
Types: use only types
Subtypes: use only subtypes
Filtered: use filtered categories (subset of categories)

André Pires edited this pageAug 16, 2017 ·36 revisions

This wiki documents the development process formy master's thesis, namedNamed entity extraction from Portuguese web text.

First, the HAREM dataset was used to perform NER using available tools, namelyStanford CoreNLP,NLTK,OpenNLP andspaCy. Repeated 10-fold cross validation was used to evaluate all tools, all results are present in this wiki. More info on theHAREM collection on its page.

After evaluation all tools with the baseline configuration, I performed a Hyperparameter study for each tool, this time using repeated holdout cross-validation.

I manually annotated a subset of SIGARRA news, generating a Portuguese corpus with 905 annotated news. And finally, I trained models with each tool with this dataset. More info on theSIGARRA News Corpus on its page.

Main repository folders

brat: annotation tool and annotated SIGARRA's news
datasets: Keeps the datasets used
scripts:
- extra: some extra scripts not directly used
- evaluation: scripts to compute the evaluation of all tools, using theconlleval script
- filter-harem: scripts to manipulate HAREM dataset
  - harem-to-opennlp: transform HAREM in OpenNLP input format
  - harem-to-standoff: transform HAREM in standoff format, used in spaCy
  - harem-to-stanford: transform HAREM in conll format, used in Stanford CoreNLP
  - src: source files for scripts
  - run-scripts: commands to run scripts
- filter-sigarra: scripts to manipulate SIGARRA dataset
  - sigarra-to-opennlp: transform SIGARRA in OpenNLP input format
  - sigarra-to-standoff: transform SIGARRA in standoff format, used in spaCy
  - src: source files for scripts
  - run-scripts: commands to run scripts
tools:
- nltk: folder to keep NLTK related data/scripts
- open-nlp: folder to keep OpenNLP related data/scripts
- spacy: folder to keep spaCy related data/scripts
- stanford-ner: folder to keep Stanford CoreNLP related data/scripts