innerNULL/fastTextAnnotationPublic

forked fromfacebookresearch/fastText

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Library for fast text representation and classification.

License

MIT license

0 stars 4.8k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 406 Commits
.circleci		.circleci
alignment		alignment
crawl		crawl
docs		docs
python		python
scripts		scripts
src		src
tests		tests
webassembly		webassembly
website		website
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README_source.md		README_source.md
classification-example.sh		classification-example.sh
classification-results.sh		classification-results.sh
download_model.py		download_model.py
eval.py		eval.py
fasttext.pc.in		fasttext.pc.in
get-wikimedia.sh		get-wikimedia.sh
quantization-example.sh		quantization-example.sh
reduce_model.py		reduce_model.py
runtests.py		runtests.py
setup.cfg		setup.cfg
setup.py		setup.py
wikifil.pl		wikifil.pl
word-vector-example.sh		word-vector-example.sh

Repository files navigation

fastText Code Reading

[TOC]

Introduction

Best Reference

fastText paper
word2vec Parameter Learning ExplainedMany conventional way of naming (such as hidder layer/hidden state) are come from this paper. And this paper also shows the detail of parameters updating process.
FASTTEXT.ZIP:COMPRESSING TEXT CLASSIFICATION MODELSShow some details about how to compress the model and how to pruning the token dictionary

Tricks

Support multi-categories classifying with each sample has one or more than one labels
Increase model's robustness by randomly dropping word-tokens in self-surpervised training mode
TODO: Support dynamically droping not-that-importance tokens, this can protect the scale of tokens not be to large

Support multi-categories classifying with each sample has one or more than one labels

fastText supports this case by an a little tricky strategy.

multi-categories classifying using case satisfied by the combination of the strategied hidden inFastText::supervised,SoftmaxLoss::forward andLoss::findKBest, the detials could be found in the annotation incode reading repository.

Briefly speaking, during training process, for each sample, if it has more than one target label, then although its label vector shoud in multi-hot encoding form,FastText::supervised will convert its label vector to one-hot encoding form by randomly setting the corresponding element (in one-hot label vector) of one of randomly choosen targe labels to 1, and setting all other elements to zero. About setting which target label's corresponding element to 1, this will be controled by the parametertargetIndex ofSoftmaxLoss::forward.

During inference process, the top-k most possible prediction results will be saved in aheap structure combined bystd::vector< std::pair<real, int32_t> > and several cpp stl heap algorithms, this process is executed byLoss::findKBest. But not that sample,Loss::findKBest will filter all potentials prediction results with "score" or "weight" smaller than certain threshold, this threshold is controled by the parameterthreshold ofLoss::findKBest. In this way we can tagging each prediction sample at most k potential labels.

Increase model's robustness by randomly dropping word-tokens in self-surpervised training mode

InDictionary::discard, if a generated random number is larger than a threshold (given by parameterrand), then this word-token will be drop during self-surpervised training process, this introducing of randomness can improve model's rubustness.

Tips

Here are some tips about hard-to-understand-piece of the project, which may let reading codes be more easier.

The parameters updating process is splitted in`Model::update` and`Loss::forward`

This is beacuse, the gradient calculation of the parameter matrix mapping hidden layer to output layer is depend on the loss function type, but the calculation approach of the parameters matrix mapping input tokens to hidden layer is all the same not matter which loss function you choose. So if we put the computation of hidden-to-output-layer parameters gradients intoLoss::forward and the computation of input-to-hidden-layer parameters gradients intoModel::update, we can get following advantages on architecture design:

We can unify each Loss function's interface and when we need add a new loss function, we can just developing a class satisfy these interface requirements
Loss::forward generates intermediate result which will be helpful to get final result of the input-to-hidden-layer parameters gradients, so we can cache these result and improve computational efficency.

Naming of "word" may not only refer to words, but also to labels

In fastText, there are two kinds of tokens, word-token and label-token, they could be distincted by the fact that there is "__label__" sign in the label token.
Actually, in fastText, the "words" sometimes mean "tokens", which includes both words or labels (with "__label__" sign in the token). For exampleDictionary::word2int_ also saving ids of labels, andDictionary::words_ also savingentry objects of labels, the word entry and label entry could be distincted byentry::type.

How`Dictionary::find`,`Dictionary::add`,`Dictionary::word2int_`,`Dictionary::words_` works together mapping row text tokens to token's`entry` object

Here are the functions for these relative methods and attributes:

Dictionary::find:Allocate each token a non-collision id with remainder and shift strategy, I call this id token-id.
Dictionary::add:During the building ofDictionary object, when meeting a new token, it will updateDictionary::word2int_ andDictionary::words_ based on this new token's id, make sure we can mapping the token's raw text to token'sentry object next time without recalculation.
Dictionary::word2int_:Each element's index represents an token's id, and each element value represent that token's correponding index in token-vocabDictionary::words_, with that index, we can indexing this token's detail info fromDictionary::words_.
Dictionary::words_:This is token-vocab, it's anstd::vector object, and each element in it is anentry object which holding certain token's detail info such as token text, token type(word-token or label-token), token's char n-gram, token's appearance-count, etc.

Each token's info will be build and put intoDictionary::words_ during theDictionary building withDictionary::add. This includes several steps:

Judging if current token is a new one, if it's a new token, executing following steps.
Getting the token's id from token's raw text byDictionary::find.
Building current new token'sentry object and push it back into token-vocabDictionary::words_.
UsingDictionary::word2int_ to record this new token's index in token vocabDictionary::words_. The element value inDictionary::word2int_ which index equal to current token-id should be current token's correponding index inDictionary::words_.

After buidingDictionary, when we needs getting a raw token's detail info (for example int training/inference stage), we just needs:

Mapping token's raw text to token-id.
Getting token-vocab-index by indexing the element with token-id fromDictionary::word2int_.
Getting tokenentry object by indexing token-vocab-index from Dictionary::words_.

Some threads-shared variables have been wrapped by`std::atomic<>` to guarantee thread-safety

Some variables, for example,FastText::tokenCount_ andFastText::loss_, is defined with wrapping bystd::atomic<>, since this varibles could be writting and reading by all threads, so they must be thread-safe.
FastText::tokenCount_ is responsible for counting global word and label tokens processed by all threads, each thread will updatetokenCount_ totokenCount_ + 1 after one thread-local-sample has been processed by that thread, the process contains reading and writting.

TODO

Figure out the meaning for the gradient-normalizing technique used inModel::update
Thinking about if should set a maximum value forModel::Statr::nexamples_ in case we have a huge training data size or we will continiously training the model with incremental training data.
If should adding a minimum learning rate value inFastText::progressInfo

About

Library for fast text representation and classification.

Releases

5tags

Packages

No packages published

Languages

HTML68.5%
C++10.5%
JavaScript10.2%
Python6.4%
CSS2.0%
Shell2.0%
Other0.4%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

fastText Code Reading

Introduction

Best Reference

Tricks

Support multi-categories classifying with each sample has one or more than one labels

Increase model's robustness by randomly dropping word-tokens in self-surpervised training mode

Tips

The parameters updating process is splitted in`Model::update` and`Loss::forward`

Naming of "word" may not only refer to words, but also to labels

How`Dictionary::find`,`Dictionary::add`,`Dictionary::word2int_`,`Dictionary::words_` works together mapping row text tokens to token's`entry` object

Some threads-shared variables have been wrapped by`std::atomic<>` to guarantee thread-safety

TODO

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

innerNULL/fastTextAnnotation

Folders and files

Latest commit

History

Repository files navigation

fastText Code Reading

Introduction

Best Reference

Tricks

Support multi-categories classifying with each sample has one or more than one labels

Increase model's robustness by randomly dropping word-tokens in self-surpervised training mode

Tips

The parameters updating process is splitted inModel::update andLoss::forward

Naming of "word" may not only refer to words, but also to labels

HowDictionary::find,Dictionary::add,Dictionary::word2int_,Dictionary::words_ works together mapping row text tokens to token'sentry object

Some threads-shared variables have been wrapped bystd::atomic<> to guarantee thread-safety

TODO

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

The parameters updating process is splitted in`Model::update` and`Loss::forward`

How`Dictionary::find`,`Dictionary::add`,`Dictionary::word2int_`,`Dictionary::words_` works together mapping row text tokens to token's`entry` object

Some threads-shared variables have been wrapped by`std::atomic<>` to guarantee thread-safety

Packages