- Notifications
You must be signed in to change notification settings - Fork0
Library for fast text representation and classification.
License
innerNULL/fastTextAnnotation
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
[TOC]
- fastText paper
- word2vec Parameter Learning ExplainedMany conventional way of naming (such as hidder layer/hidden state) are come from this paper. And this paper also shows the detail of parameters updating process.
- FASTTEXT.ZIP:COMPRESSING TEXT CLASSIFICATION MODELSShow some details about how to compress the model and how to pruning the token dictionary
- Support multi-categories classifying with each sample has one or more than one labels
- Increase model's robustness by randomly dropping word-tokens in self-surpervised training mode
- TODO: Support dynamically droping not-that-importance tokens, this can protect the scale of tokens not be to large
fastText supports this case by an a little tricky strategy.
multi-categories classifying using case satisfied by the combination of the strategied hidden inFastText::supervised
,SoftmaxLoss::forward
andLoss::findKBest
, the detials could be found in the annotation incode reading repository.
Briefly speaking, during training process, for each sample, if it has more than one target label, then although its label vector shoud in multi-hot encoding form,FastText::supervised
will convert its label vector to one-hot encoding form by randomly setting the corresponding element (in one-hot label vector) of one of randomly choosen targe labels to 1, and setting all other elements to zero. About setting which target label's corresponding element to 1, this will be controled by the parametertargetIndex
ofSoftmaxLoss::forward
.
During inference process, the top-k most possible prediction results will be saved in aheap
structure combined bystd::vector< std::pair<real, int32_t> >
and several cpp stl heap algorithms, this process is executed byLoss::findKBest
. But not that sample,Loss::findKBest
will filter all potentials prediction results with "score" or "weight" smaller than certain threshold, this threshold is controled by the parameterthreshold
ofLoss::findKBest
. In this way we can tagging each prediction sample at most k potential labels.
InDictionary::discard
, if a generated random number is larger than a threshold (given by parameterrand
), then this word-token will be drop during self-surpervised training process, this introducing of randomness can improve model's rubustness.
Here are some tips about hard-to-understand-piece of the project, which may let reading codes be more easier.
This is beacuse, the gradient calculation of the parameter matrix mapping hidden layer to output layer is depend on the loss function type, but the calculation approach of the parameters matrix mapping input tokens to hidden layer is all the same not matter which loss function you choose. So if we put the computation of hidden-to-output-layer parameters gradients intoLoss::forward
and the computation of input-to-hidden-layer parameters gradients intoModel::update
, we can get following advantages on architecture design:
- We can unify each Loss function's interface and when we need add a new loss function, we can just developing a class satisfy these interface requirements
Loss::forward
generates intermediate result which will be helpful to get final result of the input-to-hidden-layer parameters gradients, so we can cache these result and improve computational efficency.
In fastText, there are two kinds of tokens, word-token and label-token, they could be distincted by the fact that there is "__label__" sign in the label token.
Actually, in fastText, the "words" sometimes mean "tokens", which includes both words or labels (with "__label__" sign in the token). For exampleDictionary::word2int_
also saving ids of labels, andDictionary::words_
also savingentry
objects of labels, the word entry and label entry could be distincted byentry::type
.
HowDictionary::find
,Dictionary::add
,Dictionary::word2int_
,Dictionary::words_
works together mapping row text tokens to token'sentry
object
Here are the functions for these relative methods and attributes:
Dictionary::find
:Allocate each token a non-collision id with remainder and shift strategy, I call this id token-id.Dictionary::add
:During the building ofDictionary
object, when meeting a new token, it will updateDictionary::word2int_
andDictionary::words_
based on this new token's id, make sure we can mapping the token's raw text to token'sentry
object next time without recalculation.Dictionary::word2int_
:Each element's index represents an token's id, and each element value represent that token's correponding index in token-vocabDictionary::words_
, with that index, we can indexing this token's detail info fromDictionary::words_
.Dictionary::words_
:This is token-vocab, it's anstd::vector
object, and each element in it is anentry
object which holding certain token's detail info such as token text, token type(word-token or label-token), token's char n-gram, token's appearance-count, etc.
Each token's info will be build and put intoDictionary::words_
during theDictionary
building withDictionary::add
. This includes several steps:
- Judging if current token is a new one, if it's a new token, executing following steps.
- Getting the token's id from token's raw text by
Dictionary::find
. - Building current new token's
entry
object and push it back into token-vocabDictionary::words_
. - Using
Dictionary::word2int_
to record this new token's index in token vocabDictionary::words_
. The element value inDictionary::word2int_
which index equal to current token-id should be current token's correponding index inDictionary::words_
.
After buidingDictionary
, when we needs getting a raw token's detail info (for example int training/inference stage), we just needs:
- Mapping token's raw text to token-id.
- Getting token-vocab-index by indexing the element with token-id from
Dictionary::word2int_
. - Getting token
entry
object by indexing token-vocab-index fromDictionary::words_
.
Some variables, for example,FastText::tokenCount_
andFastText::loss_
, is defined with wrapping bystd::atomic<>
, since this varibles could be writting and reading by all threads, so they must be thread-safe.FastText::tokenCount_
is responsible for counting global word and label tokens processed by all threads, each thread will updatetokenCount_
totokenCount_ + 1
after one thread-local-sample has been processed by that thread, the process contains reading and writting.
- Figure out the meaning for the gradient-normalizing technique used in
Model::update
- Thinking about if should set a maximum value for
Model::Statr::nexamples_
in case we have a huge training data size or we will continiously training the model with incremental training data. - If should adding a minimum learning rate value in
FastText::progressInfo
About
Library for fast text representation and classification.
Resources
License
Code of conduct
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Languages
- HTML68.5%
- C++10.5%
- JavaScript10.2%
- Python6.4%
- CSS2.0%
- Shell2.0%
- Other0.4%