Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Library for fast text representation and classification.

License

NotificationsYou must be signed in to change notification settings

innerNULL/fastTextAnnotation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[TOC]

Introduction

Best Reference

  • fastText paper
  • word2vec Parameter Learning ExplainedMany conventional way of naming (such as hidder layer/hidden state) are come from this paper. And this paper also shows the detail of parameters updating process.
  • FASTTEXT.ZIP:COMPRESSING TEXT CLASSIFICATION MODELSShow some details about how to compress the model and how to pruning the token dictionary

Tricks

  • Support multi-categories classifying with each sample has one or more than one labels
  • Increase model's robustness by randomly dropping word-tokens in self-surpervised training mode
  • TODO: Support dynamically droping not-that-importance tokens, this can protect the scale of tokens not be to large

Support multi-categories classifying with each sample has one or more than one labels

fastText supports this case by an a little tricky strategy.

multi-categories classifying using case satisfied by the combination of the strategied hidden inFastText::supervised,SoftmaxLoss::forward andLoss::findKBest, the detials could be found in the annotation incode reading repository.

Briefly speaking, during training process, for each sample, if it has more than one target label, then although its label vector shoud in multi-hot encoding form,FastText::supervised will convert its label vector to one-hot encoding form by randomly setting the corresponding element (in one-hot label vector) of one of randomly choosen targe labels to 1, and setting all other elements to zero. About setting which target label's corresponding element to 1, this will be controled by the parametertargetIndex ofSoftmaxLoss::forward.

During inference process, the top-k most possible prediction results will be saved in aheap structure combined bystd::vector< std::pair<real, int32_t> > and several cpp stl heap algorithms, this process is executed byLoss::findKBest. But not that sample,Loss::findKBest will filter all potentials prediction results with "score" or "weight" smaller than certain threshold, this threshold is controled by the parameterthreshold ofLoss::findKBest. In this way we can tagging each prediction sample at most k potential labels.

Increase model's robustness by randomly dropping word-tokens in self-surpervised training mode

InDictionary::discard, if a generated random number is larger than a threshold (given by parameterrand), then this word-token will be drop during self-surpervised training process, this introducing of randomness can improve model's rubustness.

Tips

Here are some tips about hard-to-understand-piece of the project, which may let reading codes be more easier.

The parameters updating process is splitted inModel::update andLoss::forward

This is beacuse, the gradient calculation of the parameter matrix mapping hidden layer to output layer is depend on the loss function type, but the calculation approach of the parameters matrix mapping input tokens to hidden layer is all the same not matter which loss function you choose. So if we put the computation of hidden-to-output-layer parameters gradients intoLoss::forward and the computation of input-to-hidden-layer parameters gradients intoModel::update, we can get following advantages on architecture design:

  • We can unify each Loss function's interface and when we need add a new loss function, we can just developing a class satisfy these interface requirements
  • Loss::forward generates intermediate result which will be helpful to get final result of the input-to-hidden-layer parameters gradients, so we can cache these result and improve computational efficency.

Naming of "word" may not only refer to words, but also to labels

In fastText, there are two kinds of tokens, word-token and label-token, they could be distincted by the fact that there is "__label__" sign in the label token.
Actually, in fastText, the "words" sometimes mean "tokens", which includes both words or labels (with "__label__" sign in the token). For exampleDictionary::word2int_ also saving ids of labels, andDictionary::words_ also savingentry objects of labels, the word entry and label entry could be distincted byentry::type.

HowDictionary::find,Dictionary::add,Dictionary::word2int_,Dictionary::words_ works together mapping row text tokens to token'sentry object

Here are the functions for these relative methods and attributes:

  • Dictionary::find:Allocate each token a non-collision id with remainder and shift strategy, I call this id token-id.
  • Dictionary::add:During the building ofDictionary object, when meeting a new token, it will updateDictionary::word2int_ andDictionary::words_ based on this new token's id, make sure we can mapping the token's raw text to token'sentry object next time without recalculation.
  • Dictionary::word2int_:Each element's index represents an token's id, and each element value represent that token's correponding index in token-vocabDictionary::words_, with that index, we can indexing this token's detail info fromDictionary::words_.
  • Dictionary::words_:This is token-vocab, it's anstd::vector object, and each element in it is anentry object which holding certain token's detail info such as token text, token type(word-token or label-token), token's char n-gram, token's appearance-count, etc.

Each token's info will be build and put intoDictionary::words_ during theDictionary building withDictionary::add. This includes several steps:

  • Judging if current token is a new one, if it's a new token, executing following steps.
  • Getting the token's id from token's raw text byDictionary::find.
  • Building current new token'sentry object and push it back into token-vocabDictionary::words_.
  • UsingDictionary::word2int_ to record this new token's index in token vocabDictionary::words_. The element value inDictionary::word2int_ which index equal to current token-id should be current token's correponding index inDictionary::words_.

After buidingDictionary, when we needs getting a raw token's detail info (for example int training/inference stage), we just needs:

  • Mapping token's raw text to token-id.
  • Getting token-vocab-index by indexing the element with token-id fromDictionary::word2int_.
  • Getting tokenentry object by indexing token-vocab-index from Dictionary::words_.

Some threads-shared variables have been wrapped bystd::atomic<> to guarantee thread-safety

Some variables, for example,FastText::tokenCount_ andFastText::loss_, is defined with wrapping bystd::atomic<>, since this varibles could be writting and reading by all threads, so they must be thread-safe.
FastText::tokenCount_ is responsible for counting global word and label tokens processed by all threads, each thread will updatetokenCount_ totokenCount_ + 1 after one thread-local-sample has been processed by that thread, the process contains reading and writting.

TODO

  • Figure out the meaning for the gradient-normalizing technique used inModel::update
  • Thinking about if should set a maximum value forModel::Statr::nexamples_ in case we have a huge training data size or we will continiously training the model with incremental training data.
  • If should adding a minimum learning rate value inFastText::progressInfo

About

Library for fast text representation and classification.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML68.5%
  • C++10.5%
  • JavaScript10.2%
  • Python6.4%
  • CSS2.0%
  • Shell2.0%
  • Other0.4%

[8]ページ先頭

©2009-2025 Movatter.jp