aymara/fastTextPublic

forked fromfacebookresearch/fastText

NotificationsYou must be signed in to change notification settings
Fork1
Star0

Library for fast text representation and classification.

License

MIT license

0 stars 4.8k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 407 Commits
.circleci		.circleci
.github/workflows		.github/workflows
alignment		alignment
crawl		crawl
docs		docs
python		python
scripts		scripts
src		src
tests		tests
webassembly		webassembly
website		website
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
classification-example.sh		classification-example.sh
classification-results.sh		classification-results.sh
download_model.py		download_model.py
eval.py		eval.py
fasttext.pc.in		fasttext.pc.in
get-wikimedia.sh		get-wikimedia.sh
quantization-example.sh		quantization-example.sh
reduce_model.py		reduce_model.py
runtests.py		runtests.py
setup.cfg		setup.cfg
setup.py		setup.py
wikifil.pl		wikifil.pl
word-vector-example.sh		word-vector-example.sh

Repository files navigation

fastText

fastText is a library for efficient learning of word representations and sentence classification.

Resources

Models

Recent state-of-the-artEnglish word vectors.
Word vectors for157 languages trained on Wikipedia and Crawl.
Models forlanguage identification andvarious supervised tasks.

Supplementary data

The preprocessedYFCC100M data used in [2].

FAQ

You can findanswers to frequently asked questions on ourwebsite.

Cheatsheet

We also provide acheatsheet full of useful one-liners.

Requirements

We are continuously building and testing our library, CLI and Python bindings under various docker images usingcircleci.

Generally,fastText builds on modern Mac OS and Linux distributions.Since it uses some C++11 features, it requires a compiler with good C++11 support.These include :

(g++-4.7.2 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a workingmake.If you want to usecmake you need at least version 2.8.9.

One of the oldest distributions we successfully built and tested the CLI under isDebian jessie.

For the word-similarity evaluation script you will need:

Python 2.6 or newer
NumPy & SciPy

For the python bindings (see the subdirectory python) you will need:

Python version 2.7 or >=3.4
NumPy & SciPy
pybind11

One of the oldest distributions we successfully built and tested the Python bindings under isDebian jessie.

If these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.

Building fastText

We discuss building the latest stable version of fastText.

Getting the source code

You can find ourlatest stable release in the usual place.

There is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.

Building fastText using make (preferred)

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip$ unzip v0.9.2.zip$ cd fastText-0.9.2$ make

This will produce object files for all the classes as well as the main binaryfasttext.If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Building fastText using cmake

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git$ cd fastText$ mkdir build && cd build && cmake ..$ make && make install

This will create the fasttext binary and also all relevant libraries (shared, static, PIC).

Building fastText for Python

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git$ cd fastText$ pip install .

For further information and introduction see python/README.md

Example use cases

This library has two main use cases: word representation learning and text classification.These were described in the two papers1 and2.

Word representation learning

In order to learn word vectors, as described in1, do:

$ ./fasttext skipgram -input data.txt -output model

wheredata.txt is a training file containingUTF-8 encoded text.By default the word vectors will take into account character n-grams from 3 to 6 characters.At the end of optimization the program will save two files:model.bin andmodel.vec.model.vec is a text file containing the word vectors, one per line.model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters.The binary file can be used later to compute word vectors or to restart the optimization.

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words.Provided you have a text filequeries.txt containing words for which you want to compute vectors, use the following command:

$ ./fasttext print-word-vectors model.bin < queries.txt

This will output word vectors to the standard output, one vector per line.This can also be used with pipes:

$ cat queries.txt | ./fasttext print-word-vectors model.bin

See the provided scripts for an example. For instance, running:

$ ./word-vector-example.sh

will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].

Text classification

This library can also be used to train supervised text classifiers, for instance for sentiment analysis.In order to train a text classifier using the method described in2, use:

$ ./fasttext supervised -input train.txt -output model

wheretrain.txt is a text file containing a training sentence per line along with the labels.By default, we assume that labels are words that are prefixed by the string__label__.This will output two files:model.bin andmodel.vec.Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:

$ ./fasttext test model.bin test.txt k

The argumentk is optional, and is equal to1 by default.

In order to obtain the k most likely labels for a piece of text, use:

$ ./fasttext predict model.bin test.txt k

or usepredict-prob to also get the probability for each label

$ ./fasttext predict-prob model.bin test.txt k

wheretest.txt contains a piece of text to classify per line.Doing so will print to the standard output the k most likely labels for each line.The argumentk is optional, and equal to1 by default.Seeclassification-example.sh for an example use case.In order to reproduce results from the paper2, runclassification-results.sh, this will download all the datasets and reproduce the results from Table 1.

If you want to compute vector representations of sentences or paragraphs, please use:

$ ./fasttext print-sentence-vectors model.bin < text.txt

This assumes that thetext.txt file contains the paragraphs that you want to get vectors for.The program will output one vector representation per line in the file.

You can also quantize a supervised model to reduce its memory usage with the following command:

$ ./fasttext quantize -output model

This will create a.ftz file with a smaller memory footprint. All the standard functionality, liketest orpredict work the same way on the quantized models:

$ ./fasttext test model.ftz test.txt

The quantization procedure follows the steps described in3. You canrun the scriptquantization-example.sh for an example.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext supervisedEmpty input or output path.The following arguments are mandatory:  -input              training file path  -output             output file pathThe following arguments are optional:  -verbose            verbosity level [2]The following arguments for the dictionary are optional:  -minCount           minimal number of word occurrences [1]  -minCountLabel      minimal number of label occurrences [0]  -wordNgrams         max length of word ngram [1]  -bucket             number of buckets [2000000]  -minn               min length of char ngram [0]  -maxn               max length of char ngram [0]  -t                  sampling threshold [0.0001]  -label              labels prefix [__label__]The following arguments for training are optional:  -lr                 learning rate [0.1]  -lrUpdateRate       change the rate of updates for the learning rate [100]  -dim                size of word vectors [100]  -ws                 size of the context window [5]  -epoch              number of epochs [5]  -neg                number of negatives sampled [5]  -loss               loss function {ns, hs, softmax} [softmax]  -thread             number of threads [12]  -pretrainedVectors  pretrained word vectors for supervised learning []  -saveOutput         whether output params should be saved [0]The following arguments for quantization are optional:  -cutoff             number of words and ngrams to retain [0]  -retrain            finetune embeddings if a cutoff is applied [0]  -qnorm              quantizing the norm separately [0]  -qout               quantizing the classifier [0]  -dsub               size of each sub-vector [2]

Defaults may vary by mode. (Word-representation modesskipgram andcbow use a default-minCount of 5.)

References

Please cite1 if using this code for learning word representations or2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov,Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,  title={Enriching Word Vectors with Subword Information},  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},  journal={Transactions of the Association for Computational Linguistics},  volume={5},  year={2017},  issn={2307-387X},  pages={135--146}}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov,Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,  title={Bag of Tricks for Efficient Text Classification},  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},  month={April},  year={2017},  publisher={Association for Computational Linguistics},  pages={427--431},}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov,FastText.zip: Compressing text classification models

@article{joulin2016fasttext,  title={FastText.zip: Compressing text classification models},  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},  journal={arXiv preprint arXiv:1612.03651},  year={2016}}

(* These authors contributed equally.)

Join the fastText community

Facebook page:https://www.facebook.com/groups/1174547215919768
Google group:https://groups.google.com/forum/#!forum/fasttext-library
Contact:egrave@fb.com,bojanowski@fb.com,ajoulin@fb.com,tmikolov@fb.com

See the CONTRIBUTING file for information about how to help out.

License

fastText is MIT-licensed.

About

Library for fast text representation and classification.

Releases

4tags

Packages

No packages published

Languages

HTML68.3%
C++10.9%
JavaScript10.1%
Python6.3%
CSS2.0%
Shell2.0%
Other0.4%

Movatterモバイル変換

License

aymara/fastText

Folders and files

Latest commit

History

Repository files navigation

fastText

Table of contents

Resources

Models

Supplementary data

FAQ

Cheatsheet

Requirements

Building fastText

Getting the source code

Building fastText using make (preferred)

Building fastText using cmake

Building fastText for Python

Example use cases

Word representation learning

Obtaining word vectors for out-of-vocabulary words

Text classification

Full documentation

References

Enriching Word Vectors with Subword Information

Bag of Tricks for Efficient Text Classification

FastText.zip: Compressing text classification models

Join the fastText community

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages