votamvan/nbsvmPublic

forked fromsidaw/nbsvm

NotificationsYou must be signed in to change notification settings
Fork0
Star0

code for our paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

License

MIT license

0 stars 14 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
src		src
LICENSE.md		LICENSE.md
README.md		README.md
wang12simple.pdf		wang12simple.pdf
wang12simple_slides.pdf		wang12simple_slides.pdf

Repository files navigation

NBSVM

Since I still receive a good number of emails about this project 4 years later,I decided to put this code on github and write theinstructions better. The code itself is unchanged, in matlab, and not that great.Luckily, there areseveral other implementations in various languages,which are better. For example, I usedGrégoire Mesnil's implementation on thisCodaLab worksheet and got slightly better results than we originally did.

Running NBSVM

Download the data and override the empty data directory in root: for example, you should have "./data/rt10662/unigram_rts.mat" if this readme has pass "./README.MD"
Go to src and run the script master.m to produce the results from the paper
Results and details are logged in resultslog.txt and details.txt, respectively
A table with all the results is printed, like:

AthRXGraphBbCryptCRIMDBMPQART-2kRTssubj85.1391.1999.4079.9786.5986.2785.8579.0393.56MNB-bigram84.9989.9699.2979.7683.5585.2983.4577.9492.58MNB-unigram83.7386.1797.6880.8589.1686.7287.4077.7291.74SVM-bigram82.6185.1498.2979.0286.9586.1586.2576.2390.84SVM-unigram87.6690.6899.5081.7591.2286.3289.4579.3893.18NBSVM-bigram87.9491.1999.7080.4588.2985.2587.8078.0592.40SVM-unigram

The data

data - 404.4MB includes all the data
data_small - 108.5MBdata_small = data_all - large_IMDB
For each data set, there is a corresponding folder data/$DatasetName.
You can find $FeatureType_$DatasetName.mat in data/$DatasetName, where$FeatureType = "unigram" or "bigram".
data/$DatasetName/cv_obj.mat determines the standard evaluation for each dataset (how manyfolds, whats the split, etc.). They are generated by correspondingdata processing script in src/misc

Other implementations

Please consider submitting a pull request or shoot me an email if you used NBSVM in your work!

https://github.com/mesnilgr/nbsvm, Python implementation by Grégoire Mesnil, It runs on the large IMDB dataset with a single script and the results are described in theirICLR 2015 paper
https://github.com/dpressel/nbsvm-xl, Java implementation by Daniel Pressel, using SGD.
https://github.com/lrei/nbsvm, Python implementation by Luis Rei, multiclass
https://github.com/tkng/rakai, a Go implementation by tkng, probably imcomplete
http://d.hatena.ne.jp/jetbead/20140916/1410798409, Perl! unfortunately cant read Japanese

It appears to be used in these kaggle entries:

Notes

The datasets are collected by others, please cite the original sources if you work with them
The data structure used kept the order information of the document, instead ofconverting to bag-of-words vector right away. This resulted in someunnecessary mess for this work, but might make it easier if you wantto try a more complex model.

Comments

While many experiments have been ran for this task, performance isreally all about regularization, and even the simplest model (NaiveBayes) would fit the training set perfectly. As far as I know, there is no goodtheory for why things even work in this case of non-sparse weightsand p>>n.
It is unclear if any of the complicated deep learning modelstoday are doing significantly more than bag of words on these datasets:
- As far as I know, none of these results are impressively better (usually about 1%)
- Available compute power, engineering competence, and software infrastructure are vastly better for deep learning
- Difference in enthusiasm level: no one seems to try very hard pushing basic models to the available compute power / hardware
Bag of words models run in few seconds or less, andbehaves predictably for a different test distribution.
It is very encouraging for me to see others finding this work helpful and implementing it.
Anotherexample of bag of words going strong in 2015.

References

For technical details seeour paper andour talk.

@inproceedings{wang12simple, author = {Wang, Sida I. and Manning, Christopher D.}, booktitle = {Proceedings of the ACL}, title = {Baselines and Bigrams: Simple, Good Sentiment and Topic Classification}, year = {2012}, booktitle = {ACL (2)}, pages = {90-94} }

IMDB comparisons

These works compare with the largest dataset of the batch (IMDB), where maybe regularization is not as important. Our result was 91.22% correct.

Quoc V. Le, Tomas Mikolov. Distributed Representations of Sentences and Documents. 2014.

Got 92.58%, no released code, the paper below reports that the results were not reproduced.

Grégoire Mesnil, Tomas Mikolov, Marc'Aurelio Ranzato, Yoshua Bengio. Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews. ICLR 2015

Their implementation of NBSVM actually got better than us at 91.87%, and their best number is 92.57% with some ensembling.

Andrew M. Dai, Quoc V. Le. Semi-supervised Sequence Learning. NIPS 2015.

92.76% with additional unlabeled data.

Stefan Wager, Sida Wang, and Percy Liang. Dropout Training as Adaptive Regularization. NIPS 2013

We got 91.98% using unlabeled data with logistic regression and bigrams.

(please submit a pull request if you want something added or changed)

MIT license:here

About

code for our paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

Releases

No releases published

Packages

No packages published

Languages

MATLAB99.5%
M0.5%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

NBSVM

Running NBSVM

The data

Other implementations

Notes

Comments

References

IMDB comparisons

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

votamvan/nbsvm

Folders and files

Latest commit

History

Repository files navigation

NBSVM

Running NBSVM

The data

Other implementations

Notes

Comments

References

IMDB comparisons

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages