- Notifications
You must be signed in to change notification settings - Fork58
lovit/textmining-tutorial
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
텍스트 마이닝을 공부하기 위한 자료입니다. 언어에 상관없이 적용할 수 있는 자연어처리 / 머신러닝 관련 자료도 포함되지만, 한국어 분석을 위한 자료들도 포함됩니다.
- 이 자료는 현재 작업중이며, slide와 jupyter notebook example codes가 포함되어 있습니다.
- 이 자료는soynlp package를 이용합니다. 한국어 분석을 위한 자연어처리 코드입니다. soynlp 역시 현재 작업중입니다.
- Slides 내용에 관련된 texts 는blog 에 포스팅 중입니다.
- 실습코드는코드 repository 에 있습니다.
- Python basic
- jupyter tutorial
- From text to vector (KoNLPy)
- n-gram
- from text to vector using KoNLPy
- Word extraction and tokenization (Korean)
- word extractor
- unsupervised tokenizer
- noun extractor
- dictionary based pos tagger
- Document classification
- Logistic Regression and Lasso regression
- SVM (linear, RBF)
- k-nearest neighbors classifier
- Feed-forward neural network
- Decision Tree
- Naive Bayes
- Sequential labeling
- Conditional Random Field
- Embedding for representation
- Word2Vec / Doc2Vec
- GloVe
- FastText (word embedding using subword)
- FastText (supervised word embedding)
- Sparse Coding
- Nonnegative Matrix Factorization (NMF) for topic modeling
- Embedding for vector visualization
- MDS, ISOMAP, Locally Linear Embedding, PCA, Kernel PCA
- t-SNE
- t-SNE (detailed)
- Keyword / Related words analysis
- co-occurrence based keyword / related word analysis
- Document clustering
- k-means is good for document clustering
- DBSCAN, hierarchical, GMM, BGMM are not appropriate for document clustering
- Finding similar documents (neighbor search)
- Random Projection
- Locality Sensitive Hashing
- Inverted Index
- Graph similarity and ranking (centrality)
- SimRank & Random Walk with Restart
- PageRank, HITS, WordRank, TextRank
- kr-wordrank keyword extraction
- String similarity
- Levenshtein / Cosine / Jaccard distance
- Convolutional Neural Network (CNN)
- Introduction of CNN
- Word-level CNN for sentence classification (Yoon Kim)
- Character-level CNN (LeCun)
- BOW-CNN
- Recurrent Neural Network (RNN)
- Introduction of RNN
- LSTM, GRU
- Deep RNN & ELMo
- Sequence to sequence & seq2seq with attention
- Skip-thought vector
- Attention mechanism for sentence classification
- Hierarchical Attention Network (HAN) for document classification
- Transformer & BERT
- Applications
- soyspacing: heuristic Korean space correction
- crf-based Korean soace correction
- HMM & CRF-based part-of-speech tagger (morphological analyzer)
- semantic movie search using IMDB
- TBD
자료를 리뷰하고 함께 토론해주는 고마운 동료들이 많습니다. 특히 많은 시간과 정성을 들여 도와주는태욱에게 고마움을 표합니다.
About
(한국어) 텍스트 마이닝을 위한 공부거리들
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
No releases published
Packages0
No packages published
Uh oh!
There was an error while loading.Please reload this page.