kocohub/korean-hate-speechPublic

NotificationsYou must be signed in to change notification settings
Fork39
Star386

Korean HateSpeech Dataset

License

CC-BY-SA-4.0 license

386 stars 39 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
guideline		guideline
labeled		labeled
news_title		news_title
unlabeled		unlabeled
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
test.no_label.tsv		test.no_label.tsv

Repository files navigation

Korean HateSpeech Dataset

We provide the first human-annotated Korean corpus for toxic speech detection and the large unlabeled corpus.
The data is comments from the Korean entertainment news aggregation platform.

Dataset description

The dataset consists of 3 parts: 1)labeled 2)unlabeled and 3)news_title.

1.`labeled`

There are9,381 human-labeled comments in total. They are splitted into 7,896 training set, 471 validation set, and 974 test set. (We left test set labels undisclosed for the fair comparison of prediction models. The model can be evaluated via the Kaggle submission which will be described later in this document.) Each comment is annotated on two aspects, the existence ofsocial bias andhate speech, given that hate speech is closely related to bias.

For social bias, we presentgender,others, andnone bias labels. Considering the context of Korean entertainment news where public figures encounter stereotypes mostly intertwined withgender, we weigh more on the prevalent bias.We also added binary labelwhether a comment contains gender bias or not.For hate speech, we introducehate,offensive, andnone labels.

commentscontain_gender_biasbiashate송중기 시대극은 믿고본다. 첫회 신선하고 좋았다.Falsenonenone지현우 나쁜놈Falsenoneoffensive알바쓰고많이만들면되지 돈욕심없으면골목식당왜나온겨 기댕기게나하고 산에가서팔어라Falsenonehate설마 ㅈ 현정 작가 아니지??Truegenderhate

Detailed definitions are described inguideline.

2.`unlabeled`

We additionally provide2,033,893unlabeled comments sincelabeled data is limited.
This unlabeled dataset can be used in various ways: pretraining language model, semi-supervised learning, and so on.

3.`news_title`

We release news titles for each comments. To fully understand meaning of the comments, context is must be required.
For the entertainment news, both title and contents can be used for the context. However, we only provide the news articles' title, due to the legal issue.

Usage

koco is a library to easily accesskocohub datasets.

Forkorean-hate-speech, we can load datasets by using this code:

>>>importkoco>>>train_dev=koco.load_dataset('korean-hate-speech',mode='train_dev')>>>type(train_dev)dict>>>train_dev.keys()dict_keys(['train','dev'])>>>train_dev['train'][33]{'comments':'2,30대 골빈여자들은 이 기사에 다 모이는건가ㅋㅋㅋㅋ 이래서 여자는 투표권 주면 안된다. 엠넷사전투표나 하고 살아야지 계집들은','contain_gender_bias':True,'bias':'gender','hate':'hate','news_title':'"“8년째 연애 중”…‘인생술집’ 블락비 유권♥전선혜, 4살차 연상연하 커플"'}>>>unlabeled=koco.load_dataset('korean-hate-speech',mode='unlabeled')>>>type(unlabeled)list>>>unlabeled[33]{'comments':'이주연님 되게 이쁘시다 오빠 오래가요 잘어울려 주연님 울오빠 잘부탁해요','news_title':'"[단독] 지드래곤♥이주연, 제주도 데이트…2018년 1호 커플 탄생"'}>>>test=koco.load_dataset('korean-hate-speech',mode='test')>>>type(test)list>>>test[33]{'comments':'끝낼때도 됐지 요즘같은 분위기엔 성드립 잘못쳤다가 난리. 그동안 잘봤습니다','news_title':'[단독] ‘SNL 코리아’ 공식적인 폐지 확정…아름다운 종료'}

Kaggle competition

We open Kaggle competition to provide leaderboard system easily. There are 3 competitions:

Gender-bias detection:www.kaggle.com/c/korean-gender-bias-detection
Bias detection:www.kaggle.com/c/korean-bias-detection
Hate speech detection:www.kaggle.com/c/korean-hate-speech-detection

Feel free to participate 🎉

Annotation Guideline

Contributors

The main contributors of the work are:

*: Equal Contribution

Note that this project is an independent research and was not supported by any of the organizations.
Instead, we had an individual sponsorHyunjoong Kim and we sincerely thankHyunjoong Kim for providing financial support ❤️

References

If you find this dataset useful, feel free to cite our publicationBEEP! Korean Corpus of Online News Comments for Toxic Speech Detection which is accepted inSocialNLP@ACL 2020:

@inproceedings{moon-etal-2020-beep,    title = "{BEEP}! {K}orean Corpus of Online News Comments for Toxic Speech Detection",    author = "Moon, Jihyung  and      Cho, Won Ik  and      Lee, Junbum",    booktitle = "Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media",    month = jul,    year = "2020",    address = "Online",    publisher = "Association for Computational Linguistics",    url = "https://www.aclweb.org/anthology/2020.socialnlp-1.4",    pages = "25--31",    abstract = "Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff{'}s alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.",}

License

This work is licensed under aCreative Commons Attribution-ShareAlike 4.0 International License.

About

Korean HateSpeech Dataset

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Korean HateSpeech Dataset

Dataset description

1.`labeled`

2.`unlabeled`

3.`news_title`

Usage

Kaggle competition

Annotation Guideline

Contributors

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Movatterモバイル変換

License

kocohub/korean-hate-speech

Folders and files

Latest commit

History

Repository files navigation

Korean HateSpeech Dataset

Dataset description

1.labeled

2.unlabeled

3.news_title

Usage

Kaggle competition

Annotation Guideline

Contributors

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

1.`labeled`

2.`unlabeled`

3.`news_title`

Packages