- Notifications
You must be signed in to change notification settings - Fork39
kocohub/korean-hate-speech
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
We provide the first human-annotated Korean corpus for toxic speech detection and the large unlabeled corpus.
The data is comments from the Korean entertainment news aggregation platform.
The dataset consists of 3 parts: 1)labeled
2)unlabeled
and 3)news_title
.
There are9,381 human-labeled comments in total. They are splitted into 7,896 training set, 471 validation set, and 974 test set. (We left test set labels undisclosed for the fair comparison of prediction models. The model can be evaluated via the Kaggle submission which will be described later in this document.) Each comment is annotated on two aspects, the existence ofsocial bias andhate speech, given that hate speech is closely related to bias.
For social bias, we presentgender
,others
, andnone
bias labels. Considering the context of Korean entertainment news where public figures encounter stereotypes mostly intertwined withgender, we weigh more on the prevalent bias.We also added binary labelwhether a comment contains gender bias or not
.For hate speech, we introducehate
,offensive
, andnone
labels.
commentscontain_gender_biasbiashate송중기 시대극은 믿고본다. 첫회 신선하고 좋았다.Falsenonenone지현우 나쁜놈Falsenoneoffensive알바쓰고많이만들면되지 돈욕심없으면골목식당왜나온겨 기댕기게나하고 산에가서팔어라Falsenonehate설마 ㅈ 현정 작가 아니지??Truegenderhate
Detailed definitions are described inguideline
.
We additionally provide2,033,893unlabeled
comments sincelabeled
data is limited.
This unlabeled dataset can be used in various ways: pretraining language model, semi-supervised learning, and so on.
We release news titles for each comments. To fully understand meaning of the comments, context is must be required.
For the entertainment news, both title and contents can be used for the context. However, we only provide the news articles' title, due to the legal issue.
koco
is a library to easily accesskocohub
datasets.
Forkorean-hate-speech
, we can load datasets by using this code:
>>>importkoco>>>train_dev=koco.load_dataset('korean-hate-speech',mode='train_dev')>>>type(train_dev)dict>>>train_dev.keys()dict_keys(['train','dev'])>>>train_dev['train'][33]{'comments':'2,30대 골빈여자들은 이 기사에 다 모이는건가ㅋㅋㅋㅋ 이래서 여자는 투표권 주면 안된다. 엠넷사전투표나 하고 살아야지 계집들은','contain_gender_bias':True,'bias':'gender','hate':'hate','news_title':'"“8년째 연애 중”…‘인생술집’ 블락비 유권♥전선혜, 4살차 연상연하 커플"'}>>>unlabeled=koco.load_dataset('korean-hate-speech',mode='unlabeled')>>>type(unlabeled)list>>>unlabeled[33]{'comments':'이주연님 되게 이쁘시다 오빠 오래가요 잘어울려 주연님 울오빠 잘부탁해요','news_title':'"[단독] 지드래곤♥이주연, 제주도 데이트…2018년 1호 커플 탄생"'}>>>test=koco.load_dataset('korean-hate-speech',mode='test')>>>type(test)list>>>test[33]{'comments':'끝낼때도 됐지 요즘같은 분위기엔 성드립 잘못쳤다가 난리. 그동안 잘봤습니다','news_title':'[단독] ‘SNL 코리아’ 공식적인 폐지 확정…아름다운 종료'}
We open Kaggle competition to provide leaderboard system easily. There are 3 competitions:
- Gender-bias detection:www.kaggle.com/c/korean-gender-bias-detection
- Bias detection:www.kaggle.com/c/korean-bias-detection
- Hate speech detection:www.kaggle.com/c/korean-hate-speech-detection
Feel free to participate 🎉
The main contributors of the work are:
*: Equal Contribution
Note that this project is an independent research and was not supported by any of the organizations.
Instead, we had an individual sponsorHyunjoong Kim and we sincerely thankHyunjoong Kim for providing financial support ❤️
If you find this dataset useful, feel free to cite our publicationBEEP! Korean Corpus of Online News Comments for Toxic Speech Detection which is accepted inSocialNLP@ACL 2020:
@inproceedings{moon-etal-2020-beep, title = "{BEEP}! {K}orean Corpus of Online News Comments for Toxic Speech Detection", author = "Moon, Jihyung and Cho, Won Ik and Lee, Junbum", booktitle = "Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.socialnlp-1.4", pages = "25--31", abstract = "Toxic comments in online platforms are an unavoidable social issue under the cloak of anonymity. Hate speech detection has been actively done for languages such as English, German, or Italian, where manually labeled corpus has been released. In this work, we first present 9.4K manually labeled entertainment news comments for identifying Korean toxic speech, collected from a widely used online news platform in Korea. The comments are annotated regarding social bias and hate speech since both aspects are correlated. The inter-annotator agreement Krippendorff{'}s alpha score is 0.492 and 0.496, respectively. We provide benchmarks using CharCNN, BiLSTM, and BERT, where BERT achieves the highest score on all tasks. The models generally display better performance on bias identification, since the hate speech detection is a more subjective issue. Additionally, when BERT is trained with bias label for hate speech detection, the prediction score increases, implying that bias and hate are intertwined. We make our dataset publicly available and open competitions with the corpus and benchmarks.",}
This work is licensed under aCreative Commons Attribution-ShareAlike 4.0 International License.
About
Korean HateSpeech Dataset
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.