There is a shortage of high-quality corpora for South-Slavic languages. Such corpora are useful to computer scientists and researchers in social sciences and humanities alike, focusing on numerous linguistic, content analysis, and natural language processing applications. This paper presents a workflow for mining Wikipedia content and processing it into linguistically-processed corpora, applied on the Bosnian, Bulgarian, Croatian, Macedonian, Serbian, Serbo-Croatian and Slovenian Wikipedia. We make the resulting seven corpora publicly available. We showcase these corpora by comparing the content of the underlying Wikipedias, our assumption being that the content of the Wikipedias reflects broadly the interests in various topics in these Balkan nations. We perform the content comparison by using topic modelling algorithms and various distribution comparisons. The results show that all Wikipedias are topically rather similar, with all of them covering art, culture, and literature, whereas they contain differences in geography, politics, history and science.
Filip Markoski, Elena Markoska, Nikola Ljubešić, Eftim Zdravevski, and Ljupco Kocarev. 2021.Cultural Topic Modelling over Novel Wikipedia Corpora for South-Slavic Languages. InProceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 910–917, Held Online. INCOMA Ltd..
@inproceedings{markoski-etal-2021-cultural, title = "Cultural Topic Modelling over Novel {W}ikipedia Corpora for {S}outh-{S}lavic Languages", author = "Markoski, Filip and Markoska, Elena and Ljube{\v{s}}i{\'c}, Nikola and Zdravevski, Eftim and Kocarev, Ljupco", editor = "Mitkov, Ruslan and Angelova, Galia", booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)", month = sep, year = "2021", address = "Held Online", publisher = "INCOMA Ltd.", url = "https://aclanthology.org/2021.ranlp-1.104/", pages = "910--917", abstract = "There is a shortage of high-quality corpora for South-Slavic languages. Such corpora are useful to computer scientists and researchers in social sciences and humanities alike, focusing on numerous linguistic, content analysis, and natural language processing applications. This paper presents a workflow for mining Wikipedia content and processing it into linguistically-processed corpora, applied on the Bosnian, Bulgarian, Croatian, Macedonian, Serbian, Serbo-Croatian and Slovenian Wikipedia. We make the resulting seven corpora publicly available. We showcase these corpora by comparing the content of the underlying Wikipedias, our assumption being that the content of the Wikipedias reflects broadly the interests in various topics in these Balkan nations. We perform the content comparison by using topic modelling algorithms and various distribution comparisons. The results show that all Wikipedias are topically rather similar, with all of them covering art, culture, and literature, whereas they contain differences in geography, politics, history and science."}
<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="markoski-etal-2021-cultural"> <titleInfo> <title>Cultural Topic Modelling over Novel Wikipedia Corpora for South-Slavic Languages</title> </titleInfo> <name type="personal"> <namePart type="given">Filip</namePart> <namePart type="family">Markoski</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Elena</namePart> <namePart type="family">Markoska</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Nikola</namePart> <namePart type="family">Ljubešić</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Eftim</namePart> <namePart type="family">Zdravevski</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ljupco</namePart> <namePart type="family">Kocarev</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2021-09</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)</title> </titleInfo> <name type="personal"> <namePart type="given">Ruslan</namePart> <namePart type="family">Mitkov</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Galia</namePart> <namePart type="family">Angelova</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>INCOMA Ltd.</publisher> <place> <placeTerm type="text">Held Online</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>There is a shortage of high-quality corpora for South-Slavic languages. Such corpora are useful to computer scientists and researchers in social sciences and humanities alike, focusing on numerous linguistic, content analysis, and natural language processing applications. This paper presents a workflow for mining Wikipedia content and processing it into linguistically-processed corpora, applied on the Bosnian, Bulgarian, Croatian, Macedonian, Serbian, Serbo-Croatian and Slovenian Wikipedia. We make the resulting seven corpora publicly available. We showcase these corpora by comparing the content of the underlying Wikipedias, our assumption being that the content of the Wikipedias reflects broadly the interests in various topics in these Balkan nations. We perform the content comparison by using topic modelling algorithms and various distribution comparisons. The results show that all Wikipedias are topically rather similar, with all of them covering art, culture, and literature, whereas they contain differences in geography, politics, history and science.</abstract> <identifier type="citekey">markoski-etal-2021-cultural</identifier> <location> <url>https://aclanthology.org/2021.ranlp-1.104/</url> </location> <part> <date>2021-09</date> <extent unit="page"> <start>910</start> <end>917</end> </extent> </part></mods></modsCollection>
%0 Conference Proceedings%T Cultural Topic Modelling over Novel Wikipedia Corpora for South-Slavic Languages%A Markoski, Filip%A Markoska, Elena%A Ljubešić, Nikola%A Zdravevski, Eftim%A Kocarev, Ljupco%Y Mitkov, Ruslan%Y Angelova, Galia%S Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)%D 2021%8 September%I INCOMA Ltd.%C Held Online%F markoski-etal-2021-cultural%X There is a shortage of high-quality corpora for South-Slavic languages. Such corpora are useful to computer scientists and researchers in social sciences and humanities alike, focusing on numerous linguistic, content analysis, and natural language processing applications. This paper presents a workflow for mining Wikipedia content and processing it into linguistically-processed corpora, applied on the Bosnian, Bulgarian, Croatian, Macedonian, Serbian, Serbo-Croatian and Slovenian Wikipedia. We make the resulting seven corpora publicly available. We showcase these corpora by comparing the content of the underlying Wikipedias, our assumption being that the content of the Wikipedias reflects broadly the interests in various topics in these Balkan nations. We perform the content comparison by using topic modelling algorithms and various distribution comparisons. The results show that all Wikipedias are topically rather similar, with all of them covering art, culture, and literature, whereas they contain differences in geography, politics, history and science.%U https://aclanthology.org/2021.ranlp-1.104/%P 910-917
[Cultural Topic Modelling over Novel Wikipedia Corpora for South-Slavic Languages](https://aclanthology.org/2021.ranlp-1.104/) (Markoski et al., RANLP 2021)
Filip Markoski, Elena Markoska, Nikola Ljubešić, Eftim Zdravevski, and Ljupco Kocarev. 2021.Cultural Topic Modelling over Novel Wikipedia Corpora for South-Slavic Languages. InProceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 910–917, Held Online. INCOMA Ltd..