Movatterモバイル変換

ホーム

Jump to content

Wikipedia:Wikipedia Signpost/2025-07-18/Recent research

From Wikipedia, the free encyclopedia

<Wikipedia:Wikipedia Signpost |2025-07-18

Knowledge manipulation on Russia's Wikipedia fork; Marxist critique of Wikidata license; call to analyze power relations of Wikipedia: As well as "hermeneutic excursions" and other scientific research findings.

← Back to Contents

View Latest Issue

18 July 2025

File:Figure 4. RWFork.png

Mykola Trokhymovych, Oleksandr Kosovan, Nathan Forrester, Pablo Aragón, Diego Saez-Trumper, Ricardo Baeza-Yates in the academic paper preprint of "Characterizing Knowledge Manipulation in a Russian Wikipedia Fork"

CC BY 4.0

270

650

Recent research

Knowledge manipulation on Russia's Wikipedia fork; Marxist critique of Wikidata license; call to analyze power relations of Wikipedia

Contribute —

Knowledge manipulation on Ruwiki, the Russian Wikipedia fork

Reviewed bySmallbones

Ruwiki, a fork of the Russian Wikipedia, is widely believed to be financed and published by people close to the Kremlin. The authors of this paper^[1] construct a dataset consisting of 33,664 pairs of articles taken from over 1.9 million articles on the officialWMF Russian Wikipedia and the Ruwiki articles of the same title. To avoid confusion, Ruwiki is generally called "RWFork" in this paper.

The authors do not use the word "propaganda" in the paper, nor do they directly refer to RWFork as "disinformation". But you can take "knowledge manipulation" as used in the title as having the same meaning. Accusations of spreading propaganda have long been made between Russia and Western countries. The situation has only gotten worse since the start of the Russian invasion of Ukraine in February 2022. The Putin government has attempted several times to replace, block, or just undermine the Russian Wikipedia — and they haven't been shy in saying so. SeeSignpost coverage inMay 2020,April 2022,June 2023,January 2024,July 2024, andJune 2025.

The stated purpose of RWFork according to the paper is that it is "edited to conform to the Russian legislation" without directly saying that Russian legislation requires the use of propaganda, e.g. writing "special military operation" instead of the "Russian invasion of Ukraine."

The structure of RWFork facilitates a direct comparison of articles on both encyclopedias. This comparison effectively reveals not just the topics required to be modified by Russian legislation, but also which are controversial enough that an active ally of the government in practice has made further edits. Both encyclopedias are powered by MediaWiki software. RWFork copied almost all of the over 1.9 million articles from Russian Wikipedia. 97.33% of the articles were unchanged (identified as "duplicates") over the period studied 2022- 2023. 0.92% of the articles were never copied or were immediately deleted and are identified as "missing" in the paper. Only 1.75% of the articles were changed - which may be the most surprising result of the paper. 0.96% had changes which affected the article text and 0.79% had changes that didn't affect the text, such as article categorization or references. Though the percentage of changed articles is small, the resulting dataset is still quite large at 33,664 entries. Most variables, such as page views, and edit reversion rates and IP editing rates are collected from the Russian Wikipedia articles. RWFork's lack of available data other than the articles themselves and the article's editing history result in most comparisons based solely on Russian Wikipedia data - e.g. if the Russian Wikipedia article has a high number of page views, both articles in the pair are considered as frequently viewed. The main exception is that the timing of edits (often called the "time-card" on Wikipedia) is available for both articles in the pair.

This dataset is the major accomplishment of the authors, and isfreely available online. It is described in enough detail to answer several important questions. Were the changed articles relevant or controversial (using page views and reversion rates)? When were the articles changed (using time-cards)? Were there patterns in the articles changed (using article geography and subject matter)?

Three figures from the paper give these basic results.

Figure 3a. shows that page views from the Russian Wikipedia articles are much higher for thechanged articles than for theduplicate andmissing articles. Figure 3b. shows very similar results for edit counts. Figures 3c. (for IP edit rates) and 3d. (for revert rates) have smaller differences between thechanged articles and theduplicate articles, but overall these results strongly support the hypothesis thatchanged articles are especially relevant and controversial.

Figure 4 shows the editing time-cards for RWFork (top) and Russian Wikipedia (bottom). The top card shows that RWFork is mostly edited during ordinary Moscow working hours on weekdays, whereas Russian Wikipedia is edited at earlier and later times as well as during the weekend. This strongly suggests that RWFork is edited more by professional editors and Russian Wikipedia by more volunteers.

Figure 5 is a bit more complex. It shows how all the article groups (changed,duplicate andmissing) change for the geography of the article subject. Articles about Ukraine (UA) fall, much more often than those from elsewhere, into thechanged group. Conversely, articles about Russian or U.S. topics fall most commonly in themissing group, which suggests that there are different reasons that country-specific articles end up in different groups.

The authors' also offer a "taxonomy of patterns of knowledge manipulation" (Table 4 from the paper), i.e. a classification of the different types of changes made on RWFork to the imported articles. This is more refined data, based onclustering algorithms, and begs for further analysis:

There is indeed far more research that this data might be used for. For example, researchers might investigate whether the articles modified on RWFork have also been modified on Polish, Hungarian, or other eastern European language Wikipedias, possibly indicating a Russian interest in spreading propaganda beyond its borders.

Mapping the Dispossession of the Commons

Reviewed byE mln e

A collective of humanities scholars publishes amanifesto and a commentary^[2] to renew critical research approaches in Wikimedia research, grounded in critical humanist traditions. The group and the manifesto emerges from last year's Wikihistories symposium,^{[supp 1]} a new research events series in the critical humanist tradition (co-organized by Wikimedia Australia). The manifesto and commentary are a call for the community to focus on the following themes:

Map the dispossession of the commons
Recognise Wikimedia's role as a hub of global knowledge infrastructure
Examine power relations
Explore the juxtapositions of Wikimedia policies and practices
Investigate linguistic and cultural plurality
Assess the implications of algorithms
Historicise Wikimedia's epistemology
Study Wikimedia's data as partial, temporary, fallible and shifting
Situate research practice
Build a shared project of critical investigation across disciplines

In a blog post last week,^{[supp 2]} one of the authors (Heather Ford) characterized the manifesto as a continuation of the Critical Point of View Conference series in 2010/11 (Signpost coverage), and the collective volume developed from it^{[supp 3]}.

While there is previous research on the manifesto's topics - in particular the "dispossession of the commons", i.e. the impact of Large Language Models and other reuses by technology companies (cf. below) on the ways Wikimedia projects function as commons - the call seems designed to encourage further inquiries and strengthen the academic community in this area.

"The Realienation of the Commons": A Marxist critique of Wikidata's license choice

Reviewed byTilman Bayer

In a paper titled "The Realienation of the Commons: Wikidata and the Ethics of 'Free' Data",^[3] Zachary McDowell and Matthew Vetter argue that

In many ways, Wikipedia, and its parent company Wikimedia, can be viewed as the standard-bearers of Web 2.0's early promises for a free and open Web. However, the introduction of Wikipedia's sister project Wikidata and its movement away from "share alike" licensing has dramatically shifted the relationship between editors and complicated Wikimedia's ethics as it relates to the digital commons. This article investigates concerns surrounding what we term the "re-alienation of the commons," especially as it relates to Google and other search engine companies' reliance on data emerging from free/libre and open-source (FOSS/FLOSS) Web movements of the late 20th and early 21st centuries. Taking a Marxist approach, this article explores the labor relationship of editors to Wikimedia projects and how this "realienation" threatens this relationship, as well as the future of the community.

In more detail, the authors explain their application ofMarx's theory of alienation to Wikipedia and Wikidata as follows:

[...] Wikipedia editing allowed the average editor to subvert the capitalist status quo. The Wikipedia community was created around this new economic model—CBPP [commons-based peer production], which connected editors with their labor and connected other editors to each other through that labor. Karl Marx [...] defined alienation as "appropriation as estrangement" and stated that "realisation of labour appears as loss of realisation for the workers" [...]. Marx's concept here refers to the relationship between the product of the labor and how it is both used and disconnected from the laborer. This relationship with labor (and the community around it) marks the important distinction that helps illustrate our use of the term "realienation" with regard to Marx's usage of "alienation." [...]
Instead of Wikipedia's CC-BY-SA ("share alike") license (a license that requires derivatives and other uses of the licensed material to retain the same license), Wikidata utilizes a license that has no requirements. This might sound ideal for "freedom," but in reality, Wikidata seems to appropriate that particular FOSS imaginary of sharing while instead delicensing information into data by assigning it a CC0 license—allowing companies to extract, commodify, and otherwise use these data in ways to create systems without requirements to honor the license or reference the works that were utilized.

A problem with the paper's argument here is that their depiction of theCC0 license as contrary to Wikimedian values (and mocking scare quotes around"freedom [...]") is incompatible with the Wikimedia movement's conception offree licenses itself, as pointed out by several Wikimedians in adiscussion with the authors in the "Wikipedia Weekly" Facebook group:

I think this [paper] is bad for the open movement as they try to make a new definition of what "free" is, contrary to Freedom defined [i.e. thedefinition used in theWikimedia Foundation's 2007 licensing policy resolution that specifies the admissible content licenses on all Wikimedia projects, not just Wikidata], the Open definition and for example the Free in Free Software Foundation or theopen source definition.

One of the authors rejected this criticism asmaking a mountain out of a molehill, while the other stated thatthe main argument I would emphasize in response is that we need to be more attentive and critical to the outcomes of CCZero licensing.

As per its abstract (quoted above), the paper explores the postulatedre-alienation [...] especially as it relates to Google and other search engine companies' reliance on data from Wikimedia projects. In case of Wikipedia, the authors devote ample space to summarizing earlier research about its importance for Google's search engine, and concerns that Google'sKnowledge panel feature (introduced in 2012) might havesignificantly reduced traffic to Wikipedia as well as average Web users' understanding of where information comes from when sourced from Wikipedia. However, they also acknowledge thatthe relationship between Google and Wikipedia had been (somewhat) mutually beneficial overall.

In contrast though, and rather peculiarly considering their overall claim that Wikidata's CC0 license makes the project more exploitable bysearch engine companies, the paper cites no research or other concrete evidence about whether and how much information fromWikidata is being using in Google Search or in its knowledge panels. At one point, the authors even lament that

it is of deep concern that the Wikimedia community and Wikidata volunteers know very little with regard to how third-party consumers use Wikidata.

But McDowell and Vetter don't seem to have considered how they themselves, and the strong claims they make in their paper about the exploitation of Wikdata due to its license choice, might be affected by this lack of knowledge.

Published in the 2024 issue of theInternational Journal of Communication, the paper also briefly mentions

large language model generative artificial intelligence (AI) applications such as ChatGPT orGoogle's Bard

as a more recent example of this "realienation". However, it largely focuses on search engines and discusses artificial intelligence mostly in form ofAI apps such asGoogle Knowledge Graph [and] VAs [voice assistants] (e.g., Siri, Alexa), presumably due to its submission date (the ambiguous11-9-2022) predating the release of ChatGPT on November 30, 2022.

Briefly

See thepage of the monthlyWikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research,are always welcome.

Compiled byTilman Bayer andJPxG

"Digital sovereignty": A history of "Wikimedia's atypical organizational model"

From the abstract:^[4]:

"Based on the authors' extensive involvement [with e.g. Wikimedia Germany and the Wikimedia Foundation] since the early years, this article examines Wikipedia's journey of over two decades to unravel relevant aspects of sovereignty within its unconventional organizational framework. The concept of digital sovereignty was nascent when Wikipedia emerged in 2001. In its 24-year evolution, Wikimedia's atypical organizational model, shaped by a mix of intent and happenstance, fostered digital independence while unintentionally creating pockets of dependence. Looking at the origins and the foundational principles, this article sheds light on various aspects of dependence, brought about in the areas of content, collaboration, governmental influence, legal framework and funding models."

The authors envision Wikipedia — which at the time of its origin "could have remained a marginal experiment" — as a self-determining digital space. However, they conclude that this state is not the result of deliberately orchestrated hierarchy, but as an almost-accidental stumbling into independence through a mix of idealism and adaptation.

"Jürgen Habermas revisited via Tim Cook's Wikipedia biography: A hermeneutic approach to critical Information Systems research"

From the abstract:^[5]

"CriticalInformation Systems (IS) research is sometimes appreciated for the shades of gray it adds to sunny portraits of technology's emancipatory potential. In this article, we revisit a theory about Wikipedia's putative freedom from the authority of corporate media's editors and authors. We present the curious example ofTim Cook's Wikipedia biography and its history of crowd-sourced editorial decisions [... W]hat we found pertained to authoritative discourse – the opposite of "rational discourse" – as well as Jürgen Habermas's concept of dramaturgical action. Our discussion aims to change how critical scholars think about IS's Habermasian theories and emancipatory technology. Our contribution – a critical intervention – is a clear alternative to mainstream IS research's moral prescriptions and mechanistic causes."

Specifically, the paper focuses on talk page debates about whether the article should mention the Apple CEO's homosexuality, whereadvocates of privacy prevailed until Cook himself

[...] wrote an auto-biographical essay about his sexuality, published by Bloomberg Media. [...] Corporate powers determined and disseminated the final word about Cook's sexuality, not Wikipedia's global pool of co-authors and co-editors.
In short, Wikipedia's putatively "rational discourse" (Hansen et al., 2009) did not establish the consensus; corporate media authority, the author (Cook, 2014), and his auto-biography established an orthodox position, which Wikipedia then copied.

How this critique, carried out by means of a "hermeneutic excursion", relates to our own policies onbiographies of living people is not specified, as actions taken here are broadly commensurate with what policy recommends for biographies in general. The authors are unclear on this point, but offer the suggestion that the article was tainted by the use of that reference, since Cook's biography was published by a company owned by a billionaire, and he "did not release it through a social media outlet" (althoughFacebook,Twitter,Instagram, andTruth Social are also owned by billionaires).

(See alsoearlier coverage of other publications involving Habermas)

References

^
Trokhymovych, M., Kosovan, O., Forrester, N., Aragón, P., Saez-Trumper, D., & Baeza-Yates, R. (2025). Characterizing Knowledge Manipulation in a Russian Wikipedia Fork. Proceedings of the International AAAI Conference on Web and Social Media, 19(1), 1924-1936.https://doi.org/10.1609/icwsm.v19i1.35910 With downloads available
- for the texthttps://ojs.aaai.org/index.php/ICWSM/issue/view/658 and
- for the code and datasethttps://github.com/trokhymovych/RWFork,https://doi.org/10.5281/zenodo.15073728
A preprint is also available athttps://arxiv.org/abs/2504.10663, licensed CC BY-SA 4.0
^Jankowski, Steve; Ford, Heather; Iliadis, Andrew; Sidoti, Francesca (2025-07-07). "Uniting and reigniting critical Wikimedia research".Big Data & Society.12 (3) 20539517251357292.doi:10.1177/20539517251357292.ISSN 2053-9517.
^McDowell, Zachary J.; Vetter, Matthew A. (2023-12-26)."The Realienation of the Commons: Wikidata and the Ethics of 'Free' Data".International Journal of Communication.18 (0): 19.ISSN 1932-8036.
^Klempert, Arne; Ménard, Delphine (2025)."Wikipedia's Atypical Oganizational [sic] Model: Digital Sovereignty 20 Years in the Making". In Schmuntzsch, Ulrike; Shajek, Alexandra; Hartmann, Ernst Andreas (eds.).New Digital Work II: Digital Sovereignty of Companies and Organizations. Cham: Springer Nature Switzerland. pp. 145–160.ISBN 978-3-031-69994-8.
^Smethurst, Reilly; Young, Amber G.; Wigdor, Ariel D. (2024-12-01)."Jürgen Habermas revisited via Tim Cook's Wikipedia biography: A hermeneutic approach to critical Information Systems research".Journal of Responsible Technology.20 100090.doi:10.1016/j.jrt.2024.100090.ISSN 2666-6596.

Supplementary references and notes:

^"wikihistories 2024: Wikipedia and/as Data – wikihistories". Retrieved2025-07-14.
^"Institute of Network Cultures | Heather Ford on Why Critical Wikipedia Research Is More Important Than Ever".networkcultures.org. 2025-07-10. Retrieved2025-07-16.
^"Institute of Network Cultures | Critical Point of View: A Wikipedia Reader".networkcultures.org. Retrieved2025-07-15.

←Previous "Recent research"

Next "Recent research" →

In this issue

18 July 2025(all comments)

Discuss this story

These comments are automaticallytranscluded from this article'stalk page. To follow comments,add the page to your watchlist. If your comment has not appeared here, you can trypurging the cache.

==="The Realienation of the Commons": A Marxist critique of Wikidata's license choice===

Good roundup/selective dive as usual,HaeB. I saw an early presentation of the realienation research at a conference a couple years ago(and might as well disclose I know the authors) and had an initial pragmatic-defensive reaction: Wikidata can't just switch to a different license -- it doesn't function without CC0, so what's the point? But the more I sat with it, the more I felt like there was a really important point here about alienation, wikis, wiki contributors, and licensing.
Contributorsare more and more frequently separated from our work. No amount of reaffirmation of our definition of freedom changes the reality thatmany people in our community regularly express feelings ranging from annoyance to demotivation because they feel like their labor is exploited.
Back in 2018, for example,Bfpage wroteaSignpost article about the experience of hearing Alexa read something she wrote on Wikipedia, without attribution. The paper focuses on Wikidata, but the objection about Alexa, and one of the chief criticisms here and elsewhere about more recently relevant companies like OpenAI and Google, isn't simply that they use Wikipedia, but that they treat Wikipedia (and everything else)as if they're CC0.
Google and Wikipedia/the rest of the webhave had a historically mutually beneficial relationship, but that undeniablybegan to erode with Knowledge Panels, which have now given way to AI Search.
It seems to me the distance created between contributors and readers, owing to companies treating our work as though it's CC0, regardless of whether it is, does take a toll worth examining. I think there are now several people/groups working to better understand just that, like the WMF's Future Audiences, but "realienation" seems like a natural frame through which to talk about it.
BTW:research or other concrete evidence about whether and how much information from Wikidata is being using in Google Search or in its knowledge panels - How much of this is available? My sense is that such information would be difficult to find, and that it is easily obscured for reasons that align with the authors' arguments, but I would be happy to be wrong about that. —Rhododendrites^talk \\15:47, 18 July 2025 (UTC)[reply]

TheWikimedia sound logo was meant to eliminate the difference between Alexa announcing that it will use Spotify to play a requested song while lifting Wikipedia text without attribution, so it is dispiriting that two years later, there is no external adoption of this Wikimedia branding. As Barbara Page pondered in that 2018 article, perhaps the WMF has calculated that large donations from reliant tech companies are better than enforcement of Wikipedia's attribution requirement especially since they do not experience the alienation.ViridianPenguin🐧 (💬)04:22, 19 July 2025 (UTC)[reply]

Re the Wikidata license paper: It would be of far "deep[er] concern" if Wikidata/WMF was claiming toown simple DB connections, like who the painter of the Mona Lisa was. The author apparently acts like it's dispiriting to editors that such basic info is being shared, but it's the other way around. It would be dispiriting if the kind of stuff Wikidata does was being locked down further as they seem to advocate for. The CC0 license is a good fit for Wikidata.SnowFire (talk)19:21, 18 July 2025 (UTC)[reply]
Well, I think it is important to distinguishcopyright fromsui generis database rights. The former does not restrict information, and the latter usually restricts only use of "significant" portions of databases (IANAL). I do not think any licence could restrict uses similar to one you named.Janhrach (talk)18:53, 23 July 2025 (UTC)[reply]
I agree it would be unlikely to work, but I wouldn't want the WMF to even try to restrict rights further than CC0. I want to contribute to a free database anyone can use.SnowFire (talk)21:35, 23 July 2025 (UTC)[reply]
I super appreciate your perspective here, but I definitely did not intend this at all. As a longtime editor, teacher, and researcher who got into this because of my love for free culture, I am very much of the understanding that the motivation to contribute is to share and have it get used. This is a piece to explore theoretical connections and tease out the nuances of how the spirit of free culture can be co-opted when faced with the vast inequities of big tech, and how that can create adverse effects that can shape how people relate to the fruits of their labor. The paper has its flaws and could be tightened up of course (it came out years ago) and updated for more recent farming of the commons material (with AI systems), but more recently published works that myself and the coauthor have released cover much of those things.
Furthermore, as @Rhododendrites mentioned I absolutely understand how Wikidata (as it is) cannot actually work without CC0 but that isn't the point of the essay at all - the point is to raise questions about the larger commons project and how the larger relational exchange has shifted due to the large mass scale (and enormous revenue) of big tech. Again, this is a theory driven piece to open up ideas for research and communication, not to publish empirical data about these topics. So really its even beyond Wikidata and questioning the requirements for information exchange that create opportunity for mass scale coopting without any participation in the commons itself, which in itself can (and has) lead to problems.
I think we are all concerned with the ongoing maintenance and preservation of the commons here, and its getting harder when everything is algorithmically negotiated and created from our generous volunteer labor. I worry that will continue to erode the systems and participation - multi billion dollar companies keep making oodles of money while people actually lose access the the sum of the worlds knowledge because of that erosion. Maybe I'm being a doomsayer but I think someone needs to offer that as a space for conversation.
For what its worth I also run a cc-by-sa academic journal (platinum open access), as I do believe people should be able to use the work for commercial purposes, but I think that exchange has shifted due to the size and scope of the exchange.ZachMcDowell (talk)15:33, 4 August 2025 (UTC)[reply]

More recent research on Wikimedia exploitation and sustainability

Thanks for your coverage and engagement @HaeB! I won't respond to some of the critiques here - mostly because they seem to be clamoring for evidence from a epistemological standpoint outside of the paper's theoretical purview. There did exist a few studies from computer science at the time of writing, which I can copy below:

Vincent, N., & Hecht, B. (2021). A deeper investigation of the importance of Wikipedia links to search engine results. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), 1–15. doi:10.1145/3449078 Vincent, N., Johnson, I., & Hecht, B. (2018). Examining Wikipedia with a broader lens: Quantifying the value of Wikipedia’s relationships with other large-scale online communities. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, April, 1–13. doi:10.1145/3173574.3174140

Zhang, C. C., Houtti, M., Smith, C. E., Kong, R., & Terveen, L. (2022). Working for the invisible machines or pumping information into an empty void? An exploration of Wikidata contributors’ motivations. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW1), 1–21. doi:10.1145/3512982

But in general, the article is a theoretical commentary - which should become obvious from the use of Marx - and probably not always going to satisfy folks coming from certain disciplinary perspectives (notwithstanding any political prejudice of Marxism).

For folks who are unfamiliar with the long history of debate about licensing, data capture, and exploitation of Wikidata/pedia, here are a few background links worth a read.

Kolbe, A. (2015). Whither wikidata? The Signpost. Retrieved fromhttps://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-12-02/Op-ed

Ford, H., & Iliadis, A. (2023). Wikidata as Semantic Infrastructure: Knowledge Representation, Data Labor, and Truth in a More-Than-Technical Project. Social Media + Society, 9(3).https://doi.org/10.1177/20563051231195552 (Original work published 2023)

But perhaps more to the point: I do want to share some of the more recent research Zach and I have done since working on the "Re-alienation" piece (which we started writing in February of 2022). Both of these further take up the "dispossession of the commons" problem we began thinking about in the piece reviewed.

An endangered species: how LLMs threaten Wikipedia’s sustainability

--https://link.springer.com/article/10.1007/s00146-025-02199-9

As a collaboratively edited and open-access knowledge archive, Wikipedia offers a vast dataset for training artificial intelligence (AI) applications and models, enhancing data accessibility and access to information. However, reliance on the crowd-sourced encyclopedia raises ethical issues related to data provenance, knowledge production, curation, and digital labor. Drawing on critical data studies, feminist posthumanism, and recent research at the intersection of Wikimedia and AI, this study employs problem-centered expert interviews to investigate the relationship between Wikipedia and large language models (LLMs). Key findings include the unclear role of Wikipedia in LLM training, ethical issues, and potential solutions for systemic biases and sustainability challenges. By foregrounding these concerns, this study contributes to ongoing discourses on the responsible use of AI in digital knowledge production and information management. Ultimately, this article calls for greater transparency and accountability in how big tech entities use open-access datasets like Wikipedia, advocating for collaborative frameworks prioritizing ethical considerations and equitable representation.

Wikipedia and AI: Access, representation, and advocacy in the age of large language models

--https://indigo.uic.edu/articles/journal_contribution/Wikipedia_and_AI_Access_representation_and_advocacy_in_the_age_of_large_language_models/25702581?file=45880287

Wikipedia, despite its volunteer-driven nature, stands as a trustworthy repository of information, thanks to its transparent and verifiable processes. However, Large Language Models (LLMs) often use Wikipedia as a source without acknowledging it, creating a disconnect between users and Wikipedia’s rich framework. This poses a triple threat to information literacy, Wikipedia’s vitality, and the potential for dynamic, updated information. This article explores the interplay between representation, accessibility, and LLMs on Wikipedia, highlighting the importance of preserving Wikipedia as a space for access, representation, and ultimately advocacy in an increasingly LLM-dominated information landscape. This article contends that, despite being over two decades old, Wikipedia remains vital not only for knowledge accumulation but also as a sanctuary for the future of knowledge representation, championing representation and accessibility in the age of closed-system LLMs.

Hopefully, these articles might get some coverage in future Research Newsletters, but I encourage folks interested in the larger problem of Wikipimedia sustainability and exploitation to check out our research. Many thanks to all of the folks, esp.Smallbones,E mln e,Tilman Bayer andJPxG, working on the Research Newsletter!

Matthewvetter (talk)16:49, 6 August 2025 (UTC)[reply]

RUfork

I'm not surprised that the vast majority of articles have not been changed on the RUfork. I suspect that both will contain articles on aardvarks, the Amazon river, the Angevin empire, the Algarve and athletes foot, along with a multitude of other topics which are not obviously controversial in a Ukrainian/Russian war. If the RUfork persists it would be interesting to see how they keep it up to date on those uncontentious subjects. A project that mostly edits in Moscow office hours does not sound like one that has recruited the necessary volunteers to maintain such a fork. If the bulk of RUfork is unmaintained and allowed to go stale then the project will be unattractive to readers and can only persist on lifesupport. If they set up a scraper feed to update uncontentious articles with the latest edits from Russian Wikipedia, then the project won't appear stale to readers, but would be even less attractive to volunteers. I doubt it will long outlive Mr Putin. If there is a follow up article on the RUfork, I would like to see this point addressed.Ϣere SpielChequers07:36, 25 July 2025 (UTC)[reply]

Worth noting here that some of the authors of this paper have also recently published a thesis and a conference paper about closely related topics (both based on data from the Russian fork):

Makovska, Viktoriia; Trokhymovych, , Mykola; Saez-Trumper, Diego (2025)."Vandalism or Propaganda? Enhancing Vandalism Detection in Ukrainian and Russian Wikipedia through Knowledge Manipulation Filtering"(PDF).Advances in Data Mining, Machine Learning, and Computer Vision. 6th Masters Symposium MS-AMLV-2025, March 28–29, 2025. Lviv, Ukraine.{{cite conference}}: CS1 maint: multiple names: authors list (link)
Makovska, Viktoriia (2025).Vandalism or Knowledge Manipulation? Detecting Narratives in Wikipedia Edits (Thesis). Lviv: Ukrainian Catholic University, Faculty of Applied Sciences, Department of Computer Sciences. (see also[1])

As always, interested Wikimedians and researchers are invited to contribute a review or writeup, seenewsroom. Regards,HaeB (talk)05:58, 25 August 2025 (UTC)[reply]

Make sure we cover what matters to you –leave a suggestion.

Home

About