A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as theWikimedia Research Newsletter.
Ruwiki, a fork of the Russian Wikipedia, is widely believed to be financed and published by people close to the Kremlin. The authors of this paper[1] construct a dataset consisting of 33,664 pairs of articles taken from over 1.9 million articles on the officialWMF Russian Wikipedia and the Ruwiki articles of the same title. To avoid confusion, Ruwiki is generally called "RWFork" in this paper.
The authors do not use the word "propaganda" in the paper, nor do they directly refer to RWFork as "disinformation". But you can take "knowledge manipulation" as used in the title as having the same meaning. Accusations of spreading propaganda have long been made between Russia and Western countries. The situation has only gotten worse since the start of the Russian invasion of Ukraine in February 2022. The Putin government has attempted several times to replace, block, or just undermine the Russian Wikipedia — and they haven't been shy in saying so. SeeSignpost coverage inMay 2020,April 2022,June 2023,January 2024,July 2024, andJune 2025.
The stated purpose of RWFork according to the paper is that it is "edited to conform to the Russian legislation" without directly saying that Russian legislation requires the use of propaganda, e.g. writing "special military operation" instead of the "Russian invasion of Ukraine."
The structure of RWFork facilitates a direct comparison of articles on both encyclopedias. This comparison effectively reveals not just the topics required to be modified by Russian legislation, but also which are controversial enough that an active ally of the government in practice has made further edits. Both encyclopedias are powered by MediaWiki software. RWFork copied almost all of the over 1.9 million articles from Russian Wikipedia. 97.33% of the articles were unchanged (identified as "duplicates") over the period studied 2022- 2023. 0.92% of the articles were never copied or were immediately deleted and are identified as "missing" in the paper. Only 1.75% of the articles were changed - which may be the most surprising result of the paper. 0.96% had changes which affected the article text and 0.79% had changes that didn't affect the text, such as article categorization or references. Though the percentage of changed articles is small, the resulting dataset is still quite large at 33,664 entries. Most variables, such as page views, and edit reversion rates and IP editing rates are collected from the Russian Wikipedia articles. RWFork's lack of available data other than the articles themselves and the article's editing history result in most comparisons based solely on Russian Wikipedia data - e.g. if the Russian Wikipedia article has a high number of page views, both articles in the pair are considered as frequently viewed. The main exception is that the timing of edits (often called the "time-card" on Wikipedia) is available for both articles in the pair.
This dataset is the major accomplishment of the authors, and isfreely available online. It is described in enough detail to answer several important questions. Were the changed articles relevant or controversial (using page views and reversion rates)? When were the articles changed (using time-cards)? Were there patterns in the articles changed (using article geography and subject matter)?
Three figures from the paper give these basic results.
Figure 3a. shows that page views from the Russian Wikipedia articles are much higher for thechanged articles than for theduplicate andmissing articles. Figure 3b. shows very similar results for edit counts. Figures 3c. (for IP edit rates) and 3d. (for revert rates) have smaller differences between thechanged articles and theduplicate articles, but overall these results strongly support the hypothesis thatchanged articles are especially relevant and controversial.
Figure 4 shows the editing time-cards for RWFork (top) and Russian Wikipedia (bottom). The top card shows that RWFork is mostly edited during ordinary Moscow working hours on weekdays, whereas Russian Wikipedia is edited at earlier and later times as well as during the weekend. This strongly suggests that RWFork is edited more by professional editors and Russian Wikipedia by more volunteers.
Figure 5 is a bit more complex. It shows how all the article groups (changed,duplicate andmissing) change for the geography of the article subject. Articles about Ukraine (UA) fall, much more often than those from elsewhere, into thechanged group. Conversely, articles about Russian or U.S. topics fall most commonly in themissing group, which suggests that there are different reasons that country-specific articles end up in different groups.
The authors' also offer a "taxonomy of patterns of knowledge manipulation" (Table 4 from the paper), i.e. a classification of the different types of changes made on RWFork to the imported articles. This is more refined data, based onclustering algorithms, and begs for further analysis:
There is indeed far more research that this data might be used for. For example, researchers might investigate whether the articles modified on RWFork have also been modified on Polish, Hungarian, or other eastern European language Wikipedias, possibly indicating a Russian interest in spreading propaganda beyond its borders.
A collective of humanities scholars publishes amanifesto and a commentary[2] to renew critical research approaches in Wikimedia research, grounded in critical humanist traditions. The group and the manifesto emerges from last year's Wikihistories symposium,[supp 1] a new research events series in the critical humanist tradition (co-organized by Wikimedia Australia). The manifesto and commentary are a call for the community to focus on the following themes:
In a blog post last week,[supp 2] one of the authors (Heather Ford) characterized the manifesto as a continuation of the Critical Point of View Conference series in 2010/11 (Signpost coverage), and the collective volume developed from it[supp 3].
While there is previous research on the manifesto's topics - in particular the "dispossession of the commons", i.e. the impact of Large Language Models and other reuses by technology companies (cf. below) on the ways Wikimedia projects function as commons - the call seems designed to encourage further inquiries and strengthen the academic community in this area.
In a paper titled "The Realienation of the Commons: Wikidata and the Ethics of 'Free' Data",[3] Zachary McDowell and Matthew Vetter argue that
In many ways, Wikipedia, and its parent company Wikimedia, can be viewed as the standard-bearers of Web 2.0's early promises for a free and open Web. However, the introduction of Wikipedia's sister project Wikidata and its movement away from "share alike" licensing has dramatically shifted the relationship between editors and complicated Wikimedia's ethics as it relates to the digital commons. This article investigates concerns surrounding what we term the "re-alienation of the commons," especially as it relates to Google and other search engine companies' reliance on data emerging from free/libre and open-source (FOSS/FLOSS) Web movements of the late 20th and early 21st centuries. Taking a Marxist approach, this article explores the labor relationship of editors to Wikimedia projects and how this "realienation" threatens this relationship, as well as the future of the community.
In more detail, the authors explain their application ofMarx's theory of alienation to Wikipedia and Wikidata as follows:
[...] Wikipedia editing allowed the average editor to subvert the capitalist status quo. The Wikipedia community was created around this new economic model—CBPP [commons-based peer production], which connected editors with their labor and connected other editors to each other through that labor. Karl Marx [...] defined alienation as "appropriation as estrangement" and stated that "realisation of labour appears as loss of realisation for the workers" [...]. Marx's concept here refers to the relationship between the product of the labor and how it is both used and disconnected from the laborer. This relationship with labor (and the community around it) marks the important distinction that helps illustrate our use of the term "realienation" with regard to Marx's usage of "alienation." [...]
Instead of Wikipedia's CC-BY-SA ("share alike") license (a license that requires derivatives and other uses of the licensed material to retain the same license), Wikidata utilizes a license that has no requirements. This might sound ideal for "freedom," but in reality, Wikidata seems to appropriate that particular FOSS imaginary of sharing while instead delicensing information into data by assigning it a CC0 license—allowing companies to extract, commodify, and otherwise use these data in ways to create systems without requirements to honor the license or reference the works that were utilized.
A problem with the paper's argument here is that their depiction of theCC0 license as contrary to Wikimedian values (and mocking scare quotes around"freedom [...]"
) is incompatible with the Wikimedia movement's conception offree licenses itself, as pointed out by several Wikimedians in adiscussion with the authors in the "Wikipedia Weekly" Facebook group:
I think this [paper] is bad for the open movement as they try to make a new definition of what "free" is, contrary to Freedom defined [i.e. thedefinition used in theWikimedia Foundation's 2007 licensing policy resolution that specifies the admissible content licenses on all Wikimedia projects, not just Wikidata], the Open definition and for example the Free in Free Software Foundation or theopen source definition.
One of the authors rejected this criticism asmaking a mountain out of a molehill
, while the other stated thatthe main argument I would emphasize in response is that we need to be more attentive and critical to the outcomes of CCZero licensing.
As per its abstract (quoted above), the paper explores the postulatedre-alienation [...] especially as it relates to Google and other search engine companies' reliance
on data from Wikimedia projects. In case of Wikipedia, the authors devote ample space to summarizing earlier research about its importance for Google's search engine, and concerns that Google'sKnowledge panel feature (introduced in 2012) might havesignificantly reduced traffic to Wikipedia as well as average Web users' understanding of where information comes from when sourced from Wikipedia
. However, they also acknowledge thatthe relationship between Google and Wikipedia had been (somewhat) mutually beneficial
overall.
In contrast though, and rather peculiarly considering their overall claim that Wikidata's CC0 license makes the project more exploitable bysearch engine companies
, the paper cites no research or other concrete evidence about whether and how much information fromWikidata is being using in Google Search or in its knowledge panels. At one point, the authors even lament that
it is of deep concern that the Wikimedia community and Wikidata volunteers know very little with regard to how third-party consumers use Wikidata.
But McDowell and Vetter don't seem to have considered how they themselves, and the strong claims they make in their paper about the exploitation of Wikdata due to its license choice, might be affected by this lack of knowledge.
Published in the 2024 issue of theInternational Journal of Communication, the paper also briefly mentions
large language model generative artificial intelligence (AI) applications such as ChatGPT orGoogle's Bard
as a more recent example of this "realienation". However, it largely focuses on search engines and discusses artificial intelligence mostly in form ofAI apps such asGoogle Knowledge Graph [and] VAs [voice assistants] (e.g., Siri, Alexa)
, presumably due to its submission date (the ambiguous11-9-2022
) predating the release of ChatGPT on November 30, 2022.
Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research,are always welcome.
From the abstract:[4]:
"Based on the authors' extensive involvement [with e.g. Wikimedia Germany and the Wikimedia Foundation] since the early years, this article examines Wikipedia's journey of over two decades to unravel relevant aspects of sovereignty within its unconventional organizational framework. The concept of digital sovereignty was nascent when Wikipedia emerged in 2001. In its 24-year evolution, Wikimedia's atypical organizational model, shaped by a mix of intent and happenstance, fostered digital independence while unintentionally creating pockets of dependence. Looking at the origins and the foundational principles, this article sheds light on various aspects of dependence, brought about in the areas of content, collaboration, governmental influence, legal framework and funding models."
The authors envision Wikipedia — which at the time of its origin "could have remained a marginal experiment" — as a self-determining digital space. However, they conclude that this state is not the result of deliberately orchestrated hierarchy, but as an almost-accidental stumbling into independence through a mix of idealism and adaptation.
From the abstract:[5]
"CriticalInformation Systems (IS) research is sometimes appreciated for the shades of gray it adds to sunny portraits of technology's emancipatory potential. In this article, we revisit a theory about Wikipedia's putative freedom from the authority of corporate media's editors and authors. We present the curious example ofTim Cook's Wikipedia biography and its history of crowd-sourced editorial decisions [... W]hat we found pertained to authoritative discourse – the opposite of "rational discourse" – as well as Jürgen Habermas's concept of dramaturgical action. Our discussion aims to change how critical scholars think about IS's Habermasian theories and emancipatory technology. Our contribution – a critical intervention – is a clear alternative to mainstream IS research's moral prescriptions and mechanistic causes."
Specifically, the paper focuses on talk page debates about whether the article should mention the Apple CEO's homosexuality, whereadvocates of privacy
prevailed until Cook himself
[...] wrote an auto-biographical essay about his sexuality, published by Bloomberg Media. [...] Corporate powers determined and disseminated the final word about Cook's sexuality, not Wikipedia's global pool of co-authors and co-editors.
In short, Wikipedia's putatively "rational discourse" (Hansen et al., 2009) did not establish the consensus; corporate media authority, the author (Cook, 2014), and his auto-biography established an orthodox position, which Wikipedia then copied.
How this critique, carried out by means of a "hermeneutic excursion", relates to our own policies onbiographies of living people is not specified, as actions taken here are broadly commensurate with what policy recommends for biographies in general. The authors are unclear on this point, but offer the suggestion that the article was tainted by the use of that reference, since Cook's biography was published by a company owned by a billionaire, and he "did not release it through a social media outlet" (althoughFacebook,Twitter,Instagram, andTruth Social are also owned by billionaires).
(See alsoearlier coverage of other publications involving Habermas)
Good roundup/selective dive as usual,HaeB. I saw an early presentation of the realienation research at a conference a couple years ago(and might as well disclose I know the authors) and had an initial pragmatic-defensive reaction: Wikidata can't just switch to a different license -- it doesn't function without CC0, so what's the point? But the more I sat with it, the more I felt like there was a really important point here about alienation, wikis, wiki contributors, and licensing.
Contributorsare more and more frequently separated from our work. No amount of reaffirmation of our definition of freedom changes the reality thatmany people in our community regularly express feelings ranging from annoyance to demotivation because they feel like their labor is exploited.
Back in 2018, for example,Bfpage wroteaSignpost article about the experience of hearing Alexa read something she wrote on Wikipedia, without attribution. The paper focuses on Wikidata, but the objection about Alexa, and one of the chief criticisms here and elsewhere about more recently relevant companies like OpenAI and Google, isn't simply that they use Wikipedia, but that they treat Wikipedia (and everything else)as if they're CC0.
Google and Wikipedia/the rest of the webhave had a historically mutually beneficial relationship, but that undeniablybegan to erode with Knowledge Panels, which have now given way to AI Search.
It seems to me the distance created between contributors and readers, owing to companies treating our work as though it's CC0, regardless of whether it is, does take a toll worth examining. I think there are now several people/groups working to better understand just that, like the WMF's Future Audiences, but "realienation" seems like a natural frame through which to talk about it.
BTW:research or other concrete evidence about whether and how much information from Wikidata is being using in Google Search or in its knowledge panels
- How much of this is available? My sense is that such information would be difficult to find, and that it is easily obscured for reasons that align with the authors' arguments, but I would be happy to be wrong about that. —Rhododendritestalk \\15:47, 18 July 2025 (UTC)[reply]
Thanks for your coverage and engagement @HaeB! I won't respond to some of the critiques here - mostly because they seem to be clamoring for evidence from a epistemological standpoint outside of the paper's theoretical purview. There did exist a few studies from computer science at the time of writing, which I can copy below:
Vincent, N., & Hecht, B. (2021). A deeper investigation of the importance of Wikipedia links to search engine results. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), 1–15. doi:10.1145/3449078 Vincent, N., Johnson, I., & Hecht, B. (2018). Examining Wikipedia with a broader lens: Quantifying the value of Wikipedia’s relationships with other large-scale online communities. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, April, 1–13. doi:10.1145/3173574.3174140
Zhang, C. C., Houtti, M., Smith, C. E., Kong, R., & Terveen, L. (2022). Working for the invisible machines or pumping information into an empty void? An exploration of Wikidata contributors’ motivations. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW1), 1–21. doi:10.1145/3512982
But in general, the article is a theoretical commentary - which should become obvious from the use of Marx - and probably not always going to satisfy folks coming from certain disciplinary perspectives (notwithstanding any political prejudice of Marxism).
For folks who are unfamiliar with the long history of debate about licensing, data capture, and exploitation of Wikidata/pedia, here are a few background links worth a read.
Kolbe, A. (2015). Whither wikidata? The Signpost. Retrieved fromhttps://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-12-02/Op-ed
Ford, H., & Iliadis, A. (2023). Wikidata as Semantic Infrastructure: Knowledge Representation, Data Labor, and Truth in a More-Than-Technical Project. Social Media + Society, 9(3).https://doi.org/10.1177/20563051231195552 (Original work published 2023)
But perhaps more to the point: I do want to share some of the more recent research Zach and I have done since working on the "Re-alienation" piece (which we started writing in February of 2022). Both of these further take up the "dispossession of the commons" problem we began thinking about in the piece reviewed.
An endangered species: how LLMs threaten Wikipedia’s sustainability
--https://link.springer.com/article/10.1007/s00146-025-02199-9
As a collaboratively edited and open-access knowledge archive, Wikipedia offers a vast dataset for training artificial intelligence (AI) applications and models, enhancing data accessibility and access to information. However, reliance on the crowd-sourced encyclopedia raises ethical issues related to data provenance, knowledge production, curation, and digital labor. Drawing on critical data studies, feminist posthumanism, and recent research at the intersection of Wikimedia and AI, this study employs problem-centered expert interviews to investigate the relationship between Wikipedia and large language models (LLMs). Key findings include the unclear role of Wikipedia in LLM training, ethical issues, and potential solutions for systemic biases and sustainability challenges. By foregrounding these concerns, this study contributes to ongoing discourses on the responsible use of AI in digital knowledge production and information management. Ultimately, this article calls for greater transparency and accountability in how big tech entities use open-access datasets like Wikipedia, advocating for collaborative frameworks prioritizing ethical considerations and equitable representation.
Wikipedia and AI: Access, representation, and advocacy in the age of large language models
Wikipedia, despite its volunteer-driven nature, stands as a trustworthy repository of information, thanks to its transparent and verifiable processes. However, Large Language Models (LLMs) often use Wikipedia as a source without acknowledging it, creating a disconnect between users and Wikipedia’s rich framework. This poses a triple threat to information literacy, Wikipedia’s vitality, and the potential for dynamic, updated information. This article explores the interplay between representation, accessibility, and LLMs on Wikipedia, highlighting the importance of preserving Wikipedia as a space for access, representation, and ultimately advocacy in an increasingly LLM-dominated information landscape. This article contends that, despite being over two decades old, Wikipedia remains vital not only for knowledge accumulation but also as a sanctuary for the future of knowledge representation, championing representation and accessibility in the age of closed-system LLMs.
Hopefully, these articles might get some coverage in future Research Newsletters, but I encourage folks interested in the larger problem of Wikipimedia sustainability and exploitation to check out our research. Many thanks to all of the folks, esp.Smallbones,E mln e,Tilman Bayer andJPxG, working on the Research Newsletter!
Matthewvetter (talk)16:49, 6 August 2025 (UTC)[reply]
I'm not surprised that the vast majority of articles have not been changed on the RUfork. I suspect that both will contain articles on aardvarks, the Amazon river, the Angevin empire, the Algarve and athletes foot, along with a multitude of other topics which are not obviously controversial in a Ukrainian/Russian war. If the RUfork persists it would be interesting to see how they keep it up to date on those uncontentious subjects. A project that mostly edits in Moscow office hours does not sound like one that has recruited the necessary volunteers to maintain such a fork. If the bulk of RUfork is unmaintained and allowed to go stale then the project will be unattractive to readers and can only persist on lifesupport. If they set up a scraper feed to update uncontentious articles with the latest edits from Russian Wikipedia, then the project won't appear stale to readers, but would be even less attractive to volunteers. I doubt it will long outlive Mr Putin. If there is a follow up article on the RUfork, I would like to see this point addressed.ϢereSpielChequers07:36, 25 July 2025 (UTC)[reply]
Worth noting here that some of the authors of this paper have also recently published a thesis and a conference paper about closely related topics (both based on data from the Russian fork):
{{cite conference}}: CS1 maint: multiple names: authors list (link)