Movatterモバイル変換

Jump to content

Wikimedia Forum

From Meta, a Wikimedia project coordination wiki

Latest comment:2 days ago by SuperGrey in topicProposed mass import of court documents to Chinese Wikisource

Shortcut:
WM:FORUM

TheWikimedia Forum is a central place for questions, announcements and other discussions about theWikimedia Foundation and its projects. (For discussion about the Meta wiki, seeMeta:Babel.)
This is not the place to make technical queries regarding theMediaWiki software; please ask such questions at theMediaWiki support desk; technical questions about Wikimedia wikis, however, can be placed onTech page.

You can reply to a topic by clicking the "[edit]" link beside that section, or you canstart a new discussion.

Wikimedia Meta-Wiki

This box:view ·talk ·edit

SpBot archives all sections tagged with{{Section resolved|1=~~~~}} and sections whose most recent comment is older than 30 days.

Proposed mass import of court documents to Chinese Wikisource

Latest comment:2 days ago23 comments8 people in discussion

A longer explaination is available here. In a nutshell:A user proposed to run a bot to mass import Chinese court documents to Chinese Wikisource. This will potentially results in 85 million new content pages be created in Chinese Wikisource. He is asking other Wikimedia community users and WMF opinions on that. This will also have an potential impact on Wikidata, since there isanother user running a bot (MidleadingBot) to create an item for each pages, which will result in 85 million new items in Wikidata.GZWDer (talk)12:58, 6 December 2025 (UTC)Reply

Interesting. I'm curious if personal identifiable information (PII) is planned to be scrubbed before upload?So9q (talk)04:26, 12 December 2025 (UTC)Reply

The data is from China Judgments Online, which is by China’s court publicity standards, already compliant with China’s privacy laws. That being said, if a court rules that some public documents once published, are no longer suitable and needs further redaction or takedown, we could ask them to file a DMCA complaint, just like what other mirrors of China Judgments Online do (on the internet there are already some mirrors, but most have paywalls; and just as I said in the original proposal, they also takedown documents due to political censorship, which we won’t follow along).SuperGrey (talk)16:49, 16 December 2025 (UTC)Reply

You didn't answer my question. Will the document you intent to upload in Wikisource contain names, addresses and other PII?So9q (talk)16:07, 17 December 2025 (UTC)Reply

These public documents don't have a very specific standard on their redaction levels, but from what I observe: names are occasionally redacted; personal addresses are always removed; government IDs are always removed or redacted; phone numbers are always redacted; other PIIs are most likely redacted. The situation above generally follows theProvisions mentioned below.
I don't plan to manually remove or redact PIIs by myself if they are already present on these public documents, except for the following situation: According to theThe Supreme People's Court Provisions on People's Courts Release of Judgments on the Internet,

Article 8: When releasing judgment documents on the internet, people's courts shall redact the names of the following personnel:
(1) parties and their legally-designated representatives in marriage and family cases, or inheritance disputes;
(2) Victims and their legally-designated representatives, incidental civil action plaintiffs and their legally-designated representatives, witnesses, and expert evaluators in criminal cases;
(3) Minors and their legally-designated representatives;
Article 9: The redaction of names in accordance with Article 8 of these Provisions shall be handled according to the following circumstances:
(1) Retain the surname and replace the given name with "X" [某];
(2) For the names of ethnic minorities, retain the first character and replace the rest with "X" [某];
(3) For the Chinese translations of the names of foreigners and stateless persons, retain the first character and replace the rest with "X" [某]; for the English names of foreigners and stateless persons, retain the first English letter and delete the rest;
Where different names become identical after redaction, differentiate between them by adding Arabic numerals.
Article 10: When releasing judgment documents on the internet, people's courts shall delete the following information:
(1) Natural persons' home addresses, contact information, ID numbers, bank account numbers, health conditions, vehicle license plate numbers, movable or immovable property ownership certificate numbers, and other personal information;
(2) Legal persons' and other organizations' bank account numbers, vehicle license plate numbers, movable or immovable property ownership certificate numbers and other information;
(3) Information involving commercial secrets;
(4) Information involving personal privacy in family disputes, personality rights and interests disputes and other such disputes;
(5) Information involving technical investigation measures;
(6) Other information that people's courts find inappropriate to release.
Where deleting information in accordance with the first paragraph of this Article interferes with correctly understanding the judgment document, use the symbol "×" as a partial substitute.
Article 11: When releasing judgment documents on the internet, people's courts shall retain the following information of parties, legally-designated representative, entrusted representatives, and defenders:
(1) Except where names are redacted in accordance with Article 8 of these Provisions, where the parties and their legally-designated representatives are natural persons, retain their names, dates of birth, sexes, and the districts or counties to which their domiciles belong; where the parties and their legally-designated representatives are legal persons or other organizations, retain their names, domiciles, organization codes, and the names and positions of their legally-designated representatives or principal responsible persons.
(2) Where the entrusted representatives or defenders are lawyers or basic level legal service workers, retain their names, license numbers, and the names of their law firms or basic level legal service organizations; where the entrusted representatives or defenders are other personnel, retain their names, dates of birth, sexes, the districts or counties to which their domiciles belong, and their relationship with the parties.

The above articles all begin with "people's courts shall", so naturally respecting what they have released is good enough. Still, we should open to their complaints in case they want to mend their mistakes if they found some documents unsuitable to release or need further redaction, as is required by the law.

SuperGrey (talk)07:06, 18 December 2025 (UTC)Reply

85 million is not a normal scale. This is not what the website is for and likely would completely bring down wikimedia media files. —TheDJ (talk •contribs)12:15, 15 December 2025 (UTC)Reply

oh wait, I figured these were pdfs, but it seems they are only the OCR'ed content ? That is slightly better I guess. Still 85 million is a lot. It likely requires significant scaling up of the wiki, and moving it to a separate database cluster. And I'm not sure if Wikidata can scale that far as well at this moment. regardless, I think it is up to Wikidata project to determine if they even want that many entries. Even for journals and a few other categories people are already discussing about potentially splitting it off from Wikidata, and the added value of all those entries to them would be near 0. —TheDJ (talk •contribs)12:28, 15 December 2025 (UTC)Reply

+1, I started a discussion in the main Wikidata telegram channel with this as input. My view: Basically Wikidata has serious problems with scaling and the community and team have not been able to communicate well/effectively for years about it and come up with good solutions. (but this might be changing now which would be great!)

Interesting enough there are few Wikibases in or outside of Wikibase.cloud that have >1000 items what I have heard about.

It might be worth studying why Wikibase is so "slow" to take off despite being freely available and taking only minutes to set up.

Maybe new developments likefederated values are needed?So9q (talk)21:03, 15 December 2025 (UTC)Reply

User:Lydia Pintscher (WMDE) has said multiple times over the past few years that Wikidata can't handle imports on that scale, neither technically nor socially. The query service already had to be split in two because of the extremely large (and, for some reason, still ongoing) import of scientific articles.User:ASarabadani (WMF) has said that the size of the database is problematic as well and can't keep growing at the rate it currently is (d:User:ASarabadani (WMF)/Growth of databases of Wikidata).

I already responded in the Wikidata Telegram group (on the 8th of December) to the user proposing the import saying that Wikidata can't really handle that many new items and Lydia agreed with what I wrote. -Nikki (talk)05:19, 16 December 2025 (UTC)Reply

I am in that email thread and I support this project and anyone else who has 100,000,000 things to share.

Step 1 is uploading about 10 examples and doing the data modeling. People who do this typically take 6 months for that process.@SuperGrey: is an experienced Wikimedia editor who is obviously here for the long term.

I think it is an error to dismiss a project because of its size without checking what value it can have to build out other Wikimedia content. While I do greatly doubt that the Wikimedia platform can find a use for 100 million legal documents, there are lots of examples of institutions which use Wikibase instances which exchange data with Wikidata. It could happen that we only want ~500,000 of these documents, but a Wikibase holds 100 million, and the data modeling improves Wikidata processes and our Linked Open Data infrastructure by improving modeling of courts, cities, laws, and subject matter of cases.

If anyone says they want to do a big project, then I always support them doing a pilot with 10 examples. Bluerasberry (talk)16:02, 16 December 2025 (UTC)Reply

+1So9q (talk)16:21, 16 December 2025 (UTC)Reply

Thanks for the compliment and ping. I’ll upload 10 sample documents and write a proposal page here on Meta-Wiki.SuperGrey (talk)16:38, 16 December 2025 (UTC)Reply

what about setting up a separate wikibase in wikibase.cloud and model there?

you can link to commons for example.So9q (talk)15:58, 17 December 2025 (UTC)Reply

I just want to add that d:User:ASarabadani (WMF) was worried about the size of the db tables because we ran Wikidatas Mariadb cluster at the time of the writing of the report on small commodity servers with limited RAM. We try to store everything in RAM to keep the query servers fast. We had one master and about 10 slave read-only clones back then.

Since then 2 things have happened:

Silently WMF/WMDE beefed the servers to match those of the dbserver of OSM. I don't know who made that decision or when, I can just see in Grafana that there is no ongoing issue right now.
Elsewhere I have multiple times proposed a revision of the "keep every edit ever made to Wikdiata in one big history table in MariaDB"-strategy that WMDE has been running for years. I have seen no such revision yet. My proposal is to create a new archive MariaDB cluster with cheaper servers for all history over 2 years old. I think this would greatly reduce the memory requirement on the master mariadb cluster. Unfortunately, I have not heard back from anyone at WMDE about this.

So9q (talk)16:27, 16 December 2025 (UTC)Reply

I encourage WMDE to develop Wikibase further to handle the data the community wants to store in an efficient and scalable way. That's hard to do. But we could start lifting out all the scientific data that the rest of Wikimedia does not need to have in Wikidata because they don't need it for interwikilinks.

I think we need to move from:

Just upload to WD because it might be useful to someone
Let's just keep and improve the large scholarly graph with overall low quality data that no other Wikimedia wiki needs to link to

->

Set up a community wikibase for any new dataset >1000 items that you would perhaps later like to be included in Wikidata
Make sure the modeling is solid and approach the Wikibase community if you think other Wikimedia wikis have a need to link to parts of it.
If the community judge that part of the collection of items fit in Wikidata, then hooray let's import it after approval. You might end up in a split situation where only some items are welcome in Wikidata.

So9q (talk)16:33, 16 December 2025 (UTC)Reply

Just here to confirm everything @Nikki said.Lydia Pintscher (WMDE) (talk)11:04, 17 December 2025 (UTC)Reply

SuperGrey, DMCA is for copyright issues, not for personal rights issues. Also the issue is not PII (a USA concept) but personal data (an EU concept). How well-indexed are these documents? If you search a string they contain, how likely it is to turn up on Google/Bing/Baidu?

There's a risk that we surface personal data which is currently relatively obscure. People mentioned in court rulings in China may be EU residents now, which would entitle them to GDPR rights.

The idea is generally interesting and it's definitely appropriate for Wikisource to host court rulings, though it's perhaps not the best project to provide acomprehensive database of court rulings. I recommend to implement it gradually, so that you can find issues along the way.All projects which tried something like this before have run into issues of insufficient redaction in the official databases (seeJurisWiki.it in Italy,Carl Malamud in the USA), so you must assume that the same will happen here and that you will need to be in contact with the source database to help them fix mistakes. Have you already established some contacts with them?Nemo 16:18, 18 December 2025 (UTC)Reply

Most of them are not indexed by any search engines or available in websites searchable by search engines.GZWDer (talk)14:01, 19 December 2025 (UTC)Reply

Compare Swedish Riksdag documents 250,000 Swedish legal documents fromWikidata:WikiProject Sweden/Swedish Riksdag documents,profile in Scholia. I think this is a great use of Wikidata as the works are categorized by top, have assigned topics, have disambiguated authors, and have other associated data. I would like Wikidata to be stable enough so that any country could have a law document project of this size. Bluerasberry (talk)21:08, 20 December 2025 (UTC)Reply
Thanks! Will take note.SuperGrey (talk)16:27, 12 January 2026 (UTC)Reply

Formal proposal

Hi everyone! Sorry for the late reply. I had family matters to take care of during Christmas.

Based on concerns raised on Meta-Wiki and in the email thread (scale, privacy/personal data, and Wikidata spillover), I drafted a formal WMF-facing proposal here:China Judgments Online Preservation Program. The program is staged and requires WMF review before expanding each stage; it also keeps Wikidata item creation out of scope and includes a dedicated privacy request workflow.

Comments and suggestions are welcome, here and/or atthe proposal's talk page.

Cheers,SuperGrey (talk)16:37, 12 January 2026 (UTC)Reply

Update: I have received formal community approval from Chinese Wikisource on January 28, 2026.SuperGrey (talk)08:51, 8 February 2026 (UTC)Reply

Does anyone know who/which WMF tech team is in charge of the technical limitations of Wikisource?SuperGrey (talk)23:05, 15 February 2026 (UTC)Reply

Restricting AI-generated images of people

Latest comment:10 days ago5 comments5 people in discussion

Commons community is considering to revise its acceptance policy to reject some (most?) AI-generated images of living people and dead people. This could mean that images might be deleted when they are AI-generated, even if they are being shown on sister projects, instead of just sitting on Commons. I notify here because this can affect sister projects outside of Commons. Discussions have been ongoing at least since December 2025.

Links:

whym (talk)10:39, 7 January 2026 (UTC)Reply

An issue of this is that by censoring images of that type, the long-standing pillar policyc:COM:INUSE would start to get exceptions – people can be less and less sure that files uploaded to Commons used in Wikipedia articles or other project sites will remain on Commons instead of getting deleted by arbitrary decisions by the Commons community. This undermines the stability of the project and trust in Commons.

This policy proposal seeks to ban also neutral depictions of real people which are not banned as paintings and also those for long-dead people like ancient pharaohs etc. It's also entirely unnecessary since there already is a policy against files that violate the dignity of real people,c:COM:DIGNITY and people can easily nominate individual problematic files (or many of them at once) for deletion. The question is whether to sacrifice the consistency of two important Commons policies,c:COM:INUSE andc:COM:NOTCENSORED, without any reason in the proposed policy that tries to justify the policy and without any need. This affects all Wikimedia projects.Prototyperspective (talk)14:39, 13 January 2026 (UTC)Reply

I feel all AI content should banned from Wikipedia and related projects without exception.19:58, 7 February 2026 (UTC)Reply
This is up to local communities to decide on this matter.A09|(pogovor)19:59, 7 February 2026 (UTC)Reply
- It simply is not feasible. Even the term "AI content" creates havoc, as it is not clear. The most used grammar checkers are AI-enhanced, so anything corrected by them is "AI content" – did you mean to include that? If a user asks an AI how to edit a page, and then follows the AI's instructions, is that AI content? If a user asks an AI how to improve a page and then does it, is that AI content? And once you've defined what you meant, there's the issue of identifying AI content. More and more people are spending greater and greater amounts of time interacting with AIs. So much so, that they are picking up new AI-like communication habits, such as manner of speech and writing styles, making false positives more likely when analyzing their submissions. AI is developing rapidly, with their output becoming more human-like, so that it gets harder over time to detect AI submissions. At the same time, AI users are becoming more adept at using AI tools, and expect to be able to use them on whatever they are working on. As with using any tool, practice improves results. If you turn those people away, they'll go make their valuable skills available somewhere else. Last but not least, is AI itself: its ability to produce instant quality articles upon request is improving, while its ability to make instant websites of quality is rapidly progressing as well. This threatens Wikipedia's traffic, and in turn may reduce readers and new editor recruiting to a trickle. For Wikipedia to survive the AI intelligence explosion that is happening right now, it will need to embrace AI and leverage it, much more than it is doing now.The Transhumanist (talk)02:56, 8 February 2026 (UTC)Reply

Annual review of the Universal Code of Conduct and Enforcement Guidelines

I am writing to you to let you know the annual review period for the Universal Code of Conduct and Enforcement Guidelines is open now. You can make suggestions for changes through 9 February 2026. This is the first step of several to be taken for the annual review.Read more information and find a conversation to join on the UCoC page on Meta.

TheUniversal Code of Conduct Coordinating Committee (U4C) is a global group dedicated to providing an equitable and consistent implementation of the UCoC. This annual review was planned and implemented by the U4C. For more information and the responsibilities of the U4C,you may review the U4C Charter.

Please share this information with other members in your community wherever else might be appropriate.

-- In cooperation with the U4C,Keegan (WMF) (talk)

21:01, 19 January 2026 (UTC)

Questions to the community in preparation for WMF Annual Plan

Latest comment:26 days ago1 comment1 person in discussion

The Wikimedia Foundation'sProduct and Technology department (the part that develops features and works on code changes) is starting it'sAnnual planning process (APP) where it decides what it will be allocating resources to/prioritizing over the next fiscal year. To inform WMF's decision making on what to prioritize, they are requesting community members to answer a few questions/provide feedback on what should be considered atthis meta wiki page until Feb 10. --Sohom (talk)11:39, 23 January 2026 (UTC)Reply

[Research] Preliminary analysis of AI-assisted translation workflows

Latest comment:23 days ago2 comments2 people in discussion

Note: this was also posted inEN Wikipedia's village pump.

I’m sharing the results of arecent study conducted by theOpen Knowledge Association (OKA), supported byWikimedia CH, on using Large Language Models (LLMs) for article translation. We analyzed 119 articles across 10 language pairs to see how AI output holds up against Mainspace standards.

Selected findings:

LLMs were found to be significantly better than traditional tools at retaining Wikicode and templates, simplifying the "wikification" process.
26% of human edits fixed issues already present in the source article (e.g., dead links), showing that the process improves the original content too.
Human editors modified about 27% of the AI-generated text to reach publication quality.
We found a ~5.6% critical error rate (distortions or omissions). This confirms that "blind" AI publication is not suitable; human oversight is essential.
Claude and ChatGPT led in prose quality, while Gemini showed a risk of omitting text. Grok was the most responsive to structural formatting commands.

Acknowleging limitations: We consider these findings a "first look" rather than a definitive conclusion. The study has several limitations, including:

Subjectivity: Error categorization is inherently dependent on individual editor judgment.
Non-blind testing: Editors knew which models they were using, which likely influenced their prompting strategies.
Sample size: While we processed over 400,000 words, the data for specific model comparisons across all 10 language pairs is insufficient.

Our goal is to provide some data for the community as we collectively figure out the best way to handle these tools.

The full report, including the error taxonomy and raw data logs, is availablehere.7804j (talk)16:21, 24 January 2026 (UTC)Reply

LLMs were found to be significantly better than traditional tools at retaining Wikicode and templates, simplifying the "wikification" process. this is one of the main issues with the current translation system/methods.

I think the conclusion shouldn't be to become more fine with LLM-translations (not that you're suggesting that; you mentioned the~5.6% critical error rate) but instead to improve the translation system (which of course if one doesn't want to waste huge amounts of time and be limited by one's personal vocabulary & skills, involved machine translation).

This is whatW376: Better article translation system is about. There also is an alternative approach mentioned there which would also enable templates to be translated properly etc. If it's too long for people to glance over it: when I translate an article from English WP, I need to spend more exhausting time on fixing all the ref template issues than for actual proofreading and rewording/rewriting.Prototyperspective (talk)18:47, 25 January 2026 (UTC)Reply

"Active" metrics on Wikipedias

Latest comment:21 days ago8 comments3 people in discussion

As seen atenWiki stats, active users are "users who have performed an action in the last 30 days". Perhaps it is also the criterion for all language editions of Wikipedias. In my opinion it seems unrealistic: a user may just make a major edit to an article while at the workplace or school and never returns for another 30 days, which is contrary to the common sense of an "active" user. Another case, some users may make few edits on other language editions even if they don't speak or natively understand those language, because they need to change the file links (for example,my edits at German Wikipedia related to replacing files or changing file names). Just because a user made a single, file change-related edit in a foreign-language Wikipedia doesn't mean he/she is going tk be active there.

Fromen.wiktionary, "active" is defined as "given to action;constantly engaged in action; energetic; diligent; busy." Changing file name instance in a foreign-language Wikipedia isn't a hallmark of being "constantly-engaged" in that Wikipedia. Moreover, someone who only made an edit from their workplace or school (and won't make another edit on wiki within 30 days) isnot a characteristic of being energetic or busy.

I suggestraising the bar of metrics in active users across Wikipedias. My suggestions: either 7 or 10 non-minor actions.

"Users who have performedat least ten (10) non-minor actions in the last 30 days."

"Users who have performedat least seven (7) non-minor actions in the last 30 days."

Exclusion: active user criteria in non-Wikipedias (like WikiCommons, Wikidata and Wiktionaries). Also, this suggestion only concerns human users, not bot users. Furthermore, this only affects the status of Wikipedia users who are "active" or not; the criteria on adminships and other positions in local Wikipedias remain unchanged (the same).JWilz12345(Talk|Contributions)02:10, 25 January 2026 (UTC)Reply

No. In the context of small or new wikis, a user who makes an edit last month is still considered active if there are only dozens of edits a day in the small wiki.Midleading (talk)07:06, 25 January 2026 (UTC)Reply

@Midleading in response to your concern, I'm adding a condition that my suggestion "Users who have performed at least ten (10) non-minor actions in the last 30 days." will only apply to Wikipedias with at least 100,000 total number of users as perList of Wikipedias, starting from Mongolian Wikipedia (101,092 users as of this writing) up to English Wikipedia (51,235,474 users). For all other Wikipedias with less than 100,000 users, the original criterion ("an action in the last 30 days") still applies.JWilz12345(Talk|Contributions)06:58, 27 January 2026 (UTC)Reply

Not every Wikipedia community accept your definition of activeness. Probably, you need to consult your German Wikipedia and English Wikipedia community first to see whether your idea is actually useful. Personally, I don’t see there is a necessity to change it.Midleading (talk)10:41, 27 January 2026 (UTC)Reply

@Midleading is my single edit just to change the file name instance on German Wikipedia count as "active" there, despite that I haveno intention of becoming active there since German isnot my native or near-professional language (that doesn't involve GTranslations)?JWilz12345(Talk|Contributions)14:40, 28 January 2026 (UTC)Reply

Support

Support I agree with what you said and support your proposal for Wikimedia projects with at least 100,000 total number of users. The current numbers aren't useful. It's worth mentioning that sometimes some users take a break for 1-2 months but are otherwise very active but that's a low fraction and easily balanced out with users who are actually inactive but just made a few trivial edits once (in 1 month). Your proposed changes would make the stats far more useful and reasonable albeit they could certainly do well with further improvements, such as allowing users to customize what counts as active or inactive themselves and then see the respective stats.Prototyperspective (talk)13:25, 27 January 2026 (UTC)Reply

@Prototyperspective Note: only Wikipedias. Other projects, like Wikimedia Commons, Wiktionary or Wikidata, would still use the current criterion.JWilz12345(Talk|Contributions)14:38, 28 January 2026 (UTC)Reply

That's in your proposal, yes. I would also support this for Commons, Wikidata, etc because there 10 edits a month is as little or less than the same number on an active Wikipedia like ENWP.Prototyperspective (talk)14:54, 28 January 2026 (UTC)Reply

Testing simpler New Account creation

Latest comment:17 days ago1 comment1 person in discussion

New account creation on Android is significantly better than on other platforms. It is just "username, password, repeat password, email" with a captcha after you enter that. On mobile and desktop web, there are 10-20 unnecessary extra lines of text, images, and headers. One has to scroll down the page to find a button to submit. And there are a number of common ways that your account creation request could be rejected (password too short, password too simple) which don't show up as an inline password-strength visual but instead intrude into pageflow as a red error message after you leave a textarea to go to the next.

At an experiment lab we're discussing testing alternate flows for desktop and mobile, testing on Meta. Please share your thoughts and comments here:User:Sj/Design chats/Create account

— The precedingunsigned commentwas added bySj (talk)11:55, 1 February 2026 (UTC)Reply

Retrieved from "https://meta.wikimedia.org/w/index.php?title=Wikimedia_Forum&oldid=30082289"

[8]ページ先頭

©2009-2026 Movatter.jp