Wikitech-lFebruary 2008

wikitech-l@lists.wikimedia.org

100 participants
106 discussions

Database dumps
by Byrial Jensen 17 Apr '25

17 Apr '25

Until some weeks agohttp://dumps.wikimedia.org/backup-index.html usedto show 4 dumps in progress at the same time. That meant that newdatabase dumps normally was available within about 3 weeks for alldatabases except for enwiki and maybe dewiki where the dump process dueto size took longer time.However the 4 dumps processes at one time become 3 some weeks ago. Andafter massive failures at June 4, only one dump has been in progress atthe same time. So at the current speed it will take several months tocome thru all dumps.Is it possible to speed up the process again using several dumpprocesses at the same time?Thank you,Byrial

3 2

EBNF grammar project status?
by Steve Bennett 01 Apr '25

01 Apr '25

What's the status of the project to create a grammar for Wikitext in EBNF?There are two pages:http://meta.wikimedia.org/wiki/Wikitext_Metasyntax http://www.mediawiki.org/wiki/Markup_specNothing seems to have happened since January this year. Also the comments onthe latter page seem to indicate a lack of clear goal: is this just a funproject, is it to improve the existing parser, or is it to facilititate anew parser? It's obviously a lot of work, so it needs to be of clearbenefit.Brion requested the grammar IIRC (and there's a comment to that effect athttp://bugzilla.wikimedia.org/show_bug.cgi?id=7), so I'm wondering what became of it.Is there still a goal of replacing the parser? Or is there some alternativeplan?Steve

26 217

MediaWiki to Latex Converter
by Hugo Vincent 18 Jun '12

18 Jun '12

Hi everyone,I recently set up a MediaWiki (http://server.bluewatersys.com/w90n740/) and I need to extra the content from it and convert it into LaTeX syntax for printed documentation. I have googled for a suitable OSS solution but nothing was apparent.I would prefer a script written in Python, but any recommendations would be very welcome.Do you know of anything suitable?Kind Regards,Hugo Vincent,Bluewater Systems.

6 13

Replacement stats for placeholder images?
by David Gerard 13 Oct '09

13 Oct '09

I've been putting placeholder images on a lot of articles on en:wp.e.g. [[Image:Replace this image male.svg]], which goes to[[Wikipedia:Fromowner]], which asks people to upload an image if theyown one.I know it's inspired people to add free content images to articles inseveral cases. What I'm interested in is numbers. So what I'd need isa list of edits where one of the SVGs that redirects to[[Wikipedia:Fromowner]] is replaced with an image. (Checking which ofthose are actually free images can come next.)Is there a tolerably easy way to get this info from a dump? AnyWikipedia statistics fans who think this'd be easy?(If the placeholders do work, then it'd also be useful convincing somewikiprojects to encourage the things. Not that there's ownership ofarticles on en:wp, of *course* ...)- d.

7 11

Case insensitive links (not just titles).
by subscribe＠divog.com.ru 23 Jun '08

23 Jun '08

Hi Sorry for my English :) What I need is case insensitive titles. My solution for the problem was tochange collation in mysql from <unf8_bin> to <utf8_general_ci> in table<page>, for field <page_title>. But bigger problem with links persists. In my case, if there is an article<Frank Dreben>, link [[Frank Dreben]] is treated like a link to an existentarticle (GoodLink), but link [[frank dreben]] is treated like a link to anon-existent article, so, this link opens editing of existent article <FrankDreben>. What can be fixed for that link [[frank dreben]] to be treated likea GoodLink? I've spent some time in Parser.php, LinkCache.php, Title.php, Linker.php,LinkBatch.php but found nothing useful. The last thing I tried was to dostrtoupper on title every time array of link cache is filled, inLinkCache.php. I also tried to do strtoupper on title every time data isfetched from the array.I've tried to make titles in cache be case insensitive, but it didn't workout, not sure why - it seems like when links are constructed (parser, title,linker, etc) only LinkCache methods are used. Could anybody point a direction to dig in? :)

7 36

Interface embarrassment rant
by Magnus Manske 24 Apr '08

24 Apr '08

<rant>I'm currently working on the Scott Forseman image donation, cuttinglarge scanned images into smaller, manually optimized ones.The category containing the unprocessed images ishttp://commons.wikimedia.org/wiki/Category:ScottForesman-rawIt's shameful. Honestly. Look at it. We're the world's #9 top website, and this is the best we can do?Yes, I know that the images are large, both in dimensions(~5000x5000px) and size (5-15MB each).Yes, I know that ImageMagick has problems with such images.But honestly, is there no open source software that can generate athumbnail from a 15MB PNG without nuking our servers?In case it's not possible (which I doubt, since I can generatethumbnails with ImageMagick from these on my laptop, one at a time;maybe a slow-running thumbnail generator, at least for "usual" sizes,on a dedicated server?), it's no use cluttering the entire page withbroken thumbnails.Where's the option for a list view? You know, a table with linkedtitle, size, uploader, date, no thumbnails? They're files, so whydon't we use things that have proven useful in a file system?And then, of course:"There are 200 files in this category."That's two lines below the "(next 200)" link. At that point, we knowthere are more than 200 images, but we forget about that two linesfurther down?Yes, I know that some categories are huge, and that it would take toolong to get the exact number.But, would the exact number for large categories be useful? 500.000 or500.001 entries, who cares? How many categories are that large anyway?200 or 582 entries, now /that/ people might care about.Why not at least try to get a number, set a limit to, say, 5001, and* give the exact number if it's less that 5001 entries* say "over 5000 entries" if it returns 5001Yes, everyone's busy.Yes, there are more pressing issues (SUL, stable versions, you name it).Yes, MediaWiki wasn't developed as a media repository (tell me about it;-)Yes, "sofixit" myself.Still, I ask: is this the best we can do?Magnus</rant>

15 56

Broken dump enwiki-20080103-pages-meta-current.xml.bz2
by Lev Bishop 20 Apr '08

20 Apr '08

The most recent enwiki dump seems corrupt (CRC failure when bunzipping).Another person (Nessus) has also noticed this, so it's not just me:http://meta.wikimedia.org/wiki/Talk:Data_dumps#Broken_image_.28enwiki-20080…Steps to reproduce:lsb32@cmt:~/enwiki> md5sum enwiki-20080103-pages-meta-current.xml.bz29aa19d3a871071f4895431f19d674650 enwiki-20080103-pages-meta-current.xml.bz2lsb32@cmt:~/enwiki> bzip2 -tvvenwiki-20080103-pages-meta-current.xml.bz2 &> bunzip.loglsb32@cmt:~/enwiki> tail bunzip.log [3490: huff+mtf rt+rld] [3491: huff+mtf rt+rld] [3492: huff+mtf rt+rld] [3493: huff+mtf rt+rld] [3494: huff+mtf rt+rld] [3495: huff+mtf data integrity (CRC) error in dataYou can use the `bzip2recover' program to attempt to recoverdata from undamaged sections of corrupted files.lsb32@cmt:~/enwiki> bzip2 -Vbzip2, a block-sorting file compressor. Version 1.0.3, 15-Feb-2005. Copyright (C) 1996-2005 by Julian Seward. This program is free software; you can redistribute it and/or modify it under the terms set out in the LICENSE file, which is included in the bzip2-1.0 source distribution. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the LICENSE file for more details.bzip2: I won't write compressed data to a terminal.bzip2: For help, type: `bzip2 --help'.lsb32@cmt:~/enwiki>

4 6

Re: [Wikitech-l] Primary account for single user login
by Anon Sricharoenchai 08 Apr '08

08 Apr '08

>> Message: 8> Date: Fri, 12 Oct 2007 17:59:22 +0200> From: GerardM <gerard.meijssen(a)gmail.com>> Subject: Re: [Wikitech-l] Primary account for single user login>> Hoi,> This issue has been decided. Seniority is not fair either; there are> hundreds if not thousands of users that have done no or only a few edits and> I would not consider it fair when a person with say over 10.000 edits should> have to defer to these typically inactive users.1. Yes, it's not fair, but this is the truth on wikimedia project that ones have to admit. Imagine if, all wikimedia sites has a single user login since when it is first established, the one who first register will own that username for all wikimedia sites.2. The person with less edits, doesn't mean that they are less active than the one with more edits. And according to,http://en.wikipedia.org/wiki/Wikipedia:Edit_count, ``Edit counts do not necessarily reflect the value of a user's contributions to the Wikipedia project.'' What if, some users have less edits count, * since they deliberately edit, preview, edit, and preview the articles, over and over, before submitting the deliberated versions to wikimedia sites. * Some users edit, edit and edit the articles in their offline storage, over and over, before submitting the only final versions to wikimedia sites. While some users have more edits count, * since they often submit so many changes, without previewing it first, and have to correct the undeliberated edit, over and over. * Some users often submit so many minor changes, over and over, rather than accumulate the changes resulting in fewer edits count. * Some users do so many robot routines by themselves, rather than letting the real robot to do those tasks. * Some users often take part in many edit wars. * Some users often take part in many arguments in many talk pages. What if, the users with less edits count, try to increase their edits count to take back the status of primary account. What if, they decide to change their habit of editing, to increase the edits count, * by submitting many edits without deliberated preview, * by splitting the accumulated changes into many minor edits, and submit them separately, * by stopping their robots, and do those robot routines by themselves, * by joining edit wars.3. According to 2) above, I think, the better measurement of activeness is to measure the time between the first edit and the last edit of that username. The formula will look like this, activeness = last edit time - first edit time>> A choice has been made and as always, there will be people that will find an> un-justice. There were many discussions and a choice was made. It is not> good to revisit things continuously, it is good to finish things so that> there is no point to it any more.>> Thanks,> GerardM>> On 10/12/07, Anon Sricharoenchai <anon.hui(a)gmail.com> wrote:> >> > According to the conflict resolution process, that the account with> > most edits is selected as a primary account for that username, this> > may sound reasonable for the username that is owned by the same person> > on all wikimedia sites.> >> > But the problem will come when the same username on those wikimedia> > sites is owned by different person and they are actively in used.> > The active account that has registered first (seniority rule) should> > rather be considered the primary account.> > Since, I think the person who register first should own that username> > on the unified> > wikimedia sites.> >> > Imagine, what if the wikimedia sites have been unified ever since the> > sites are> > first established long time ago (that their accounts have never been> > separated),> > the person who register first will own that username on all of the> > wikimedia> > sites.> > The person who come after will be unable to use the registered> > username, and have> > to choose their alternate username.> > This logic should also apply on current wikimedia sites, after it have> > been> > unified.> >

8 13

RfC: Wikipedia data displays
by Erik Moeller 05 Mar '08

05 Mar '08

We're planning to set up 4 data displays in the Wikimedia Foundationoffice - I'm thinking at least 19" screens, maybe larger. The intenthere is not to appear "hip", but to make the office environment moreinteresting for visitors, such as potential donors. This createsconversation pieces and memorable moments - which is important forcultivating relationships.I'd like to request your comments on what kinds of displays we couldset up. Some initial ideas:- Real-time recent changes. This should be relatively straightforwardusing the IRC feeds. Most effort here will go into prettification, Ithink. What would be a good IRC client to show multiple channels atonce?- Show random articles. Not particularly creative, but should also befairly easy to do using some scripting. Would be nice to show stufffrom projects beyond WP.- Show articles matching to current searches. How difficult would itbe to capture search data for this?- Show the actual search strings. I don't love this one, becauseGoogle already does this, but it might be interesting content-wise.- Show traffic data. What would be interesting displays here? Can weshow bandwidth usage in real-time?- Show images as they are being uploaded. Do we have anything likethat already? If not, how hard would it be to implement?- Data displays of developmental indicators - e.g. Gapminder data onInternet access, literacy, etc. Is there anything like this that wecould do with relatively little effort? Any volunteers to putsomething together?- Geomapping of access - some visualization of the primary clusterswhere traffic is coming from, based on sampling. I imagine this couldbe quite tricky - but might be a cool long-term project for avolunteer?- Visualization of edit patterns, similar to:http://abeautifulwww.com/2007/05/20/visualizing-the-power-struggle-in-wikip…Other ideas / comments?-- Erik MöllerDeputy Director, Wikimedia FoundationSupport Free Knowledge:http://wikimediafoundation.org/wiki/Donate

13 19

Tag intersection, crazy idea
by Magnus Manske 03 Mar '08

03 Mar '08

I just had the following thought: For a tag intersection system,* we limit queries to two intersections (show all pages with categories A and B)* we assume on average 5 categories per page (can someone check that?)then we have 5*4=20 intersections per page.Now, for each intersection, we calculate MD5("A|B") to get an integerhash, and store that in a new table (page_id INTEGER,intersection_hashINTERGER).That table would be 4 times as long as the categorylinks table.* Memory usage: Acceptable (?)* Update: Fast, on page edit only* Works for non-existing categoriesOn a query, we look for the (indexed) hash in a subquery, then checkthose against the actual categorylinks.Looking up an integer in the subquery should be fast enough ;-)Given the number of categories and INTEGER >4bn, that would make thehash unique for all combinations of 65K categories (if the hash weretruely randomly distributed, which it isn't), which should mean thatthe number of false positives (to be filtered by the main query)should be rather low.If that's fast enough, we could even expand to three intersections (A,B, and C), querying "A|B", "A|C", and "B|C", and let PHP find the onescommon to all three sets.Summary: Fixing slow MySQL lookup by throwing memory at it...Feasible? Or pipe dream?Magnus

8 27

Movatterモバイル変換

Keyboard Shortcuts

Thread View

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-lFebruary 2008