Movatterモバイル変換


[0]ホーム

URL:


Keyboard Shortcuts

Thread View

  • j: Next unread message
  • k: Previous unread message
  • j a: Jump to all threads
  • j l: Jump to MailingList overview
List overview
Download

Wikimedia-lMay 2014

wikimedia-l@lists.wikimedia.org
  • 181 participants
  • 144 discussions
Start a nNew thread
This paper (first reference) is the result of a class project I was part ofalmost two years ago for CSCI 5417 Information Retrieval Systems. It buildson a class project I did in CSCI 5832 Natural Language Processing and whichI presented at Wikimania '07. The project was very late as we didn't sendthe final paper in until the day before new years. This technical report wasnever really announced that I recall so I thought it would be interesting tolook briefly at the results. The goal of this paper was to break articlesdown into surface features and latent features and then use those to studythe rating system being used, predict article quality and rank results in asearch engine. We used the [[random forests]] classifier which allowed us toanalyze the contribution of each feature to performance by looking directlyat the weights that were assigned. While the surface analysis was performedon the whole english wikipedia, the latent analysis was performed on thesimple english wikipedia (it is more expensive to compute). = Surfacefeatures = * Readability measures are the single best predictor of qualitythat I have found, as defined by the Wikipedia Editorial Team (WET). The[[Automated Readability Index]], [[Gunning Fog Index]] and [[Flesch-KincaidGrade Level]] were the strongest predictors, followed by length of articlehtml, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]], numberof internal links, [[Laesbarhedsindex Readability Formula]], number of wordsand number of references. Weakly predictive were number of to be's, numberof sentences, [[Coleman-Liau Index]], number of templates, PageRank, numberof external links, number of relative links. Not predictive (overall - seethe end of section 2 for the per-rating score breakdown): Number of h2 orh3's, number of conjunctions, number of images*, average word length, numberof h4's, number of prepositions, number of pronouns, number of interlanguagelinks, average syllables per word, number of nominalizations, article age(based on page id), proportion of questions, average sentence length. :*Number of images was actually by far the single strongest predictor of anyclass, but only for Featured articles. Because it was so good at picking outfeatured articles and somewhat good at picking out A and G articles theclassifier was confused in so many cases that the overall contribution ofthis feature to classification performance is zero. :* Number of externallinks is strongly predictive of Featured articles. :* The B class is highlydistinctive. It has a strong "signature," with high predictive valueassigned to many features. The Featured class is also very distinctive. F, Band S (Stop/Stub) contain the most information. :* A is the least distinct class, not being very different from F or G. =Latent features = The algorithm used for latent analysis, which is ananalysis of the occurence of words in every document with respect to thelink structure of the encyclopedia ("concepts"), is [[Latent DirichletAllocation]]. This part of the analysis was done by CS PhD student PrafulMangalath. An example of what can be done with the result of this analysisis that you provide a word (a search query) such as "hippie". You can thenlook at the weight of every article for the word hippie. You can pick thearticle with the largest weight, and then look at its link network. You canpick out the articles that this article links to and/or which link to thisarticle that are also weighted strongly for the word hippie, while alsocontributing maximally to this articles "hippieness". We tried this query inour system (LDA), Google (site:en.wikipedia.org hippie), and the SimpleEnglish Wikipedia's Lucene search engine. The breakdown of articles occuringin the top ten search results for this word for those engines is: * LDAonly: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[CarlRadle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[PlasticOno Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers brothers]],[[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. * Simpleonly: [[African Americans]], [[Charles Manson]], [[Counterculture]], [[Druguse]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexualliberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]],[[Human Be-in]], [[Students for a democratic society]], [[Woodstockfestival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple: [[Lysergicacid diethylamide]], [[Summer of Love]] ( See the paper for the articlesproduced for the keywords philosophy and economics ) = Discussion /Conclusion = * The results of the latent analysis are totally up to yourperception. But what is interesting is that the LDA features predict the WETratings of quality just as well as the surface level features. Both featuresets (surface and latent) both pull out all almost of the information thatthe rating system bears. * The rating system devised by the WET is notdistinctive. You can best tell the difference between, grouped together,Featured, A and Good articles vs B articles. Featured, A and Good articlesare also quite distinctive (Figure 1). Note that in this study we didn'tlook at Start's and Stubs, but in earlier paper we did. :* This isinteresting when compared to this recent entry on the YouTube blog. "FiveStars Dominate Ratings"http://youtube-global.blogspot.com/2009/09/five-stars-dominate-ratings.html…I think a sane, well researched (with actual subjects) rating systemiswell within the purview of the Usability Initiative. Helping people find andcreate good content is what Wikipedia is all about. Having a solid ratingsystem allows you to reorganized the user interface, the Wikipedianamespace, and the main namespace around good content and bad content asneeded. If you don't have a solid, information bearing rating system youdon't know what good content really is (really bad content is easy to spot).:* My Wikimania talk was all about gathering data from people about articlesand using that to train machines to automatically pick out good content. Youask people questions along dimensions that make sense to people, and givethe machine access to other surface features (such as a statistical measureof readability, or length) and latent features (such as can be derived fromdocument word occurence and encyclopedia link structure). I referenced page262 of Zen and the Art of Motorcycle Maintenance to give an example of thekind of qualitative features I would ask people. It really depends on whatfeatures end up bearing information, to be tested in "the lab". Each word isan example dimension of quality: We have "*unity, vividness, authority,economy, sensitivity, clarity, emphasis, flow, suspense, brilliance,precision, proportion, depth and so on.*" You then use surface and latentfeatures to predict these values for all articles. You can also say, when aperson rates this article as high on the x scale, they also mean that it hashas this much of these surface and these latent features.= References = - DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in Wikipedia through quality and concept discovery*. Technical Report.PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/6/68/DeHoustMangalat…> - Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the feasibility of automatically rating online article quality*. Technical Report. PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/d/d3/RassbachPincock…>
3 2
0 0
Hoi,I have asked and received permission to forward to you all this mostexcellent bit of news.The linguist list, is a most excellent resource for people interested in thefield of linguistics. As I mentioned some time ago they have had a fundingdrive and in that funding drive they asked for a certain amount of money ina given amount of days and they would then have a project on Wikipedia tolearn what needs doing to get better coverage for the field of linguistics.What you will read in this mail that the total community of linguists areasked to cooperate. I am really thrilled as it will also get us morelinguists interested in what we do. My hope is that a fraction will beinterested in the languages that they care for and help it become morerelevant. As a member of the "language prevention committee", I love to getmore knowledgeable people involved in our smaller projects. If it means thatwe get more requests for more projects we will really feel embarrassed withall the new projects we will have to approve because of the quality of theIncubator content and the quality of the linguistic arguments why we shouldapprove yet another language :)NB Is this not a really clever way of raising money; give us this much inthis time frame and we will then do this as a bonus...Thanks, GerardM---------- Forwarded message ----------From: LINGUIST Network <linguist(a)linguistlist.org>Date: Jun 18, 2007 6:53 PMSubject: 18.1831, All: Call for Participation: Wikipedia VolunteersTo: LINGUIST(a)listserv.linguistlist.orgLINGUIST List: Vol-18-1831. Mon Jun 18 2007. ISSN: 1068 - 4875.Subject: 18.1831, All: Call for Participation: Wikipedia VolunteersModerators: Anthony Aristar, Eastern Michigan U <aristar(a)linguistlist.org> Helen Aristar-Dry, Eastern Michigan U <hdry(a)linguistlist.org>Reviews: Laura Welcher, Rosetta Project <reviews(a)linguistlist.org>Homepage:http://linguistlist.org/The LINGUIST List is funded by Eastern Michigan University,and donations from subscribers and publishers.Editor for this issue: Ann Sawyer <sawyer(a)linguistlist.org>================================================================To post to LINGUIST, use our convenient web form athttp://linguistlist.org/LL/posttolinguist.html===========================Directory==============================1)Date: 18-Jun-2007From: Hannah Morales < hannah(a)linguistlist.org >Subject: Wikipedia Volunteers-------------------------Message 1 ----------------------------------Date: Mon, 18 Jun 2007 12:49:35From: Hannah Morales < hannah(a)linguistlist.org >Subject: Wikipedia VolunteersDear subscribers,As you may recall, one of our Fund Drive 2007 campaigns was called the"Wikipedia Update Vote." We asked our viewers to consider earmarking theirdonations to organize an update project on linguistics entries in theEnglish-language Wikipedia. You can find more background information on thisat:http://linguistlist.org/donation/fund-drive2007/wikipedia/index.cfm.The speed with which we met our goal, thanks to the interest and generosityofour readers, was a sure sign that the linguistics community was enthusiasticabout the idea. Now that summer is upon us, and some of you may have a bitmoreleisure time, we are hoping that you will be able to help us get started ontheWikipedia project. The LINGUIST List's role in this project is a purelyorganizational one. We will:*Help, with your input, to identify major gaps in the Wikipedia materials orpages that need improvement;*Compile a list of linguistics pages that Wikipedia editors have identifiedas"in need of attention from an expert on the subject" or " does not cite anyreferences or sources," etc;*Send out periodical calls for volunteer contributors on specific topics orarticles;*Provide simple instructions on how to upload your entries into Wikipedia;*Keep track of our project Wikipedians;*Keep track of revisions and new entries;*Work with Wikimedia Foundation to publicize the linguistics community'sefforts.We hope you are as enthusiastic about this effort as we are. Just to help usallget started looking at Wikipedia more critically, and to easily identify anareaneeding improvement, we suggest that you take a look at the List ofLinguistspage at:http://en.wikipedia.org/wiki/List_of_linguists. MMany people are not listed there; others need to have more facts andinformationadded. If you would like to participate in this exciting update effort,pleaserespond by sending an email to LINGUIST Editor Hannah Morales athannah(a)linguistlist.org, suggesting what your role might be or whichlinguisticsentries you feel should be updated or added. Some linguists who saw ourcampaignon the Internet have already written us with specific suggestions, which wewillshare with you soon.This update project will take major time and effort on all our parts. Theendresult will be a much richer internet resource of information on the breadthanddepth of the field of linguistics. Our efforts should also stimulateprospectivestudents to consider studying linguistics and to educate a wider public onwhatwe do. Please consider participating.Sincerely,Hannah MoralesEditor, Wikipedia Update ProjectLinguistic Field(s): Not Applicable-----------------------------------------------------------LINGUIST List: Vol-18-1831
3 2
0 0
A personal note.
by Lila Tretikov 21 Nov '15

21 Nov '15
Hi all,This is a personal note to clarify a some questions that recently came up,specifically in the context of my role as the incoming ED.My partner Wil and I are partners in our private lives. We have always bothbeen extremely independent, and we respect that in each other. That said wehave different roles: I am the Executive Director with responsibilitiestowards the Foundation and the movement, and he is an independent communitymember with his own voice.I make my decisions using my own professional judgement in conjunction withinput from the community and staff. I don’t consult Wil on these matters,ask him to do anything on my behalf or monitor his engagements with thecommunity. When I speak here, it is in my capacity as an ED.Wil, on the other hand, has a very strong personal interest in thecommunity and agreat deal of curiosity about how the Wikimediaprojectswork. It is very important to him that he remains anindependent individualable to speak with his own voice and ask his own questions. He does nottake direction from me. He will not work for the WMF or engage with the WMFemployees.I hope this addresses some of the questions and draws distinction betweenmy role as ED and Wil’s participation as an independent member. If you haveany questions for Wil you can reach him directly. If you have any questionsfor me or the WMF, you can get a hold of me by email or on my talk page.Thanks,Lila
34 79
0 0
Re:http://twkozlowski.net/the-pot-and-the-kettle-the-wikimedia-way/Two questions:1. Where can I find a response from either the WMF board or WMFfunding/finance to the criticisms of a lack of transparency or theapparent failure of the project to deliver value for the donor's moneyas raised in this blog post?2. Where can I read an officially recognized report for the outcomesof this project in terms of value for Wikimedia projects? Obviously wedo not want to rely on second-hand analysis when reports to the WMFare a requirement for such projects.Thanks,Fae-- faewik(a)gmail.comhttps://commons.wikimedia.org/wiki/User:Fae
24 56
0 0

10 Mar '15
We know NSA wants Wikipedia data, as Wikipedia is listed in one of theNSA slides:https://commons.wikimedia.org/wiki/File:KS8-001.jpgThat slide is about HTTP, and the tech staff are moving theuser/reader base to HTTPS.As we learn more about the NSA programs, we need to consider vectorsother than HTTP for the NSA to obtain the data they want. And theuserbase needs to be aware of the current risks.One question from the "Dells are backdored"[sic] thread that is worthseparate consideration is:Are the Wikimedia transit links encrypted, especially for database replication?MySQL has replication over SSL, so I assume the answer is Yes.If not, is this necessary or useful, and feasible ?However we also need to consider that SSL and other encryption may beuseless against NSA/etc, which means replicating non-public datashould be avoided wherever possible, as it becomes a single point offailure.Given how public our system is, we don't have a lot of non-publicdata, so we might be able to design the architecture so thatinformation isnt replicated, and also ensure it isnt accessed overinsecure links. I think the only parts of the dataset that areprivate & valuable are* passwords/login cookies,* checkuser info - IPs and useragents,* WMF analytics, which includes readers iirc, and* hidden/deleted edits* private wikis and mailing listsHave I missed any?Are passwords and/or checkuser info replicated?Is there a data policy on WMF analytics data which prevents it flowingover insecure links, and limits what is collected and ensuresdestruction of the data within reasonable timeframes? i.e. how aboutnot using cookies to track analytics of readers who are on HTTPinstead of HTTPS?The private wikis can be restricted to https, depending on the valueof the data on those wikis in the wrong hands. The private mailinglists will be harder to secure, and at least the English Wikipediaarbcom list contain a lot of valuable data about contributors.Regarding hidden/deleted edits, the replication isnt the only sourceof this data. All edits are also exposed via Recent Changes(https/api/etc) as they occur, and the value of these edits isdetermined by the fact they are hidden afterwards (e.g. don't appearin dumps). Is there any way to control who is effectively capturingall edits via Recent Changes?--John Vandenberg
5 4
0 0

16 Feb '15
Hi folks,to increase accountability and create more opportunities for coursecorrections and resourcing adjustments as necessary, Sue's asked meand Howie Fung to set up a quarterly project evaluation process,starting with our highest priority initiatives. These are, accordingto Sue's narrowing focus recommendations which were approved by theBoard [1]:- Visual Editor- Mobile (mobile contributions + Wikipedia Zero)- Editor Engagement (also known as the E2 and E3 teams)- Funds Dissemination Committe and expanded grant-making capacityI'm proposing the following initial schedule:January:- Editor Engagement ExperimentsFebruary:- Visual Editor- Mobile (Contribs + Zero)March:- Editor Engagement Features (Echo, Flow projects)- Funds Dissemination CommitteeWe’ll try doing this on the same day or adjacent to the monthlymetrics meetings [2], since the team(s) will give a presentation ontheir recent progress, which will help set some context that wouldotherwise need to be covered in the quarterly review itself. This willalso create open opportunities for feedback and questions.My goal is to do this in a manner where even though the quarterlyreview meetings themselves are internal, the outcomes are captured asmeeting minutes and shared publicly, which is why I'm starting thisdiscussion on a public list as well. I've created a wiki page herewhich we can use to discuss the concept further:https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings/Quarterly_r…The internal review will, at minimum, include:Sue GardnermyselfHowie FungTeam members and relevant director(s)Designated minute-takerSo for example, for Visual Editor, the review team would be the VisualEditor / Parsoid teams, Sue, me, Howie, Terry, and a minute-taker.I imagine the structure of the review roughly as follows, with aduration of about 2 1/2 hours divided into 25-30 minute blocks:- Brief team intro and recap of team's activities through the quarter,compared with goals- Drill into goals and targets: Did we achieve what we said we would?- Review of challenges, blockers and successes- Discussion of proposed changes (e.g. resourcing, targets) and otheraction items- Buffer time, debriefingOnce again, the primary purpose of these reviews is to create improvedstructures for internal accountability, escalation points in caseswhere serious changes are necessary, and transparency to the world.In addition to these priority initiatives, my recommendation would beto conduct quarterly reviews for any activity that requires more thana set amount of resources (people/dollars). These additional reviewsmay however be conducted in a more lightweight manner and internallyto the departments. We’re slowly getting into that habit inengineering.As we pilot this process, the format of the high priority reviews canhelp inform and support reviews across the organization.Feedback and questions are appreciated.All best,Erik[1]https://wikimediafoundation.org/wiki/Vote:Narrowing_Focus[2]https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings-- Erik MöllerVP of Engineering and Product Development, Wikimedia FoundationSupport Free Knowledge:https://wikimediafoundation.org/wiki/Donate
9 65
0 0

23 Sep '14
Hi folks,I'd be interested in hearing broader community opinions about theextent to which WMF should sponsor non-profits purely to support workthat Wikimedia benefits from, even if it's not directed towards aspecific goal established in a grant agreement.This comes up from time to time. One of the few historic precedentsI'm aware of is the $5,000 donation that WMF made to FreeNode in 2006[1]. But there are of course many other organizations/communities thatthe Wikimedia movement is indebted to.On the software side, we have Ubuntu Linux (itself highly indebted toDebian) / Apache / MariaDB / PHP / Varnish / ElasticSearch / memcached/ Puppet / OpenStack / various libraries and many other dependencies [2],infrastructure tools like ganglia, observium, icinga, etc. Some ofthese projects have nonprofits that accept and seek sponsorship andsupport, some don't.One could easily expand well beyond the software we depend onserver-side to client-side open source applications used by ourcommunity to create content: stuff like Inkscape, GIMP and LibreOffice(used for diagrams). And there are other communities we depend on,like OpenStreetMap.So, should we steer clear of this type of sponsorship altogetherbecause it's a slippery slope, or should we try to come up withevaluation criteria to consider it on a case-by-case basis (e.g. isthere a trustworthy non-profit that has a track record ofaccomplishment and is in actual need of financial support)?I could imagine a process with a fixed "giving back" annual budgetand a community nominations/review workflow. It'd be work to createand I don't want to commit to that yet, but I would be interested tohear opinions.MariaDB specifically invited WMF to become a sponsor, and we'reclearly highly dependent on them. But I don't think it makes sense forus to just write checks if there's someone who asks for support andthere's a justifiable need. However, if there's broad agreement thatthis is something Wikimedia should do more of, then I think it's worthdeveloping more consistent sponsorship criteria.Thanks,Erik[1]https://wikimediafoundation.org/wiki/Resolution:Freenode_Donation[2] Cf.https://www.mediawiki.org/wiki/Upstream_projects-- Erik MöllerVP of Engineering and Product Development, Wikimedia Foundation
30 47
0 0
Wikimedia Ukraine's anniversary
by Richard Ames 10 Jul '14

10 Jul '14
----- оригінальне повідомлення -----Тема: Wikimedia Ukraine's anniversaryВід кого: Levon Azizian <levonazizian(a)bigmir.net>Кому: wikimedia-l(a)lists.wikimedia.orgКопія: Правління Вікімедіа Україна <board(a)wikimedia.in.ua>Відправлено: 31.05.2014 18:40,Today, our organization celebrates anniversary - 5 years from the dateof creation.Exactly 5 years ago, on May 31, 2009, in Kyiv was held the constituentmeeting, which approved the bylaws and elected its first Board of the neworganization, known as Wikimedia Ukraine.Our community has gone through a long and difficult path. Birthday ofWikimedia Ukraine for our community is the third remarkable date thisyear. On January 30 was the 10th anniversary of the establishment ofUkrainian Wikipedia and on May 12 Ukrainian Wikipedia has crossed thethreshold of 500 000 articles.We want to thank to Wikimedia Foundation Inc. for their help, to ourneighboring communities for fruitful cooperation with us and of courseto our community for their contributions!Regards, Levon AzizianDeputy chairWikimedia Ukraine-- The greatest collection of shared knowledge in history. Help Wikipedia, participate now:http://wikimedia.org/
6 5
0 0

09 Jul '14
(CCing wikimedia-l as well, please send any replies to wikitech-l only)The Wikimedia technical community wants to have another hackathon next yearin Europe. Who will organize it?Interested parties, checkhttps://www.mediawiki.org/wiki/HackathonsWe would like to confirm a host by Wikimania, latest.The same call goes for India and other locations with a good concentrationof Wikimedia contributors and software developers. Come on, step in. Wewant to increase our geographical diversity of technical contributors.-- Quim GilEngineering Community Manager @ Wikimedia Foundationhttp://www.mediawiki.org/wiki/User:Qgil
2 2
0 0

24 Jun '14
I emailed mobile-l and wikitech-l about this, now I'm moving thisdiscussion to wikimedia-l. Here's the longer technical thread:http://lists.wikimedia.org/pipermail/mobile-l/2014-April/006884.htmlIn summary, to show Wikipedia Zero banners for the correct mobile networks,we are planning once for each cellular-based app session to log two piecesof data in a specialized logfile, deleting log entries older than 90 days.1. MCC-MNC <http://en.wikipedia.org/wiki/Mobile_country_code> code (formatis ###-##), which denotes the mobile operator2. Exit (gateway/proxy) IP address* These data points would not be logged alongside the normal web accesslogs.This information could be used to estimate rough demand for Wikipedia inpotential Wikipedia Zero geos, although remediating the out-of-sync IPaddresses on file for existing partners is primary.Internal review suggests this is in alignment with privacy policy, and wewanted to see if there were other thoughts on this approach here onwikimedia-l.-Adam
3 9
0 0
Results per page:

[8]ページ先頭

©2009-2026 Movatter.jp