Movatterモバイル変換


[0]ホーム

URL:


Keyboard Shortcuts

Thread View

  • j: Next unread message
  • k: Previous unread message
  • j a: Jump to all threads
  • j l: Jump to MailingList overview
List overview
Download

Wikimedia-lJanuary 2014

wikimedia-l@lists.wikimedia.org
  • 173 participants
  • 107 discussions
Start a nNew thread
This paper (first reference) is the result of a class project I was part ofalmost two years ago for CSCI 5417 Information Retrieval Systems. It buildson a class project I did in CSCI 5832 Natural Language Processing and whichI presented at Wikimania '07. The project was very late as we didn't sendthe final paper in until the day before new years. This technical report wasnever really announced that I recall so I thought it would be interesting tolook briefly at the results. The goal of this paper was to break articlesdown into surface features and latent features and then use those to studythe rating system being used, predict article quality and rank results in asearch engine. We used the [[random forests]] classifier which allowed us toanalyze the contribution of each feature to performance by looking directlyat the weights that were assigned. While the surface analysis was performedon the whole english wikipedia, the latent analysis was performed on thesimple english wikipedia (it is more expensive to compute). = Surfacefeatures = * Readability measures are the single best predictor of qualitythat I have found, as defined by the Wikipedia Editorial Team (WET). The[[Automated Readability Index]], [[Gunning Fog Index]] and [[Flesch-KincaidGrade Level]] were the strongest predictors, followed by length of articlehtml, number of paragraphs, [[Flesh Reading Ease]], [[Smog Grading]], numberof internal links, [[Laesbarhedsindex Readability Formula]], number of wordsand number of references. Weakly predictive were number of to be's, numberof sentences, [[Coleman-Liau Index]], number of templates, PageRank, numberof external links, number of relative links. Not predictive (overall - seethe end of section 2 for the per-rating score breakdown): Number of h2 orh3's, number of conjunctions, number of images*, average word length, numberof h4's, number of prepositions, number of pronouns, number of interlanguagelinks, average syllables per word, number of nominalizations, article age(based on page id), proportion of questions, average sentence length. :*Number of images was actually by far the single strongest predictor of anyclass, but only for Featured articles. Because it was so good at picking outfeatured articles and somewhat good at picking out A and G articles theclassifier was confused in so many cases that the overall contribution ofthis feature to classification performance is zero. :* Number of externallinks is strongly predictive of Featured articles. :* The B class is highlydistinctive. It has a strong "signature," with high predictive valueassigned to many features. The Featured class is also very distinctive. F, Band S (Stop/Stub) contain the most information. :* A is the least distinct class, not being very different from F or G. =Latent features = The algorithm used for latent analysis, which is ananalysis of the occurence of words in every document with respect to thelink structure of the encyclopedia ("concepts"), is [[Latent DirichletAllocation]]. This part of the analysis was done by CS PhD student PrafulMangalath. An example of what can be done with the result of this analysisis that you provide a word (a search query) such as "hippie". You can thenlook at the weight of every article for the word hippie. You can pick thearticle with the largest weight, and then look at its link network. You canpick out the articles that this article links to and/or which link to thisarticle that are also weighted strongly for the word hippie, while alsocontributing maximally to this articles "hippieness". We tried this query inour system (LDA), Google (site:en.wikipedia.org hippie), and the SimpleEnglish Wikipedia's Lucene search engine. The breakdown of articles occuringin the top ten search results for this word for those engines is: * LDAonly: [[Acid rock]], [[Aldeburgh Festival]], [[Anne Murray]], [[CarlRadle]], [[Harry Nilsson]], [[Jack Kerouac]], [[Phil Spector]], [[PlasticOno Band]], [[Rock and Roll]], [[Salvador Allende]], [[Smothers brothers]],[[Stanley Kubrick]]. * Google only: [[Glam Rock]], [[South Park]]. * Simpleonly: [[African Americans]], [[Charles Manson]], [[Counterculture]], [[Druguse]], [[Flower Power]], [[Nuclear weapons]], [[Phish]], [[Sexualliberation]], [[Summer of Love]] * LDA & Google & Simple: [[Hippie]],[[Human Be-in]], [[Students for a democratic society]], [[Woodstockfestival]] * LDA & Google: [[Psychedelic Pop]] * Google & Simple: [[Lysergicacid diethylamide]], [[Summer of Love]] ( See the paper for the articlesproduced for the keywords philosophy and economics ) = Discussion /Conclusion = * The results of the latent analysis are totally up to yourperception. But what is interesting is that the LDA features predict the WETratings of quality just as well as the surface level features. Both featuresets (surface and latent) both pull out all almost of the information thatthe rating system bears. * The rating system devised by the WET is notdistinctive. You can best tell the difference between, grouped together,Featured, A and Good articles vs B articles. Featured, A and Good articlesare also quite distinctive (Figure 1). Note that in this study we didn'tlook at Start's and Stubs, but in earlier paper we did. :* This isinteresting when compared to this recent entry on the YouTube blog. "FiveStars Dominate Ratings"http://youtube-global.blogspot.com/2009/09/five-stars-dominate-ratings.html…I think a sane, well researched (with actual subjects) rating systemiswell within the purview of the Usability Initiative. Helping people find andcreate good content is what Wikipedia is all about. Having a solid ratingsystem allows you to reorganized the user interface, the Wikipedianamespace, and the main namespace around good content and bad content asneeded. If you don't have a solid, information bearing rating system youdon't know what good content really is (really bad content is easy to spot).:* My Wikimania talk was all about gathering data from people about articlesand using that to train machines to automatically pick out good content. Youask people questions along dimensions that make sense to people, and givethe machine access to other surface features (such as a statistical measureof readability, or length) and latent features (such as can be derived fromdocument word occurence and encyclopedia link structure). I referenced page262 of Zen and the Art of Motorcycle Maintenance to give an example of thekind of qualitative features I would ask people. It really depends on whatfeatures end up bearing information, to be tested in "the lab". Each word isan example dimension of quality: We have "*unity, vividness, authority,economy, sensitivity, clarity, emphasis, flow, suspense, brilliance,precision, proportion, depth and so on.*" You then use surface and latentfeatures to predict these values for all articles. You can also say, when aperson rates this article as high on the x scale, they also mean that it hashas this much of these surface and these latent features.= References = - DeHoust, C., Mangalath, P., Mingus., B. (2008). *Improving search in Wikipedia through quality and concept discovery*. Technical Report.PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/6/68/DeHoustMangalat…> - Rassbach, L., Mingus., B, Blackford, T. (2007). *Exploring the feasibility of automatically rating online article quality*. Technical Report. PDF<http://grey.colorado.edu/mediawiki/sites/mingus/images/d/d3/RassbachPincock…>
3 2
0 0
Hoi,I have asked and received permission to forward to you all this mostexcellent bit of news.The linguist list, is a most excellent resource for people interested in thefield of linguistics. As I mentioned some time ago they have had a fundingdrive and in that funding drive they asked for a certain amount of money ina given amount of days and they would then have a project on Wikipedia tolearn what needs doing to get better coverage for the field of linguistics.What you will read in this mail that the total community of linguists areasked to cooperate. I am really thrilled as it will also get us morelinguists interested in what we do. My hope is that a fraction will beinterested in the languages that they care for and help it become morerelevant. As a member of the "language prevention committee", I love to getmore knowledgeable people involved in our smaller projects. If it means thatwe get more requests for more projects we will really feel embarrassed withall the new projects we will have to approve because of the quality of theIncubator content and the quality of the linguistic arguments why we shouldapprove yet another language :)NB Is this not a really clever way of raising money; give us this much inthis time frame and we will then do this as a bonus...Thanks, GerardM---------- Forwarded message ----------From: LINGUIST Network <linguist(a)linguistlist.org>Date: Jun 18, 2007 6:53 PMSubject: 18.1831, All: Call for Participation: Wikipedia VolunteersTo: LINGUIST(a)listserv.linguistlist.orgLINGUIST List: Vol-18-1831. Mon Jun 18 2007. ISSN: 1068 - 4875.Subject: 18.1831, All: Call for Participation: Wikipedia VolunteersModerators: Anthony Aristar, Eastern Michigan U <aristar(a)linguistlist.org> Helen Aristar-Dry, Eastern Michigan U <hdry(a)linguistlist.org>Reviews: Laura Welcher, Rosetta Project <reviews(a)linguistlist.org>Homepage:http://linguistlist.org/The LINGUIST List is funded by Eastern Michigan University,and donations from subscribers and publishers.Editor for this issue: Ann Sawyer <sawyer(a)linguistlist.org>================================================================To post to LINGUIST, use our convenient web form athttp://linguistlist.org/LL/posttolinguist.html===========================Directory==============================1)Date: 18-Jun-2007From: Hannah Morales < hannah(a)linguistlist.org >Subject: Wikipedia Volunteers-------------------------Message 1 ----------------------------------Date: Mon, 18 Jun 2007 12:49:35From: Hannah Morales < hannah(a)linguistlist.org >Subject: Wikipedia VolunteersDear subscribers,As you may recall, one of our Fund Drive 2007 campaigns was called the"Wikipedia Update Vote." We asked our viewers to consider earmarking theirdonations to organize an update project on linguistics entries in theEnglish-language Wikipedia. You can find more background information on thisat:http://linguistlist.org/donation/fund-drive2007/wikipedia/index.cfm.The speed with which we met our goal, thanks to the interest and generosityofour readers, was a sure sign that the linguistics community was enthusiasticabout the idea. Now that summer is upon us, and some of you may have a bitmoreleisure time, we are hoping that you will be able to help us get started ontheWikipedia project. The LINGUIST List's role in this project is a purelyorganizational one. We will:*Help, with your input, to identify major gaps in the Wikipedia materials orpages that need improvement;*Compile a list of linguistics pages that Wikipedia editors have identifiedas"in need of attention from an expert on the subject" or " does not cite anyreferences or sources," etc;*Send out periodical calls for volunteer contributors on specific topics orarticles;*Provide simple instructions on how to upload your entries into Wikipedia;*Keep track of our project Wikipedians;*Keep track of revisions and new entries;*Work with Wikimedia Foundation to publicize the linguistics community'sefforts.We hope you are as enthusiastic about this effort as we are. Just to help usallget started looking at Wikipedia more critically, and to easily identify anareaneeding improvement, we suggest that you take a look at the List ofLinguistspage at:http://en.wikipedia.org/wiki/List_of_linguists. MMany people are not listed there; others need to have more facts andinformationadded. If you would like to participate in this exciting update effort,pleaserespond by sending an email to LINGUIST Editor Hannah Morales athannah(a)linguistlist.org, suggesting what your role might be or whichlinguisticsentries you feel should be updated or added. Some linguists who saw ourcampaignon the Internet have already written us with specific suggestions, which wewillshare with you soon.This update project will take major time and effort on all our parts. Theendresult will be a much richer internet resource of information on the breadthanddepth of the field of linguistics. Our efforts should also stimulateprospectivestudents to consider studying linguistics and to educate a wider public onwhatwe do. Please consider participating.Sincerely,Hannah MoralesEditor, Wikipedia Update ProjectLinguistic Field(s): Not Applicable-----------------------------------------------------------LINGUIST List: Vol-18-1831
3 2
0 0

10 Mar '15
We know NSA wants Wikipedia data, as Wikipedia is listed in one of theNSA slides:https://commons.wikimedia.org/wiki/File:KS8-001.jpgThat slide is about HTTP, and the tech staff are moving theuser/reader base to HTTPS.As we learn more about the NSA programs, we need to consider vectorsother than HTTP for the NSA to obtain the data they want. And theuserbase needs to be aware of the current risks.One question from the "Dells are backdored"[sic] thread that is worthseparate consideration is:Are the Wikimedia transit links encrypted, especially for database replication?MySQL has replication over SSL, so I assume the answer is Yes.If not, is this necessary or useful, and feasible ?However we also need to consider that SSL and other encryption may beuseless against NSA/etc, which means replicating non-public datashould be avoided wherever possible, as it becomes a single point offailure.Given how public our system is, we don't have a lot of non-publicdata, so we might be able to design the architecture so thatinformation isnt replicated, and also ensure it isnt accessed overinsecure links. I think the only parts of the dataset that areprivate & valuable are* passwords/login cookies,* checkuser info - IPs and useragents,* WMF analytics, which includes readers iirc, and* hidden/deleted edits* private wikis and mailing listsHave I missed any?Are passwords and/or checkuser info replicated?Is there a data policy on WMF analytics data which prevents it flowingover insecure links, and limits what is collected and ensuresdestruction of the data within reasonable timeframes? i.e. how aboutnot using cookies to track analytics of readers who are on HTTPinstead of HTTPS?The private wikis can be restricted to https, depending on the valueof the data on those wikis in the wrong hands. The private mailinglists will be harder to secure, and at least the English Wikipediaarbcom list contain a lot of valuable data about contributors.Regarding hidden/deleted edits, the replication isnt the only sourceof this data. All edits are also exposed via Recent Changes(https/api/etc) as they occur, and the value of these edits isdetermined by the fact they are hidden afterwards (e.g. don't appearin dumps). Is there any way to control who is effectively capturingall edits via Recent Changes?--John Vandenberg
5 4
0 0

16 Feb '15
Hi folks,to increase accountability and create more opportunities for coursecorrections and resourcing adjustments as necessary, Sue's asked meand Howie Fung to set up a quarterly project evaluation process,starting with our highest priority initiatives. These are, accordingto Sue's narrowing focus recommendations which were approved by theBoard [1]:- Visual Editor- Mobile (mobile contributions + Wikipedia Zero)- Editor Engagement (also known as the E2 and E3 teams)- Funds Dissemination Committe and expanded grant-making capacityI'm proposing the following initial schedule:January:- Editor Engagement ExperimentsFebruary:- Visual Editor- Mobile (Contribs + Zero)March:- Editor Engagement Features (Echo, Flow projects)- Funds Dissemination CommitteeWe’ll try doing this on the same day or adjacent to the monthlymetrics meetings [2], since the team(s) will give a presentation ontheir recent progress, which will help set some context that wouldotherwise need to be covered in the quarterly review itself. This willalso create open opportunities for feedback and questions.My goal is to do this in a manner where even though the quarterlyreview meetings themselves are internal, the outcomes are captured asmeeting minutes and shared publicly, which is why I'm starting thisdiscussion on a public list as well. I've created a wiki page herewhich we can use to discuss the concept further:https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings/Quarterly_r…The internal review will, at minimum, include:Sue GardnermyselfHowie FungTeam members and relevant director(s)Designated minute-takerSo for example, for Visual Editor, the review team would be the VisualEditor / Parsoid teams, Sue, me, Howie, Terry, and a minute-taker.I imagine the structure of the review roughly as follows, with aduration of about 2 1/2 hours divided into 25-30 minute blocks:- Brief team intro and recap of team's activities through the quarter,compared with goals- Drill into goals and targets: Did we achieve what we said we would?- Review of challenges, blockers and successes- Discussion of proposed changes (e.g. resourcing, targets) and otheraction items- Buffer time, debriefingOnce again, the primary purpose of these reviews is to create improvedstructures for internal accountability, escalation points in caseswhere serious changes are necessary, and transparency to the world.In addition to these priority initiatives, my recommendation would beto conduct quarterly reviews for any activity that requires more thana set amount of resources (people/dollars). These additional reviewsmay however be conducted in a more lightweight manner and internallyto the departments. We’re slowly getting into that habit inengineering.As we pilot this process, the format of the high priority reviews canhelp inform and support reviews across the organization.Feedback and questions are appreciated.All best,Erik[1]https://wikimediafoundation.org/wiki/Vote:Narrowing_Focus[2]https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings-- Erik MöllerVP of Engineering and Product Development, Wikimedia FoundationSupport Free Knowledge:https://wikimediafoundation.org/wiki/Donate
9 65
0 0
Visually impaired
by rupert THURNER 30 Mar '14

30 Mar '14
Hi, would anybody of you have some starting points concerning wikipedia forvisually impaired persons, both computer and mobile devices?Rupert
12 16
0 0

18 Mar '14
I am very pleased to announce that Wikimedia NYC and Wikimedia DC areworking in collaboration to host the first national Wikimedia conference inthe United States!Here are the details for the conference:Dates: Friday, May 30, 2014 - Sunday, June 1, 2014Location: New York Law School (185 West Broadway, New York, NY 10013)Website:http://wikiconferenceusa.orgEmail: wikicon(a)wikimedianyc.orgRegistration:http://wikiconusa.eventbrite.org/For more information, please review our official press release below! Wehope you will join us and help us spread the word!https://commons.wikimedia.org/wiki/File:WikiCon_USA_2014_Press_Release_v1.p…Thanks,Richard (User:Pharos)Wikimedia NYC
3 3
0 0
I can think of a few reasons why we should accept bitcoin:* It's consistent with our leadership in internet technology* Our peers like EFF, and Internet archive accept it* It's secured using the same kinds of encryption we rely on to maintainuser privacy* It permits donations from countries that do not have Visa/Mastercardservices* It has a fanatically loyal and growing following that is dying to give usmoney in that currencyMost imporantly, current technology would permit us to accept bitcoinwithout ever *holding* bitcoin.Companies like BitPay (https://bitpay.com/) and CoinBase (https://coinbase.com/) are little different than accepting Visa,Mastercard, or Paypal. It's now possible for funds received as bitcoins tobe *immediately* converted to USD.I don't think we should 'make a statement' by accepting bitcoin, I thinkthe currency is simply at the stage where it would be to our benefit to doso.Jake (Ocaasi)
16 20
2 0
Funding programs for individuals.
by Jean-Frédéric 04 Mar '14

04 Mar '14
Hi folks,While watching the current changes to Wikimedia France microgrants programimplemented, I was curious to know which Wikimedia entities had similarfunding programs for individuals - how they worked, how we could learn formeach other.Since apparently there was no Meta page for that(tm) (yet!) I went ahead anddrafted <https://meta.wikimedia.org/wiki/User:Jean-Frédéric/Funding_programs>I dug my information out of my email archives and FDC proposal forms, so Icould totally have missed some programs - please add the ones you knowabout!Of course, it would be more useful to have more detailed information onevery program.Together with Caroline & Pierre-Selim we threw some ideas on what wethought was interesting to know about the programs, but that's still veryalpha - please add more ideas!Looking forward to your thoughts about this!Cheers,-- Jean-FrédéricWikimédia France
7 10
0 0

19 Feb '14
Dear members of the Wikimedia Movement:In the past weeks, I've taken the decision to resign to my position ofExecutive Director of Wikimedia Argentina. This decision was presented tothe Board of the chapter past Saturday and it was accepted.For personal and professional reasons, I've decided to return to mycountry, Chile, in the following weeks and start a new stage of my life.Two years ago, I was presented with the opportunity of living in Argentinaand working for one of my passions. This was a big challenge for me: I hadto leave my country and my family and work in a foreign country. These last18 months have been a unique experience and I've learnt a lot, becoming abetter professional and a more mature person. However, at the same time Ifeel that I need to move on and continue with new projects and challenges.Certainly, this has not been an easy decision for me, because I'm very fondof this work, the people that have participated in our activities and theprojects we worked on and we are still working on.I leave the Association in a very different position than when I tookoffice. We have several ongoing projects and we regularized all the delayedpaperwork. We became the first organization from a developing country toget an Annual Plan Grant via FDC and have been one of the best-gradedchapters in both processes. We are now a reliable and serious organization,continuing a process of professionalization that can improve our programs,making them more efficient and more effective. Clearly, this has not beenonly due to my work, which is why I thank the Board of the Chapter thathelped me in everything and to María Cruz, which has been a great colleaguethese months.My main interest is that Wikimedia Argentina continues to grow, which iswhy we have decided that my departure occurs at the end of March. This willallow us to work calmly to ensure the continuity of the ongoing projects ofWikimedia Argentina and the transfer of knowledge once the new ExecutiveDirector takes office. In any case, I will continue to participate as amember and Wikimedia volunteer, once this period expires. By request of ourBoard, I will also attend the next Wikimedia Conference, so I will be ableto transmit the experiences of Wikimedia Argentina in the last year.I appreciate the trust placed in me by the Board and all the members ofWikimedia Argentina and their support all these months. I'm sure they willcontinue the great work we have done lately.Kind regards,*Osmar Valdebenito G.*Director EjecutivoA. C. Wikimedia Argentina
8 7
0 0
VisualEditor office hours in February
by Maggie Dennis 15 Feb '14

15 Feb '14
Hi, guys.I just wanted to let you know, so you could mark your calendars ifinterested, that there are two IRC office hours scheduled to discussVisualEditor in February.The first will be held on Saturday February 15 at 1700 UTC and the secondwill be held on Sunday February 16 at 00:00 UTC. (Seehttps://meta.wikimedia.org/wiki/IRC_office_hours for time conversionlinks.)Logs will be posted on meta after each office hour completes. You'll findthem, along with logs for older office hours on the topic, athttps://meta.wikimedia.org/wiki/Category:VisualEditor_office_hours_logsPlease seehttps://meta.wikimedia.org/wiki/IRC_office_hours for moreinformation on how to join in.Thanks!Maggie-- Maggie DennisSenior Community AdvocateWikimedia Foundation, Inc.
1 1
0 0
Results per page:

[8]ページ先頭

©2009-2026 Movatter.jp