Why this Site?

  • Our Mission:
  • We exist to shine the light of scrutiny into the dark crevices of Wikipedia and its related projects; to examine the corruption there, along with its structural flaws; and to inoculate the unsuspecting public against the torrent of misinformation, defamation, and general nonsense that issues forth from one of the world’s most frequently visited websites, the “encyclopedia that anyone can edit.”
  • How you can participate:
  •  Visit the Wikipediocracy Forum, a candid exchange of views between Wikipedia editors, administrators, critics, proponents, and the general public.
  • 'Like' our Wikipediocracy page on Facebook.
  •  Follow Wikipediocracy on Twitter!

Press Releases

  • Please click here for recent Wikipediocracy press releases.

The Myth of Crowdsourcing at Wikipedia

By greybeard and Kelly Martin

 

According to Wikipedia,

Wikipedia is often cited as a successful example of crowdsourcing,[157] despite objections by co-founder Jimmy Wales to the term.[158]

.

For the largest audience, one has to be careful about the definition of the word “crowdsourcing“.

Wikipedia is a failed example of crowdsourcing, but there are also successful examples. The failure of Wikipedia as a crowdsourcing project is very interesting, but if one is to be — or are perceived as — decrying crowdsourcing more generally, one walks into a tarpit of contradictory evidence and conclusions that weaken one’s primary point.

Wikipedia’s model fails for a number of reasons. One we can call “entropy”. No fact on Wikipedia is ever fully-established. If we crowdsource (e.g.) a catalog of birds or a map of actual-vs-scheduled train times, then the facts are never (or seldom) in dispute. These projects rely on individual and precise datapoints submitted by individuals, either volitionally or automatically. The crowdsourcing of earthquake data on people’s phones is considered successful as well. While an individual can “game” that system, that data gets drowned out in the larger datastream and becomes “experimental error”. On Wikipedia no fact is ever final, no page ever complete, and the data is forever mutable, at the finest granularity. If someone enters that Ludwig van Beethoven was born in 1770, that fact is never locked down, and someone can change it at any time to 1707 or 1907 or 7707. As we know, people may patrol the page, but more sparsely watched pages can exist in erroneous states indefinitely. Entropy prevails.

A second reason Wikipedia’s model fails is also well-known here. Wikipedia attracts zealots, partisans, and extremists from all parts of the spectrum, each wishing to see his or her opinion (or “version of the facts”) memorialized in an “encyclopedia”. Thus we get partisans on nationalist topics, ones on matters of taste and morality, and even partisans on nominally scientific topics. Fanboys of particular popular culture are a subset here. Sometimes these partisans manage to control a set of topics, sometimes they simply make page after page of trivia concerning their obsession, and sometimes they simply war away indefinitely on established topics. In each of these cases the result is that Wikipedia skews away from commonly-accepted academic, historical, scientific, or cultural dogma on any given topic, and toward the extremes. In some cases — the warred-over pages — you get a kind of soft mush of opposing opinions of the “on the one hand / on the other hand” variety that no scholarly book would accept. Successful crowdsourced projects almost universally have some editorial control at the top. Linus Torvalds controls Linux absolutely. On Wikipedia, “WP:OWNership” prevails.

A third reason Wikipedia’s model fails is the lack of what in Computer Science we would call a “goal function“. There is no objective measure of the success of an online encyclopedia. As a result, Wikipedia substitutes available measures — completeness becomes number of pages, despite the lack of apparent correlation between the two. Success is also defined as page views and engagement is defined as number of editors. Any initiative, change, or environmental factor seen as likely to diminish those measures is swiftly defeated. Thus Wikipedia does not remove problematic biographies — because that would reduce the number of pages, and the perception of completeness. They will not prevent anonymous editing, because that would reduce the perception of user engagement. And they won’t institute safety practices like flagged revisions because that would make the site less mutable, and arguably diminish both page views and editors. So conservatism prevails.

None of these reasons (and there are others) — entropy, ownership, and conservatism — are endemic to crowdsourced projects. While they’re not unique to Wikipedia, it is arguably the largest and most visible project that suffers from them.

Finally, the popular and online media tend to conflate “crowdsourced” with “crowdfunded“. They are, of course, completely different. Once a crowdfunder parts with his or her money, their editorial control over the result is finished, except to the degree that they proselytize the product. They don’t contribute circuit design or software or artwork, and they certainly don’t war over it. Additionally, somewhat savvy large corporations have adopted PR and marketing campaigns under the rubric of “crowdsourcing”, which are basically contests that people enter, giving their free talent (such as it is) for a chance at winning something, whether a modicum of fame or something more tangible. This isn’t really crowdsourcing.

One needn’t look farther than the galaxy-identifying systems, the human OCR systems, the protein-folding systems, and other things like that to find good, useful, and successful examples of crowdsourcing, so one would be unwise to either paint those efforts with the same brush as Wikipedia, or weaken one’s argument by over-generalization.

.

The lack of an objective measure of quality is a key part and needs to be remembered. One of us has volunteered, in the past, for CoCoRaHS, a crowdsourced precipitation measurement project for the United States. Pretty much anyone can volunteer and submit observations after completing a very simple training program. The CoCoRaHS coordinators, who are qualified hydrologists (the coordinator for one area is a professor of meteorological sciences at a local university, and the state level coordinator is also the retired director of the state climate research center), gather the data and validate them by comparing them to one another, to National Weather Service observations, and to radar data to determine if the reported precipitation is “reasonable”. Someone who reports 4 inches of rain when their neighboring stations all report no rain will get an email from the coordinator asking for an explanation. Someone who reports 11 inches of rain in the middle of a hurricane, though, will probably not, but they will likely get mentioned in the monthly newsletter.

The algorithms used for this validation process are similar to the ones the NWS uses to validate its own datasets, and there is a solid mathematical basis behind them. Because of this procedural rigor, CoCoRaHS has been incredibly successful in gathering a large corpus of fairly reliable precipitation data for large parts of the US that were previously not well-covered, and that data is used by hundreds of organizations for all sorts of purposes. It’s an example of successful crowdsourcing. It works because it targets people who have an interest in the field, provides training and easy access to resources, makes the process simple and nonconfrontational for volunteers, and provides qualified professional resources to supervise volunteers and validate the quality of volunteer data. Contrast this to what Wikipedia does in these areas, and you’ll see why Wikipedia fails.

The problem isn’t crowdsourcing itself. Rather, it’s how Wikipedia does crowdsourcing.

(This blog post was originally published on February 23, 2015)

Image credits: Flickr/Shane Kelly (ballinascreen.com), Flickr/USDAgov ~ licensed under Creative C0mmons Attribution 2.0 Generic

12 comments to The Myth of Crowdsourcing at Wikipedia

  • Anthony Cole

    Thanks. Yes, validation is the problem. Just as the crowdsourced meteorology project borrows its validation system from NWS, so Wikipedia could borrow the academic community’s validation system – expert review. But that seems to be ruled out by a higher principle: experts are shit. See https://en.wikipedia.org/wiki/Wikipedia:Randy_in_Boise

  • Eric Corbett

    Reasonable as far as it goes I suppose, and the quality issue does indeed need to be addressed rather than ignored, but Wikipedia is not now and never has been “crowdsourced”. I’d be surprised if the typical article – including featured articles – had even half-a-dozen significant contributors, not much of a crowd. It makes little sense to look at Wikipedia in the large when considering crowdsourcing, as there’s no overall editorial control of the product.

  • Three qualities that one looks for in in an online reference work that includes articles about living people are respectable levels of Accuracy, Excellence, and Ethics. Wikipedia comes up short on all three, but especially flunks on Ethics in Mass Media.

  • James Salsman

    These issues are precisely the ones I am trying to address with http://www.mediawiki.org/wiki/Accuracy_review. Currently it’s scaled down to a 2-3 week “senior contributor” project that I hope will work in this year’s GSoC, but check the 2009 strategy proposal linked from the see alsos at the bottom for the original vision.

    Actually, my current vision goes quite a bit further than the 2009 proposal. We should be able to make a system which provides safely blinded, authentically crowdsourced maintenance of existing content to correct mistakes, both intentional, accidental, and from the mere passage of time, while at the same time providing a computer-aided instruction system to anyone who chooses to participate. Even further, it may make sense for a spin-off Foundation to fund people to work on such questions for pay. If done correctly, it will not put safe harbor provisions at risk, and could conceivably some day result in a sustainable way to pay people to enrich their own knowledge while maintaining the encyclopedias at the highest possible standard of accuracy.

    Obviously, I’m very excited about this, and have been for the better part of a decade, but the time didn’t seem right until now to try to bring it forward. However, I know that my excitement can cloud my vision, so I am trying to step back by making it someone else’s project. I would be happy to work with any interested volunteers. If some senior bot contributor decides to step forward and make it happen before GSoC, there are plenty of follow-on tasks to enhance it. Not just those suggested by the 2009 strategy proposal, but plenty more involving integration with, for example, Wikipedia Zero, new ways of identifying out-of-date, suspect, and confusing content, stronger redundant blind voting system to eliminate advocacy bias of all kinds, and integration with modern educational freeware.

    • Anthony Cole

      (GSoC = Google Summer of Code)
      That’s a worthwhile, ambitious project, James. I hope you’re finding the support you need.

    • John lilburn

      All of this is tinkering at the edges.

      facts and statistics which are likely to have become out of date and have not been updated in a given number of years … phrases which are likely unclear

      This does not fix “by 1345, during the reign of Richard II”, nor does it fix “Thomas Rainsborough the Ranter”. It doesn’t fix a [[phrase]] being pointing to something inappropriate (like someone from a different time or place), or the million other bits of nonsense that can be found on the site.

      All it does is present a list of pages that might be wrong, and we already have experience of list of pages that might be wrong on wikipedia and how well that goes. Least one thinks that one is an outlier there is also the Qworty cleanup which fizzled out after a few days.

      Almost every random page you hit has some template or other on it, some have been there 5 or more years. Nothing happens, the crap remains.

      The problem is that random people add out-of-date stuff that they’ve read in some 19th century book, that their preacher has told them, that they’ve seen in a TV drama, or which they’ve misunderstood from some other source, or garbled in order to avoid a charge of plagiarism. You cannot apply a technical solution to something that isn’t a technical problem. That is how wikipedia got into the mess it is in now.

      • James Salsman

        John, updating a traditional printed encyclopedia involves the same technical problems of using patterns to identify potentially out-of-date, inaccurate, and confusing content, and bringing those passages to the attention of human fact-checkers and proofreaders. A similar technical task is to identify spelling mistakes in new content — while this can be automated, it’s essential that the process has to be overseen by humans, which I’ve tried to keep paramount.

        Do you think the examples you give are too complex for machine pattern matching with natural language processing? I don’t, but I agree with your sentiment, because there are a whole lot of more subtle examples like potential innuendo and insinuation which depend on connotation beyond the capabilities of any contemporary natural language processing systems. But someone has to give the people working on such parsers examples of where the boundary of their technology is, and a general purpose encyclopedia is far more likely to do that more readily than specialized tasks like playing Jeopardy or suggesting chemotherapy regimens.

        I keep quarterly statistical tabs on Wikipedia’s short popular vital articles and I have not seen any interruption in the rate at which they reliably increase in quality. I’m also very familiar with the backlog, and was instrumental in sorting all of the WP:BACKLOG categories by number of incoming links some years ago before the migration from the toolserver broke that. In the mean time, most of the people who chose to work on backlog tasks (and the proportion doesn’t waiver very much from year to year) were working on articles which were arguably more important to address first. And there should naturally be tags on articles which are subjects of legitimate uncertainty, dispute, or known astroturfing, shouldn’t there?

        If in the next several years accuracy review merges with general computer-aided instruction, and the result is something that people are willing to pay for, there may be more than enough human capacity to find and correct the cleverest of hidden vandalism.

        • John lilburn

          a traditional printed encyclopedia involves the same technical problems of using patterns to identify potentially out-of-date, inaccurate, and confusing content

          Wikipedia is nothing like a “traditional printed encyclopedia” firstly because the content is altered far more often than with a ‘encyclopedia’. Encyclopedic content rarely changes. The original content is written by a subject expert not by someone garbling data from a 19th century book. The articles aren’t warred over by tossers. The editorial group directs resources to what needs fixing it doesn’t just say “this is a bit naff”. As I said almost every article is templated in some way and has been for years. Simply more marking them up won’t fix the issues which are fundamental to the wikipedia process.

  • TP

    Really great article gang, and I agree that Wikipedia is unique in the constellation of crowdsourcing, but it is indeed important in said constellation. In this respect, I think you’ll find this body of research particularly useful going forward:

    http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1919614

    and this one in particular:

    http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2660638

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>