Harassment is a major problem on the Internet. A 2014 Pew survey found that 40% of US adult Internet users have personally experienced harassment, and 73% have seen someone else harassed. The most frequent venue of harassment is at social networking sites, like Twitter and Facebook (66%), with a further 16% at online gaming sites.
Given that Wikipedia is a cross between a social networking site and MMORPG, one would expect the problem to be prevalent here as well. So it was a welcome step that the WMF conducted its first ever survey about harassment on Wikimedia projects a few months ago. Designed and implemented by the WMF Support and Safety (SuSa) team (formerly Community Advocacy), it was conducted over a period of two weeks in November 2015. A report with the findings was published in late January 2016 (Preliminary report — Latest version).
The report is 52 pages long and filled with charts and numbers. What, if anything, do these numbers mean? Do they shed light on the prevalence and nature of the harassment encountered on Wikimedia projects?
This blog post will look at the survey in some detail. It is based on the Wikipediocracy forum thread about the survey and associated discussions on Meta by Wikipediocracy members and members of the WMF SuSa team. We will also compare and contrast the WMF survey to the aforementioned Pew survey.
The first question to ask in a survey is: how were the respondents selected from a pool of all editors? A small, well-chosen sample is better than a big, unrepresentative one (a well known example is the survey conducted by the publication Literary Digest to predict the result of the 1936 US Presidential election. They surveyed a massive 2.4 million people, but due to sampling bias got the wrong answer by a whopping 19%, which not surprisingly resulted in the election of a different President than the one they predicted). The WMF seems to have prioritized being more “open and inclusive”, with almost no thought given to the former issue. Comment from Patrick Earley: “First, as to the question of why we’re not doing scientific sampling to generate survey participants: when we design surveys at the Foundation, we tend to keep them as open and inclusive as possible. This approach often introduces some error when it comes to studying very specific issues related to small user population. Using a sampling approach that targets very narrow user groups would be better, but the high privacy standards that Wikimedia projects maintain tend to hinder such high-accuracy sampling, and subsequently leave us with a broad intake pool of respondents.”
The WMF survey was an opt-in survey, stating: “We invite you to participate in a survey about online harassment on Wikimedia projects.” As Wikipediocracy member Gregory Kohs pointed out in the associated Meta discussion: “People who are interested in commenting about online harassment on Wikimedia projects will self-select into the survey sample, while people who are not interested in commenting about online harassment on Wikimedia projects will self-select out of the survey sample.”
The Pew survey took extreme care in ensuring its sample was representative of the general population. This is much harder to do with a largely anonymous editing community like the one found on Wikimedia projects, but that is not an excuse to simply ignore the matter.
Another important survey parameter is the recall period: how far back are the respondents asked to remember things? Are they expected to keep a journal or other records? These considerations heavily influence answers to surveys. For instance, the Bureau of Labor Statistics in the US selects the appropriate recall period carefully, after much research and testing out different recall periods for comparison.
In the WMF survey, the respondents were asked (Question 6) : “How many times have you experienced incidents like the ones described below while working on any of the Wikimedia projects?” with no time period specified at all. How would a person be expected to remember the number of times they have been called names on Wikipedia? Also, the numbers for one editor who has been editing for 10 months, versus one who has been editing for 10 years, are clearly not comparable.
In response to queries, the SuSa team stated that the wording without a time period specified was deliberate, because they wanted to cover all instances of harassment, including those which occur over longer time frames. They also stated that the numbers are to be understood as ballpark figures.
In contrast, the Pew survey had a much more sensible design: they asked a binary (yes/no) question about whether the respondents have experienced harassment of type X, with follow-up questions on their most recent harassment experience (See ON5-ON7 for initial questions, with follow up questions from ON8 onwards).
Implausible results in the preliminary report
Methodology aside, several of the reported results are implausible. One example in the preliminary report was pointed out by WO members Demonology and Gregory Kohs: 54% of Wikipedia users say they may have been harassed, and of those 61% claimed that they were harassed in the form of “revenge porn.” If this is representative of editors in the Wikimedia community, then fully a third of them have been personally victimized by revenge porn. Needless to say, if this is true, it would be rather big news which would lead to some rather drastic measures, including perhaps shutting down Wikipedia entirely.
The discussion on the Meta page and elsewhere indicated that, perhaps, many of the respondents did not have a clear and uniform idea of what revenge porn meant.
Leaving aside precision of definitions, let’s look at another category: hacking. 63% of respondents who experienced harassment reported to have their accounts hacked. As pointed out by WO member Drijfzand, hacking is not like calling someone names or sending them threats: in a venue where most editors are anonymous, it would take a lot of effort to find private information about editors. Since private information about a editor on Wikipedia is usually only known to trusted CheckUsers, or for privileged users, by the WMF, it could mean that there is widespread incompetence or abuse: again, if those figures were correct.
The raw figures also show a curious pattern: for the categories “revenge porn”, “doxxing”, “threats of violence”, “impersonation” and “hacking”, the numbers all fall within a range 775 +/- 5%.
Technical design faults
What is going on here? A clue is in the section called “Design Fault” on the Meta page. One respondent complained that “The page with sliders on which forms of harassment I have experienced would not advance until they were all non-zero. I selected a distinct number and added a note to say that actually meant zero”. This bug was fixed a couple of days into the survey, but would have been operative in the interim; this could explain the curious figures observed.
As Drijfzand pointed out, the survey seems to be counting many answers given as indicating prevalence of harassment, even if the user entered a value of zero. They demonstrated, using the raw data and the Bhatia-Davis inequality, that the results are mathematically impossible to obtain if only the non-zero responses were considered.
Types of harassment
Harassment can take many forms: the Pew survey breaks them down into “less severe” (name calling, embarrassment etc.) and “more severe” (stalking, physical threats etc.) Men tend to experience name calling, embarrassment and physical threats more often, while women tend to experience stalking and sexual harassment more. The appropriate way of dealing with less severe harassment is often to ignore it – the old adage “don’t feed the trolls”. Both the Pew (pages 2, 6) and the WMF survey (pages 29, 33) found that ignoring was both a common and effective tactic to deal with harassment. Obviously, ignoring severe harassment is much more difficult and dangerous. (Pew survey, page 6). The preliminary WMF survey report did not include a breakdown of the data by harassment type and gender.
In response to questions on Meta, the updated report removed the the implausible prevalence numbers (non-zero values for “number of times you have experienced harassment of type X”) and used an alternate measure: averaging the numbers. The updated report also includes a breakdown by gender and harassment type.
The alternate measure is problematic for many reasons. For reasons mentioned in the section “Recall period”, these numbers aren’t reliable or precise. Since the numbers are really ballpark figures for frequency of harassment, not prevalence of harassment, it is very misleading to simply average them; such a measure would be dominated by a few large and imprecise values. Thus, it is not clear what, if any, meaning can be attached to these numbers. The reported averages do not show any of the gender patterns observed in the Pew survey.
Another important factor is age: the Pew survey found that young people tend to experience much more harassment. The WMF survey does not break down the harassment data by age.
What is salvageable from this mess?
Perhaps studying a subset of the raw data, with the buggy responses removed, could give a more accurate and plausible picture. It would also be useful to treat Q6 (number of times one has been harassed) as a binary (yes/no) question; the high standard deviation in the raw data indicates that many respondents read the question as binary anyway. The problems with sampling probably cannot be fixed without conducting a new survey. Qualitative data from the survey could be useful: for instance, a third of the respondents suggested improving Wiki governance to deal with harassment (page 46).
Looking closely at the “grounds of harassment,” one finds the following: A third of the people did not know why they were targeted. A lot of the rest is disputes over content; a quarter listed political disputes as the reason, and a further 10% each listed “content” and “POV/difference of opinion” as the reasons. Unfortunately, the Wikipedia dispute resolution system is ill-suited for addressing this: a 2010 Emory Law Journal article found that “the Wiki-dispute resolution ignores the content of user disputes, instead focusing on user conduct.”
Finally though, the best way to think about this survey might be the Wikipedia acronym WP:TNT, sometimes invoked in discussions on whether to keep some article. It means: “Blow it up and start over”.