Wikipedia search logs from 2012

nkurz · Unread post by **nkurz** » Wed Feb 05, 2014 9:20 am

A couple years ago there was discussion here about Wikipedia's intent to release anonymized search logs:
viewtopic.php?t=935&p=18042

It turned out that only a few were made available, and only for a short window in 2012:
viewtopic.php?t=1038&p=20055

Does any one still have a copy of these, or know where a copies might be found? I'm doing some research on search strategies (previous paper at http://arxiv.org/abs/1401.6399) and would find these very useful to analyze as a realistic search set. Online would be easiest, but I'd be happy to send SASE with blank DVD's if anyone might be able to provide them.

Thanks! (and sorry for signing up and jumping straight to business)

thekohser · Unread post by **thekohser** » Wed Feb 05, 2014 2:34 pm

Eric Barbour says he has it, so he should chime in here soon.

Poetlister · Unread post by **Poetlister** » Wed Feb 05, 2014 7:18 pm

nkurz wrote:sorry for signing up and jumping straight to business

Welcome. It's nice to have people who haven't just turned up to make silly comments.

Pen · Unread post by **Pen** » Thu Feb 06, 2014 4:01 pm

(Update 9/20 17:40 PDT) It appeared that a small percentage of queries contained information unintentionally inserted by users. For example, some users may have pasted unintended information from their clipboards into the search box, causing the information to be displayed in the datasets. This prompted us to withdraw the files.

We are looking into the feasibility of publishing search logs at an aggregated level, but, until further notice, we do not plan on publishing this data in the near future.

Diederik van Liere, Product Manager Analytics

The information was withdrawn from the public because it contains information that shouldn't be public.

Pen · Unread post by **Pen** » Thu Feb 06, 2014 4:17 pm

link

On March 17, 2006, a federal judge in San Jose, Calif., ordered Google to partially comply with a subpoena from the Justice Department seeking search-engine records in its defense of the Child Online Protection Act, or COPA. U.S. District Judge James Ware denied the DOJ's request for a sample of one million search terms, but ordered Google to hand over a sample of 50,000 URLs returned as search results.

The DOJ had requested the data from all of the major search-engine providers, including Microsoft, AOL and Yahoo, but Google was the only one to fight it.

Privacy advocates praised the decision for keeping a lid on search terms, which can identify searchers even in the absence of any other information.

In August 2006, AOL researchers posted online 20 million search terms representing three months worth of queries from some 658,000 users of its web browser. The researchers had stripped out user names and other personally identifying material, but the search terms alone were sufficient to reveal the identities of the searchers in at least some cases (see FAQ: AOL's Search Gaffe and You, http://www.wired.com/politics/security/ ... 6/08/71579). Under criticism, AOL pulled down the data.

Pen · Unread post by **Pen** » Thu Feb 06, 2014 4:35 pm

For readers who would like a dataset to play with, without software or downloading:

http://search-logs.com/

For people who'd want to read a slab of the data, then process and visualize the entropy of the masses, I recommend baraka as a good shortcut.

EricBarbour · Unread post by **EricBarbour** » Thu Feb 06, 2014 8:32 pm

Yes, I have it. If someone wants a copy, they have to send me enough DVD-Rs or a flash drive to hold it, plus an SASE.
1.1 GB compressed (very difficult to decompress, warning), or 5.1 GB uncompressed.

Better yet, I wish someone would put it on a webhost so I don't have to do this.

Unread post by **tarantino** » Thu Feb 06, 2014 9:05 pm

EricBarbour wrote:Yes, I have it. If someone wants a copy, they have to send me enough DVD-Rs or a flash drive to hold it, plus an SASE.
1.1 GB compressed (very difficult to decompress, warning), or 5.1 GB uncompressed.

Better yet, I wish someone would put it on a webhost so I don't have to do this.

If you have a Google account, you can use Drive. It accepts files up to 10GB.

nkurz · Unread post by **nkurz** » Thu Feb 06, 2014 11:14 pm

Pen wrote:The information was withdrawn from the public because it contains information that shouldn't be public.

I think there is a balance between the public good of having large data sets available and the potential risk to individuals. My guess would be that the risk to users is much lower than it would be with a general a purpose search engine, but not having seen the data yet, I'm not able to weigh these two in this case.

In the case of the WMF, I'd presume the decision to discontinue release of new data was not based on the actual risk of exploitation of the data, but the (unfortunately legitimate) fear of being sued. I've written to Diederik, but haven't yet heard back. Personally, I think this climate of fear is detrimental, and may mean that useful large datasets of any sort are never again publicly released.

EricBarbour wrote:Better yet, I wish someone would put it on a webhost so I don't have to do this.

We currently use datasets (including AOL) for which we have agreements prohibiting release. If we are able to use the data for our upcoming paper, and if we feel comfortable that the data does not cause undue risk to individuals, and if it appears legal to do so, we hope to be able to clean it up and host it so that our results can be reproduced.

EricBarbour · Unread post by **EricBarbour** » Fri Feb 07, 2014 1:13 am

nkurz wrote:In the case of the WMF, I'd presume the decision to discontinue release of new data was not based on the actual risk of exploitation of the data, but the (unfortunately legitimate) fear of being sued. I've written to Diederik, but haven't yet heard back. Personally, I think this climate of fear is detrimental, and may mean that useful large datasets of any sort are never again publicly released.

You may be right. You're the fifth researcher who has contacted me for a copy -- and I had no intention of acting as an "unofficial mirror".
I would not be a bit surprised to learn that the WMF does not want any outsiders to see it.

nkurz · Unread post by **nkurz** » Fri Feb 07, 2014 6:23 am

EricBarbour wrote:
nkurz wrote:In the case of the WMF, I'd presume the decision to discontinue release of new data was not based on the actual risk of exploitation of the data, but the (unfortunately legitimate) fear of being sued.

I would not be a bit surprised to learn that the WMF does not want any outsiders to see it.

I didn't just mean that we wouldn't get to see Wikipedia search data, but that legal fears currently discourage every large company from every releasing any large data set again. I find it somewhat tragic that things like the Netflix prize (which resulted in the release of a fabulous data set that spurred lots of research and useful innovation) are never going to happen again.

Pen · Unread post by **Pen** » Fri Feb 07, 2014 2:01 pm

nkurz wrote: I didn't just mean that we wouldn't get to see Wikipedia search data, but that legal fears currently discourage every large company from every releasing any large data set again. I find it somewhat tragic that things like the Netflix prize (which resulted in the release of a fabulous data set that spurred lots of research and useful innovation) are never going to happen again.

Not at all. It can be difficult to recruit software geniuses if you don't have the skill. Netflix used a good marketing campaign for it's recruitment which gave them success. Once another company wants to run a similar campaign, then they can give access to private information datasets of any kind if they need to. They just give out the confidentiality agreements to the non-employees and employees alike.

kdb4 · Unread post by **kdb4** » Mon May 05, 2014 8:05 am

Hello guys,

I have to say I also just signed up as a consequence of interests in those logs.

I am currently doing some research on ways to compute an improved version of PageRank using search logs. Working with those Wikipedia logs is invaluable/priceless for me, as this is the only subset of the web where we can safely say that we have the entire graph (which is quite a point, for PR computation

).

I can create an account on a server for you to upload it Eric, and I could also find a way to host it so that you are not bothered more in the future about that. I could give you the URL without sharing it publicly and you would just give the URL to anyone you want if you want to.

I do not intent to distribute them - of course -, but this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.

Poetlister · Unread post by **Poetlister** » Mon May 05, 2014 1:45 pm

kdb4. It's nice to know that we do useful things here occasionally!

thekohser · Unread post by **thekohser** » Mon May 05, 2014 3:29 pm

kdb4 wrote:...this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.

I wonder, is your research going to be about boredom?

kdb4 · Unread post by **kdb4** » Mon May 05, 2014 5:29 pm

thekohser wrote:
kdb4 wrote:...this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.
I wonder, is your research going to be about boredom?

Well, this is why you've called it a "sample"

You could as well say, when you look at AOL logs, that researchers are doing research about porn, if you were to only take a sample

Clipperton · Unread post by **Clipperton** » Tue May 06, 2014 1:28 am

Have WO members considered setting up a torrent to 'keep alive' documents and other files of interest?

It would probably have some awkward IP address issues.

kdb4 · Unread post by **kdb4** » Tue May 06, 2014 6:18 am

That's why you use trackers, isn't it?

Wikipediocracy

Wikipedia search logs from 2012

Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012

Re: Wikipedia search logs from 2012