Wikipedia search logs from 2012

nkurz
Member
Posts: 3
kołdry
Joined: Wed Feb 05, 2014 4:19 am

Wikipedia search logs from 2012

Unread post by nkurz » Wed Feb 05, 2014 9:20 am

A couple years ago there was discussion here about Wikipedia's intent to release anonymized search logs:
viewtopic.php?t=935&p=18042

It turned out that only a few were made available, and only for a short window in 2012:
viewtopic.php?t=1038&p=20055

Does any one still have a copy of these, or know where a copies might be found? I'm doing some research on search strategies (previous paper at http://arxiv.org/abs/1401.6399) and would find these very useful to analyze as a realistic search set. Online would be easiest, but I'd be happy to send SASE with blank DVD's if anyone might be able to provide them.

Thanks! (and sorry for signing up and jumping straight to business)

User avatar
thekohser
Majordomo
Posts: 13408
Joined: Thu Mar 15, 2012 5:07 pm
Wikipedia User: Thekohser
Wikipedia Review Member: thekohser
Actual Name: Gregory Kohs
Location: United States
Contact:

Re: Wikipedia search logs from 2012

Unread post by thekohser » Wed Feb 05, 2014 2:34 pm

Eric Barbour says he has it, so he should chime in here soon.
"...making nonsensical connections and culminating in feigned surprise, since 2006..."

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way
Contact:

Re: Wikipedia search logs from 2012

Unread post by Poetlister » Wed Feb 05, 2014 7:18 pm

nkurz wrote:sorry for signing up and jumping straight to business
Welcome. It's nice to have people who haven't just turned up to make silly comments.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche


User avatar
Pen
Critic
Posts: 113
Joined: Tue Dec 10, 2013 10:32 am
Location: waiting for attachment

Re: Wikipedia search logs from 2012

Unread post by Pen » Thu Feb 06, 2014 4:17 pm

link

On March 17, 2006, a federal judge in San Jose, Calif., ordered Google to partially comply with a subpoena from the Justice Department seeking search-engine records in its defense of the Child Online Protection Act, or COPA. U.S. District Judge James Ware denied the DOJ's request for a sample of one million search terms, but ordered Google to hand over a sample of 50,000 URLs returned as search results.

The DOJ had requested the data from all of the major search-engine providers, including Microsoft, AOL and Yahoo, but Google was the only one to fight it.

Privacy advocates praised the decision for keeping a lid on search terms, which can identify searchers even in the absence of any other information.

In August 2006, AOL researchers posted online 20 million search terms representing three months worth of queries from some 658,000 users of its web browser. The researchers had stripped out user names and other personally identifying material, but the search terms alone were sufficient to reveal the identities of the searchers in at least some cases (see FAQ: AOL's Search Gaffe and You, http://www.wired.com/politics/security/ ... 6/08/71579). Under criticism, AOL pulled down the data.

User avatar
Pen
Critic
Posts: 113
Joined: Tue Dec 10, 2013 10:32 am
Location: waiting for attachment

Re: Wikipedia search logs from 2012

Unread post by Pen » Thu Feb 06, 2014 4:35 pm

For readers who would like a dataset to play with, without software or downloading:

http://search-logs.com/

For people who'd want to read a slab of the data, then process and visualize the entropy of the masses, I recommend baraka as a good shortcut.

EricBarbour
 
Posts: 10891
Joined: Wed Mar 14, 2012 11:32 pm
Location: hell

Re: Wikipedia search logs from 2012

Unread post by EricBarbour » Thu Feb 06, 2014 8:32 pm

Yes, I have it. If someone wants a copy, they have to send me enough DVD-Rs or a flash drive to hold it, plus an SASE.
1.1 GB compressed (very difficult to decompress, warning), or 5.1 GB uncompressed.

Better yet, I wish someone would put it on a webhost so I don't have to do this.

User avatar
tarantino
Habitué
Posts: 4764
Joined: Thu Mar 15, 2012 7:19 pm

Re: Wikipedia search logs from 2012

Unread post by tarantino » Thu Feb 06, 2014 9:05 pm

EricBarbour wrote:Yes, I have it. If someone wants a copy, they have to send me enough DVD-Rs or a flash drive to hold it, plus an SASE.
1.1 GB compressed (very difficult to decompress, warning), or 5.1 GB uncompressed.

Better yet, I wish someone would put it on a webhost so I don't have to do this.
If you have a Google account, you can use Drive. It accepts files up to 10GB.

nkurz
Member
Posts: 3
Joined: Wed Feb 05, 2014 4:19 am

Re: Wikipedia search logs from 2012

Unread post by nkurz » Thu Feb 06, 2014 11:14 pm

Pen wrote:The information was withdrawn from the public because it contains information that shouldn't be public.
I think there is a balance between the public good of having large data sets available and the potential risk to individuals. My guess would be that the risk to users is much lower than it would be with a general a purpose search engine, but not having seen the data yet, I'm not able to weigh these two in this case.

In the case of the WMF, I'd presume the decision to discontinue release of new data was not based on the actual risk of exploitation of the data, but the (unfortunately legitimate) fear of being sued. I've written to Diederik, but haven't yet heard back. Personally, I think this climate of fear is detrimental, and may mean that useful large datasets of any sort are never again publicly released.
EricBarbour wrote:Better yet, I wish someone would put it on a webhost so I don't have to do this.
We currently use datasets (including AOL) for which we have agreements prohibiting release. If we are able to use the data for our upcoming paper, and if we feel comfortable that the data does not cause undue risk to individuals, and if it appears legal to do so, we hope to be able to clean it up and host it so that our results can be reproduced.

EricBarbour
 
Posts: 10891
Joined: Wed Mar 14, 2012 11:32 pm
Location: hell

Re: Wikipedia search logs from 2012

Unread post by EricBarbour » Fri Feb 07, 2014 1:13 am

nkurz wrote:In the case of the WMF, I'd presume the decision to discontinue release of new data was not based on the actual risk of exploitation of the data, but the (unfortunately legitimate) fear of being sued. I've written to Diederik, but haven't yet heard back. Personally, I think this climate of fear is detrimental, and may mean that useful large datasets of any sort are never again publicly released.
You may be right. You're the fifth researcher who has contacted me for a copy -- and I had no intention of acting as an "unofficial mirror".
I would not be a bit surprised to learn that the WMF does not want any outsiders to see it.

nkurz
Member
Posts: 3
Joined: Wed Feb 05, 2014 4:19 am

Re: Wikipedia search logs from 2012

Unread post by nkurz » Fri Feb 07, 2014 6:23 am

EricBarbour wrote:
nkurz wrote:In the case of the WMF, I'd presume the decision to discontinue release of new data was not based on the actual risk of exploitation of the data, but the (unfortunately legitimate) fear of being sued.

I would not be a bit surprised to learn that the WMF does not want any outsiders to see it.
I didn't just mean that we wouldn't get to see Wikipedia search data, but that legal fears currently discourage every large company from every releasing any large data set again. I find it somewhat tragic that things like the Netflix prize (which resulted in the release of a fabulous data set that spurred lots of research and useful innovation) are never going to happen again.

User avatar
Pen
Critic
Posts: 113
Joined: Tue Dec 10, 2013 10:32 am
Location: waiting for attachment

Re: Wikipedia search logs from 2012

Unread post by Pen » Fri Feb 07, 2014 2:01 pm

nkurz wrote: I didn't just mean that we wouldn't get to see Wikipedia search data, but that legal fears currently discourage every large company from every releasing any large data set again. I find it somewhat tragic that things like the Netflix prize (which resulted in the release of a fabulous data set that spurred lots of research and useful innovation) are never going to happen again.
Not at all. It can be difficult to recruit software geniuses if you don't have the skill. Netflix used a good marketing campaign for it's recruitment which gave them success. Once another company wants to run a similar campaign, then they can give access to private information datasets of any kind if they need to. They just give out the confidentiality agreements to the non-employees and employees alike.

kdb4
Member
Posts: 3
Joined: Fri May 02, 2014 2:08 pm

Re: Wikipedia search logs from 2012

Unread post by kdb4 » Mon May 05, 2014 8:05 am

Hello guys,

I have to say I also just signed up as a consequence of interests in those logs.

I am currently doing some research on ways to compute an improved version of PageRank using search logs. Working with those Wikipedia logs is invaluable/priceless for me, as this is the only subset of the web where we can safely say that we have the entire graph (which is quite a point, for PR computation :) ).

I can create an account on a server for you to upload it Eric, and I could also find a way to host it so that you are not bothered more in the future about that. I could give you the URL without sharing it publicly and you would just give the URL to anyone you want if you want to.

I do not intent to distribute them - of course -, but this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way
Contact:

Re: Wikipedia search logs from 2012

Unread post by Poetlister » Mon May 05, 2014 1:45 pm

:welcome: kdb4. It's nice to know that we do useful things here occasionally!
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche

User avatar
thekohser
Majordomo
Posts: 13408
Joined: Thu Mar 15, 2012 5:07 pm
Wikipedia User: Thekohser
Wikipedia Review Member: thekohser
Actual Name: Gregory Kohs
Location: United States
Contact:

Re: Wikipedia search logs from 2012

Unread post by thekohser » Mon May 05, 2014 3:29 pm

kdb4 wrote:...this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.
I wonder, is your research going to be about boredom?
"...making nonsensical connections and culminating in feigned surprise, since 2006..."

kdb4
Member
Posts: 3
Joined: Fri May 02, 2014 2:08 pm

Re: Wikipedia search logs from 2012

Unread post by kdb4 » Mon May 05, 2014 5:29 pm

thekohser wrote:
kdb4 wrote:...this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.
I wonder, is your research going to be about boredom?
Well, this is why you've called it a "sample" :)

You could as well say, when you look at AOL logs, that researchers are doing research about porn, if you were to only take a sample :)

User avatar
Clipperton
Contributor
Posts: 53
Joined: Sat Nov 23, 2013 9:31 am

Re: Wikipedia search logs from 2012

Unread post by Clipperton » Tue May 06, 2014 1:28 am

Have WO members considered setting up a torrent to 'keep alive' documents and other files of interest?

It would probably have some awkward IP address issues.

kdb4
Member
Posts: 3
Joined: Fri May 02, 2014 2:08 pm

Re: Wikipedia search logs from 2012

Unread post by kdb4 » Tue May 06, 2014 6:18 am

That's why you use trackers, isn't it?

Post Reply