Wikipedia search logs from 2012
Wikipedia search logs from 2012
A couple years ago there was discussion here about Wikipedia's intent to release anonymized search logs:
viewtopic.php?t=935&p=18042
It turned out that only a few were made available, and only for a short window in 2012:
viewtopic.php?t=1038&p=20055
Does any one still have a copy of these, or know where a copies might be found? I'm doing some research on search strategies (previous paper at http://arxiv.org/abs/1401.6399) and would find these very useful to analyze as a realistic search set. Online would be easiest, but I'd be happy to send SASE with blank DVD's if anyone might be able to provide them.
Thanks! (and sorry for signing up and jumping straight to business)
viewtopic.php?t=935&p=18042
It turned out that only a few were made available, and only for a short window in 2012:
viewtopic.php?t=1038&p=20055
Does any one still have a copy of these, or know where a copies might be found? I'm doing some research on search strategies (previous paper at http://arxiv.org/abs/1401.6399) and would find these very useful to analyze as a realistic search set. Online would be easiest, but I'd be happy to send SASE with blank DVD's if anyone might be able to provide them.
Thanks! (and sorry for signing up and jumping straight to business)
- thekohser
- Majordomo
- Posts: 13408
- Joined: Thu Mar 15, 2012 5:07 pm
- Wikipedia User: Thekohser
- Wikipedia Review Member: thekohser
- Actual Name: Gregory Kohs
- Location: United States
- Contact:
Re: Wikipedia search logs from 2012
Eric Barbour says he has it, so he should chime in here soon.
"...making nonsensical connections and culminating in feigned surprise, since 2006..."
- Poetlister
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
- Contact:
Re: Wikipedia search logs from 2012
Welcome. It's nice to have people who haven't just turned up to make silly comments.nkurz wrote:sorry for signing up and jumping straight to business
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
Re: Wikipedia search logs from 2012
(Update 9/20 17:40 PDT) It appeared that a small percentage of queries contained information unintentionally inserted by users. For example, some users may have pasted unintended information from their clipboards into the search box, causing the information to be displayed in the datasets. This prompted us to withdraw the files.
We are looking into the feasibility of publishing search logs at an aggregated level, but, until further notice, we do not plan on publishing this data in the near future.
Diederik van Liere, Product Manager Analytics
The information was withdrawn from the public because it contains information that shouldn't be public.
We are looking into the feasibility of publishing search logs at an aggregated level, but, until further notice, we do not plan on publishing this data in the near future.
Diederik van Liere, Product Manager Analytics
The information was withdrawn from the public because it contains information that shouldn't be public.
Re: Wikipedia search logs from 2012
link
On March 17, 2006, a federal judge in San Jose, Calif., ordered Google to partially comply with a subpoena from the Justice Department seeking search-engine records in its defense of the Child Online Protection Act, or COPA. U.S. District Judge James Ware denied the DOJ's request for a sample of one million search terms, but ordered Google to hand over a sample of 50,000 URLs returned as search results.
The DOJ had requested the data from all of the major search-engine providers, including Microsoft, AOL and Yahoo, but Google was the only one to fight it.
Privacy advocates praised the decision for keeping a lid on search terms, which can identify searchers even in the absence of any other information.
In August 2006, AOL researchers posted online 20 million search terms representing three months worth of queries from some 658,000 users of its web browser. The researchers had stripped out user names and other personally identifying material, but the search terms alone were sufficient to reveal the identities of the searchers in at least some cases (see FAQ: AOL's Search Gaffe and You, http://www.wired.com/politics/security/ ... 6/08/71579). Under criticism, AOL pulled down the data.
On March 17, 2006, a federal judge in San Jose, Calif., ordered Google to partially comply with a subpoena from the Justice Department seeking search-engine records in its defense of the Child Online Protection Act, or COPA. U.S. District Judge James Ware denied the DOJ's request for a sample of one million search terms, but ordered Google to hand over a sample of 50,000 URLs returned as search results.
The DOJ had requested the data from all of the major search-engine providers, including Microsoft, AOL and Yahoo, but Google was the only one to fight it.
Privacy advocates praised the decision for keeping a lid on search terms, which can identify searchers even in the absence of any other information.
In August 2006, AOL researchers posted online 20 million search terms representing three months worth of queries from some 658,000 users of its web browser. The researchers had stripped out user names and other personally identifying material, but the search terms alone were sufficient to reveal the identities of the searchers in at least some cases (see FAQ: AOL's Search Gaffe and You, http://www.wired.com/politics/security/ ... 6/08/71579). Under criticism, AOL pulled down the data.
Re: Wikipedia search logs from 2012
For readers who would like a dataset to play with, without software or downloading:
http://search-logs.com/
For people who'd want to read a slab of the data, then process and visualize the entropy of the masses, I recommend baraka as a good shortcut.
http://search-logs.com/
For people who'd want to read a slab of the data, then process and visualize the entropy of the masses, I recommend baraka as a good shortcut.
-
- Posts: 10891
- Joined: Wed Mar 14, 2012 11:32 pm
- Location: hell
Re: Wikipedia search logs from 2012
Yes, I have it. If someone wants a copy, they have to send me enough DVD-Rs or a flash drive to hold it, plus an SASE.
1.1 GB compressed (very difficult to decompress, warning), or 5.1 GB uncompressed.
Better yet, I wish someone would put it on a webhost so I don't have to do this.
1.1 GB compressed (very difficult to decompress, warning), or 5.1 GB uncompressed.
Better yet, I wish someone would put it on a webhost so I don't have to do this.
Re: Wikipedia search logs from 2012
If you have a Google account, you can use Drive. It accepts files up to 10GB.EricBarbour wrote:Yes, I have it. If someone wants a copy, they have to send me enough DVD-Rs or a flash drive to hold it, plus an SASE.
1.1 GB compressed (very difficult to decompress, warning), or 5.1 GB uncompressed.
Better yet, I wish someone would put it on a webhost so I don't have to do this.
Re: Wikipedia search logs from 2012
I think there is a balance between the public good of having large data sets available and the potential risk to individuals. My guess would be that the risk to users is much lower than it would be with a general a purpose search engine, but not having seen the data yet, I'm not able to weigh these two in this case.Pen wrote:The information was withdrawn from the public because it contains information that shouldn't be public.
In the case of the WMF, I'd presume the decision to discontinue release of new data was not based on the actual risk of exploitation of the data, but the (unfortunately legitimate) fear of being sued. I've written to Diederik, but haven't yet heard back. Personally, I think this climate of fear is detrimental, and may mean that useful large datasets of any sort are never again publicly released.
We currently use datasets (including AOL) for which we have agreements prohibiting release. If we are able to use the data for our upcoming paper, and if we feel comfortable that the data does not cause undue risk to individuals, and if it appears legal to do so, we hope to be able to clean it up and host it so that our results can be reproduced.EricBarbour wrote:Better yet, I wish someone would put it on a webhost so I don't have to do this.
-
- Posts: 10891
- Joined: Wed Mar 14, 2012 11:32 pm
- Location: hell
Re: Wikipedia search logs from 2012
You may be right. You're the fifth researcher who has contacted me for a copy -- and I had no intention of acting as an "unofficial mirror".nkurz wrote:In the case of the WMF, I'd presume the decision to discontinue release of new data was not based on the actual risk of exploitation of the data, but the (unfortunately legitimate) fear of being sued. I've written to Diederik, but haven't yet heard back. Personally, I think this climate of fear is detrimental, and may mean that useful large datasets of any sort are never again publicly released.
I would not be a bit surprised to learn that the WMF does not want any outsiders to see it.
Re: Wikipedia search logs from 2012
I didn't just mean that we wouldn't get to see Wikipedia search data, but that legal fears currently discourage every large company from every releasing any large data set again. I find it somewhat tragic that things like the Netflix prize (which resulted in the release of a fabulous data set that spurred lots of research and useful innovation) are never going to happen again.EricBarbour wrote:nkurz wrote:In the case of the WMF, I'd presume the decision to discontinue release of new data was not based on the actual risk of exploitation of the data, but the (unfortunately legitimate) fear of being sued.
I would not be a bit surprised to learn that the WMF does not want any outsiders to see it.
Re: Wikipedia search logs from 2012
Not at all. It can be difficult to recruit software geniuses if you don't have the skill. Netflix used a good marketing campaign for it's recruitment which gave them success. Once another company wants to run a similar campaign, then they can give access to private information datasets of any kind if they need to. They just give out the confidentiality agreements to the non-employees and employees alike.nkurz wrote: I didn't just mean that we wouldn't get to see Wikipedia search data, but that legal fears currently discourage every large company from every releasing any large data set again. I find it somewhat tragic that things like the Netflix prize (which resulted in the release of a fabulous data set that spurred lots of research and useful innovation) are never going to happen again.
Re: Wikipedia search logs from 2012
Hello guys,
I have to say I also just signed up as a consequence of interests in those logs.
I am currently doing some research on ways to compute an improved version of PageRank using search logs. Working with those Wikipedia logs is invaluable/priceless for me, as this is the only subset of the web where we can safely say that we have the entire graph (which is quite a point, for PR computation ).
I can create an account on a server for you to upload it Eric, and I could also find a way to host it so that you are not bothered more in the future about that. I could give you the URL without sharing it publicly and you would just give the URL to anyone you want if you want to.
I do not intent to distribute them - of course -, but this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.
I have to say I also just signed up as a consequence of interests in those logs.
I am currently doing some research on ways to compute an improved version of PageRank using search logs. Working with those Wikipedia logs is invaluable/priceless for me, as this is the only subset of the web where we can safely say that we have the entire graph (which is quite a point, for PR computation ).
I can create an account on a server for you to upload it Eric, and I could also find a way to host it so that you are not bothered more in the future about that. I could give you the URL without sharing it publicly and you would just give the URL to anyone you want if you want to.
I do not intent to distribute them - of course -, but this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.
- Poetlister
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
- Contact:
Re: Wikipedia search logs from 2012
kdb4. It's nice to know that we do useful things here occasionally!
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
- thekohser
- Majordomo
- Posts: 13408
- Joined: Thu Mar 15, 2012 5:07 pm
- Wikipedia User: Thekohser
- Wikipedia Review Member: thekohser
- Actual Name: Gregory Kohs
- Location: United States
- Contact:
Re: Wikipedia search logs from 2012
I wonder, is your research going to be about boredom?kdb4 wrote:...this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.
"...making nonsensical connections and culminating in feigned surprise, since 2006..."
Re: Wikipedia search logs from 2012
Well, this is why you've called it a "sample"thekohser wrote:I wonder, is your research going to be about boredom?kdb4 wrote:...this is an invaluable piece of material for research purposes. That could allow important advancement in web search personalization, for instance, but also has other incredible applications. Wikipedia was aware of that when they released it, I am also convinced they withdrew them because of the fear of being sued for any reason, which is a pitty imho.
You could as well say, when you look at AOL logs, that researchers are doing research about porn, if you were to only take a sample
- Clipperton
- Contributor
- Posts: 53
- Joined: Sat Nov 23, 2013 9:31 am
Re: Wikipedia search logs from 2012
Have WO members considered setting up a torrent to 'keep alive' documents and other files of interest?
It would probably have some awkward IP address issues.
It would probably have some awkward IP address issues.
Re: Wikipedia search logs from 2012
That's why you use trackers, isn't it?