Size of the English Wikipedia database in GB
-
- Habitué
- Posts: 2620
- kołdry
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
-
- Posts: 10891
- Joined: Wed Mar 14, 2012 11:32 pm
- Location: hell
Re: Size of the English Wikipedia database in GB
As you can see on the statistics tables, they stopped updating a whole slew of statistics in January 2010. Including total size of all articles.Johnny Au wrote:Searching for database statistics only shows the most recent being from 2010.
No one outside the WMF knows why, and WMF people refuse to discuss it.
As you can see on this chart, other language Wikipedias are still updating their size statistics. The only prominent exception: English.
-
- Habitué
- Posts: 2620
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
Re: Size of the English Wikipedia database in GB
Note that this topic is different from viewtopic.php?f=6&t=3985
This topic is about why the WMF somehow wants to hide the true size of Wikipedia, while the other thread is about the size of Wikipedia (without discussing WMF's censorship).
This topic is about why the WMF somehow wants to hide the true size of Wikipedia, while the other thread is about the size of Wikipedia (without discussing WMF's censorship).
-
- Habitué
- Posts: 1364
- Joined: Wed Mar 14, 2012 11:21 pm
Re: Size of the English Wikipedia database in GB
An approximation may be determined from these pages: http://dumps.wikimedia.org/enwiki/20140614/
-
- Site Admin
- Posts: 9969
- Joined: Mon Mar 19, 2012 11:10 pm
- Wikipedia Review Member: Somey
Re: Size of the English Wikipedia database in GB
I just always assumed they were worried that the North Koreans would buy up the entire worldwide supply of USB sticks that are just a bit larger than the true size of Wikipedia, forcing them to pay more money for USB sticks that are the next-size-up in capacity, which they will then attach to big helium balloons and float over the demilitarized zone in order to bring the sum of human knowledge to all those North Koreans - none of whom actually are allowed to own their own computers.Johnny Au wrote:This topic is about why the WMF somehow wants to hide the true size of Wikipedia...
-
- Habitué
- Posts: 4208
- Joined: Thu Mar 15, 2012 8:14 pm
- Wikipedia User: Peter Damian
- Wikipedia Review Member: Peter Damian
- Location: London
Re: Size of the English Wikipedia database in GB
Don't forget the work we did way back on article size. This is different from database size, as the database includes absolutely every stupid edit. I could dig that up (or Eric could). What we found, when we ordered by size in Kb, was a distribution with a massive tail of stubs and short crappy articles. You can easily verify this by pressing the 'random article' button. Once you cut away that rubbish, there are about 250,000 articles (from memory). At the other end of the scale there were these huge massive articles written by SPAs. I wonder what that looks like now. We did this in 2012, from memory.EricBarbour wrote:As you can see on the statistics tables, they stopped updating a whole slew of statistics in January 2010. Including total size of all articles.Johnny Au wrote:Searching for database statistics only shows the most recent being from 2010.
No one outside the WMF knows why, and WMF people refuse to discuss it.
As you can see on this chart, other language Wikipedias are still updating their size statistics. The only prominent exception: English.
οὐκ ἀγαθὸν πολυκοιρανίη: εἷς κοίρανος ἔστω
-
- Habitué
- Posts: 2620
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
Re: Size of the English Wikipedia database in GB
I am curious to know the mean and median size of an article in the English Wikipedia as of 2014 in KB.
-
- Posts: 10891
- Joined: Wed Mar 14, 2012 11:32 pm
- Location: hell
Re: Size of the English Wikipedia database in GB
The average is around 8-9 KB. It has slowly gotten longer over the years, although it seems to have topped out thanks to the heavy use to bots to generate crap stub articles. We only put significant effort into studying articles over 15k bytes, in order to exclude stubs and crap.Johnny Au wrote:I am curious to know the mean and median size of an article in the English Wikipedia as of 2014 in KB.
http://wikipediocracy.com/wiki/index.ph ... _Wikipedia
-
- Muted
- Posts: 6609
- Joined: Sun Mar 03, 2013 2:36 am
- Wikipedia User: Kumioko; Reguyla
- Nom de plume: Persona non grata
Re: Size of the English Wikipedia database in GB
The database dump I just downloaded a couple weeks ago was something like 10 GB compressed and 45 GB uncompressed for the English Wikipedia with all history. Downloading all history for all projects in all languages is something like 3TB.
-
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
Re: Size of the English Wikipedia database in GB
Keep it. Old dumps are very useful for investigating whether things have been oversighted.Kumioko wrote:The database dump I just downloaded a couple weeks ago was something like 10 GB compressed and 45 GB uncompressed for the English Wikipedia with all history. Downloading all history for all projects in all languages is something like 3TB.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
-
- Muted
- Posts: 6609
- Joined: Sun Mar 03, 2013 2:36 am
- Wikipedia User: Kumioko; Reguyla
- Nom de plume: Persona non grata
Re: Size of the English Wikipedia database in GB
I should clarify that when I say all projects I didn't count commons, it is so big its just nasty.
-
- Retired
- Posts: 4130
- Joined: Thu Nov 01, 2012 1:40 pm
- Wikipedia User: Scott
- Location: London
Re: Size of the English Wikipedia database in GB
You'll get a lot quicker result just asking an admin. I'm always happy to answer any query as to whether an article's history contains hidden revisions. I can't see oversighted material, and wouldn't be able to share it whether I could, but there's no reason why I can't report the state of an article's history. When it comes to RevDel'ed material, I can also explain why without necessarily supplying the content itself, although sometimes it's possible.Poetlister wrote:Old dumps are very useful for investigating whether things have been oversighted.
My question, to this esteemed Wiki community, is this: Do you think that a Wiki could successfully generate a useful encyclopedia? -- JimboWales
Yes, but in the end it wouldn't be an encyclopedia. It would be a wiki. -- WardCunningham (Jan 2001)
Yes, but in the end it wouldn't be an encyclopedia. It would be a wiki. -- WardCunningham (Jan 2001)
-
- Muted
- Posts: 6609
- Joined: Sun Mar 03, 2013 2:36 am
- Wikipedia User: Kumioko; Reguyla
- Nom de plume: Persona non grata
Re: Size of the English Wikipedia database in GB
:No offense to you Hex but I have gotten to the point where I don't want to ask an admin for anything unless I absolutely have too. Besides, most of the revdeled info is available in the data dumps. I can just look it up for myself without having to ask an admin for it and be told I can't be trusted.Hex wrote:You'll get a lot quicker result just asking an admin. I'm always happy to answer any query as to whether an article's history contains hidden revisions. I can't see oversighted material, and wouldn't be able to share it whether I could, but there's no reason why I can't report the state of an article's history. When it comes to RevDel'ed material, I can also explain why without necessarily supplying the content itself, although sometimes it's possible.Poetlister wrote:Old dumps are very useful for investigating whether things have been oversighted.
-
- Retired
- Posts: 4130
- Joined: Thu Nov 01, 2012 1:40 pm
- Wikipedia User: Scott
- Location: London
Re: Size of the English Wikipedia database in GB
No offense taken. Incidentally, RevDel'ed/oversighted revisions aren't included in dumps now, but I couldn't say when that started happening, or whether prior dumps have been reprocessed to remove subsequently suppressed material, which you'd think the nature of the action would necessitate.Kumioko wrote: :No offense to you Hex but I have gotten to the point where I don't want to ask an admin for anything unless I absolutely have too. Besides, most of the revdeled info is available in the data dumps. I can just look it up for myself without having to ask an admin for it and be told I can't be trusted.
Oh, also:
pages-articles.xml is the one you downloaded, which is latest revisions only. For full history you need the pages-meta-history.xml dumps, which come to about 565 GB - compressed. Dump format info here.Kumioko wrote: The database dump I just downloaded a couple weeks ago was something like 10 GB compressed and 45 GB uncompressed for the English Wikipedia with all history.
My question, to this esteemed Wiki community, is this: Do you think that a Wiki could successfully generate a useful encyclopedia? -- JimboWales
Yes, but in the end it wouldn't be an encyclopedia. It would be a wiki. -- WardCunningham (Jan 2001)
Yes, but in the end it wouldn't be an encyclopedia. It would be a wiki. -- WardCunningham (Jan 2001)
-
- Muted
- Posts: 6609
- Joined: Sun Mar 03, 2013 2:36 am
- Wikipedia User: Kumioko; Reguyla
- Nom de plume: Persona non grata
Re: Size of the English Wikipedia database in GB
Your right, it was only the most recent that I downloaded. I think that 565GB is the full backup of all Wiki's though, not just english.
-
- Sonny, I've got a whole theme park full of red delights for you.
- Posts: 31850
- Joined: Thu Mar 29, 2012 8:16 pm
- Wikipedia User: Vigilant
- Wikipedia Review Member: Vigilant
Re: Size of the English Wikipedia database in GB
You should upload that to commons.Kumioko wrote:Your right, it was only the most recent that I downloaded. I think that 565GB is the full backup of all Wiki's though, not just english.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.
-
- Muted
- Posts: 6609
- Joined: Sun Mar 03, 2013 2:36 am
- Wikipedia User: Kumioko; Reguyla
- Nom de plume: Persona non grata
Re: Size of the English Wikipedia database in GB
They have a 100MB limit.:-(
-
- Sonny, I've got a whole theme park full of red delights for you.
- Posts: 31850
- Joined: Thu Mar 29, 2012 8:16 pm
- Wikipedia User: Vigilant
- Wikipedia Review Member: Vigilant
Re: Size of the English Wikipedia database in GB
Sounds like 5650 100MB slices then.Kumioko wrote:They have a 100MB limit.:-(
You should do it every day to make sure the historical record is clear.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.
-
- Muted
- Posts: 6609
- Joined: Sun Mar 03, 2013 2:36 am
- Wikipedia User: Kumioko; Reguyla
- Nom de plume: Persona non grata
Re: Size of the English Wikipedia database in GB
Lol, I would rather do it to Wikipedia since I am already banned there. It would surely get their attention.Vigilant wrote:Sounds like 5650 100MB slices then.Kumioko wrote:They have a 100MB limit.:-(
You should do it every day to make sure the historical record is clear.
-
- Habitué
- Posts: 2620
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
Re: Size of the English Wikipedia database in GB
The size of the English Wikipedia database on October 15 has been revealed: 11.5GB
Here is the chart:
Here is the source: https://meta.wikimedia.org/wiki/Data_du ... nts#enwiki
Here is the chart:
Here is the source: https://meta.wikimedia.org/wiki/Data_du ... nts#enwiki
-
- Muted
- Posts: 6609
- Joined: Sun Mar 03, 2013 2:36 am
- Wikipedia User: Kumioko; Reguyla
- Nom de plume: Persona non grata
Re: Size of the English Wikipedia database in GB
That seems way too small. That must be articles only, current revisions only and no other namespaces and with no files. Just a guess though.
-
- Sonny, I've got a whole theme park full of red delights for you.
- Posts: 31850
- Joined: Thu Mar 29, 2012 8:16 pm
- Wikipedia User: Vigilant
- Wikipedia Review Member: Vigilant
Re: Size of the English Wikipedia database in GB
I assumed that was the size of the average ARBCOM case.Johnny Au wrote:The size of the English Wikipedia database on October 15 has been revealed: 11.5GB
Here is the chart:
Here is the source: https://meta.wikimedia.org/wiki/Data_du ... nts#enwiki
Hello, John. John, hello. You're the one soul I would come up here to collect myself.
-
- Habitué
- Posts: 4804
- Joined: Thu Mar 15, 2012 7:19 pm
-
- Critic
- Posts: 169
- Joined: Sat Oct 03, 2015 12:33 pm
- Location: Belgium
Re: Size of the English Wikipedia database in GB
It's the size of the .bz2 archive. Probably 45 to 50 GB unzipped.Kumioko wrote:That seems way too small. That must be articles only, current revisions only and no other namespaces and with no files. Just a guess though.
The total DB (english WP only) with history and all(?) namespaces is supposed to be 10 TB, also available for download: 126 GB in .7z files. I tried one of them, 237 MB that unpacked to 80.5 GB. XML file format, starts with the complete text of every revision of the Anarchy article (2 GB, 3 million lines), followed by Asperger syndrome...
Tweaker in Metropolis
-
- Muted
- Posts: 6609
- Joined: Sun Mar 03, 2013 2:36 am
- Wikipedia User: Kumioko; Reguyla
- Nom de plume: Persona non grata
Re: Size of the English Wikipedia database in GB
Ah ok, that makes more sense then. Thanks for the clarification.Drijfzand wrote:It's the size of the .bz2 archive. Probably 45 to 50 GB unzipped.Kumioko wrote:That seems way too small. That must be articles only, current revisions only and no other namespaces and with no files. Just a guess though.
The total DB (english WP only) with history and all(?) namespaces is supposed to be 10 TB, also available for download: 126 GB in .7z files. I tried one of them, 237 MB that unpacked to 80.5 GB. XML file format, starts with the complete text of every revision of the Anarchy article (2 GB, 3 million lines), followed by Asperger syndrome...
-
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
Re: Size of the English Wikipedia database in GB
Most of the images are presumably on Commons and transcluded into WP articles. Are they included in the file dump?
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
-
- Critic
- Posts: 169
- Joined: Sat Oct 03, 2015 12:33 pm
- Location: Belgium
Re: Size of the English Wikipedia database in GB
No, have to be downloaded separately, from mirrors.Poetlister wrote:Most of the images are presumably on Commons and transcluded into WP articles. Are they included in the file dump?
Info here
Media (current version only)
Media tarballs per project (except Commons)
Media tarballs per day for Wikimedia Commons
The Wikipedia and User namespaces are not available for download, don't know why I thought otherwise. too bad, would have been fun
Tweaker in Metropolis
-
- Habitué
- Posts: 2620
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
Re: Size of the English Wikipedia database in GB
I have estimated the uncompressed size by multiplying the compressed size by 4 when it comes to the size of the total English Wikipedia current article text.Drijfzand wrote:It's the size of the .bz2 archive. Probably 45 to 50 GB unzipped.Kumioko wrote:That seems way too small. That must be articles only, current revisions only and no other namespaces and with no files. Just a guess though.
It means that the October 2, 2015 compressed dump size of 11.5GB is approximately 46GB when uncompressed.
-
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
Re: Size of the English Wikipedia database in GB
That will make a big difference to the size. I suppose they also exclude deleted articles and hidden revisions.Drijfzand wrote:No, have to be downloaded separately, from mirrors.Poetlister wrote:Most of the images are presumably on Commons and transcluded into WP articles. Are they included in the file dump?
Info here
Media (current version only)
Media tarballs per project (except Commons)
Media tarballs per day for Wikimedia Commons
The Wikipedia and User namespaces are not available for download, don't know why I thought otherwise. too bad, would have been fun
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
-
- Blue Meanie
- Posts: 4955
- Joined: Fri Sep 07, 2012 10:33 am
- Wikipedia User: Begoon
- Wikipedia Review Member: Jim
- Location: NSW
Re: Size of the English Wikipedia database in GB
One would imagine so. It'd be insanely stupid and incompetent not to.Poetlister wrote:I suppose they also exclude deleted articles and hidden revisions.
Oh, wait... We should probably check.
-
- Habitué
- Posts: 2620
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
Re: Size of the English Wikipedia database in GB
The size of the Wikipedia database is growing, especially with its revision history being filled with bloat.
The Wikipedia database would be much tighter if vandalism edits and reversions were removed from the history.
The average size of a Wikipedia article is growing as well (just based on current article versions only).
The Wikipedia database would be much tighter if vandalism edits and reversions were removed from the history.
The average size of a Wikipedia article is growing as well (just based on current article versions only).
-
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
Re: Size of the English Wikipedia database in GB
That may not be true if you include new articles, which will typically be small, in the average. Most articles do tend to grow over time of course, that's the "Wikipedia is constantly improving" fallacy.Johnny Au wrote:The average size of a Wikipedia article is growing as well (just based on current article versions only).
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
-
- Habitué
- Posts: 2620
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
Re: Size of the English Wikipedia database in GB
I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia
The uncompressed size is approximately 2.05 times the compressed size.
The uncompressed size is approximately 2.05 times the compressed size.
-
- Regular
- Posts: 310
- Joined: Mon Mar 26, 2012 9:43 pm
- Wikipedia User: Collect
Re: Size of the English Wikipedia database in GB
Poetlister wrote:That may not be true if you include new articles, which will typically be small, in the average. Most articles do tend to grow over time of course, that's the "Wikipedia is constantly improving" fallacy.Johnny Au wrote:The average size of a Wikipedia article is growing as well (just based on current article versions only).
Shorter is often better -- I took a biography which was over 190K in size - whittled it down to a good article at about 15% of its earlier size. Those who think "longer is better" are often in error. On average I would think that almost all articles would be far more useful for readers with about 20 to 25% reductions.
-
- Habitué
- Posts: 2620
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
Re: Size of the English Wikipedia database in GB
There is a very good reason why WP:SIZE (T-H-L) exists.collect wrote:Poetlister wrote:That may not be true if you include new articles, which will typically be small, in the average. Most articles do tend to grow over time of course, that's the "Wikipedia is constantly improving" fallacy.Johnny Au wrote:The average size of a Wikipedia article is growing as well (just based on current article versions only).
Shorter is often better -- I took a biography which was over 190K in size - whittled it down to a good article at about 15% of its earlier size. Those who think "longer is better" are often in error. On average I would think that almost all articles would be far more useful for readers with about 20 to 25% reductions.
-
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
Re: Size of the English Wikipedia database in GB
+1. People take "the sum of all human knowledge" to mean that every possible detail must be thrown in; they think that you must have multiple references for key points; they think that waffling away will make things clearer.collect wrote:Shorter is often better -- I took a biography which was over 190K in size - whittled it down to a good article at about 15% of its earlier size. Those who think "longer is better" are often in error. On average I would think that almost all articles would be far more useful for readers with about 20 to 25% reductions.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
-
- Banned
- Posts: 345
- Joined: Tue Sep 03, 2013 8:34 pm
- Location: United Kingdom
Re: Size of the English Wikipedia database in GB
No rational person would ever think that. But Wikipedians aren't always too rational, of course.Poetlister wrote:waffling away will make things clearer.
(All proceeds donated to Save the Content Writers.)
-
- Sonny, I've got a whole theme park full of red delights for you.
- Posts: 31850
- Joined: Thu Mar 29, 2012 8:16 pm
- Wikipedia User: Vigilant
- Wikipedia Review Member: Vigilant
Re: Size of the English Wikipedia database in GB
That seems like a very poor compression for what is primarily text.Johnny Au wrote:I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia
The uncompressed size is approximately 2.05 times the compressed size.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.
-
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
Re: Size of the English Wikipedia database in GB
By George, he's got it!Peryglus wrote:No rational person would ever think that. But Wikipedians aren't always too rational, of course.Poetlister wrote:waffling away will make things clearer.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
-
- Critic
- Posts: 169
- Joined: Sat Oct 03, 2015 12:33 pm
- Location: Belgium
Re: Size of the English Wikipedia database in GB
That page says: "As of July 2015, there were approximately 23 billion characters". Not sure if they include references, probably not categories and certainly not XML tags, templates etc.. The zip file will contain those and more (enwiki-201510702-pages-articles.xml.bz2; task description: Recombine articles, templates, media/file descriptions, and primary meta-pages). I tried to extract a different version (enwiki-20150702-pages-articles-multistream.xml.bz2), which is slightly larger, got an error and an unusable file of 38 GB. No way of knowing what the size is supposed to be, the bzip2 format uses only 4 bytes to store the file size (showing incorrect size for files >4GB).Vigilant wrote:That seems like a very poor compression for what is primarily text.Johnny Au wrote:I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia
The uncompressed size is approximately 2.05 times the compressed size.
Tweaker in Metropolis
-
- Habitué
- Posts: 2620
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
Re: Size of the English Wikipedia database in GB
If that is the case, we will never know the true size of the Wikipedia database useful for size comparisons (as in being uncompressed source text, containing only the current revision of articles excluding the following: those without a wikilink, redirects, and disambig pages).Drijfzand wrote:That page says: "As of July 2015, there were approximately 23 billion characters". Not sure if they include references, probably not categories and certainly not XML tags, templates etc.. The zip file will contain those and more (enwiki-201510702-pages-articles.xml.bz2; task description: Recombine articles, templates, media/file descriptions, and primary meta-pages). I tried to extract a different version (enwiki-20150702-pages-articles-multistream.xml.bz2), which is slightly larger, got an error and an unusable file of 38 GB. No way of knowing what the size is supposed to be, the bzip2 format uses only 4 bytes to store the file size (showing incorrect size for files >4GB).Vigilant wrote:That seems like a very poor compression for what is primarily text.Johnny Au wrote:I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia
The uncompressed size is approximately 2.05 times the compressed size.
-
- Critic
- Posts: 169
- Joined: Sat Oct 03, 2015 12:33 pm
- Location: Belgium
Re: Size of the English Wikipedia database in GB
I assume that figure of 23 billion was correct. Downloading a version (assuming it unpacks correctly) and running an agent counting the characters inside the text tags would give an answer; what should and shouldn't be counted may be a matter of opinion...Johnny Au wrote:If that is the case, we will never know the true size of the Wikipedia database useful for size comparisons (as in being uncompressed source text, containing only the current revision of articles excluding the following: those without a wikilink, redirects, and disambig pages).
I've almost downloaded last months full history version, ran a python script on the first two files (80 GB each) counting the number of edits and editors (for example: Antiarrhythmic medication: 7 edits, 6 editors, a redirect page; Amphetamine: 5255 and 1898 ). I'm not going to do the other files before I've decided what info to retrieve; there are 202 .7z files that take a long time to unpack, and I only have space for about 15 to 20 unpacked files. It will take at least 200 hours to unzip and extract the data, don't want to do it more than once.
Date, size and editor for each version of each article would suffice to calculate most of the general statistics, like the total size (last versions only) and the number of articles for any date, the number of edits, editors...
If you (or anyone else) have any suggestions for other useful data? I'd like to have some measure of edit "importance", adding a category is less work than adding a paragraph, but a full comparison of versions would take too much time.
Tweaker in Metropolis
-
- Gregarious
- Posts: 730
- Joined: Tue Sep 02, 2014 11:46 pm
- Wikipedia User: formerly Konveyor Belt
Re: Size of the English Wikipedia database in GB
Does this include metadata?
I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
Always improving...
-
- Sonny, I've got a whole theme park full of red delights for you.
- Posts: 31850
- Joined: Thu Mar 29, 2012 8:16 pm
- Wikipedia User: Vigilant
- Wikipedia Review Member: Vigilant
Re: Size of the English Wikipedia database in GB
How quintessentially wikipedian.Drijfzand wrote:That page says: "As of July 2015, there were approximately 23 billion characters". Not sure if they include references, probably not categories and certainly not XML tags, templates etc.. The zip file will contain those and more (enwiki-201510702-pages-articles.xml.bz2; task description: Recombine articles, templates, media/file descriptions, and primary meta-pages). I tried to extract a different version (enwiki-20150702-pages-articles-multistream.xml.bz2), which is slightly larger, got an error and an unusable file of 38 GB. No way of knowing what the size is supposed to be, the bzip2 format uses only 4 bytes to store the file size (showing incorrect size for files >4GB).Vigilant wrote:That seems like a very poor compression for what is primarily text.Johnny Au wrote:I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia
The uncompressed size is approximately 2.05 times the compressed size.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.
-
- Critic
- Posts: 169
- Joined: Sat Oct 03, 2015 12:33 pm
- Location: Belgium
Re: Size of the English Wikipedia database in GB
23.8 billion was the number of characters in the english WP (don't know if that includes spaces?). It's text only, no metadata. Word count 2.95 billion, or an average of 590 words per article. Would fit on a 32 GB microSD card.Konveyor Belt wrote:Does this include metadata?
I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
Tweaker in Metropolis
-
- Habitué
- Posts: 2620
- Joined: Fri Jan 31, 2014 5:05 pm
- Wikipedia User: Johnny Au
- Actual Name: Johnny Au
- Location: Toronto, Ontario, Canada
Re: Size of the English Wikipedia database in GB
That seems to fit the assumptions made here: WP:SIV (T-H-L)Drijfzand wrote:23.8 billion was the number of characters in the english WP (don't know if that includes spaces?). It's text only, no metadata. Word count 2.95 billion, or an average of 590 words per article. Would fit on a 32 GB microSD card.Konveyor Belt wrote:Does this include metadata?
I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
The average would be 595 words per article as well based on the most recent calculations.
-
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
Re: Size of the English Wikipedia database in GB
Unquestionably it would fit onto a pocket-sized hard drive. They can hold far more GB than that these days. Even memory sticks can hold up to 64GB.Konveyor Belt wrote:Does this include metadata?
I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
-
- Sonny, I've got a whole theme park full of red delights for you.
- Posts: 31850
- Joined: Thu Mar 29, 2012 8:16 pm
- Wikipedia User: Vigilant
- Wikipedia Review Member: Vigilant
Re: Size of the English Wikipedia database in GB
Well, hell.Poetlister wrote:Unquestionably it would fit onto a pocket-sized hard drive. They can hold far more GB than that these days. Even memory sticks can hold up to 64GB.Konveyor Belt wrote:Does this include metadata?
I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
I guess this 128GB USB stick I have here must be a lie.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.
-
- Genius
- Posts: 25599
- Joined: Wed Jan 02, 2013 8:15 pm
- Nom de plume: Poetlister
- Location: London, living in a similar way
Re: Size of the English Wikipedia database in GB
Things move so fast on the storage front these days, it's hard to keep up. Anyone remember this?Vigilant wrote:Well, hell.Poetlister wrote:Unquestionably it would fit onto a pocket-sized hard drive. They can hold far more GB than that these days. Even memory sticks can hold up to 64GB.Konveyor Belt wrote:Does this include metadata?
I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
I guess this 128GB USB stick I have here must be a lie.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche
-
- Regular
- Posts: 310
- Joined: Mon Mar 26, 2012 9:43 pm
- Wikipedia User: Collect
Re: Size of the English Wikipedia database in GB
Vigilant wrote:Well, hell.Poetlister wrote:Unquestionably it would fit onto a pocket-sized hard drive. They can hold far more GB than that these days. Even memory sticks can hold up to 64GB.Konveyor Belt wrote:Does this include metadata?
I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
I guess this 128GB USB stick I have here must be a lie.
FWIW - Kingston HyperX has a 1 terabyte flash drive