Size of the English Wikipedia database in GB

User avatar
Johnny Au
Habitué
Posts: 2620
kołdry
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Size of the English Wikipedia database in GB

Unread post by Johnny Au » Mon Jun 16, 2014 2:54 am

How large is the English Wikipedia database in GB?

Searching for database statistics only shows the most recent being from 2010. The figure here (WP:SIV (T-H-L)) is very much pulled out of thin air.

I am curious to know.

EricBarbour
 
Posts: 10891
Joined: Wed Mar 14, 2012 11:32 pm
Location: hell

Re: Size of the English Wikipedia database in GB

Unread post by EricBarbour » Mon Jun 16, 2014 3:03 am

Johnny Au wrote:Searching for database statistics only shows the most recent being from 2010.
As you can see on the statistics tables, they stopped updating a whole slew of statistics in January 2010. Including total size of all articles.
No one outside the WMF knows why, and WMF people refuse to discuss it.

As you can see on this chart, other language Wikipedias are still updating their size statistics. The only prominent exception: English.

User avatar
Johnny Au
Habitué
Posts: 2620
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Re: Size of the English Wikipedia database in GB

Unread post by Johnny Au » Mon Jun 16, 2014 3:05 am

Note that this topic is different from viewtopic.php?f=6&t=3985

This topic is about why the WMF somehow wants to hide the true size of Wikipedia, while the other thread is about the size of Wikipedia (without discussing WMF's censorship).

User avatar
greybeard
Habitué
Posts: 1364
Joined: Wed Mar 14, 2012 11:21 pm

Re: Size of the English Wikipedia database in GB

Unread post by greybeard » Mon Jun 16, 2014 6:20 am

An approximation may be determined from these pages: http://dumps.wikimedia.org/enwiki/20140614/

User avatar
Midsize Jake
Site Admin
Posts: 9964
Joined: Mon Mar 19, 2012 11:10 pm
Wikipedia Review Member: Somey

Re: Size of the English Wikipedia database in GB

Unread post by Midsize Jake » Mon Jun 16, 2014 6:41 am

Johnny Au wrote:This topic is about why the WMF somehow wants to hide the true size of Wikipedia...
I just always assumed they were worried that the North Koreans would buy up the entire worldwide supply of USB sticks that are just a bit larger than the true size of Wikipedia, forcing them to pay more money for USB sticks that are the next-size-up in capacity, which they will then attach to big helium balloons and float over the demilitarized zone in order to bring the sum of human knowledge to all those North Koreans - none of whom actually are allowed to own their own computers.

User avatar
Peter Damian
Habitué
Posts: 4208
Joined: Thu Mar 15, 2012 8:14 pm
Wikipedia User: Peter Damian
Wikipedia Review Member: Peter Damian
Location: London

Re: Size of the English Wikipedia database in GB

Unread post by Peter Damian » Mon Jun 16, 2014 12:21 pm

EricBarbour wrote:
Johnny Au wrote:Searching for database statistics only shows the most recent being from 2010.
As you can see on the statistics tables, they stopped updating a whole slew of statistics in January 2010. Including total size of all articles.
No one outside the WMF knows why, and WMF people refuse to discuss it.

As you can see on this chart, other language Wikipedias are still updating their size statistics. The only prominent exception: English.
Don't forget the work we did way back on article size. This is different from database size, as the database includes absolutely every stupid edit. I could dig that up (or Eric could). What we found, when we ordered by size in Kb, was a distribution with a massive tail of stubs and short crappy articles. You can easily verify this by pressing the 'random article' button. Once you cut away that rubbish, there are about 250,000 articles (from memory). At the other end of the scale there were these huge massive articles written by SPAs. I wonder what that looks like now. We did this in 2012, from memory.
οὐκ ἀγαθὸν πολυκοιρανίη: εἷς κοίρανος ἔστω

User avatar
Johnny Au
Habitué
Posts: 2620
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Re: Size of the English Wikipedia database in GB

Unread post by Johnny Au » Mon Jun 16, 2014 10:30 pm

I am curious to know the mean and median size of an article in the English Wikipedia as of 2014 in KB.

EricBarbour
 
Posts: 10891
Joined: Wed Mar 14, 2012 11:32 pm
Location: hell

Re: Size of the English Wikipedia database in GB

Unread post by EricBarbour » Mon Jun 16, 2014 10:34 pm

Johnny Au wrote:I am curious to know the mean and median size of an article in the English Wikipedia as of 2014 in KB.
The average is around 8-9 KB. It has slowly gotten longer over the years, although it seems to have topped out thanks to the heavy use to bots to generate crap stub articles. We only put significant effort into studying articles over 15k bytes, in order to exclude stubs and crap.

http://wikipediocracy.com/wiki/index.ph ... _Wikipedia

User avatar
Kumioko
Muted
Posts: 6609
Joined: Sun Mar 03, 2013 2:36 am
Wikipedia User: Kumioko; Reguyla
Nom de plume: Persona non grata

Re: Size of the English Wikipedia database in GB

Unread post by Kumioko » Tue Jun 17, 2014 3:20 am

The database dump I just downloaded a couple weeks ago was something like 10 GB compressed and 45 GB uncompressed for the English Wikipedia with all history. Downloading all history for all projects in all languages is something like 3TB.

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way

Re: Size of the English Wikipedia database in GB

Unread post by Poetlister » Tue Jun 17, 2014 8:05 pm

Kumioko wrote:The database dump I just downloaded a couple weeks ago was something like 10 GB compressed and 45 GB uncompressed for the English Wikipedia with all history. Downloading all history for all projects in all languages is something like 3TB.
Keep it. Old dumps are very useful for investigating whether things have been oversighted.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche

User avatar
Kumioko
Muted
Posts: 6609
Joined: Sun Mar 03, 2013 2:36 am
Wikipedia User: Kumioko; Reguyla
Nom de plume: Persona non grata

Re: Size of the English Wikipedia database in GB

Unread post by Kumioko » Tue Jun 17, 2014 8:23 pm

I should clarify that when I say all projects I didn't count commons, it is so big its just nasty.

Hex
Retired
Posts: 4130
Joined: Thu Nov 01, 2012 1:40 pm
Wikipedia User: Scott
Location: London

Re: Size of the English Wikipedia database in GB

Unread post by Hex » Wed Jun 18, 2014 9:20 am

Poetlister wrote:Old dumps are very useful for investigating whether things have been oversighted.
You'll get a lot quicker result just asking an admin. I'm always happy to answer any query as to whether an article's history contains hidden revisions. I can't see oversighted material, and wouldn't be able to share it whether I could, but there's no reason why I can't report the state of an article's history. When it comes to RevDel'ed material, I can also explain why without necessarily supplying the content itself, although sometimes it's possible.
My question, to this esteemed Wiki community, is this: Do you think that a Wiki could successfully generate a useful encyclopedia? -- JimboWales
Yes, but in the end it wouldn't be an encyclopedia. It would be a wiki. -- WardCunningham (Jan 2001)

User avatar
Kumioko
Muted
Posts: 6609
Joined: Sun Mar 03, 2013 2:36 am
Wikipedia User: Kumioko; Reguyla
Nom de plume: Persona non grata

Re: Size of the English Wikipedia database in GB

Unread post by Kumioko » Wed Jun 18, 2014 10:52 am

Hex wrote:
Poetlister wrote:Old dumps are very useful for investigating whether things have been oversighted.
You'll get a lot quicker result just asking an admin. I'm always happy to answer any query as to whether an article's history contains hidden revisions. I can't see oversighted material, and wouldn't be able to share it whether I could, but there's no reason why I can't report the state of an article's history. When it comes to RevDel'ed material, I can also explain why without necessarily supplying the content itself, although sometimes it's possible.
:No offense to you Hex but I have gotten to the point where I don't want to ask an admin for anything unless I absolutely have too. Besides, most of the revdeled info is available in the data dumps. I can just look it up for myself without having to ask an admin for it and be told I can't be trusted.

Hex
Retired
Posts: 4130
Joined: Thu Nov 01, 2012 1:40 pm
Wikipedia User: Scott
Location: London

Re: Size of the English Wikipedia database in GB

Unread post by Hex » Wed Jun 18, 2014 2:09 pm

Kumioko wrote: :No offense to you Hex but I have gotten to the point where I don't want to ask an admin for anything unless I absolutely have too. Besides, most of the revdeled info is available in the data dumps. I can just look it up for myself without having to ask an admin for it and be told I can't be trusted.
No offense taken. Incidentally, RevDel'ed/oversighted revisions aren't included in dumps now, but I couldn't say when that started happening, or whether prior dumps have been reprocessed to remove subsequently suppressed material, which you'd think the nature of the action would necessitate.

Oh, also:
Kumioko wrote: The database dump I just downloaded a couple weeks ago was something like 10 GB compressed and 45 GB uncompressed for the English Wikipedia with all history.
pages-articles.xml is the one you downloaded, which is latest revisions only. For full history you need the pages-meta-history.xml dumps, which come to about 565 GB - compressed. Dump format info here.
My question, to this esteemed Wiki community, is this: Do you think that a Wiki could successfully generate a useful encyclopedia? -- JimboWales
Yes, but in the end it wouldn't be an encyclopedia. It would be a wiki. -- WardCunningham (Jan 2001)

User avatar
Kumioko
Muted
Posts: 6609
Joined: Sun Mar 03, 2013 2:36 am
Wikipedia User: Kumioko; Reguyla
Nom de plume: Persona non grata

Re: Size of the English Wikipedia database in GB

Unread post by Kumioko » Wed Jun 18, 2014 3:26 pm

Your right, it was only the most recent that I downloaded. I think that 565GB is the full backup of all Wiki's though, not just english.

User avatar
Vigilant
Sonny, I've got a whole theme park full of red delights for you.
Posts: 31804
Joined: Thu Mar 29, 2012 8:16 pm
Wikipedia User: Vigilant
Wikipedia Review Member: Vigilant

Re: Size of the English Wikipedia database in GB

Unread post by Vigilant » Wed Jun 18, 2014 3:47 pm

Kumioko wrote:Your right, it was only the most recent that I downloaded. I think that 565GB is the full backup of all Wiki's though, not just english.
You should upload that to commons.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.

User avatar
Kumioko
Muted
Posts: 6609
Joined: Sun Mar 03, 2013 2:36 am
Wikipedia User: Kumioko; Reguyla
Nom de plume: Persona non grata

Re: Size of the English Wikipedia database in GB

Unread post by Kumioko » Wed Jun 18, 2014 4:36 pm

They have a 100MB limit.:-(

User avatar
Vigilant
Sonny, I've got a whole theme park full of red delights for you.
Posts: 31804
Joined: Thu Mar 29, 2012 8:16 pm
Wikipedia User: Vigilant
Wikipedia Review Member: Vigilant

Re: Size of the English Wikipedia database in GB

Unread post by Vigilant » Wed Jun 18, 2014 5:02 pm

Kumioko wrote:They have a 100MB limit.:-(
Sounds like 5650 100MB slices then.

You should do it every day to make sure the historical record is clear.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.

User avatar
Kumioko
Muted
Posts: 6609
Joined: Sun Mar 03, 2013 2:36 am
Wikipedia User: Kumioko; Reguyla
Nom de plume: Persona non grata

Re: Size of the English Wikipedia database in GB

Unread post by Kumioko » Wed Jun 18, 2014 7:30 pm

Vigilant wrote:
Kumioko wrote:They have a 100MB limit.:-(
Sounds like 5650 100MB slices then.

You should do it every day to make sure the historical record is clear.
Lol, I would rather do it to Wikipedia since I am already banned there. It would surely get their attention.

User avatar
Johnny Au
Habitué
Posts: 2620
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Re: Size of the English Wikipedia database in GB

Unread post by Johnny Au » Tue Oct 27, 2015 1:47 am

The size of the English Wikipedia database on October 15 has been revealed: 11.5GB

Here is the chart:

Image

Here is the source: https://meta.wikimedia.org/wiki/Data_du ... nts#enwiki

User avatar
Kumioko
Muted
Posts: 6609
Joined: Sun Mar 03, 2013 2:36 am
Wikipedia User: Kumioko; Reguyla
Nom de plume: Persona non grata

Re: Size of the English Wikipedia database in GB

Unread post by Kumioko » Tue Oct 27, 2015 2:47 am

That seems way too small. That must be articles only, current revisions only and no other namespaces and with no files. Just a guess though.

User avatar
Vigilant
Sonny, I've got a whole theme park full of red delights for you.
Posts: 31804
Joined: Thu Mar 29, 2012 8:16 pm
Wikipedia User: Vigilant
Wikipedia Review Member: Vigilant

Re: Size of the English Wikipedia database in GB

Unread post by Vigilant » Tue Oct 27, 2015 4:05 am

Johnny Au wrote:The size of the English Wikipedia database on October 15 has been revealed: 11.5GB

Here is the chart:

Image

Here is the source: https://meta.wikimedia.org/wiki/Data_du ... nts#enwiki
I assumed that was the size of the average ARBCOM case.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.

User avatar
tarantino
Habitué
Posts: 4797
Joined: Thu Mar 15, 2012 7:19 pm

Re: Size of the English Wikipedia database in GB

Unread post by tarantino » Tue Oct 27, 2015 4:51 am


User avatar
Drijfzand
Critic
Posts: 169
Joined: Sat Oct 03, 2015 12:33 pm
Location: Belgium

Re: Size of the English Wikipedia database in GB

Unread post by Drijfzand » Tue Oct 27, 2015 5:49 am

Kumioko wrote:That seems way too small. That must be articles only, current revisions only and no other namespaces and with no files. Just a guess though.
It's the size of the .bz2 archive. Probably 45 to 50 GB unzipped.

The total DB (english WP only) with history and all(?) namespaces is supposed to be 10 TB, also available for download: 126 GB in .7z files. I tried one of them, 237 MB that unpacked to 80.5 GB. XML file format, starts with the complete text of every revision of the Anarchy article (2 GB, 3 million lines), followed by Asperger syndrome...
Tweaker in Metropolis

User avatar
Kumioko
Muted
Posts: 6609
Joined: Sun Mar 03, 2013 2:36 am
Wikipedia User: Kumioko; Reguyla
Nom de plume: Persona non grata

Re: Size of the English Wikipedia database in GB

Unread post by Kumioko » Tue Oct 27, 2015 11:13 am

Drijfzand wrote:
Kumioko wrote:That seems way too small. That must be articles only, current revisions only and no other namespaces and with no files. Just a guess though.
It's the size of the .bz2 archive. Probably 45 to 50 GB unzipped.

The total DB (english WP only) with history and all(?) namespaces is supposed to be 10 TB, also available for download: 126 GB in .7z files. I tried one of them, 237 MB that unpacked to 80.5 GB. XML file format, starts with the complete text of every revision of the Anarchy article (2 GB, 3 million lines), followed by Asperger syndrome...
Ah ok, that makes more sense then. Thanks for the clarification.

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way

Re: Size of the English Wikipedia database in GB

Unread post by Poetlister » Tue Oct 27, 2015 7:20 pm

Most of the images are presumably on Commons and transcluded into WP articles. Are they included in the file dump?
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche

User avatar
Drijfzand
Critic
Posts: 169
Joined: Sat Oct 03, 2015 12:33 pm
Location: Belgium

Re: Size of the English Wikipedia database in GB

Unread post by Drijfzand » Wed Oct 28, 2015 3:48 am

Poetlister wrote:Most of the images are presumably on Commons and transcluded into WP articles. Are they included in the file dump?
No, have to be downloaded separately, from mirrors.
Info here
Media (current version only)
Media tarballs per project (except Commons)
Media tarballs per day for Wikimedia Commons

The Wikipedia and User namespaces are not available for download, don't know why I thought otherwise. too bad, would have been fun
Tweaker in Metropolis

User avatar
Johnny Au
Habitué
Posts: 2620
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Re: Size of the English Wikipedia database in GB

Unread post by Johnny Au » Wed Oct 28, 2015 4:47 am

Drijfzand wrote:
Kumioko wrote:That seems way too small. That must be articles only, current revisions only and no other namespaces and with no files. Just a guess though.
It's the size of the .bz2 archive. Probably 45 to 50 GB unzipped.
I have estimated the uncompressed size by multiplying the compressed size by 4 when it comes to the size of the total English Wikipedia current article text.

It means that the October 2, 2015 compressed dump size of 11.5GB is approximately 46GB when uncompressed.

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way

Re: Size of the English Wikipedia database in GB

Unread post by Poetlister » Wed Oct 28, 2015 12:38 pm

Drijfzand wrote:
Poetlister wrote:Most of the images are presumably on Commons and transcluded into WP articles. Are they included in the file dump?
No, have to be downloaded separately, from mirrors.
Info here
Media (current version only)
Media tarballs per project (except Commons)
Media tarballs per day for Wikimedia Commons

The Wikipedia and User namespaces are not available for download, don't know why I thought otherwise. too bad, would have been fun
That will make a big difference to the size. I suppose they also exclude deleted articles and hidden revisions.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche

User avatar
Jim
Blue Meanie
Posts: 4955
Joined: Fri Sep 07, 2012 10:33 am
Wikipedia User: Begoon
Wikipedia Review Member: Jim
Location: NSW

Re: Size of the English Wikipedia database in GB

Unread post by Jim » Wed Oct 28, 2015 1:02 pm

Poetlister wrote:I suppose they also exclude deleted articles and hidden revisions.
One would imagine so. It'd be insanely stupid and incompetent not to.
Oh, wait... We should probably check. :XD

User avatar
Johnny Au
Habitué
Posts: 2620
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Re: Size of the English Wikipedia database in GB

Unread post by Johnny Au » Wed Oct 28, 2015 8:03 pm

The size of the Wikipedia database is growing, especially with its revision history being filled with bloat.

The Wikipedia database would be much tighter if vandalism edits and reversions were removed from the history.

The average size of a Wikipedia article is growing as well (just based on current article versions only).

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way

Re: Size of the English Wikipedia database in GB

Unread post by Poetlister » Wed Oct 28, 2015 8:08 pm

Johnny Au wrote:The average size of a Wikipedia article is growing as well (just based on current article versions only).
That may not be true if you include new articles, which will typically be small, in the average. Most articles do tend to grow over time of course, that's the "Wikipedia is constantly improving" fallacy.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche

User avatar
Johnny Au
Habitué
Posts: 2620
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Re: Size of the English Wikipedia database in GB

Unread post by Johnny Au » Sun Nov 01, 2015 11:08 pm

I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia

The uncompressed size is approximately 2.05 times the compressed size.

collect
Regular
Posts: 310
Joined: Mon Mar 26, 2012 9:43 pm
Wikipedia User: Collect

Re: Size of the English Wikipedia database in GB

Unread post by collect » Sun Nov 01, 2015 11:58 pm

Poetlister wrote:
Johnny Au wrote:The average size of a Wikipedia article is growing as well (just based on current article versions only).
That may not be true if you include new articles, which will typically be small, in the average. Most articles do tend to grow over time of course, that's the "Wikipedia is constantly improving" fallacy.

Shorter is often better -- I took a biography which was over 190K in size - whittled it down to a good article at about 15% of its earlier size. Those who think "longer is better" are often in error. On average I would think that almost all articles would be far more useful for readers with about 20 to 25% reductions.

User avatar
Johnny Au
Habitué
Posts: 2620
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Re: Size of the English Wikipedia database in GB

Unread post by Johnny Au » Mon Nov 02, 2015 1:18 am

collect wrote:
Poetlister wrote:
Johnny Au wrote:The average size of a Wikipedia article is growing as well (just based on current article versions only).
That may not be true if you include new articles, which will typically be small, in the average. Most articles do tend to grow over time of course, that's the "Wikipedia is constantly improving" fallacy.

Shorter is often better -- I took a biography which was over 190K in size - whittled it down to a good article at about 15% of its earlier size. Those who think "longer is better" are often in error. On average I would think that almost all articles would be far more useful for readers with about 20 to 25% reductions.
There is a very good reason why WP:SIZE (T-H-L) exists.

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way

Re: Size of the English Wikipedia database in GB

Unread post by Poetlister » Mon Nov 02, 2015 12:44 pm

collect wrote:Shorter is often better -- I took a biography which was over 190K in size - whittled it down to a good article at about 15% of its earlier size. Those who think "longer is better" are often in error. On average I would think that almost all articles would be far more useful for readers with about 20 to 25% reductions.
+1. People take "the sum of all human knowledge" to mean that every possible detail must be thrown in; they think that you must have multiple references for key points; they think that waffling away will make things clearer.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche

User avatar
Peryglus
Banned
Posts: 345
Joined: Tue Sep 03, 2013 8:34 pm
Location: United Kingdom

Re: Size of the English Wikipedia database in GB

Unread post by Peryglus » Mon Nov 02, 2015 5:38 pm

Poetlister wrote:waffling away will make things clearer.
No rational person would ever think that. But Wikipedians aren't always too rational, of course.
(All proceeds donated to Save the Content Writers.)

User avatar
Vigilant
Sonny, I've got a whole theme park full of red delights for you.
Posts: 31804
Joined: Thu Mar 29, 2012 8:16 pm
Wikipedia User: Vigilant
Wikipedia Review Member: Vigilant

Re: Size of the English Wikipedia database in GB

Unread post by Vigilant » Mon Nov 02, 2015 6:17 pm

Johnny Au wrote:I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia

The uncompressed size is approximately 2.05 times the compressed size.
That seems like a very poor compression for what is primarily text.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way

Re: Size of the English Wikipedia database in GB

Unread post by Poetlister » Mon Nov 02, 2015 9:39 pm

Peryglus wrote:
Poetlister wrote:waffling away will make things clearer.
No rational person would ever think that. But Wikipedians aren't always too rational, of course.
By George, he's got it!
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche

User avatar
Drijfzand
Critic
Posts: 169
Joined: Sat Oct 03, 2015 12:33 pm
Location: Belgium

Re: Size of the English Wikipedia database in GB

Unread post by Drijfzand » Mon Nov 02, 2015 11:08 pm

Vigilant wrote:
Johnny Au wrote:I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia

The uncompressed size is approximately 2.05 times the compressed size.
That seems like a very poor compression for what is primarily text.
That page says: "As of July 2015, there were approximately 23 billion characters". Not sure if they include references, probably not categories and certainly not XML tags, templates etc.. The zip file will contain those and more (enwiki-201510702-pages-articles.xml.bz2; task description: Recombine articles, templates, media/file descriptions, and primary meta-pages). I tried to extract a different version (enwiki-20150702-pages-articles-multistream.xml.bz2), which is slightly larger, got an error and an unusable file of 38 GB. No way of knowing what the size is supposed to be, the bzip2 format uses only 4 bytes to store the file size (showing incorrect size for files >4GB).
Tweaker in Metropolis

User avatar
Johnny Au
Habitué
Posts: 2620
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Re: Size of the English Wikipedia database in GB

Unread post by Johnny Au » Mon Nov 02, 2015 11:25 pm

Drijfzand wrote:
Vigilant wrote:
Johnny Au wrote:I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia

The uncompressed size is approximately 2.05 times the compressed size.
That seems like a very poor compression for what is primarily text.
That page says: "As of July 2015, there were approximately 23 billion characters". Not sure if they include references, probably not categories and certainly not XML tags, templates etc.. The zip file will contain those and more (enwiki-201510702-pages-articles.xml.bz2; task description: Recombine articles, templates, media/file descriptions, and primary meta-pages). I tried to extract a different version (enwiki-20150702-pages-articles-multistream.xml.bz2), which is slightly larger, got an error and an unusable file of 38 GB. No way of knowing what the size is supposed to be, the bzip2 format uses only 4 bytes to store the file size (showing incorrect size for files >4GB).
If that is the case, we will never know the true size of the Wikipedia database useful for size comparisons (as in being uncompressed source text, containing only the current revision of articles excluding the following: those without a wikilink, redirects, and disambig pages).

User avatar
Drijfzand
Critic
Posts: 169
Joined: Sat Oct 03, 2015 12:33 pm
Location: Belgium

Re: Size of the English Wikipedia database in GB

Unread post by Drijfzand » Tue Nov 03, 2015 2:29 am

Johnny Au wrote:If that is the case, we will never know the true size of the Wikipedia database useful for size comparisons (as in being uncompressed source text, containing only the current revision of articles excluding the following: those without a wikilink, redirects, and disambig pages).
I assume that figure of 23 billion was correct. Downloading a version (assuming it unpacks correctly) and running an agent counting the characters inside the text tags would give an answer; what should and shouldn't be counted may be a matter of opinion...
I've almost downloaded last months full history version, ran a python script on the first two files (80 GB each) counting the number of edits and editors (for example: Antiarrhythmic medication: 7 edits, 6 editors, a redirect page; Amphetamine: 5255 and 1898 ). I'm not going to do the other files before I've decided what info to retrieve; there are 202 .7z files that take a long time to unpack, and I only have space for about 15 to 20 unpacked files. It will take at least 200 hours to unzip and extract the data, don't want to do it more than once.
Date, size and editor for each version of each article would suffice to calculate most of the general statistics, like the total size (last versions only) and the number of articles for any date, the number of edits, editors...
If you (or anyone else) have any suggestions for other useful data? I'd like to have some measure of edit "importance", adding a category is less work than adding a paragraph, but a full comparison of versions would take too much time.
Tweaker in Metropolis

User avatar
Konveyor Belt
Gregarious
Posts: 722
Joined: Tue Sep 02, 2014 11:46 pm
Wikipedia User: formerly Konveyor Belt

Re: Size of the English Wikipedia database in GB

Unread post by Konveyor Belt » Tue Nov 03, 2015 2:33 am

Does this include metadata?

I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
Always improving...

User avatar
Vigilant
Sonny, I've got a whole theme park full of red delights for you.
Posts: 31804
Joined: Thu Mar 29, 2012 8:16 pm
Wikipedia User: Vigilant
Wikipedia Review Member: Vigilant

Re: Size of the English Wikipedia database in GB

Unread post by Vigilant » Tue Nov 03, 2015 2:40 am

Drijfzand wrote:
Vigilant wrote:
Johnny Au wrote:I have revised the uncompressed size per https://en.wikipedia.org/wiki/Wikipedia ... _Wikipedia

The uncompressed size is approximately 2.05 times the compressed size.
That seems like a very poor compression for what is primarily text.
That page says: "As of July 2015, there were approximately 23 billion characters". Not sure if they include references, probably not categories and certainly not XML tags, templates etc.. The zip file will contain those and more (enwiki-201510702-pages-articles.xml.bz2; task description: Recombine articles, templates, media/file descriptions, and primary meta-pages). I tried to extract a different version (enwiki-20150702-pages-articles-multistream.xml.bz2), which is slightly larger, got an error and an unusable file of 38 GB. No way of knowing what the size is supposed to be, the bzip2 format uses only 4 bytes to store the file size (showing incorrect size for files >4GB).
How quintessentially wikipedian.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.

User avatar
Drijfzand
Critic
Posts: 169
Joined: Sat Oct 03, 2015 12:33 pm
Location: Belgium

Re: Size of the English Wikipedia database in GB

Unread post by Drijfzand » Tue Nov 03, 2015 4:20 am

Konveyor Belt wrote:Does this include metadata?

I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
23.8 billion was the number of characters in the english WP (don't know if that includes spaces?). It's text only, no metadata. Word count 2.95 billion, or an average of 590 words per article. Would fit on a 32 GB microSD card.
Tweaker in Metropolis

User avatar
Johnny Au
Habitué
Posts: 2620
Joined: Fri Jan 31, 2014 5:05 pm
Wikipedia User: Johnny Au
Actual Name: Johnny Au
Location: Toronto, Ontario, Canada

Re: Size of the English Wikipedia database in GB

Unread post by Johnny Au » Tue Nov 03, 2015 4:33 am

Drijfzand wrote:
Konveyor Belt wrote:Does this include metadata?

I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
23.8 billion was the number of characters in the english WP (don't know if that includes spaces?). It's text only, no metadata. Word count 2.95 billion, or an average of 590 words per article. Would fit on a 32 GB microSD card.
That seems to fit the assumptions made here: WP:SIV (T-H-L)

The average would be 595 words per article as well based on the most recent calculations.

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way

Re: Size of the English Wikipedia database in GB

Unread post by Poetlister » Tue Nov 03, 2015 12:55 pm

Konveyor Belt wrote:Does this include metadata?

I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
Unquestionably it would fit onto a pocket-sized hard drive. They can hold far more GB than that these days. Even memory sticks can hold up to 64GB.
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche

User avatar
Vigilant
Sonny, I've got a whole theme park full of red delights for you.
Posts: 31804
Joined: Thu Mar 29, 2012 8:16 pm
Wikipedia User: Vigilant
Wikipedia Review Member: Vigilant

Re: Size of the English Wikipedia database in GB

Unread post by Vigilant » Tue Nov 03, 2015 7:00 pm

Poetlister wrote:
Konveyor Belt wrote:Does this include metadata?

I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
Unquestionably it would fit onto a pocket-sized hard drive. They can hold far more GB than that these days. Even memory sticks can hold up to 64GB.
Well, hell.
I guess this 128GB USB stick I have here must be a lie.
Hello, John. John, hello. You're the one soul I would come up here to collect myself.

User avatar
Poetlister
Genius
Posts: 25599
Joined: Wed Jan 02, 2013 8:15 pm
Nom de plume: Poetlister
Location: London, living in a similar way

Re: Size of the English Wikipedia database in GB

Unread post by Poetlister » Wed Nov 04, 2015 3:08 pm

Vigilant wrote:
Poetlister wrote:
Konveyor Belt wrote:Does this include metadata?

I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
Unquestionably it would fit onto a pocket-sized hard drive. They can hold far more GB than that these days. Even memory sticks can hold up to 64GB.
Well, hell.
I guess this 128GB USB stick I have here must be a lie.
Things move so fast on the storage front these days, it's hard to keep up. Anyone remember this?
"The higher we soar the smaller we appear to those who cannot fly" - Nietzsche

collect
Regular
Posts: 310
Joined: Mon Mar 26, 2012 9:43 pm
Wikipedia User: Collect

Re: Size of the English Wikipedia database in GB

Unread post by collect » Wed Nov 04, 2015 6:11 pm

Vigilant wrote:
Poetlister wrote:
Konveyor Belt wrote:Does this include metadata?

I'd like to see the full plaintext size of Wikipedia, and if it would be small enough to carry around with you.
Unquestionably it would fit onto a pocket-sized hard drive. They can hold far more GB than that these days. Even memory sticks can hold up to 64GB.
Well, hell.
I guess this 128GB USB stick I have here must be a lie.


FWIW - Kingston HyperX has a 1 terabyte flash drive