Wikipedia:Modelling Wikipedia extended growth

From Wikipedia, the free encyclopedia
Comparison of the extended growth model versus the Gompertz model, versus the logistic model, versus the actual growth

This essay considers many issues about projecting the long-term growth of Wikipedia in terms of article count, including growth promoted by many other types of articles, beyond the traditional encyclopedia and major pop-culture articles. The extended-growth model considers factors that will create millions of new articles, far beyond the current 6.9 million articles (live count), to reach perhaps 9 to 10 million articles, before deletions offset the creations of new articles.

The growth of Wikipedia, although reduced somewhat, is not slowing as much as predicted in 2007, but not skyrocketing either, as predicted in 2005. The extended model predicted Wikipedia as exceeding a total of 3 million articles in mid-August 2009, rather than at year-end (occurred 17 August 2009). The model predicted the 3.5 millionth article would be added in mid-September 2010, but occurred on 12 December 2010 instead.

Wikipedia size & users
English articles6,812,425
Average revisions20.09
Articles per day+6715
Total wiki pages60,426,330
Total admins860
Total users47,251,688
UTC time03:22 on 2024-Apr-15

Wikipedia extended growth model[edit]

Approximation of Wikipedia long-term growth, projecting a slow decline in the numerous types of new, follow-on articles being added each year.

An alternate possibility for the growth of Wikipedia is a more protracted, long-term decline in new articles: not the original exponential burst that doubled each year, but neither a balanced bell curve that peaked in late 2006. Instead, an extended-growth model should be considered with the middle, or mean size, to occur during 2010–2011, to double that size of 4 or 5 million articles to nearly 9–10 million articles, long term. The additional millions of articles will be various types of follow-on articles, after the major articles have mostly become stable.

The psychological motivation for the follow-on articles might be a feeling that Wikipedia needs to answer some basic questions about any notable topic that anyone can think about. That motivation is probably much stronger than refining the existing articles to be a comprehensive treatment of each topic (see below: Psychological motivation).

Graphical model fits overall pattern[edit]

The model was developed as a graphical curve to fit the overall pattern of the data, which does not follow a simple mathematical model because many batches of new articles are added by wiki-bot and short-term groups, rather than as "random" additions by the general public. Hence, there is no simple equation which could fit the actual data, which fluctuates wildly when robotic bot-programs are triggered to load numerous new articles in some months, such as for numerous protein-sequence articles. There is no simple mathematical "process generator" to simulate new-article growth. A detailed operational model would not be an equation, but rather a logical, procedural computerized model. However, the growth impact of articles from the general public has been much greater, than the short-term group efforts, so that the overall pattern appears to be a somewhat linear decline in the growth rate for new articles, averaged for several months, or 3 future years, at a time. Perhaps a rough equation would reduce the new-article growth by 11% each year, with the understanding that the decline slows further in each June/July but rises in August, each year, probably tied to school vacations in the Northern Hemisphere. Bear in mind that if a massive bot were triggered to load 700,000 new articles from a "Who's Who in Science" then the new-article rate would soar for months, and always appear as an anomaly, as an upward bump, in the declining overall curve during the next 20-65 years.

Continued growth for follow-on articles[edit]

The initial base of Wikipedia articles covered the traditional encyclopedia and mainstream pop-culture articles, including historical figures, world events, catalogs of scientific terms, celebrities, entertainment topics, and famous sports figures. Those topics, after 6 years of expansion were thought to be saturated, so that the primary growth of Wikipedia would quickly decline and end within 5 years.

However, growth can be expected for several other types of articles:

  • unresolved redlink articles – linked because authors expected notability (someday);
  • spinoffs – sets of sub-articles created when large articles are split;
  • disambiguation pages – whenever 2 or more articles have similar titles, expect a page to separate them;
  • unseen-hand articles – these are the supporting cast & crew, or assistant leaders, as the power behind the throne that made things happen;
  • lost-world articles – these are the long-lost, buried civilizations, failed inventions, secret societies, or forgotten heroes;
  • also-ran articles – these are the contenders, or losing players, just outside the lime light; and
  • technical artifacts - such as cars, consumer electronics, electrical parts, scientific instruments, software, weapons. Thousands (millions) of new models enter the market each year, and millions were notable from the past (e.g. IBM 1620).
  • chemicals - it is estimated that some 10 million substances like 2,2-dimethylbutane have been described non-trivially in the literature (with some other information besides their mere existence and formula).
  • species - estimates for the number of species range in the millions, all with some nontrivial information published somewhere.
  • stars – there are several star catalogues with millions of stars listed, as yet there is only a fraction listed on Wikipedia.
  • fan-cruft articles – these are detailed or pop-culture topics, such as one-event clothing designs, that get mentioned (briefly) in mainstream news.

Note that even the fan-cruft articles will be notable, because thousands or millions of people might be affected, however briefly, and the topic will be covered by some mainstream media sources.

  • additional articles for new things of established sorts: new books, new films, newly notably performers, newly elected politicians, new major athletes, new scientific discoveries, new major products. Where major prizes are notable, there will be new people in that group every year. This portion can never become saturated, though the growth can become linear.
  • expansion of the enWP into fuller coverage of the other culture areas; for example, we are much more saturated for UK railroad stations than for ones in India.

Because of the large array of follow-on articles, there seems to be great potential for creating masses of new articles, beyond the millions of traditional encyclopedia and major pop-culture articles.

Annual growth rate of new articles[edit]

The table below shows the increasing article counts for the English Wikipedia:

Date  Article count   Increase during 
preceding year
 % Increase during 
preceding year
Doubling time (in years
and days rounded up)
Average increase per
 day during preceding year 
 2002-01-01  19,700 19,700 54
2003-01-01 96,500 76,800 390% 160 days 210
2004-01-01 188,800 92,300 96% 377 days 253
2005-01-01 438,500 249,700 132% 301 days 682
2006-01-01 895,000 456,500 104% 355 days 1251
2007-01-01 1,560,000 665,000 74% 342 days 1822
2008-01-01 2,153,000 593,000 38% 1 year, 302 days 1625
2009-01-01 2,679,000 526,000 24% 2 years, 326 days 1437
2010-01-01 3,144,000 465,000 17% 4 years, 29 days 1274
2011-01-01 3,518,000 374,000 12% 5 years, 284 days 1025
2012-01-01 3,835,000 317,000 9% 7 years, 257 days 868
2013-01-01 4,133,000 298,000 8% 8 years, 243 days 814
2014-01-01 4,413,000 280,000 7% 9 years, 330 days 767
2015-01-01 4,682,000 269,000 6% 11 years, 202 days 736
2016-01-01 5,045,000 363,000 8% 8 years, 243 days 995
2017-01-01 5,321,200 276,200 7% 9 years, 330 days 755
2018-01-01 5,541,900 220,700 4.5% 15 years, 148 days 605
2019-01-01 5,773,600 231,700 4.2% 16 years, 310 days 635
2020-01-01 5,989,400 215,800 3.75% 20 years, 11 days 591
2021-01-01 6,219,700 230,300 3.8% 20 years 629
2022-01-01 6,431,400 211,700 3.4% 21 years, 0 days 580
2023-01-01 6,595,468 164,068 2.6% 450
2024-01-01 6,764,335 168,867 2.6% 463
2024-04-15 6,812,425   592,697[a]   494[a]
[a] Calculated live, so far, as only for partial year.

Articles as resolved redlinks[edit]

If Wikipedia's growth were nearing an end, then many articles would have most major redlinks already resolved with the intended linked articles. However, many articles still recommend 6 or more redlinked articles. Improbable redlinks are often removed from articles, so the remaining redlinks are typically notable. They will include: nearby mountain names, wildlife reserves, rivers, bays, towns, key personnel, book/film titles, special varieties, etc. Such topics are easily defended as being notable, so the redlinks are a major influence on creating new notable articles.

Articles as disambiguation pages[edit]

A common type of new article is a disambiguation page, which offers a choice of articles related to the same title. Originally, the choice was between items having exactly the same name, such as "John Smith" or "Mary Jones" or "Leonardo". However, variations of a title were added as potential matches, in a manner similar to word-prefix searches. As a result, disambiguation pages began listing organized groups of potential matches for a partial title, carefully grouping people, companies, towns, films (etc.) with a short description of each.

A disambiguation page can be so comprehensive, and descriptive, that it acts like search-engine results "on steroids", as a structured, informative scan that would be a lofty goal for a search-engine to attain. Because of the exceptional information distilled by the disambiguation pages, they can be valuable additions to Wikipedia, and hence, a major source of welcomed new pages. In February 2009, Wikipedia had nearly 108,000 disambiguation pages, more than the entire size of Wikipedia back in early 2003. In early 2009, the daily growth of new articles included, perhaps, nearly 1–2% disambiguation pages. By 2014 the count of disambiguation pages had grown to over 250,000.

Articles as lost-world topics[edit]

The search for knowledge often illuminates the worlds of yesteryear. Archaeologists have excavated for decades at Emperor Qin's Terracotta Army in Xian (China), at the fields of Ephesus, the hills of Copan, the ruins at Carchemish, inside many Caribbean shipwrecks, under lava flows near Pompeii, and in ancient temples at Edfu, Abydos or Kom Ombo along the Nile. As new discoveries are pieced together, thousands of ancient topics gain the details to become full articles.

The world of antiques, with furniture and household items, instantly provides many thousands of topics for new articles.

Paleontologists are expanding the fossil record in many areas: as the arctic glaciers melt, numerous fossils are sometimes found on the surface under the ice; and even in Africa, where dinosaur remains were rarely seen, numerous fossils are being discovered.

Many thousands of articles can be expected on lost-world topics.

Articles as unseen-hand issues[edit]

Behind, or beneath, the major, popular topics, are the "unseen hand" articles. The supporting cast and crew (sometimes with a "cast of thousands") eventually becomes known well enough to fill new articles.

These articles include the people, with their inventions, who sold their novel ideas to Thomas Edison.

Psychological motivation for new articles[edit]

The English Wikipedia, since early 2005, has added over 1,000 new articles every day. However, the number of articles being refined and polished to meet featured-article status is only a few a day. Clearly, the ratio of 1 featured article for every thousand indicates some key psychological factors are involved.

The psychological motivation for creating so many new, follow-on articles might be a feeling that Wikipedia needs to answer some basic questions about almost any notable topic that anyone can imagine. For example, there are over 33,000 English Wikipedia articles about professional footballers (soccer players), and many of those articles are read daily, by someone somewhere. In contrast, for the more traditional field of mathematics, there are perhaps 21,000 total articles. However, new articles are still being added.

Meanwhile, the process of refining articles to reach featured-article status, as a comparison, involves weeks of changes and reviews. Plus, the criteria used to screen articles can become severe: some even request that the phrasing within an article be made more diverse, by eliminating repetition of ordinary phrases. It is not enough to just describe all major aspects of a topic, those articles must meet certain literary standards. During 2008, over 100 articles lost their featured-article status, as criteria became perhaps more strict about the quality required for featured-level.

As a consequence, the motivation is probably much stronger, to create new (brief) articles which provide a general introduction to each topic, rather than refining or polishing the existing articles to become comprehensive treatments of their topics, according to a carefully defined set of high-quality criteria.

Growth as a percentage of prior year[edit]

Although the decline in daily growth has only occurred for about 6 years, it is possible that the annual decline is about 9% fewer new daily articles, each year. So, the next year would add only 91% of the prior year's new-article count. Using that form of model, then the total articles would continue to grow, even beyond year 2040, before the added articles would become offset by daily deleted articles.

The following table shows each year & daily new-article count, reducing by 17% annually:

   

2008 – 1437
2009 – 1308
2010 – 1085
2011 – 901
2012 – 748
2013 – 621
2014 – 515

2015 – 428
2016 – 355
2017 – 295
2018 – 244
2019 – 203
2020 – 168
2021 – 140

2022 – 116
2023 – 96
2024 – 80
2025 – 66
2026 – 55
2027 – 46
2028 – 38

2029 – 31
2030 – 26
2031 – 22
2032 – 18
2033 – 15
2034 – 12
2035 – 10

2036 – 9
2037 – 7
2038 – 6
2039 – 5
2040 – 4
2041 – 3
2042 – 3

Beyond 2008, the daily new-article count (declining by 17% each year) is only an approximation: the purpose of the table is to show how the article growth could easily continue past year 2040. However, the actual new-article counts are likely to differ greatly (from the table values). Note that the actual counts could jump much higher, especially, if bot programs are written (someday) to auto-generate stub articles for redlinks, such as auto-searching for matching source webpages, then auto-generating footnotes and inserting a few key phrases or infobox details (copied from source webpages) within each stub article.

If the annual decline, actually, were to slow even less, such as to then average 83% of the prior year, the daily new-article count (in year 2035) could become: 10 new articles per day (as the projected daily average in year 2035).

Projections could change radically[edit]

Beware that the ongoing projections assume a continuation, of the prior types, of new articles. Any drastic change in mass uploads or new-article restrictions could radically alter the rate of new-article creation. For example:

  • If some WikiProject decided to auto-upload new articles, generated as stubs, from a huge database of "Who's who in science", then a massive upsurge would occur for new articles.
  • In contrast, if Wikipedia policies were quickly changed to demand sources, such as requiring 2 independent sources per new stub, then new-article creation could fall to just a few dozen a day.

Because of the widespread impact of mass uploads or new-article restrictions, the actual growth figures could veer widely from the projected levels, within only weeks of the current time.

Also, article count as a sole parameter does not take into account that there is an ongoing work on merging articles that are redundant or too small into larger and more comprehensive articles, for which a reduction in total article count is a sign of healthy development.

See also[edit]

[ This essay is a rapid draft, created with very limited time. ]