Read follow up
28 feb -
MSN cheating too?7 mar -
Yahoo indexes more pages than Google13 mar -
Google adjusts its counts23 mar -
5 billion "the" have disappeared overnight25 mar -
A snapshot of the update
In previous articles, I pointed out
two strange problems with Google counts (
here and
here). Pages seem to massively disappear:
- If you type Chirac OR Sarkozy, you get half the number results of Chirac alone, which may have a political explanation... but is a weird approach to boolean logic.
- If you search the in the English pages, you get 1% of the number you get for the all languages together. Does this mean that the is 99 times more frequent in languages other than English? Of course not.
Where are the missing pages gone? This is the question that I am trying to address in this article. A
possible scenario is that the real index used by Google is
considerably smaller than the counts officially announced. The detailed experiment reported below yields a precise estimate of 60%, thus leading to a real index size of
ca. 5 billion pages. This scenario is of course entirely hypothetical, but it enables to explain both the discrepancy in the English page counts and the strange behaviour of Google's Boolean operators.
Let me say it right away, in order to save commentators' time: this does
not mean that Google is a bad search engine (and I actually have it as my browser's home page). For most users, counts are useless, and what...
counts for them is whether they find the right results quickly and accurately or not. Figures are relevant only for experts, but in this case, these have some reasons to wonder.
An experiment
In this new
experiment I do not use frequent words such as
the, because frequent words are likely to be processed in a special way by any search engine. They are probably on a special
stoplist, and their occurrences not fully indexed. I have used instead
50 English words drawn randomly from mid-range frequencies in a 1-million word corpus of English text (
accumulated, alive, ancestor, bushes, etc.). I have eliminated words for which I knew obvious homographs in other languages (such as
patio, etc.).
The figure below plots the counts given by Google for English pages vs the entire Web (the part known to Google, of course) [see complete results
here -- all figures in this study were obtained on February 6th].
The slope of the regression line indicates that the
English results represent 56% of the results for the entire Web for the same words. Of course, I may have missed some collisions of homographs accross languages, and some of the words probably appear cited in non-English pages as well, but these factors should be marginal, and in any case,
different for each word. If almost half of the occurrences of these words are located in non-English pages, there should be a
considerable amount of dispersion in the plot. Instead, there is a
very strong correlation between the two counts, with a coefficient of determination R
2 equal to 0.96. This high correlation is
statistically impossible, and some systematic factor must explain it. A possibility would be an extremely poor behavior of the language detection algorithm used by Google, but this is very unlikely because we would see evidence of that in almost every other result, and it is far from being the case: Google's language detection is fairly robust, if not perfect.
On the other hand, if we look at Yahoo's results for the same word list, we get a much more expected pattern [see complete results
here]:
The correlation is very high too (higher, indeed), but this is normal because the results are almost identical: English results represent 92% of the whole. This figure is in line with our linguistic knowledge.
Results for French are very similar. I built a French word list on the same principle, and ran it through Google and Yahoo. Google gives a
58% share of results located in French pages, and again a high correlation, slightly lower (R
2 = 0.86), but still incompatible with a large proportion of results outside the pages categorised as French. Individual word behaviour should bring a much more random pattern [see complete results
here].
Yahoo behaves just as it did for English. The proportion of results located in French pages is even higher (97%), which is expected, since English, as an international language, tends to be cited in more documents than French.
A possible scenario
Many experts believe (see for example
here) that
Google's database is composed of (at least) two parts. One part which is a full index, and another one which contains URLs and other information for pages that Google knows about, but whose content has
not been indexed (only the words in their URLs are possible indexed). I have no means to know whether this hypothesis is correct (although Google admitted it publicly until 2002), but it could explain the strange behaviour reported above.
Lets call the two hypothetical parts A and B respectively, composing together the whole database D:
We can then build a possible scenario. When we query Google with a word X in any language, it looks it up in its index, i.e. the part A, and
extrapolates the count to match the size of the entire database D. However, when we restrict the search to a given language, it does not extrapolate, because pages in part B are not indexed and not categorised in any language. Only the results of A are reported. Of course, it would have been possible to extrapolate the
language proportions from A to the entire database D, and extrapolate anyway, but the Google engineers didn't think of it, or didn't think it was important.
We can compute a fairly good estimate of parts A and B, using my calculations above. According to Yahoo (if we accept to trust it), 92% of the results for my English word list are located in English pages. If we apply the same proportion to Google, this means that the index, i.e. part A, is 0.52 / 0.92 = 60.9% the size of D. Interestingly enough, if we do the same computation using the French list, we get an estimate of 0.58 / 0.96 = 60.4%. These figures are so close that it would be surprising that they are a pure coincidence.
Under the scenario outlined above, the real size of Google's index is therefore ca. 60% of the entire database, and the numbers reported are inflated by a factor of 66% (1/0.60 - 1).
This is difficult to match to
absolute numbers, because nobody knows exactly the size of Google's database. In November 2004, Google announced that it was searching 8,058,044,651 web pages. The number has not changed since then on the main page of the engine, but I have shown on January 23 that the
index had increased by a factor of 1.13 since the announcement (read
here). An estimate on February 6th gives a
growth of 1.14. This would correspond to a current database size of ca. 9.2 billion pages, i.e. a real index size (part A) of 5.5 billions. However,
some observers have noticed that for a short while before the announcement in November Google reported
10.8 billion results for a query on
the, which would indicate an even larger database, unless it simply means that at some point in time Google had considered an even larger inflation factor. We will probably never know.
A new light on Googlean logic
The hypothetical scenario above also nicely explains the
Googlean logic problem. We remember that X OR Y returns fewer results than X alone (see
details). Even weirder, both X OR X and X (AND) X return also fewer results than X itself. I queried Google for X OR X and X (AND) X for each word X in my English list (with the "any language" setting) . The results for both queries are almost identical for all words [see complete results
here], and very surprisingly, they are
almost identical to the number of results for X in the English pages only (coefficient of determination R
2 > 0.999!).
It is likely that Google does the boolean computations (union and intersection of lists) on the basis of the real index, i.e. part A. This would explain why X OR X and X (AND) X yield the same results as the search in English pages when X is an (almost exclusive) English word. The same occurs with French words [see complete results
here]. This fact probably went unnoticed until now because if you use words that can appear in many languages (homographs such as
patio, or proper names such as
Chirac or
Bush), the pattern is blurred.
In all likelihood, the Google engineers simply
forgot to plug the extrapolation routine at the end of the boolean module! Therefore, if you want to know the real index count for any word, simply type it twice:
| Word | Count |
|---|
| stuttering | 749,000 |
| stuttering stuttering | 452,000 |
The second line is likely to be the real count...
Read follow up
28 feb -
MSN cheating too?7 mar -
Yahoo indexes more pages than Google13 mar -
Google adjusts its counts23 mar -
5 billion "the" have disappeared overnight25 mar -
A snapshot of the update