Editing Encyclopedia:Search engine test (section)

==Search engine limitations – technical notes==
{{shortcut|WP:GOOGLELIMITS}}

Many, probably most, of the publicly available web pages in existence are not indexed. Each search engine captures a different percentage of the total. Nobody can tell exactly what portion is captured.

The estimated size of the [[World Wide Web]] is at least 11.5 billion pages,<ref>{{cite paper |url= http://www.cs.uiowa.edu/~asignori/web-size/ |title=The Indexable Web is more than 11.5 billion pages |first1=Antonio |last1=Gulli |first2=Alessio |last2=Signorini |date=28 August 2005}}</ref> but a much [[Deep web (search indexing)|deeper (and larger) Web]], estimated at over 3 trillion pages,  exists within databases whose contents the search engines do not index. These [[dynamic web page]]s are formatted by a Web server when a user requests them and as such cannot be indexed by conventional search engines. The [[United States Patent and Trademark Office]] website is an example; although a search engine can find its main page, one can only search its database of individual patents by entering queries into the site itself.<ref>{{cite paper |first1=Alvin |last1=More |first2=Brian H. |last2=Murray |title=Sizing the Internet |publisher=Cyveillance. |date=2000}}</ref> <!--include link to Google.com/patents?-->

Google, like all Internet search engines can only find information that has actually been made available on the Internet.  There is still a sizable amount of information that is not on the Internet.

Google, like all major Web search services, follows the [[robots.txt protocol]] and can be [http://www.google.com/support/webmasters/bin/answer.py?answer=40364&query=robots.txt&topic=&type= blocked] by sites that do not wish their content to be indexed or cached by Google. Sites that contain large amounts of copyrighted content (Image galleries, subscription newspapers, webcomics, movies, video, help desks), usually involving membership, will block Google and other search engines. Other sites may also block Google due to the stress or bandwidth concerns on the server hosting the content.

Search engines also might not be able to read links or metadata that normally requires a browser plugin, [http://www.searchtools.com/info/pdf.html Adobe PDF], or Macromedia Flash, or where a website is displayed as part of an image. Search engines also can not listen to podcasts or other audio streams, or even video mentioning a search term.  Similarly, search engines cannot read PDF files consisting of photoscans or look inside compressed (.zip) files.

Forums, membership-only and subscription-only sites (since Googlebot does not sign up for site access) and sites that cycle their content are not cached or indexed by any search engine. With more sites moving to AJAX/Web 2.0 designs, this limitation will become more prevalent as search engines only simulate following the links on a web page. AJAX page setups (like Google Maps) dynamically return data based on real-time manipulation of JavaScript.

Google has also been the victim of [http://clsc.net/research/google-302-page-hijack.htm redirection exploits] that may cause it to return more results for a specific search term than exist actual content pages.

Google and other popular search engines are also a target for search engine "search result enhancement", also known as [http://www.google.com/support/webmasters/bin/answer.py?answer=35291 search engine optimizers], so there may also be many results returned that lead to a page that only serves as an advertisement. Sometimes pages contain hundreds of keywords designed specifically to attract search engine users to that page, but in fact serve an advertisement instead of a page with content related to the keyword.

Hit counts reported by Google are only estimates, which in some cases have been shown to necessarily be off by nearly an order of magnitude, especially for hit counts above a few thousands.<ref>Mark Liberman (2009), "[http://languagelog.ldc.upenn.edu/nll/?p=1992 Quotes with and without quotes]", ''[[Language Log]]''.</ref><ref>Liberman, Mark (2005), "[http://158.130.17.5/~myl/languagelog/archives/001837.html Questioning reality]", ''[[Language Log]]''; and other ''Language Log'' posts linked from there.</ref> For such common words as to yield several thousand Google hits, freely available [[text corpus|text corpora]] such as the [[British National Corpus]] (for British English) and the [[Corpus of Contemporary American English]] (for American English) can provide a more accurate estimate of the relative frequencies of two words.

=== Example of the limitations ===
The [http://www.summit.nw3c.org Economic Crime Summit] site is a rather Google- and [http://web.archive.org/web/*/www.summit.nw3c.org Internet Archive-unfriendly] site. It is very graphics heavy, providing Google with little to nothing to look for and many missing pages in the Internet Archive version. So while you can bring up the [https://web.archive.org/web/20020124024106/http://www.summit.nw3c.org/ 2002 Economic Crime Summit Conference], the overview link that would tell you who presented what does not work. The [https://web.archive.org/web/20040208075832/http://www.summit.nw3c.org/index.html 2004 Economic Crime Summit Conference archive] is even worse as that was in three places and none of the archived links tells you anything about the papers presented.

Via Internet Archive you have proof that some information regarding "Impact of Advances in Computer Technology in Evidence Processing" existed on the Internet.<ref>http://web.archive.org/web/20011212161658/http://www.summit.nw3c.org/Programs_Agenda.htm</ref>  Yet today {{em|Google cannot find that information!}}  A program known to be part of the 2002 Economic Crime Summit Conference and at one time was listed on a website on the Internet currently{{when}} cannot be found by Google.