Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Web crawler
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Crawling the deep web== A vast amount of web pages lie in the [[Deep Web (search indexing)|deep or invisible web]].<ref>Shestakov, Denis (2008). [https://oa.doria.fi/handle/10024/38506 ''Search Interfaces on the Web: Querying and Characterizing''] {{Webarchive|url=https://web.archive.org/web/20140706120641/https://oa.doria.fi/handle/10024/38506 |date=6 July 2014 }}. TUCS Doctoral Dissertations 104, University of Turku</ref> These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google's [[Sitemaps]] protocol and [[mod oai]]<ref>{{Cite journal|arxiv=cs/0503069 |author1=Michael L Nelson |author2=Herbert Van de Sompel |author3=Xiaoming Liu |author4=Terry L Harrison |author5=Nathan McFarland |title=mod_oai: An Apache Module for Metadata Harvesting |pages=cs/0503069 |date=2005-03-24|bibcode=2005cs........3069N }}</ref> are intended to allow discovery of these [[Deep web|deep-Web]] resources. Deep web crawling also multiplies the number of web links to be crawled. Some crawlers only take some of the URLs in <code><a href="URL"></code> form. In some cases, such as the [[Googlebot]], Web crawling is done on all text contained inside the hypertext content, tags, or text. Strategic approaches may be taken to target deep Web content. With a technique called [[screen scraping]], specialized software may be customized to automatically and repeatedly query a given Web form with the intention of aggregating the resulting data. Such software can be used to span multiple Web forms across multiple Websites. Data extracted from the results of one Web form submission can be taken and applied as input to another Web form thus establishing continuity across the Deep Web in a way not possible with traditional web crawlers.<ref>{{cite journal | first1 = Denis | last1 = Shestakov | first2 = Sourav S. | last2 = Bhowmick | first3 = Ee-Peng | last3 = Lim | title = DEQUE: Querying the Deep Web | journal = Data & Knowledge Engineering |volume=52 |issue=3 | pages = 273β311 | year = 2005 | url = http://www.mendeley.com/download/public/1423991/3893295922/dc0f7d824fd2a8fbbc84f6fdf9e4f337d343987d/dl.pdf | doi=10.1016/s0169-023x(04)00107-7}}</ref> Pages built on [[AJAX]] are among those causing problems to web crawlers. [[Google search engine|Google]] has proposed a format of AJAX calls that their bot can recognize and index.<ref>{{cite web | url=https://support.google.com/webmasters/bin/answer.py?hl=en&answer=174992 | title=AJAX crawling: Guide for webmasters and developers | access-date=17 March 2013}}</ref>
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Web crawler
(section)
Add topic