Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Web crawler
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Politeness policy=== Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. If a single crawler is performing multiple requests per second and/or downloading large files, a server can have a hard time keeping up with requests from multiple crawlers. As noted by Koster, the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community.<ref>Koster, M. (1995). Robots in the web: threat or treat? ConneXions, 9(4).</ref> The costs of using Web crawlers include: * network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time; * server overload, especially if the frequency of accesses to a given server is too high; * poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle; and * personal crawlers that, if deployed by too many users, can disrupt networks and Web servers. A partial solution to these problems is the [[Robots Exclusion Standard|robots exclusion protocol]], also known as the robots.txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers.<ref>Koster, M. (1996). [http://www.robotstxt.org/wc/exclusion.html A standard for robot exclusion] {{Webarchive|url=https://web.archive.org/web/20071107021800/http://www.robotstxt.org/wc/exclusion.html |date=7 November 2007 }}.</ref> This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search engines like [[Google.com|Google]], [[Ask.com|Ask Jeeves]], [[Bing (search engine)|MSN]] and [[Yahoo! Search]] are able to use an extra "Crawl-delay:" parameter in the [[robots.txt]] file to indicate the number of seconds to delay between requests. The first proposed interval between successive pageloads was 60 seconds.<ref>Koster, M. (1993). [http://www.robotstxt.org/wc/guidelines.html Guidelines for robots writers] {{Webarchive|url=https://web.archive.org/web/20050422045839/http://www.robotstxt.org/wc/guidelines.html |date=22 April 2005 }}.</ref> However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire Web site; also, only a fraction of the resources from that Web server would be used. Cho uses 10 seconds as an interval for accesses,<ref name=cho2003/> and the WIRE crawler uses 15 seconds as the default.<ref name=baeza2002>Baeza-Yates, R. and Castillo, C. (2002). [http://www.chato.cl/papers/baeza02balancing.pdf Balancing volume, quality and freshness in Web crawling]. In Soft Computing Systems β Design, Management and Applications, pages 565β572, Santiago, Chile. IOS Press Amsterdam.</ref> The MercatorWeb crawler follows an adaptive politeness policy: if it took ''t'' seconds to download a document from a given server, the crawler waits for 10''t'' seconds before downloading the next page.<ref>{{cite journal |author1=Heydon, Allan |author2=Najork, Marc |title=Mercator: A Scalable, Extensible Web Crawler |date=1999-06-26 |url=http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf |access-date=2009-03-22 |url-status=dead |archive-url=https://web.archive.org/web/20060219085958/http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf |archive-date=19 February 2006}}</ref> Dill ''et al.'' use 1 second.<ref>{{cite journal |last1 = Dill |first1 = S. |last2 = Kumar |first2 = R. |last3 = Mccurley |first3 = K. S. |last4 = Rajagopalan |first4 = S. |last5 = Sivakumar |first5 = D. |last6 = Tomkins |first6 = A. |year = 2002 |title = Self-similarity in the web |url = http://www.mccurley.org/papers/fractal.pdf |journal = ACM Transactions on Internet Technology|volume = 2 |issue = 3 |pages = 205β223 |doi=10.1145/572326.572328|s2cid = 6416041 }}</ref> For those using Web crawlers for research purposes, a more detailed cost-benefit analysis is needed and ethical considerations should be taken into account when deciding where to crawl and how fast to crawl.<ref>{{Cite journal| author1 = M. Thelwall | author2 = D. Stuart | year = 2006 | url = http://www.scit.wlv.ac.uk/%7Ecm1993/papers/Web_Crawling_Ethics_preprint.doc | title = Web crawling ethics revisited: Cost, privacy and denial of service | volume = 57 | issue = 13 | pages = 1771β1779 | journal = Journal of the American Society for Information Science and Technology| doi = 10.1002/asi.20388 }}</ref> Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3β4 minutes. It is worth noticing that even when being very polite, and taking all the safeguards to avoid overloading Web servers, some complaints from Web server administrators are received. [[Sergey Brin]] and [[Larry Page]] noted in 1998, "... running a crawler which connects to more than half a million servers ... generates a fair amount of e-mail and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen."<ref name=brin1998>{{cite journal|url=http://infolab.stanford.edu/~backrub/google.html|doi=10.1016/s0169-7552(98)00110-x|title=The anatomy of a large-scale hypertextual Web search engine|journal=Computer Networks and ISDN Systems|volume=30|issue=1β7|pages=107β117|year=1998|last1=Brin|first1=Sergey|last2=Page|first2=Lawrence|s2cid=7587743 }}</ref>
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Web crawler
(section)
Add topic