Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Web crawler
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Re-visit policy=== The Web has a very dynamic nature, and crawling a fraction of the Web can take weeks or months. By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates, and deletions. From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most-used cost functions are freshness and age.<ref>{{Cite conference | publisher = ACM | doi = 10.1145/342009.335391 | isbn = 1-58113-217-4 | pages = 117β128 | author1 = Junghoo Cho | author2 = Hector Garcia-Molina | title = Synchronizing a database to improve freshness | book-title = Proceedings of the 2000 ACM SIGMOD international conference on Management of data | location = Dallas, Texas, United States | access-date = 2009-03-23 | year = 2000 | url = http://www.cs.brown.edu/courses/cs227/2002/cache/Cho.pdf}}</ref> '''Freshness''': This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page ''p'' in the repository at time ''t'' is defined as: : <math> F_p(t) = \begin{cases} 1 & {\rm if}~p~{\rm~is~equal~to~the~local~copy~at~time}~t\\ 0 & {\rm otherwise} \end{cases} </math> '''Age''': This is a measure that indicates how outdated the local copy is. The age of a page ''p'' in the repository, at time ''t'' is defined as: : <math> A_p(t) = \begin{cases} 0 & {\rm if}~p~{\rm~is~not~modified~at~time}~t\\ t - {\rm modification~time~of}~p & {\rm otherwise} \end{cases} </math> [[Edward G. Coffman, Jr.|Coffman]] ''et al.'' worked with a definition of the objective of a Web crawler that is equivalent to freshness, but use a different wording: they propose that a crawler must minimize the fraction of time pages remain outdated. They also noted that the problem of Web crawling can be modeled as a multiple-queue, single-server polling system, on which the Web crawler is the server and the Web sites are the queues. Page modifications are the arrival of the customers, and switch-over times are the interval between page accesses to a single Web site. Under this model, mean waiting time for a customer in the polling system is equivalent to the average age for the Web crawler.<ref name="coffman">{{Cite journal | doi = 10.1002/(SICI)1099-1425(199806)1:1<15::AID-JOS3>3.0.CO;2-K | volume = 1 | issue = 1 | pages = 15β29 | author1 = E. G. Coffman Jr | author2 = Zhen Liu | author3 = Richard R. Weber | title = Optimal robot scheduling for Web search engines | journal = Journal of Scheduling | year = 1998 | citeseerx = 10.1.1.36.6087 }}</ref> The objective of the crawler is to keep the average freshness of pages in its collection as high as possible, or to keep the average age of pages as low as possible. These objectives are not equivalent: in the first case, the crawler is just concerned with how many pages are outdated, while in the second case, the crawler is concerned with how old the local copies of pages are. [[File:Web Crawling Freshness Age.png|thumb|Evolution of Freshness and Age in a web crawler]] Two simple re-visiting policies were studied by Cho and Garcia-Molina:<ref name=cho2003>{{Cite journal |doi = 10.1145/958942.958945|title = Effective page refresh policies for Web crawlers|journal = ACM Transactions on Database Systems|volume = 28|issue = 4|pages = 390β426|year = 2003|last1 = Cho|first1 = Junghoo|last2 = Garcia-Molina|first2 = Hector|s2cid = 147958}}</ref> * Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change. * Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency. In both cases, the repeated crawling order of pages can be done either in a random or a fixed order. Cho and Garcia-Molina proved the surprising result that, in terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. Intuitively, the reasoning is that, as web crawlers have a limit to how many pages they can crawl in a given time frame, (1) they will allocate too many new crawls to rapidly changing pages at the expense of less frequently updating pages, and (2) the freshness of rapidly changing pages lasts for shorter period than that of less frequently changing pages. In other words, a proportional policy allocates more resources to crawling frequently updating pages, but experiences less overall freshness time from them. To improve freshness, the crawler should penalize the elements that change too often.<ref name="cho2003a">{{Cite journal | doi = 10.1145/857166.857170 | volume = 3 | issue = 3 | pages = 256β290 | author1 = Junghoo Cho | author2 = Hector Garcia-Molina | title = Estimating frequency of change | journal = ACM Transactions on Internet Technology| year = 2003 | citeseerx = 10.1.1.59.5877 | s2cid = 9362566 }}</ref> The optimal re-visiting policy is neither the uniform policy nor the proportional policy. The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. In both cases, the optimal is closer to the uniform policy than to the proportional policy: as [[Edward G. Coffman, Jr.|Coffman]] ''et al.'' note, "in order to minimize the expected obsolescence time, the accesses to any particular page should be kept as evenly spaced as possible".<ref name="coffman" /> Explicit formulas for the re-visit policy are not attainable in general, but they are obtained numerically, as they depend on the distribution of page changes. Cho and Garcia-Molina show that the exponential distribution is a good fit for describing page changes,<ref name="cho2003a" /> while [[Panos Ipeirotis|Ipeirotis]] ''et al.'' show how to use statistical tools to discover parameters that affect this distribution.<ref>Ipeirotis, P., Ntoulas, A., Cho, J., Gravano, L. (2005) [http://pages.stern.nyu.edu/~panos/publications/icde2005.pdf Modeling and managing content changes in text databases] {{Webarchive|url=https://web.archive.org/web/20050905013119/http://pages.stern.nyu.edu/~panos/publications/icde2005.pdf |date=5 September 2005 }}. In Proceedings of the 21st IEEE International Conference on Data Engineering, pages 606-617, April 2005, Tokyo.</ref> The re-visiting policies considered here regard all pages as homogeneous in terms of quality ("all pages on the Web are worth the same"), something that is not a realistic scenario, so further information about the Web page quality should be included to achieve a better crawling policy.
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Web crawler
(section)
Add topic