Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Niidae Wiki
Search
Search
Appearance
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Web crawler
(section)
Page
Discussion
English
Read
Edit
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
View history
General
What links here
Related changes
Page information
Appearance
move to sidebar
hide
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=====Academic focused crawler===== An example of the [[focused crawlers]] are academic crawlers, which crawls free-access academic related documents, such as the ''citeseerxbot'', which is the crawler of [[CiteSeer]]<sup>X</sup> search engine. Other academic search engines are [[Google Scholar]] and [[Microsoft Academic Search]] etc. Because most academic papers are published in [[PDF]] formats, such kind of crawler is particularly interested in crawling PDF, [[PostScript]] files, [[Microsoft Word]] including their [[Zipped file|zipped]] formats. Because of this, general open-source crawlers, such as [[Heritrix]], must be customized to filter out other [[MIME types]], or a [[middleware]] is used to extract these documents out and import them to the focused crawl database and repository.<ref>{{Cite book | doi=10.1145/2389936.2389949| chapter=Web crawler middleware for search engine digital libraries| title=Proceedings of the twelfth international workshop on Web information and data management - WIDM '12| pages=57| year=2012| last1=Wu| first1=Jian| last2=Teregowda| first2=Pradeep| last3=Khabsa| first3=Madian| last4=Carman| first4=Stephen| last5=Jordan| first5=Douglas| last6=San Pedro Wandelmer| first6=Jose| last7=Lu| first7=Xin| last8=Mitra| first8=Prasenjit| last9=Giles| first9=C. Lee| isbn=9781450317207| s2cid=18513666}}</ref> Identifying whether these documents are academic or not is challenging and can add a significant overhead to the crawling process, so this is performed as a post crawling process using [[machine learning]] or [[regular expression]] algorithms. These academic documents are usually obtained from home pages of faculties and students or from publication page of research institutes. Because academic documents make up only a small fraction of all web pages, a good seed selection is important in boosting the efficiencies of these web crawlers.<ref>{{Cite book |doi = 10.1145/2380718.2380762|chapter = The evolution of a crawling strategy for an academic document search engine|title = Proceedings of the 3rd Annual ACM Web Science Conference on - Web ''Sci'' '12|pages = 340–343|year = 2012|last1 = Wu|first1 = Jian|last2 = Teregowda|first2 = Pradeep|last3 = Ramírez|first3 = Juan Pablo Fernández|last4 = Mitra|first4 = Prasenjit|last5 = Zheng|first5 = Shuyi|last6 = Giles|first6 = C. Lee|isbn = 9781450312288|s2cid = 16718130}}</ref> Other academic crawlers may download plain text and [[HTML]] files, that contains [[metadata]] of academic papers, such as titles, papers, and abstracts. This increases the overall number of papers, but a significant fraction may not provide free PDF downloads.
Summary:
Please note that all contributions to Niidae Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Encyclopedia:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Search
Search
Editing
Web crawler
(section)
Add topic