Editing Web crawler (section)

== List of web crawlers ==
{{further|List of search engine software}}

The following is a list of published crawler architectures for general-purpose crawlers (excluding focused web crawlers), with a brief description that includes the names given to the different components and outstanding features:

=== Historical web crawlers ===
<!-- PLEASE RESPECT ALPHABETICAL ORDER -->
* [https://web.archive.org/web/20020106130042/http://www.wolfbot.com/ WolfBot] was a massively multi threaded crawler built in 2001 by Mani Singh a Civil Engineering graduate from the University of California at Davis.
* [[World Wide Web Worm]] was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the <kbd>[[grep]]</kbd> [[Unix]] command.
* Yahoo! Slurp was the name of the [[Yahoo!]] Search crawler until Yahoo! contracted with [[Microsoft]] to use [[Bingbot]] instead.

=== In-house web crawlers ===

<!-- PLEASE RESPECT ALPHABETICAL ORDER -->
* Applebot is [[Apple (company)|Apple]]'s web crawler. It supports [[Siri]] and other products.<ref>{{cite web |title=About Applebot |url=https://support.apple.com/en-us/HT204683 |publisher=Apple Inc |access-date=18 October 2021}}</ref>
* [[Bingbot]] is the name of Microsoft's [[Bing (search engine)|Bing]] webcrawler. It replaced ''[[Msnbot]]''.
* Baiduspider is [[Baidu]]'s web crawler.
* DuckDuckBot is [[DuckDuckGo]]'s web crawler.
* [[Googlebot]] is described in some detail, but the reference is only about an early version of its architecture, which was written in C++ and [[Python (programming language)|Python]]. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.
* [[WebCrawler]] was used to build the first publicly available full-text index of a subset of the Web. It was based on [[Libwww|lib-WWW]] to download pages, and another program to parse and order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with the provided query.
* [[WebFountain]] is a distributed, modular crawler similar to Mercator but written in C++.
* [[Xenon (program)|Xenon]] is a web crawler used by government tax authorities to detect fraud.{{r|Norton-2007-01-25}}{{r|Canada-2017-04-11}}

=== Commercial web crawlers ===
The following web crawlers are available, for a price::
* [[Diffbot]] - programmatic general web crawler, available as an [[API]]
* [[SortSite]] - crawler for analyzing websites, available for [[Microsoft Windows|Windows]] and [[Macintosh operating systems|Mac OS]]
* Swiftbot - [[Swiftype]]'s web crawler, available as [[software as a service]]
* Aleph Search - web crawler allowing massive collection with high scalability

=== Open-source crawlers ===

<!-- PLEASE RESPECT ALPHABETICAL ORDER -->
* [[Apache Nutch]] is a highly extensible and scalable web crawler written in Java and released under an [[Apache License]]. It is based on [[Apache Hadoop]] and can be used with [[Apache Solr]] or [[Elasticsearch]].
* [[Grub (search engine)|Grub]] was an open source distributed search crawler that [[Wikia Search]] used to crawl the web.
* [[Heritrix]] is the [[Internet Archive]]'s archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in [[Java (programming language)|Java]].
* [[Ht-//dig|ht://Dig]] includes a Web crawler in its indexing engine.
* [[HTTrack]] uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in [[C (programming language)|C]] and released under the GPL.
* Norconex Web Crawler is a highly extensible Web Crawler written in [[Java (programming language)|Java]] and released under an [[Apache License]]. It can be used with many repositories such as [[Apache Solr]], [[Elasticsearch]], [[Azure Cognitive Search|Microsoft Azure Cognitive Search]], [[Amazon CloudSearch]] and more.
* [[mnoGoSearch]] is a crawler, indexer and a search engine written in C and licensed under the GPL (*NIX machines only)
* [[Open Search Server]] is a search engine and web crawler software release under the GPL.
* [[Scrapy]], an open source webcrawler framework, written in python (licensed under [[BSD License|BSD]]).
* [[Seeks]], a free distributed search engine (licensed under [[GNU Affero General Public License|AGPL]]).
* [[StormCrawler]], a collection of resources for building low-latency, scalable web crawlers on [[Storm (event processor)|Apache Storm]] (Apache License).
* [[tkWWW Robot]], a crawler based on the [[tkWWW]] web browser (licensed under GPL).
* [[Wget|GNU Wget]] is a [[Command line interface|command-line]]-operated crawler written in [[C (programming language)|C]] and released under the [[GNU General Public License|GPL]]. It is typically used to mirror Web and FTP sites.
* [[YaCy]], a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).