Editing Web crawler (section)

=== Open-source crawlers ===

<!-- PLEASE RESPECT ALPHABETICAL ORDER -->
* [[Apache Nutch]] is a highly extensible and scalable web crawler written in Java and released under an [[Apache License]]. It is based on [[Apache Hadoop]] and can be used with [[Apache Solr]] or [[Elasticsearch]].
* [[Grub (search engine)|Grub]] was an open source distributed search crawler that [[Wikia Search]] used to crawl the web.
* [[Heritrix]] is the [[Internet Archive]]'s archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in [[Java (programming language)|Java]].
* [[Ht-//dig|ht://Dig]] includes a Web crawler in its indexing engine.
* [[HTTrack]] uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in [[C (programming language)|C]] and released under the GPL.
* Norconex Web Crawler is a highly extensible Web Crawler written in [[Java (programming language)|Java]] and released under an [[Apache License]]. It can be used with many repositories such as [[Apache Solr]], [[Elasticsearch]], [[Azure Cognitive Search|Microsoft Azure Cognitive Search]], [[Amazon CloudSearch]] and more.
* [[mnoGoSearch]] is a crawler, indexer and a search engine written in C and licensed under the GPL (*NIX machines only)
* [[Open Search Server]] is a search engine and web crawler software release under the GPL.
* [[Scrapy]], an open source webcrawler framework, written in python (licensed under [[BSD License|BSD]]).
* [[Seeks]], a free distributed search engine (licensed under [[GNU Affero General Public License|AGPL]]).
* [[StormCrawler]], a collection of resources for building low-latency, scalable web crawlers on [[Storm (event processor)|Apache Storm]] (Apache License).
* [[tkWWW Robot]], a crawler based on the [[tkWWW]] web browser (licensed under GPL).
* [[Wget|GNU Wget]] is a [[Command line interface|command-line]]-operated crawler written in [[C (programming language)|C]] and released under the [[GNU General Public License|GPL]]. It is typically used to mirror Web and FTP sites.
* [[YaCy]], a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).