Norconex HTTP Collector
Norconex HTTP Collector is a free and open-source web crawler, or web spider design to crawl anything on the World Wide Web in order to collect data from the internet..[1][2][3]
Original author(s) | Pascal Essiembre |
---|---|
Developer(s) | Norconex Inc. |
Stable release | 2.x
|
Preview release | 3.x
|
Written in | Java |
Engine | |
Operating system | Cross-platform |
Type | Web Crawler |
License | Apache |
Website | https://opensource.norconex.com/collectors/http/ |
Search Norconex HTTP Collector on Amazon.
History[edit]
The Norconex HTTP Collector began as a closed source project developed by Norconex and it was initially created for Enterprise Search integrators and developers.[4][5]
It was eventually released as an open source software in 2013[6][7][8][9][10] and it is still activaly maintain by Norconex with the current version stable being version 2.x and with the version 3.x being currently in development.[11]
Feature[1][3][10][edit]
- Multi-threaded.
- Supports full and incremental crawls.
- Supports different hit interval according to different schedules.
- Can crawls millions on a single server of average capacity.
- Extract text out of many file formats (HTML, PDF, Word, etc.)
- Extract metadata associated with documents.
- Supports pages rendered with JavaScript.
- Language detection.
- Many content and metadata manipulation options.
- OCR support on images and PDFs.
- Page screenshots.
- Extract page "featured" image.
- Translation support.
- Dynamic title generation.
- Configurable crawling speed.
- URL normalization.
- Detects modified and deleted documents.
- Supports different frequencies for re-crawling certain pages.
- Supports various web site authentication schemes.
- Supports sitemap.xml (including "lastmod" and "changefreq").
- Supports robot rules.
- Supports canonical URLs.
- Can filter documents based on URL, HTTP headers, content, or metadata.
- Can treat embedded documents as distinct documents.
- Can split a document into multiple documents.
- Can store crawled URLs in different database engines.
- Can re-process or delete URLs no longer linked by other crawled pages.
- Supports different URL extraction strategies for different content types.
- Fires more than 20 crawler event types for custom event listeners.
- Date parsers/formatters to match your source/target repository dates.
- Can create hierarchical fields.
- Supports scripting languages for manipulating documents.
- Reference XML/HTML elements using simple DOM tree navigation.
- Supports external commands to parse or manipulate documents.
- Supports crawling with your favorite browser (using WebDriver).
- Supports "If-Modified-Since" for more efficient crawling.
- Follow URLs from HTML or any other document format.
- Can detects and report broken links.
- Can send crawled content to multiple target repositories at once.
- Many others.
Architecture[edit]
Norconex HTTP Collector was built entirely using Java.[5][12] A single Collector installation is responsible for launching one or multiple crawler threads, each with their own configuration.[13]
Each step is part of a crawler life-cycle is configurable and overwritable. Developers can provide their own interface implementation for most steps undertaken by the crawler. The default implementations provided will cover a vast array of crawling use cases, and are built on stable products such as Apache Tika and Apache Derby. The following figure is a high-level representation of a URL life cycle from the crawler perspective.
The Importer and Committer modules are separate Apache licensed java libraries distributed with the Collector.
The Importer[11] module is use to parse incoming documents from their raw form (HTML, PDF, Word, etc) to a set of extracted metadata and plain text content. In addition, it provides interfaces to manipulate a document metadata, transform its content, or simply filter the documents based on their new format. While the Collector is heavily dependent on the Importer module, the latter can be used on its own, as a general-purpose document parser.
The committer module is responsible for directing the parsed data to a target repository of choice. Developers are able to write custom implementations, allowing the use of Norconex HTTP Collector with any search engines or repositories. Multiples committer implementations currently exists[8][13][14], with some who have been implemented by third party's organisations, like the Apache Kafka Committer[15] or the Google Cloud Search Committer[16]
Minimum Requirements[edit]
While the Norconex HTTP Collector can be configured programmatically, out of the box, the crawler will have to be configured by the use of an XML configuration file.
The following code is the minimum XML configuration for the current version 2.x. See the documentation for more complex configuration.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!--
Copyright 2010-2017 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
to run a crawler.
-->
<httpcollector id="Minimum Config HTTP Collector">
<!-- Decide where to store generated files. -->
<progressDir>./examples-output/minimum/progress</progressDir>
<logsDir>./examples-output/minimum/logs</logsDir>
<crawlers>
<crawler id="Norconex Minimum Test Page">
<!-- Requires at least one start URL (or urlsFile).
Optionally limit crawling to same protocol/domain/port as
start URLs. -->
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url>https://opensource.norconex.com/collectors/http/test/minimum</url>
</startURLs>
<!-- === Recommendations: ============================================ -->
<!-- Specify a crawler default directory where to generate files. -->
<workDir>./examples-output/minimum</workDir>
<!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
<maxDepth>0</maxDepth>
<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
<sitemapResolverFactory ignore="true" />
<!-- Be as nice as you can to sites you crawl. -->
<delay default="5000" />
<!-- Document importing -->
<importer>
<postParseHandlers>
<!-- If your target repository does not support arbitrary fields,
make sure you only keep the fields you need. -->
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>title,keywords,description,document.reference</fields>
</tagger>
</postParseHandlers>
</importer>
<!-- Decide what to do with your files by specifying a Committer. -->
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./examples-output/minimum/crawledFiles</directory>
</committer>
</crawler>
</crawlers>
</httpcollector>
See also[edit]
References[edit]
- ↑ 1.0 1.1 "A powerful web crawler that can also extract and manipulate documents in order to retrieve the required information on the Internet". softpedia. Retrieved 2021-06-22. Unknown parameter
|url-status=
ignored (help) - ↑ "Télécharger Norconex HTTP Collector - 01net.com - Telecharger.com". www.01net.com. Retrieved 2021-06-22.
- ↑ 3.0 3.1 Valcheva, Silvia (2018-02-11). "10 Best Open Source Web Crawlers: Web Data Extraction Software". Blog For Data-Driven Business. Retrieved 2021-06-22.
- ↑ "A New Open-Source Web Crawler". 2016-03-04. Archived from the original on 2016-03-04. Retrieved 2021-05-28.
- ↑ 5.0 5.1 "Norconex HTTP Collector - A Web Crawler in Java". www.findbestopensource.com. Retrieved 2021-06-22.
- ↑ "Norconex Gives Back to Open-Source". Norconex Inc. 2013-06-05. Retrieved 2021-05-28. Unknown parameter
|url-status=
ignored (help) - ↑ "Norconex Offers Open Source HTTP Crawler". Beyond Search. 2013-07-16. Retrieved 2021-05-28.
- ↑ 8.0 8.1 "Discover the Open Source Alternative to the Autonomy Crawler". Beyond Search. 2014-02-07. Retrieved 2021-05-28.
- ↑ NT, Author Baiju (2018-09-12). "Top 50 open source web crawlers for data mining". Big Data Made Simple. Retrieved 2021-05-28.
- ↑ 10.0 10.1 "50 Best Open Source Web Crawlers". ProWebScraper. Retrieved 2021-06-21. Unknown parameter
|url-status=
ignored (help) - ↑ 11.0 11.1 Norconex/importer, GitHub, 2021-04-26
- ↑ "Norconex/collector-http - githubmemory". githubmemory.com. Retrieved 2021-06-22.
- ↑ 13.0 13.1 "Importing Data from the Web with Norconex & Neo4j". Neo4j Graph Database Platform. 2020-02-10. Retrieved 2021-06-23.
- ↑ "SolrEcosystem - SOLR - Apache Software Foundation". cwiki.apache.org. Retrieved 2021-06-22.
- ↑ "jpmantuano/kafka-committer". GitHub. 23 July 2019. Retrieved 2021-05-28.
- ↑ "Deploy a Norconex HTTP Collector Indexer Plugin | Cloud Search". Google Developers. Retrieved 2021-05-28.
External links[edit]
This article "Norconex HTTP Collector" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Norconex HTTP Collector. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.