Frontera (web crawling)

Frontera
Original author(s)	Alexander Sibiryakov, Javier Casas
Developer(s)	Scrapinghub Ltd., GitHub community
Initial release	November 1, 2014; 11 years ago
Stable release	v0.8.1 / April 5, 2019; 7 years ago
Written in	Python
Engine
Operating system	OS X, Linux
Type	web crawling
License	BSD 3-clause license
Website	github.com/scrapinghub/frontera

Search Frontera (web crawling) on Amazon.

Frontera is an open-source, web crawling framework implementing a crawl frontier component and providing scalability primitives for web crawler applications.

Overview

Large-scale web crawlers often operate in batch mode with sequential phases of injection, fetching, parsing, deduplication, and scheduling. This leads to a delay in updating the crawl when the web changes. The design is primarily motivated by the relatively low random access performance of hard disks compared to sequential access.

Frontera instead relies on key-value storage systems, using efficient data structures and powerful hardware to allow crawling, parsing, and scheduling indexing of new links concurrently. It is an open-source project designed to fit various use cases, with high flexibility and configurability.

Large-scale web crawls are Frontera's primary purpose. It allows crawls of moderate size on a single machine with a few cores by using single-process and distributed spiders run modes.

Features

Frontera is primarily written in Python. Data transport and formats are well-abstracted, and out-of-box implementations include support for MessagePack, JSON, Kafka, and ZeroMQ.

Online operation: small requests batches, with parsing performed immediately after fetching.
Pluggable backend architecture: low-level storage logic is separated from the crawling policy.
Three run modes: single process, distributed spiders, distributed backend and spiders.
Transparent data flow, allowing for easy integration of custom components.
Message bus abstraction, providing a way to implement custom transport (ZeroMQ and Kafka are available out of the box).
SQLAlchemy and HBase storage backends.
Revisiting logic (only with RDBMS backend).
Optional use of Scrapy for fetching and parsing.
BSD 3-clause license, allowing use in any commercial product.
Python 3 support.

Comparison to other web crawlers

Although Frontera is not a web crawler itself, it requires a streaming crawling architecture rather than a batch crawling approach.^{[citation needed]}

StormCrawler is another stream-oriented crawler built on top of Apache Storm while using some components from the Apache Nutch ecosystem. Scrapy Cluster was designed by ISTResearch with precise monitoring and management of the queue in mind. These systems provide fetching and/or queueing mechanisms, but no link database or content processing.

Battle testing

At Scrapinghub Ltd., a crawler processing 1600 requests per second at peak, primarily built using Frontera, utilizes Kafka as a message bus and HBase for storing link states and link databases. Such a crawler operates in cycles, each cycle taking 1.5 months and resulting in 1.7 billion downloaded pages.^[2]

A crawl of the Spanish internet resulted in 46.5 million pages in 1.5 months on an AWS cluster with 2 spider machines.^[3]

History

The initial version of Frontera operated in a single process, as part of a custom scheduler for Scrapy, using an on-disk SQLite database to store link states and queues. It was capable of crawling for several days. However, as the number of links grew, it started spending increasing amounts of time on SELECT queries, making the crawl inefficient. This prompted development under DARPA's Memex program, and its inclusion in the open-source catalog. ^[4]

Subsequent Frontera versions (2015) utilized HBase for storing the link database and queue. The application was distributed into a backend and a fetcher. The backend communicated with HBase using Kafka, and the fetcher read Kafka topics containing URLs to crawl and produced crawl results to another topic consumed by the backend, creating a closed cycle. A first priority queue prototype suitable for web-scale crawling was implemented during this period. The queue generated batches with limitations on the number of hosts and requests per host.

A significant milestone was the introduction of crawling strategies and strategy workers, along with message bus abstraction. This allowed coding custom crawling strategies without dealing with low-level backend queue code. Easily specifying which links to schedule, when, and with what priority made Frontera a true crawl frontier framework. Kafka's demanding requirements for small crawlers were addressed by the message bus abstraction, allowing integration with nearly any messaging system.

References

↑ "Releases · Scrapinghub/frontera". github.com. Retrieved 2021-04-07.
↑ Sibiryakov, Alexander (29 Mar 2017). "Frontera: архитектура фреймворка для обхода веба и текущие проблемы". Habrahabr.
↑ Sibiryakov, Alexander (15 Oct 2015). "frontera-open-source-large-scale-web-crawling-framework". Speakerdeck.
↑ "Open Catalog, Memex (Domain-Specific Search)". Archived from the original on 2018-06-22. Retrieved 2021-11-09.

This article "Frontera (web crawling)" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Frontera (web crawling). Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

[1] "Releases · Scrapinghub/frontera". github.com. Retrieved 2021-04-07.

[2] Sibiryakov, Alexander (29 Mar 2017). "Frontera: архитектура фреймворка для обхода веба и текущие проблемы". Habrahabr.

[3] Sibiryakov, Alexander (15 Oct 2015). "frontera-open-source-large-scale-web-crawling-framework". Speakerdeck.

[4] "Open Catalog, Memex (Domain-Specific Search)". Archived from the original on 2018-06-22. Retrieved 2021-11-09.

[1]

[2]

[3]

[4]

Frontera (web crawling)

Contents