You can edit almost every page by Creating an account and confirming your email.

Apache Airflow

From EverybodyWiki Bios & Wiki





Apache Airflow is an open-source tool that is used to programmatically author, schedule, and monitor workflows.[1]

The project was started by Airbnb Engineer Maxime Beauchemin in 2014. It joined the Apache Software Foundation's incubation program in March 2016 and officially graduated from the incubation phase in January 2019.[2][3] As of September 2019, the [Airflow project on Github](https://github.com/apache/airflow) has 937 active contributors and 7,040 commits.[4]

History

Airflow was originally created by data engineer Maxime Beauchemin to help manage the ever-growing network of data pipelines behind Airbnb's data infrastructure. Maxime had spent previous years at Facebook, where he made heavy use of Facebook's Dataswarm tool. While Dataswarm was useful for Facebook's processes, it was not publicly available, so Beauchemin decided to write and open-source Airflow to help solve pipeline scheduling problems for data teams nationwide.[5]

Core concepts

Airflow is designed to be flexible, extensible, and fully code-based. It is designed in Python and allows users to interface with any third party Python API, database, or external system. It can also run tasks that are written in different languages, and is compatible with Kubernetes, AWS S3, Docker, and more.[6]

DAGs

DAG stands for "Directed Acyclic Graph." Each DAG represents a collection of tasks to be run. Relationships between tasks are shown directly in the Airflow UI.[7] These relationships are architected as DAGs for the following reasons:[8]

  1. Directed: If multiple tasks exist, etch must have at least one defined upstream or downstream task.
  2. Acyclic: Tasks are not allowed to create data that goes on to self-reference. This is to avoid creating infinite loops.
  3. Graph: All tasks are laid out in a clear structure with processes occurring at clear points with set relationships to other tasks.

Tasks

Tasks represent each node of a defined DAG. They often do some sort of work, such as interfacing with a third party API, loading into a database, or pulling things from a file bucket.

Operators

Operators in Airflow determine the actual work that gets done.[9] They define a single task, or one node of a DAG. DAGs make sure that operators get scheduled and run in a certain order, while operators define the work that must be done at each step of the process.

Hooks

Hooks are Airflow's way of interfacing with third-party systems. They allow you to connect to any Python API or databases.

Managed providers

Since Airflow was created, multiple managed providers have come about to offer ancillary services around the core open source project. Astronomer has built a SaaS tool and [kubernetes.io Kubernetes]-deployable Airflow stack that assists with monitoring, alerting, devops, and cluster management.[10] Google Cloud Composer is a managed version of Airflow that runs on GCP and integrates well with other GCP services.[11] Prefect is a stealth-mode service that was started by one of Airflow's core contributors- they appear to be building a full platform that integrates some aspects of Airflow's conceptual design.

References

  1. Thusoo, Ashish (August 20, 2019). "From Spark To Airflow And Presto: Demystifying The Fast-Moving Cloud Data Stack". Forbes. Retrieved September 18, 2019.
  2. "Project — Airflow Documentation". airflow.apache.org.
  3. "Apache Airflow is now a Top-Level Project". SD Times. 2019-01-08. Retrieved 2019-09-18.
  4. "Apache Airflow. Contribute to apache/airflow development by creating an account on GitHub". 19 February 2019 – via GitHub.
  5. "The Origins of Airflow" – via soundcloud.com.
  6. Khudairi, Sally (January 8, 2019). "The Apache Software Foundation Announces Apache Airflow as a Top-Level Project". globenewswire. Retrieved September 18, 2019. Unknown parameter |url-status= ignored (help)
  7. "Concepts — Airflow Documentation". airflow.apache.org.
  8. Inc, Astronomer. "What Exactly is a DAG?". Astronomer.
  9. Feng, Tao (2019-04-04). "Running Apache Airflow At Lyft". Medium. Retrieved 2019-09-18.
  10. Lipp, Cassie (July 13, 2018). "Astronomer is Now the Apache Airflow Company". americaninno. Retrieved September 18, 2019. Unknown parameter |url-status= ignored (help)
  11. "Google launches Cloud Composer, a new workflow automation tool for developers". TechCrunch. Retrieved 2019-09-18.


This article "Apache Airflow" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Apache Airflow. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.

Page kept on Wikipedia This page exists already on Wikipedia.