Meltano v2.0 is almost here! See what's on the roadmap.

Orchestrate Data

Most EL(T) pipelines aren’t run just once, but over and over again, to make sure additions and changes in the source eventually make their way to the destination.

To help you realize this, Meltano supports scheduled pipelines that can be orchestrated using Apache Airflow.

When a new pipeline schedule is created using the UI or CLI, a DAG is automatically created in Airflow as well, which represents “a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies”.

Create a Schedule #

To regularly schedule your ELT to run, use the “Pipelines” interface in the UI, or the meltano schedule command:

meltano schedule [SCHEDULE_NAME] [EXTRACTOR_NAME] [TARGET_NAME] [INTERVAL]

Example:

meltano schedule carbon__sqlite tap-carbon-intensity target-sqlite @daily

Now that you’ve scheduled your first pipeline, you can load the “Pipeline” page in the UI and see it show up.

Installing Airflow #

While you can use Meltano’s CLI or UI to define pipeline schedules, actually executing them is the orchestrator’s responsibility, so let’s install Airflow:

Change directories so that you are inside your Meltano project, and then run the following command to add the default DAG generator to your project and make Airflow available to use via meltano invoke:

meltano add orchestrator airflow

Using an existing Airflow installation #

You can also use the Meltano DAG generator with an existing Airflow installation, as long as the MELTANO_PROJECT_ROOT environment variable is set to point at your Meltano project.

In fact, all meltano invoke airflow ... does is populate MELTANO_PROJECT_ROOT, set Airflow’s core.dags_folder setting to $MELTANO_PROJECT_ROOT/orchestrate/dags (where the DAG generator lives by default), and invoke the airflow executable with the provided arguments.

You can add the Meltano DAG generator to your project without also installing the Airflow orchestrator plugin by adding the airflow file bundle:

meltano add files airflow

Now, you’ll want to copy the DAG generator in to your Airflow installation’s dags_folder, or reconfigure it to look in your project’s orchestrate/dags directory instead.

This setup assumes you’ll use meltano schedule to schedule your meltano elt pipelines, as described above, since the DAG generator iterates over the result of meltano schedule list --format=json and creates DAGs for each. However, you can also create your own Airflow DAGs for any pipeline you fancy by using BashOperator with the meltano elt command, or DockerOperator with a project-specific Docker image.

Starting the Airflow scheduler #

Now that Airflow is installed and (automatically) configured to look at your project’s Meltano DAG generator, let’s start the scheduler:

meltano invoke airflow scheduler

(Add -D to run the scheduler in the background)

Airflow will now run your pipelines on a schedule as long as the scheduler is running!

Using Airflow directly #

You are free to interact with Airflow directly through its own UI. You can start the web like this:

meltano invoke airflow webserver

(Add -D to run the webserver in the background)

By default, you’ll only see Meltano’s pipeline DAGs here, which are created automatically using the dynamic DAG generator included with every Meltano project, located at orchestrate/dags/meltano.py.

You can use the bundled Airflow with custom DAGs by putting them inside the orchestrate/dags directory, where they’ll be picked up by Airflow automatically. To learn more, check out the Apache Airflow documentation.

Meltano’s use of Airflow will be unaffected by other usage of Airflow as long as orchestrate/dags/meltano.py remains untouched and pipelines are managed through the dedicated interface.

Other things you can do with Airflow #

Currently, meltano invoke gives you raw access to the underlying plugin after any configuration hooks.

View ‘meltano’ dags:

meltano invoke airflow dags list

Manually trigger a task to run:

meltano invoke airflow tasks run --raw meltano extract_load $(date -I)

Start the Airflow UI: (will start in a separate browser)

meltano invoke airflow webserver -D

Start the Airflow scheduler, enabling background job processing if you’re not already running Meltano UI:

meltano invoke airflow scheduler -D

Trigger a dag run:

meltano invoke airflow dags trigger meltano

Airflow is a full-featured orchestrator that has a lot of features that are currently outside of Meltano’s scope. As we are improving this integration, Meltano will facade more of these feature to create a seamless experience using this orchestrator. Please refer to the Airflow documentation for more in-depth knowledge about Airflow.

Meltano UI #

While Meltano is optimized for usage through the meltano CLI, basic pipeline management functionality is also available in the UI.