In this article, we will present how to automate the creation of a Glue workflow that includes triggers, jobs, and crawlers using Cloudformation as Infrastructure As a Code framework (IaC).
AWS Glue Components
First we will start by presenting the Glue component:
Crawler: Is used to populate the AWS Glue Data Catalog with tables. It retrieves data from multiple data stores using built-in or custom classifiers and creates or updates one or more tables in the DataCatalog.
Job: Runs the ETL scripts that connects to the source data, processes it, and then writes it out to your data target. It represents the business logic that carries out an ETL task, using Apache Spark with python or scala language.
Trigger: A trigger starts the ETL job or crawler execution on a schedule or event, or on demand.
Workflow: Used to create and visualize complex (ETL) activities involving multiple crawlers, jobs, and triggers, it also manages the execution and monitoring of all these components.
How To Build Workflows?
To start a workflow we need to have either a SCHEDULED or ON_DEMAND trigger. The rest of the workflow triggers should be CONDITIONAL. In fact, CONDITIONAL triggers within workflows can be fired and started by both jobs or crawlers. This way CloudFormation knows how to build the Directed Acyclic Graphs of different components. On the console we can visualize the components and the graph of the workflow flow.
When a workflow starts to run each component, records execution progress and status, to provide a final overview of the larger task and the details of each step.
To share and manage state throughout a workflow run, we can define default workflow run properties that will be available to all the jobs in the workflow. All parameters that will be used as part of each execution are defined as a json in the DefaultRunProperties in the Cloudformation template.
We will now show you how to build a workflow in order to extract json data from one S3 bucket, transform it to Parquet and finally load it to another S3.
This workflow has two crawlers and a job to be run when the first crawler ends. The first trigger is a SCHEDULED trigger and it fires the first crawler every day at 06:00 UTC. When the first crawler ends successfully, the second trigger launches the Glue job. Finally, when the job succeeds the last trigger is fired and stars the second crawler.
Cloudformation code example:
You can also find the code in this Gist file:
Note:for ON_DEMAND and CONDITIONAL triggers, you need to ensure that “LogicalOperator: EQUALS” is in your Cloudformation.
In this post, we discussed setting up a Glue Workflow to orchestrate data pipelines of varying complexity.
There are many benefits to be gained from using Glue Workflows, like tracking the progress of each workflow element independently or the entire workflow making it easier to troubleshoot your pipelines.