Skip to content

Latest commit

 

History

History
62 lines (42 loc) · 3.96 KB

README.md

File metadata and controls

62 lines (42 loc) · 3.96 KB

Galaxy Airflow Quickstart

This repository contains a quickstart guide for setting up Galaxy and Airflow using Docker. It provides instructions to clone the repository, set up the necessary environment variables, build the Docker images, and start the services.

Prerequisites

Before getting started, ensure you have the following:

  • Docker installed on your machine
  • Access to a Galaxy instance
  • Cluster connection information (username, password, host name) from the Galaxy UI

Installation

  1. git clone https://github.com/YCat33/galaxy_airflow_quickstart.git
  2. cd galaxy_airflow_quickstart
  3. Navigate to your Galaxy Domain
  4. Leverage the Clusters page within the Galaxy UI to locate your connection variables.
  5. Run the below to allow the bash setup script to be executable:
  chmod +x setup.sh
  1. Run the bash script below by leveraging the connection variables in step 4:
  ./setup.sh '<host>' '<user>' '<password>'

*This script performs the follwing steps

1. Runs the encode_special_chars script that sets the connection parameters to the necessary format (e.g. replacing "@' with "%40").
2. Runs the DockerFile to build the image, which involves installing the "apache-airflow-providers-trino" package and setting up the Galaxy Connection (These variables are used within the Docker-Compose.yaml file to instantiate a connection to Starburst Galaxy (see line 75 [here](https://github.com/YCat33/galaxy_airflow_quickstart/blob/31b28bbf9237b26cddbab380f416e80384e65cd3/docker-compose.yaml#L75)))
3. Deploys the necessary Docker containers based off the docker-compose file
  1. Navigate to localhost:8080 in your Browser and login using "airflow" as the username and password.

Example Dag (starburst-galaxy-example)

  1. Task 1 (select_star) uses the TrinoOperator to execute a SQL select statement. It counts the number of records in the "tpch.tiny.customer" table and stores the result.
  2. Task 2 (print_number) is a PythonOperator that calls the print_command method. It retrieves the return value from Task 1 and prints it to the logs.
  3. Task 3 (data_validation_check) is an SQLColumnCheckOperator that performs a data quality check. It verifies that the distinct values in the "custkey" column of the "tpch.tiny.customer" table are equal to 1500.

Running the Example DAG

  1. From the Airflow UI home screen, you should see a single DAG titled "starburst-galaxy-example".
image
  1. Trigger the DAG by clicking the “play” button on the right-hand side of the screen
image
  1. View the Logs for each task
image image image image image image image