Skip to content
This repository has been archived by the owner on Apr 11, 2024. It is now read-only.

Editing pass on README.md #43

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 44 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,71 +8,87 @@ transfers made easy<br><br>

[![CI](https://github.com/astronomer/apache-airflow-provider-transfers/actions/workflows/ci-uto.yaml/badge.svg)](https://github.com/astronomer/apache-airflow-provider-transfers)

The **Universal Transfer Operator** simplifies how users transfer data from a source to a destination using [Apache Airflow](https://airflow.apache.org/). It offers a consistent agnostic interface, improving the users' experience so they do not need to use explicitly specific providers or operators.
The **UniversalTransferOperator** simplifies how you transfer data from a source to a destination using [Apache Airflow](https://airflow.apache.org/). Its agnostic interface eliminates the need to use specific providers or operators.

At the moment, it supports transferring data between [file locations](https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/src/universal_transfer_operator/constants.py#L26-L32) and [databases](https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/src/universal_transfer_operator/constants.py#L72-L74) (in both directions) and cross-database transfers.
At the moment, it supports transferring data between [file locations](https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/src/universal_transfer_operator/constants.py#L26-L32) and [databases](https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/src/universal_transfer_operator/constants.py#L72-L74), as well as cross-database transfers.

This project is maintained by [Astronomer](https://astronomer.io).

## Installation

```
```sh
pip install apache-airflow-provider-transfers
```


## Example DAGs

Checkout the [example_dags](./example_dags) folder for examples of how the UniversalTransfeOperator can be used.
See the [example_dags](./example_dags) folder for examples of how you can use the UniversalTransferOperator.


## How Universal Transfer Operator Works
## How the UniversalTransferOperator works

![Approach](./docs/images/approach.png)

With Universal Transfer Operator, users can perform data transfers using the following transfer modes:
The purpose of the UniversalTransferOperator is to move data from a source dataset to a destination dataset. Your datasets can be defined as `Files` or as `Tables`.

Instead of using different operators for each of your transfers, the UniversalTransferOperator supports three universal transfer types:

1. Non-native
2. Native
3. Third-party
- Non-native transfers
- Native transfers
- Third-party transfers

### Non-native transfers
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it too late to rename these transfer types / is this the common term for this type of transfer? This was unintuitive to me, because I considered a worker-driven transfer to be native, but from the perspective that it's native to Airflow. I'd prefer if we called these:

  • Worker transfer
  • Dataset transfer
  • Third-party transfer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwitz worker_transfer makes sense

But dataset_transfer doesn't make sense to me. Can you elaborate on this, please? I still feel that native transfers signify natively transferring without involving the worker node. Open to suggestions for better naming for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwitz I think we can still rename these transfers, as we have not done a major release yet.
IMO,

  1. Worker transfer - Makes sense to me as well
  2. For Native transfer - We can maybe use - direct / peer to peer
  3. Third-party transfer - This is perfect as it is.

Also, maybe we can lose the transfer from names since we already call them transfer modes.
WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO - we should rename them to the below to be more easily understood by users:

1."local"
2."optimized"
3."third-party"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure "optimized" will help users understand exactly what's going on. Optimized can mean many things, and some might consider third party to be the "optimized" solution for their use case.

I like "peer-to-peer" or "direct" @sunank200 !

@utkarsharma2 I think we should still keep "Transfer", because it helps us communicate what these modes are in documentation. If you want to remove "transfer" at the code level in terms of how people specify these modes in the operator parameters, I think that's fine.

Copy link
Collaborator

@utkarsharma2 utkarsharma2 Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwitz Sure we can keep the transfer in docs, but we can remove it from the code. @sunank200 @phanikumv WDYT?

Also, if the choice is between peer-to-peer and direct, I prefer peer-to-peer since it's a well known concept, and would be obvious to users.


### Non-native transfer
In a non-native transfer, you transfer data from a source to a destination through Airflow workers. Chunking is applied where possible. This method can be suitable for datasets smaller than 2GB in size. However, the performance of this method is dependent upon the worker's memory, disk, processor, and network configuration.

Non-native transfers rely on transferring the data through the Airflow worker node. Chunking is applied where possible. This method can be suitable for datasets smaller than 2GB, depending on the source and target. The performance of this method is highly dependent upon the worker's memory, disk, processor and network configuration.
To use this type of transfer, you provide the UniversalTransferOperator with:

Internally, the steps involved are:
- Retrieve the dataset data in chunks from dataset storage to the worker node.
- Send data to the cloud dataset from the worker node.
- A `task_id`.
- A `source_dataset`, defined as a `File` or `Table`.
- A `destination_dataset`, defined as a `File` or `Table`.

When you initiate the transfer, the following happens in Airflow:

- The worker retrieves the dataset in chunks from the data source.
- The worker sends data to the destination dataset.

Following is an example of non-native transfers between Google cloud storage and Sqlite:

https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/example_dags/example_universal_transfer_operator.py#L37-L41

### Native transfers

### Improving bottlenecks by using native transfer
In a native transfer, Airflow relies on the mechanisms and tools offered by your data source and destination to facilitate the transfer. For example, when you use a native transfer to transfer data from object storage to a Snowflake database, Airflow calls on Snowflake to run the ``COPY INTO`` command. Another example is that when loading data from S3 to BigQuery, the UniversalTransferOperator calls on the GCP Storage Transfer Service to facilitate the data transfer.

An alternative to using the Non-native transfer method is the native method. The native transfers rely on mechanisms and tools offered by the data source or data target providers. In the case of moving from object storage to a Snowflake database, for instance, a native transfer consists in using the built-in ``COPY INTO`` command. When loading data from S3 to BigQuery, the Universal Transfer Operator uses the GCP Storage Transfer Service.
The benefit of native transfers is that they can perform better for larger datasets (2 GB) and don't rely on the Airflow worker node hardware configuration. Airflow worker nodes are used only as orchestrators and don't perform any data operations. The speed depends exclusively on the service being used and the bandwidth between the source and destination.

The benefit of native transfers is that they will likely perform better for larger datasets (2 GB) and do not rely on the Airflow worker node hardware configuration. With this approach, the Airflow worker nodes are used as orchestrators and do not perform the transfer. The speed depends exclusively on the service being used and the bandwidth between the source and destination.
When you initiate the transfer, the following happens in Airflow:

Steps:
- Request destination dataset to ingest data from the source dataset.
- Destination dataset requests source dataset for data.
- The worker calls on the destination dataset to ingest data from the source dataset.
- The destination dataset runs the necessary steps to request and ingest data from the source dataset.

> **_NOTE:_**
The Native method implementation is in progress and will be available in future releases.
> **Note**
> The Native method implementation is in progress and will be available in future releases.

### Third-party transfers

### Transfer using a third-party tool
The Universal Transfer Operator can also offer an interface to generic third-party services that transfer data, similar to Fivetran.
In a third-party transfer, the UniversalTransferOperator calls on a third-party service to facilitate your data transfer, such as Fivetran.

Here is an example of how to use Fivetran for transfers:
To complete a third-party transfer, you provide the UniversalTransferOperator with:

https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/example_dags/example_dag_fivetran.py#L52-L58
- A source dataset, defined as a `Table` or `File`.
- A destination dataset, defined as a `Table` or `File`.
- The parameter `transfer_mode=TransferMode.THIRDPARTY`.
- `transfer_params` for the third-party tool.

When you initiate the transfer, the following happens in Airflow:

- The worker calls on the third-party tool to facilitate the data transfer.

Currently, Fivetran is the only suppported third-party tool. See [`fivetran.py`](https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/src/universal_transfer_operator/integrations/fivetran.py) for a complete list of parameters that you can set to determine how Fivetran completes the transfer.

Here is an example of how to use Fivetran for transfers:

https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/example_dags/example_dag_fivetran.py#L52-L58

## Supported technologies

Expand All @@ -84,7 +100,6 @@ https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/exampl

https://github.com/astronomer/apache-airflow-provider-transfers/blob/main/src/universal_transfer_operator/constants.py#L26-L32


## Documentation

The documentation is a work in progress -- we aim to follow the [Diátaxis](https://diataxis.fr/) system.
Expand All @@ -93,7 +108,6 @@ The documentation is a work in progress -- we aim to follow the [Diátaxis](http

- **[Getting Started Tutorial](https://apache-airflow-provider-transfers.readthedocs.io/en/latest/getting-started/GETTING_STARTED.html)**: A hands-on introduction to the Universal Transfer Operator


## Changelog

The **Universal Transfer Operator** follows semantic versioning for releases. Check the [changelog](/docs/CHANGELOG.md) for the latest changes.
Expand Down