Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] 11228 universal file transfer #50846

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 12 additions & 22 deletions docs/integrations/sources/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ This page contains the setup guide and reference information for the [S3](https:

</HideInUI>

:::info
Please note that using cloud storage may incur egress costs. Egress refers to data that is transferred out of the cloud storage system, such as when you download files or access them from a different location. For detailed information on egress costs, please consult the [Amazon S3 pricing guide](https://aws.amazon.com/s3/pricing/).
:::warning
Using cloud storage may incur egress costs. Egress refers to data that is transferred out of the cloud storage system, such as when you download files or access them from a different location. For detailed information on egress costs, please consult the [AWS S3 pricing guide](https://aws.amazon.com/s3/pricing/).
:::

## Prerequisites
Expand Down Expand Up @@ -70,8 +70,11 @@ For more information on managing your access keys, please refer to the
#### Option 2: Using an IAM Role (Most secure)

<!-- env:oss -->

:::note
S3 authentication using an IAM role member is not supported using the OSS platform.
:::

<!-- /env:oss -->

<!-- env:cloud -->
Expand All @@ -80,10 +83,10 @@ S3 authentication using an IAM role member is not supported using the OSS platfo
S3 authentication using an IAM role member must be enabled by a member of the Airbyte team. If you'd like to use this feature, please [contact the Sales team](https://airbyte.com/company/talk-to-sales) for more information.
:::


1. In the IAM dashboard, click **Roles**, then **Create role**.

2. Choose the **AWS account** trusted entity type.

3. Set up a trust relationship for the role. This allows the Airbyte instance's AWS account to assume this role. You will also need to specify an external ID, which is a secret key that the trusting service (Airbyte) and the trusted role (the role you're creating) both know. This ID is used to prevent the "confused deputy" problem. The External ID should be your Airbyte workspace ID, which can be found in the URL of your workspace page. Edit the trust relationship policy to include the external ID:

```
Expand Down Expand Up @@ -118,8 +121,9 @@ S3 authentication using an IAM role member must be enabled by a member of the Ai
2. Click Sources and then click + New source.
3. On the Set up the source page, select S3 from the Source type dropdown.
4. Enter a name for the S3 connector.
5. Enter the name of the **Bucket** containing your files to replicate.
6. Add a stream
5. Choose a [delivery method](../../using-airbyte/delivery-methods) for your data.
6. Enter the name of the **Bucket** containing your files to replicate.
7. Add a stream
1. Choose the **File Format**
2. In the **Format** box, use the dropdown menu to select the format of the files you'd like to replicate. The supported formats are **CSV**, **Parquet**, **Avro** and **JSONL**. Toggling the **Optional fields** button within the **Format** box will allow you to enter additional configurations based on the selected format. For a detailed breakdown of these settings, refer to the [File Format section](#file-format-settings) below.
3. Give a **Name** to the stream
Expand All @@ -128,7 +132,7 @@ S3 authentication using an IAM role member must be enabled by a member of the Ai
6. (Optional) If you want to enforce a specific schema, you can enter a **Input schema**. By default, this value is set to `{}` and will automatically infer the schema from the file\(s\) you are replicating. For details on providing a custom schema, refer to the [User Schema section](#user-schema).
7. (Optional) Select the **Schemaless** option, to skip all validation of the records against a schema. If this option is selected the schema will be `{"data": "object"}` and all downstream data will be nested in a "data" field. This is a good option if the schema of your records changes frequently.
8. (Optional) Select a **Validation Policy** to tell Airbyte how to handle records that do not match the schema. You may choose to emit the record anyway (fields that aren't present in the schema may not arrive at the destination), skip the record altogether, or wait until the next discovery (which will happen in the next 24 hours).
7. **To authenticate your private bucket**:
8. **To authenticate your private bucket**:
- If using an IAM role, enter the **AWS Role ARN**.
- If using IAM user credentials, fill the **AWS Access Key ID** and **AWS Secret Access Key** fields with the appropriate credentials.

Expand All @@ -141,25 +145,11 @@ All other fields are optional and can be left empty. Refer to the [S3 Provider S
3. On the Set up the source page, select S3 from the Source type dropdown.
4. Enter a name for the S3 connector.

#### Copy Raw Files Configuration
#### Delivery Method

<FieldAnchor field="delivery_method.delivery_type">

:::info

The raw file replication feature has the following requirements and limitations:
- **Supported Airbyte Versions:**
- Cloud: All Workspaces
- OSS / Enterprise: `v1.2.0` or later
- **Max File Size:** `1GB` per file
- **Supported Destinations:**
- S3: `v1.4.0` or later

:::

Copy raw files without parsing their contents. Bits are copied into the destination exactly as they appeared in the source. Recommended for use with unstructured text data, non-text and compressed files.

Format options will not be taken into account. Instead, files will be transferred to the file-based destination without parsing underlying data.
Choose a [delivery method](../../using-airbyte/delivery-methods) for your data.

</FieldAnchor>

Expand Down
45 changes: 16 additions & 29 deletions docs/integrations/sources/sftp-bulk.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,14 @@ For more information on SSH key pair authentication, please refer to the
2. Click Sources and then click + New source.
3. On the Set up the source page, select SFTP Bulk from the Source type dropdown.
4. Enter a name for the SFTP Bulk connector.
5. Enter the **Host Address**.
6. Enter your **Username**
7. Enter your authentication credentials for the SFTP server (**Password** or **Private Key**). If you are authenticating with a private key, you can upload the file containing the private key (usually named `rsa_id`) using the Upload file button.
8. In the section titled "The list of streams to sync", enter a **Stream Name**. This will be the name of the stream that will be created in your destination. Add additional streams by clicking "Add".
9. For each stream, select in the dropdown menu the **File Type** you wish to sync. Depending on the format chosen, you'll see a set of options specific to the file type. You can read more about specifics to each file type below.
12. (Optional) Provide a **Start Date** using the provided datepicker, or by entering the date in the format `YYYY-MM-DDTHH:mm:ss.SSSSSSZ`. Incremental syncs will only sync files modified/added after this date.
13. (Optional) Specify the **Host Address**. The default port for SFTP is 2​2. If your remote server is using a different port, enter it here.
5. Choose a [delivery method](../../using-airbyte/delivery-methods) for your data.
6. Enter the **Host Address**.
7. Enter your **Username**
8. Enter your authentication credentials for the SFTP server (**Password** or **Private Key**). If you are authenticating with a private key, you can upload the file containing the private key (usually named `rsa_id`) using the Upload file button.
9. In the section titled "The list of streams to sync", enter a **Stream Name**. This will be the name of the stream that will be created in your destination. Add additional streams by clicking "Add".
10. For each stream, select in the dropdown menu the **File Type** you wish to sync. Depending on the format chosen, you'll see a set of options specific to the file type. You can read more about specifics to each file type below.
11. (Optional) Provide a **Start Date** using the provided datepicker, or by entering the date in the format `YYYY-MM-DDTHH:mm:ss.SSSSSSZ`. Incremental syncs will only sync files modified/added after this date.
12. (Optional) Specify the **Host Address**. The default port for SFTP is 2​2. If your remote server is using a different port, enter it here.
(Optional) Determine the **Folder Path**. This determines the directory to search for files in, and defaults to "/". If you prefer to specify a specific folder path, specify the directory on the remote server to be synced. For example, given the file structure:

```
Expand Down Expand Up @@ -103,6 +104,14 @@ This pattern will filter for files that match the format `log-YYYYMMDD`, where `
3. On the Set up the source page, select SFTP Bulk from the Source type dropdown.
4. Enter a name for the SFTP Bulk connector.

#### Delivery Method

<FieldAnchor field="delivery_method.delivery_type">

Choose a [delivery method](../../using-airbyte/delivery-methods) for your data.

</FieldAnchor>

#### File-specific Configuration

Depending on your **File Type** selection, you will be presented with a few configuration options specific to that file type.
Expand All @@ -113,28 +122,6 @@ For example, assuming your folder path is not set in the connector configuration

If your files are in a folder, include the folder in your glob pattern, like `my_folder/my_prefix_*.csv`.

#### Copy Raw Files Configuration

<FieldAnchor field="delivery_method.delivery_type">

:::info

The raw file replication feature has the following requirements and limitations:
- **Supported Airbyte Versions:**
- Cloud: All Workspaces
- OSS / Enterprise: `v1.2.0` or later
- **Max File Size:** `1GB` per file
- **Supported Destinations:**
- S3: `v1.4.0` or later

:::

Copy raw files without parsing their contents. Bits are copied into the destination exactly as they appeared in the source. Recommended for use with unstructured text data, non-text and compressed files.

Format options will not be taken into account. Instead, files will be transferred to the file-based destination without parsing underlying data.

</FieldAnchor>

## Supported sync modes

The SFTP Bulk source connector supports the following [sync modes](https://docs.airbyte.com/cloud/core-concepts/#connection-sync-modes):
Expand Down
4 changes: 4 additions & 0 deletions docs/using-airbyte/core-concepts/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,10 @@ Depending on your destination, you may know this more commonly as the "Dataset",

For more details, see our [Namespace documentation](namespaces.md).

## Delivery Method

You can move data from a source to a destination in one of two ways, depending on whether your data is structured or unstructured. When you replicate records, you extract and load structured records, allowing for blocking and hashing individual fields, typing, and deduping. You can also copy raw files without processing them, which is appropriate for unstructured data. Read more about the difference in [Delivery methods](../delivery-methods).

## Sync Mode

A sync mode governs how Airbyte reads from a source and writes to a destination. Airbyte provides several sync modes depending what you want to accomplish. The sync modes define how your data will sync and whether duplicates will exist in the destination.
Expand Down
Binary file added docs/using-airbyte/delivery-method-copy-raw.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/using-airbyte/delivery-method-replicate.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
58 changes: 58 additions & 0 deletions docs/using-airbyte/delivery-methods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
products: all
---

# Delivery methods

Airbyte supports two methods for delivering source data to the destination.

- Replicate records
- Copy raw files

This article explains the difference between these methods, when you should each one, and how to configure this option in Airbyte.

## Replicate records

When you replicate records, you extract and load structured records into your destination of choice. This method allows for blocking and hashing individual fields or files from a structured schema. Data can be flattened, typed, and deduped depending on the destination.

For most connectors, this is the only option you have. It's ideal for working with structured data like databases, spreadsheets, JSON, and APIs.

![Moving individual fields from a source to a destination](delivery-method-replicate.png)

## Copy raw files

When you copy raw files, you copy files without parsing their contents. Bits are copied into the destination exactly as they appeared in the source. In this case, Airbyte is strictly focused on data movement, and pays no attention to structure or processing.

This choice is ideal for unstructured text, non-text data like multimedia, and compressed files. However, it's only available on specific connectors that are designed to handle unstructured data, like those related to blob storage solutions.

![Moving raw files from a source to a destination without regard for their contents or structure](delivery-method-copy-raw.png)

### Supported versions and limitations

#### Supported Airbyte versions

- Cloud: All Workspaces

- Self-Managed Community and Self-Managed Enterprise: `v1.2.0` or later

#### Supported sources {#supported-sources}

- [SFTP bulk](../integrations/sources/sftp-bulk): `v1.5.0` or later

- [S3](../integrations/sources/s3): `v4.10.1` or later

Additional sources may be added later.

#### Supported destinations

- [S3](../integrations/destinations/s3): `v1.4.0` or later

Additional destinations may be added later.

#### Limitations

- Maximum file size: `1GB` per file.

## How to configure the delivery method

You configure the delivery method on the source. See the docs for [supported connectors](#supported-sources), above.
4 changes: 4 additions & 0 deletions docusaurus/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -553,6 +553,10 @@ module.exports = {
type: "doc",
id: "using-airbyte/core-concepts/typing-deduping",
},
{
type: "doc",
id: "using-airbyte/delivery-methods",
},
{
type: "category",
label: "Transformations",
Expand Down
4 changes: 4 additions & 0 deletions docusaurus/src/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -279,4 +279,8 @@ table tr:hover {
}
table thead tr:hover {
background-color: transparent;
}

.markdown li > p {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to anything, but it fixes an oddity where list items in the final docs had variable top and bottom margins depending on how an author spaced things out in their MarkDown.

margin: 0px !important;
}
Loading