Replies: 1 comment
-
+1 would also really like this. I currently have many instances of duplicate sources with different replication_start_dates (in airbyte cloud) due to desired differing start dates between streams. In some case, I have as many as 3 copies of the same source with different start dates. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Issue Summary:
Note: This issue has come up frequently in the Slack community (recent example), and it seemed like it needs an official home.
While most sources have a
start_date
parameter you can use to limit the data returned, this is problematic when you require full lifetime data for certain streams (e.g. CRM Contacts or Products you likely need full history, but you may only need a year of Orders). Think Salesforce when you don't need 30 years of Opportunity records, but you need all the customer Accounts and Contacts.The more data involved, the more problematic this becomes. For example, we have LOTS of high-volume connectors for things like HubSpot and Mailchimp—but some of those accounts have 10+ years of data so the
email_events
(HubSpot) andemail_activity
(Mailchimp) can get insanely noisy (many billions—or more—records).It isn't desirable to use the
start_date
parameter, as this would also limit things like Contacts, Campaigns, etc. (which is problematic for data in other streams that reference those)While with HubSpot we can force the Connection State to filter this stream over the API when it's first created, it isn't ideal and doesn't survive things like refreshes from the UI. Plus this doesn't play nice in Mailchimp, because it's a child stream without
global_substream_cursor
set (and there can be thousands of partitions in the state object).Similar issues exist for manually syncing the full streams, then changing the
start_date
and adding the additional streams . . . if a reset or resync is needed, the result will be unexpected missing data in the streams that original had lifetime data.Right now we're doing
#2
with HubSpot over the Config API (since Airbyte API doesn't support clobbering State), and for Mailchimp doing#3
, but this still doesn't play nice with things like Refresh/Clear Data which will then try to backfill for all time. Not to mention it only works for us because we're managing everything over the API.Proposed Solution:
Because Airbyte has moved to be increasingly stream-centric, parameters like
start_date
should be able to be overridden at a stream level. Whilestart_date
is the most pressing case, ideally this could be used to override any user input config (which would allow for more flexibility).Feature details:
By default, all streams should use the existing global user input value, if present. This will preserve compatibility with existing collections, as well as the current simple use case of a global
start_date
New UI should be introduced to allow this to be overridden on a per-stream basis; maybe a checkbox to enable the override, then the inputs for the value(s).
Once configured, I would also recommend some visual indicator in the stream listing that stream-level config overrides are present on a stream
This should also be configurable via the API and Terraform by adding a new option to the
StreamConfiguration
model (similar to what was done formappers
—a key/value dict of config options to override along with their new values)When present, I would suggest this be output as an info-level log item at the start of each stream to help reduce confusion of why a certain value was used (e.g.
Overriding connection-level config with stream-level values: start_date = 2025-01-01
)Optionally, it may make sense to allow the connector author to implement controls around what user inputs can be overridden (by adding additional YAML properties, e.g.
allow_stream_level_override: false
)This will need to documented and socialized, but should be backwards compatible since prior versions would just continue to behave as expected (with the former behavior)
Beta Was this translation helpful? Give feedback.
All reactions