Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Source Slack migration to low code #35477

Merged
merged 95 commits into from
Apr 15, 2024

Conversation

midavadim
Copy link
Contributor

@midavadim midavadim commented Feb 21, 2024

What

resolved: https://github.com/airbytehq/airbyte-internal-issues/issues/2920

Migrate slack to low-code cdk.

How

Deleted python code and moved main stream to manifest.yaml.

Channels stream: custom component - join_channels.py used JoinChannel stream in ChannelsRetriever to connect to channel if it is_member is False and user defined this feature in config. Retriever checks if it should be connected JoinChannelStream performs logic for connection and returns joined channel.

Users stream: also used to check connection, so error_handler was extended with needed status codes.

Channel Members stream: custom component for extracting records from response channel_members_extractor.py :
from: ['aa', 'bb'] to: [{'member_id': 'aa'}, {{'member_id': 'bb'}].

Channel messages stream: no custom components, retrieves messages in channel by channel id and provided time period.

Threads stream: is used old python base implementation. For now we can't migrate to low code, because we can't operate state in substream partitioner which prevents incorrect work of incremental sync.

Also add Migrate legacy config to make it compatible with Selective Auth. Legacy config does not have path to be able to select authenticator.

Recommended reading order

  1. airbyte-integrations/connectors/source-slack/source_slack/manifest.yaml
  2. airbyte-integrations/connectors/source-slack/source_slack/components
  3. other fiels.

🚨 User Impact 🚨

Breaking change due to update of state format in sub streams.

Pre-merge Actions

Update date in breaking changes.

@darynaishchenko
Copy link
Collaborator

@irvingpop Thanks again for the feedback.
Only last question about rate limits. Did you face some issues with rate limiting or backoff etc?

Sync speed is an order of magnitude faster! In the old connector, downloading the entire 7-year history for even a single channel took many hours, now it is about 4 minutes for our busiest channel!

For threads stream we do not migrate it to low code for now. But it's planned to migrate once cdk updates with new state feature in slices.

@darynaishchenko
Copy link
Collaborator

@alafanechere Hi, I have run regression test for current changes.
Regression testing results:
failed:

TestDataIntegrity.test_record_count_with_state
Stream channel_messages has 96 fewer records in the target version(0 vs. 96).

TestDataIntegrity.test_all_pks_are_produced_in_target_version_with_state
Missing records on stream channel_messages(2 missing records)

TestDataIntegrity.test_all_records_are_the_same_with_state
channel_messages stream: records in control but not target (truncated)

TestDataIntegrity.test_record_count_without_state
Stream threads has 1712 fewer records in the target version(3 vs. 1715).

TestDataIntegrity.test_all_records_are_the_same_without_state

channel_messages stream: records with primary key in target & control whose values differ (truncated)
threads stream: records with primary key in target & control whose values differ (truncated)

I have tested it locally with connection creds, dev and latest versions returned same amount of records, except channel messages with state as low-code version start from previous cursor partition and returned one record for each partition which is expected behavior.
dev without state

{"type": "LOG", "log": {"level": "INFO", "message": "Read 256 records from channel_messages stream"}}
{"type": "LOG", "log": {"level": "INFO", "message": "Read 1715 records from threads stream"}}

latest without state

{"type": "LOG", "log": {"level": "INFO", "message": "Read 256 records from channel_messages stream"}}
{"type": "LOG", "log": {"level": "INFO", "message": "Read 1715 records from threads stream"}}

dev with state

{"type": "LOG", "log": {"level": "INFO", "message": "Read 3 records from channel_messages stream"}}
{"type": "LOG", "log": {"level": "INFO", "message": "Read 0 records from threads stream"}}

latest with state

{"type": "LOG", "log": {"level": "INFO", "message": "Read 0 records from channel_messages stream"}}
{"type": "LOG", "log": {"level": "INFO", "message": "Read 0 records from threads stream"}}

Missing records: Both 2 missed records present in dev local test result for channel messages stream.

"root[144]['data']['attachments'][0]['ts']":
{ "old_type": "str", "new_type": "int", 
"old_value": "1711389750.392079", "new_value": 1710522783 }

ts value in attachments are the same type, regression tool compared different records. This value are not present in stream schema so we should consider changing stream schema to fix it in next slack release.

@irvingpop
Copy link
Contributor

irvingpop commented Apr 5, 2024

Hey @darynaishchenko!

Only last question about rate limits. Did you face some issues with rate limiting or backoff etc?

Yes, unfortunately Slack's API heavily rate limits us all. The rate limiting with the old connector was atrocious (thousands of rate limit backoffs per sync) - but is much much better with the new connector (still couple hundred rate limit messages during the initial sync, but a huge improvement)

For threads stream we do not migrate it to low code for now. But it's planned to migrate once cdk updates with new state feature in slices.

Okay, good to know! Out of curiosity, is the plan to release the new low-code connector first, and then update the threads functionality at a later time? From what I can tell, the threads sync has suffered a big regression.

Feel free also to message me in the Airbyte slack if you need any other details.

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My testing with the regression test tool leads to the following conclusion:

  • Record count is correct on all streams. This new version should not lead to data loss
  • Record schemas are matching.
  • There's indeed a breaking change on channel_messages stream as the state format has changed.

Here are a couple of additional questions I'd like answer on before giving an additional review:

  • The checkpointing logic has slightly changed. The new version produces ~30% less state message than the current one. Is it expected? Should it be concerning? Do we have sufficient control in the low-code CDK to make the checkpointing logic match?
  • I noticed that the catalog schema is incomplete. For instance channel_messages.attachment do not have all the properties returned by the API. I don't think it is problematic for this PR. But do we plan on improving the catalog schema to match the API response? (you can check the following artifact if you want to introspect the inferred schema on channel_messages: command_execution_artifacts/source-slack/read/dev/stream_schemas/channel_messages.json

Why I'm requesting changes?
I think you should use scopedImpact breaking changes if the only impacted stream is channel_messages.

airbyte-integrations/connectors/source-slack/metadata.yaml Outdated Show resolved Hide resolved
@octavia-squidington-iv octavia-squidington-iv requested a review from a team April 10, 2024 08:14
@octavia-squidington-iv octavia-squidington-iv requested a review from a team April 10, 2024 08:16
@darynaishchenko
Copy link
Collaborator

@alafanechere

The checkpointing logic has slightly changed. The new version produces ~30% less state message than the current one. Is it expected? Should it be concerning? Do we have sufficient control in the low-code CDK to make the checkpointing logic match?

Yes, it'expected. Declarative stream disables checkpointing and we don't have a way to change it from the source implementation.

I noticed that the catalog schema is incomplete. For instance channel_messages.attachment do not have all the properties returned by the API. I don't think it is problematic for this PR. But do we plan on improving the catalog schema to match the API response? (you can check the following artifact if you want to introspect the inferred schema on channel_messages: command_execution_artifacts/source-slack/read/dev/stream_schemas/channel_messages.json

Yes, I also noticed undeclared fields for channel messages stream. I will create a follow up ticket and discuss with my team when we can work on it.

PS: working on requested changes

Copy link
Contributor

@alafanechere alafanechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉🎉🎉🎉
Congrats @darynaishchenko for taking this to the finish line.
Thanks for your patience and multiple iterations.

Copy link
Collaborator

@lazebnyi lazebnyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@irvingpop
Copy link
Contributor

After picking up the latest commits, I was able to successfully extract a much more expected number of thread messages. thank you for all your hard work on this!

image

message:
The source slack connector is being migrated from the Python CDK to our declarative low-code CDK.
Due to changes in the handling of state format for incremental substreams, this migration constitutes a breaking change for the channel_messages stream.
Users will need to retest source configuration, refresh the source schema and reset the channel_messages stream after upgrading.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/retest/reset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

breakingChanges:
1.0.0:
message:
The source slack connector is being migrated from the Python CDK to our declarative low-code CDK.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/slack/Slack

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

## Upgrading to 1.0.0

We're continuously striving to enhance the quality and reliability of our connectors at Airbyte.
As part of our commitment to delivering exceptional service, we are transitioning source slack from the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/slack/Slack

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

|:--------|:-----------|:---------------------------------------------------------|:------------------------------------------------------------------------------------|
| Version | Date | Pull Request | Subject |
|:--------|:-----------|:---------------------------------------------------------|:-------------------------------------------------------------------------------------|
| 1.0.0 | 2024-04-02 | [35477](https://github.com/airbytehq/airbyte/pull/35477) | Migration to low code |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low-code CDK

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@octavia-squidington-iv octavia-squidington-iv requested a review from a team April 12, 2024 12:30
|:--------|:-----------|:---------------------------------------------------------|:------------------------------------------------------------------------------------|
| Version | Date | Pull Request | Subject |
|:--------|:-----------|:---------------------------------------------------------|:-------------------------------------------------------------------------------------|
| 1.0.0 | 2024-04-02 | [35477](https://github.com/airbytehq/airbyte/pull/35477) | Migration to low code CDK |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration to low-code CDK

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

@lazebnyi lazebnyi merged commit 714eea3 into master Apr 15, 2024
30 checks passed
@lazebnyi lazebnyi deleted the midavadim/source-slack-to-low-code branch April 15, 2024 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/slack low-code-migration This connector has been migrated to the low-code CDK
Projects
None yet
Development

Successfully merging this pull request may close these issues.