-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AirbyteLib: Progress Printer #34588
AirbyteLib: Progress Printer #34588
Conversation
…le flag during install
…e-lib/progress-print
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this! Left two suggestions, but they are incremental improvements, not blockers.
from rich.markdown import Markdown as RichMarkdown | ||
|
||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like these small QOL checks, but could we also add an option for this like pinecone does?
Some environments don't handle this kind of using the console very well and it's nice to be able to mute it for a clean output in a more "production" setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely on board for this. The refactoring would be significant at this stage though, specifically because the progress messages are currently sent from a few different places in the codebase. While I'm not able to do so in this iteration, I definitely think this is worth doing and will probably get to it down the road.
# This is some math to make updates adaptive to the scale of records read. | ||
# We want to update the display more often when the count is low, and less | ||
# often when the count is high. | ||
updated_period = min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it would make sense to throttle these updates based on the elapsed time.
The current approach works well in a lot of scenarios, but gets weird for edge cases like:
- Records are read really fast in the beginning, then get slow (e.g. because it's reading a second, slower stream) - it would mean it would update really rarely, so the user can't see the slow progress on the second stream well
- Records are read glacially slow (like one per second). It takes a long time for stuff to show up in the first place
By just updating once a second or so, it's lively without any risk of overloading the terminal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good feedback. It's not obvious here, but there's also a time-based throttle inside update_display(). In this layer I'm trying to keep as fast as possible performance, so just the math-based check. Probably we could add a time-elapsed check, but this codepath may be called literally thousands of times per second, so I'm wary to change this further right now. Happy to keep iterating after merging though.
…oundError to avoid confusion
…e-lib/progress-print
Note:
A sync operation can take 5 minutes, 30 minutes, or multiple hours. This PR adds a self-updating progress-print when run from IPython.
As much as possible, I try to not impact performance.
Sample output below:
This is the runtime of extracting just the 'issues' stream for the repo 'airbytehq/airbyte', using the GitHub source:
What's nice about this is that I can presume this workload is probably throttled by GitHub's API rate limits, and I can see at a glance that the write to the SQL table was <2 minutes, while the extraction from source was 24 minutes. Running with 2 or 3 auth tokens to rotate (which I think this source supports) would make an interesting experiment and possible demo content.
It's also nice that this is updated while the sync is running, so I don't need to wonder if the code got stuck. If the user has an estimate of how many rows they expect, then they can also deduce a % progress - even though we can't programmatically detect that.