Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make stdout and stderr handle decode errors gracefully #505

Merged

Conversation

0div
Copy link
Contributor

@0div 0div commented Dec 12, 2024

Description

The sbx.commands.run method can generate a stdout with badly encoded UTF-8 bytes, which can cause this kind of errors:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 8191: unexpected end of data

see more here: https://linear.app/e2b/issue/E2B-1291/unicodedecodeerror-in-python-sdk-when-using-curl-command

To address this we update the decode error handler to replace with the unicode replacement char.

Test

 poetry run pytest -s -n 4 -k "tests/sync/sandbox_sync/commands/test_run"

Copy link

linear bot commented Dec 12, 2024

Copy link

changeset-bot bot commented Dec 12, 2024

⚠️ No Changeset found

Latest commit: bff11dc

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@0div 0div requested review from mlejva and jamesmurdza December 12, 2024 01:13
@0div 0div self-assigned this Dec 12, 2024
@0div 0div added the bug Something isn't working label Dec 12, 2024
@ValentaTomas
Copy link
Member

  1. Can this happen in the JS too?
  2. Aren't we missing the change is async?

@0div
Copy link
Contributor Author

0div commented Dec 12, 2024

  1. Can this happen in the JS too?
  2. Aren't we missing the change is async?
  1. @jamesmurdza tested it with js-sdk, it doesn't throw, it uses replacement chars by default.
  2. Adding it to async 👍

@ValentaTomas
Copy link
Member

Do we know why it is happening? Can we improve the encoding in envd or are we doing something improper while encoding?

@ValentaTomas
Copy link
Member

My question is basically—why are there undecodable chars in the stdout/stderr? Is this because of the command output or are we messing something up in the envd?

@0div
Copy link
Contributor Author

0div commented Dec 12, 2024

Do we know why it is happening? Can we improve the encoding in envd or are we doing something improper while encoding?

Essentially when reading data from a network socket, the client might encounter a scenario where a multi-byte UTF-8 character is split across network reads. UTF-8 uses multiple bytes to represent characters with codepoints above 127, and these characters can be interrupted mid-sequence when the buffer is filled.

@0div 0div closed this Dec 12, 2024
@0div 0div reopened this Dec 12, 2024
@ValentaTomas
Copy link
Member

Do we know why it is happening? Can we improve the encoding in envd or are we doing something improper while encoding?

Essentially when reading data from a network socket, the client might encounter a scenario where a multi-byte UTF-8 character is split across network reads. UTF-8 uses multiple bytes to represent characters with codepoints above 127, and these characters can be interrupted mid-sequence when the buffer is filled.

Should this happen here when the messages we are getting from envd are actually structured and contain multiple fields in addition to the stdout/stderr?

@ValentaTomas
Copy link
Member

With that in mind, should we fix this when reading the messages here (

for chunk in http_resp.iter_stream():
) because we might not be reading the whole message?

@0div
Copy link
Contributor Author

0div commented Dec 12, 2024

With that in mind, should we fix this when reading the messages here (

for chunk in http_resp.iter_stream():

) because we might not be reading the whole message?

IIUC you don't get to control where the split happens when reading byte stream from the network, but at the same time im surprised this is not a solved problem given the ubiquity of utf8 encoded byte array streaming

@mlejva
Copy link
Member

mlejva commented Dec 12, 2024

With that in mind, should we fix this when reading the messages here (

for chunk in http_resp.iter_stream():

) because we might not be reading the whole message?

That sounds like a separate PR. This PR looks good to me!

@0div 0div merged commit f02eeac into main Dec 12, 2024
3 checks passed
@0div 0div deleted the unicodedecodeerror-in-python-sdk-when-using-curl-command-e2b-1291 branch December 12, 2024 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants