Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle permission error properly when checking for file #856

Merged
merged 5 commits into from
Jan 27, 2025

Conversation

amritghimire
Copy link
Contributor

@amritghimire amritghimire commented Jan 24, 2025

main > #856 (this) > #857

Currently, we had blanket catch for exception when trying to check the
file using _isfile. As a result, the exception stacktrace was repeated
and catching the exception in script was difficult as we had to capture
different exception. This convert the error to datachain native error
that can be captured safely and proceed accordingly.

This is first step toward handling #600

Currently, we had blanket catch for exception when trying to check the
file using _isfile. As a result, the exception stacktrace was repeated
and catching the exception in script was difficult as we had to capture
different exception. This convert the error to datachain native error
that can be captured safely and proceed accordingly.

This is first step toward handling #600
@amritghimire amritghimire self-assigned this Jan 24, 2025
Copy link

cloudflare-workers-and-pages bot commented Jan 24, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: af393cd
Status: ✅  Deploy successful!
Preview URL: https://6ea56970.datachain-documentation.pages.dev
Branch Preview URL: https://amrit-handle-permission-erro.datachain-documentation.pages.dev

View logs

@amritghimire amritghimire requested review from ilongin, dreadatour and a team January 24, 2025 03:20
@shcheklein
Copy link
Member

Currently, we had blanket catch for exception when trying to check the
file using _isfile. As a result, the exception stacktrace was repeated
and catching the exception in script was difficult as we had to capture
different exception

I'm not sure I understand this ... could you give an example please?

@amritghimire
Copy link
Contributor Author

Currently, we had blanket catch for exception when trying to check the
file using _isfile. As a result, the exception stacktrace was repeated
and catching the exception in script was difficult as we had to capture
different exception

I'm not sure I understand this ... could you give an example please?

https://github.com/iterative/datachain/pull/856/files#diff-ba026f8084c104850332b7e62ae0741b251f65b67d0271ce5870f4794a1346bfL102

@amritghimire
Copy link
Contributor Author

#856 (files)

is it an example?

Yes, as you see we had

except: 

with supressed lint. That means no matter what the error we were returning false. So, if we did not have the permission, the script proceeded.

Copy link

codecov bot commented Jan 24, 2025

Codecov Report

Attention: Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.

Project coverage is 87.61%. Comparing base (13e5c13) to head (af393cd).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/lib/listing.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #856      +/-   ##
==========================================
+ Coverage   87.53%   87.61%   +0.07%     
==========================================
  Files         128      128              
  Lines       11369    11382      +13     
  Branches     1538     1540       +2     
==========================================
+ Hits         9952     9972      +20     
+ Misses       1029     1023       -6     
+ Partials      388      387       -1     
Flag Coverage Δ
datachain 87.54% <92.85%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@amritghimire
Copy link
Contributor Author

https://github.com/iterative/datachain/pull/767/files#diff-ba026f8084c104850332b7e62ae0741b251f65b67d0271ce5870f4794a1346bfR102-R103

This code looks problematic but uncaught because we just ignored everything with except. The _isfile doesn't work in windows for glob as a result as seen in https://github.com/iterative/datachain/actions/runs/12948522550/job/36117365039?pr=856

Comment on lines +107 to +113
except FileNotFoundError:
return False
except REMOTE_ERRORS as e:
raise ClientError(
message=str(e),
error_code=getattr(e, "code", None),
) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to catch other exceptions here? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be the burden of outer call or calling function to capture exception that are totally unexpected.

@shcheklein
Copy link
Member

This code looks problematic but uncaught because we just ignored everything with except. The _isfile doesn't work in windows for glob as a result as seen in https://github.com/iterative/datachain/actions/runs/12948522550/job/36117365039?pr=856

yep, but that's why I guess we were just ignoring a wide range of exceptions and the rest of the code is built to handle that path as a non-file

should we just try to catch Remote exceptions that are relevant to credentials, etc. And then do a wide open except:? wdyt?

@@ -90,6 +91,10 @@ def _isfile(client: "Client", path: str) -> bool:
Returns True if uri points to a file
Copy link
Member

@skshetry skshetry Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein, was _isfile() added just for making it work with Google Cloud?

If so, I'd recommend extending GCSFileSystem._is_file (my preference), or add GCSClient._is_file.

Whitelisting exceptions on a fs-agnostic/generic code is not a good thing to do.
get_listing could raise any unknown exceptions that's not FileNotFoundError in _is_file as a ClientError.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I think it is a general code (there are special zero-byte files across all clouds I think) - we need to filter them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we stopped doing these checks in dvc. We ignore files that end with slashes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so that's the point of this call - to always treat it as a directory AFAIR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in some cases (not sure if this works the same across all clients) it returns a file/ even if ask it to check file if zero-byte object exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this warrants separate discussion if we want to include _isfile or not. Merging the PR for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion is not limited to only _isfile.
I am arguing that _is_file should not raise ClientError, or handle certain set of exceptions. _is_file should raise any exception that is not FileNotFoundError, and get_listing should convert it to ClientError without any filtering. There's no need for whitelisting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_isfile should be very close to fs._is_file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in this case we can consider is_file more like client.is_file that already abstracts out fs - thus it is probably fine to convert exceptions if that is relevant

@@ -90,6 +91,10 @@ def _isfile(client: "Client", path: str) -> bool:
Returns True if uri points to a file
"""
try:
if "://" in path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed, btw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because when the s3 or other schema was in uppercase, the process would fail since it didn't match the specified regex. Its just that we swallowed the exception in test cases previously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to those tests that use uppercase scheme?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def test_ls_sources_scheme_uppercased(cloud_test_catalog, cloud_type, capsys):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still very weird place to do this kind of transformation ... how was it handled by ls downstream? Okay, it's swallowed here means we are losing some optimization - by ls still worked right? where do we handle this usually?

@amritghimire
Copy link
Contributor Author

should we just try to catch Remote exceptions that are relevant to credentials, etc. And then do a wide open except:? wdyt?

A wide open except: would swallow a lot of issues IMO.

@amritghimire amritghimire merged commit b1d4f2a into main Jan 27, 2025
37 checks passed
@amritghimire amritghimire deleted the amrit/handle-permission-error-properly branch January 27, 2025 06:25
dreadatour pushed a commit that referenced this pull request Jan 27, 2025
* Handle permission error properly when checking for file

Currently, we had blanket catch for exception when trying to check the
file using _isfile. As a result, the exception stacktrace was repeated
and catching the exception in script was difficult as we had to capture
different exception. This convert the error to datachain native error
that can be captured safely and proceed accordingly.

This is first step toward handling #600

* Convert scheme to lower

* Handle case for glob in windows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants