-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle permission error properly when checking for file #856
Conversation
Currently, we had blanket catch for exception when trying to check the file using _isfile. As a result, the exception stacktrace was repeated and catching the exception in script was difficult as we had to capture different exception. This convert the error to datachain native error that can be captured safely and proceed accordingly. This is first step toward handling #600
Deploying datachain-documentation with Cloudflare Pages
|
I'm not sure I understand this ... could you give an example please? |
|
Yes, as you see we had
with supressed lint. That means no matter what the error we were returning false. So, if we did not have the permission, the script proceeded. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #856 +/- ##
==========================================
+ Coverage 87.53% 87.61% +0.07%
==========================================
Files 128 128
Lines 11369 11382 +13
Branches 1538 1540 +2
==========================================
+ Hits 9952 9972 +20
+ Misses 1029 1023 -6
+ Partials 388 387 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
This code looks problematic but uncaught because we just ignored everything with except. The _isfile doesn't work in windows for glob as a result as seen in https://github.com/iterative/datachain/actions/runs/12948522550/job/36117365039?pr=856 |
except FileNotFoundError: | ||
return False | ||
except REMOTE_ERRORS as e: | ||
raise ClientError( | ||
message=str(e), | ||
error_code=getattr(e, "code", None), | ||
) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to catch other exceptions here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should be the burden of outer call or calling function to capture exception that are totally unexpected.
yep, but that's why I guess we were just ignoring a wide range of exceptions and the rest of the code is built to handle that path as a non-file should we just try to catch Remote exceptions that are relevant to credentials, etc. And then do a wide open |
@@ -90,6 +91,10 @@ def _isfile(client: "Client", path: str) -> bool: | |||
Returns True if uri points to a file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein, was _isfile()
added just for making it work with Google Cloud?
If so, I'd recommend extending GCSFileSystem._is_file
(my preference), or add GCSClient._is_file
.
Whitelisting exceptions on a fs-agnostic/generic code is not a good thing to do.
get_listing
could raise any unknown exceptions that's not FileNotFoundError
in _is_file
as a ClientError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I think it is a general code (there are special zero-byte files across all clouds I think) - we need to filter them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, we stopped doing these checks in dvc. We ignore files that end with slashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, so that's the point of this call - to always treat it as a directory AFAIR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in some cases (not sure if this works the same across all clients) it returns a file/
even if ask it to check file
if zero-byte object exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this warrants separate discussion if we want to include _isfile or not. Merging the PR for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion is not limited to only _isfile
.
I am arguing that _is_file
should not raise ClientError
, or handle certain set of exceptions. _is_file
should raise any exception that is not FileNotFoundError
, and get_listing
should convert it to ClientError
without any filtering. There's no need for whitelisting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_isfile
should be very close to fs._is_file
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in this case we can consider is_file more like client.is_file
that already abstracts out fs
- thus it is probably fine to convert exceptions if that is relevant
@@ -90,6 +91,10 @@ def _isfile(client: "Client", path: str) -> bool: | |||
Returns True if uri points to a file | |||
""" | |||
try: | |||
if "://" in path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed, btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because when the s3 or other schema was in uppercase, the process would fail since it didn't match the specified regex. Its just that we swallowed the exception in test cases previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point me to those tests that use uppercase scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datachain/tests/func/test_ls.py
Line 78 in 60aed77
def test_ls_sources_scheme_uppercased(cloud_test_catalog, cloud_type, capsys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still very weird place to do this kind of transformation ... how was it handled by ls downstream? Okay, it's swallowed here means we are losing some optimization - by ls
still worked right? where do we handle this usually?
A wide open except: would swallow a lot of issues IMO. |
* Handle permission error properly when checking for file Currently, we had blanket catch for exception when trying to check the file using _isfile. As a result, the exception stacktrace was repeated and catching the exception in script was difficult as we had to capture different exception. This convert the error to datachain native error that can be captured safely and proceed accordingly. This is first step toward handling #600 * Convert scheme to lower * Handle case for glob in windows
main > #856 (this) > #857
Currently, we had blanket catch for exception when trying to check the
file using _isfile. As a result, the exception stacktrace was repeated
and catching the exception in script was difficult as we had to capture
different exception. This convert the error to datachain native error
that can be captured safely and proceed accordingly.
This is first step toward handling #600