-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resolve files #313
Add resolve files #313
Conversation
I've attempted to resolve this issue, but my test is still failing. I'd like to understand the reason why the 'is_valid' field is not being successfully updated. Could you please help me identify the cause of this problem? Thank you." test_resolve_filesdef test_resolve_files(test_session):
valid_file = File(path="valid.txt", size=10)
invalid_file = File(path="invalid.txt", size=5)
dc = DataChain.from_values(file=[valid_file, invalid_file], session=test_session)
def mock_open(self):
if self.path == "valid.txt":
return MagicMock()
raise OSError("File not found")
with patch("datachain.lib.file.File.open", mock_open):
resolved_dc = dc.resolve_files()
results = list(resolved_dc.collect("file"))
print("results: ",results)
assert len(results) == 2
assert results[0].is_valid
AssertionError: assert None
where None = File(source='', path='valid.txt', size=10, version='', etag='', is_latest=True, last_modified=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=datetime.timezone.utc), location=None, vtype='', is_valid=None).is_valid
/Users/lifei/PycharmProjects/datachain/tests/unit/lib/test_datachain.py:1633: AssertionError |
1a6d2d6
to
49a9a05
Compare
@EdwardLi-coder to help you fully understand what is going on underneath you can set the From the following, it looks like it is not possible to update a field under an existing Feature right now as the result will not be included in the final output: SELECT
"udf_ZCVabk".sys__id,
"udf_ZCVabk".sys__rand,
"udf_ZCVabk".file__source,
"udf_ZCVabk".file__path,
"udf_ZCVabk".file__size,
"udf_ZCVabk".file__version,
"udf_ZCVabk".file__etag,
"udf_ZCVabk".file__is_latest,
"udf_ZCVabk".file__last_modified,
"udf_ZCVabk".file__location,
"udf_ZCVabk".file__vtype,
"udf_ZCVabk".file__is_valid
FROM
"udf_ZCVabk"
ORDER BY
"udf_ZCVabk".sys__id LIMIT 100000 OFFSET 0;
SELECT
"udf_ZCVabk".sys__id,
"udf_ZCVabk".sys__rand,
"udf_ZCVabk".file__source,
"udf_ZCVabk".file__path,
"udf_ZCVabk".file__size,
"udf_ZCVabk".file__version,
"udf_ZCVabk".file__etag,
"udf_ZCVabk".file__is_latest,
"udf_ZCVabk".file__last_modified,
"udf_ZCVabk".file__location,
"udf_ZCVabk".file__vtype,
"udf_ZCVabk".file__is_valid
FROM
"udf_ZCVabk"
ORDER BY
"udf_ZCVabk".sys__id LIMIT 100000 OFFSET 100000;
INSERT
INTO
"udf_gCBFP0"
(sys__id, file__source, file__path, file__size, file__version, file__etag, file__is_latest, file__last_modified, file__location, file__vtype, file__is_valid)
VALUES
(1, '', 'valid.txt', 10, '', '', 1, '1970-01-01 00:00:00', NULL, '', 1);
INSERT
INTO
"udf_gCBFP0"
(sys__id, file__source, file__path, file__size, file__version, file__etag, file__is_latest, file__last_modified, file__location, file__vtype, file__is_valid)
VALUES
(2, '', 'invalid.txt', 5, '', '', 1, '1970-01-01 00:00:00', NULL, '', 0);
SELECT
anon_1.file__source,
anon_1.file__path,
anon_1.file__size,
anon_1.file__version,
anon_1.file__etag,
anon_1.file__is_latest,
anon_1.file__last_modified,
anon_1.file__location,
anon_1.file__vtype,
anon_1.file__is_valid
FROM
(SELECT
anon_2.sys__id AS sys__id,
anon_2.sys__rand AS sys__rand,
anon_2.file__source AS file__source,
anon_2.file__path AS file__path,
anon_2.file__size AS file__size,
anon_2.file__version AS file__version,
anon_2.file__etag AS file__etag,
anon_2.file__is_latest AS file__is_latest,
anon_2.file__last_modified AS file__last_modified,
anon_2.file__location AS file__location,
anon_2.file__vtype AS file__vtype,
anon_2.file__is_valid AS file__is_valid,
"udf_gCBFP0".file__source AS file__source_1,
"udf_gCBFP0".file__path AS file__path_1,
"udf_gCBFP0".file__size AS file__size_1,
"udf_gCBFP0".file__version AS file__version_1,
"udf_gCBFP0".file__etag AS file__etag_1,
"udf_gCBFP0".file__is_latest AS file__is_latest_1,
"udf_gCBFP0".file__last_modified AS file__last_modified_1,
"udf_gCBFP0".file__location AS file__location_1,
"udf_gCBFP0".file__vtype AS file__vtype_1,
"udf_gCBFP0".file__is_valid AS file__is_valid_1
FROM
(SELECT
"udf_ZCVabk".sys__id AS sys__id,
"udf_ZCVabk".sys__rand AS sys__rand,
"udf_ZCVabk".file__source AS file__source,
"udf_ZCVabk".file__path AS file__path,
"udf_ZCVabk".file__size AS file__size,
"udf_ZCVabk".file__version AS file__version,
"udf_ZCVabk".file__etag AS file__etag,
"udf_ZCVabk".file__is_latest AS file__is_latest,
"udf_ZCVabk".file__last_modified AS file__last_modified,
"udf_ZCVabk".file__location AS file__location,
"udf_ZCVabk".file__vtype AS file__vtype,
"udf_ZCVabk".file__is_valid AS file__is_valid
FROM
"udf_ZCVabk") AS anon_2
LEFT OUTER JOIN
"udf_gCBFP0"
ON "udf_gCBFP0".sys__id = anon_2.sys__id
) AS anon_1 |
Ok,Thank you very much .I will try use this environment variable. |
49a9a05
to
313dbd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let's review this comment first #301 (comment) (cc @dmpetrov )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great starting point!
I'd suggest make a few improvement there:
- Implement this as a mapper in datachain.file, not in the core DC class
- We need to populate all file specific structures (for all clouds 😅) (size, version, etag, is_latest, last_modified) to make next command actionable (next command should be able to pull data).
- Please do not introduce is_valid at this point. We can just keep etag empty/0 as a signal that file was not resolved.
src/datachain/lib/dc.py
Outdated
file.is_valid = False | ||
return file | ||
|
||
return self.map(check_file, output={signal: File}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great starting point! I'll provide summary in the request for changes.
e61dd89
to
96c2499
Compare
@dmpetrov I have followed the requirements and implemented only the resolve method in this PR. If you think we should also implement resolve_uri in this PR, please let me know. Thank you. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #313 +/- ##
==========================================
+ Coverage 87.24% 87.27% +0.02%
==========================================
Files 92 92
Lines 9936 9954 +18
Branches 2047 2048 +1
==========================================
+ Hits 8669 8687 +18
Misses 913 913
Partials 354 354
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
96c2499
to
2524e37
Compare
@EdwardLi-coder sorry for a delay. More details in the discussion #301 |
I have updated my code according to your suggestions. After reading the discussion at #301, I noticed that you mentioned the possibility of needing two functions. Therefore, I wanted to ask if I should implement the resolve_uri function in this PR as well. Maybe you need Re-review my code. |
2524e37
to
f951893
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Just a couple of small comments are inline.
One more action is needed: we need a def resolve(file: File) -> File: ...
as a shortcut for mappers. It should enable scenarios like:
dc = DataChain.from_features(my_files=[File(source="s3://my_bucket", "file1.txt"), File(source="s3://my_bucket", "file2.txt")])
from datachain.file import resolve
dc = dc.map(file = resolve, params="my_files")
content = dc.map(res=lambda file: file.read()).collect()
Approved. PLease feel free to merge once the issues are resolved.
@dmpetrov I have already updated it. Could you please review it again? Thank you. |
c97fae8
to
773784e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be tested with actual (simulated) clouds, i.e. using the cloud_test_catalog
fixture or similar as in tests.func.test_datachain
. The existing tests are so heavily mocked that they basically just duplicate the implementation.
Also, (though it makes no difference for using resolve()
as a mapper) I think it's a bit strange that it modifies the File object in place, I think we should treat them as immutable as much as possible, otherwise users will quickly end up in situations where the object doesn't match what's in the DB.
4b3fc92
to
0e8ef14
Compare
I have updated it according to your suggestions.If you have any other suggestions, I will update it accordingly. Thank you. |
68c34ad
to
3d13924
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! There are still a few issues though (see comments below) and testing can be improved:
Ideally, test_resolve_file()
should check something like File(source=orig_file.source, path=orig_file.path).resolve() == orig_file
instead of just checking that orig_file.resolve()
looks reasonable. OTOH, I don't think we need to check the file contents. Also, we should test the case of a non-existing file.
2543c6f
to
c524efc
Compare
I updated it. |
8c6f693
to
ef8c47b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review! This looks good, except for replacing .convert_info()
with .info_to_file()
.
src/datachain/lib/file.py
Outdated
@@ -316,6 +319,72 @@ def get_fs(self): | |||
"""Returns `fsspec` filesystem for the file.""" | |||
return self._catalog.get_client(self.source).fs | |||
|
|||
def resolve(self) -> "File": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def resolve(self) -> "File": | |
def resolve(self) -> "Self": |
src/datachain/lib/file.py
Outdated
|
||
try: | ||
info = client.fs.info(client.get_full_path(self.path)) | ||
converted_info = client.convert_info(info, self.source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're actually removing .convert_info()
(sorry!), please use client.info_to_file()
instead. Also, that way all attributes will be present and have a correct value.
ef8c47b
to
da163fb
Compare
I updated it. |
da163fb
to
765d1a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! One last rebase and I think it can be merged.
07dfc02
to
3cc64c6
Compare
I updated it.Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Fix: Need resolve_files() method #301
Closes #301