Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resolve files #313

Merged
merged 4 commits into from
Sep 12, 2024
Merged

Add resolve files #313

merged 4 commits into from
Sep 12, 2024

Conversation

EdwardLi-coder
Copy link
Contributor

@EdwardLi-coder EdwardLi-coder commented Aug 17, 2024

Fix: Need resolve_files() method #301

Closes #301

@EdwardLi-coder
Copy link
Contributor Author

I've attempted to resolve this issue, but my test is still failing. I'd like to understand the reason why the 'is_valid' field is not being successfully updated. Could you please help me identify the cause of this problem? Thank you."

test_resolve_files

def test_resolve_files(test_session):
    valid_file = File(path="valid.txt", size=10)
    invalid_file = File(path="invalid.txt", size=5)

    dc = DataChain.from_values(file=[valid_file, invalid_file], session=test_session)

    def mock_open(self):
        if self.path == "valid.txt":
            return MagicMock()
        raise OSError("File not found")

    with patch("datachain.lib.file.File.open", mock_open):
        resolved_dc = dc.resolve_files()

        results = list(resolved_dc.collect("file"))
        print("results: ",results)
        assert len(results) == 2
        assert results[0].is_valid
AssertionError: assert None

where None = File(source='', path='valid.txt', size=10, version='', etag='', is_latest=True, last_modified=datetime.datetime(1970, 1, 1, 0, 0, tzinfo=datetime.timezone.utc), location=None, vtype='', is_valid=None).is_valid
/Users/lifei/PycharmProjects/datachain/tests/unit/lib/test_datachain.py:1633: AssertionError

@mattseddon
Copy link
Member

mattseddon commented Aug 19, 2024

@EdwardLi-coder to help you fully understand what is going on underneath you can set the DEBUG_SHOW_SQL_QUERIES=1 environment variable to see the SQL generated by DataChain.

From the following, it looks like it is not possible to update a field under an existing Feature right now as the result will not be included in the final output:

SELECT
        "udf_ZCVabk".sys__id,
        "udf_ZCVabk".sys__rand,
        "udf_ZCVabk".file__source,
        "udf_ZCVabk".file__path,
        "udf_ZCVabk".file__size,
        "udf_ZCVabk".file__version,
        "udf_ZCVabk".file__etag,
        "udf_ZCVabk".file__is_latest,
        "udf_ZCVabk".file__last_modified,
        "udf_ZCVabk".file__location,
        "udf_ZCVabk".file__vtype,
        "udf_ZCVabk".file__is_valid  
    FROM
        "udf_ZCVabk" 
    ORDER BY
        "udf_ZCVabk".sys__id  LIMIT 100000 OFFSET 0;
    SELECT
        "udf_ZCVabk".sys__id,
        "udf_ZCVabk".sys__rand,
        "udf_ZCVabk".file__source,
        "udf_ZCVabk".file__path,
        "udf_ZCVabk".file__size,
        "udf_ZCVabk".file__version,
        "udf_ZCVabk".file__etag,
        "udf_ZCVabk".file__is_latest,
        "udf_ZCVabk".file__last_modified,
        "udf_ZCVabk".file__location,
        "udf_ZCVabk".file__vtype,
        "udf_ZCVabk".file__is_valid  
    FROM
        "udf_ZCVabk" 
    ORDER BY
        "udf_ZCVabk".sys__id  LIMIT 100000 OFFSET 100000;
    INSERT 
    INTO
        "udf_gCBFP0"
        (sys__id, file__source, file__path, file__size, file__version, file__etag, file__is_latest, file__last_modified, file__location, file__vtype, file__is_valid) 
    VALUES
        (1, '', 'valid.txt', 10, '', '', 1, '1970-01-01 00:00:00', NULL, '', 1);
    INSERT 
    INTO
        "udf_gCBFP0"
        (sys__id, file__source, file__path, file__size, file__version, file__etag, file__is_latest, file__last_modified, file__location, file__vtype, file__is_valid) 
    VALUES
        (2, '', 'invalid.txt', 5, '', '', 1, '1970-01-01 00:00:00', NULL, '', 0);
    SELECT
        anon_1.file__source,
        anon_1.file__path,
        anon_1.file__size,
        anon_1.file__version,
        anon_1.file__etag,
        anon_1.file__is_latest,
        anon_1.file__last_modified,
        anon_1.file__location,
        anon_1.file__vtype,
        anon_1.file__is_valid  
    FROM
        (SELECT
            anon_2.sys__id AS sys__id,
            anon_2.sys__rand AS sys__rand,
            anon_2.file__source AS file__source,
            anon_2.file__path AS file__path,
            anon_2.file__size AS file__size,
            anon_2.file__version AS file__version,
            anon_2.file__etag AS file__etag,
            anon_2.file__is_latest AS file__is_latest,
            anon_2.file__last_modified AS file__last_modified,
            anon_2.file__location AS file__location,
            anon_2.file__vtype AS file__vtype,
            anon_2.file__is_valid AS file__is_valid,
            "udf_gCBFP0".file__source AS file__source_1,
            "udf_gCBFP0".file__path AS file__path_1,
            "udf_gCBFP0".file__size AS file__size_1,
            "udf_gCBFP0".file__version AS file__version_1,
            "udf_gCBFP0".file__etag AS file__etag_1,
            "udf_gCBFP0".file__is_latest AS file__is_latest_1,
            "udf_gCBFP0".file__last_modified AS file__last_modified_1,
            "udf_gCBFP0".file__location AS file__location_1,
            "udf_gCBFP0".file__vtype AS file__vtype_1,
            "udf_gCBFP0".file__is_valid AS file__is_valid_1  
        FROM
            (SELECT
                "udf_ZCVabk".sys__id AS sys__id,
                "udf_ZCVabk".sys__rand AS sys__rand,
                "udf_ZCVabk".file__source AS file__source,
                "udf_ZCVabk".file__path AS file__path,
                "udf_ZCVabk".file__size AS file__size,
                "udf_ZCVabk".file__version AS file__version,
                "udf_ZCVabk".file__etag AS file__etag,
                "udf_ZCVabk".file__is_latest AS file__is_latest,
                "udf_ZCVabk".file__last_modified AS file__last_modified,
                "udf_ZCVabk".file__location AS file__location,
                "udf_ZCVabk".file__vtype AS file__vtype,
                "udf_ZCVabk".file__is_valid AS file__is_valid  
            FROM
                "udf_ZCVabk") AS anon_2 
        LEFT OUTER JOIN
            "udf_gCBFP0" 
                ON "udf_gCBFP0".sys__id = anon_2.sys__id
            ) AS anon_1

@EdwardLi-coder
Copy link
Contributor Author

@EdwardLi-coder to help you fully understand what is going on underneath you can set the DEBUG_SHOW_SQL_QUERIES=1 environment variable to see the SQL generated by DataChain.

From the following, it looks like it is not possible to update a field under an existing Feature right now as the result will not be included in the final output:

SELECT
        "udf_ZCVabk".sys__id,
        "udf_ZCVabk".sys__rand,
        "udf_ZCVabk".file__source,
        "udf_ZCVabk".file__path,
        "udf_ZCVabk".file__size,
        "udf_ZCVabk".file__version,
        "udf_ZCVabk".file__etag,
        "udf_ZCVabk".file__is_latest,
        "udf_ZCVabk".file__last_modified,
        "udf_ZCVabk".file__location,
        "udf_ZCVabk".file__vtype,
        "udf_ZCVabk".file__is_valid  
    FROM
        "udf_ZCVabk" 
    ORDER BY
        "udf_ZCVabk".sys__id  LIMIT 100000 OFFSET 0;
    SELECT
        "udf_ZCVabk".sys__id,
        "udf_ZCVabk".sys__rand,
        "udf_ZCVabk".file__source,
        "udf_ZCVabk".file__path,
        "udf_ZCVabk".file__size,
        "udf_ZCVabk".file__version,
        "udf_ZCVabk".file__etag,
        "udf_ZCVabk".file__is_latest,
        "udf_ZCVabk".file__last_modified,
        "udf_ZCVabk".file__location,
        "udf_ZCVabk".file__vtype,
        "udf_ZCVabk".file__is_valid  
    FROM
        "udf_ZCVabk" 
    ORDER BY
        "udf_ZCVabk".sys__id  LIMIT 100000 OFFSET 100000;
    INSERT 
    INTO
        "udf_gCBFP0"
        (sys__id, file__source, file__path, file__size, file__version, file__etag, file__is_latest, file__last_modified, file__location, file__vtype, file__is_valid) 
    VALUES
        (1, '', 'valid.txt', 10, '', '', 1, '1970-01-01 00:00:00', NULL, '', 1);
    INSERT 
    INTO
        "udf_gCBFP0"
        (sys__id, file__source, file__path, file__size, file__version, file__etag, file__is_latest, file__last_modified, file__location, file__vtype, file__is_valid) 
    VALUES
        (2, '', 'invalid.txt', 5, '', '', 1, '1970-01-01 00:00:00', NULL, '', 0);
    SELECT
        anon_1.file__source,
        anon_1.file__path,
        anon_1.file__size,
        anon_1.file__version,
        anon_1.file__etag,
        anon_1.file__is_latest,
        anon_1.file__last_modified,
        anon_1.file__location,
        anon_1.file__vtype,
        anon_1.file__is_valid  
    FROM
        (SELECT
            anon_2.sys__id AS sys__id,
            anon_2.sys__rand AS sys__rand,
            anon_2.file__source AS file__source,
            anon_2.file__path AS file__path,
            anon_2.file__size AS file__size,
            anon_2.file__version AS file__version,
            anon_2.file__etag AS file__etag,
            anon_2.file__is_latest AS file__is_latest,
            anon_2.file__last_modified AS file__last_modified,
            anon_2.file__location AS file__location,
            anon_2.file__vtype AS file__vtype,
            anon_2.file__is_valid AS file__is_valid,
            "udf_gCBFP0".file__source AS file__source_1,
            "udf_gCBFP0".file__path AS file__path_1,
            "udf_gCBFP0".file__size AS file__size_1,
            "udf_gCBFP0".file__version AS file__version_1,
            "udf_gCBFP0".file__etag AS file__etag_1,
            "udf_gCBFP0".file__is_latest AS file__is_latest_1,
            "udf_gCBFP0".file__last_modified AS file__last_modified_1,
            "udf_gCBFP0".file__location AS file__location_1,
            "udf_gCBFP0".file__vtype AS file__vtype_1,
            "udf_gCBFP0".file__is_valid AS file__is_valid_1  
        FROM
            (SELECT
                "udf_ZCVabk".sys__id AS sys__id,
                "udf_ZCVabk".sys__rand AS sys__rand,
                "udf_ZCVabk".file__source AS file__source,
                "udf_ZCVabk".file__path AS file__path,
                "udf_ZCVabk".file__size AS file__size,
                "udf_ZCVabk".file__version AS file__version,
                "udf_ZCVabk".file__etag AS file__etag,
                "udf_ZCVabk".file__is_latest AS file__is_latest,
                "udf_ZCVabk".file__last_modified AS file__last_modified,
                "udf_ZCVabk".file__location AS file__location,
                "udf_ZCVabk".file__vtype AS file__vtype,
                "udf_ZCVabk".file__is_valid AS file__is_valid  
            FROM
                "udf_ZCVabk") AS anon_2 
        LEFT OUTER JOIN
            "udf_gCBFP0" 
                ON "udf_gCBFP0".sys__id = anon_2.sys__id
            ) AS anon_1

Ok,Thank you very much .I will try use this environment variable.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's review this comment first #301 (comment) (cc @dmpetrov )

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great starting point!

I'd suggest make a few improvement there:

  • Implement this as a mapper in datachain.file, not in the core DC class
  • We need to populate all file specific structures (for all clouds 😅) (size, version, etag, is_latest, last_modified) to make next command actionable (next command should be able to pull data).
  • Please do not introduce is_valid at this point. We can just keep etag empty/0 as a signal that file was not resolved.

file.is_valid = False
return file

return self.map(check_file, output={signal: File})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great starting point! I'll provide summary in the request for changes.

@EdwardLi-coder EdwardLi-coder force-pushed the add-resolve-files branch 2 times, most recently from e61dd89 to 96c2499 Compare August 22, 2024 02:50
@EdwardLi-coder
Copy link
Contributor Author

That's a great starting point!

I'd suggest make a few improvement there:

  • Implement this as a mapper in datachain.file, not in the core DC class
  • We need to populate all file specific structures (for all clouds 😅) (size, version, etag, is_latest, last_modified) to make next command actionable (next command should be able to pull data).
  • Please do not introduce is_valid at this point. We can just keep etag empty/0 as a signal that file was not resolved.

@dmpetrov I have followed the requirements and implemented only the resolve method in this PR. If you think we should also implement resolve_uri in this PR, please let me know. Thank you.

Copy link

codecov bot commented Aug 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.27%. Comparing base (4f741e6) to head (3cc64c6).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #313      +/-   ##
==========================================
+ Coverage   87.24%   87.27%   +0.02%     
==========================================
  Files          92       92              
  Lines        9936     9954      +18     
  Branches     2047     2048       +1     
==========================================
+ Hits         8669     8687      +18     
  Misses        913      913              
  Partials      354      354              
Flag Coverage Δ
datachain 87.22% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dmpetrov
Copy link
Member

If you think we should also implement resolve_uri in this PR, please let me know. Thank you.

@EdwardLi-coder sorry for a delay.
We need only resolve_file() (or just resolve()) in this PR. However, the functionality was underspecified. The method should return a resolved file record with all required fields. While is_valid can be easely derived from a resolved file record like dc.mutate(is_valid=Column("file.etag") == "")

More details in the discussion #301

@EdwardLi-coder
Copy link
Contributor Author

EdwardLi-coder commented Aug 24, 2024

If you think we should also implement resolve_uri in this PR, please let me know. Thank you.

@EdwardLi-coder sorry for a delay. We need only resolve_file() (or just resolve()) in this PR. However, the functionality was underspecified. The method should return a resolved file record with all required fields. While is_valid can be easely derived from a resolved file record like dc.mutate(is_valid=Column("file.etag") == "")

More details in the discussion #301

I have updated my code according to your suggestions. After reading the discussion at #301, I noticed that you mentioned the possibility of needing two functions. Therefore, I wanted to ask if I should implement the resolve_uri function in this PR as well. Maybe you need Re-review my code.
WX20240824-181317@2x

@EdwardLi-coder
Copy link
Contributor Author

@dmpetrov I think the resolve_uri you mentioned should be the method that @volkfox suggested to add in #300.

@dmpetrov
Copy link
Member

that @volkfox suggested to add in #300.

Exactly. It's a separate issue and better to implement it as a separate PR.

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Just a couple of small comments are inline.

One more action is needed: we need a def resolve(file: File) -> File: ... as a shortcut for mappers. It should enable scenarios like:

dc = DataChain.from_features(my_files=[File(source="s3://my_bucket", "file1.txt"), File(source="s3://my_bucket", "file2.txt")])

from datachain.file import resolve
dc = dc.map(file = resolve, params="my_files")
content = dc.map(res=lambda file: file.read()).collect()

Approved. PLease feel free to merge once the issues are resolved.

src/datachain/lib/file.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Outdated Show resolved Hide resolved
tests/unit/lib/test_file.py Outdated Show resolved Hide resolved
tests/unit/lib/test_file.py Outdated Show resolved Hide resolved
@EdwardLi-coder
Copy link
Contributor Author

@dmpetrov I have already updated it. Could you please review it again? Thank you.

@EdwardLi-coder EdwardLi-coder force-pushed the add-resolve-files branch 2 times, most recently from c97fae8 to 773784e Compare August 28, 2024 11:07
Copy link
Contributor

@rlamy rlamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be tested with actual (simulated) clouds, i.e. using the cloud_test_catalog fixture or similar as in tests.func.test_datachain. The existing tests are so heavily mocked that they basically just duplicate the implementation.

Also, (though it makes no difference for using resolve() as a mapper) I think it's a bit strange that it modifies the File object in place, I think we should treat them as immutable as much as possible, otherwise users will quickly end up in situations where the object doesn't match what's in the DB.

src/datachain/lib/file.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Outdated Show resolved Hide resolved
@EdwardLi-coder
Copy link
Contributor Author

This needs to be tested with actual (simulated) clouds, i.e. using the cloud_test_catalog fixture or similar as in tests.func.test_datachain. The existing tests are so heavily mocked that they basically just duplicate the implementation.

Also, (though it makes no difference for using resolve() as a mapper) I think it's a bit strange that it modifies the File object in place, I think we should treat them as immutable as much as possible, otherwise users will quickly end up in situations where the object doesn't match what's in the DB.

I have updated it according to your suggestions.If you have any other suggestions, I will update it accordingly. Thank you.

@EdwardLi-coder EdwardLi-coder force-pushed the add-resolve-files branch 2 times, most recently from 68c34ad to 3d13924 Compare August 29, 2024 15:09
Copy link
Contributor

@rlamy rlamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! There are still a few issues though (see comments below) and testing can be improved:
Ideally, test_resolve_file() should check something like File(source=orig_file.source, path=orig_file.path).resolve() == orig_file instead of just checking that orig_file.resolve() looks reasonable. OTOH, I don't think we need to check the file contents. Also, we should test the case of a non-existing file.

src/datachain/lib/file.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Outdated Show resolved Hide resolved
tests/unit/lib/test_file.py Outdated Show resolved Hide resolved
tests/unit/lib/test_file.py Outdated Show resolved Hide resolved
@EdwardLi-coder EdwardLi-coder force-pushed the add-resolve-files branch 2 times, most recently from 2543c6f to c524efc Compare September 6, 2024 13:45
@EdwardLi-coder
Copy link
Contributor Author

Thank you! There are still a few issues though (see comments below) and testing can be improved: Ideally, test_resolve_file() should check something like File(source=orig_file.source, path=orig_file.path).resolve() == orig_file instead of just checking that orig_file.resolve() looks reasonable. OTOH, I don't think we need to check the file contents. Also, we should test the case of a non-existing file.

I updated it.

@EdwardLi-coder EdwardLi-coder requested a review from rlamy September 6, 2024 13:47
@EdwardLi-coder EdwardLi-coder force-pushed the add-resolve-files branch 5 times, most recently from 8c6f693 to ef8c47b Compare September 11, 2024 11:40
Copy link
Contributor

@rlamy rlamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review! This looks good, except for replacing .convert_info() with .info_to_file().

@@ -316,6 +319,72 @@ def get_fs(self):
"""Returns `fsspec` filesystem for the file."""
return self._catalog.get_client(self.source).fs

def resolve(self) -> "File":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def resolve(self) -> "File":
def resolve(self) -> "Self":


try:
info = client.fs.info(client.get_full_path(self.path))
converted_info = client.convert_info(info, self.source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're actually removing .convert_info() (sorry!), please use client.info_to_file() instead. Also, that way all attributes will be present and have a correct value.

@EdwardLi-coder
Copy link
Contributor Author

Sorry for the late review! This looks good, except for replacing .convert_info() with .info_to_file().

I updated it.

Copy link
Contributor

@rlamy rlamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! One last rebase and I think it can be merged.

src/datachain/client/s3.py Outdated Show resolved Hide resolved
src/datachain/lib/file.py Outdated Show resolved Hide resolved
@EdwardLi-coder
Copy link
Contributor Author

Thank you! One last rebase and I think it can be merged.

I updated it.Thank you.

Copy link
Contributor

@rlamy rlamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@rlamy rlamy dismissed shcheklein’s stale review September 12, 2024 18:25

comment was addressed

@rlamy rlamy merged commit 027162c into iterative:main Sep 12, 2024
31 of 32 checks passed
@EdwardLi-coder EdwardLi-coder deleted the add-resolve-files branch September 12, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need resolve_files() method
5 participants