Adding schema param to `from_records` #248

ilongin · 2024-08-07T09:48:32Z

Adding optional schema param to DataChain.from_records()
This allows us creating datasets with specific schema where important use case is creating empty datasets with no rows but with defined schema. If no rows are provided and there is no explicit schema, exception is thrown.

Follow up is to infer schema from rows itself.

cloudflare-workers-and-pages · 2024-08-07T12:49:50Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`daaaf5b`
Status:	✅ Deploy successful!
Preview URL:	https://2ef9d0d1.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-246-schema-in-from-r.datachain-documentation.pages.dev

View logs

src/datachain/lib/dc.py

shcheklein · 2024-08-08T01:18:44Z

src/datachain/lib/dc.py

@@ -1509,12 +1511,31 @@ def from_records(
        session = Session.get(session)
        catalog = session.catalog

+        if not to_insert and not schema:
+            raise ValueError("Schema is required for creating empty dataset")


message seems to be wrong or misleading (?) considering the condition Schema is required for creating empty dataset. Condition should be if not schema then

Changed message, please check if now it seems better

…tive/datachain into ilongin/246-schema-in-from-records

shcheklein · 2024-08-09T17:15:37Z

src/datachain/lib/dc.py


        Example:
            ```py
-            empty = DataChain.from_records()
            single_record = DataChain.from_records(DataChain.DEFAULT_FILE_RECORD)


okay, but is the meaning, purpose of this?

You are referring to that example of creating dataset from single record or?

yep, I would say this specific record - DEFAULT_FILE_RECORD - what is the point of this example?

dmpetrov

Looks good!

A couple of comments are inline. Mostly about test - some tests are testing so many parts. We should be careful with this.

Also, it feels like DC is becoming a fat class. We should do something about it (not in this PR of course).

tests/unit/lib/test_datachain.py

dmpetrov · 2024-08-09T17:55:51Z

tests/unit/lib/test_datachain.py

+
+    # check that columns have actually been created from schema
+    dr = ds_sys.catalog.warehouse.dataset_rows(ds_sys.catalog.get_dataset(ds_name))
+    assert sorted([c.name for c in dr.c]) == sorted(ds.signals_schema.db_signals())


This doesn't belong in this test. It should either be removed or extracted to a separate test if the need is clear.

Extracted to separate test as I think it's important to test it

dmpetrov · 2024-08-09T17:56:33Z

tests/unit/lib/test_datachain.py

+
+def test_from_records_empty_chain_without_schema():
+    with pytest.raises(ValueError):
+        DataChain.from_records([], schema=None)


misc: schema should be None by default

It is but I wanted to make that explicit here in test

dmpetrov · 2024-08-09T17:58:03Z

tests/unit/lib/test_signal_schema.py

@@ -2,8 +2,9 @@
 from typing import Optional, Union

 import pytest
+from sqlalchemy import Column


Please use DataChain's Column, not sqlite. There is a chance we will be replacing the implementation.

I had some strange issues when I was using Column from DataChain here so I change to use sqlalchemy.Column directly.
For some reason index field was defined when using DataChain.Column for each column and that was causing failure later on when creating table with those columns.
I need to investigate why is that (maybe we accidentally broke something when defining our Column class which extends SQLAlchemy one) but regardless, this is all hidden from user, i.e is not in the API itself so it should be good.

shcheklein · 2024-08-14T03:48:59Z

src/datachain/lib/dc.py

            single_record = DataChain.from_records(DataChain.DEFAULT_FILE_RECORD)
            ```
        """
        session = Session.get(session)
        catalog = session.catalog

+        if not to_insert and not schema:
+            raise ValueError("Non empty records to insert or schema must be defined")


can we define both? what takes precedence then?

btw, why can't we do a completely empty schema and no records?

Yea, I removed this exception. Now if empty records and no schema is set, default columns are created (as in general when no schema is set) ... in followup issue we should try to infer schema from records, if explicit schema is not defined, and then if there is no records no columns will be created

About precedence, schema now takes precedence if it's defined

default columns are created

what are the default columns though? It probably comes back to my question re the DEFAULT_FILE_RECORD - I don't understand its meaning tbh.

shcheklein · 2024-08-14T04:05:45Z

LGTM, please take a look one more time and address some unresolved discussions if they make sense. Thanks @ilongin !

codecov · 2024-08-20T08:29:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.94%. Comparing base (6fdc261) to head (daaaf5b).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #248      +/-   ##
==========================================
+ Coverage   86.91%   86.94%   +0.02%     
==========================================
  Files          90       90              
  Lines        9892     9898       +6     
  Branches     1994     1995       +1     
==========================================
+ Hits         8598     8606       +8     
+ Misses        946      944       -2     
  Partials      348      348

Flag	Coverage Δ
datachain	`86.87% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adding schema param to from_records

4babd9e

ilongin marked this pull request as draft August 7, 2024 09:48

fixing the logic

e823673

removed print

462a424

ilongin linked an issue Aug 7, 2024 that may be closed by this pull request

Accept schema in DataChain.from_records() #246

Closed

fix test

556318c

ilongin requested review from rlamy, dtulga, skshetry and shcheklein August 7, 2024 13:22

ilongin marked this pull request as ready for review August 7, 2024 13:22

Merge branch 'main' into ilongin/246-schema-in-from-records

e24e715

shcheklein reviewed Aug 7, 2024

View reviewed changes

src/datachain/lib/dc.py Show resolved Hide resolved

ilongin mentioned this pull request Aug 7, 2024

Generator function to list bucket #244

Closed

ilongin added 2 commits August 7, 2024 23:51

Merge branch 'main' into ilongin/246-schema-in-from-records

a600e28

updated docstring

bbcedc5

ilongin requested a review from shcheklein August 7, 2024 21:59

Merge branch 'main' into ilongin/246-schema-in-from-records

1f47445

shcheklein reviewed Aug 8, 2024

View reviewed changes

src/datachain/lib/dc.py Show resolved Hide resolved

shcheklein reviewed Aug 8, 2024

View reviewed changes

ilongin added 3 commits August 8, 2024 09:33

Merge branch 'main' into ilongin/246-schema-in-from-records

d8a51dd

updated docstring and error message

de5434a

Merge branch 'ilongin/246-schema-in-from-records' of github.com:itera…

1fa957b

…tive/datachain into ilongin/246-schema-in-from-records

ilongin requested review from shcheklein, dmpetrov and mattseddon August 8, 2024 07:37

shcheklein reviewed Aug 9, 2024

View reviewed changes

dmpetrov approved these changes Aug 9, 2024

View reviewed changes

ilongin requested a review from shcheklein August 14, 2024 00:27

shcheklein reviewed Aug 14, 2024

View reviewed changes

shcheklein approved these changes Aug 14, 2024

View reviewed changes

merging with main

2063565

ilongin force-pushed the ilongin/246-schema-in-from-records branch from f9d31d4 to 2063565 Compare August 20, 2024 08:12

changed behavior of empty records and no explicit schema

daaaf5b

ilongin merged commit 70fe5a1 into main Aug 20, 2024
38 checks passed

ilongin deleted the ilongin/246-schema-in-from-records branch August 20, 2024 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding schema param to `from_records` #248

Adding schema param to `from_records` #248

ilongin commented Aug 7, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Aug 7, 2024 •

edited

Loading

shcheklein Aug 8, 2024

ilongin Aug 8, 2024

shcheklein Aug 9, 2024

ilongin Aug 10, 2024

shcheklein Aug 14, 2024

dmpetrov left a comment

dmpetrov Aug 9, 2024

ilongin Aug 10, 2024

dmpetrov Aug 9, 2024

ilongin Aug 10, 2024

dmpetrov Aug 9, 2024

ilongin Aug 10, 2024 •

edited

Loading

shcheklein Aug 14, 2024

shcheklein Aug 14, 2024

ilongin Aug 20, 2024

ilongin Aug 20, 2024

shcheklein Aug 20, 2024

shcheklein commented Aug 14, 2024

codecov bot commented Aug 20, 2024 •

edited

Loading

Adding schema param to from_records #248

Adding schema param to from_records #248

Conversation

ilongin commented Aug 7, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Aug 7, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmpetrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilongin Aug 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shcheklein commented Aug 14, 2024

codecov bot commented Aug 20, 2024 • edited Loading

Codecov Report

Adding schema param to `from_records` #248

Adding schema param to `from_records` #248

ilongin commented Aug 7, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Aug 7, 2024 •

edited

Loading

ilongin Aug 10, 2024 •

edited

Loading

codecov bot commented Aug 20, 2024 •

edited

Loading