-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding schema param to from_records
#248
Conversation
Deploying datachain-documentation with Cloudflare Pages
|
src/datachain/lib/dc.py
Outdated
@@ -1509,12 +1511,31 @@ def from_records( | |||
session = Session.get(session) | |||
catalog = session.catalog | |||
|
|||
if not to_insert and not schema: | |||
raise ValueError("Schema is required for creating empty dataset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message seems to be wrong or misleading (?) considering the condition Schema is required for creating empty dataset
. Condition should be if not schema
then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed message, please check if now it seems better
…tive/datachain into ilongin/246-schema-in-from-records
|
||
Example: | ||
```py | ||
empty = DataChain.from_records() | ||
single_record = DataChain.from_records(DataChain.DEFAULT_FILE_RECORD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, but is the meaning, purpose of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are referring to that example of creating dataset from single record or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I would say this specific record - DEFAULT_FILE_RECORD - what is the point of this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
A couple of comments are inline. Mostly about test - some tests are testing so many parts. We should be careful with this.
Also, it feels like DC is becoming a fat class. We should do something about it (not in this PR of course).
|
||
# check that columns have actually been created from schema | ||
dr = ds_sys.catalog.warehouse.dataset_rows(ds_sys.catalog.get_dataset(ds_name)) | ||
assert sorted([c.name for c in dr.c]) == sorted(ds.signals_schema.db_signals()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't belong in this test. It should either be removed or extracted to a separate test if the need is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted to separate test as I think it's important to test it
tests/unit/lib/test_datachain.py
Outdated
|
||
def test_from_records_empty_chain_without_schema(): | ||
with pytest.raises(ValueError): | ||
DataChain.from_records([], schema=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misc: schema should be None by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is but I wanted to make that explicit here in test
@@ -2,8 +2,9 @@ | |||
from typing import Optional, Union | |||
|
|||
import pytest | |||
from sqlalchemy import Column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use DataChain's Column, not sqlite. There is a chance we will be replacing the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some strange issues when I was using Column
from DataChain
here so I change to use sqlalchemy.Column
directly.
For some reason index
field was defined when using DataChain.Column
for each column and that was causing failure later on when creating table with those columns.
I need to investigate why is that (maybe we accidentally broke something when defining our Column
class which extends SQLAlchemy one) but regardless, this is all hidden from user, i.e is not in the API itself so it should be good.
src/datachain/lib/dc.py
Outdated
single_record = DataChain.from_records(DataChain.DEFAULT_FILE_RECORD) | ||
``` | ||
""" | ||
session = Session.get(session) | ||
catalog = session.catalog | ||
|
||
if not to_insert and not schema: | ||
raise ValueError("Non empty records to insert or schema must be defined") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we define both? what takes precedence then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, why can't we do a completely empty schema and no records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I removed this exception. Now if empty records and no schema is set, default columns are created (as in general when no schema is set) ... in followup issue we should try to infer schema from records, if explicit schema is not defined, and then if there is no records no columns will be created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About precedence, schema now takes precedence if it's defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default columns are created
what are the default columns though? It probably comes back to my question re the DEFAULT_FILE_RECORD
- I don't understand its meaning tbh.
LGTM, please take a look one more time and address some unresolved discussions if they make sense. Thanks @ilongin ! |
f9d31d4
to
2063565
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #248 +/- ##
==========================================
+ Coverage 86.91% 86.94% +0.02%
==========================================
Files 90 90
Lines 9892 9898 +6
Branches 1994 1995 +1
==========================================
+ Hits 8598 8606 +8
+ Misses 946 944 -2
Partials 348 348
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Adding optional
schema
param toDataChain.from_records()
This allows us creating datasets with specific schema where important use case is creating empty datasets with no rows but with defined schema. If no rows are provided and there is no explicit schema, exception is thrown.
Follow up is to infer schema from rows itself.