Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding schema param to from_records #248

Merged
merged 13 commits into from
Aug 20, 2024
Merged

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Aug 7, 2024

Adding optional schema param to DataChain.from_records()
This allows us creating datasets with specific schema where important use case is creating empty datasets with no rows but with defined schema. If no rows are provided and there is no explicit schema, exception is thrown.

Follow up is to infer schema from rows itself.

@ilongin ilongin marked this pull request as draft August 7, 2024 09:48
Copy link

cloudflare-workers-and-pages bot commented Aug 7, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: daaaf5b
Status: ✅  Deploy successful!
Preview URL: https://2ef9d0d1.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-246-schema-in-from-r.datachain-documentation.pages.dev

View logs

@ilongin ilongin linked an issue Aug 7, 2024 that may be closed by this pull request
@ilongin ilongin marked this pull request as ready for review August 7, 2024 13:22
@ilongin ilongin requested a review from shcheklein August 7, 2024 21:59
@@ -1509,12 +1511,31 @@ def from_records(
session = Session.get(session)
catalog = session.catalog

if not to_insert and not schema:
raise ValueError("Schema is required for creating empty dataset")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message seems to be wrong or misleading (?) considering the condition Schema is required for creating empty dataset. Condition should be if not schema then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed message, please check if now it seems better


Example:
```py
empty = DataChain.from_records()
single_record = DataChain.from_records(DataChain.DEFAULT_FILE_RECORD)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, but is the meaning, purpose of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are referring to that example of creating dataset from single record or?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I would say this specific record - DEFAULT_FILE_RECORD - what is the point of this example?

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

A couple of comments are inline. Mostly about test - some tests are testing so many parts. We should be careful with this.

Also, it feels like DC is becoming a fat class. We should do something about it (not in this PR of course).

tests/unit/lib/test_datachain.py Show resolved Hide resolved
tests/unit/lib/test_datachain.py Show resolved Hide resolved

# check that columns have actually been created from schema
dr = ds_sys.catalog.warehouse.dataset_rows(ds_sys.catalog.get_dataset(ds_name))
assert sorted([c.name for c in dr.c]) == sorted(ds.signals_schema.db_signals())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong in this test. It should either be removed or extracted to a separate test if the need is clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to separate test as I think it's important to test it


def test_from_records_empty_chain_without_schema():
with pytest.raises(ValueError):
DataChain.from_records([], schema=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misc: schema should be None by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is but I wanted to make that explicit here in test

@@ -2,8 +2,9 @@
from typing import Optional, Union

import pytest
from sqlalchemy import Column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use DataChain's Column, not sqlite. There is a chance we will be replacing the implementation.

Copy link
Contributor Author

@ilongin ilongin Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some strange issues when I was using Column from DataChain here so I change to use sqlalchemy.Column directly.
For some reason index field was defined when using DataChain.Column for each column and that was causing failure later on when creating table with those columns.
I need to investigate why is that (maybe we accidentally broke something when defining our Column class which extends SQLAlchemy one) but regardless, this is all hidden from user, i.e is not in the API itself so it should be good.

@ilongin ilongin requested a review from shcheklein August 14, 2024 00:27
single_record = DataChain.from_records(DataChain.DEFAULT_FILE_RECORD)
```
"""
session = Session.get(session)
catalog = session.catalog

if not to_insert and not schema:
raise ValueError("Non empty records to insert or schema must be defined")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define both? what takes precedence then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, why can't we do a completely empty schema and no records?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I removed this exception. Now if empty records and no schema is set, default columns are created (as in general when no schema is set) ... in followup issue we should try to infer schema from records, if explicit schema is not defined, and then if there is no records no columns will be created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About precedence, schema now takes precedence if it's defined

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default columns are created

what are the default columns though? It probably comes back to my question re the DEFAULT_FILE_RECORD - I don't understand its meaning tbh.

@shcheklein
Copy link
Member

LGTM, please take a look one more time and address some unresolved discussions if they make sense. Thanks @ilongin !

@ilongin ilongin force-pushed the ilongin/246-schema-in-from-records branch from f9d31d4 to 2063565 Compare August 20, 2024 08:12
Copy link

codecov bot commented Aug 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.94%. Comparing base (6fdc261) to head (daaaf5b).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #248      +/-   ##
==========================================
+ Coverage   86.91%   86.94%   +0.02%     
==========================================
  Files          90       90              
  Lines        9892     9898       +6     
  Branches     1994     1995       +1     
==========================================
+ Hits         8598     8606       +8     
+ Misses        946      944       -2     
  Partials      348      348              
Flag Coverage Δ
datachain 86.87% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ilongin ilongin merged commit 70fe5a1 into main Aug 20, 2024
38 checks passed
@ilongin ilongin deleted the ilongin/246-schema-in-from-records branch August 20, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept schema in DataChain.from_records()
3 participants