Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable execution of last query expression by default #407

Merged
merged 3 commits into from
Sep 11, 2024
Merged

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Sep 9, 2024

This PR hides the support for execution of last query expression behind a flag, which I plan to remove when Studio is updated.

Also, few other changes have been made to the query API:

  1. envs= keyword argument has been renamed to env, similar to subprocess.Popen(env=).
  2. python_executable default has been changed from None to sys.executable. And it no longer accepts None.
  3. The query API no longer returns QueryResult API. The responsibility is now with the caller, to find out latest dataset version. They are in much better place to do that, since they are the one responsible for creating a job.
  4. On capture_output=True, query API no longer prints to the stdout. The output_hook is responsible to do so now.
  5. Exceptions raised on query no longer have output set.
  6. QueryScriptDatasetNotFound has been removed.
  7. We no longer set script_output and query_script to the DatasetVersion. This was not used anyway.

Closes #360.

output: str
if buffer: # Handle any remaining data in the buffer
line = buffer.decode("utf-8")
callback(line)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll no longer print to the stdout when capture_output=True.

@@ -1805,14 +1772,15 @@ def apply_udf(
def query(
self,
query_script: str,
envs: Optional[Mapping[str, str]] = None,
python_executable: Optional[str] = None,
env: Optional[Mapping[str, str]] = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to env. I don't think it should be plural.

envs: Optional[Mapping[str, str]] = None,
python_executable: Optional[str] = None,
env: Optional[Mapping[str, str]] = None,
python_executable: str = sys.executable,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed python_executable to default to sys.executable and not take a None value.

save: bool = False,
capture_output: bool = True,
output_hook: Callable[[str], None] = noop,
params: Optional[dict[str, str]] = None,
job_id: Optional[str] = None,
) -> QueryResult:
_execute_last_expression: bool = False,
Copy link
Member Author

@skshetry skshetry Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to get rid of this when we update Studio. (Well, I plan to release Studio with _execute_last_expression=True set first, then break compatibility in the next release).

@@ -42,10 +42,6 @@ def __init__(self, message: str, return_code: int = 0, output: str = ""):
super().__init__(self.message)


class QueryScriptDatasetNotFound(QueryScriptRunError): # noqa: N818
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer used in query. I will create a similar exception in Studio side.

Comment on lines -1890 to -1902
dr = self.update_dataset(
dr,
script_output=output,
query_script=query_script,
)
self.update_dataset_version_with_warehouse_info(
dr,
dv.version,
script_output=output,
query_script=query_script,
job_id=job_id,
is_job_result=True,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not being used anywhere. Job replaces this.

Comment on lines -1872 to -1888
def _get_dataset_versions_by_job_id():
for dr, dv, job in self.list_datasets_versions():
if job and str(job.id) == job_id:
yield dr, dv

try:
dr, dv = max(
_get_dataset_versions_by_job_id(), key=lambda x: x[1].created_at
)
except ValueError as e:
if not save:
return QueryResult(dataset=None, version=None, output=output)

raise QueryScriptDatasetNotFound(
"No dataset found after running Query script",
output=output,
) from e
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be done on the caller side. And eventually removed when we drop _execute_last_expression support.

Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 79.41176% with 7 lines in your changes missing coverage. Please review.

Project coverage is 87.07%. Comparing base (df24ffa) to head (01fcceb).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/catalog/catalog.py 79.41% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #407      +/-   ##
==========================================
- Coverage   87.32%   87.07%   -0.25%     
==========================================
  Files          92       92              
  Lines        9986     9952      -34     
  Branches     2041     2037       -4     
==========================================
- Hits         8720     8666      -54     
- Misses        911      931      +20     
  Partials      355      355              
Flag Coverage Δ
datachain 87.02% <79.41%> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

cloudflare-workers-and-pages bot commented Sep 10, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 01fcceb
Status: ✅  Deploy successful!
Preview URL: https://cfdd5765.datachain-documentation.pages.dev
Branch Preview URL: https://last-expr.datachain-documentation.pages.dev

View logs

) from exc
else:
query_script_compiled = query_script
assert not save
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to remove this flag as well.

Copy link
Contributor

@amritghimire amritghimire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me, thank you for removing all this outdated code 🙏

) from exc
envs = dict(envs or os.environ)
envs.update(
env = dict(env or os.environ)
Copy link
Contributor

@dreadatour dreadatour Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure why or? May be something like this?

Suggested change
env = dict(env or os.environ)
env = {**os.environ, **(env or {})}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you provide a way to override envvars of the current process?

This is how subprocess.Popen works, and given this is a thin wrapper around it, I think it's better to mimic it's API.

Also, Studio already provides copy of all envvars.

@skshetry skshetry merged commit 48a0de5 into main Sep 11, 2024
38 of 40 checks passed
@skshetry skshetry deleted the last-expr branch September 11, 2024 10:44
def _process_stream(stream: "IO[bytes]", callback: Callable[[str], None]) -> None:
buffer = b""
while byt := stream.read(1): # Read one byte at a time
buffer += byt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing some errors coming through the test suite of the form: TypeError: can't concat str to bytes.

Examples are here: https://github.com/iterative/datachain/actions/runs/10821775005/job/30024466321?pr=427#step:7:177

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed tests in #431.

def loop() -> None:
buffer = b""
while byt := stream.read(1): # Read one byte at a time
buffer += byt.encode("utf-8") if isinstance(byt, str) else byt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may still need this to fix the issue @mattseddon mentioned above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #431.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove support for executing last expression in the script
4 participants