disable execution of last query expression by default #407

skshetry · 2024-09-09T05:29:21Z

This PR hides the support for execution of last query expression behind a flag, which I plan to remove when Studio is updated.

Also, few other changes have been made to the query API:

envs= keyword argument has been renamed to env, similar to subprocess.Popen(env=).
python_executable default has been changed from None to sys.executable. And it no longer accepts None.
The query API no longer returns QueryResult API. The responsibility is now with the caller, to find out latest dataset version. They are in much better place to do that, since they are the one responsible for creating a job.
On capture_output=True, query API no longer prints to the stdout. The output_hook is responsible to do so now.
Exceptions raised on query no longer have output set.
QueryScriptDatasetNotFound has been removed.
We no longer set script_output and query_script to the DatasetVersion. This was not used anyway.

Closes #360.

skshetry · 2024-09-09T05:30:19Z

src/datachain/catalog/catalog.py

-    output: str
+    if buffer:  # Handle any remaining data in the buffer
+        line = buffer.decode("utf-8")
+        callback(line)


We'll no longer print to the stdout when capture_output=True.

skshetry · 2024-09-09T05:30:51Z

src/datachain/catalog/catalog.py

@@ -1805,14 +1772,15 @@ def apply_udf(
    def query(
        self,
        query_script: str,
-        envs: Optional[Mapping[str, str]] = None,
-        python_executable: Optional[str] = None,
+        env: Optional[Mapping[str, str]] = None,


Renamed to env. I don't think it should be plural.

skshetry · 2024-09-09T05:31:32Z

src/datachain/catalog/catalog.py

-        envs: Optional[Mapping[str, str]] = None,
-        python_executable: Optional[str] = None,
+        env: Optional[Mapping[str, str]] = None,
+        python_executable: str = sys.executable,


Changed python_executable to default to sys.executable and not take a None value.

skshetry · 2024-09-09T05:31:59Z

src/datachain/catalog/catalog.py

        save: bool = False,
        capture_output: bool = True,
        output_hook: Callable[[str], None] = noop,
        params: Optional[dict[str, str]] = None,
        job_id: Optional[str] = None,
-    ) -> QueryResult:
+        _execute_last_expression: bool = False,


I plan to get rid of this when we update Studio. (Well, I plan to release Studio with _execute_last_expression=True set first, then break compatibility in the next release).

skshetry · 2024-09-09T05:33:31Z

src/datachain/error.py

@@ -42,10 +42,6 @@ def __init__(self, message: str, return_code: int = 0, output: str = ""):
        super().__init__(self.message)


-class QueryScriptDatasetNotFound(QueryScriptRunError):  # noqa: N818


No longer used in query. I will create a similar exception in Studio side.

skshetry · 2024-09-09T05:33:50Z

src/datachain/catalog/catalog.py

-        dr = self.update_dataset(
-            dr,
-            script_output=output,
-            query_script=query_script,
-        )
-        self.update_dataset_version_with_warehouse_info(
-            dr,
-            dv.version,
-            script_output=output,
-            query_script=query_script,
-            job_id=job_id,
-            is_job_result=True,
-        )


This was not being used anywhere. Job replaces this.

skshetry · 2024-09-09T05:34:36Z

src/datachain/catalog/catalog.py

-        def _get_dataset_versions_by_job_id():
-            for dr, dv, job in self.list_datasets_versions():
-                if job and str(job.id) == job_id:
-                    yield dr, dv
-
-        try:
-            dr, dv = max(
-                _get_dataset_versions_by_job_id(), key=lambda x: x[1].created_at
-            )
-        except ValueError as e:
-            if not save:
-                return QueryResult(dataset=None, version=None, output=output)
-
-            raise QueryScriptDatasetNotFound(
-                "No dataset found after running Query script",
-                output=output,
-            ) from e


This will have to be done on the caller side. And eventually removed when we drop _execute_last_expression support.

codecov · 2024-09-09T05:35:09Z

Codecov Report

Attention: Patch coverage is 79.41176% with 7 lines in your changes missing coverage. Please review.

Project coverage is 87.07%. Comparing base (df24ffa) to head (01fcceb).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/catalog/catalog.py	79.41%	5 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #407      +/-   ##
==========================================
- Coverage   87.32%   87.07%   -0.25%     
==========================================
  Files          92       92              
  Lines        9986     9952      -34     
  Branches     2041     2037       -4     
==========================================
- Hits         8720     8666      -54     
- Misses        911      931      +20     
  Partials      355      355

Flag	Coverage Δ
datachain	`87.02% <79.41%> (-0.25%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cloudflare-workers-and-pages · 2024-09-10T03:38:21Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`01fcceb`
Status:	✅ Deploy successful!
Preview URL:	https://cfdd5765.datachain-documentation.pages.dev
Branch Preview URL:	https://last-expr.datachain-documentation.pages.dev

View logs

amritghimire · 2024-09-10T04:29:28Z

src/datachain/catalog/catalog.py

+                ) from exc
+        else:
+            query_script_compiled = query_script
+            assert not save


Reminder to remove this flag as well.

amritghimire

Looks good to me.

dreadatour

Code looks good to me, thank you for removing all this outdated code 🙏

dreadatour · 2024-09-10T06:43:26Z

src/datachain/catalog/catalog.py

-            ) from exc
-        envs = dict(envs or os.environ)
-        envs.update(
+        env = dict(env or os.environ)


Hm, not sure why or? May be something like this?

Suggested change

env = dict(env or os.environ)

env = {**os.environ, **(env or {})}

How would you provide a way to override envvars of the current process?

This is how subprocess.Popen works, and given this is a thin wrapper around it, I think it's better to mimic it's API.

Also, Studio already provides copy of all envvars.

…et_version

mattseddon · 2024-09-12T00:40:09Z

src/datachain/catalog/catalog.py

+def _process_stream(stream: "IO[bytes]", callback: Callable[[str], None]) -> None:
+    buffer = b""
+    while byt := stream.read(1):  # Read one byte at a time
+        buffer += byt


Seeing some errors coming through the test suite of the form: TypeError: can't concat str to bytes.

Examples are here: https://github.com/iterative/datachain/actions/runs/10821775005/job/30024466321?pr=427#step:7:177

Fixed tests in #431.

amritghimire · 2024-09-12T01:24:41Z

src/datachain/catalog/catalog.py

-    def loop() -> None:
-        buffer = b""
-        while byt := stream.read(1):  # Read one byte at a time
-            buffer += byt.encode("utf-8") if isinstance(byt, str) else byt


We may still need this to fix the issue @mattseddon mentioned above

Fixed in #431.

skshetry requested review from dreadatour, ilongin and amritghimire September 9, 2024 05:29

skshetry commented Sep 9, 2024

View reviewed changes

skshetry force-pushed the last-expr branch from ba17036 to a0d9e37 Compare September 10, 2024 03:38

amritghimire reviewed Sep 10, 2024

View reviewed changes

amritghimire approved these changes Sep 10, 2024

View reviewed changes

dreadatour approved these changes Sep 10, 2024

View reviewed changes

skshetry added 3 commits September 11, 2024 14:36

avoid setting script_output and query_script in the dataset and datas…

2402d4f

…et_version

avoid returning latest dataset, let the caller do the work

ec06ad1

disable wrapping last statement by default

01fcceb

skshetry force-pushed the last-expr branch from a0d9e37 to 01fcceb Compare September 11, 2024 08:51

skshetry merged commit 48a0de5 into main Sep 11, 2024
38 of 40 checks passed

skshetry deleted the last-expr branch September 11, 2024 10:44

mattseddon reviewed Sep 12, 2024

View reviewed changes

amritghimire reviewed Sep 12, 2024

View reviewed changes

skshetry mentioned this pull request Sep 12, 2024

tests: fix mock for subprocess stdout/stderr to return BytesIO #431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disable execution of last query expression by default #407

disable execution of last query expression by default #407

skshetry commented Sep 9, 2024 •

edited

Loading

skshetry Sep 9, 2024

skshetry Sep 9, 2024

skshetry Sep 9, 2024

skshetry Sep 9, 2024 •

edited

Loading

skshetry Sep 9, 2024

skshetry Sep 9, 2024

skshetry Sep 9, 2024

codecov bot commented Sep 9, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Sep 10, 2024 •

edited

Loading

amritghimire Sep 10, 2024

amritghimire left a comment

dreadatour left a comment

dreadatour Sep 10, 2024 •

edited

Loading

skshetry Sep 10, 2024

mattseddon Sep 12, 2024

skshetry Sep 12, 2024

amritghimire Sep 12, 2024

skshetry Sep 12, 2024

		@@ -42,10 +42,6 @@ def __init__(self, message: str, return_code: int = 0, output: str = ""):
		super().__init__(self.message)


		class QueryScriptDatasetNotFound(QueryScriptRunError): # noqa: N818

	env = dict(env or os.environ)
	env = {os.environ, (env or {})}

disable execution of last query expression by default #407

disable execution of last query expression by default #407

Conversation

skshetry commented Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skshetry Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

cloudflare-workers-and-pages bot commented Sep 10, 2024 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

Choose a reason for hiding this comment

amritghimire left a comment

Choose a reason for hiding this comment

dreadatour left a comment

Choose a reason for hiding this comment

dreadatour Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skshetry commented Sep 9, 2024 •

edited

Loading

skshetry Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Sep 10, 2024 •

edited

Loading

dreadatour Sep 10, 2024 •

edited

Loading