Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline script metadata #913

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Feb 10, 2025

Adding class called ScriptMeta that knows how to parse script inline metadata (https://packaging.python.org/en/latest/specifications/inline-script-metadata/#inline-script-metadata) .

Data that we collect from it are:

  • python version
  • dependencies
  • number of workers
  • files
  • params

With this, we can even avoid having special run script settings in Studio, as user can do everything in the script codebase and can seamlessly c/p script between local and Studio.

Example of script with inline metadata:

# /// script
# requires-python = ">=3.12"
#
# dependencies = [
#   "pandas < 2.1.0",
#   "numpy == 1.26.4"
# ]
#
# [tools.datachain.workers]
# num_workers = 3
#
# [tools.datachain.files]
# image1 = "s3://ldb-public/image1.jpg"
# file1 = "s3://ldb-public/file.pdf"
#
# [tools.datachain.params]
# min_length_sec = 1
# cache = false
#
# ///
import sys
import pandas as pd
print(f"Python version: {sys.version_info}")
print(f"Pandas version: {pd.__version__}")

Example of parsing:

meta = ScriptMeta.parse(script)
print(meta.python_version)
print(meta.get_param("cache"))

@ilongin ilongin linked an issue Feb 10, 2025 that may be closed by this pull request
Copy link

cloudflare-workers-and-pages bot commented Feb 10, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2564871
Status: ✅  Deploy successful!
Preview URL: https://51a01d53.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-776-inline-script-me.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

Attention: Patch coverage is 91.07143% with 5 lines in your changes missing coverage. Please review.

Project coverage is 87.69%. Comparing base (491aab4) to head (2564871).

Files with missing lines Patch % Lines
src/datachain/script_meta.py 91.07% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #913      +/-   ##
==========================================
+ Coverage   87.68%   87.69%   +0.01%     
==========================================
  Files         130      131       +1     
  Lines       11714    11770      +56     
  Branches     1594     1597       +3     
==========================================
+ Hits        10271    10322      +51     
- Misses       1043     1047       +4     
- Partials      400      401       +1     
Flag Coverage Δ
datachain 87.62% <91.07%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

def read_inline_meta(script: str) -> Optional[dict]:
"""Converts inline script metadata to dict with all found data"""
regex = (
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this regex was taken from https://packaging.python.org/en/latest/specifications/inline-script-metadata/, but this is just an example and I would like to suggest to update this regexp:

Suggested change
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$\s(?P<content>(^#(| .*)$\s)+)^# \/\/\/[ \t]*$"
  1. Escape the / symbol (it is ok in this case, but in general in regex it should be escaped)
  2. Allow trailing spaces (and tabs) in first and last strings (where /// goes)

Also I am not really sure about using \s as a newline matcher, I know $\s combination will only catches newline, but not space or tab, anyway, [\r\n|\r|\n] looks more strict to me:

Suggested change
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(^#(| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$"

Also it is a good idea to use non-capturing groups if no capturing is needed, as it will make regex much faster. For example, in this case I would suggest to use (?:| .*) instead of (| .*), one more group can be converted to non-capturing and the final regex will looks like this:

Suggested change
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$"
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(?:^#(?:| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$"

This regex is more strict in terms of regex specifications, a bit more loose in terms of trailing newlines and faster in terms of performance.

Also we may want to use verbose mode (re.VERBOSE) to make it more readable, let me know if I can help with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)

I don't think so, in docs they are suggesting to use regex, but we need to take a look 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dreadatour thanks for the suggestions, I've updated the regex as you mentioned.
@shcheklein I also haven't seen anything else, but I think it shouldn't be a problem as this is just to get the toml content itself from the comment and then parsing is done with toml library afterwards so regex will not change when we add more things to toml.

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Only one comment above about regex 👀

@dreadatour
Copy link
Contributor

Oh, and tests fails because of:

ModuleNotFoundError: No module named 'tomllib'

@ilongin
Copy link
Contributor Author

ilongin commented Feb 13, 2025

Oh, and tests fails because of:

ModuleNotFoundError: No module named 'tomllib'

Yea, tomllib is added in python 3.11 version so this is failing in older versions. I'm now using tomli instead which has the same API and works the same.

@ilongin ilongin requested a review from shcheklein February 13, 2025 08:30


@dataclass
class ScriptMeta:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilongin what is the long term plan for this class?

can we put this information into an existing Query class / table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term plan is to use it for parsing env data from script itself with which we could get rid of that settings UI panel and user could c/p code easier between local and Studio.
I'm not sure what did you mean by putting information into existing Query class. Did you mean DatasetQuery ? I would keep it like this and use it in Catalog when we run the scripts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Sorry, disregard the "put somewhere" part - it's not relevant for now. So, we won't use it atm, right? Should we postpone merging this until we have a clear feature that requires it?

with which we could get rid of that settings UI panel and user could c/p code easier between local and Studio.

so, is the next step - using this info in Studio and / or CLI (is this code even needed to run it in CLI?) ... I'm just not sure if we should be merging code that we won't use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be needed when / if we decide to replace Studio query settings with it, or make it as optional if it makes sense to have two ways to define settings in the first place.
I think it needs to be in datachain codebase as currently we have method in Catalog to run query script which accepts all those env / settings data - now those arguments will be removed and parsed and read from script itself with this class.

We can postpone merge until we have final decision

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein it started with the idea that we can replace manual Settings by requirements and python. I'd definitely use this in Studio (and probably in CLI by running through uw run).

But it has more features like params and inputs that might be need as a way to parametrize the scripts. I this it's a really good start in this direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I'm fine with this and I think I understand the feature. Thanks @ilongin @dmpetrov for elaborating. I guess my question was if we need to merge it now or when we have some actual use for this, but it's not a blocker or anything. 100% your call on this.

@ilongin ilongin requested a review from shcheklein February 14, 2025 10:27
@shcheklein
Copy link
Member

Yea, tomllib is added in python 3.11 version so this is failing in older versions. I'm now using tomli instead which has the same API and works the same.

can we avoid new deps at least in Python 3.11+ ? every dependency counts

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change! Some comments are inline.

super().__init__(message)


@dataclass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilongin Could you please rename it? Meta is not a good keyword.
Can we use something like "ScriptConfig" or "InlineConfig"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScripConfig/InlineConfig is too long. Any idea how to simplify it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that I wanted to be as similar as official python naming for this which is Inline script metadata https://packaging.python.org/en/latest/specifications/inline-script-metadata/#inline-script-metadata ... with this in mind Metadata sounds better than Config ... but I'm not 100% sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to ScriptConfig for now

src/datachain/script_meta.py Show resolved Hide resolved
src/datachain/script_meta.py Show resolved Hide resolved
self.params = params or {}
self.num_workers = num_workers

def get_param(self, name: str) -> Any:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the "get" methods need to support default values like InlineConfig.get_param("bucket", "gcs://datachain-demo/")

Also, let support only str as return values. It's needed for parametrization use cases, especially in CLI. Otherwise, you would need to introduce the types in the spec (I'm pretty sure it's not supported in the spec).

Copy link
Contributor Author

@ilongin ilongin Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmpetrov I'm not sure I completely understand the need for always returning str. Isn't it beneficial if someone can set false as a parameter and get boolean False in the script?. Python lib for parsing TOML does automatic conversion for basic types: https://docs.python.org/3/library/tomllib.html#conversion-table

src/datachain/script_meta.py Outdated Show resolved Hide resolved
src/datachain/script_meta.py Outdated Show resolved Hide resolved


@dataclass
class ScriptMeta:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein it started with the idea that we can replace manual Settings by requirements and python. I'd definitely use this in Studio (and probably in CLI by running through uw run).

But it has more features like params and inputs that might be need as a way to parametrize the scripts. I this it's a really good start in this direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inline project meta
4 participants