-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline script metadata #913
base: main
Are you sure you want to change the base?
Conversation
Deploying datachain-documentation with
|
Latest commit: |
2564871
|
Status: | ✅ Deploy successful! |
Preview URL: | https://51a01d53.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-776-inline-script-me.datachain-documentation.pages.dev |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #913 +/- ##
==========================================
+ Coverage 87.68% 87.69% +0.01%
==========================================
Files 130 131 +1
Lines 11714 11770 +56
Branches 1594 1597 +3
==========================================
+ Hits 10271 10322 +51
- Misses 1043 1047 +4
- Partials 400 401 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
src/datachain/script_meta.py
Outdated
def read_inline_meta(script: str) -> Optional[dict]: | ||
"""Converts inline script metadata to dict with all found data""" | ||
regex = ( | ||
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this regex was taken from https://packaging.python.org/en/latest/specifications/inline-script-metadata/, but this is just an example and I would like to suggest to update this regexp:
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$" | |
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$\s(?P<content>(^#(| .*)$\s)+)^# \/\/\/[ \t]*$" |
- Escape the
/
symbol (it is ok in this case, but in general in regex it should be escaped) - Allow trailing spaces (and tabs) in first and last strings (where
///
goes)
Also I am not really sure about using \s
as a newline matcher, I know $\s
combination will only catches newline, but not space or tab, anyway, [\r\n|\r|\n]
looks more strict to me:
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$" | |
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(^#(| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$" |
Also it is a good idea to use non-capturing groups if no capturing is needed, as it will make regex much faster. For example, in this case I would suggest to use (?:| .*)
instead of (| .*)
, one more group can be converted to non-capturing and the final regex will looks like this:
r"(?m)^# /// (?P<type>[a-zA-Z0-9-]+)$\s(?P<content>(^#(| .*)$\s)+)^# ///$" | |
r"(?m)^# \/\/\/ (?P<type>[a-zA-Z0-9-]+)[ \t]*$[\r\n|\r|\n](?P<content>(?:^#(?:| .*)$[\r\n|\r|\n])+)^# \/\/\/[ \t]*$" |
This regex is more strict in terms of regex specifications, a bit more loose in terms of trailing newlines and faster in terms of performance.
Also we may want to use verbose mode (re.VERBOSE
) to make it more readable, let me know if I can help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a regular parser (not regexp based) that we could use? complex regexs can be a can worms (not saying that this is the case here, but in general it can be tricky and sometimes leads to very bad performance)
I don't think so, in docs they are suggesting to use regex, but we need to take a look 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dreadatour thanks for the suggestions, I've updated the regex as you mentioned.
@shcheklein I also haven't seen anything else, but I think it shouldn't be a problem as this is just to get the toml
content itself from the comment and then parsing is done with toml library afterwards so regex will not change when we add more things to toml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Only one comment above about regex 👀
Oh, and tests fails because of:
|
Yea, |
src/datachain/script_meta.py
Outdated
|
||
|
||
@dataclass | ||
class ScriptMeta: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilongin what is the long term plan for this class?
can we put this information into an existing Query class / table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term plan is to use it for parsing env data from script itself with which we could get rid of that settings UI panel and user could c/p code easier between local and Studio.
I'm not sure what did you mean by putting information into existing Query class. Did you mean DatasetQuery
? I would keep it like this and use it in Catalog
when we run the scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Sorry, disregard the "put somewhere" part - it's not relevant for now. So, we won't use it atm, right? Should we postpone merging this until we have a clear feature that requires it?
with which we could get rid of that settings UI panel and user could c/p code easier between local and Studio.
so, is the next step - using this info in Studio and / or CLI (is this code even needed to run it in CLI?) ... I'm just not sure if we should be merging code that we won't use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be needed when / if we decide to replace Studio query settings with it, or make it as optional if it makes sense to have two ways to define settings in the first place.
I think it needs to be in datachain codebase as currently we have method in Catalog
to run query script which accepts all those env / settings data - now those arguments will be removed and parsed and read from script itself with this class.
We can postpone merge until we have final decision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein it started with the idea that we can replace manual Settings by requirements and python. I'd definitely use this in Studio (and probably in CLI by running through uw run
).
But it has more features like params
and inputs
that might be need as a way to parametrize the scripts. I this it's a really good start in this direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid new deps at least in Python 3.11+ ? every dependency counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change! Some comments are inline.
super().__init__(message) | ||
|
||
|
||
@dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilongin Could you please rename it? Meta is not a good keyword.
Can we use something like "ScriptConfig" or "InlineConfig"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScripConfig/InlineConfig is too long. Any idea how to simplify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that I wanted to be as similar as official python naming for this which is Inline script metadata https://packaging.python.org/en/latest/specifications/inline-script-metadata/#inline-script-metadata ... with this in mind Metadata
sounds better than Config
... but I'm not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to ScriptConfig
for now
src/datachain/script_meta.py
Outdated
self.params = params or {} | ||
self.num_workers = num_workers | ||
|
||
def get_param(self, name: str) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the "get" methods need to support default values like InlineConfig.get_param("bucket", "gcs://datachain-demo/")
Also, let support only str
as return values. It's needed for parametrization use cases, especially in CLI. Otherwise, you would need to introduce the types in the spec (I'm pretty sure it's not supported in the spec).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmpetrov I'm not sure I completely understand the need for always returning str
. Isn't it beneficial if someone can set false
as a parameter and get boolean False
in the script?. Python lib for parsing TOML does automatic conversion for basic types: https://docs.python.org/3/library/tomllib.html#conversion-table
src/datachain/script_meta.py
Outdated
|
||
|
||
@dataclass | ||
class ScriptMeta: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shcheklein it started with the idea that we can replace manual Settings by requirements and python. I'd definitely use this in Studio (and probably in CLI by running through uw run
).
But it has more features like params
and inputs
that might be need as a way to parametrize the scripts. I this it's a really good start in this direction.
Adding class called
ScriptMeta
that knows how to parse script inline metadata (https://packaging.python.org/en/latest/specifications/inline-script-metadata/#inline-script-metadata) .Data that we collect from it are:
With this, we can even avoid having special run script settings in Studio, as user can do everything in the script codebase and can seamlessly c/p script between local and Studio.
Example of script with inline metadata:
Example of parsing: