-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse uncertainties #1615
Parse uncertainties #1615
Conversation
Enable Pint to consume uncertain quantities. Signed-off-by: [email protected]
Signed-off-by: MichaelTiemann <[email protected]>
Signed-off-by: MichaelTiemann <[email protected]>
Related: hgrecco/pint-pandas#140 |
Signed-off-by: MichaelTiemann <[email protected]>
Signed-off-by: [email protected]
First, I completely agree with the need of parsing measurements in Pint. I have been looking for this for a while. So thanks, and awesome work. Having said this, I am not sure about the implementation. Why parsing In both cases, you have a couple of numbers “connected” by a multichar operator. I guess that the difference is that Take a look at this: >>> from pint.compat import tokenizer
>>> for p in tokenizer("8**4"):
... print(p)
...
TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8**4')
TokenInfo(type=54 (OP), string='**', start=(1, 1), end=(1, 3), line='8**4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 3), end=(1, 4), line='8**4')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 4), end=(1, 5), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='') So you changed the tokenizer, but in that process it became more difficult to understand, maintain and debug. I wonder if a more sensible approach would not be to add another layer on top of the tokenizer that scans for To understand my proposal, let me show you the output of the current tokenizer: >>> for p in tokenizer("8 + / - 4"):
... print(p)
...
TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8 + / - 4')
TokenInfo(type=54 (OP), string='+', start=(1, 2), end=(1, 3), line='8 + / - 4')
TokenInfo(type=54 (OP), string='/', start=(1, 4), end=(1, 5), line='8 + / - 4')
TokenInfo(type=54 (OP), string='-', start=(1, 6), end=(1, 7), line='8 + / - 4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 8), end=(1, 9), line='8 + / - 4')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 9), end=(1, 10), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='') If we can create function that transform the previous list into: >>> for p in tokenizer("8 + / - 4"):
... print(p)
...
TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8 + / - 4')
TokenInfo(type=<NEW NUMBER> (OP), string='+ / -', start=(1, 2), end=(1, 7), line='8 + / - 4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 8), end=(1, 9), line='8 + / - 4')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 9), end=(1, 10), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='') (this could also handle the parenthesis) Then the new tokenizer will just be: def tokenizer(s):
return uncertainty_layer(old_tokenizer(s)) (the names are crap, but just as an example) So it is easy to debug and turn on/off if necessary. If you go this way, I think it will be wise to move the |
I also think we need to clearly define which uncertainties format we are going to support (and which we are not). |
Thank you so much for giving it a look. On the choice of creating a tighter binding than **, consider `x ** 2+/-0 ** (1/2)'. If +/- binds the same as **, it will fail when there's a string of tokens that all have the same precedence (but different associativity). I don't disagree with you at all that there are better ways to do this. What I wanted to do first was a proof-of-concept, to see whether the bones in Pint and Pint-Pandas were ready for this sort of thing. And I think they are. My principle concern in modifying the tokenizer was out of respect for the fact that Pint does talk to itself using str() and repr, and as such we need to be prepared to handle whatever sorts of formats users may throw at us (including the N(S) notation and paranthesized ± as a unicode symbol). Please keep the thoughts coming! |
I am not sure that I am proposing the same thing. Consider parsing TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8 + / - 4')
TokenInfo(type=54 (OP), string='+', start=(1, 2), end=(1, 3), line='8 + / - 4')
TokenInfo(type=54 (OP), string='/', start=(1, 4), end=(1, 5), line='8 + / - 4')
TokenInfo(type=54 (OP), string='-', start=(1, 6), end=(1, 7), line='8 + / - 4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 8), end=(1, 9), line='8 + / - 4') which is transformed to: TokenInfo(type=2 (NUMBER), string='8', start=(1, 0), end=(1, 1), line='8 + / - 4')
TokenInfo(type=<NEW NUMBER> (OP), string='+ / -', start=(1, 2), end=(1, 7), line='8 + / - 4')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 8), end=(1, 9), line='8 + / - 4') which is then converted by ufloat_fromstr("8 +/- 4") This will automatically support every format supported here. I think we should limit to those (at least for the time being). Consider now parsing TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='(8 +/- 4) m')
TokenInfo(type=2 (NUMBER), string='8', start=(1, 1), end=(1, 2), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string='+', start=(1, 3), end=(1, 4), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string='/', start=(1, 4), end=(1, 5), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string='-', start=(1, 5), end=(1, 6), line='(8 +/- 4) m')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 7), end=(1, 8), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string=')', start=(1, 8), end=(1, 9), line='(8 +/- 4) m')
TokenInfo(type=1 (NAME), string='m', start=(1, 10), end=(1, 11), line='(8 +/- 4) m') which will then be converted TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='(8 +/- 4) m')
TokenInfo(type=2 (NUMBER), string='8', start=(1, 1), end=(1, 2), line='(8 +/- 4) m')
TokenInfo(type=<NEW NUMBER> (OP), string='+ / -', start=(1, 3), end=(1, 6), line='(8 +/- 4) m')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 7), end=(1, 8), line='(8 +/- 4) m')
TokenInfo(type=54 (OP), string=')', start=(1, 8), end=(1, 9), line='(8 +/- 4) m')
TokenInfo(type=1 (NAME), string='m', start=(1, 10), end=(1, 11), line='(8 +/- 4) m') which is then converted by ufloat_fromstr("8 +/- 4") * ureg.meter (I am 'paraphrasing' here) Please note that that Pint supports more numerical types than uncertainties. We need to make this clear in the docs |
I have returned from a long trip and am ready to tackle this now. I like the proposed approach (of creating a layer that can be enabled and disabled) and will give it a go. However, this idea--as I understand it--is a no go, because OP is (for better or worse), number 54:
To give this string a new number would mean it's no longer an OP, but something entirely different, which would mean we'd need to special-case its handling as something-other-than-OP. Would it be your preference to insert a new kind of OP (such as NUMBER_CONSTRUCTOR)? Or, given that OP is OP is OP, leave that alone and focus only on implementing a pre-processor as a general function and then parsing and evaluating the results of the preprocessor separately? But also, before I go too far, I have to point out that a very common format to get back from Pint (when dealing with large quantities) is more like this:
This creates the problem of deciding when and how to distribute the factor of a million (e6) so as to end up with the result:
We cannot simply process what's between the (); rather there's some additional look-ahead and processing necessary. So I'm looking at that. I should add that I'm a rank notice when it comes to I will leave it to others to flip the magic switches so that test_umath and friends can, if they want to, test the actual mechanics of uncertainties within Pint. Hope that's OK! |
Based on feedback, tokenize uncertainties on top of default tokenizer, not instead of default tokenizer. Signed-off-by: MichaelTiemann <[email protected]>
The original pint tokenizer is
I'm very open to changing names and/or mechanics of how the tokenizer should be selected. I've not yet started to flesh out some better test cases for test_measurement.py. I'd like to get feedback on these changes first. |
pint/pint_eval.py
Outdated
from .errors import DefinitionSyntaxError | ||
|
||
# For controlling order of operations | ||
_OP_PRIORITY = { | ||
"±": 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can go away: we change ± to +/- to avoid gratuitous problems with ERRORTOKEN.
Any comments on this PR? I think I made the changes that were requested. |
I'm still waiting, and my project is still waiting to know if this is ready to enter final approach. We'd really like to add uncertainties to our pint-based work, but are loathe to ask users to install a completely custom version of pint. If there are other changes you would like to see, I would be happy to receive feedback! |
pint/pint_eval.py
Outdated
_BINARY_OPERATOR_MAP = { | ||
"±": _ufloat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line "±": _ufloat,
can probably also go away, as above.
if tokinfo.type != tokenlib.ENCODING: | ||
yield tokinfo | ||
|
||
def uncertainty_tokenizer(input_string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would uncertainties_tokenizer
be a better name?
Signed-off-by: Michael Tiemann <[email protected]>
Signed-off-by: Michael Tiemann <[email protected]>
Fixes problems parsing currency symbols that also show up when dealing with uncertainties. Signed-off-by: Michael Tiemann <[email protected]>
Handle negative numbers using uncertainty parenthesis notation. Signed-off-by: Michael Tiemann <[email protected]>
Ahem...use walrus operator for side-effect, not truth value. Signed-off-by: Michael Tiemann <[email protected]>
… of the exponent, not just in the parsing of the exponent. i.e., (5.01+/-0.07)e+04 Signed-off-by: Michael Tiemann <[email protected]>
Can somebody please review? |
While trying to understand how to fix #1646 I pulled this branch and ran the tests: Before parse-uncertainties:
After parse-uncertainties:
pint-convert is unfortunately still broken (I rebased, so no surprise):
|
In parallel with Pint and Pint-Pandas, changes to support uncertainties as an EA Dtype. See hgrecco/pint#1615 and hgrecco/pint-pandas#140 Signed-off-by: Michael Tiemann <[email protected]>
With these changes (and pandas-dev/pandas#53970 and hgrecco/pint#1615) the test suite passes or xpasses everything (no failures or error). Indeed, the uncertainties code has essentially doubled the scope of the test suite (to test with and without it). The biggest gotcha is that the EA for complex numbers is not compatible with the EA for uncertainties, due to incompatible hacks: The hack for complex numbers is to np.nan (which is, technically, a complex number) for na_value across all numeric types. But that doesn't work for uncertainties, because uncertainties doesn't accept np.nan as an uncertain value. The hack for uncertainties is to use pd.NA for na_value. This works for Int64, Float64, and uncertainties, but doesn't work for complex (which cannot tolerate NAType). Some careful subclassing fills in what doesn't easily work, with fixtures to prevent the improper mixing of complex and uncertainty types in the same python environment. Happy to discuss! Signed-off-by: Michael Tiemann <[email protected]>
Shall we merge? |
Yes but Any idea why the test suite is not running for this PR? |
I don't know how the CI/CD works. I certainly haven't tried to do anything to evade it. Does anybody else know? |
as far as I can tell, for PRs of new contributors (as far as the commit history is concerned) the maintainers have to manually confirm every time there's commits. This is configurable (we have switched it off in xarray), but I think github's intention is to by default avoid running CI with potentially malicious changes (like, cryptocurrency) before having a maintainer review those changes. |
it is strange, though, that it doesn't tell you that. Maybe try closing / reopening the PR? |
Closing PR to reopen it (attempting soft reboot of CI/CD). |
Reopening PR to attempt soft reboot of CI/CD. |
It looks like there's a problem with a readthedocs key on Pint's side. |
no, it's just that the configuration ( After that is merged, you also wouldn't have to wait on maintainer approval for CI to run on this PR. |
Now it seems to be running. If all test pass, I will merge right away! |
CodSpeed Performance ReportMerging #1615 will not alter performanceComparing Summary
|
Test are failing. There seems to be an error due to missing babel dependency for a test. |
Here's hoping this fixes the CI/CD problem with test_1400. Signed-off-by: Michael Tiemann <[email protected]>
I'm taking a wild guess that because we now have a test case that depends on babel, we need babel as part of |
Removing `system_packages: false` as suggested by @keewis Signed-off-by: Michael Tiemann <[email protected]>
Not my day: lint failed for a reason I don't understand/cannot reproduce: reserveCache failed: Cache already exists. Scope: refs/heads/parse-uncertainties, Key: [xxx], Version: 25d34710bdd63bed70a22b8204c3e1d0942dc237e098600e3896b4334c6d784f |
It is better that you use the decorator that is provided for this purpose: pint/pint/testsuite/helpers.py Line 125 in 2852f36
Just import Do not add babel as a requirement as it will make all tests slower. |
Thanks, I'll fix that. There are some new failures I need to fix, which I will:
|
Fix isnan to use unp.isnan as appropriate for both duck_array_type and objects of UFloat types. Fix a minor typo in pint/facets/__init__.py comment. In test_issue_1400, use decorators to ensure babel library is loaded when needed. pyproject.toml: revert change to testbase; we fixed with decorators instead. Signed-off-by: Michael Tiemann <[email protected]>
Looks good! Shall we merge @MichaelTiemannOSC ? |
Yes, thanks 🙏 |
Congratulations and thanks for this excelent work!! |
pre-commit run --all-files
with no errorsI'd like to discuss and review before committing to "completely cover" all automated test units. I have contributed two test cases that clearly show the filed issues working with the changes. I have run the full test suite, and the one unexpected failure (#783) I cannot explain.
I'm also not quite sure how to approach the docs question. Pint documentation tends to be more on the minimal side, and I don't want to call unnecessary attention to things that should "just work".