-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with netcdf time:units with unexpected characters #296
Comments
A PR that catches this, and adds more leeway in the formatting (consistent with CF) would be welcome. |
@jswhit I am a little confused how the regex following the CF standards at https://github.com/Unidata/cftime/blob/master/src/cftime/_cftime.pyx#L45-L47 isn't catching this, it's an out-of-bounds character I'm unfamiliar with this library, so would need some pointers to get started |
Sorry, can't be of much help - as the comment says this regex was lifted from http://delete.me.uk/2005/03/iso8601.html but apparently that link no longer exists. I know almost nothing about regexes. |
Hi - I've looked at this regex in the (dim and distant) past - happy to have take a look here, if that's OK. The quick test below shows the regex passing in the "letter O" case, but has too many >>> import re
>>> ISO8601_REGEX = re.compile(r"(?P<year>[+-]?[0-9]+)(-(?P<month>[0-9]{1,2})(-(?P<day>[0-9]{1,2})"
r"(((?P<separator1>.)(?P<hour>[0-9]{1,2}):(?P<minute>[0-9]{1,2})(:(?P<second>[0-9]{1,2})(\.(?P<fraction>[0-9]+))?)?)?"
r"((?P<separator2>.?)(?P<timezone>Z|(([-+])([0-9]{2})((:([0-9]{2}))|([0-9]{2}))?)))?)?)?)?"
) # From https://github.com/Unidata/cftime/blob/v1.6.2rel/src/cftime/_cftime.pyx#L45-L48
>>> ISO8601_REGEX.match('2001-01-01').groups() # All numbers
('2001',
'-01-01',
'01',
'-01',
'01',
'',
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None)
>>> ISO8601_REGEX.match('20O1-01-01').groups() # Letter O
('20',
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None) |
Thanks @davidhassell! I hit some weird duplication issue in more simple regex I was using, it validated correctly, but then seemed to cycle over things again and again - https://regex101.com/ was really helpful in pointing that out to me |
I've found a delightful edge case that is a little hard to believe. It involves a netcdf
time:units
that includes a character outside of the [0-9,-] range. If it's not obvious from the below, the issue is that thetime:units = "days since 20O1-1-1"
whereas this should betime:units = "days since 2001-1-1"
(so replacing the rogue "O" (oooh), with the numeral "0" zero).The file is a 297MiB file downloadable from here
Below is the example reproducing the error:
I wonder if a regex check would be useful to implement? This problem tripped me up for a while, and it was not at all obvious that an incorrect character (which looks almost identical, depending on fonts) was the root cause. Testing for a datestring that matches regex
r"(?:[0-9][0-9])?[0-9][0-9]-(?:[0-1])?[0-9]-(?:[0-3])?[0-9]"
could be a useful test to catch such a fringe case - and point out the issue obviously in the error message. It seems in the CF Conventions docs that there is little leeway in this format, so using"/"
or alternativeMM-DD-YYYY
formats to the standard[YYY]Y-[M]M-[D]D HH:MM:SS.ss [-]0:00
And just because pydata/xarray#7144
The text was updated successfully, but these errors were encountered: