Add support for ZZZ in parse_datetime #11312

kevinwilfong · 2024-10-21T19:55:26Z

Summary:
This diff adds support for JODA's ZZZ pattern in Presto's parse_datetime function.

This is used to parse time zone IDs (called "time zone names" in the tz library, but this
means something else in JODA).

I borrowed the algorithm from JODA to ensure it matches Presto Java's behavior. The
idea is to greedily consume the longest substring that matches a known time zone. I
borrowed their algorithm which is to break the set of known time zones into a list of
those without a prefix (without the '/' character) and lists of suffixes for those with
prefixes. This limits the number of strings that need to be compared. I modified it
slightly to pre-sort these lists by size descending, so we don't have to necessarily
compare every string, but can stop early if we find a match.

One other change is I added a get_time_zone_names function to our copy of the tz
library. I tried calling get_tzdb() from DateTimeFormatter directly and accessing its
zones member to get the names, but for some reason after get_tzdb() returns every
time_zone in zones (except the first one) has a string name_ that has nullptr for its
data after get_tzdb() returns. I spent a good amount of time trying to figure out why,
but couldn't figure it out, so I gave up and added this helper method (for whatever
reason everything is fine as long as it's done in the tz.cpp file). If anyone has pointers
as to what's going on I'm happy to investigate further, I'd much rather use the existing
get_tzdb function if I can.

Differential Revision: D64708598

facebook-github-bot · 2024-10-21T19:55:41Z

This pull request was exported from Phabricator. Differential Revision: D64708598

netlify · 2024-10-21T19:55:44Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`62b4516`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/671fc7af131b040008a52b42

Summary: This diff adds support for JODA's ZZZ pattern in Presto's parse_datetime function. This is used to parse time zone IDs (called "time zone names" in the tz library, but this means something else in JODA). I borrowed the algorithm from JODA to ensure it matches Presto Java's behavior. The idea is to greedily consume the longest substring that matches a known time zone. I borrowed their algorithm which is to break the set of known time zones into a list of those without a prefix (without the '/' character) and lists of suffixes for those with prefixes. This limits the number of strings that need to be compared. I modified it slightly to pre-sort these lists by size descending, so we don't have to necessarily compare every string, but can stop early if we find a match. One other change is I added a get_time_zone_names function to our copy of the tz library. I tried calling get_tzdb() from DateTimeFormatter directly and accessing its zones member to get the names, but for some reason after get_tzdb() returns every time_zone in zones (except the first one) has a string name_ that has nullptr for its data after get_tzdb() returns. I spent a good amount of time trying to figure out why, but couldn't figure it out, so I gave up and added this helper method (for whatever reason everything is fine as long as it's done in the tz.cpp file). If anyone has pointers as to what's going on I'm happy to investigate further, I'd much rather use the existing get_tzdb function if I can. Differential Revision: D64708598

facebook-github-bot · 2024-10-21T19:59:08Z

This pull request was exported from Phabricator. Differential Revision: D64708598

Summary: This diff adds support for JODA's ZZZ pattern in Presto's parse_datetime function. This is used to parse time zone IDs (called "time zone names" in the tz library, but this means something else in JODA). I borrowed the algorithm from JODA to ensure it matches Presto Java's behavior. The idea is to greedily consume the longest substring that matches a known time zone. I borrowed their algorithm which is to break the set of known time zones into a list of those without a prefix (without the '/' character) and lists of suffixes for those with prefixes. This limits the number of strings that need to be compared. I modified it slightly to pre-sort these lists by size descending, so we don't have to necessarily compare every string, but can stop early if we find a match. One other change is I added a get_time_zone_names function to our copy of the tz library. I tried calling get_tzdb() from DateTimeFormatter directly and accessing its zones member to get the names, but for some reason after get_tzdb() returns every time_zone in zones (except the first one) has a string name_ that has nullptr for its data after get_tzdb() returns. I spent a good amount of time trying to figure out why, but couldn't figure it out, so I gave up and added this helper method (for whatever reason everything is fine as long as it's done in the tz.cpp file). If anyone has pointers as to what's going on I'm happy to investigate further, I'd much rather use the existing get_tzdb function if I can. Differential Revision: D64708598

facebook-github-bot · 2024-10-21T21:01:48Z

This pull request was exported from Phabricator. Differential Revision: D64708598

bikramSingh91

Looks good, just one small nit

bikramSingh91 · 2024-10-25T21:14:00Z

velox/functions/lib/DateTimeFormatter.cpp

+
+    // Try to find a time zone with a prefix that includes the speratorPoint.
+    if (separatorPoint != end) {
+      std::string prefix(cur, separatorPoint);


just curious, since both params are char*, which string constructor would this map to? is it this?

template< class InputIt > basic_string( InputIt first, InputIt last, const Allocator& alloc = Allocator() );

I think so, they could certainly act like iterators (honestly I just tried it with that in mind and hoped it'd work)

bikramSingh91 · 2024-10-25T21:24:25Z

velox/functions/prestosql/tests/DateTimeFunctionsTest.cpp

+  // Test an invalid time zone with a prefix that does appear.
+  VELOX_ASSERT_THROW(
+      parseDatetime(
+          "2024-02-25+06:00:99 America/XYZ", "yyyy-MM-dd+HH:mm:99 ZZZ"),


can you add a test case where there are extra strings after a valid timezone like:

2024-02-25+06:00:99 America/Los_Angeleszzz

That's what I was trying to cover in the tests in lines 2997-3009

…or#11283) Summary: The format_datetime currently outptus the time zone id if for 3 or fewer 'z' characters in the format string. However, the JODA library, which this is based on, does this for 3 or more 'Z' characters. This diff fixes this, as well as adds support for a single 'Z' (which outputs the same thing as 'ZZ' just without the colon). So 'Z' is fully supported for any number of characters. To be more explicit: https://www.joda.org/joda-time/apidocs/org/joda/time/format/DateTimeFormat.html From the JODA docs: `'Z' outputs offset without a colon (-0800), 'ZZ' outputs the offset with a colon(-08:00), 'ZZZ' or more outputs the zone id(America/Los_Angeles).` And not clearly explained in the docs, but from experimentation: `'z', 'zz', or 'zzz' outputs the abbreviation of the time zone(PDT), and 'zzzz' or more outputs the time zone name(Pacific Daylight Time)` Currently DateTimeFormatter treats 'zzzz' or more like JODA treats 'ZZZ' or more. This diff marks 'zzzz' or more as unsupported (we can implement that in a future change), and moves that logic under 'ZZZ' or more to be consistent. It also implements 'Z' (previously only 'ZZ' was implemented in DateTimeFormatter). Reviewed By: bikramSingh91 Differential Revision: D64500193

Summary: This diff adds support for JODA's ZZZ pattern in Presto's parse_datetime function. This is used to parse time zone IDs (called "time zone names" in the tz library, but this means something else in JODA). I borrowed the algorithm from JODA to ensure it matches Presto Java's behavior. The idea is to greedily consume the longest substring that matches a known time zone. I borrowed their algorithm which is to break the set of known time zones into a list of those without a prefix (without the '/' character) and lists of suffixes for those with prefixes. This limits the number of strings that need to be compared. I modified it slightly to pre-sort these lists by size descending, so we don't have to necessarily compare every string, but can stop early if we find a match. One other change is I added a get_time_zone_names function to our copy of the tz library. I tried calling get_tzdb() from DateTimeFormatter directly and accessing its zones member to get the names, but for some reason after get_tzdb() returns every time_zone in zones (except the first one) has a string name_ that has nullptr for its data after get_tzdb() returns. I spent a good amount of time trying to figure out why, but couldn't figure it out, so I gave up and added this helper method (for whatever reason everything is fine as long as it's done in the tz.cpp file). If anyone has pointers as to what's going on I'm happy to investigate further, I'd much rather use the existing get_tzdb function if I can. Reviewed By: bikramSingh91 Differential Revision: D64708598

facebook-github-bot · 2024-10-28T17:19:57Z

This pull request was exported from Phabricator. Differential Revision: D64708598

Summary: This diff adds support for JODA's ZZZ pattern in Presto's parse_datetime function. This is used to parse time zone IDs (called "time zone names" in the tz library, but this means something else in JODA). I borrowed the algorithm from JODA to ensure it matches Presto Java's behavior. The idea is to greedily consume the longest substring that matches a known time zone. I borrowed their algorithm which is to break the set of known time zones into a list of those without a prefix (without the '/' character) and lists of suffixes for those with prefixes. This limits the number of strings that need to be compared. I modified it slightly to pre-sort these lists by size descending, so we don't have to necessarily compare every string, but can stop early if we find a match. One other change is I added a get_time_zone_names function to our copy of the tz library. I tried calling get_tzdb() from DateTimeFormatter directly and accessing its zones member to get the names, but for some reason after get_tzdb() returns every time_zone in zones (except the first one) has a string name_ that has nullptr for its data after get_tzdb() returns. I spent a good amount of time trying to figure out why, but couldn't figure it out, so I gave up and added this helper method (for whatever reason everything is fine as long as it's done in the tz.cpp file). If anyone has pointers as to what's going on I'm happy to investigate further, I'd much rather use the existing get_tzdb function if I can. Reviewed By: bikramSingh91 Differential Revision: D64708598

facebook-github-bot · 2024-10-29T19:54:50Z

This pull request has been merged in f6dbb3d.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 21, 2024

facebook-github-bot added the fb-exported label Oct 21, 2024

kevinwilfong force-pushed the export-D64708598 branch from 555988b to 5a93516 Compare October 21, 2024 19:59

kevinwilfong force-pushed the export-D64708598 branch from 5a93516 to 6f6b53e Compare October 21, 2024 21:01

bikramSingh91 approved these changes Oct 25, 2024

View reviewed changes

Kevin Wilfong added 2 commits October 28, 2024 10:19

kevinwilfong force-pushed the export-D64708598 branch from 6f6b53e to 62b4516 Compare October 28, 2024 17:19

facebook-github-bot closed this in f6dbb3d Oct 29, 2024

facebook-github-bot added the Merged label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ZZZ in parse_datetime #11312

Add support for ZZZ in parse_datetime #11312

kevinwilfong commented Oct 21, 2024

facebook-github-bot commented Oct 21, 2024

netlify bot commented Oct 21, 2024 •

edited

Loading

facebook-github-bot commented Oct 21, 2024

facebook-github-bot commented Oct 21, 2024

bikramSingh91 left a comment

bikramSingh91 Oct 25, 2024

kevinwilfong Oct 25, 2024

bikramSingh91 Oct 25, 2024

kevinwilfong Oct 25, 2024

facebook-github-bot commented Oct 28, 2024

facebook-github-bot commented Oct 29, 2024

Add support for ZZZ in parse_datetime #11312

Add support for ZZZ in parse_datetime #11312

Conversation

kevinwilfong commented Oct 21, 2024

facebook-github-bot commented Oct 21, 2024

netlify bot commented Oct 21, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Oct 21, 2024

facebook-github-bot commented Oct 21, 2024

bikramSingh91 left a comment

Choose a reason for hiding this comment

bikramSingh91 Oct 25, 2024

Choose a reason for hiding this comment

kevinwilfong Oct 25, 2024

Choose a reason for hiding this comment

bikramSingh91 Oct 25, 2024

Choose a reason for hiding this comment

kevinwilfong Oct 25, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Oct 28, 2024

facebook-github-bot commented Oct 29, 2024

netlify bot commented Oct 21, 2024 •

edited

Loading