Skip to content

Commit

Permalink
Add support for ZZZ in parse_datetime (facebookincubator#11312)
Browse files Browse the repository at this point in the history
Summary:

This diff adds support for JODA's ZZZ pattern in Presto's parse_datetime function.

This is used to parse time zone IDs (called "time zone names" in the tz library, but this 
means something else in JODA).

I borrowed the algorithm from JODA to ensure it matches Presto Java's behavior.  The
idea is to greedily consume the longest substring that matches a known time zone.  I
borrowed their algorithm which is to break the set of known time zones into a list of
those without a prefix (without the '/' character) and lists of suffixes for those with
prefixes.  This limits the number of strings that need to be compared.  I modified it
slightly to pre-sort these lists by size descending, so we don't have to necessarily 
compare every string, but can stop early if we find a match.

One other change is I added a get_time_zone_names function to our copy of the tz
library.  I tried calling get_tzdb() from DateTimeFormatter directly and accessing its
zones member to get the names, but for some reason after get_tzdb() returns every
time_zone in zones (except the first one) has a string name_ that has nullptr for its
data after get_tzdb() returns. I spent a good amount of time trying to figure out why,
but couldn't figure it out, so I gave up and added this helper method (for whatever
reason everything is fine as long as it's done in the tz.cpp file).  If anyone has pointers
as to what's going on I'm happy to investigate further, I'd much rather use the existing
get_tzdb function if I can.

Reviewed By: bikramSingh91

Differential Revision: D64708598
  • Loading branch information
Kevin Wilfong authored and facebook-github-bot committed Oct 29, 2024
1 parent 54bf880 commit 5e05545
Show file tree
Hide file tree
Showing 5 changed files with 217 additions and 2 deletions.
30 changes: 30 additions & 0 deletions velox/external/date/patches/0006-add_get_time_zone_names.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
diff --git a/velox/external/date/tz.cpp b/velox/external/date/tz.cpp
--- a/velox/external/date/tz.cpp
+++ b/velox/external/date/tz.cpp
@@ -3538,6 +3538,14 @@
return get_tzdb_list().front();
}

+std::vector<std::string> get_time_zone_names() {
+ std::vector<std::string> result;
+ for (const auto& z : get_tzdb().zones) {
+ result.push_back(z.name());
+ }
+ return result;
+}
+
const time_zone*
#if HAS_STRING_VIEW
tzdb::locate_zone(std::string_view tz_name) const
diff --git a/velox/external/date/tz.h b/velox/external/date/tz.h
--- a/velox/external/date/tz.h
+++ b/velox/external/date/tz.h
@@ -1258,6 +1258,8 @@

DATE_API const tzdb& get_tzdb();

+std::vector<std::string> get_time_zone_names();
+
class tzdb_list
{
std::atomic<tzdb*> head_{nullptr};
8 changes: 8 additions & 0 deletions velox/external/date/tz.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3538,6 +3538,14 @@ get_tzdb()
return get_tzdb_list().front();
}

std::vector<std::string> get_time_zone_names() {
std::vector<std::string> result;
for (const auto& z : get_tzdb().zones) {
result.push_back(z.name());
}
return result;
}

const time_zone*
#if HAS_STRING_VIEW
tzdb::locate_zone(std::string_view tz_name) const
Expand Down
2 changes: 2 additions & 0 deletions velox/external/date/tz.h
Original file line number Diff line number Diff line change
Expand Up @@ -1258,6 +1258,8 @@ operator<<(std::ostream& os, const tzdb& db);

DATE_API const tzdb& get_tzdb();

std::vector<std::string> get_time_zone_names();

class tzdb_list
{
std::atomic<tzdb*> head_{nullptr};
Expand Down
135 changes: 134 additions & 1 deletion velox/functions/lib/DateTimeFormatter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -351,6 +351,133 @@ int64_t parseTimezone(const char* cur, const char* end, Date& date) {
return -1;
}

// Contains a list of all time zone names in a convenient format for searching.
//
// Time zone names without the '/' character (without a prefix) are stored in
// timeZoneNamesWithoutPrefix ordered by size desc.
//
// Time zone names with the '/' character (with a prefix) are stored in a map
// timeZoneNamePrefixMap from prefix (the string before the first '/') to a
// vector of strings which contains the suffixes (the strings after the first
// '/') ordered by size desc.
struct TimeZoneNameMappings {
std::vector<std::string> timeZoneNamesWithoutPrefix;
std::unordered_map<std::string, std::vector<std::string>>
timeZoneNamePrefixMap;
};

TimeZoneNameMappings getTimeZoneNameMappings() {
// Here we use get_time_zone_names instead of calling get_tzdb and
// constructing the list ourselves because there is some unknown issue with
// the tz library where the time_zone objects after the first one in the tzdb
// will be invalid (contain nullptrs) after the get_tzdb function returns.
const std::vector<std::string> timeZoneNames = date::get_time_zone_names();

TimeZoneNameMappings result;
for (size_t i = 0; i < timeZoneNames.size(); i++) {
const auto& timeZoneName = timeZoneNames[i];
auto separatorPoint = timeZoneName.find('/');

if (separatorPoint == std::string::npos) {
result.timeZoneNamesWithoutPrefix.push_back(timeZoneName);
} else {
std::string prefix = timeZoneName.substr(0, separatorPoint);
std::string suffix = timeZoneName.substr(separatorPoint + 1);

result.timeZoneNamePrefixMap[prefix].push_back(suffix);
}
}

std::sort(
result.timeZoneNamesWithoutPrefix.begin(),
result.timeZoneNamesWithoutPrefix.end(),
[](const std::string& a, const std::string& b) {
return b.size() < a.size();
});

for (auto& [prefix, suffixes] : result.timeZoneNamePrefixMap) {
std::sort(
suffixes.begin(),
suffixes.end(),
[](const std::string& a, const std::string& b) {
return b.size() < a.size();
});
}

return result;
}

int64_t parseTimezoneName(const char* cur, const char* end, Date& date) {
// For time zone names we try to greedily find the longest substring starting
// from cur that is a valid time zone name. To help speed things along we
// treat time zone names as {prefix}/{suffix} (for the first instance of '/')
// and create lists of suffixes per prefix. We order these lists by length of
// the suffix so once we identify the prefix, we can return the first suffix
// we find in the string. We treat time zone names without a prefix (i.e.
// without a '/') separately but similarly.
static const TimeZoneNameMappings timeZoneNameMappings =
getTimeZoneNameMappings();

if (cur < end) {
// Find the first instance of '/' in the remainder of the string
const char* separatorPoint = cur;
while (separatorPoint < end && *separatorPoint != '/') {
++separatorPoint;
}

// Try to find a time zone with a prefix that includes the speratorPoint.
if (separatorPoint != end) {
std::string prefix(cur, separatorPoint);

auto it = timeZoneNameMappings.timeZoneNamePrefixMap.find(prefix);
if (it != timeZoneNameMappings.timeZoneNamePrefixMap.end()) {
// This is greedy, find the longest suffix for the given prefix that
// fits the string. We know the value in the map is already sorted by
// length in decreasing order.
for (const auto& suffixName : it->second) {
if (suffixName.size() <= end - separatorPoint - 1 &&
suffixName ==
std::string_view(separatorPoint + 1, suffixName.size())) {
auto timeZoneNameSize = prefix.size() + 1 + suffixName.size();
date.timezone =
tz::locateZone(std::string_view(cur, timeZoneNameSize), false);

if (!date.timezone) {
return -1;
}

return timeZoneNameSize;
}
}
}
}

// If we found a '/' but didn't find a match in the set of time zones with
// prefixes, try search before the '/' for a time zone without a prefix. If
// we didn't find a '/' then end already equals separatorPoint.
end = separatorPoint;

for (const auto& timeZoneName :
timeZoneNameMappings.timeZoneNamesWithoutPrefix) {
// Again, this is greedy, find the largest time zone name without a prefix
// that fits the string. We know timeZoneNamesWithoutPrefix is already
// sorted by length in decreasing order.
if (timeZoneName.size() <= end - cur &&
timeZoneName == std::string_view(cur, timeZoneName.size())) {
date.timezone = tz::locateZone(timeZoneName, false);

if (!date.timezone) {
return -1;
}

return timeZoneName.size();
}
}
}

return -1;
}

int64_t parseTimezoneOffset(const char* cur, const char* end, Date& date) {
// For timezone offset ids, there are three formats allowed by Joda:
//
Expand Down Expand Up @@ -689,7 +816,13 @@ int32_t parseFromPattern(
bool specifierNext,
DateTimeFormatterType type) {
if (curPattern.specifier == DateTimeFormatSpecifier::TIMEZONE_OFFSET_ID) {
auto size = parseTimezoneOffset(cur, end, date);
int64_t size;
if (curPattern.minRepresentDigits < 3) {
size = parseTimezoneOffset(cur, end, date);
} else {
size = parseTimezoneName(cur, end, date);
}

if (size == -1) {
return -1;
}
Expand Down
44 changes: 43 additions & 1 deletion velox/functions/prestosql/tests/DateTimeFunctionsTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2976,10 +2976,52 @@ TEST_F(DateTimeFunctionsTest, parseDatetime) {
ts, parseDatetime("2024-02-25+06:00:99 UTC", "yyyy-MM-dd+HH:mm:99 ZZZ"));
EXPECT_EQ(
ts, parseDatetime("2024-02-25+06:00:99 UTC", "yyyy-MM-dd+HH:mm:99 ZZZ"));

// Test a time zone with a prefix.
EXPECT_EQ(
TimestampWithTimezone(1708869600000, "America/Los_Angeles"),
parseDatetime(
"2024-02-25+06:00:99 America/Los_Angeles",
"yyyy-MM-dd+HH:mm:99 ZZZ"));
// Test a time zone with a prefix is greedy. Etc/GMT-1 and Etc/GMT-10 are both
// valid time zone names.
EXPECT_EQ(
TimestampWithTimezone(1708804800000, "Etc/GMT-10"),
parseDatetime(
"2024-02-25+06:00:99 Etc/GMT-10", "yyyy-MM-dd+HH:mm:99 ZZZ"));
// Test a time zone without a prefix is greedy. NZ and NZ-CHAT are both
// valid time zone names.
EXPECT_EQ(
TimestampWithTimezone(1708791300000, "NZ-CHAT"),
parseDatetime("2024-02-25+06:00:99 NZ-CHAT", "yyyy-MM-dd+HH:mm:99 ZZZ"));
// Test a time zone with a prefix can handle trailing data.
EXPECT_EQ(
TimestampWithTimezone(1708869600000, "America/Los_Angeles"),
parseDatetime(
"America/Los_Angeles2024-02-25+06:00:99", "ZZZyyyy-MM-dd+HH:mm:99"));
// Test a time zone without a prefix can handle trailing data.
EXPECT_EQ(
TimestampWithTimezone(1708840800000, "GMT"),
parseDatetime("GMT2024-02-25+06:00:99", "ZZZyyyy-MM-dd+HH:mm:99"));
// Test parsing can fall back to checking for time zones without a prefix when
// a '/' is present but not part of the time zone name.
EXPECT_EQ(
TimestampWithTimezone(1708840800000, "GMT"),
parseDatetime("GMT/2024-02-25+06:00:99", "ZZZ/yyyy-MM-dd+HH:mm:99"));

// Test an invalid time zone without a prefix. (zzz should be used to match
// abbreviations)
VELOX_ASSERT_THROW(
parseDatetime("2024-02-25+06:00:99 PST", "yyyy-MM-dd+HH:mm:99 ZZZ"),
"Invalid date format: '2024-02-25+06:00:99 PST'");
// Test an invalid time zone with a prefix that doesn't appear at all.
VELOX_ASSERT_THROW(
parseDatetime("2024-02-25+06:00:99 ABC/XYZ", "yyyy-MM-dd+HH:mm:99 ZZZ"),
"Invalid date format: '2024-02-25+06:00:99 ABC/XYZ'");
// Test an invalid time zone with a prefix that does appear.
VELOX_ASSERT_THROW(
parseDatetime(
"2024-02-25+06:00:99 America/XYZ", "yyyy-MM-dd+HH:mm:99 ZZZ"),
"Invalid date format: '2024-02-25+06:00:99 America/XYZ'");
}

TEST_F(DateTimeFunctionsTest, formatDateTime) {
Expand Down

0 comments on commit 5e05545

Please sign in to comment.