Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] TZiF reader to support timezone-aware operations #11592

Closed
shwina opened this issue Aug 24, 2022 · 3 comments
Closed

[FEA] TZiF reader to support timezone-aware operations #11592

shwina opened this issue Aug 24, 2022 · 3 comments
Labels
1 - On Deck To be worked on next cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@shwina
Copy link
Contributor

shwina commented Aug 24, 2022

Related: #10047, #2477.

It is desirable to be able to do timezone-aware operations in cuDF.

One relatively simple approach is to load the IANA time zone database into cuDF as a table, and use existing algorithms to implement timezone-aware operations. This works quite well (see "Additional Context" below).

Unfortunately, the timezone database is typically not distributed or available in a format that libcudf can consume. It is usually distributed and present on user systems as a collection of TZiF files.

It would be great if we had a way to read TZiF files into cudf/libcudf. The first pass at this doesn't even need to be GPU-accelerated.

Additional context

To prototype this approach, I used a .csv version of the tzdb distributed by the third-party website timezonedb.com. It looks something like this:

             zone_name                    time_start  gmt_offset  dst
0     America/New_York 1883-11-18 16:59:59.000000000      -17762    0
1     America/New_York 1883-11-18 17:00:00.000000000      -18000    0
2     America/New_York 1918-03-31 07:00:00.000000000      -14400    1
3     America/New_York 1918-10-27 06:00:00.000000000      -18000    0
4     America/New_York 1919-03-30 07:00:00.000000000      -14400    1
...                ...                           ...         ...  ...
1156  America/New_York 1913-04-16 06:25:26.290448384      -18000    0
1157  America/New_York 1913-08-20 07:25:26.290448384      -14400    1
1158  America/New_York 1914-04-15 06:25:26.290448384      -18000    0
1159  America/New_York 1914-08-19 07:25:26.290448384      -14400    1
1160  America/New_York 1915-04-14 06:25:26.290448384      -18000    0

With the tzdb loaded into a table:

  • tz_convert, or converting to/from UTC and between timezones, is a binary search (cudf::upper/lower_bound)
  • tz_localize, or identifying nonexistent/ambiguous timestamps, is a binning problem (cudf::label_bins)

You can see my implementations of these operations here.

Perf comparison with Pandas:

# skipping some details in the code below:

psr = pd.Series(pd.date_range("1970-01-01", "2022-01-01", freq="1T"))
sr = cudf.from_pandas(psr)

%timeit psr.dt.tz_localize("America/New_York", ambiguous="NaT", nonexistent="NaT")
2.11 s ± 779 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit sr.dt.tz_localize("America/New_York", ambiguous="NaT", nonexistent="NaT")
31.4 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit psr.dt.hour
2.12 s ± 8.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit sr.dt.hour
1.58 ms ± 9.46 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
@shwina shwina added feature request New feature or request Needs Triage Need team to review and classify labels Aug 24, 2022
@shwina shwina added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 24, 2022
@shwina
Copy link
Contributor Author

shwina commented Aug 26, 2022

After a conversation with @bdice:

As it turns out, we already have a TZiF reader in https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/io/orc/timezone.cpp. It needs a bit of refactoring to return a cudf::table, but nothing major. I'll try to use this with the tz-experiment branch I've been prototyping with and report back here.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball GregoryKimball added cuIO cuIO issue 1 - On Deck To be worked on next and removed inactive-30d labels Jan 6, 2023
@GregoryKimball GregoryKimball moved this to Todo in libcudf Jan 6, 2023
@GregoryKimball GregoryKimball moved this from 23.02 to In Progress in libcudf Jan 6, 2023
@GregoryKimball GregoryKimball removed the status in libcudf Jan 9, 2023
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Jan 9, 2023
@shwina
Copy link
Contributor Author

shwina commented Feb 21, 2023

Closing in favor of #12813

@shwina shwina closed this as completed Feb 21, 2023
@GregoryKimball GregoryKimball removed the status in libcudf Apr 3, 2023
@GregoryKimball GregoryKimball removed this from libcudf Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

2 participants