Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for timezone-aware data to from_pandas #13611

Closed
charlesbluca opened this issue Jun 23, 2023 · 0 comments · Fixed by #15935
Closed

[FEA] Add support for timezone-aware data to from_pandas #13611

charlesbluca opened this issue Jun 23, 2023 · 0 comments · Fixed by #15935
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Python Affects Python cuDF API.

Comments

@charlesbluca
Copy link
Member

charlesbluca commented Jun 23, 2023

Is your feature request related to a problem? Please describe.
Support for timezone-aware datetimes is ongoing in #12813, and currently timezone-aware operations like tz_convert and tz_localize are supported.

However, when trying to load in a pandas dataframe with timezone-aware data using from_pandas, we still error with a message that implies that timezone-aware datetimes aren't yet supported:

import cudf
import pandas as pd

df = pd.DataFrame(
    {
        "d": pd.date_range(
            start="2014-08-01 09:00", freq="8H", periods=6, tz="UTC"
        ),
    }
)

cudf.from_pandas(df)
NotImplementedError                       Traceback (most recent call last)
Cell In[1], line 12
      2 import pandas as pd
      4 df = pd.DataFrame(
      5     {
      6         "d": pd.date_range(
   (...)
      9     }
     10 )
---> 12 cudf.from_pandas(df)

File /datasets/charlesb/micromamba/envs/dask-sql-gpuci-py39/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File /datasets/charlesb/micromamba/envs/dask-sql-gpuci-py39/lib/python3.9/site-packages/cudf/core/dataframe.py:7491, in from_pandas(obj, nan_as_null)
   7393 """
   7394 Convert certain Pandas objects into the cudf equivalent.
   7395 
   (...)
   7488 <class 'pandas.core.indexes.multi.MultiIndex'>
   7489 """
   7490 if isinstance(obj, pd.DataFrame):
-> 7491     return DataFrame.from_pandas(obj, nan_as_null=nan_as_null)
   7492 elif isinstance(obj, pd.Series):
   7493     return Series.from_pandas(obj, nan_as_null=nan_as_null)

File /datasets/charlesb/micromamba/envs/dask-sql-gpuci-py39/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File /datasets/charlesb/micromamba/envs/dask-sql-gpuci-py39/lib/python3.9/site-packages/cudf/core/dataframe.py:5119, in DataFrame.from_pandas(cls, dataframe, nan_as_null)
   5115 for col_name, col_value in dataframe.items():
   5116     # necessary because multi-index can return multiple
   5117     # columns for a single key
   5118     if len(col_value.shape) == 1:
-> 5119         data[col_name] = column.as_column(
   5120             col_value.array, nan_as_null=nan_as_null
   5121         )
   5122     else:
   5123         vals = col_value.values.T

File /datasets/charlesb/micromamba/envs/dask-sql-gpuci-py39/lib/python3.9/site-packages/cudf/core/column/column.py:2327, in as_column(arbitrary, nan_as_null, dtype, length)
   2317         if cudf.get_option(
   2318             "default_float_bitwidth"
   2319         ) and infer_dtype(arbitrary) in (
   2320             "floating",
   2321             "mixed-integer-float",
   2322         ):
   2323             pa_type = np_to_pa_dtype(
   2324                 _maybe_convert_to_default_type("float")
   2325             )
-> 2327     data = as_column(
   2328         pa.array(
   2329             arbitrary,
   2330             type=pa_type,
   2331             from_pandas=True
   2332             if nan_as_null is None
   2333             else nan_as_null,
   2334         ),
   2335         dtype=dtype,
   2336         nan_as_null=nan_as_null,
   2337     )
   2338 except (pa.ArrowInvalid, pa.ArrowTypeError, TypeError):
   2339     if is_categorical_dtype(dtype):

File /datasets/charlesb/micromamba/envs/dask-sql-gpuci-py39/lib/python3.9/site-packages/cudf/core/column/column.py:1974, in as_column(arbitrary, nan_as_null, dtype, length)
   1968 if isinstance(arbitrary, pa.lib.HalfFloatArray):
   1969     raise NotImplementedError(
   1970         "Type casting from `float16` to `float32` is not "
   1971         "yet supported in pyarrow, see: "
   1972         "https://issues.apache.org/jira/browse/ARROW-3802"
   1973     )
-> 1974 col = ColumnBase.from_arrow(arbitrary)
   1976 if isinstance(arbitrary, pa.NullArray):
   1977     new_dtype = cudf.dtype(arbitrary.type.to_pandas_dtype())

File /datasets/charlesb/micromamba/envs/dask-sql-gpuci-py39/lib/python3.9/site-packages/cudf/core/column/column.py:340, in ColumnBase.from_arrow(cls, array)
    334 data = pa.table([array], [None])
    336 if (
    337     isinstance(array.type, pa.TimestampType)
    338     and array.type.tz is not None
    339 ):
--> 340     raise NotImplementedError(
    341         "cuDF does not yet support timezone-aware datetimes"
    342     )
    343 if isinstance(array.type, pa.DictionaryType):
    344     indices_table = pa.table(
    345         {
    346             "None": pa.chunked_array(
   (...)
    350         }
    351     )

NotImplementedError: cuDF does not yet support timezone-aware datetimes

Describe the solution you'd like
Either adding support for timezone-aware data to from_pandas, or updating the error message to indicate to users workarounds they can use while support is in progress.

Describe alternatives you've considered
It is generally possible to move timezone-aware data from pandas to cuDF by storing the timezone information somewhere, converting the data to timezone-naive (dt.tz_localize(None)) and then restoring the information once it's been read in with from_pandas.

@charlesbluca charlesbluca added feature request New feature or request Needs Triage Need team to review and classify labels Jun 23, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jul 22, 2023
rapids-bot bot pushed a commit that referenced this issue Jun 10, 2024
closes #13611

(This technically does not support pandas objects have interval types that are timezone aware)

@rjzamora let me know if the test I adapted from your PR in #15929 is adequate

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #15935
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants