Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] TypeError: function is not supported for this dtype: size at getting-started-movielens #1103

Open
qrcodeTH opened this issue Aug 31, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@qrcodeTH
Copy link

Bug description

I encountered an issue while running the code in the getting-started-movielens folder on my Kaggle notebook. I successfully completed the "Download and Convert" notebook, but I encountered a problem during the "ETL with NVTabular" step.

Steps/Code to reproduce bug

  1. Set Up Environment
    I started by writing the code from the "Download and Convert" notebook in my Kaggle notebook.
    I then continued with the code from the "ETL with NVTabular" notebook, which is where the issue arises.

  2. Modify the Code:
    In the notebook, I updated the INPUT_DATA_DIR variable to point to the correct path in my Kaggle notebook. For example:
    python

INPUT_DATA_DIR = '/kaggle/working/data'

All other code remains unchanged.

  1. Run the Notebook
    I executed the notebook cells in sequence.
    The error occurs when running the following line:
    python
workflow.fit(train_dataset)

The error message received is: TypeError: function is not supported for this dtype: size.

Expected behavior

Could you please assist in resolving this issue?

Environment Details

  • Merlin version: 1.12.1
  • NVTabular version: 23.08.00
  • Platform: Kaggle notebook
  • Python version: 3.10.14
  • PyTorch version: 2.4.0, CUDA available: True
  • TensorFlow version: 2.16.1, GPU available: True

Additional context

Full Error


TypeError Traceback (most recent call last)
File :1

File /opt/conda/lib/python3.10/site-packages/nvtabular/workflow/workflow.py:213, in Workflow.fit(self, dataset)
199 def fit(self, dataset: Dataset) -> "Workflow":
200 """Calculates statistics for this workflow on the input dataset
201
202 Parameters
(...)
211 This Workflow with statistics calculated on it
212 """
--> 213 self.executor.fit(dataset, self.graph)
214 return self

File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:466, in DaskExecutor.fit(self, dataset, graph, refit)
462 if not current_phase:
463 # this shouldn't happen, but lets not infinite loop just in case
464 raise RuntimeError("failed to find dependency-free StatOperator to fit")
--> 466 self.fit_phase(dataset, current_phase)
468 # Remove all the operators we processed in this phase, and remove
469 # from the dependencies of other ops too
470 for node in current_phase:

File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:532, in DaskExecutor.fit_phase(self, dataset, nodes, strict)
530 stats.append(node.op.fit(node.input_columns, Dataset(ddf)))
531 else:
--> 532 stats.append(node.op.fit(node.input_columns, transformed_ddf))
533 except Exception:
534 LOG.exception("Failed to fit operator %s", node.op)

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:400, in Categorify.fit(self, col_selector, ddf)
391 # Define a rough row-count at which we are likely to
392 # start hitting memory-pressure issues that cannot
393 # be accommodated with smaller partition sizes.
394 # By default, we estimate a "problematic" cardinality
395 # to be one that consumes >12.5% of the total memory.
396 self.cardinality_memory_limit = parse_bytes(
397 self.cardinality_memory_limit or int(device_mem_size(kind="total", cpu=_cpu) * 0.125)
398 )
--> 400 dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
401 return Delayed(key, dsk)

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1551, in _category_stats(ddf, options)
1549 if options.agg_cols == [] and options.agg_list == []:
1550 options.agg_list = ["size"]
-> 1551 return _groupby_to_disk(ddf, _write_uniques, options)
1553 # Otherwise, getting category-statistics
1554 if isinstance(options.agg_cols, str):

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1406, in _groupby_to_disk(ddf, write_func, options)
1402 # Use map_partitions to improve task fusion
1403 grouped = ddf.to_bag(format="frame").map_partitions(
1404 _top_level_groupby, options=options, token="level_1"
1405 )
-> 1406 _grouped_meta = _top_level_groupby(ddf._meta, options=options)
1407 _grouped_meta_col = {}
1409 dsk_split = defaultdict(dict)

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result

File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1017, in _top_level_groupby(df, options, spill)
1015 df_gb = _maybe_flatten_list_column(cat_col_selector.names[0], df_gb)
1016 # NOTE: groupby(..., dropna=False) requires pandas>=1.1.0
-> 1017 gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
1018 gb.columns = [
1019 _make_name((tuple(cat_col_selector.names) + name[1:]), sep=options.name_sep)
1020 if name[0] == cat_col_selector.names[0]
1021 else _make_name(
(tuple(cat_col_selector.names) + name), sep=options.name_sep)
1022 for name in gb.columns.to_flat_index()
1023 ]
1024 gb.reset_index(inplace=True, drop=False)

File /opt/conda/lib/python3.10/site-packages/cudf/utils/performance_tracking.py:51, in _performance_tracking..wrapper(*args, **kwargs)
43 if nvtx.enabled():
44 stack.enter_context(
45 nvtx.annotate(
46 message=func.qualname,
(...)
49 )
50 )
---> 51 return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:629, in GroupBy.agg(self, func)
619 orig_dtypes = tuple(c.dtype for c in columns)
621 # Note: When there are no key columns, the below produces
622 # an Index with float64 dtype, while Pandas returns
623 # an Index with int64 dtype.
624 # (GH: 6945)
625 (
626 result_columns,
627 grouped_key_cols,
628 included_aggregations,
--> 629 ) = self._groupby.aggregate(columns, normalized_aggs)
631 result_index = self.grouping.keys._from_columns_like_self(
632 grouped_key_cols,
633 )
635 multilevel = _is_multi_agg(func)

File groupby.pyx:192, in cudf._lib.groupby.GroupBy.aggregate()

TypeError: function is not supported for this dtype: size

@qrcodeTH qrcodeTH added the bug Something isn't working label Aug 31, 2024
@adidwd
Copy link

adidwd commented Sep 20, 2024

I have same issue. Did you find any resolve?
I am using the data and code as per the github links only

@hishambawa
Copy link

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants