You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered an issue while running the code in the getting-started-movielens folder on my Kaggle notebook. I successfully completed the "Download and Convert" notebook, but I encountered a problem during the "ETL with NVTabular" step.
Steps/Code to reproduce bug
Set Up Environment
I started by writing the code from the "Download and Convert" notebook in my Kaggle notebook.
I then continued with the code from the "ETL with NVTabular" notebook, which is where the issue arises.
Modify the Code:
In the notebook, I updated the INPUT_DATA_DIR variable to point to the correct path in my Kaggle notebook. For example:
python
INPUT_DATA_DIR = '/kaggle/working/data'
All other code remains unchanged.
Run the Notebook
I executed the notebook cells in sequence.
The error occurs when running the following line:
python
workflow.fit(train_dataset)
The error message received is: TypeError: function is not supported for this dtype: size.
File /opt/conda/lib/python3.10/site-packages/nvtabular/workflow/workflow.py:213, in Workflow.fit(self, dataset)
199 def fit(self, dataset: Dataset) -> "Workflow":
200 """Calculates statistics for this workflow on the input dataset
201
202 Parameters
(...)
211 This Workflow with statistics calculated on it
212 """
--> 213 self.executor.fit(dataset, self.graph)
214 return self
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:466, in DaskExecutor.fit(self, dataset, graph, refit)
462 if not current_phase:
463 # this shouldn't happen, but lets not infinite loop just in case
464 raise RuntimeError("failed to find dependency-free StatOperator to fit")
--> 466 self.fit_phase(dataset, current_phase)
468 # Remove all the operators we processed in this phase, and remove
469 # from the dependencies of other ops too
470 for node in current_phase:
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:532, in DaskExecutor.fit_phase(self, dataset, nodes, strict)
530 stats.append(node.op.fit(node.input_columns, Dataset(ddf)))
531 else:
--> 532 stats.append(node.op.fit(node.input_columns, transformed_ddf))
533 except Exception:
534 LOG.exception("Failed to fit operator %s", node.op)
File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:400, in Categorify.fit(self, col_selector, ddf)
391 # Define a rough row-count at which we are likely to
392 # start hitting memory-pressure issues that cannot
393 # be accommodated with smaller partition sizes.
394 # By default, we estimate a "problematic" cardinality
395 # to be one that consumes >12.5% of the total memory.
396 self.cardinality_memory_limit = parse_bytes(
397 self.cardinality_memory_limit or int(device_mem_size(kind="total", cpu=_cpu) * 0.125)
398 )
--> 400 dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
401 return Delayed(key, dsk)
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1551, in _category_stats(ddf, options)
1549 if options.agg_cols == [] and options.agg_list == []:
1550 options.agg_list = ["size"]
-> 1551 return _groupby_to_disk(ddf, _write_uniques, options)
1553 # Otherwise, getting category-statistics
1554 if isinstance(options.agg_cols, str):
File /opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:629, in GroupBy.agg(self, func)
619 orig_dtypes = tuple(c.dtype for c in columns)
621 # Note: When there are no key columns, the below produces
622 # an Index with float64 dtype, while Pandas returns
623 # an Index with int64 dtype.
624 # (GH: 6945)
625 (
626 result_columns,
627 grouped_key_cols,
628 included_aggregations,
--> 629 ) = self._groupby.aggregate(columns, normalized_aggs)
631 result_index = self.grouping.keys._from_columns_like_self(
632 grouped_key_cols,
633 )
635 multilevel = _is_multi_agg(func)
File groupby.pyx:192, in cudf._lib.groupby.GroupBy.aggregate()
TypeError: function is not supported for this dtype: size
The text was updated successfully, but these errors were encountered:
Bug description
I encountered an issue while running the code in the getting-started-movielens folder on my Kaggle notebook. I successfully completed the "Download and Convert" notebook, but I encountered a problem during the "ETL with NVTabular" step.
Steps/Code to reproduce bug
Set Up Environment
I started by writing the code from the "Download and Convert" notebook in my Kaggle notebook.
I then continued with the code from the "ETL with NVTabular" notebook, which is where the issue arises.
Modify the Code:
In the notebook, I updated the INPUT_DATA_DIR variable to point to the correct path in my Kaggle notebook. For example:
python
All other code remains unchanged.
I executed the notebook cells in sequence.
The error occurs when running the following line:
python
The error message received is: TypeError: function is not supported for this dtype: size.
Expected behavior
Could you please assist in resolving this issue?
Environment Details
Additional context
Full Error
TypeError Traceback (most recent call last)
File :1
File /opt/conda/lib/python3.10/site-packages/nvtabular/workflow/workflow.py:213, in Workflow.fit(self, dataset)
199 def fit(self, dataset: Dataset) -> "Workflow":
200 """Calculates statistics for this workflow on the input dataset
201
202 Parameters
(...)
211 This Workflow with statistics calculated on it
212 """
--> 213 self.executor.fit(dataset, self.graph)
214 return self
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:466, in DaskExecutor.fit(self, dataset, graph, refit)
462 if not current_phase:
463 # this shouldn't happen, but lets not infinite loop just in case
464 raise RuntimeError("failed to find dependency-free StatOperator to fit")
--> 466 self.fit_phase(dataset, current_phase)
468 # Remove all the operators we processed in this phase, and remove
469 # from the dependencies of other ops too
470 for node in current_phase:
File /opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py:532, in DaskExecutor.fit_phase(self, dataset, nodes, strict)
530 stats.append(node.op.fit(node.input_columns, Dataset(ddf)))
531 else:
--> 532 stats.append(node.op.fit(node.input_columns, transformed_ddf))
533 except Exception:
534 LOG.exception("Failed to fit operator %s", node.op)
File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:400, in Categorify.fit(self, col_selector, ddf)
391 # Define a rough row-count at which we are likely to
392 # start hitting memory-pressure issues that cannot
393 # be accommodated with smaller partition sizes.
394 # By default, we estimate a "problematic" cardinality
395 # to be one that consumes >12.5% of the total memory.
396 self.cardinality_memory_limit = parse_bytes(
397 self.cardinality_memory_limit or int(device_mem_size(kind="total", cpu=_cpu) * 0.125)
398 )
--> 400 dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
401 return Delayed(key, dsk)
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1551, in _category_stats(ddf, options)
1549 if options.agg_cols == [] and options.agg_list == []:
1550 options.agg_list = ["size"]
-> 1551 return _groupby_to_disk(ddf, _write_uniques, options)
1553 # Otherwise, getting category-statistics
1554 if isinstance(options.agg_cols, str):
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1406, in _groupby_to_disk(ddf, write_func, options)
1402 # Use map_partitions to improve task fusion
1403 grouped = ddf.to_bag(format="frame").map_partitions(
1404 _top_level_groupby, options=options, token="level_1"
1405 )
-> 1406 _grouped_meta = _top_level_groupby(ddf._meta, options=options)
1407 _grouped_meta_col = {}
1409 dsk_split = defaultdict(dict)
File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.call..inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File /opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py:1017, in _top_level_groupby(df, options, spill)
1015 df_gb = _maybe_flatten_list_column(cat_col_selector.names[0], df_gb)
1016 # NOTE: groupby(..., dropna=False) requires pandas>=1.1.0
-> 1017 gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
1018 gb.columns = [
1019 _make_name((tuple(cat_col_selector.names) + name[1:]), sep=options.name_sep)
1020 if name[0] == cat_col_selector.names[0]
1021 else _make_name((tuple(cat_col_selector.names) + name), sep=options.name_sep)
1022 for name in gb.columns.to_flat_index()
1023 ]
1024 gb.reset_index(inplace=True, drop=False)
File /opt/conda/lib/python3.10/site-packages/cudf/utils/performance_tracking.py:51, in _performance_tracking..wrapper(*args, **kwargs)
43 if nvtx.enabled():
44 stack.enter_context(
45 nvtx.annotate(
46 message=func.qualname,
(...)
49 )
50 )
---> 51 return func(*args, **kwargs)
File /opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:629, in GroupBy.agg(self, func)
619 orig_dtypes = tuple(c.dtype for c in columns)
621 # Note: When there are no key columns, the below produces
622 # an Index with float64 dtype, while Pandas returns
623 # an Index with int64 dtype.
624 # (GH: 6945)
625 (
626 result_columns,
627 grouped_key_cols,
628 included_aggregations,
--> 629 ) = self._groupby.aggregate(columns, normalized_aggs)
631 result_index = self.grouping.keys._from_columns_like_self(
632 grouped_key_cols,
633 )
635 multilevel = _is_multi_agg(func)
File groupby.pyx:192, in cudf._lib.groupby.GroupBy.aggregate()
TypeError: function is not supported for this dtype: size
The text was updated successfully, but these errors were encountered: