Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nested fields count towards limit of stats calculation of first 32 columns #3172

Open
mrjsj opened this issue Jan 30, 2025 · 2 comments
Open
Labels
bug Something isn't working

Comments

@mrjsj
Copy link
Contributor

mrjsj commented Jan 30, 2025

Environment

Delta-rs version:
0.24.0

Binding:
Python

Environment:

  • Cloud provider: N/A
  • OS: MacOS 15.2
  • Other:

Bug

What happened:
When table stats are calculated, only the first 32 columns are included. However, if a column has nested fields, these fields count towards the 32 limit.
Additionally, the stats are not calculated if the column is a list of structs (but that is maybe intended?)

What you expected to happen:
That a nested field only count once, and table stats are calculated for the first 32 columns at the root level.

How to reproduce it:

from deltalake import write_deltalake, DeltaTable
from polars import Schema, List, Struct, Int64, String
import polars as pl

schema = Schema(
    {
        "1": String,
        "nested": List(
            Struct(
                {
                    "2": Int64,
                    "3": Int64,
                    "4": Int64,
                    "5": Int64,
                    "6": String,
                    "7": String,
                    "8": String,
                    "9": String,
                    "10": String,
                    "11": String,
                    "12": String,
                    "13": String,
                    "14": String,
                    "15": String,
                    "16": String,
                    "17": String,
                    "18": String,
                    "19": String,
                    "20": String,
                    "21": String,
                    "22": String,
                    "23": String,
                    "24": String,
                    "25": String,
                    "26": String,
                    "27": String,
                    "28": String,
                    "29": String,
                    "30": String,
                    "31": String,
                    "32": String,
                }
            )
        ),
        "year": Int64,
        "month": Int64,
        "day": Int64,
    }
)

df = pl.DataFrame(
    {
        "1": ["foo"],
        "nested": [[]],
        "year": [2024],
        "month": [12],
        "day": [1],
    },
    schema=schema,
)

write_deltalake(
    "my_temp_table", df.to_arrow(), mode="overwrite", schema_mode="overwrite"
)


ds = DeltaTable("my_temp_table").to_pyarrow_dataset()

result = pl.scan_pyarrow_dataset(ds).filter(pl.col("year") == 2024).collect()

print(result)

# shape: (0, 5)
# ┌─────┬──────────────────┬──────┬───────┬─────┐
# │ 1   ┆ nested           ┆ year ┆ month ┆ day │
# │ --- ┆ ---              ┆ ---  ┆ ---   ┆ --- │
# │ str ┆ list[struct[31]] ┆ i64  ┆ i64   ┆ i64 │
# ╞═════╪══════════════════╪══════╪═══════╪═════╡
# └─────┴──────────────────┴──────┴───────┴─────┘

The resulting transaction file. Here the interesting part is the stats field of the add transaction.

{
    "protocol": {
        "minReaderVersion": 1,
        "minWriterVersion": 2
    }
}
{
    "metaData": {
        "id": "3b36e7e3-abd2-4eac-a39a-f6fbfd5f39c6",
        "name": null,
        "description": null,
        "format": {
            "provider": "parquet",
            "options": {}
        },
        "schemaString": "{\"type\":\"struct\",\"fields\":[{\"name\":\"1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"nested\",\"type\":{\"type\":\"array\",\"elementType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"2\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"3\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"4\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"5\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"6\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"7\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"8\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"9\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"10\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"11\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"12\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"13\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"14\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"15\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"16\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"17\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"18\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"19\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"20\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"21\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"22\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"23\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"24\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"25\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"26\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"27\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"28\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"29\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"30\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"31\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"32\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]},\"containsNull\":true},\"nullable\":true,\"metadata\":{}},{\"name\":\"year\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"month\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"day\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}",
        "partitionColumns": [],
        "createdTime": 1738246259519,
        "configuration": {}
    }
}
{
    "add": {
        "path": "part-00001-f9991ba8-46d0-4fa5-bae2-6ff3bfed4c56-c000.snappy.parquet",
        "partitionValues": {},
        "size": 8860,
        "modificationTime": 1738246259565,
        "dataChange": true,
        "stats": "{\"numRecords\":1,\"minValues\":{\"1\":\"foo\"},\"maxValues\":{\"1\":\"foo\"},\"nullCount\":{\"1\":0}}",
        "tags": null,
        "deletionVector": null,
        "baseRowId": null,
        "defaultRowCommitVersion": null,
        "clusteringProvider": null
    }
}
{
    "commitInfo": {
        "timestamp": 1738246259569,
        "operation": "WRITE",
        "operationParameters": {
            "mode": "Overwrite"
        },
        "operationMetrics": {
            "execution_time_ms": 52,
            "num_added_files": 1,
            "num_added_rows": 1,
            "num_partitions": 0,
            "num_removed_files": 0
        },
        "clientVersion": "delta-rs.0.23.1"
    }
}

More details:
Slack conversation:
https://delta-users.slack.com/archives/C013LCAEB98/p1738184140820519

@jayctran
Copy link

jayctran commented Feb 5, 2025

I believe nested fields are included in the count of fields used for statistics.

The Databricks liquid clustering docs refer to the data types that can be leveraged as a cluster key and these columns must have statistics captured.

https://learn.microsoft.com/en-us/azure/databricks/delta/clustering#choose-clustering-keys

Hence the reason arrays and maps do not capture stats.

Additional info for choosing stat columns, which can include the nested fields in a struct. I'm not sure if there's any difference with OSS delta but I would think not.

https://learn.microsoft.com/en-us/azure/databricks/delta/data-skipping#specify-delta-statistics-columns

@mrjsj
Copy link
Contributor Author

mrjsj commented Feb 5, 2025

I have verified how it is done with Spark-Delta, and it is the first 32-column, even if the nested columns is ArrayType/StructType.

So the problem is not that the nested column counts towards the 32-column limit, but that each of the fields within the nested column counts towards the limit.
In delta-rs, statistics are not calculated for the year, month and day columns.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

nested_schema = StructType([
    StructField("2", IntegerType(), True),
    StructField("3", IntegerType(), True),
    StructField("4", IntegerType(), True),
    StructField("5", IntegerType(), True),
    StructField("6", StringType(), True),
    StructField("7", StringType(), True),
    StructField("8", StringType(), True),
    StructField("9", StringType(), True),
    StructField("10", StringType(), True),
    StructField("11", StringType(), True),
    StructField("12", StringType(), True),
    StructField("13", StringType(), True),
    StructField("14", StringType(), True),
    StructField("15", StringType(), True),
    StructField("16", StringType(), True),
    StructField("17", StringType(), True),
    StructField("18", StringType(), True),
    StructField("19", StringType(), True),
    StructField("20", StringType(), True),
    StructField("21", StringType(), True),
    StructField("22", StringType(), True),
    StructField("23", StringType(), True),
    StructField("24", StringType(), True),
    StructField("25", StringType(), True),
    StructField("26", StringType(), True),
    StructField("27", StringType(), True),
    StructField("28", StringType(), True),
    StructField("29", StringType(), True),
    StructField("30", StringType(), True),
    StructField("31", StringType(), True),
    StructField("32", StringType(), True)
])

schema = StructType([
    StructField("1", StringType(), True),
    StructField("nested", ArrayType(nested_schema), True),
    StructField("year", IntegerType(), True),
    StructField("month", IntegerType(), True),
    StructField("day", IntegerType(), True)
])

data = [("foo", [], 2024, 12, 1)]
df = spark.createDataFrame(data, schema)

df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("my_catalog.default.my_temp_table")

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants