Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using PARTITION BY in SQL generates 'Error: External(NotImplemented("it is not yet supported to write to hive partitions with datatype Float64"))' #13602

Closed
ajazam opened this issue Nov 29, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@ajazam
Copy link

ajazam commented Nov 29, 2024

Describe the bug

I am trying to create a parquet file with hive partitioning, from csv data and get error

Error: External(NotImplemented("it is not yet supported to write to hive partitions with datatype Float64"))

To Reproduce

main.rs
use std::fs::File;
use std::io::Write;
use arrow::datatypes::{DataType, Field, Schema};
use datafusion::prelude::*;
use tempfile::tempdir;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let dir = tempdir()?;
let file_path = dir.path().join("example.csv");

let mut file = File::create(&file_path)?;
file.write_all(
        r#"dte,ot

2016-07-01 00:00:00,2
2016-07-01 06:45:00,3"#
.as_bytes())?;

let file_path = file_path.to_str().unwrap();

let ctx = SessionContext::new();
let csv_df = ctx.read_csv(file_path, CsvReadOptions::default()).await?;
csv_df.show().await?;

let schema = Schema::new(vec![
    Field::new("dte", DataType::Timestamp(arrow::datatypes::TimeUnit::Second, None), false),

    Field::new("ot", DataType::UInt16, false),
]);

ctx.register_csv("data" ,file_path, CsvReadOptions::new().schema(&schema).has_header(true)).await?;

let df = ctx.sql("copy (SELECT dte, ot, EXTRACT(YEAR FROM dte) AS year from data) to './partitioned_output' stored as parquet PARTITIONED BY (year)").await?;
df.count().await?;

Ok(())

}

cargo.toml
[package]
name = "datafusion_csv"
version = "0.1.0"
edition = "2021"

[dependencies]
tokio = { version = "1", features = ["full"] }
datafusion = "43.0.0"
arrow = "53.3.0"
tempfile = "3.14.0"

Expected behavior

I am expecting a folder year=2016 containing a parquet file

Additional context

I was original trying to have folders for month and day, couldn't get the application to work and then created this simpler example.

@ajazam ajazam added the bug Something isn't working label Nov 29, 2024
@ajazam
Copy link
Author

ajazam commented Nov 29, 2024

I've tried rust 1.82 and 1.83

@delamarch3
Copy link
Contributor

Looks like date_part was updated to return Int32 instead of Float64 in this PR #13466 which should fix this issue. As a workaround you could try casting it like arrow_cast(EXTRACT(..), 'Int64')

@Omega359
Copy link
Contributor

I didn't implement float64 for hive partitioning because, well, floats in general are not exact values. Best to cast to an int.

@ajazam
Copy link
Author

ajazam commented Dec 1, 2024

Thanks gents I got it working. For anybody else who comes up against this issue I made the following alteration

let df = ctx.sql("copy (SELECT dte, ot, arrow_cast(EXTRACT(YEAR FROM dte), 'Int32') AS year from data) to './partitioned_output' stored as parquet PARTITIONED BY (year)").await?;

@ajazam ajazam closed this as completed Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants