How does the `to postgres` command determine the best type? #1796

mhkeller · 2024-05-06T01:07:58Z

mhkeller
May 6, 2024

Can you give any insight into the algorithm that determines how you go from CSV types to Postgres types? I was doing some tests and saw in a csv like this:

my_string,my_float,my_int,my_bool,my_null
hello,2.23,2,true,
,0.23,0,False,

it generates:

sql

CREATE TABLE IF NOT EXISTS "test_for_types" ( "my_string" TEXT
 , "my_float" NUMERIC
 , "my_int" BIGINT
 , "my_bool" BOOL
 , "my_null" BOOL
);

and I was surprised it went for a BIGINT. Also I was curious about the choice of NUMERIC instead of a more specific float field. I imagine it goes with the most forgiving types for the most compatibility?

edit: it looks like a string like 2024-01-01 gets converted to a TIMESTAMP type instead of a DATE, which would also be consistent with that logic.

jqnatividad · 2024-05-06T09:50:19Z

jqnatividad
May 6, 2024
Maintainer

The to command was contributed by @kindly, using his https://github.com/kindly/csvs_convert crate.

As I understand it, it also uses the column's summary statistics - in particular, the max value, to determine the kind of numeric data type it will use when creating the table schema definition in PostgreSQL.

I do something similar with Datapusher+, which uses qsv stats - to determine the best numeric data type to use.

https://github.com/dathere/datapusher-plus/blob/e86ebf5fbd9d30b399e337e0cbcacea7cc412efb/datapusher/jobs.py#L999-L1006

@kindly, can you confirm?

1 reply

mhkeller May 6, 2024
Author

Thanks for the pointer to the code. I wonder why it's going with BIGINT there for 0 and 2. I would be curious to set up a comparison between it and csvsql to see how they handle different files.

For my use case I think I may just stick with doing qsv stats and then doing float8 and int4 for numbers. If the import fails I can have the user fix it. I'll have to investigate if setting everything to NUMERIC has any downsides. I also haven't sorted out all of the options qsv has for sniffing dates and whether there's an easy way to differentiate DATE from TIMESTAMP and the different timestamp with/without timezone options postgres has.

edit: Just realized that the code you pointed to is separate. But that's a good idea to use the min/max from stats. I could definitely do something like that to get finer-grained postgres types if I didn't want to rely on putting everything as NUMERIC.

kindly · 2024-05-08T12:14:27Z

kindly
May 8, 2024

@mhkeller yes the approach was to for the most lenient types. I had issue with running over the int size on a second loading into the same table.
Numeric was chosen to preserve precision, but maybe a float option would make sense.

An aim of the type-guessing was for speed for large datasets. So the least options to check for each field was preferable, so checking for size of float and ints for every row was not my priority. Float parsing is particularly expensive. I thought extra disk space for these fairly negligible, compared to any text fields. However, for heavy numerical data this may not be ideal.

3 replies

mhkeller May 8, 2024
Author

That’s approach makes sense. Thanks for explaining. By “float parsing” do you mean parsing the string that comes back from the stats command?

kindly May 8, 2024

yeah csvs are strings, and you have to parse them to what you want in order to check if they are actually that thing. With floats the conversion normally allow exponents like "6.08e-4" which makes them more expensive to parse.

mhkeller May 8, 2024
Author

But isn’t the stats command doing that already if you choose to get the min/max?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does the `to postgres` command determine the best type? #1796

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How does the to postgres command determine the best type? #1796

mhkeller May 6, 2024

Replies: 2 comments · 4 replies

jqnatividad May 6, 2024 Maintainer

mhkeller May 6, 2024 Author

kindly May 8, 2024

mhkeller May 8, 2024 Author

kindly May 8, 2024

mhkeller May 8, 2024 Author

How does the `to postgres` command determine the best type? #1796

mhkeller
May 6, 2024

Replies: 2 comments 4 replies

jqnatividad
May 6, 2024
Maintainer

mhkeller May 6, 2024
Author

kindly
May 8, 2024

mhkeller May 8, 2024
Author

mhkeller May 8, 2024
Author