Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle geo types #11

Open
mhkeller opened this issue Nov 10, 2023 · 9 comments
Open

Handle geo types #11

mhkeller opened this issue Nov 10, 2023 · 9 comments

Comments

@mhkeller
Copy link

I'd like to use this to query a PostGIS database that has geometry types. I see there is a GeoParquet format although I haven't looked too deeply into how well these types match one another. Alternatively, serializing geometry types as strings would also be fine for now.

@mhkeller mhkeller changed the title Support geo types Handle geo types Nov 10, 2023
@mhkeller
Copy link
Author

mhkeller commented Nov 10, 2023

Revising this to just Handle instead of Support since I think the two most useful options for my use case don't actually require converting into geotypes.

  • Option 1: A flag to convert all unsupported types (or just geotypes) to strings
  • Option 2: A flag to omit columns of unsupported types

@mhkeller
Copy link
Author

I can close this for now since I found a good workaround. Instead of using select * from my_table, I'm naming each column and for geometry fields, changing the query selection to SUBSTRING(ST_AsText(${column_name})::text, 1, 1000) as ${column_name}, which has the advantage in my use case of not returning the entire geom string.

@exyi
Copy link
Owner

exyi commented Nov 11, 2023

Sorry, I'll reopen the issue :)

I like the proposal to add options for ignoring unknown types, or converting them to text. It has the problem of backwards compatibility - when I would add support for a new type, your existing workflow can break since type of some column have changed. I guess it's not a big deal if we don't make it the default, and add warning to the documentation.

I think it is a good idea to support the GeoParquet extension, although it looks quite complex, so no promises it will happen anytime soon. I already wanted to add support for points, paths and polygons and serialize them as parquet structs, but postgres only tells me the column type is geometry, not geometry(point) :/ GeoParquet can apparently serialize all of these objects into one BYTE_ARRAY field, so it should be possible to implement.

@exyi exyi reopened this Nov 11, 2023
@sparkadvisory
Copy link

sparkadvisory commented May 13, 2024

I just came here to ask for the same :) Thank you for your efforts. It would be nice to have a flag to cast unknown types as string.

@exyi
Copy link
Owner

exyi commented May 14, 2024

@mahmut-spark Great! Are you looking specifically for postgis geography types or other currently unsupported data types? I have recently found a way to extract the exact column type (i.e. geography(point), not just geography), which opens the possibility to serialize geo types as Parquet structs (point as {x: float, y: float} struct, polygons as lists, ...).

What kind of output schema would you prefer? The possibilities I'm aware of are:

Unfortunately, GeoParquet (currently) only supports geography types at the root of the schema. PostGIS definitely allows you to have them in custom types and arrays, meaning that we might have to produce an invalid GeoParquet if the database uses this.

@mhkeller
Copy link
Author

I’m what curious what you found to get the specific geometry type…

@sparkadvisory
Copy link

My use case is Postgis -> Parquet -> DuckDB. I think it will be fine even if the geometry values are cast to text. That being said, I gathered these. Hopefully, it will helpful:

PostGIS definitely allows you to have them in custom types and arrays

I couldn't find any information on custom geometry types. But arrays yes, and I confirmed from pgadmin as well:

image

And geoparquet seems to expect the format as:
https://geoparquet.org/releases/v1.0.0/

Nesting: Geometry columns MUST be at the root of the schema. A geometry MUST NOT be a group field or nested in a group. In practice, this means that when writing to GeoParquet from another format, geometries cannot be contained in complex or nested types such as structs, lists, arrays, or map types.

Given these circumstances, I suppose it makes sense to (if possible) use GeoParquet:

  1. Write compatible geometry types as WKB.
  2. Write incompatible geometry types cast as TEXT.

What do you think?

@exyi
Copy link
Owner

exyi commented May 16, 2024

@mhkeller I’m what curious what you found to get the specific geometry type…

The information is available in the system catalog pg_attribute. This query will list all attributes with the full types: select a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) FROM pg_catalog.pg_attribute a, you can obviously add filters for specific relations, it also has information about nullability. I believe this is the way to go for other things than geo types — knowing numeric scale and precision is also great, and nullability is nice to have. Currently, all fields are optional in the generated Parquet files; sometimes it doesn't matter, sometimes it's annoying to work with (hello Julia DataFrames).

However, it's bit more complicated to get work with arbitrary queries, but the basic info should be extractable from EXPLAIN (VERBOSE, FORMAT json) {userQuery}.

Since it isn't really needed for WKB/WKT output, I this we could have geo support first. I guess the other main blocker is being able to reliably test against postgres with extensions 😅 , I'll have a look at setting up proper CI

@mahmut-spark ...

Thank you for the detailed info! I'm still not sure about the correct defaults, but I find it important to support both WKB, WKT and maybe the structs in future, so I'll probably just add an option and make it required for now.

@basaran
Copy link

basaran commented May 17, 2024

Hi, this is my personal git. If it will be helpful I can put together a docker compose for you to extend postgis. Please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants