Proposed plan for Geospatial Capabilities in Velox #12094

jagill · 2025-01-15T20:48:46Z

jagill
Jan 15, 2025
Collaborator

TL;DR

A necessary condition to deprecate the Presto Java execution stack and cluster is to add geospatial capabilities to Velox. We call this GeoVelox. The capabilities are difficult and coupled, and there are several significant decisions and steps that we need to take on the way.

Background: Why and why now?

Presto Java has a robust set of geospatial capabilities which are critical for doing geometric, geospatial, or GIS-type analyses. These features constitute the majority of the feature gap between the Java execution stack and Velox. We cannot deprecate the Java execution stack until we implement the features in Velox and still support these customers.

While we have two stacks we:

Increase the development cost of any new feature, since we must develop it in each stack and ensure consistent semantics,
Increase the operational cost, as we have two clusters for each region and these clusters have different operating characteristics and failure modes,
Increase the resource cost, because we need different machines for the two types of clusters.

Unlike most missing capabilities, geospatial features are tightly coupled: strategic decisions about third-party library use, new types, in-memory representation, serialized representation, and more all have to be made before almost any capabilities can be added. More advanced capabilities like Spatial Joins (local and partitioned) require new operators, and these are also tightly coupled with strategic choices.

Goals

The overall goal is to remove geospatial capabilities as a blocker to Java execution stack deprecation, and do it in a way that maintains (or improves) the customer experience. To do that, we use the specific goals of this project. In order of priority, they are:

Capabilities: Add all of the Presto Geospatial capabilities to Velox, with minimal changes in semantics.
Performance: Not regress, and ideally improve, performance compared to Java.
Extensibility: Implement the features in a way that allows further improvements.

Background: Geometries

A Geometry represents a set of points on the Euclidean plane. It is expressed as a finite set of coordinates of the form (x, y) (we ignore z and m coordinates for now). For extended geometries, these coordinates are the vertices of polygons or places where a path changes direction.

The general standard (Simple Feature access) for Geometries defines 7 types, including 3 basic types:

Point: represented by a single coordinate,
LineString: a piece-wise linear string represented by a coordinate array,
Polygon: A polygon possibly with holes, represented by an array of coordinate arrays. Each coordinate array is a ring. The first ring is an external "shell", and the subsequent rings are holes inside the shell.

There are also 3 "multi" types:

MultiPoint: A collection of points, represented by a coordinate array.
MultiLineString: A collection of LineStrings, represented by an array of coordinate arrays.
MultiPolygon: A collection of polygons, represented by an array of array of coordinate arrays.

There is also the GeometryCollection, a collection of any geometries (including GeometryCollection). While some specifications view these as 7 distinct types, for Presto Java we just have the Geometry type that may be any of these: which they are is determined by accessor functions instead of type information. I recommend that we do the same in GeoVelox.

Any of the above can be empty, so there are separate 'Point Empty' and 'MultiPolygon Empty'; these are considered members of their respective types. Note that Empty types are not equivalent to NULL, in the same way that 0 or the empty string are not equivalent to NULL. Empty types are an existent, valid Geometry that contains no points.

Geometries have many basic binary predicates, like intersects(g1: Geometry, g2: Geometry): bool (gory details). They also have binary operations, like intersection(g1: Geometry, g2: Geometry): Geometry.

One important binary operation is equals(g1: Geometry, g2: Geometry): Geometry, which means the two geometries contain the same points in the plane. Since geometries can have different representations with their coordinate arrays, this is an expensive operation. Further, due to issues of numerical stability, it is common to modify the geometry slightly to make the calculation easier, which means that as a practical matter the results may be different than expected. From painful experience, we consider Geometries to not have a natural = operation, so they can’t be used in GROUP BY clauses, IN clauses, etc. Since resolving equals(g1, g2) can take literal minutes, and it may give unexpected false results, overloading g1 = g2 to use equals(g1, g2) is a dangerous footgun.

Hurdle 1: Library choice

With the allocated resources, we will have to rely on third-party libraries to do the actual geometric calculations (e.g., the intersection of two geometries).

For the Java execution stack, we use two libraries: ESRI and JTS. We use two libraries because they have slightly different capabilities, but they also have different definitions of geometry (both at the type level and in-memory representation) and different semantics. An effort to fully deprecate ESRI for JTS ran out of time/resources, but this is a source of technical debt.

For C++, there is one fundamental library: GEOS. ESRI has a C++ library but it is closed-source. GEOS is developed in parallel with JTS, with a goal of feature parity and consistent semantics. However, it does not have the capabilities in ESRI that are missing in JTS, particularly with regards to projections and Geographies (see section below).

GDAL is a geometry library that wraps GEOS and adds capabilities for projections. GDAL has a significantly higher operational burden (getting it installed on a machine takes a day of bespoke work), but has a rich set of projections (see below for more discussion on projections).

One thing is inescapable: We do not have directly equivalent C++ libraries for all of our capabilities. We will have at least slightly different semantics for some of our operations.

For GEOS, we can either target the C API or the C++ API. The C API is stable (including ABI), while the C++ API/ABI is not considered stable. Using the C-API requires manual memory management. From the GEOS site:

The GEOS C++ API is included in the collection of header files installed in
include/geos which is a very large collection. Effectively it includes both
"public" headers that a user might be expected to make use of and "private"
headers that are mostly only used by internal algorithms. Currently, the two
kinds of headers are not marked in any way, nor is there an easy way to
disentangle them. ... headers can be moved, re-named or deleted according to
the implementation needs of the library.

Recommendation: Evaluate and use GEOS. Evaluate the pros/cons of C vs C++ API. Defer decision on GDAL to when we add Geographies. Check that library licenses are compatible with that of Velox.

Hurdle 2: Standards choice

There are two standards for geometric calculation: OGC and ISO SQL3 Spatial. These two are mostly compatible, with some minor different choices.

OGC:

[+] is open and free
[+] targeted by JTS and GEOS
[-] non-SQL
[-] many ambiguities and non-specified behavior

ISO:

[-] requires payment
[-] JTS/GEOS compliance is a non-goal
[+] detailed with fewer ambiguities
[+] SQL-specific

For Presto Java, we target ISO but the OGC-specific choices of JTS leak through.

Recommendation: We adopt ISO SQL3 Spatial as a goal, either working around GEOS/JTS choices or being explicit where we deviate from ISO.

Hurdle 3: In-memory representation

Presto Java uses an ESRI Shapefile format to serialize geometries between workers, with its own parser for Shapefile-to-JTS conversion. This is very expensive: an earlier analysis we did suggested that up to 70% of the CPU cost of geometry computation was in serde from Presto blocks to geometries. In particular, in the common case when users nest functions the geometry is deserialized and reserialized between each function call.

One alternative is to express a geometry as a Row-type, with a type tag and List<x> and List<y> coordinate subfields, and then use views into these to construct the GEOS geometries. If we can make this 0-copy (or O(1)-copy), we can realize significant efficiency gains. The open source community has been active in figuring out these representations, for example GeoArrow; we hope to leverage as much of their work as possible. One risk is the ability to represent GeometryCollections, which themselves may have GeometryCollections: it is not clear how to represent this in a flattened structure.

Recommendation: Verify that the inter-operator and inter-worker serialization formats are transparent to the user. If so, we can start with the lowest-lift version (WKB, Shapefile, etc) then evaluate a more efficient representation. The GEOS C or C++ API might have different possibilities here. If the format is not transparent, we should explore which format we should use.

Hurdle 4: Local Joins

An inner join is, at its heart, a cartesian product of the left and right sides. The default local implementation of this is a nested loop join (NLJ). However, if the join is followed by a filter on a predicate pred(left, right) (either in the ON clause or the WHERE clause), it’s often possible to make the join implementation more efficient than a NLJ.

A common case is an equijoin, when the predicate is of the form pred(left, right) -> left.a == right.b. Then for the local join we can construct a HashMap from the right (build) side, then look up possible right-side rows instead of looping through all right-side rows. Let's examine this in detail in order to build up the case for spatial joins.

index := {right.b: right for right in right_rows}
for left in left_rows:
  for right in index.lookup(left.a):
    if pred(left, right):
      yield (left, right)

Here we explicitly checked pred after looking up the rows, because it could be the case that pred(left, right) -> left.a == right.b and left.c < right.d, so the filtering predicate could be more selective than the equi-condition.

The conditions we need here are:

You can build an index of the build-side rows.
If pred(left, right) == true, then right will be in index.lookup(left).

The index will return a superset of all possible matches. The smaller the superset, the more efficient, but correctness is maintained for any superset. As a fun exercise, you can consider what would be a good index for a predicate like pred(left, right) -> strpos(right, left) == 1.

Spatial Joins are not equijoins. This is because:

Geometries do not have a consistent notion of hash. They cannot be distributed via hash or indexed in a HashMap for a local join.
Geometries do not even have an easy concept of "equal".
Joins are often on operations like intersects(g1, g2) or distance(g1, g2) <= X, not equality.

Thus, we cannot use equijoins. By default, local joins will be nested-loop-joins (NLJ). But there are indices that can work.

We use the same concept of index above, but for predicates like intersects(left.geom, right.geom). The Presto optimizer notices certain join spatial predicates, and optimizes the NLJ to a "Spatial Join". In a local spatial join, the build-side is indexed by an RTree, and probe geometries first retrieve all elements of the RTree whose bounding box intersects the probe's bounding box. Then the predicate is tested. (For Java, we built our own packed RTree which was much more efficient due to better data locality and branch predictions.)

For Velox, this would require a new operator. Perhaps we could abstract the concept of index into something like an IndexedJoinOperator, and be able to plug in HashMaps, RTrees, or other indices as required. It’s important to be performance guided here: for build sides on the order of 100s or possibly even 1000s of geometries, iterating through a flattened array and checking bounding boxes is likely faster than building and querying an RTree.

Recommendation: Spatial joins are for efficiency and not correctness. Many spatial queries do not even use joins. We should build this once we have built the non-join capabilities. We start with simpler indices (flattened lists of bounding boxes) and make more advanced versions as required. We can parallelize these efforts with enough resources.

Hurdle 5: Partitioned Joins

To run a join over multiple workers, the default method that maintains correctness is a broadcast join, where the entire build side is on each join worker. If the build-side does not fit into a single machine's memory, and the predicate allows, we can partition the build side across different machines.

Again, let's examine the equijoin case to understand what we need. If the predicate is pred(left, right) -> left.a == right.b, we know that hash(left.a) == hash(right.b) is true for any pair that matches the predicate. Thus, if we define partitions functions for left and right as part_left(left) -> hash(left.a) and part_right(right) = hash(right.b), then if we distribute each row (probe and build) to the worker based on its partition value, then if a pair matches the predicate they will be on the same worker, and the local join will be correct.

In general if we have a join with a predicate pred(left, right), to distribute the build side across workers requires two functions part_left(left) -> int32 and part_right(right) -> int32 such that pred(left, right) == true implies part_left(left) == part_right(right). We call part_left and part_right partition functions, and their value on a row a partition index. We can assign each row to a worker based on the partition index.

To partition geometries, Presto Java uses a KDB Tree (with K=2), with each cell assigned a partition index. Each geometry (probe or build) queries the KDB tree, and gets an index for each cell it intersects. Thus each geometry may have 0, 1, or many partitions. This is implemented by returning a list of indices and then UNNESTing the list, making one row for each spatial partition index. Geometries are then assigned to workers based on these partition indices. For Velox, we can consider wrapping the build-side array in a DictionaryArray and having the partition index as the dictionary lookup index.

Crucially, any non-Point geometry (either probe or build) may be sent to multiple workers. Thus, different workers may seem the same match. We use a clever heuristic¹ to deduplicate. This works for inner joins.

For outer joins, a worker that does not find a match does not know if other workers have also not found a match, so there's no way to emit the null-match consistently at this stage². For Presto Java, we only implemented this for outer joins when the non-outer side was a point, and thus must only go to one worker.

For Presto Java, supplying the spatial join was a challenge, and the implementation is not very elegant. But it is incredibly effective, allowing extremely large build sides. The API is a bit arcane, and hopefully we can improve on this.

Recommendation: Since Partitioned Joins require the local joins above, this is something that can be designed and implemented after local joins.

Hurdle 6: Projections and Geographies

Geometries and projections are analogous to naive datetimes and timezones, but much harder. A Geometry is a set of abstract x,y coordinates on an infinite plane. This is a valid concept (unlike a naive datetime), but most people are interested in coordinates that are points on Earth. The Earth is not a plane, or even a sphere, or even an ellipsoid. To translate between abstract coordinates and a physical location on earth requires two things: A choice of approximate ellipsoid for the Earth, and a Fundamental Point to anchor the coordinate system. This choice is called a datum, but colloquially we'll call it a projection. We call a Geometry with a projection a Geography.

The further you get from the fundamental point, the greater the error will be between your coordinates and the expected locations on earth. For surveying, country borders, and other applications requiring high precision, local datums are constructed using the local curvature of the earth. However, almost everyone uses the WGS84 datum, which is a "pretty good" approximation all over the earth, with the fundamental point corresponding to the intuitive 0° lat/0° lon. This is the analogue of UTC.

Many spatial calculations are different for geographies compared to geometries. For example, the distance between two points is a great-circle distance, not a Euclidean line. Areas of polygons are also different. These differences are relevant at country scale, and even smaller.

Two things that make projection much harder than timezones are:

Most projections are not valid very far from their fundamental point.
There is no closed-form analytic way to convert between projections.

Thus dealing with projections is very hard, and requires specialized software. Most people, when they use projections at all, use WGS84. Specialized users often use other projections.

Presto Java has a Geography type that uses the WGS84 projection, and functions that convert between Geometries and Geographies (this is a type change with no data change). It has geography specific functions (or implementation of functions), although it never implemented geographic variants of all geometric functions due to prioritization.

Recommendation: We add Geographies after Geometries, and only support WGS84. WGS84 was supported only be ESRI, for which we don't have an analogue, so we will need to research GDAL or other solutions.

Dependency DAG

Library and Standards Decision: No dependencies: can start now.
In-Memory Representation Decision: Depends on 1.
Validation/rollout procedure: Depends on 1.
Easy Bing Tile Functions: Depends on 3.
Geometry functions: Depends on 1 and 2 and 3.
Local Join: Depends on 1 and 2 and 3.
Partitioned Join: Depends on 1 and 2 and 3.
Geography/projection: Depends on 1 and 2 and 3 and 5.

Recommended path forward

The biggest strategic decisions are Library, Standards, and In-Memory Representation. Most work depends on these, so we should work on these decisions first. Next we should focus on an initial set of geometry functions to test and validate these choices, as well as establish a roll-out procedure.

Once these are done, we can parallelize work more effectively. Geometry functions can be added independently of each other, and work can also begin on local joins. After local joins, we should work on Partitioned Joins and Geographies in order of impact to effort.

Each worker checks the top-left of the intersection of the probe and build bounding boxes (which is itself a box). If that point falls within a cell of the KDB tree the worker is responsible for, the worker will yield the match; otherwise it will drop it. ↩
One could shuffle and deduplicate the outer-joined pairs in the following stage if the inner-side row is unique in some way. ↩

aditi-pandit · 2025-01-15T21:25:30Z

aditi-pandit
Jan 15, 2025
Collaborator

@jagill : Thanks for this detailed proposal. There is a lot of excitement about Geo-spatial in multiple groups including IBM and Uber in the community.

We all meet for regular discussions on Thursdays 11 am PST in the Native Worker working group forum https://lists.prestodb.io/g/native-worker-wg. Would you be able to join us next Thursday 1/23 to discuss the plan ?

2 replies

jagill Jan 15, 2025
Collaborator Author

That particular Thursday I have a conflict, but in general I could!

aditi-pandit Jan 15, 2025
Collaborator

Thanks @jagill. I'm open to moving the meeting around for your availability that week. If you slack me some timeslots then I can co-ordinate with the rest of the interested parties. Appreciate it.

pedroerp · 2025-01-16T00:46:30Z

pedroerp
Jan 16, 2025
Collaborator

Cc: @czentgr @majetideepak @jaystarshot @amitkdutta

0 replies

wraymo · 2025-01-21T15:48:34Z

wraymo
Jan 21, 2025

Hi, I have a draft PR to introduce basic spatial support (excluding joins). I’d appreciate it if you could take a look. Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed plan for Geospatial Capabilities in Velox #12094

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Proposed plan for Geospatial Capabilities in Velox #12094

jagill Jan 15, 2025 Collaborator

TL;DR

Background: Why and why now?

Goals

Background: Geometries

Hurdle 1: Library choice

Hurdle 2: Standards choice

Hurdle 3: In-memory representation

Hurdle 4: Local Joins

Hurdle 5: Partitioned Joins

Hurdle 6: Projections and Geographies

Dependency DAG

Recommended path forward

Footnotes

Replies: 3 comments · 2 replies

aditi-pandit Jan 15, 2025 Collaborator

jagill Jan 15, 2025 Collaborator Author

aditi-pandit Jan 15, 2025 Collaborator

pedroerp Jan 16, 2025 Collaborator

wraymo Jan 21, 2025

jagill
Jan 15, 2025
Collaborator

Replies: 3 comments 2 replies

aditi-pandit
Jan 15, 2025
Collaborator

jagill Jan 15, 2025
Collaborator Author

aditi-pandit Jan 15, 2025
Collaborator

pedroerp
Jan 16, 2025
Collaborator

wraymo
Jan 21, 2025