Replies: 3 comments 2 replies
-
@jagill : Thanks for this detailed proposal. There is a lot of excitement about Geo-spatial in multiple groups including IBM and Uber in the community. We all meet for regular discussions on Thursdays 11 am PST in the Native Worker working group forum https://lists.prestodb.io/g/native-worker-wg. Would you be able to join us next Thursday 1/23 to discuss the plan ? |
Beta Was this translation helpful? Give feedback.
2 replies
-
Hi, I have a draft PR to introduce basic spatial support (excluding joins). I’d appreciate it if you could take a look. Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
TL;DR
A necessary condition to deprecate the Presto Java execution stack and cluster is to add geospatial capabilities to Velox. We call this GeoVelox. The capabilities are difficult and coupled, and there are several significant decisions and steps that we need to take on the way.
Background: Why and why now?
Presto Java has a robust set of geospatial capabilities which are critical for doing geometric, geospatial, or GIS-type analyses. These features constitute the majority of the feature gap between the Java execution stack and Velox. We cannot deprecate the Java execution stack until we implement the features in Velox and still support these customers.
While we have two stacks we:
Unlike most missing capabilities, geospatial features are tightly coupled: strategic decisions about third-party library use, new types, in-memory representation, serialized representation, and more all have to be made before almost any capabilities can be added. More advanced capabilities like Spatial Joins (local and partitioned) require new operators, and these are also tightly coupled with strategic choices.
Goals
The overall goal is to remove geospatial capabilities as a blocker to Java execution stack deprecation, and do it in a way that maintains (or improves) the customer experience. To do that, we use the specific goals of this project. In order of priority, they are:
Background: Geometries
A Geometry represents a set of points on the Euclidean plane. It is expressed as a finite set of coordinates of the form
(x, y)
(we ignorez
andm
coordinates for now). For extended geometries, these coordinates are the vertices of polygons or places where a path changes direction.The general standard (Simple Feature access) for Geometries defines 7 types, including 3 basic types:
There are also 3 "multi" types:
There is also the GeometryCollection, a collection of any geometries (including GeometryCollection). While some specifications view these as 7 distinct types, for Presto Java we just have the Geometry type that may be any of these: which they are is determined by accessor functions instead of type information. I recommend that we do the same in GeoVelox.
Any of the above can be empty, so there are separate 'Point Empty' and 'MultiPolygon Empty'; these are considered members of their respective types. Note that Empty types are not equivalent to NULL, in the same way that 0 or the empty string are not equivalent to NULL. Empty types are an existent, valid Geometry that contains no points.
Geometries have many basic binary predicates, like
intersects(g1: Geometry, g2: Geometry): bool
(gory details). They also have binary operations, likeintersection(g1: Geometry, g2: Geometry): Geometry
.One important binary operation is
equals(g1: Geometry, g2: Geometry): Geometry
, which means the two geometries contain the same points in the plane. Since geometries can have different representations with their coordinate arrays, this is an expensive operation. Further, due to issues of numerical stability, it is common to modify the geometry slightly to make the calculation easier, which means that as a practical matter the results may be different than expected. From painful experience, we consider Geometries to not have a natural=
operation, so they can’t be used in GROUP BY clauses, IN clauses, etc. Since resolvingequals(g1, g2)
can take literal minutes, and it may give unexpected false results, overloadingg1 = g2
to useequals(g1, g2)
is a dangerous footgun.Hurdle 1: Library choice
With the allocated resources, we will have to rely on third-party libraries to do the actual geometric calculations (e.g., the intersection of two geometries).
For the Java execution stack, we use two libraries: ESRI and JTS. We use two libraries because they have slightly different capabilities, but they also have different definitions of geometry (both at the type level and in-memory representation) and different semantics. An effort to fully deprecate ESRI for JTS ran out of time/resources, but this is a source of technical debt.
For C++, there is one fundamental library: GEOS. ESRI has a C++ library but it is closed-source. GEOS is developed in parallel with JTS, with a goal of feature parity and consistent semantics. However, it does not have the capabilities in ESRI that are missing in JTS, particularly with regards to projections and Geographies (see section below).
GDAL is a geometry library that wraps GEOS and adds capabilities for projections. GDAL has a significantly higher operational burden (getting it installed on a machine takes a day of bespoke work), but has a rich set of projections (see below for more discussion on projections).
One thing is inescapable: We do not have directly equivalent C++ libraries for all of our capabilities. We will have at least slightly different semantics for some of our operations.
For GEOS, we can either target the C API or the C++ API. The C API is stable (including ABI), while the C++ API/ABI is not considered stable. Using the C-API requires manual memory management. From the GEOS site:
Recommendation: Evaluate and use GEOS. Evaluate the pros/cons of C vs C++ API. Defer decision on GDAL to when we add Geographies. Check that library licenses are compatible with that of Velox.
Hurdle 2: Standards choice
There are two standards for geometric calculation: OGC and ISO SQL3 Spatial. These two are mostly compatible, with some minor different choices.
OGC:
ISO:
For Presto Java, we target ISO but the OGC-specific choices of JTS leak through.
Recommendation: We adopt ISO SQL3 Spatial as a goal, either working around GEOS/JTS choices or being explicit where we deviate from ISO.
Hurdle 3: In-memory representation
Presto Java uses an ESRI Shapefile format to serialize geometries between workers, with its own parser for Shapefile-to-JTS conversion. This is very expensive: an earlier analysis we did suggested that up to 70% of the CPU cost of geometry computation was in serde from Presto blocks to geometries. In particular, in the common case when users nest functions the geometry is deserialized and reserialized between each function call.
One alternative is to express a geometry as a
Row
-type, with a type tag andList<x>
andList<y>
coordinate subfields, and then use views into these to construct the GEOS geometries. If we can make this 0-copy (or O(1)-copy), we can realize significant efficiency gains. The open source community has been active in figuring out these representations, for example GeoArrow; we hope to leverage as much of their work as possible. One risk is the ability to represent GeometryCollections, which themselves may have GeometryCollections: it is not clear how to represent this in a flattened structure.Recommendation: Verify that the inter-operator and inter-worker serialization formats are transparent to the user. If so, we can start with the lowest-lift version (WKB, Shapefile, etc) then evaluate a more efficient representation. The GEOS C or C++ API might have different possibilities here. If the format is not transparent, we should explore which format we should use.
Hurdle 4: Local Joins
An inner join is, at its heart, a cartesian product of the left and right sides. The default local implementation of this is a nested loop join (NLJ). However, if the join is followed by a filter on a predicate
pred(left, right)
(either in the ON clause or the WHERE clause), it’s often possible to make the join implementation more efficient than a NLJ.A common case is an equijoin, when the predicate is of the form
pred(left, right) -> left.a == right.b
. Then for the local join we can construct a HashMap from the right (build) side, then look up possible right-side rows instead of looping through all right-side rows. Let's examine this in detail in order to build up the case for spatial joins.Here we explicitly checked
pred
after looking up the rows, because it could be the case thatpred(left, right) -> left.a == right.b and left.c < right.d
, so the filtering predicate could be more selective than the equi-condition.The conditions we need here are:
pred(left, right) == true
, then right will be inindex.lookup(left)
.The index will return a superset of all possible matches. The smaller the superset, the more efficient, but correctness is maintained for any superset. As a fun exercise, you can consider what would be a good index for a predicate like
pred(left, right) -> strpos(right, left) == 1
.Spatial Joins are not equijoins. This is because:
intersects(g1, g2)
ordistance(g1, g2) <= X
, not equality.Thus, we cannot use equijoins. By default, local joins will be nested-loop-joins (NLJ). But there are indices that can work.
We use the same concept of index above, but for predicates like
intersects(left.geom, right.geom)
. The Presto optimizer notices certain join spatial predicates, and optimizes the NLJ to a "Spatial Join". In a local spatial join, the build-side is indexed by an RTree, and probe geometries first retrieve all elements of the RTree whose bounding box intersects the probe's bounding box. Then the predicate is tested. (For Java, we built our own packed RTree which was much more efficient due to better data locality and branch predictions.)For Velox, this would require a new operator. Perhaps we could abstract the concept of index into something like an IndexedJoinOperator, and be able to plug in HashMaps, RTrees, or other indices as required. It’s important to be performance guided here: for build sides on the order of 100s or possibly even 1000s of geometries, iterating through a flattened array and checking bounding boxes is likely faster than building and querying an RTree.
Recommendation: Spatial joins are for efficiency and not correctness. Many spatial queries do not even use joins. We should build this once we have built the non-join capabilities. We start with simpler indices (flattened lists of bounding boxes) and make more advanced versions as required. We can parallelize these efforts with enough resources.
Hurdle 5: Partitioned Joins
To run a join over multiple workers, the default method that maintains correctness is a broadcast join, where the entire build side is on each join worker. If the build-side does not fit into a single machine's memory, and the predicate allows, we can partition the build side across different machines.
Again, let's examine the equijoin case to understand what we need. If the predicate is
pred(left, right) -> left.a == right.b
, we know thathash(left.a) == hash(right.b)
is true for any pair that matches the predicate. Thus, if we define partitions functions for left and right aspart_left(left) -> hash(left.a)
andpart_right(right) = hash(right.b)
, then if we distribute each row (probe and build) to the worker based on its partition value, then if a pair matches the predicate they will be on the same worker, and the local join will be correct.In general if we have a join with a predicate
pred(left, right)
, to distribute the build side across workers requires two functionspart_left(left) -> int32
andpart_right(right) -> int32
such thatpred(left, right) == true implies part_left(left) == part_right(right)
. We callpart_left
andpart_right
partition functions, and their value on a row a partition index. We can assign each row to a worker based on the partition index.To partition geometries, Presto Java uses a KDB Tree (with K=2), with each cell assigned a partition index. Each geometry (probe or build) queries the KDB tree, and gets an index for each cell it intersects. Thus each geometry may have 0, 1, or many partitions. This is implemented by returning a list of indices and then UNNESTing the list, making one row for each spatial partition index. Geometries are then assigned to workers based on these partition indices. For Velox, we can consider wrapping the build-side array in a DictionaryArray and having the partition index as the dictionary lookup index.
Crucially, any non-Point geometry (either probe or build) may be sent to multiple workers. Thus, different workers may seem the same match. We use a clever heuristic1 to deduplicate. This works for inner joins.
For outer joins, a worker that does not find a match does not know if other workers have also not found a match, so there's no way to emit the null-match consistently at this stage2. For Presto Java, we only implemented this for outer joins when the non-outer side was a point, and thus must only go to one worker.
For Presto Java, supplying the spatial join was a challenge, and the implementation is not very elegant. But it is incredibly effective, allowing extremely large build sides. The API is a bit arcane, and hopefully we can improve on this.
Recommendation: Since Partitioned Joins require the local joins above, this is something that can be designed and implemented after local joins.
Hurdle 6: Projections and Geographies
Geometries and projections are analogous to naive datetimes and timezones, but much harder. A Geometry is a set of abstract
x,y
coordinates on an infinite plane. This is a valid concept (unlike a naive datetime), but most people are interested in coordinates that are points on Earth. The Earth is not a plane, or even a sphere, or even an ellipsoid. To translate between abstract coordinates and a physical location on earth requires two things: A choice of approximate ellipsoid for the Earth, and a Fundamental Point to anchor the coordinate system. This choice is called a datum, but colloquially we'll call it a projection. We call a Geometry with a projection a Geography.The further you get from the fundamental point, the greater the error will be between your coordinates and the expected locations on earth. For surveying, country borders, and other applications requiring high precision, local datums are constructed using the local curvature of the earth. However, almost everyone uses the WGS84 datum, which is a "pretty good" approximation all over the earth, with the fundamental point corresponding to the intuitive 0° lat/0° lon. This is the analogue of UTC.
Many spatial calculations are different for geographies compared to geometries. For example, the distance between two points is a great-circle distance, not a Euclidean line. Areas of polygons are also different. These differences are relevant at country scale, and even smaller.
Two things that make projection much harder than timezones are:
Thus dealing with projections is very hard, and requires specialized software. Most people, when they use projections at all, use WGS84. Specialized users often use other projections.
Presto Java has a Geography type that uses the WGS84 projection, and functions that convert between Geometries and Geographies (this is a type change with no data change). It has geography specific functions (or implementation of functions), although it never implemented geographic variants of all geometric functions due to prioritization.
Recommendation: We add Geographies after Geometries, and only support WGS84. WGS84 was supported only be ESRI, for which we don't have an analogue, so we will need to research GDAL or other solutions.
Dependency DAG
Recommended path forward
The biggest strategic decisions are Library, Standards, and In-Memory Representation. Most work depends on these, so we should work on these decisions first. Next we should focus on an initial set of geometry functions to test and validate these choices, as well as establish a roll-out procedure.
Once these are done, we can parallelize work more effectively. Geometry functions can be added independently of each other, and work can also begin on local joins. After local joins, we should work on Partitioned Joins and Geographies in order of impact to effort.
Footnotes
Each worker checks the top-left of the intersection of the probe and build bounding boxes (which is itself a box). If that point falls within a cell of the KDB tree the worker is responsible for, the worker will yield the match; otherwise it will drop it. ↩
One could shuffle and deduplicate the outer-joined pairs in the following stage if the inner-side row is unique in some way. ↩
Beta Was this translation helpful? Give feedback.
All reactions