Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Backing CoordinateSequence with custom data #868

Closed
Ipiano opened this issue Apr 12, 2023 · 6 comments
Closed

Discussion: Backing CoordinateSequence with custom data #868

Ipiano opened this issue Apr 12, 2023 · 6 comments

Comments

@Ipiano
Copy link

Ipiano commented Apr 12, 2023

Preamble

Hey GEOS team! It's me again, the guy from #564.

I really appreciated the discussions that were had on that ticket, and that things were cleaned up as a result; however, the end conclusion has put me in a bit of an awkward situation. My team is relying on the functionality to inter-operate between GEOS and other libraries that was nixed after that; so we can't update GEOS past 3.10. I suspect that we could probably just keep using 3.10 for a long time, but I'd like to have the door open for us to update in the future if there's some crucial bugfix in an algorithm we use.

This ticket is to open a discussion about what types of things my team has been using GEOS for, and if we can help influence the C++ API in a direction that we can continue using newer GEOS versions in the future without having to fork the project.

The details I'm presenting here are specific to our use cases, but I suspect they're going to be relevant to discussions about inter-operation with other geometry libraries and/or language bindings, as our concerns are mainly around performance and being able to lend existing data to GEOS to do calculations without having to outright copy the data.

This relates to a number of other tickets that are ongoing; the most important one being #747. It's probably tangentially related to #704; and I personally would love the conclusion to ultimately be part of an official C++ API, as noted in #755.

Below I've outlined

  • What is our use case
  • Possible approaches to resolving this
  • Details of how we implemented things

More Formally: The Request

I would like to discuss the API and implementation of CoordinateSequence, specifically with respect to how it might be expanded to support better low-level memory management that would allow consumers of GEOS to lend their own data types and memory buffers to GEOS and similarly take ownership of memory away from GEOS.

This is incentivized by our specific use cases, but is probably a valuable discussion for the general case of making GEOS flexible for use along-side other geometry libraries and as an underlying implementation for bindings in other languages.

Our Use Case

We've been building a domain-specific geometry and vector-math library for internal use - the goal is to have a base that we can build high level algorithms and rendering implementations on. The main use cases for this library include

  • Geometric data representation and manipulation
  • Vector math calculations
  • Graphical display

Additionally, we care mainly about 2-dimensional data.

Also additionally our target runtime environment is an embedded ARM Linux system with under 1 GB total RAM.

GEOS is phenomenal for planar geometric data manipulation, and we chose to include GEOS in our implementation mainly for those algorithms; we've also found a lot of great uses for some of the data structures like STR-Tree. We made the conscious choice to use the unstable C++ API, and it's largely paid off. (Using TemplateSTRTree was a huge performance win for us)

GEOS is less ideal for our other use cases. The core Coordinate type doesn't have much functionality in terms of vector math, nor is it particularly easy to efficiently use in conjunction with OpenGL for rendering. (At least, this was true in 2020, on version 3.9 - as GEOS moves toward better supporting 2D data, this is changing)

To address our use cases, we also included the GLM library. GLM is great for high-performance vector math operations (which is nice when building things like path-planning algorithms), and it's designed for integration with OpenGL.

The public API of our library is built exclusively using GLM's dvec2, dvec3, and dvec4. This allows us to strictly define dimensionality and lets our consumers take advantage of all the targeted optimization work that GLM provides.

Another major detail here is that we chose to use open polygon representations instead of the closed representation GEOS uses (i.e GEOS requires that closed loops use a duplicate point in the first and last index of the point list; we do not) - the reasons for this are largely so that we can adopt this library internally better and it's easier to work with for some of the algorithms we are implementing.

Proposals

These are the long-term resolutions that I see as possible solutions to our specific problem, ordered by how likely I think they are.

  1. geos::geom::CoordinateSequence becomes smart enough to do the following:
    • Borrow a read-only memory buffer of 2D data
    • Release ownership of a memory buffer of 2D data that it owns
  2. 1, but additionally...
    • In the case that it borrows memory for a loop where the first/last points don't match, quietly paper over that detail so nothing else in GEOS has to care
  3. We fork GEOS and maintain our own CoordinateSequence implementation that supports our needs
    • Depending on how different it is from the official implementation, this might not be an unacceptable amount of overhead for us, but it would be kind of a pain
  4. 1 or 2, coupled with expanding geos::geom::Coordinate to become more of a general-purpose vector math primitive
    • This allows us to just drop GLM, which does simplify the scope of what we're trying to accomplish. We're ok with losing some of the specialized optimization that library provides in exchange for simplifying our library.
  5. We stop using GEOS
    • I'm not a fan of this - it's a lot of work for us and GEOS is awesome
    • Really the only other option for us would be to switch to Boost.Geometry
  6. GEOS becomes more template-y like Boost.Geometry
    • I don't seriously expect this - it's a major undertaking. It's more here
      for completeness
    • This allows us to continue having our own implementation of CoordinateSequence that does the things it does, but without the overhead over virtual calls that prompted removal of the functionality in the first place

Interop Details

The public types of our library are internally backed by a std::vector<glm::dvec2>. When a user wants to perform some operation that requires to make a call to GEOS, we "convert" the underlying data to a custom geos::geom::CoordinateSequence subtype, perform the operation, and then "convert" back.

I put "convert" in quotes because we've taken care to avoid copying data into/out of GEOS types whenever possible - if we were doing data copies of thousands of points each time we had to do an intersection operation, that'd be unacceptable from a performance perspective. Instead, we just lend memory to our CoordinateSequence types and take memory ownership away from them later to avoid copies. We also use our custom CoordinateSequence subtype to convert between the open and closed loop representations.

We have three types inheriting geos::geom::CoordinateSequence:

  • GlmCoordinateSequence : public geos::geom::CoordinateSequence
  • MutableGlmCoordinateSequence : public GlmCoordinateSequence
  • MaxCapacityGlmCoordinateSequence : public GlmCoordinateSequence

We've also got one single GlmCoordinateSequenceFactory : public geos::geom::CoordinateSequenceFactory.

GlmCoordinateSequence

GlmCoordinateSequence can be described as a "borrowing, read-only, GLM-backed CoordinateSequence".

This type behaves similarly to CoordinateSequence does when using GEOSCoordSeq_createFromBuffer_r introduced in pull request #747; however, unlike that use case, it cannot grow as needed - it's strictly read-only. GlmCoordinateSequence holds a pointer to an existing std::vector<glm::dvec2>, and returns the data from within it when requested by casting to geos::geom::Coordinate. I'll talk later about 2D vs. 3D, and memory layouts.

MutableGlmCoordinateSequence

MutableGlmCoordinateSequence can be described as an "owning, read-write, GLM-backed CoordinateSequence".

This type is used any time GEOS deigns to create a new CoordinateSequence (for example, during a union operation when CoordinateSequenceFactory::create is invoked). When MutableGlmCoordinateSequence is created from a std::vector<geos::geom::Coordinate>, we copy all of the data to the std::vector<glm::dvec2> that it owns. Then, like GlmCoordinateSequence, we convert that backing data to geos::geom::Coordinate references as-needed. Later, we can take ownership of that underlying data away from the MutableGlmCoordinateSequence and give it to our public API data types.

MaxCapacityGlmCoordinateSequence

MaxCapacityGlmCoordinateSequence is a special edge-case. It helps us deal with the fact that GEOS uses closed loops and we use open loops, and it prevents accessing un-owned memory because geos::geom::Coordinate is 3D, and our data is 2D. Like GlmCoordinateSequence, it's read-only and doesn't grow.

In many cases, the std::vector<glm::dvec2> owned by a GlmCoordinateSequence has a capacity larger than the actual contents. In these cases, we cheat and access past-the-end memory which is still within the bounds which were allocated by the underlying std::vector. Then when we provide access to GEOS, we represent the data as one larger than the std::vector would report.

When the underlying std::vector does not have a sufficient capacity() to use this trick, we use MaxCapacityGlmCoordinateSequence because in this case, there's nowhere that we can duplicate the 0th point to, and if we were to cast the last point to a geos::geom::Coordinate, then the z value would be un-owned memory. (We could probably get away with just calling reserve() to increase the capacity, but we've tried to avoid having to copy the entire data block as much as possible; and we weren't entirely sure what sort of guarantees consumers might want about iterator invalidation).

MaxCapacityGlmCoordinateSequence solves the issue by containing an extra std::array<glm::dvec2, 3> which can be used to store the final point of the open loop, a duplicate of the first point, and some scratch memory that's safe to access so we can cast a 2D object to a 3D object.

Undefined Behavior and Nasal Demons

So at this point you might be thinking "how in the world is any of this safe??"

Here's some important assumptions:

  • All our data is 2d
  • We only call GEOS functions which do 2d operations (as far as we can tell, this holds, and it's hard to define exactly what something like polygon union mean in 3d)
  • We compile-time checks that the memory layout of geos::geom::Coordinate matches that of glm::dvec2 (at least for the x and y values) and that sizeof(geos::geom::Coordinate) <= sizeof(glm::dvec2)*2

So given these things, it's acceptable to reinterpret_cast<*geos::geom::Coordinate>(&some_glm_dvec2), provided there is another glm::dvec2 allocated immediately after the object being cast. We make the assumption that GEOS doesn't try to write the z value (which, as far as we can tell, it doesn't. If it did, we'd expect it would route through GlmCoordinateSequence::setAt or a similar function, which we assert() doesn't happen)

Given all these assumptions, all our operations end up in one of three situations:

  1. We're passing read-only data to GEOS through a GlmCoordinateSequence where capacity() >= size() + 1 for the underlying borrowed std::vector<glm::dvec2>
  2. We're passing read-only data to GEOS through a MaxCapacityGlmCoordinateSequence where capacity() < size() + 1 for the underlying borrowed std::vector<glm::dvec2>, so we use the extra std::array<glm::dvec2, 3> to hold the final borrowed point and the duplicated initial point.
  3. GEOS created a MutableGlmCoordinateSequence through CoordinateSequenceFactory, and we were able to guarantee that it's got sufficient capacity to act like GlmCoordinateSequence, but it's also mutable so we don't have to worry about const propagation and/or thread-safety as much.

There's still the consideration of "we're using type-punning, which is strictly invoking undefined-behavior". Right now, we only type-pun things through the libGEOS abi barrier when calls are going in/out of geos.so. This severely limits the assumptions the compiler can make, and mostly guarantees we can't run into crazy optimization bugs as a result. That, plus a wide gambit of unit tests, has given us confidence that our implementation functions as intended.

There is a bit of trepidation on my team related to the recent changes in CoordinateSequence, which is effectively doing the same thing (type-punning a list of doubles as the various Coordinate types) - the main concern being that it's entirely within the GEOS shared object, so the optimizer can get a lot more aggressive. We're not sure if there's anything that can be done about this - these types of things are typically necessary in high-performance applications. I think we're willing to say that if our unit tests are passing, and your unit tests are passing, it's OK; but I wanted to raise this point and see if it's something you've explicitly tried to address.

Conclusion

I know, this is a lot; and, honestly, it would have probably been appropriate to bring this up around the time of my original ticket last year. But now felt like the next-best time, especially with ticket #747 going on, and we wanted to bring up what we've encountered and give some proposals that would enable us to continue using such an awesome library.

@hobu
Copy link
Member

hobu commented Apr 12, 2023

if we can help influence the C++ API in a direction that we can continue using newer GEOS versions in the future without having to fork the project.

No one is going to care if you internally fork the project. That's going to be your organization's cost to bear. Given that your organization is not a historically public contributor to the project, the odds of your making a public GEOS fork and sustaining it to any significant point of risk to the major momentum of the project is practically zero. Many folks have wanted to fork GEOS for this or that thing, but the reality of carrying it through as a project that attracts a contributor base is a lot more than a Github fork and some alternative PRs.

I know, this is a lot; and, honestly, it would have probably been appropriate to bring this up around the time of my original ticket last year.

Way too much, honestly. It's a giant wall-o-text that articulates how GEOS doesn't satisfy your organization's ability to take complete and full advantage of it in the smoothest, lowest impedance way possible.

The best way to influence the evolution of GEOS is to contribute code, tests, analysis, and demonstrations. GEOS developers can then react to items on an individual basis with concrete information. Generate social capital for yourself and your organization by contributing, and then you can work to make changes that you desire which are likely at the very least to spread complexity cost or change cost over other peoples' applications and usages.

That said, there are GEOS users who have wanted some of the things you desire. Showing up with a 2000 word bulleted list of things you need to have happen to continue benefiting from GEOS' upgrade path is not the way to make them reality. The way to do it is by providing boiled down examples, explicit proposals as code that do not externally disrupt other applications, and RFCs as requested to implement your proposals.

@dbaston
Copy link
Member

dbaston commented Apr 12, 2023

Proposal (1) sounds like it's covered by #747 which wasn't merged not because it's bad, but because I couldn't demonstrate a compelling advantage. If you are looking for a way to contribute, I'd suggest testing that branch as a starting point.

While most of the GEOS algorithms are fundamentally 2D, most of the GEOS code base assumes it can access XYZ coordinates. If you pack XY coordinates such that what is read as Z is really the X of the next coordinate, you have the possibility not only of garbage values being read, but also a segfault on the last coordinate. But if you improve the GEOS algorithms you use so that they do not assume XYZ access, you should be able to do what you propose safely. To see what can be safely run on a packed XY sequence, just uncomment the "padding" define in CoordinateSequence and run the test suite.

@Ipiano
Copy link
Author

Ipiano commented Apr 12, 2023

Thank you both for your feedback. As is painfully clear, I'm not particularly experienced with trying to contribute to projects. You're right, Wall-O-Text was the wrong approach here, and I realize that now.

My intent wasn't to just drop in and say "it's not working for us, please change it, kthxbye". I believe our use case is not wholly unique, and that it's likely relevant to anyone who may want to use GEOS alongside other similar libraries or under a language binding, even if our specific examples aren't relevant elsewhere.

That may be a poor guess on my part, and I also don't know if that's a use case you (the steering committee) are particularly interested in; I should have just started there - I was too gung-ho with just assuming that you'd care, and that of course you'd want examples of how it's not working right now - I like talking about technical details, so I rambled about technical details.

As Dan pointed out

you have the possibility not only of garbage values being read, but also a segfault on the last coordinate

That's absolutely correct, and both things that we've attempted to work around. If an algorithm is fundamentally 2D, I don't really care if it reads garbage data for the Z value, as long as it doesn't seg fault, unless that's going to cause the 2D algorithm to misbehave, which is something we haven't been experiencing. My question boils down to "what can we do to help make it so we don't have to work around this, but instead GEOS supports things like this?"

I appreciate the suggestion to take a look at the "padding" define and look into improving the algorithms so as not to assume XYZ access. Thank you! I suppose a follow-up question is, in general, is that direction you want GEOS to go? In the direction of 2D algorithms really being just 2D, and not having the option of being 3D if the data passed is appropriate for it?

@dr-jts
Copy link
Contributor

dr-jts commented Apr 13, 2023

is that direction you want GEOS to go? In the direction of 2D algorithms really being just 2D, and not having the option of being 3D if the data passed is appropriate for it?

We want whatever is appropriate and feasible for the algorithm, subject to development cycles. Some algorithms (e.g. buffer) are inherently 2D. Others can preserve Z if it's present (e.g constructive algorithms which utilize subsets of the input points such as Convex and Concave Hull). And there are some which actively try and interpolate new Z values (the overlay algorithms).

That was the original intent of the GEOS/JTS implementation, anyway. That's why much of the code "assumes XYZ access". It usually doesn't do anything with the Z value except carry it along. Hopefully this pattern is still viable (@dbaston ?).

@pramsey
Copy link
Member

pramsey commented Apr 13, 2023

In fact we're getting deeper into higher dimensionality and hoping to passively carry M coordinates along too, where they are. But that doesn't/shouldn't foreclose being able to ride 2D coordinate sequences on 2D buffers and have everything "just work", with appropriate clean-ups.

@Ipiano
Copy link
Author

Ipiano commented Apr 19, 2023

Thank you all for the guidance. I see that #872 has been created, which continues the discussion of "what's needed to make things 2D-safe" - it sounds like that is likely the bulk of the work that would be necessary for us to drive.

@Ipiano Ipiano closed this as completed Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants