-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Backing CoordinateSequence with custom data #868
Comments
No one is going to care if you internally fork the project. That's going to be your organization's cost to bear. Given that your organization is not a historically public contributor to the project, the odds of your making a public GEOS fork and sustaining it to any significant point of risk to the major momentum of the project is practically zero. Many folks have wanted to fork GEOS for this or that thing, but the reality of carrying it through as a project that attracts a contributor base is a lot more than a Github fork and some alternative PRs.
Way too much, honestly. It's a giant wall-o-text that articulates how GEOS doesn't satisfy your organization's ability to take complete and full advantage of it in the smoothest, lowest impedance way possible. The best way to influence the evolution of GEOS is to contribute code, tests, analysis, and demonstrations. GEOS developers can then react to items on an individual basis with concrete information. Generate social capital for yourself and your organization by contributing, and then you can work to make changes that you desire which are likely at the very least to spread complexity cost or change cost over other peoples' applications and usages. That said, there are GEOS users who have wanted some of the things you desire. Showing up with a 2000 word bulleted list of things you need to have happen to continue benefiting from GEOS' upgrade path is not the way to make them reality. The way to do it is by providing boiled down examples, explicit proposals as code that do not externally disrupt other applications, and RFCs as requested to implement your proposals. |
Proposal (1) sounds like it's covered by #747 which wasn't merged not because it's bad, but because I couldn't demonstrate a compelling advantage. If you are looking for a way to contribute, I'd suggest testing that branch as a starting point. While most of the GEOS algorithms are fundamentally 2D, most of the GEOS code base assumes it can access XYZ coordinates. If you pack XY coordinates such that what is read as Z is really the X of the next coordinate, you have the possibility not only of garbage values being read, but also a segfault on the last coordinate. But if you improve the GEOS algorithms you use so that they do not assume XYZ access, you should be able to do what you propose safely. To see what can be safely run on a packed XY sequence, just uncomment the "padding" define in |
Thank you both for your feedback. As is painfully clear, I'm not particularly experienced with trying to contribute to projects. You're right, Wall-O-Text was the wrong approach here, and I realize that now. My intent wasn't to just drop in and say "it's not working for us, please change it, kthxbye". I believe our use case is not wholly unique, and that it's likely relevant to anyone who may want to use GEOS alongside other similar libraries or under a language binding, even if our specific examples aren't relevant elsewhere. That may be a poor guess on my part, and I also don't know if that's a use case you (the steering committee) are particularly interested in; I should have just started there - I was too gung-ho with just assuming that you'd care, and that of course you'd want examples of how it's not working right now - I like talking about technical details, so I rambled about technical details. As Dan pointed out
That's absolutely correct, and both things that we've attempted to work around. If an algorithm is fundamentally 2D, I don't really care if it reads garbage data for the Z value, as long as it doesn't seg fault, unless that's going to cause the 2D algorithm to misbehave, which is something we haven't been experiencing. My question boils down to "what can we do to help make it so we don't have to work around this, but instead GEOS supports things like this?" I appreciate the suggestion to take a look at the "padding" define and look into improving the algorithms so as not to assume XYZ access. Thank you! I suppose a follow-up question is, in general, is that direction you want GEOS to go? In the direction of 2D algorithms really being just 2D, and not having the option of being 3D if the data passed is appropriate for it? |
We want whatever is appropriate and feasible for the algorithm, subject to development cycles. Some algorithms (e.g. buffer) are inherently 2D. Others can preserve Z if it's present (e.g constructive algorithms which utilize subsets of the input points such as Convex and Concave Hull). And there are some which actively try and interpolate new Z values (the overlay algorithms). That was the original intent of the GEOS/JTS implementation, anyway. That's why much of the code "assumes XYZ access". It usually doesn't do anything with the Z value except carry it along. Hopefully this pattern is still viable (@dbaston ?). |
In fact we're getting deeper into higher dimensionality and hoping to passively carry M coordinates along too, where they are. But that doesn't/shouldn't foreclose being able to ride 2D coordinate sequences on 2D buffers and have everything "just work", with appropriate clean-ups. |
Thank you all for the guidance. I see that #872 has been created, which continues the discussion of "what's needed to make things 2D-safe" - it sounds like that is likely the bulk of the work that would be necessary for us to drive. |
Preamble
Hey GEOS team! It's me again, the guy from #564.
I really appreciated the discussions that were had on that ticket, and that things were cleaned up as a result; however, the end conclusion has put me in a bit of an awkward situation. My team is relying on the functionality to inter-operate between GEOS and other libraries that was nixed after that; so we can't update GEOS past 3.10. I suspect that we could probably just keep using 3.10 for a long time, but I'd like to have the door open for us to update in the future if there's some crucial bugfix in an algorithm we use.
This ticket is to open a discussion about what types of things my team has been using GEOS for, and if we can help influence the C++ API in a direction that we can continue using newer GEOS versions in the future without having to fork the project.
The details I'm presenting here are specific to our use cases, but I suspect they're going to be relevant to discussions about inter-operation with other geometry libraries and/or language bindings, as our concerns are mainly around performance and being able to lend existing data to GEOS to do calculations without having to outright copy the data.
This relates to a number of other tickets that are ongoing; the most important one being #747. It's probably tangentially related to #704; and I personally would love the conclusion to ultimately be part of an official C++ API, as noted in #755.
Below I've outlined
More Formally: The Request
I would like to discuss the API and implementation of
CoordinateSequence
, specifically with respect to how it might be expanded to support better low-level memory management that would allow consumers of GEOS to lend their own data types and memory buffers to GEOS and similarly take ownership of memory away from GEOS.This is incentivized by our specific use cases, but is probably a valuable discussion for the general case of making GEOS flexible for use along-side other geometry libraries and as an underlying implementation for bindings in other languages.
Our Use Case
We've been building a domain-specific geometry and vector-math library for internal use - the goal is to have a base that we can build high level algorithms and rendering implementations on. The main use cases for this library include
Additionally, we care mainly about 2-dimensional data.
Also additionally our target runtime environment is an embedded ARM Linux system with under 1 GB total RAM.
GEOS is phenomenal for planar geometric data manipulation, and we chose to include GEOS in our implementation mainly for those algorithms; we've also found a lot of great uses for some of the data structures like STR-Tree. We made the conscious choice to use the unstable C++ API, and it's largely paid off. (Using
TemplateSTRTree
was a huge performance win for us)GEOS is less ideal for our other use cases. The core
Coordinate
type doesn't have much functionality in terms of vector math, nor is it particularly easy to efficiently use in conjunction with OpenGL for rendering. (At least, this was true in 2020, on version 3.9 - as GEOS moves toward better supporting 2D data, this is changing)To address our use cases, we also included the GLM library. GLM is great for high-performance vector math operations (which is nice when building things like path-planning algorithms), and it's designed for integration with OpenGL.
The public API of our library is built exclusively using GLM's
dvec2
,dvec3
, anddvec4
. This allows us to strictly define dimensionality and lets our consumers take advantage of all the targeted optimization work that GLM provides.Another major detail here is that we chose to use open polygon representations instead of the closed representation GEOS uses (i.e GEOS requires that closed loops use a duplicate point in the first and last index of the point list; we do not) - the reasons for this are largely so that we can adopt this library internally better and it's easier to work with for some of the algorithms we are implementing.
Proposals
These are the long-term resolutions that I see as possible solutions to our specific problem, ordered by how likely I think they are.
geos::geom::CoordinateSequence
becomes smart enough to do the following:CoordinateSequence
implementation that supports our needsgeos::geom::Coordinate
to become more of a general-purpose vector math primitiveBoost.Geometry
Boost.Geometry
for completeness
CoordinateSequence
that does the things it does, but without the overhead overvirtual
calls that prompted removal of the functionality in the first placeInterop Details
The public types of our library are internally backed by a
std::vector<glm::dvec2>
. When a user wants to perform some operation that requires to make a call to GEOS, we "convert" the underlying data to a customgeos::geom::CoordinateSequence
subtype, perform the operation, and then "convert" back.I put "convert" in quotes because we've taken care to avoid copying data into/out of GEOS types whenever possible - if we were doing data copies of thousands of points each time we had to do an intersection operation, that'd be unacceptable from a performance perspective. Instead, we just lend memory to our
CoordinateSequence
types and take memory ownership away from them later to avoid copies. We also use our customCoordinateSequence
subtype to convert between the open and closed loop representations.We have three types inheriting
geos::geom::CoordinateSequence
:GlmCoordinateSequence : public geos::geom::CoordinateSequence
MutableGlmCoordinateSequence : public GlmCoordinateSequence
MaxCapacityGlmCoordinateSequence : public GlmCoordinateSequence
We've also got one single
GlmCoordinateSequenceFactory : public geos::geom::CoordinateSequenceFactory
.GlmCoordinateSequence
GlmCoordinateSequence
can be described as a "borrowing, read-only, GLM-backedCoordinateSequence
".This type behaves similarly to
CoordinateSequence
does when usingGEOSCoordSeq_createFromBuffer_r
introduced in pull request #747; however, unlike that use case, it cannot grow as needed - it's strictly read-only.GlmCoordinateSequence
holds a pointer to an existingstd::vector<glm::dvec2>
, and returns the data from within it when requested by casting togeos::geom::Coordinate
. I'll talk later about 2D vs. 3D, and memory layouts.MutableGlmCoordinateSequence
MutableGlmCoordinateSequence
can be described as an "owning, read-write, GLM-backedCoordinateSequence
".This type is used any time
GEOS
deigns to create a newCoordinateSequence
(for example, during a union operation whenCoordinateSequenceFactory::create
is invoked). WhenMutableGlmCoordinateSequence
is created from astd::vector<geos::geom::Coordinate>
, we copy all of the data to thestd::vector<glm::dvec2>
that it owns. Then, likeGlmCoordinateSequence
, we convert that backing data togeos::geom::Coordinate
references as-needed. Later, we can take ownership of that underlying data away from theMutableGlmCoordinateSequence
and give it to our public API data types.MaxCapacityGlmCoordinateSequence
MaxCapacityGlmCoordinateSequence
is a special edge-case. It helps us deal with the fact that GEOS uses closed loops and we use open loops, and it prevents accessing un-owned memory becausegeos::geom::Coordinate
is 3D, and our data is 2D. LikeGlmCoordinateSequence
, it's read-only and doesn't grow.In many cases, the
std::vector<glm::dvec2>
owned by aGlmCoordinateSequence
has a capacity larger than the actual contents. In these cases, we cheat and access past-the-end memory which is still within the bounds which were allocated by the underlyingstd::vector
. Then when we provide access to GEOS, we represent the data as one larger than thestd::vector
would report.When the underlying
std::vector
does not have a sufficientcapacity()
to use this trick, we useMaxCapacityGlmCoordinateSequence
because in this case, there's nowhere that we can duplicate the 0th point to, and if we were to cast the last point to ageos::geom::Coordinate
, then thez
value would be un-owned memory. (We could probably get away with just callingreserve()
to increase the capacity, but we've tried to avoid having to copy the entire data block as much as possible; and we weren't entirely sure what sort of guarantees consumers might want about iterator invalidation).MaxCapacityGlmCoordinateSequence
solves the issue by containing an extrastd::array<glm::dvec2, 3>
which can be used to store the final point of the open loop, a duplicate of the first point, and some scratch memory that's safe to access so we can cast a 2D object to a 3D object.Undefined Behavior and Nasal Demons
So at this point you might be thinking "how in the world is any of this safe??"
Here's some important assumptions:
geos::geom::Coordinate
matches that ofglm::dvec2
(at least for the x and y values) and thatsizeof(geos::geom::Coordinate) <= sizeof(glm::dvec2)*2
So given these things, it's acceptable to
reinterpret_cast<*geos::geom::Coordinate>(&some_glm_dvec2)
, provided there is anotherglm::dvec2
allocated immediately after the object being cast. We make the assumption that GEOS doesn't try to write the z value (which, as far as we can tell, it doesn't. If it did, we'd expect it would route throughGlmCoordinateSequence::setAt
or a similar function, which weassert()
doesn't happen)Given all these assumptions, all our operations end up in one of three situations:
GlmCoordinateSequence
wherecapacity() >= size() + 1
for the underlying borrowedstd::vector<glm::dvec2>
MaxCapacityGlmCoordinateSequence
wherecapacity() < size() + 1
for the underlying borrowedstd::vector<glm::dvec2>
, so we use the extrastd::array<glm::dvec2, 3>
to hold the final borrowed point and the duplicated initial point.MutableGlmCoordinateSequence
throughCoordinateSequenceFactory
, and we were able to guarantee that it's got sufficient capacity to act likeGlmCoordinateSequence
, but it's also mutable so we don't have to worry aboutconst
propagation and/or thread-safety as much.There's still the consideration of "we're using type-punning, which is strictly invoking undefined-behavior". Right now, we only type-pun things through the libGEOS abi barrier when calls are going in/out of
geos.so
. This severely limits the assumptions the compiler can make, and mostly guarantees we can't run into crazy optimization bugs as a result. That, plus a wide gambit of unit tests, has given us confidence that our implementation functions as intended.There is a bit of trepidation on my team related to the recent changes in
CoordinateSequence
, which is effectively doing the same thing (type-punning a list ofdouble
s as the variousCoordinate
types) - the main concern being that it's entirely within the GEOS shared object, so the optimizer can get a lot more aggressive. We're not sure if there's anything that can be done about this - these types of things are typically necessary in high-performance applications. I think we're willing to say that if our unit tests are passing, and your unit tests are passing, it's OK; but I wanted to raise this point and see if it's something you've explicitly tried to address.Conclusion
I know, this is a lot; and, honestly, it would have probably been appropriate to bring this up around the time of my original ticket last year. But now felt like the next-best time, especially with ticket #747 going on, and we wanted to bring up what we've encountered and give some proposals that would enable us to continue using such an awesome library.
The text was updated successfully, but these errors were encountered: