-
Notifications
You must be signed in to change notification settings - Fork 7
Non linear running time for ogc:contains precomputation
The precomputation time of the ogc:contains currently grows non-linearly with the size of the data set, which shouldn't be the case. We are investigating the problem.
PBF files downloaded from https://download.geofabrik.de . The data for baden-wuerttemberg is around 4 times larger than that of freiburg-regbez. The data of germany is around 7.5 times larger than that of baden-wuerttemberg and around 30 times larger than that of freiburg-regbez. The time to compute the regular triples indeed grows linearly, as one would expect. But the time to compute the ogc:contains
triples grows super-linearly, and quite strongly so.
Dataset | nodes, ways, rels | regular triples | contains triples | ways-contained-in-area triples |
---|---|---|---|---|
freiburg-regbez | 0.6M, 1.9M, 37K | 0.5min | 0.4min | 0.25min |
baden-wuerttemberg | 2.2M, 8.0M, 118K | 1.5min | 17min | 14min |
germany | 16M, 59M, 742K | 12min | 400min | 300min |
Optionally, osm2rdf
can be made to output "statistic triples" for each way-in-area comparison done during the ogc:contains
precomputation, which can then be queried conveniently using SPARQL. An analysis of these numbers gives a very good picture of what is happening.
Dataset | #ways | #comp | #comp / way | time per comp | total time comp |
---|---|---|---|---|---|
freiburg-regbez | 1.9M | 2.5M | 1.4 | 14µs | 0.6min |
baden-wuerttemberg | 8.0M | 11M | 1.4 | 23µs | 4.3min |
germany | 59M | 83M | 1.4 | 130µs | 180min |
- Statistics #comp, time per comp, total time comp for the table above
- All way-in-area comparisons, one line per comparison
- All way-in-area comparisons, grouped by number of areas compared to for a way
- All way-in-area comparisons, grouped by comparison time in usecs
- All way-in-area comparisons, grouped by comparison time in usecs, rounded down to multiple of 10
- All way-in-area comparisons, grouped by area size (highest total time contributors first)
The number of way-in-area comparisons according to the statistic triples (column "#comp" in the table above) correlates well with the respective number reported in the osm2rdf
log. For freiburg-regbez
, the two numbers are identical, for baden-wuerttemberg
, there is a slight deviation. Note that this number comprises all comparisons between a way w and an area a such that (1) the bounding boxes of w and a overlap (and hence the pair is contained in the result of the R-tree query for areas potentially overlapping with w), and (2) there is no area a' that is contained in a (according to our contains-area DAG) and for which we already know that w is contained a' (and hence also in a).
One way is compared to around 1.4 areas on average, independent of the size of the data. In particular, most ways are compared to a single area. That shows that our approach works as intended. Also, most of the total comparison time is spent on comparisons of ways that are only compared against very few areas (see Query 3).
area size | #areas | #distinct | total time | area | all ways in area |
---|---|---|---|---|---|
81,447 | 440,664 | 1 | 59min | Jutland Peninsula | QLever query |
48,864 | 322,828 | 1 | 25min | Vorpommern | QLever query |
155,244 | 64,150 | 1 | 18min | Deutschland | QLever query |
4 | 204,808 | 187 | 13min | many |
The table above shows the four biggest contributors (when grouping by area size, measured in number of nodes) to the total time for the way-in-area comparisons. The three biggest contributors are single areas: these three use almost 60% of the total time. Only the fourth contributor is "normal" (many comparisons to a variety of small areas, with a moderate contribution to the total running time). See Query 6 above. After row 4 (not shown above, execute the query), there are many more single-area contributors, amounting for around a minute of the total time. These are larger regions like states (Bayern, Nordrhein-Westfalen) or others (Naturpark Bergisches Land, Umweltzone Ruhrgebiet).
Preprocessing time for "ways in areas" on ob
(AMD Ryzen 7, 128 GB, 16 cores), with default settings (0.015 for inner-outer-geoms
). The times are aggregated computing times (so 2 seconds each on 16 cores = 32 seconds aggregated time). The "time per core" in the last column is cimply the "total time" divided by the number of cores (which is 16).
Dataset | #ways | R-tree | intersect | contains | total time | time per core |
---|---|---|---|---|---|---|
freiburg-regbez | 1.9M | 2min | 3min | 0.7min | 7.1min | 0.4min |
baden-wuerttemberg | 8.0M | 4.6h | 1.9h | 0.2h | 7.1h | 0.4h |
germany | 59M | 96.0h | 35.8h | 4.3h | 141h | 8.8h |
We seem to have two independent culprits for the non-linear running time:
-
The average time for an R-tree lookup for a way grows with the number of ways. It should be constant or grow only very slightly. So something is wrong here. Note that we currently use Boost's R-tree in its standard configuration and we don't store IDs of objects in the R-tree, but the objects in their entirety.
-
Our
osm2rdf:contains_area
DAG misses some critical edges, for example, "Baden-Wurttemberg" is not contained in "Germany" although it should be. We wrote a simple program to check whetherboost::geometry::covered_by
is correct. Interestingly, the function shows strange behavior on Patrick's machine (for example, BW is not covered by BW, which is obviously wrong), but it works fine onamur
, where all the results above were produced (except the last table, which was produced onob
).
The current hope is that if we manage to fix both of these problems, the precomputation time for "intersects" and "contains" will be (almost) linear.