-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark #4
Comments
You're right! Thank you! I'm going to embark on a complete rewrite of rsgeo in the coming weeks. One thing I can do is make sure that all point distance and binary predicates first build an R* tree first. Although that may be a bit unnecessary for when there are not enough geometries to justify it. What do you think? Always build the tree or optionally or have separate functions for it? |
Note that while there aren't "prepared" geometries we can do the work manually on the Rust side by creating the tree and inserting into it. |
suggestion to make the
with an |
Maybe optionally? In BTW1: Will tree building help anything in calculating the distance matrix? I think it is brute calculating the distance between all points? BTW2: I am not expert, but I understand that you are proposing to implement spatial index and not prepared geometry? I understand that these are two separate things? I rely on this post from Edzer: https://r-spatial.org/r/2017/06/22/spatial-index.html |
Re slowness of euclidean distance matrix I think this is a result of two things:
There are two matches per iteration which is quite a lot. I think this can be helped when https://github.com/georust/geo/pull/1029/files/2709d7f841af05551e1d065403a853dbe5fdde70 is merged. Re prepared geometries (it's suprising that there is like, 0 documentation or academic papers easily found and its all source code and probably github reviews), my understanding prepared geometries comes largely from geo-rust community and discussion georust/geo#803 is that you create a spatial index upfront (this is the preparation). Then the index can be used in subsequent calls etc to make narrowing down candidates far faster. |
@Robinlovelace, little offtopic, but I tried Benchmarklibrary("sf")
library("sfheaders")
poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 500000
pts_sf = st_sample(poly_sf, n)
t = bench::mark(
check = FALSE, iterations = 10,
sf = st_coordinates(pts_sf),
sfheaders = sfc_to_df(pts_sf)
)
t[, 1:5]
#> expression min median `itr/sec` mem_alloc
#> 1 sf 8.69ms 11.2ms 45.1 15.3MB
#> 2 sfheaders 145.88ms 152.1ms 6.61 21MB |
Very interesting! I've found sfheaders v. fast for some things but clearly not all. Thanks for all these benchmarks, really helpful if not essential for working out what to use when speed's an issue! |
This should be vastly improved in the new refactor branch. I just pushed a commit to parallelize the distance matrix. It's not complete yet but the calculation portion is there. Just need to move from a list to a matrix. a2ab18a#diff-ba040067b83f26349efd96faedb319914955849e8949c3a157330a1c0d895dae |
The one to compare against is |
It's definitely going to remain faster for unioning. I think there will be a minor speed up compared to your existing benchmark but geos is very good here 😛 |
BTW @Robinlovelace, isn't the |
Correct, although as we're doing a union the results shouldn't differ too much, right (although I did notice several differences...)? Benchmark adaptations incoming! |
A note on the distance matrix calculations. Rfast calculates a distance matrix between a vector and itself. sf can do one or the other. When you calculate a distance matrix where y = x you only need to calculate one of the triangles and the diagonal is always 0. When the other vector is not identical you do not have symmetry and cannot shortcut the calculation. When we pass in a value of y for As of right now Edit: adding bench mark for Rfast and new distance matrix. When comparing apples to apples they are on par with eachother. library("sf")
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
set.seed(1)
poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 5000
pts_sf = st_sample(poly_sf, n)
system.time(st_distance(pts_sf))
#> user system elapsed
#> 0.671 0.094 0.770
system.time(st_distance(pts_sf, rev(pts_sf)))
#> user system elapsed
#> 6.197 0.064 6.278
library(rsgeo)
x <- rsgeo:::from_sfc(pts_sf)
system.time(distance_euclidean_matrix(x, rev(x)))
#> user system elapsed
#> 0.348 0.168 0.321
crds <- coords(x)
c2 <- coords(rev(x))
system.time(Rfast::dista(crds, c2))
#> user system elapsed
#> 0.228 0.034 0.266
bench::mark(
Rfast = Rfast::dista(crds, c2),
rsgeo = distance_euclidean_matrix(x, rev(x)),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 Rfast 263ms 266ms 3.76 382MB 7.51
#> 2 rsgeo 215ms 252ms 3.97 191MB 1.99 |
Two years ago I created the benchmark testset comparing packages for processing vector data (
sf
,terra
,geopandas
,geos
,s2
), and now I'm naturally interested in howrsgeo
looks like (and maybe @Robinlovelace too 😃).rsgeo
is a low-level interface, so the number of operations is limited -- I only tested the calculation of Euclidean distance between points and geometry intersects (it's pity that buffering is not originally implemented). My conclusions:sf
. However, for a larger dataset, the difference is about 30% (or another reason, the results can be outdated). It seems to me that here the best solution is to transform the geometry to a data frame and use C++ functions (such asRfast::Dist()
).rustgeo
doesn't support prepared geometries unlikegeos
, so there are big differences here.Benchmark
The text was updated successfully, but these errors were encountered: