Benchmark #4

kadyb · 2023-07-10T21:08:16Z

Two years ago I created the benchmark testset comparing packages for processing vector data (sf, terra, geopandas, geos, s2), and now I'm naturally interested in how rsgeo looks like (and maybe @Robinlovelace too 😃). rsgeo is a low-level interface, so the number of operations is limited -- I only tested the calculation of Euclidean distance between points and geometry intersects (it's pity that buffering is not originally implemented). My conclusions:

In your comparison in README, the calculation of Euclidean distance is more than 10 times faster than in sf. However, for a larger dataset, the difference is about 30% (or another reason, the results can be outdated). It seems to me that here the best solution is to transform the geometry to a data frame and use C++ functions (such as Rfast::Dist()).
I see that rustgeo doesn't support prepared geometries unlike geos, so there are big differences here.

Benchmark

library("sf")
library("geos")
library("rsgeo")
library("Rfast") # for fast Euclidean distance
set.seed(1)

poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 300000
pts_sf = st_sample(poly_sf, n)

pts_rs = as_rsgeom(pts_sf)
pts_geos = as_geos_geometry(pts_sf)
poly_rs = as_rsgeom(poly_sf)
poly_geos = as_geos_geometry(poly_sf)

### distance between points ----------------------------------------------------
system.time(st_distance(pts_sf[1:5000], which = "Euclidean"))
#>  user  system elapsed
#> 1.817   0.519   2.338
system.time(euclidean_distance_matrix(pts_rs[1:5000], pts_rs[1:5000]))
#>  user  system elapsed
#> 1.526   0.320   1.848
system.time({
  crds = st_coordinates(pts_sf[1:5000])
  Rfast::Dist(crds, method = "euclidean")
})
#>  user  system elapsed
#> 0.589   0.120   0.707

### geometry intersects --------------------------------------------------------
system.time(intersects_sparse(pts_rs, poly_rs))
#>  user  system elapsed
#> 1.415   0.072   1.486

system.time(geos_intersects_matrix(pts_geos, poly_geos))
#>  user  system elapsed
#> 0.105   0.000   0.105

The text was updated successfully, but these errors were encountered:

JosiahParry · 2023-07-10T22:36:58Z

You're right! Thank you! I'm going to embark on a complete rewrite of rsgeo in the coming weeks. One thing I can do is make sure that all point distance and binary predicates first build an R* tree first.

Although that may be a bit unnecessary for when there are not enough geometries to justify it.

What do you think? Always build the tree or optionally or have separate functions for it?

JosiahParry · 2023-07-10T22:57:27Z

Note that while there aren't "prepared" geometries we can do the work manually on the Rust side by creating the tree and inserting into it.

Robinlovelace · 2023-07-11T06:28:29Z

suggestion to make the Rfast one even faster: replace

  crds = st_coordinates(pts_sf[1:5000])

with an sfheaders alternative...

kadyb · 2023-07-11T08:09:19Z

What do you think? Always build the tree or optionally or have separate functions for it?

Maybe optionally? In {sf} there is the prepared argument. In {terra} it works automatically, but you can use a C++ reference to do it manually (like x@ptr$relate_between()).

BTW1: Will tree building help anything in calculating the distance matrix? I think it is brute calculating the distance between all points?

BTW2: I am not expert, but I understand that you are proposing to implement spatial index and not prepared geometry? I understand that these are two separate things? I rely on this post from Edzer: https://r-spatial.org/r/2017/06/22/spatial-index.html

JosiahParry · 2023-07-16T20:26:12Z

Re slowness of euclidean distance matrix I think this is a result of two things:

excessive use of cloning

rsgeo/src/rust/src/distance.rs

Line 27 in 565e7ae

.map(|y| euclidean_distance_impl(x.geom.clone(), &y.geom))
excessive matching

rsgeo/src/rust/src/distance.rs

Line 36 in 565e7ae

match x {

There are two matches per iteration which is quite a lot. I think this can be helped when https://github.com/georust/geo/pull/1029/files/2709d7f841af05551e1d065403a853dbe5fdde70 is merged.

Re prepared geometries (it's suprising that there is like, 0 documentation or academic papers easily found and its all source code and probably github reviews), my understanding prepared geometries comes largely from geo-rust community and discussion georust/geo#803 is that you create a spatial index upfront (this is the preparation). Then the index can be used in subsequent calls etc to make narrowing down candidates far faster.

kadyb · 2023-07-18T20:41:19Z

@Robinlovelace, little offtopic, but I tried {sfheaders} to extract coordinates from geometry and it is slower than {sf}. However, it doesn't really matter because this operation has little overhead. The second thing, maybe {sfheaders} shouldn't return two columns (i.e. sfg_id, point_id) -- they are rather unnecessary for points.

Benchmark

library("sf")
library("sfheaders")

poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)
n = 500000
pts_sf = st_sample(poly_sf, n)

t = bench::mark(
  check = FALSE, iterations = 10,
  sf = st_coordinates(pts_sf),
  sfheaders = sfc_to_df(pts_sf)
)
t[, 1:5]
#>   expression      min   median `itr/sec` mem_alloc
#> 1 sf           8.69ms   11.2ms     45.1     15.3MB
#> 2 sfheaders  145.88ms  152.1ms      6.61      21MB

Robinlovelace · 2023-07-18T20:53:14Z

Very interesting! I've found sfheaders v. fast for some things but clearly not all. Thanks for all these benchmarks, really helpful if not essential for working out what to use when speed's an issue!

JosiahParry · 2023-08-25T02:25:06Z

This should be vastly improved in the new refactor branch. I just pushed a commit to parallelize the distance matrix.

It's not complete yet but the calculation portion is there. Just need to move from a list to a matrix.

a2ab18a#diff-ba040067b83f26349efd96faedb319914955849e8949c3a157330a1c0d895dae

Robinlovelace · 2023-08-25T04:51:39Z

The one to compare against is geos, faster than rsgeo for union in mini benchmark I put together: https://ogh23.robinlovelace.net/tidy#rsgeo

JosiahParry · 2023-08-25T11:20:24Z

It's definitely going to remain faster for unioning. I think there will be a minor speed up compared to your existing benchmark but geos is very good here 😛

kadyb · 2023-08-25T12:52:00Z

BTW @Robinlovelace, isn't the world object CRS WGS84? Then you are comparing the s2 spherical engine in {sf} with planar in {geos} and {rsgeo}. From my observations, in general GEOS is faster than s2. In this case, you can also use the argument is_coverage = TRUE and it will be even faster.

Robinlovelace · 2023-08-25T13:09:31Z

BTW @Robinlovelace, isn't the world object CRS WGS84?

Correct, although as we're doing a union the results shouldn't differ too much, right (although I did notice several differences...)? Benchmark adaptations incoming!

JosiahParry · 2023-08-25T18:28:24Z

A note on the distance matrix calculations. Rfast calculates a distance matrix between a vector and itself. sf can do one or the other. When you calculate a distance matrix where y = x you only need to calculate one of the triangles and the diagonal is always 0. When the other vector is not identical you do not have symmetry and cannot shortcut the calculation. When we pass in a value of y for st_distance() the computation is so much slower. And to compare to Rfast we have to use dista().

As of right now rsgeo always computes the full dense matrix. I can and should add an enhancement to check if x is equal to y and short cut it.

Edit: adding bench mark for Rfast and new distance matrix. When comparing apples to apples they are on par with eachother.

library("sf")
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
set.seed(1)

poly_sf = st_bbox(c(xmin = 0, xmax = 1, ymax = 0, ymin = 1), crs = st_crs(2180))
poly_sf = st_as_sfc(poly_sf)

n = 5000
pts_sf = st_sample(poly_sf, n)

system.time(st_distance(pts_sf))
#>    user  system elapsed 
#>   0.671   0.094   0.770
system.time(st_distance(pts_sf, rev(pts_sf)))
#>    user  system elapsed 
#>   6.197   0.064   6.278

library(rsgeo)

x <- rsgeo:::from_sfc(pts_sf)

system.time(distance_euclidean_matrix(x, rev(x)))
#>    user  system elapsed 
#>   0.348   0.168   0.321

crds <- coords(x)
c2 <- coords(rev(x))

system.time(Rfast::dista(crds, c2))
#>    user  system elapsed 
#>   0.228   0.034   0.266


bench::mark(
  Rfast = Rfast::dista(crds, c2),
  rsgeo = distance_euclidean_matrix(x, rev(x)),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 Rfast         263ms    266ms      3.76     382MB     7.51
#> 2 rsgeo         215ms    252ms      3.97     191MB     1.99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark #4

Benchmark #4

kadyb commented Jul 10, 2023

JosiahParry commented Jul 10, 2023

JosiahParry commented Jul 10, 2023

Robinlovelace commented Jul 11, 2023

kadyb commented Jul 11, 2023

JosiahParry commented Jul 16, 2023

kadyb commented Jul 18, 2023

Robinlovelace commented Jul 18, 2023

JosiahParry commented Aug 25, 2023 •

edited

Loading

Robinlovelace commented Aug 25, 2023

JosiahParry commented Aug 25, 2023

kadyb commented Aug 25, 2023

Robinlovelace commented Aug 25, 2023

JosiahParry commented Aug 25, 2023 •

edited

Loading

Benchmark #4

Benchmark #4

Comments

kadyb commented Jul 10, 2023

JosiahParry commented Jul 10, 2023

JosiahParry commented Jul 10, 2023

Robinlovelace commented Jul 11, 2023

kadyb commented Jul 11, 2023

JosiahParry commented Jul 16, 2023

kadyb commented Jul 18, 2023

Robinlovelace commented Jul 18, 2023

JosiahParry commented Aug 25, 2023 • edited Loading

Robinlovelace commented Aug 25, 2023

JosiahParry commented Aug 25, 2023

kadyb commented Aug 25, 2023

Robinlovelace commented Aug 25, 2023

JosiahParry commented Aug 25, 2023 • edited Loading

JosiahParry commented Aug 25, 2023 •

edited

Loading

JosiahParry commented Aug 25, 2023 •

edited

Loading