-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
maxdist argument for points_to_od #18
Comments
Definitely a good plan and I think |
nngeo might be better, I'll have to benchmark |
Sounds good @mem48. I've assigned you for now, I'm happy to take a look. Could ask the package author for his views, Michael and I have communicated on other matters. @michaeldorman what are your thoughts on Note I've used |
@Robinlovelace I think that
If you need all pairwise distances between lon-lat points perhaps Here is a reproducible example showing that, for the lon-lat points case, results are identical and speed is comparable though a little faster in Also please note that If I can help with anything please let me know. # Sample data
library(sf)
n = 1000
x = data.frame(
lon = (runif(n) * 2 - 1) * 70,
lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
# nngeo
library(nngeo)
start = Sys.time()
result1 = st_nn(x, x, k = 1000, returnDist = TRUE, progress = FALSE)
end = Sys.time()
end - start
# geodist
library(geodist)
start = Sys.time()
result2 = geodist(st_coordinates(x), st_coordinates(x), measure = "geodesic")
end = Sys.time()
end - start
# Comapre
result1$dist[[1]] %>% head
result2[1,] %>% sort %>% head
|
Many thanks @michaeldorman that is very useful. It sounds like |
What about when you want to set a max distance, in the context of this |
|
Thanks for the clarification, I've used RANN to which is why I though you might implement something similar for lat/long. Out of interest have you benchmarked projected vs geographic coordinates? I wonder if the max distance was low you might get a big performance gain by using RANN to get an approximate set, and then doing a more accurate second pass. |
Also I believe nabor is faster than RANN |
The help page for library(nngeo)
# Large example - Geo points
n = 1000
x = data.frame(
lon = (runif(n) * 2 - 1) * 70,
lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
start = Sys.time()
result1 = st_nn(x, x, k = 3)
end = Sys.time()
end - start
## Time difference of 2.405991 secs
# Large example - Proj points
n = 1000
x = data.frame(
lon = (runif(n) * 2 - 1) * 70,
lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
x = st_transform(x, 32630)
start = Sys.time()
result = st_nn(x, x, k = 3)
end = Sys.time()
end - start
## Time difference of 0.9784892 secs There is also an option for parallel processing, which can help reduce the time of calculations using geographic coordinates: # Large example - Geo points - Parallel processing
start = Sys.time()
result2 = st_nn(x, x, k = 3, parallel = 4)
end = Sys.time()
end - start
## Time difference of 1.429008 secs Not sure I understand your idea to use |
I think it would be clearer if I explained the specific problem I'm trying to solve. I want to create a travel time matrix for the UK; I have about 200,000 points so potentially 40 billion Origin-Destination pairs, which would be a very large data frame. For my use case, I'm only really interested in commuting, so I would like a 100 km distance cap, which should reduce the problem to a few 10's of millions, large but doable. If I use geodist I would have to calculate all 40 billion distances and then subset. So I will probably run out of memory. But if I project and use RANN I can calculate far fewer values and not run out of memory. So problem solved, as long as I have a good choice of National Grid. But suppose I now wanted to do the same thing globally, but with a similar short maximum distance. I'd have to use geographic coordinates. But you could treat the lat/lng coordinates as simple x,y coordinates and pass them to RANN, the distances would be in degrees, not meters, and the accuracy would change around the world. But I thought that you might be able to take the bounding box for the input and do a simple calculation to convert the maximum distance in metres to a worst-case maximum distance in degrees. You could then pass that value to RANN and remove 90% of the possible measurements. Then use a more accurate method to do a second filter but on a much smaller set of values. Thus, preventing the problem from running out of memory. This method would not work near the poles or 180-degree line of longitude, but I expect there are lots of real-world problems that would fit those restrictions. It also wouldn't be worth the effort except for the very largest datasets, but I think it would not be too hard to come up with some simple tests to decide whether to implement the method. I tweaked your example datasets notice how big the time difference gets
|
Great to see benchmarks, how does the second example compare with |
I understand, thank you for the clarification! Eventually whether it's worthwhile depends on cost-benefit of writing the function as opposed to waiting for the calculation to finish (and how often you need to do that calculation). By the way, |
That sounds like a good plan @michaeldorman, accuracy isn't important in the first filter. |
Reviving with reference to Robinlovelace/simodels#33 Thanks @mem48 for nudge on this. |
Happy to contribute this if you think it would be useful. The ideas is that you have points and want to make an OD datasets but cap the maximum distance considered. e.g. how we capped the PCT at 30 km.
Would need to add geodists as a dependancy.
The text was updated successfully, but these errors were encountered: