Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expected data format #3

Closed
e-kotov opened this issue Aug 31, 2024 · 18 comments
Closed

expected data format #3

e-kotov opened this issue Aug 31, 2024 · 18 comments

Comments

@e-kotov
Copy link

e-kotov commented Aug 31, 2024

Hi @JohMast ,

I have a question regarding the data format that {flowmapper} expects. I have completed a draft of a new vignette (can be found here, or if the branch is already merged and deleted, I guess the vignette will be here) for {spanishoddata} that shows how to use {flowmapper} to visualise the data aqcuired with {spanishoddata}. And at some point I have ran into a problem with the format that {flowmapper} expects. To quote the vignette:

The data we have right now in od_20210407_total is in a classic long format. That is, we have rows for both the number of trips from A to B and from B to A. {flowmapper} requires the data to be in different format, where there is only one row for each pair of id_a and id_b and two columns, one for the flow from id_a to id_b and one for the flow from id_b to id_a.

So the problem is that, as far as I know (and I have worked with origin-destination data for quite some time, @Robinlovelace may correct me if I am wrong), the format you are expecting as input for {flowmapper} is not standard in the "industry" at all. I would also refer you to the {od} package (https://itsleeds.github.io/od/articles/od.html).

Would it be possible for {flowmapper} to support (and auto-detect) the more standard long format of OD data? Or some other more standardised form of origin-destination data, like od matrices?

@JohMast
Copy link
Owner

JohMast commented Sep 1, 2024

Hi @e-kotov
yes, its not exactly tidy is it? My goal was to have a readable format that A) contains both directions of bidirectional flows at a glance and B) contains edges and nodes. I think it does that, but requires some annoying data wrangling from the user if they have data in the long format, so i'd be happy to provide the option to accept long data somehow. I see two options:

A) Helper function that accepts a long table of flows and a table with the nodes and checks for matches (like od_to_sf does) before combining into the format required for add_flowmap. The user would still need to call this function, but it would save a few lines.
B) Add arguments flows and nodes to add_flowmap. If provided, these would be used instead of flow_dat and call the same helper function internally.

I think i prefer (B) because internally, the inputs already get converted to long format for plotting. So in future versions, the unnecessary conversion long -> messy ->long could be bypassed completely.

Would that be helpful?

@e-kotov
Copy link
Author

e-kotov commented Sep 1, 2024

@JohMast to me, option B sounds sensible, and it is also in line with how the original inspiration library and package https://github.com/FlowmapBlue/flowmapblue.R expects the input. I think it is nice to mimic the functionality of other packages, as that's what allows to create an ecosystem of packages that is ultimately very easy to use by the end user, and also allows your package to be an easy drop-in replacement for flowmapblue and vice-versa, which for me sounds like a win-win. Perhaps @Robinlovelace also has an opinion on that.


Note: Sadly, there seems to be no industry standard for this type of data, which would be universally accepted by packages. E.g. {mapdeck} ( @SymbolixAU ) with add_arc() uses yet another format:

head(flights)
  start_lat start_lon  end_lat    end_lon airline airport1 airport2 cnt
1  32.89595 -97.03720 35.04022 -106.60919      AA      DFW      ABQ 444
2  41.97960 -87.90446 30.19453  -97.66987      AA      ORD      AUS 166
3  32.89595 -97.03720 41.93887  -72.68323      AA      DFW      BDL 162
4  18.43942 -66.00183 41.93887  -72.68323      AA      SJU      BDL  56
5  32.89595 -97.03720 33.56294  -86.75355      AA      DFW      BHM 168
6  25.79325 -80.29056 36.12448  -86.67818      AA      MIA      BNA  56

And add_arc_layer() from {deckgl} ( @crazycapivara ) a similar to {mapdeck}, but again slightly different:

head(deckgl::bart_segments)
# A tibble: 6 × 8
  inbound outbound from_name                           from_lng from_lat to_name                             to_lng to_lat
    <int>    <int> <chr>                                  <dbl>    <dbl> <chr>                                <dbl>  <dbl>
1   72633    74735 19th St. Oakland (19TH)                -122.     37.8 12th St. Oakland City Center (12TH)  -122.   37.8
2   65042    67529 12th St. Oakland City Center (12TH)    -122.     37.8 West Oakland (WOAK)                  -122.   37.8
3       0    23821 Lake Merritt (LAKE)                    -122.     37.8 12th St. Oakland City Center (12TH)  -122.   37.8
4   62964    58788 16th St. Mission (16TH)                -122.     37.8 Civic Center/UN Plaza (CIVC)         -122.   37.8
5   55134    51019 24th St. Mission (24TH)                -122.     37.8 16th St. Mission (16TH)              -122.   37.8
6   67975    70088 MacArthur (MCAR)                       -122.     37.8 19th St. Oakland (19TH)              -122.   37.8

@Robinlovelace's {od} tries to bridge the gap somewhat between the formats, but as we can see, not for all of them. Some manual data manipulation and/or renaming is still required from the user.

@SymbolixAU, @crazycapivara, @Robinlovelace, @JohMast , perhaps it is time to discuss and try to implement a standard?

@JohMast
Copy link
Owner

JohMast commented Sep 2, 2024

Okay, I implemented the change. You should now be able to pass od (with columns o, d, value) and nodes (with columns name, x, y) to add_flowmap(). I have not tested it thoroughly, so please let me know if that works for your data @e-kotov !

Regarding formats: I agree that it would be neat to have a standard way of doing things. To my understanding, all these formats contain the same information. Some are more efficient regarding storage/memory and some are more readable, though that is probably a matter of taste. However, I suspect that the differences are because they focus on different fields and applications? Analyzing the nodes, modeling flows, exploration...

What might be possible is some standardization of the terminology? I am thinking of things like

x/y versus lon/lat
edges versus flows
origin/destination versus start/end
a_b/b_a versus inbound/outbound

Harmonizing these might make it easier for users to find comparable packages, and jointly use them. If a source for that exists already, I would be happy about a pointer!

@e-kotov
Copy link
Author

e-kotov commented Sep 2, 2024

Okay, I implemented the change. You should now be able to pass od (with columns o, d, value) and nodes (with columns name, x, y) to add_flowmap(). I have not tested it thoroughly, so please let me know if that works for your data @e-kotov !

@JohMast Great, thank you! I will test today and get back to you!

@e-kotov
Copy link
Author

e-kotov commented Sep 2, 2024

@JohMast

trying with this small data sample:

od <- structure(list(o = c("01001_AM", "01001_AM", "01001_AM", "01001_AM", 
"01001_AM", "01001_AM", "01001_AM", "01001_AM", "01001_AM", "01001_AM", 
"01001_AM", "01001_AM", "01001_AM", "01001_AM", "01001_AM", "01001_AM", 
"01001_AM", "01001_AM", "01001_AM", "01001_AM"), d = c("01002", 
"0105906", "01063_AM", "19058_AM", "1913005", "20036", "2006903", 
"20075", "20902", "22084_AM", "24212_AM", "28045", "3120102", 
"3120106", "3120107", "31208_AM", "33036", "37073_AM", "39020_AM", 
"39059"), value = c(20.178, 2078.967, 134.44, 11.225, 7.309, 
5.928, 17.053, 46.277, 12.177, 11.906, 6.352, 11.225, 21.94, 
10.537, 17.846, 9.394, 29, 4.768, 138.539, 4.113)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
    o = "01001_AM", .rows = structure(list(1:20), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -1L), .drop = TRUE))

nodes <- structure(list(x = c(478391.311766996, 484914.014734671, 578774.440509757, 
611908.530688029, 608573.323628031, 611017.377114496, 580132.802557328, 
436937.31903137, 407370.337626221, 517286.944855483, 595144.389218619, 
787373.931469436, 477333.501735812, 586892.188214235, 546087.082616995, 
351638.077170337, 525487.254205587, 284534.1296852, 502306.423213228, 
285509.147488101, 618662.856436448), y = c(4797745.16494694, 
4502754.1831806, 4781295.45569597, 4742681.29936187, 4742222.82183755, 
4740866.63365897, 4790187.05267574, 4501721.43675297, 4761496.29470226, 
4758305.73549361, 4801630.20307999, 4700483.60434869, 4498035.54417975, 
4791106.8478728, 4747657.38543724, 4806647.49130073, 4744247.07625825, 
4695004.91047174, 4763643.29102358, 4555167.70779241, 4647074.93331487
), name = c("39020_AM", "1913005", "20075", "3120106", "3120107", 
"3120102", "20902", "28045", "39059", "01063_AM", "20036", "22084_AM", 
"19058_AM", "2006903", "01001_AM", "33036", "0105906", "24212_AM", 
"01002", "37073_AM", "31208_AM")), class = "data.frame", row.names = c(NA, 
-21L))

ggplot() |> add_flowmap(od = od, nodes = nodes)

update: corrected + for |>

When debugging, just before failing the flows is:

# A tibble: 40 × 3
# Groups:   id_a [20]
   id_a     group                  flow
   <chr>    <chr>                 <dbl>
 1 01002    01002 - 01001_AM       0   
 2 0105906  0105906 - 01001_AM     0   
 3 01063_AM 01063_AM - 01001_AM    0   
 4 19058_AM 19058_AM - 01001_AM    0   
 5 1913005  1913005 - 01001_AM     0   
 6 20036    20036 - 01001_AM       0   
 7 2006903  2006903 - 01001_AM     0   
 8 20075    20075 - 01001_AM       0   
 9 20902    20902 - 01001_AM       0   
10 22084_AM 22084_AM - 01001_AM    0   
11 24212_AM 24212_AM - 01001_AM    0   
12 28045    28045 - 01001_AM       0   
13 3120102  3120102 - 01001_AM     0   
14 3120106  3120106 - 01001_AM     0   
15 3120107  3120107 - 01001_AM     0   
16 31208_AM 31208_AM - 01001_AM    0   
17 33036    33036 - 01001_AM       0   
18 37073_AM 37073_AM - 01001_AM    0   
19 39020_AM 39020_AM - 01001_AM    0   
20 39059    39059 - 01001_AM       0   
21 01002    01001_AM - 01002      20.2 
22 0105906  01001_AM - 0105906  2079.  
23 01063_AM 01001_AM - 01063_AM  134.  
24 19058_AM 01001_AM - 19058_AM   11.2 
25 1913005  01001_AM - 1913005     7.31
26 20036    01001_AM - 20036       5.93
27 2006903  01001_AM - 2006903    17.1 
28 20075    01001_AM - 20075      46.3 
29 20902    01001_AM - 20902      12.2 
30 22084_AM 01001_AM - 22084_AM   11.9 
31 24212_AM 01001_AM - 24212_AM    6.35
32 28045    01001_AM - 28045      11.2 
33 3120102  01001_AM - 3120102    21.9 
34 3120106  01001_AM - 3120106    10.5 
35 3120107  01001_AM - 3120107    17.8 
36 31208_AM 01001_AM - 31208_AM    9.39
37 33036    01001_AM - 33036      29   
38 37073_AM 01001_AM - 37073_AM    4.77
39 39020_AM 01001_AM - 39020_AM  139.  
40 39059    01001_AM - 39059       4.11

Fails at:

plot_df <-

with:

Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Error in `tidyr::pivot_longer()`:
! Can't combine `id_a` <character> and `ya` <double>.

@Robinlovelace
Copy link

Happy to provide input here. FYI we have had a similar conversation before: itsleeds/od#20

Would be happy to implement a class system in OD with checks. Could be an S3 class system or even an S7 class system, which looks ideal for this, forcing all od objects to have certain features (e.g. character strings as the origin and destination column IDs, attributes like bidirectional, 'has internal' etc could be useful), although it's still experimental and seems a bit stale: https://github.com/RConsortium/S7 Maybe vctrs checks instead? https://vctrs.r-lib.org/articles/s3-vector.html

For my purposes keeping it super-simple, simply using tibbles and sf objects without type checking or enforcement of rules has been fine but happy to consider a class system.

I do think {od} can be a handy translation layer so happy to add more conversion functions there.

@Robinlovelace
Copy link

cc @mtennekes who instigated the very similar conversation in itsleeds/od#20 and of tmap fame, would welcome any updated thoughts from you on this, especially with your experience. Could be good to extend and to be able to fall-back to data.frame/sf objects is my thinking.

@JohMast
Copy link
Owner

JohMast commented Sep 2, 2024

@JohMast

trying with this small data sample:

Fails at:

plot_df <-

with:

Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Adding missing grouping variables: `id_a`
Error in `tidyr::pivot_longer()`:
! Can't combine `id_a` <character> and `ya` <double>.

Thanks for that example👍 I added a check for grouping variables, that should prevent this and other grouping-related issues.

@e-kotov
Copy link
Author

e-kotov commented Sep 2, 2024

Thanks for that example👍 I added a check for grouping variables, that should prevent this and other grouping-related issues.

@JohMast Thank you! It now passes the tests with larger datasets in the vignette.

One more minor thing, if you could force-convert any factor (or integer/numeric too? I have not tested with numeric ids) variables of zone ids to character, it would be perfect, as a user might get ids as factors in the flows table. They can of course convert those to character themselves, but why not make their lives simpler. Up to you, this is just a suggestion.

Otherwise, I have rewritten the vignette i referred to using the new arguments you created. It is currently in my local branch. I will update the original vignette that currently preps the data in the flowdat required format with the one that uses od and nodes once {flowmapper} update gets to CRAN.

@JohMast
Copy link
Owner

JohMast commented Sep 3, 2024

Good point - I see no reason not to force-convert the entire id column in the inputs to character. That's how they are treated internally anyway, and as the user only gets the plot and not the data, it should not matter to them.
I added the conversion and will submit the new version to CRAN tomorrow.

On the broader discussion, I agree with @Robinlovelace, both options sound good to me. I personally appreciate the flexibility of using tibbles, and I think the simplicity is also a benefit for less experienced package developers. If there is an agreed-upon standard, then having a class-based system makes a lot of sense to me. Alternatively, having od as central point for converting (and perhaps describing) different formats would be a cool base for further developments.

@e-kotov
Copy link
Author

e-kotov commented Sep 3, 2024

Would be happy to implement a class system in OD with checks. Could be an S3 class system or even an S7 class system, which looks ideal for this, forcing all od objects to have certain features

I have not developed anything with classes, so I cannot comment on that. Sounds like it would be useful, but also a bit niche, as this data type is not as common as, for example, sf geometries.

For my purposes keeping it super-simple, simply using tibbles and sf objects without type checking or enforcement of rules has been fine but happy to consider a class system.

I think that is absolutely fine. As long as the packages that work with that kind of data expect the same input.

I do think {od} can be a handy translation layer so happy to add more conversion functions there.

That was my thinking too. If different package devs cannot agree on the same format, at least {od} could act as a universal translator.

@e-kotov
Copy link
Author

e-kotov commented Sep 3, 2024

Good point - I see no reason not to force-convert the entire id column in the inputs to character. That's how they are treated internally anyway, and as the user only gets the plot and not the data, it should not matter to them.
I added the conversion and will submit the new version to CRAN tomorrow.

Great, thank you!

@Robinlovelace
Copy link

If different package devs cannot agree on the same format, at least {od} could act as a universal translator.

Makes me think of a new title of {od}:

A Universal Translator Between Different 'Origin-Destination' Data Formats

See: itsleeds/od#52

@Robinlovelace
Copy link

Also: now you've shown how easy it is to create hex's, od should gain a hex...

@Robinlovelace
Copy link

Robinlovelace commented Sep 3, 2024

For the time being I think this issue may be 'fixed', great job on the package John, glad to have given it a spin in rOpenSpain/spanishoddata#65

@JohMast
Copy link
Owner

JohMast commented Sep 4, 2024

Great! Thanks for the positive feedback and the discussion!

@JohMast JohMast closed this as completed Sep 4, 2024
@Robinlovelace
Copy link

Great to hear, good outcome Johannes (apologies for typo in your name previously, just checked the DESCRIPTION)!

@JohMast
Copy link
Owner

JohMast commented Sep 4, 2024

Thanks! You got it right, Johannes is just the German version of John 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants