Skip to content

Commit

Permalink
add --stats-by-route-id-and-date option 📝 [todo:postgraphile?,tests?,…
Browse files Browse the repository at this point in the history
…benchmark]

Supports analyzing a feed by route and/or date and/or day of the week.

part of #33
  • Loading branch information
derhuerst committed Mar 5, 2023
1 parent 24db089 commit 066bb30
Show file tree
Hide file tree
Showing 7 changed files with 145 additions and 0 deletions.
10 changes: 10 additions & 0 deletions cli.js
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ const {
'stops-location-index': {
type: 'boolean',
},
'stats-by-route-id-and-date': {
type: 'string',
},
'schema': {
type: 'string',
},
Expand Down Expand Up @@ -72,6 +75,12 @@ Options:
Default if levels.txt has not been provided.
--stops-location-index Create a spatial index on stops.stop_loc for efficient
queries by geolocation.
--stats-by-route-id-and-date Wether to generate a stats_by_route_id_and_date view
letting you analyze all data per routes and/or date:
- none: Don't generate a view.
- view: Fast generation, slow access.
- materialized-view: Slow generation, fast access.
Default: none
--schema The schema to use for the database. Default: public
--postgraphile Tweak generated SQL for PostGraphile usage.
https://www.graphile.org/postgraphile/
Expand Down Expand Up @@ -107,6 +116,7 @@ const opt = {
tripsWithoutShapeId: !!flags['trips-without-shape-id'],
routesWithoutAgencyId: !!flags['routes-without-agency-id'],
stopsLocationIndex: !!flags['stops-location-index'],
statsByRouteIdAndDate: flags['stats-by-route-id-and-date'] || 'none',
schema: flags['schema'] || 'public',
postgraphile: !!flags.postgraphile,
}
Expand Down
51 changes: 51 additions & 0 deletions docs/analysis/feed-by-route-and-date.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# analysing a GTFS dataset by route ID and/or date

Are you trying to answer a question like those below?

- Are there certain dates or days of the week that have sigificantly less arrivals/departures (hereinafter "stop time events")? – This *may* indicate errors in the data, e.g. a faulty `calendar.csv` or `calendar_dates.csv` file.
- Has the number of stop time events decreased, compared to the last dataset version?
- Do specific routes stop running during certain time periods?

`gtfs-via-postgres` optionally provides a **(materialized) view `stats_by_route_id_and_date` to help with such SQL queries. Use the `--stats-by-route-id-and-date` flag to enable it** in the generated SQL:

- If you run `gtfs-to-sql` with `--stats-by-route-id-and-date=view`, `stats_by_route_id_and_date` will be a "regular" non-materialized view. Use this option if you want to import the GTFS data quickly, and if you only query `stats_by_route_id_and_date` rarely or in time-uncritical scenarios.
- If you pass `--stats-by-route-id-and-date=materialized-view`, the `stats_by_route_id_and_date` view will [be materialized](https://www.postgresql.org/docs/14/rules-materializedviews.html). Use this option if you need fast queries, and if you can tolerate significantly longer import times (3m for the 64mb 2023-03-05 SNCB/NMBS GTFS feed, todo for the 540mb 2023-02-27 VBB GTFS feed).

`stats_by_route_id_and_date` has the following columns:

- `date`
- `dow` – day of the week, following the [PostgreSQL notation `0` (Sunday) to `6` (Saturday)](https://www.postgresql.org/docs/14/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT)
- `nr_of_trips` – nr of trips starting on that date
- `nr_of_arrs_deps` – nr of trips taking place on that date
- `is_effective` – wether `nr_of_trips` & `nr_of_arrs_deps` are calculated based on the *effective* date (i.e. the date that the stop time event actually happens on) or *schedule* date (i.e. the date which their `stop_time` rows refer to)

So

- if you want to take a customer-facing perspective on the data (as in "I don't care which if trips are scheduled before midnight, I want to know if they run today"), filter for `is_effective = True` rows;
- If you're interested in the operational/planning perspective (e.g. if you're looking for data errors), filter for `is_effective = False` rows.

## example: nr of effective stop time events of a single route over a week

```sql
-- using VBB's 2023-02-27 GTFS data
SELECT
*,
EXTRACT(dow FROM "date") as dow
FROM stats_by_route_id_and_date stats
WHERE is_effective = True
AND route_id = '17438_900', -- M1 tram line
AND "date" >= '2023-03-19' -- Sunday, dow = 0
AND "date" <= '2023-03-25' -- Saturday, dow = 6
ORDER BY route_id, "date", is_effective DESC
```

```csv
route_id,date,dow,nr_of_trips,nr_of_arrs_deps,is_effective,dow
17438_900,2023-03-19,0,258,5870,t,0
17438_900,2023-03-20,1,345,7831,t,1
17438_900,2023-03-21,2,345,7831,t,2
17438_900,2023-03-22,3,345,7831,t,3
17438_900,2023-03-23,4,345,7831,t,4
17438_900,2023-03-24,5,345,7831,t,5
17438_900,2023-03-25,6,326,9001,t,6
```
8 changes: 8 additions & 0 deletions index.js
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ const convertGtfsToSql = async function* (files, opt = {}) {
routesWithoutAgencyId: false,
stopsWithoutLevelId: !files.some(f => f.name === 'levels'),
stopsLocationIndex: false,
statsByRouteIdAndDate: 'none',
schema: 'public',
postgraphile: false,
...opt,
Expand All @@ -29,6 +30,7 @@ const convertGtfsToSql = async function* (files, opt = {}) {
tripsWithoutShapeId,
requireDependencies,
ignoreUnsupportedFiles,
statsByRouteIdAndDate,
} = opt

if (ignoreUnsupportedFiles) {
Expand Down Expand Up @@ -73,6 +75,12 @@ const convertGtfsToSql = async function* (files, opt = {}) {
'frequencies': {
dep: [...deps.frequencies],
},

...(statsByRouteIdAndDate !== 'none' ? {
'stats_by_route_id_and_date': {
dep: ['stop_times'],
},
} : {}),
}

for (const file of files) {
Expand Down
1 change: 1 addition & 0 deletions lib/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,5 @@ module.exports = {
pathways: require('./pathways'),
levels: require('./levels'),
translations: require('./translations'),
stats_by_route_id_and_date: require('./stats_by_route_id_and_date'),
}
70 changes: 70 additions & 0 deletions lib/stats_by_route_id_and_date.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
'use strict'

const afterAll = (opt) => {
let materialized = false
if (opt.statsByRouteIdAndDate === 'materialized-view') {
materialized = true
} else if (opt.statsByRouteIdAndDate !== 'view') {
throw new Error('invalid opt.statsByRouteIdAndDate, must be one of these: none, view, materialized-view.')
}
const createViewCmd = materialized
? `CREATE MATERIALIZED VIEW`
: `CREATE OR REPLACE VIEW`

return `\
${createViewCmd} "${opt.schema}".stats_by_route_id_and_date AS
WITH
arrs_deps_with_svc_date AS NOT MATERIALIZED (
SELECT
route_id, stop_sequence,
"date"::date AS svc_date,
EXTRACT(DOW FROM "date") AS svc_dow
FROM "${opt.schema}".arrivals_departures
),
by_svc_date AS NOT MATERIALIZED (
SELECT DISTINCT ON (route_id, svc_date)
route_id,
svc_date AS "date",
svc_dow AS dow,
count(*) FILTER (WHERE stop_sequence = 0) OVER (PARTITION BY route_id, svc_date) AS nr_of_trips,
count(*) OVER (PARTITION BY route_id, svc_date) AS nr_of_arrs_deps
FROM arrs_deps_with_svc_date
),
arrs_deps_with_effective_date AS NOT MATERIALIZED (
SELECT
route_id, stop_sequence,
coalesce(t_departure, t_arrival)::date AS effective_date,
EXTRACT(DOW FROM coalesce(t_departure, t_arrival)) AS effective_dow
FROM "${opt.schema}".arrivals_departures
),
by_effective_date AS NOT MATERIALIZED (
SELECT DISTINCT ON (route_id, effective_date)
route_id,
effective_date AS "date",
effective_dow AS dow,
count(*) FILTER (WHERE stop_sequence = 0) OVER (PARTITION BY route_id, effective_date) AS nr_of_trips,
count(*) OVER (PARTITION BY route_id, effective_date) AS nr_of_arrs_deps
FROM arrs_deps_with_effective_date
)
SELECT
*,
True AS is_effective
FROM by_effective_date
UNION
SELECT
*,
False AS is_effective
FROM by_svc_date;
${materialized ? `\
CREATE INDEX ON "${opt.schema}".stats_by_route_id_and_date (route_id);
CREATE INDEX ON "${opt.schema}".stats_by_route_id_and_date ("date");
CREATE INDEX ON "${opt.schema}".stats_by_route_id_and_date (route_id, "date", is_effective);
CREATE INDEX ON "${opt.schema}".stats_by_route_id_and_date (route_id, dow, is_effective);
` : ''}
`
}

module.exports = {
afterAll,
}
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
"index.js",
"scripts",
"lib",
"docs",
"example.sh",
"LICENSE",
"LICENSE-PROSPERITY.md",
Expand Down
4 changes: 4 additions & 0 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,10 @@ env NODE_ENV=development npm exec -- serve-gtfs-via-graphql

**As an example for the GraphQL API, check out the [test query](test/sample-gtfs-feed-postgraphile-test.graphql)** or open GraphiQL via [`localhost:3000/graphiql`](http://localhost:3000/graphiql).

### more guides

The [`docs` directory](docs) contains more instructions on how to use `gtfs-via-postgres`.


## Correctness vs. Speed regarding GTFS Time Values

Expand Down

0 comments on commit 066bb30

Please sign in to comment.