Add support for timeseries of arbitrary granularities #111

gfairchild · 2016-02-25T00:12:34Z

Related to #110, we should add support for granularities other than hourly. The CDC data are daily. I think some low-hanging fruit would be to allow the user to provide ISO 8601 duration strings and store that in each month's metadata table. Unfortunately, we would need to enforce a rule that users can't use granularities larger than 1 month, but using arbitrary granularities between ε and 1 month is a lot more useful than being restricted to hourly.

@reidpr, what do you think about this?

The text was updated successfully, but these errors were encountered:

gfairchild · 2016-02-25T00:14:44Z

This solution would also allow fairly simple modification of the current Wikipedia data because instead of having to recreate everything, we'd just need to add a row to each month's metadata table specifying the duration as PT1H.

reidpr · 2016-02-25T23:18:24Z

A few somewhat well organized opinions:

Right now, fragments groups don't have any notion of time. The only assumption is that the filenames are lexically increasing, and adjacent filenames contain adjacent fragments of the vector (i.e., no gaps and no overlap). I'd prefer to:

not to make the fragment groups time-aware.
leave an open door for general fragmented vector (matrix, etc.) storage rather than just time series.
leave an open door for non-SQLite storage (e.g., HDF5).

Assembling a full vector by concatenation requires knowing the length of each fragment. All fragments in a group are the same length. Currently, this length is stored in the metadata but AFAICT not really used; rather, it's deduced by assuming the fragments are a month long and that the interval length is one hour. It is desirable to get the length of a fragment group without opening the database. If it's in the filename, that's kind of a hack, but it doesn't even require stat(2)ing the file.

Once assembled, to convert it to a time series, we need to know (a) the length of each interval and (b) the offset. In general, the intervals might not start and/or end on a fragment boundary (e.g., intervals of weeks). Currently, we assume intervals are one hour and the offset is zero.

Because of the offset, ISO 8601 durations aren't sufficient to specify the intervals.

Here's a proposal.

(1) Fragment groups remain time-ignorant and deal with simple vectors.

(2) Introduce a file metadata to live alongside the database files. This stores metadata key/value pairs for the whole dataset. It is in some format that can be read without introducing any new dependencies. SQLite database seems congruent, but it's also more of a hassle to deal with.

To start, the metadata elements are:

hashmod — number of data tables per database. Currently, this is read from an arbitrary fragment group's metadata. Required.
interval — Pandas interval code for the dataset. Required; future work will make it optional.

The Pandas codes can take a multiplier, so I believe this makes them fully general:

>>> pd.period_range('2016-02-01', periods=6, freq='90min')
PeriodIndex(['2016-02-01 00:00', '2016-02-01 01:30', '2016-02-01 03:00',
             '2016-02-01 04:30', '2016-02-01 06:00', '2016-02-01 07:30'],
            dtype='int64', freq='90T')

(3) Filenames are of the form {TIMESTAMP}_{LENGTH}.db. The first field is a UTC ISO 8601 datetime or date (assume 00:00.0 as time); the second is the fragment length (number of items).

The first interval must start at the timestamp. We can then use period_range() to generate the index. For example:

>>> pd.period_range('2016-02-01', periods=6, freq='H')
PeriodIndex(['2016-02-01 00:00', '2016-02-01 01:00', '2016-02-01 02:00',
             '2016-02-01 03:00', '2016-02-01 04:00', '2016-02-01 05:00'],
            dtype='int64', freq='H')
>>> pd.period_range('2016-02-01', periods=6, freq='D')
PeriodIndex(['2016-02-01', '2016-02-02', '2016-02-03', '2016-02-04',
             '2016-02-05', '2016-02-06'],
            dtype='int64', freq='D')
>>> pd.period_range('2016-01-27', periods=6, freq='W-TUE')
PeriodIndex(['2016-01-27/2016-02-02', '2016-02-03/2016-02-09',
             '2016-02-10/2016-02-16', '2016-02-17/2016-02-23',
             '2016-02-24/2016-03-01', '2016-03-02/2016-03-08'],
            dtype='int64', freq='W-TUE')

period_range() will work if the timestamp doesn't start the first interval, but I think this will be confusing and so we should disallow it, though I'm OK with calling it undefined behavior. For example:

>>> pd.period_range('2016-02-01', periods=6, freq='W-TUE')
PeriodIndex(['2016-01-27/2016-02-02', '2016-02-03/2016-02-09',
             '2016-02-10/2016-02-16', '2016-02-17/2016-02-23',
             '2016-02-24/2016-03-01', '2016-03-02/2016-03-08'],
            dtype='int64', freq='W-TUE')

(Future work will let the tag be general.)

(4) hashmod and length in fragment group metadata are no longer added and ignored if present.

(5) This issue provides a script or other easy means to upgrade the existing Wikipedia data. Data like hashmod can be hard-coded. This is non-destructive to allow reverting the change.

gfairchild · 2016-02-29T18:34:51Z

(0) What exactly is this offset you speak of? I don't see the word "offset" used in timeseries.py. Does this refer to a Pandas anchored offset, or does it perhaps refer to a time zone offset?

(1) Sounds reasonable.

(2) How about a metadata.json file? JSON is easily human- and machine-readable, language-agnostic, requires relatively little infrastructure to deal with (compared to something like SQLite).

Also, you use the word "interval" here, but it's not entirely clear what exactly you mean. Are you referring to the value supplied to the freq parameter (e.g., these aliases)?

(3) I think this makes good sense. Referring to my question above in (2), if indeed you do mean that the interval is the freq value, then:

>>> freq = '1D'  # this would come from manifest.json and would be "1H" for all Wikipedia data
>>> f = '2016-02-01T00:00:00+00:00_29.db'  # the "29" would be "696" for the Wikipedia data
>>> date, periods = f[:-3].split('_')
>>> pd.period_range(date, periods=int(periods), freq=freq)
PeriodIndex(['2016-02-01', '2016-02-02', '2016-02-03', '2016-02-04',
             '2016-02-05', '2016-02-06', '2016-02-07', '2016-02-08',
             '2016-02-09', '2016-02-10', '2016-02-11', '2016-02-12',
             '2016-02-13', '2016-02-14', '2016-02-15', '2016-02-16',
             '2016-02-17', '2016-02-18', '2016-02-19', '2016-02-20',
             '2016-02-21', '2016-02-22', '2016-02-23', '2016-02-24',
             '2016-02-25', '2016-02-26', '2016-02-27', '2016-02-28',
             '2016-02-29'],
            dtype='int64', freq='D')

I believe we're on the same page, but I just want to be clear.

(4) Makes sense.

(5) Where should this migration script live? In bin or perhaps misc?

reidpr · 2016-03-02T23:28:55Z

(0) It's a Pandas anchored offset, yes. Offset because saying data are e.g. weekly doesn't tell you when the periods start and stop, just how long they are. For e.g. daily or hourly data, there's a strong convention about when they start and stop.

(1) I agree on JSON vs. SQLite in general. I'm leaning towards SQLite, though, because we already have a metadata reader/writer for it, for use in the fragment group files.

(3) What's I'd add here is that dates are OK, so all the zeros to indicate midnight can be omitted.

(5) I'd say somewhere outside the code; this issue maybe? I don't believe there's any data sets that need to be upgraded other than ours. It could even not be a real script, just instructions. One option is to let the length filename parameter be optional, and if the dataset frequency is 1H, then it's computed as the number of hours in that month. The upside is that the Wikipedia data could be upgraded by simply creating the metadata file by hand; the downside is that there are more code paths.

gfairchild · 2016-03-02T23:30:12Z

Great, thanks Reid. I'll get this done ASAP.

gfairchild · 2016-04-19T19:39:47Z

Ok, the question I was trying to remember during last Friday's team meeting was related to point (3) of your proposal: how do we know what fragment length to choose?

Right now, the fragment length is the number of hours in each month, which is 744 for months with 31 days. Suppose we're dealing with daily or weekly data. How many fragments should we store in a single DB?

reidpr · 2016-04-19T19:54:19Z

You can use any fragment length.

I'm sure there's an optimum length, though it's probably not worth our time at this point to figure out what it is.

gfairchild · 2016-04-19T19:55:48Z

Okie dokie. I'll just pick something arbitrarily. Thanks, @reidpr.

gfairchild · 2016-05-18T06:47:09Z

Ok, I've completed this (see PR #112). I'm attaching a simple migration script to this post to migrate the v1 Wikipedia data to the new v2 format (note that GitHub allows certain extensions, so I had to zip it up to comply with GitHub's requirements). I've tested this script on a small subset of Wikipedia data I copied locally.

wp-migrate.zip

gfairchild self-assigned this Feb 25, 2016

gfairchild added feature decision labels Feb 25, 2016

gfairchild mentioned this issue May 18, 2016

Implement CDC processing and arbitrary time granularity support (closes #110 and #111) #112

Open

gfairchild removed the decision label May 18, 2016

gfairchild assigned reidpr and unassigned gfairchild May 18, 2016

reidpr removed their assignment Oct 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for timeseries of arbitrary granularities #111

Add support for timeseries of arbitrary granularities #111

gfairchild commented Feb 25, 2016

gfairchild commented Feb 25, 2016

reidpr commented Feb 25, 2016

gfairchild commented Feb 29, 2016

reidpr commented Mar 2, 2016

gfairchild commented Mar 2, 2016

gfairchild commented Apr 19, 2016

reidpr commented Apr 19, 2016

gfairchild commented Apr 19, 2016

gfairchild commented May 18, 2016

Add support for timeseries of arbitrary granularities #111

Add support for timeseries of arbitrary granularities #111

Comments

gfairchild commented Feb 25, 2016

gfairchild commented Feb 25, 2016

reidpr commented Feb 25, 2016

gfairchild commented Feb 29, 2016

reidpr commented Mar 2, 2016

gfairchild commented Mar 2, 2016

gfairchild commented Apr 19, 2016

reidpr commented Apr 19, 2016

gfairchild commented Apr 19, 2016

gfairchild commented May 18, 2016