Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for timeseries of arbitrary granularities #111

Open
gfairchild opened this issue Feb 25, 2016 · 9 comments
Open

Add support for timeseries of arbitrary granularities #111

gfairchild opened this issue Feb 25, 2016 · 9 comments
Labels

Comments

@gfairchild
Copy link
Collaborator

Related to #110, we should add support for granularities other than hourly. The CDC data are daily. I think some low-hanging fruit would be to allow the user to provide ISO 8601 duration strings and store that in each month's metadata table. Unfortunately, we would need to enforce a rule that users can't use granularities larger than 1 month, but using arbitrary granularities between ε and 1 month is a lot more useful than being restricted to hourly.

@reidpr, what do you think about this?

@gfairchild
Copy link
Collaborator Author

This solution would also allow fairly simple modification of the current Wikipedia data because instead of having to recreate everything, we'd just need to add a row to each month's metadata table specifying the duration as PT1H.

@reidpr
Copy link
Owner

reidpr commented Feb 25, 2016

A few somewhat well organized opinions:

Right now, fragments groups don't have any notion of time. The only assumption is that the filenames are lexically increasing, and adjacent filenames contain adjacent fragments of the vector (i.e., no gaps and no overlap). I'd prefer to:

  • not to make the fragment groups time-aware.
  • leave an open door for general fragmented vector (matrix, etc.) storage rather than just time series.
  • leave an open door for non-SQLite storage (e.g., HDF5).

Assembling a full vector by concatenation requires knowing the length of each fragment. All fragments in a group are the same length. Currently, this length is stored in the metadata but AFAICT not really used; rather, it's deduced by assuming the fragments are a month long and that the interval length is one hour. It is desirable to get the length of a fragment group without opening the database. If it's in the filename, that's kind of a hack, but it doesn't even require stat(2)ing the file.

Once assembled, to convert it to a time series, we need to know (a) the length of each interval and (b) the offset. In general, the intervals might not start and/or end on a fragment boundary (e.g., intervals of weeks). Currently, we assume intervals are one hour and the offset is zero.

Because of the offset, ISO 8601 durations aren't sufficient to specify the intervals.

Here's a proposal.

(1) Fragment groups remain time-ignorant and deal with simple vectors.

(2) Introduce a file metadata to live alongside the database files. This stores metadata key/value pairs for the whole dataset. It is in some format that can be read without introducing any new dependencies. SQLite database seems congruent, but it's also more of a hassle to deal with.

To start, the metadata elements are:

  • hashmod — number of data tables per database. Currently, this is read from an arbitrary fragment group's metadata. Required.
  • interval — Pandas interval code for the dataset. Required; future work will make it optional.

The Pandas codes can take a multiplier, so I believe this makes them fully general:

>>> pd.period_range('2016-02-01', periods=6, freq='90min')
PeriodIndex(['2016-02-01 00:00', '2016-02-01 01:30', '2016-02-01 03:00',
             '2016-02-01 04:30', '2016-02-01 06:00', '2016-02-01 07:30'],
            dtype='int64', freq='90T')

(3) Filenames are of the form {TIMESTAMP}_{LENGTH}.db. The first field is a UTC ISO 8601 datetime or date (assume 00:00.0 as time); the second is the fragment length (number of items).

The first interval must start at the timestamp. We can then use period_range() to generate the index. For example:

>>> pd.period_range('2016-02-01', periods=6, freq='H')
PeriodIndex(['2016-02-01 00:00', '2016-02-01 01:00', '2016-02-01 02:00',
             '2016-02-01 03:00', '2016-02-01 04:00', '2016-02-01 05:00'],
            dtype='int64', freq='H')
>>> pd.period_range('2016-02-01', periods=6, freq='D')
PeriodIndex(['2016-02-01', '2016-02-02', '2016-02-03', '2016-02-04',
             '2016-02-05', '2016-02-06'],
            dtype='int64', freq='D')
>>> pd.period_range('2016-01-27', periods=6, freq='W-TUE')
PeriodIndex(['2016-01-27/2016-02-02', '2016-02-03/2016-02-09',
             '2016-02-10/2016-02-16', '2016-02-17/2016-02-23',
             '2016-02-24/2016-03-01', '2016-03-02/2016-03-08'],
            dtype='int64', freq='W-TUE')

period_range() will work if the timestamp doesn't start the first interval, but I think this will be confusing and so we should disallow it, though I'm OK with calling it undefined behavior. For example:

>>> pd.period_range('2016-02-01', periods=6, freq='W-TUE')
PeriodIndex(['2016-01-27/2016-02-02', '2016-02-03/2016-02-09',
             '2016-02-10/2016-02-16', '2016-02-17/2016-02-23',
             '2016-02-24/2016-03-01', '2016-03-02/2016-03-08'],
            dtype='int64', freq='W-TUE')

(Future work will let the tag be general.)

(4) hashmod and length in fragment group metadata are no longer added and ignored if present.

(5) This issue provides a script or other easy means to upgrade the existing Wikipedia data. Data like hashmod can be hard-coded. This is non-destructive to allow reverting the change.

@gfairchild
Copy link
Collaborator Author

(0) What exactly is this offset you speak of? I don't see the word "offset" used in timeseries.py. Does this refer to a Pandas anchored offset, or does it perhaps refer to a time zone offset?

(1) Sounds reasonable.

(2) How about a metadata.json file? JSON is easily human- and machine-readable, language-agnostic, requires relatively little infrastructure to deal with (compared to something like SQLite).

Also, you use the word "interval" here, but it's not entirely clear what exactly you mean. Are you referring to the value supplied to the freq parameter (e.g., these aliases)?

(3) I think this makes good sense. Referring to my question above in (2), if indeed you do mean that the interval is the freq value, then:

>>> freq = '1D'  # this would come from manifest.json and would be "1H" for all Wikipedia data
>>> f = '2016-02-01T00:00:00+00:00_29.db'  # the "29" would be "696" for the Wikipedia data
>>> date, periods = f[:-3].split('_')
>>> pd.period_range(date, periods=int(periods), freq=freq)
PeriodIndex(['2016-02-01', '2016-02-02', '2016-02-03', '2016-02-04',
             '2016-02-05', '2016-02-06', '2016-02-07', '2016-02-08',
             '2016-02-09', '2016-02-10', '2016-02-11', '2016-02-12',
             '2016-02-13', '2016-02-14', '2016-02-15', '2016-02-16',
             '2016-02-17', '2016-02-18', '2016-02-19', '2016-02-20',
             '2016-02-21', '2016-02-22', '2016-02-23', '2016-02-24',
             '2016-02-25', '2016-02-26', '2016-02-27', '2016-02-28',
             '2016-02-29'],
            dtype='int64', freq='D')

I believe we're on the same page, but I just want to be clear.

(4) Makes sense.

(5) Where should this migration script live? In bin or perhaps misc?

@reidpr
Copy link
Owner

reidpr commented Mar 2, 2016

(0) It's a Pandas anchored offset, yes. Offset because saying data are e.g. weekly doesn't tell you when the periods start and stop, just how long they are. For e.g. daily or hourly data, there's a strong convention about when they start and stop.

(1) I agree on JSON vs. SQLite in general. I'm leaning towards SQLite, though, because we already have a metadata reader/writer for it, for use in the fragment group files.

(3) What's I'd add here is that dates are OK, so all the zeros to indicate midnight can be omitted.

(5) I'd say somewhere outside the code; this issue maybe? I don't believe there's any data sets that need to be upgraded other than ours. It could even not be a real script, just instructions. One option is to let the length filename parameter be optional, and if the dataset frequency is 1H, then it's computed as the number of hours in that month. The upside is that the Wikipedia data could be upgraded by simply creating the metadata file by hand; the downside is that there are more code paths.

@gfairchild
Copy link
Collaborator Author

Great, thanks Reid. I'll get this done ASAP.

@gfairchild
Copy link
Collaborator Author

Ok, the question I was trying to remember during last Friday's team meeting was related to point (3) of your proposal: how do we know what fragment length to choose?

Right now, the fragment length is the number of hours in each month, which is 744 for months with 31 days. Suppose we're dealing with daily or weekly data. How many fragments should we store in a single DB?

@reidpr
Copy link
Owner

reidpr commented Apr 19, 2016

You can use any fragment length.

I'm sure there's an optimum length, though it's probably not worth our time at this point to figure out what it is.

@gfairchild
Copy link
Collaborator Author

Okie dokie. I'll just pick something arbitrarily. Thanks, @reidpr.

@gfairchild
Copy link
Collaborator Author

Ok, I've completed this (see PR #112). I'm attaching a simple migration script to this post to migrate the v1 Wikipedia data to the new v2 format (note that GitHub allows certain extensions, so I had to zip it up to comply with GitHub's requirements). I've tested this script on a small subset of Wikipedia data I copied locally.

wp-migrate.zip

@gfairchild gfairchild assigned reidpr and unassigned gfairchild May 18, 2016
@reidpr reidpr removed their assignment Oct 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants