-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for timeseries of arbitrary granularities #111
Comments
This solution would also allow fairly simple modification of the current Wikipedia data because instead of having to recreate everything, we'd just need to add a row to each month's |
A few somewhat well organized opinions: Right now, fragments groups don't have any notion of time. The only assumption is that the filenames are lexically increasing, and adjacent filenames contain adjacent fragments of the vector (i.e., no gaps and no overlap). I'd prefer to:
Assembling a full vector by concatenation requires knowing the length of each fragment. All fragments in a group are the same length. Currently, this length is stored in the metadata but AFAICT not really used; rather, it's deduced by assuming the fragments are a month long and that the interval length is one hour. It is desirable to get the length of a fragment group without opening the database. If it's in the filename, that's kind of a hack, but it doesn't even require Once assembled, to convert it to a time series, we need to know (a) the length of each interval and (b) the offset. In general, the intervals might not start and/or end on a fragment boundary (e.g., intervals of weeks). Currently, we assume intervals are one hour and the offset is zero. Because of the offset, ISO 8601 durations aren't sufficient to specify the intervals. Here's a proposal. (1) Fragment groups remain time-ignorant and deal with simple vectors. (2) Introduce a file To start, the metadata elements are:
The Pandas codes can take a multiplier, so I believe this makes them fully general:
(3) Filenames are of the form The first interval must start at the timestamp. We can then use
(Future work will let the tag be general.) (4) (5) This issue provides a script or other easy means to upgrade the existing Wikipedia data. Data like |
(0) What exactly is this offset you speak of? I don't see the word "offset" used in (1) Sounds reasonable. (2) How about a Also, you use the word "interval" here, but it's not entirely clear what exactly you mean. Are you referring to the value supplied to the (3) I think this makes good sense. Referring to my question above in (2), if indeed you do mean that the interval is the
I believe we're on the same page, but I just want to be clear. (4) Makes sense. (5) Where should this migration script live? In |
(0) It's a Pandas anchored offset, yes. Offset because saying data are e.g. weekly doesn't tell you when the periods start and stop, just how long they are. For e.g. daily or hourly data, there's a strong convention about when they start and stop. (1) I agree on JSON vs. SQLite in general. I'm leaning towards SQLite, though, because we already have a metadata reader/writer for it, for use in the fragment group files. (3) What's I'd add here is that dates are OK, so all the zeros to indicate midnight can be omitted. (5) I'd say somewhere outside the code; this issue maybe? I don't believe there's any data sets that need to be upgraded other than ours. It could even not be a real script, just instructions. One option is to let the length filename parameter be optional, and if the dataset frequency is |
Great, thanks Reid. I'll get this done ASAP. |
Ok, the question I was trying to remember during last Friday's team meeting was related to point (3) of your proposal: how do we know what fragment length to choose? Right now, the fragment length is the number of hours in each month, which is 744 for months with 31 days. Suppose we're dealing with daily or weekly data. How many fragments should we store in a single DB? |
You can use any fragment length. I'm sure there's an optimum length, though it's probably not worth our time at this point to figure out what it is. |
Okie dokie. I'll just pick something arbitrarily. Thanks, @reidpr. |
Ok, I've completed this (see PR #112). I'm attaching a simple migration script to this post to migrate the v1 Wikipedia data to the new v2 format (note that GitHub allows certain extensions, so I had to zip it up to comply with GitHub's requirements). I've tested this script on a small subset of Wikipedia data I copied locally. |
Related to #110, we should add support for granularities other than hourly. The CDC data are daily. I think some low-hanging fruit would be to allow the user to provide ISO 8601 duration strings and store that in each month's
metadata
table. Unfortunately, we would need to enforce a rule that users can't use granularities larger than 1 month, but using arbitrary granularities between ε and 1 month is a lot more useful than being restricted to hourly.@reidpr, what do you think about this?
The text was updated successfully, but these errors were encountered: