-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added count_documents() implementation to builtin_timeseries #935
Changes from 7 commits
3c92c4b
776ba50
c4ea9a8
e773c34
bb88af4
821478c
177c45b
0903cf7
825d4ce
9d4063f
ac619ba
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -81,6 +81,76 @@ def testExtraQueries(self): | |||||
with self.assertRaises(AttributeError): | ||||||
list(ts.find_entries(time_query=tq, extra_query_list=[ignored_phones])) | ||||||
|
||||||
def testFindEntriesCount(self): | ||||||
''' | ||||||
Test: Specific keys with other parameters not passed values. | ||||||
Input: For each dataset: ["background/location", "background/filtered_location", "analysis/confirmed_trip"] | ||||||
- Testing this with sample dataset: "shankari_2015-aug-21", "shankari_2015-aug-27" | ||||||
Output: Aug_21: ([738, 508], [0]), Aug_27: ([555, 327], [0]) | ||||||
- Actual output just returns a single number for count of entries. | ||||||
- Validated using grep count of occurrences for keys: 1) "background/location" 2) "background/filtered_location" | ||||||
- $ grep -c <key> <dataset>.json | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you please add the |
||||||
|
||||||
For Aggregate Timeseries test case: | ||||||
- The expected output would be summed-up values for the respective keys from the individual users testing outputs mentioned above. | ||||||
- Output: ([1293, 835], [0]) | ||||||
- For each of the 3 input keys from key_list1: | ||||||
- 1293 = 738 (UUID1) + 555 (UUID2) | ||||||
- 835 = 508 (UUID1) + 327 (UUID2) | ||||||
- 0 = 0 (UUID1) + 0 (UUID2) | ||||||
|
||||||
''' | ||||||
|
||||||
ts1_aug_21 = esta.TimeSeries.get_time_series(self.testUUID1) | ||||||
ts2_aug_27 = esta.TimeSeries.get_time_series(self.testUUID) | ||||||
|
||||||
# Test case: Combination of original and analysis timeseries DB keys for Aug-21 dataset | ||||||
key_list1=["background/location", "background/filtered_location", "analysis/confirmed_trip"] | ||||||
count_ts1 = ts1_aug_21.find_entries_count(key_list=key_list1) | ||||||
self.assertEqual(count_ts1, ([738, 508], [0])) | ||||||
shankari marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
# Test case: Combination of original and analysis timeseries DB keys for Aug-27 dataset | ||||||
key_list1=["background/location", "background/filtered_location", "analysis/confirmed_trip"] | ||||||
count_ts2 = ts2_aug_27.find_entries_count(key_list=key_list1) | ||||||
self.assertEqual(count_ts2, ([555, 327], [0])) | ||||||
shankari marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
# Test case: Only original timeseries DB keys for Aug-27 dataset | ||||||
key_list2=["background/location", "background/filtered_location"] | ||||||
count_ts3 = ts2_aug_27.find_entries_count(key_list=key_list2) | ||||||
self.assertEqual(count_ts3, ([555, 327], [])) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And here it's an empty array. Why the inconsistency? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Empty array is returned in case there were no keys pertaining to the respective timeseries database. This is to differentiate from the [0] case where a key might be present in the input but no matching documents found. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is an interesting design choice, but I am not sure I agree with it. This seems to:
Can you articulate the pros (if any) of this interface? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank for your feedback on the design. Yes, I do agree that it is too tightly based on the current implementation, especially with regards to two types of timeseries that we have currently. In terms of reducing the complexity, it again comes down to supporting just a single key as per the initial requirement. This would resolve much of the varying output scenarios that the current implementation has such as the [] vs [0] case. Motivation for this design: Pending issue: Possible resolution:
Also, I am a bit unclear what you mean by this - "Note that with this implementation, we don't even support just adding the two sets of counts together because they may not have the same number of entries." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The design should be driven by what is easy for users to use and generalizable. Not by what is easy to implement.
Motivation for which design? single key or multi-key?
I don't see why this is a problem. What does
If you always returned the same number of entries (e.g. if Concretely, after this change, what will the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is why it is good, when you come to a decision point like this, to document the alternatives, think through the pros and cons, and document your decision instead of just refactoring. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Single key initially.
Right, so I have done what The difference in my implementation of This is done for each of the two timeseries and their respective query results are returned in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
With reference to sample code in original issue here: Now, the count can be fetched by: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Alright, thank you for the clarification. I do get your point. However, I want to point out that the two lists [x,y] and [0,0] would refer to different timeseries datasets with varying keys. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is the implementation of
The respective query results are not returned in
I fail to see who would be interested in the implementation detail that we have two timeseries collections internally, or why they would need to know which key is stored while accessing the result.
and if it were location, we would need:
can you see how that is leaking the implementation outside the interface? |
||||||
|
||||||
# Test case: Only analysis timeseries DB keys | ||||||
key_list3=["analysis/confirmed_trip"] | ||||||
count_ts4 = ts2_aug_27.find_entries_count(key_list=key_list3) | ||||||
self.assertEqual(count_ts4, ([], [0])) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. expected interface:
Suggested change
|
||||||
|
||||||
# Test case: Empty key_list which should return total count of all documents in the two DBs | ||||||
key_list4=[] | ||||||
count_ts5 = ts1_aug_21.find_entries_count(key_list=key_list4) | ||||||
self.assertEqual(count_ts5, ([2125], [0])) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
# Test case: Invalid or unmatched key in metadata field | ||||||
key_list5=["randomxyz_123test"] | ||||||
with self.assertRaises(KeyError) as ke: | ||||||
count_ts6 = ts1_aug_21.find_entries_count(key_list=key_list5) | ||||||
self.assertEqual(str(ke.exception), "'randomxyz_123test'") | ||||||
|
||||||
# Test case: Aggregate timeseries DB User data passed as input | ||||||
ts_agg = esta.TimeSeries.get_aggregate_time_series() | ||||||
count_ts7 = ts_agg.find_entries_count(key_list=key_list1) | ||||||
self.assertEqual(count_ts7, ([1293, 835], [0])) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
# Test case: New User created with no data to check | ||||||
self.testEmail = None | ||||||
self.testUUID2 = self.testUUID | ||||||
etc.createAndFillUUID(self) | ||||||
ts_new_user = esta.TimeSeries.get_time_series(self.testUUID) | ||||||
count_ts8 = ts_new_user.find_entries_count(key_list=key_list1) | ||||||
self.assertEqual(count_ts8, ([0, 0], [0])) | ||||||
|
||||||
print("Assert Test for Count Data successful!") | ||||||
|
||||||
|
||||||
if __name__ == '__main__': | ||||||
import emission.tests.common as etc | ||||||
etc.configLogging() | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to this, but I thought that the plan was to only support one key for now so that we could make progress on other server tasks. Can you clarify why that changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blank key input scenario prompted me to go for supporting multiple keys.
In this case, I thought of returning total counts of all documents for the specific user not filtered by keys.
Since, this is a functionality of abstract_timeseries class and its subclasses, any function call for counting documents would happen through this class objects and should not directly involve any timeseries database.
Now, since we don't know what timeseries database is to be used for counting, my thought was to return count of documents pertaining to both timeseries databases for the user (original and analysis).
I do realize now, that this could still have been achieved with a single key instead of key_list but a series of code changes happened in my current thought process to accommodate code changes and new test cases.
The extra time spent on the modifying the initial requirement could have been utilized for the other pending tasks.