Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added count_documents() implementation to builtin_timeseries #935

Merged
merged 11 commits into from
Sep 13, 2023
60 changes: 60 additions & 0 deletions emission/storage/timeseries/builtin_timeseries.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,3 +440,63 @@ def update_data(user_id, key, obj_id, data):
logging.debug("updating entry %s into timeseries" % new_entry)
edb.save(ts.get_timeseries_db(key), new_entry)

def find_entries_count(self, key_list = None, time_query = None, geo_query = None, extra_query_list = None):
"""
Returns the total number of documents for the given key_list referring to each of the two timeseries db.

Input: Key list with keys from both timeseries DBs = [key1, key2, key3, key4, ...]
Suppose (key1, key2) are orig_tsdb keys and (key3, key4) are analysis_tsdb keys
Output: Tuple of lists = (orig_tsdb_count, analysis_tsdb_count)
= ([count_key1, count_key2, ...], [count_key3, count_key4, ...])
Orig_tsdb_count and Analysis_tsdb_count are lists containing counts of matching documents
for each key considered separately for the specific timeseries DB.

:param key_list: list of metadata keys we are querying for.
:param time_query: the time range in which to search the stream
:param geo_query: the query for a geographical area
:param extra_query_list: any additional queries to filter out data

For key_list = None, total count of all documents are returned for each of the matching timeseries DBs.
"""
logging.debug("builtin_timeseries.find_entries_count() called")

orig_tsdb = self.timeseries_db
analysis_tsdb = self.analysis_timeseries_db

orig_tsdb_counts = []
analysis_tsdb_counts = []

if key_list == [] or key_list is None:
key_list = None

# Segregate orig_tsdb and analysis_tsdb keys
(orig_tsdb_keys, analysis_tsdb_keys) = self._split_key_list(key_list)

orig_tsdb_counts = self._get_entries_counts_for_timeseries(orig_tsdb, orig_tsdb_keys, time_query, geo_query, extra_query_list)
analysis_tsdb_counts = self._get_entries_counts_for_timeseries(analysis_tsdb, analysis_tsdb_keys, time_query, geo_query, extra_query_list)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this, but I thought that the plan was to only support one key for now so that we could make progress on other server tasks. Can you clarify why that changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blank key input scenario prompted me to go for supporting multiple keys.
In this case, I thought of returning total counts of all documents for the specific user not filtered by keys.

Since, this is a functionality of abstract_timeseries class and its subclasses, any function call for counting documents would happen through this class objects and should not directly involve any timeseries database.

Now, since we don't know what timeseries database is to be used for counting, my thought was to return count of documents pertaining to both timeseries databases for the user (original and analysis).

I do realize now, that this could still have been achieved with a single key instead of key_list but a series of code changes happened in my current thought process to accommodate code changes and new test cases.

The extra time spent on the modifying the initial requirement could have been utilized for the other pending tasks.

return (orig_tsdb_counts, analysis_tsdb_counts)


def _get_entries_counts_for_timeseries(self, tsdb, key_list, time_query, geo_query, extra_query_list):

tsdb_queries = []
tsdb_counts = []

# For each key in orig_tsdb keys, create a query
if key_list is not None:
for key in key_list:
tsdb_query = self._get_query([key], time_query, geo_query, extra_query_list)
tsdb_queries.append(tsdb_query)
# For each query generated for each orig_tsdb key, fetch count of matching documents
for query in tsdb_queries:
entries_count = tsdb.count_documents(query)
tsdb_counts.append(entries_count)
else:
tsdb_queries = self._get_query(key_list, time_query, geo_query, extra_query_list)
entries_count = tsdb.count_documents(tsdb_queries)
tsdb_counts = [entries_count]

return tsdb_counts


70 changes: 70 additions & 0 deletions emission/tests/storageTests/TestTimeSeries.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,76 @@ def testExtraQueries(self):
with self.assertRaises(AttributeError):
list(ts.find_entries(time_query=tq, extra_query_list=[ignored_phones]))

def testFindEntriesCount(self):
'''
Test: Specific keys with other parameters not passed values.
Input: For each dataset: ["background/location", "background/filtered_location", "analysis/confirmed_trip"]
- Testing this with sample dataset: "shankari_2015-aug-21", "shankari_2015-aug-27"
Output: Aug_21: ([738, 508], [0]), Aug_27: ([555, 327], [0])
- Actual output just returns a single number for count of entries.
- Validated using grep count of occurrences for keys: 1) "background/location" 2) "background/filtered_location"
- $ grep -c <key> <dataset>.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add the grep outputs here?


For Aggregate Timeseries test case:
- The expected output would be summed-up values for the respective keys from the individual users testing outputs mentioned above.
- Output: ([1293, 835], [0])
- For each of the 3 input keys from key_list1:
- 1293 = 738 (UUID1) + 555 (UUID2)
- 835 = 508 (UUID1) + 327 (UUID2)
- 0 = 0 (UUID1) + 0 (UUID2)

'''

ts1_aug_21 = esta.TimeSeries.get_time_series(self.testUUID1)
ts2_aug_27 = esta.TimeSeries.get_time_series(self.testUUID)

# Test case: Combination of original and analysis timeseries DB keys for Aug-21 dataset
key_list1=["background/location", "background/filtered_location", "analysis/confirmed_trip"]
count_ts1 = ts1_aug_21.find_entries_count(key_list=key_list1)
self.assertEqual(count_ts1, ([738, 508], [0]))
shankari marked this conversation as resolved.
Show resolved Hide resolved

# Test case: Combination of original and analysis timeseries DB keys for Aug-27 dataset
key_list1=["background/location", "background/filtered_location", "analysis/confirmed_trip"]
count_ts2 = ts2_aug_27.find_entries_count(key_list=key_list1)
self.assertEqual(count_ts2, ([555, 327], [0]))
shankari marked this conversation as resolved.
Show resolved Hide resolved

# Test case: Only original timeseries DB keys for Aug-27 dataset
key_list2=["background/location", "background/filtered_location"]
count_ts3 = ts2_aug_27.find_entries_count(key_list=key_list2)
self.assertEqual(count_ts3, ([555, 327], []))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here it's an empty array. Why the inconsistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty array is returned in case there were no keys pertaining to the respective timeseries database.
In this specific test case, only background keys are there, hence their count is returned.
No, analysis keys are present in the input keys, hence empty array returned.

This is to differentiate from the [0] case where a key might be present in the input but no matching documents found.
Whereas in this case of [], no key was present in the input itself.

Copy link
Contributor

@shankari shankari Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an interesting design choice, but I am not sure I agree with it. This seems to:

  1. very tied to this specific implementation: e.g. the difference between timeseries and analysis_timeseries. Whenever you are designing an abstract interface, think about what it would look like if the underlying implementation were completely different. What if we stored everything in giant csv files instead? What if we used a real timeseries database instead? What would the meaning of timeseries and analysis_timeseries be in that case?
  2. unnecesarily complex to use: again, think of the user of the interface, who is using the timeseries abstraction that we have outlined in chapter 5 of my thesis. they want to know how many entries there are in each stream. Do they want to know where the entries came from? Do they want to know the details of our implementation? Do they want to know the difference between the key being present but there being zero entries and the key being absent? No. We shouldn't make them do that. Note that with this implementation, we don't even support just adding the two sets of counts together because they may not have the same number of entries.

Can you articulate the pros (if any) of this interface?

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for your feedback on the design.

Yes, I do agree that it is too tightly based on the current implementation, especially with regards to two types of timeseries that we have currently.

In terms of reducing the complexity, it again comes down to supporting just a single key as per the initial requirement. This would resolve much of the varying output scenarios that the current implementation has such as the [] vs [0] case.

Motivation for this design:
In the original issue sample code usage here, I saw that first the data was fetched and then count was called. However, I designed this keeping in mind that the user would not have to explicitly state which timeseries the user wants a count for.
I noticed that the key could be used to identify which timeseries is required and hence went ahead with it.

Pending issue:
Now, in this design scenario, there was the case of blank keys passed.
How would we decide what timeseries count was to be returned to the user for this case?
Hence, I went ahead with returning the total count of the two available timeseries.
This resulted in my output now changing from a single value to two separate values for the two timeseries.

Possible resolution:

  • Stick to a single key and return count.
  • Need to decide how blank keys are handled AND,
  • Which timeseries data to be considered if user wants count of all documents without specifying key?

Also, I am a bit unclear what you mean by this - "Note that with this implementation, we don't even support just adding the two sets of counts together because they may not have the same number of entries."

Copy link
Contributor

@shankari shankari Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design should be driven by what is easy for users to use and generalizable. Not by what is easy to implement.

Motivation for this design:

Motivation for which design? single key or multi-key?

How would we decide what timeseries count was to be returned to the user for this case?

I don't see why this is a problem. What does find_entries do if the user doesn't specify a key? Do the same things for counts.

Also, I am a bit unclear what you mean by this - "Note that with this implementation, we don't even support just adding the two sets of counts together because they may not have the same number of entries."

If you always returned the same number of entries (e.g. if ['background/location', 'background/filtered_location'] returned ([x, y], [0, 0]) then we could do np.array(retVal[0]) + np.array(retVal[1]) to get the total counts. But we cannot do that with your current design.

Concretely, after this change, what will the op-admin-dashboard code that is changed to use this look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why it is good, when you come to a decision point like this, to document the alternatives, think through the pros and cons, and document your decision instead of just refactoring.

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation for which design? single key or multi-key?

Single key initially.
Approach to fetch timeseries data using keys is valid even in case of multi-key.
I moved to multi-key since I followed how find_entries was handling multiple keys.


How would we decide what timeseries count was to be returned to the user for this case?

I don't see why this is a problem. What does find_entries do if the user doesn't specify a key? Do the same things for counts.

Right, so I have done what find_entries is doing - in case of blank keys, fetch the total count of each specific dataset, which itself is done via a helper function _get_entries_for_timeseries.
Here, _get_entries_for_timeseries (used in find_entries) , adds up all counts for keys belonging to a dataset to give a single count.
Ex. ['background/location, 'background/filtered_location'] output would be [x+y]

The difference in my implementation of _get_entries_counts_for_timeseries is that I am returning individual counts for each key separately as an array element part of combined list.
Ex. ['background/location, 'background/filtered_location'] = [x,y]

This is done for each of the two timeseries and their respective query results are returned in find_entries.
Similarly, I am returning the respective counts in find_entries_count.

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Concretely, after this change, what will the op-admin-dashboard code that is changed to use this look like?

With reference to sample code in original issue here:
total_trips = edb.get_analysis_timeseries_db().count_documents(arguments include userid and keys... ),

Now, the count can be fetched by:
ts_user = esta.get_time_series(UUID(user_id))
key_list = ['analysis/confirmed_trip']
total_trips = ts_user.find_entries_count(key_list=key_list)[1][0]

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Also, I am a bit unclear what you mean by this - "Note that with this implementation, we don't even support just adding the two sets of counts together because they may not have the same number of entries."

If you always returned the same number of entries (e.g. if ['background/location, 'background/filtered_location']returned([x, y], [0, 0])then we could donp.array(retVal[0]) + np.array(retVal[1])` to get the total counts. But we cannot do that with your current design.

Alright, thank you for the clarification. I do get your point.
This would be a small change to match up the number of entries in the output.

However, I want to point out that the two lists [x,y] and [0,0] would refer to different timeseries datasets with varying keys.
For instance, if 2nd list also had non-zero values, say [a,b], corresponding to some analysis_timseries keys, each element inside the 1st list refers to a separate key which will not match with the key from 2nd list.
Would this be appropriate: x (= background/location ) + a = (analysis/some_key) = x + a ?

Copy link
Contributor

@shankari shankari Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so I have done what find_entries is doing - in case of blank keys, fetch the total count of each specific dataset, which itself is done via a helper function _get_entries_for_timeseries.

This is the implementation of find_entries. The interface of find_entries is that if you give it blank keys, you get a combined list of all entries in the time range. The count is not even exposed outside the implementation. The interface of find_entry_counts should be the same as the interface of find_entries, which is that it should return the number of matching entries for that time range.

This is done for each of the two timeseries and their respective query results are returned in find_entries.

The respective query results are not returned in find_entries. A combined query result is returned.
See itertools.chain(orig_ts_db_result, analysis_ts_db_result)

However, I want to point out that the two lists [x,y] and [0,0] would refer to different timeseries datasets with varying keys.

I fail to see who would be interested in the implementation detail that we have two timeseries collections internally, or why they would need to know which key is stored while accessing the result.

ts_user = esta.get_time_series(UUID(user_id))
key_list = ['analysis/confirmed_trip']
total_trips = ts_user.find_entries_count(key_list=key_list)[1][0]

and if it were location, we would need:

ts_user = esta.get_time_series(UUID(user_id))
key_list = ['background/location']
total_trips = ts_user.find_entries_count(key_list=key_list)[0][0]

can you see how that is leaking the implementation outside the interface?


# Test case: Only analysis timeseries DB keys
key_list3=["analysis/confirmed_trip"]
count_ts4 = ts2_aug_27.find_entries_count(key_list=key_list3)
self.assertEqual(count_ts4, ([], [0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected interface:

Suggested change
self.assertEqual(count_ts4, ([], [0]))
self.assertEqual(count_ts4, 0)


# Test case: Empty key_list which should return total count of all documents in the two DBs
key_list4=[]
count_ts5 = ts1_aug_21.find_entries_count(key_list=key_list4)
self.assertEqual(count_ts5, ([2125], [0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertEqual(count_ts5, ([2125], [0]))
self.assertEqual(count_ts5, 2125)


# Test case: Invalid or unmatched key in metadata field
key_list5=["randomxyz_123test"]
with self.assertRaises(KeyError) as ke:
count_ts6 = ts1_aug_21.find_entries_count(key_list=key_list5)
self.assertEqual(str(ke.exception), "'randomxyz_123test'")

# Test case: Aggregate timeseries DB User data passed as input
ts_agg = esta.TimeSeries.get_aggregate_time_series()
count_ts7 = ts_agg.find_entries_count(key_list=key_list1)
self.assertEqual(count_ts7, ([1293, 835], [0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertEqual(count_ts7, ([1293, 835], [0]))
self.assertEqual(count_ts7, 2125)


# Test case: New User created with no data to check
self.testEmail = None
self.testUUID2 = self.testUUID
etc.createAndFillUUID(self)
ts_new_user = esta.TimeSeries.get_time_series(self.testUUID)
count_ts8 = ts_new_user.find_entries_count(key_list=key_list1)
self.assertEqual(count_ts8, ([0, 0], [0]))

print("Assert Test for Count Data successful!")


if __name__ == '__main__':
import emission.tests.common as etc
etc.configLogging()
Expand Down
Loading