Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IO-1746] Introduction of Pagination queries and changes to item_ids endpoint #707

Merged
merged 15 commits into from
Nov 6, 2023

Conversation

Nathanjp91
Copy link
Contributor

@Nathanjp91 Nathanjp91 commented Nov 2, 2023

Problem

item ids collection uses an older endpoint, need to swap it over to the newer list_item_ids endpoint, which then needs to filter into changes on the meta objects that implement these. This however necessitates the creation of a solution to pagination on both the Core/Meta levels.

Solution

Introduce some major changes to accomodate Pagination, including

  • Core data object for keeping track of pages, darwin.future.data_objects.page.Page which has defaults that we use on the API
  • PaginatedQuery extension, which sets up pagination and changes relevant dunders to allow for pagination
    • paginated objects are lazy loading, iterable and will remember the current page, only collecting when needed
    • includes some new helper functions like collect_all to force total collection
    • still allows for indexing, PaginatedQuery[76] will detect the required page to send and collect results if not stored
    • len(PaginatedQuery) will collect_all and is something to note behaviour wise, this is a requirement to allow list(PaginatedQuery) to work, otherwise I would have made it Raise on an uncollected object
    • Changes to underlying Query object required to allow mapping, reflected in changes to dunder functions (but not behaviour, which remains the same). Results now stored as a dictionary map of {index: result} indicating the offset location of the item
  • V7ID object now a wrapper for UUID
    • Required to interact with Query objects, despite being overkill on UUID list.
    • This approach will allow for other interaction later, like deletion of objects via IDs and any extensions we want to add
    • Removal of the Generic[MetaBase] requirement and then just using UUID was a solution here, but causes lots of typing issues and would require a major change to how we structure the code, so was opted for a wrapper approach
  • Anything that uses item_ids prior now uses ItemIDQuery, the first PaginatedQuery object
  • Helper functions added to QueryString objects for adding and combining parameters together
  • Dogfooding changes useful for easier instantiation of objects and testing, creating default clients and other necessary objects when needed.

Changelog

New Item_ids behaviour and introduction of PaginatedQuery Objects inheritable for default pagination behaviour.

Copy link

linear bot commented Nov 2, 2023

IO-1746 Change get item ids to use updated endpoint

get_item_ids currently uses an old endpoint. This should be updated

@@ -166,21 +170,6 @@ def headers(self) -> Dict[str, str]:
http_headers["Authorization"] = f"ApiKey {self.config.api_key}"
return http_headers

@overload
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally unsure why I even had this overload, it didn't matter for the typing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly just experimenting with overloads etc. and got left in?

@@ -83,3 +85,6 @@ def __init__(self, value: Dict[str, str]) -> None:

def __str__(self) -> str:
return "?" + "&".join(f"{k}={v}" for k, v in self.value.items())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful for combining parameters in the core functions when required for pagination which I treat seperately

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, nice addition actually. Gotta love a good dunder.

@@ -104,6 +106,12 @@ def _from_kwarg(cls, key: str, value: str) -> QueryFilter:
modifier = None
return QueryFilter(name=key, param=value, modifier=modifier)

def to_dict(self, ignore_modifier: bool = True) -> Dict[str, str]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly so they can be swapped in and out of QueryStrings for Paginated endpoints so that .where() syntax still works, for now ignores modifiers. I think this is not the most elegant solution though as eventually I think we'll want to probably do both server side query filter and client side once collected

@@ -154,17 +162,16 @@ def __init__(
self.meta_params: dict = meta_params or {}
self.client = client
self.filters = filters or []
self.results: Optional[List[T]] = None
self._changed_since_last: bool = True
self.results: dict[int, T] = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core change that allows for Pagination, but all existing functions have been swapped over to preserve the outputs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so you went with the int hashtable style approach. 👍🏻

if "dataset_ids" in self.meta_params
else self.meta_params["dataset_id"]
)
params: QueryString = reduce(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How I use the aforementioned QueryString + QueryString syntax, map reduce them with the page object and all the filters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that python has itertools, but I'd prefer a fluent interface like any sensible language like Rust

Anyway, good stuff!

Copy link
Contributor

@owencjones owencjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments to address before merge but good stuff

@@ -166,21 +170,6 @@ def headers(self) -> Dict[str, str]:
http_headers["Authorization"] = f"ApiKey {self.config.api_key}"
return http_headers

@overload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly just experimenting with overloads etc. and got left in?

api_client: ClientCore,
team_slug: str,
dataset_id: Union[str, int],
params: QueryString = QueryString({}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably refactor QueryString to assume {} at some point.

@@ -83,3 +85,6 @@ def __init__(self, value: Dict[str, str]) -> None:

def __str__(self) -> str:
return "?" + "&".join(f"{k}={v}" for k, v in self.value.items())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, nice addition actually. Gotta love a good dunder.

@@ -154,17 +162,16 @@ def __init__(
self.meta_params: dict = meta_params or {}
self.client = client
self.filters = filters or []
self.results: Optional[List[T]] = None
self._changed_since_last: bool = True
self.results: dict[int, T] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so you went with the int hashtable style approach. 👍🏻

self.client, filters=[*self.filters, filter], meta_params=self.meta_params
)
self.filters.append(filter)
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this better, it's much clearer and nicer to read for devs, but is the functionality not slightly different? Calling __class__ will be calling the initialiser and returning a newly (m)allocated class, and return self will return the current class as ref.

I assume, basically, that the difference doesn't matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is slightly different yeah, it is modification in place rather than re-allocation, but I think that's a much more sensible approach here as most people won't be using the add dunder functionality except us anyway as internal functions, and it the modification in place saves time and memory.

self.results = list(self._collect())
return len(self.results)
self.results = {**self.results, **self._collect()}
return len(self.results.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confident you've done this for a reason, but with self.results being of type Dict[int, T], len(self.results) == len(self.results.keys() doesn't it?

I might be missing something here, or maybe it's a STFU to mypy or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah you're right that's a simpler call and have pushed a fix

@@ -205,12 +212,12 @@ def __next__(self) -> T:

def __getitem__(self, index: int) -> T:
if not self.results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this now need to also take into account self._changed_since_last or are we only using that for results we have never had, and assuming they haven't changed in DB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.collect() already accounts for it, which is a wrapper that runs the _collect function that has to be written per meta class but does all the _changed_since_last and other logic

def __init__(
self,
client: ClientCore,
filters: List[QueryFilter] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this style of typing union whilst still meeting our version needs? It was introduced in 3.10.

I'm happy if other versions will just ignore this though, because frankly it looks a significant amount nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can if we use the from future import annotations at the top, which I do anyway for some imports around classes that return themselves.

def __len__(self) -> int:
if not self.completed:
self.collect_all()
return len(self.results.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Q as before

if "dataset_ids" in self.meta_params
else self.meta_params["dataset_id"]
)
params: QueryString = reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that python has itertools, but I'd prefer a fluent interface like any sensible language like Rust

Anyway, good stuff!

@Nathanjp91 Nathanjp91 merged commit d102e24 into master Nov 6, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants