Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config / init refactor for state encapsulation #220

Merged
merged 10 commits into from
Oct 20, 2023
Merged

Conversation

jhamon
Copy link
Collaborator

@jhamon jhamon commented Oct 17, 2023

Problem

Singleton config limits how people can use this SDK and makes testing more difficult. Also, we no longer should fetch projectId and construct data url out of configuration parts, to keep the client agnostic to the format of the data plane url.

Solution

  • State encapsulation using classes
    • We want to encapsulate all config-related state into a new class Config and a subclass PineconeConfig.
    • Config is a vessel for api_key, host, and openapi configuration. It is host-agnostic because it is used by both the control plane (Pinecone) and the data plane (Index)
    • PineconeConfig is a convenience class to set the controller host for Pinecone.
    • New singleton store IndexHostStore means we won't make unnecessary and repetitious calls to describe_index if a user creates multiple Index instances referring to the same index. In the old paradigm, this was accomplished with a piece of global state storing off projectId.
  • Modified class init for Index
    • Index class constructor now requires host url instead of an index name. If you use the helper method on a Pinecone class instance to instantiate the object, it will manage fetching the host url for you. E.g. p = Pinecone(api_key='foo'); index = p.Index('my-index').
# Old way
import pinecone
pinecone.init(api_key='foo')
index = pinecone.Index('my-index')

# New way
from pinecone import Pinecone
client = Pinecone(api_key='foo')
index = client.Index('my-index')

# New way (alternate, for data-plane only use case)
from pinecone import Index
index = Index(api_key='foo', host='https://data-url')
  • Moving things around
    • manage.py got moved and renamed to pinecone/control/pinecone.py. The functions such as create_index, describe_index, etc that previously were top-level package exports relying on global state to access config information have been turned into class methods on a new Pinecone class defined in this file.
    • pinecone/index.py got moved to pinecone/data/index.py. The pinecone/data package is a home for all classes used to access the data plane.
    • Everything grpc related got shoved into pinecone/data/grpc.
    • GRPC mess in one place only. Utils that were grpc-related got pulled out of pinecone/utils.py and moved to pinecone/data/grpc/utils.py.
    • Leftover util junk moved into a deprecated package. Probably will be deleted soon.
  • Breaking down big files into small files
    • The file previously at pinecone/index_grpc.py to split up into many smaller files inside pinecone/data/grpc/ package.
    • Remaining util functions in pinecone/utils.py got broken up into many smaller files in pinecone/utils/
  • Fix existing tests
    • Needed to adjust a lot of import statements and sometimes setup functions to get pre-existing tests running.

Using it

# We're going to use a new Pinecone class instead of interacting directly with the package 
# namespace as in the past
from pinecone import Pinecone

# No more init() method
p = Pinecone(api_key='foo')

# Target an index
index = p.Index('my-index')

# Use data plane methods
index.describe_index_stats()

Mostly working

  • describe_index (call succeeds but need to follow up on source_collection field, would ideally return different object for different capacity_modes)

Somewhat working:

  • create_index (API call succeeds, but client blows up when response is not the expected shape)
  • delete_index (only when timeout=-1, since otherwise it has dependency on list_indexes)

Broken:

  • list_indexes (response not expected shape)

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Test Plan

Automated tests will be expanded in a future PR. A bunch of code moved around to different files or got split from big files into small files and I want to merge this part of the change to unlock parallel work by Austin various follow-ups.

@jhamon jhamon requested a review from austin-denoble October 19, 2023 17:28
@jhamon jhamon changed the title [WIP] Config / init refactor Config / init refactor for state encapsulation Oct 19, 2023
QueryResult = None
"""@private"""
FetchResult = None
"""@private"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have context on these, but I can't imagine any scenario where returning a bunch of None objects meaningfully preserves backwards compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'm not really sure what this was for. I think it's probably fine to pull them out. Otherwise it would just be someone defining behavior themselves for these things anyways.

Comment on lines 9 to 12
try:
from .grpc.index_grpc import *
from .data.grpc.index_grpc import *
except ImportError:
pass # ignore for non-[grpc] installations
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These try import statements led to some pretty confusing debug situations when stuff started failing due to me moving packages from one location to another since the intention is really only to ignore failures caused by people installing without grpc extras. We should revisit this pattern and make sure we're testing stuff both with and without grpc deps loaded.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these constants out of old utils.py since they were only being used in one place.

The rest of this I pulled out of We need a whole ticket to revisit and overhaul the logging approach in a holistic way.

Copy link
Collaborator Author

@jhamon jhamon Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in here was pretty much copy/pasta out of the old config.py but wrapped inside a new factory class. I haven't done anything to verify these settings for openapi, but they worked in the past so hopefully they are sufficient for now.


class PineconeConfig(Config):
def __init__(self, api_key: str = None, host: str = None, **kwargs):
host = host or kwargs.get("host") or os.getenv("PINECONE_CONTROLLER_HOST") or DEFAULT_CONTROLLER_HOST
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want to keep this or not, but in the past there was an env var called PINECONE_CONTROLLER_HOST being used.

But as you can see, the only difference between a Config instance and the PineconeConfig subclass is a little bit of logic here about how to set the host value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure whether the env var should be maintained, but I suppose it's a nice convenience and not too much trouble to keep around.

Comment on lines +5 to +15
class SingletonMeta(type):
_instances = {}

def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
instance = super().__call__(*args, **kwargs)
cls._instances[cls] = instance
return cls._instances[cls]


class IndexHostStore(metaclass=SingletonMeta):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess, I had help from ChatGPT on this. I still need to add tests to verify the singleton is working as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure I was using GPT for examples of Python singletons and this is the approach it gave me as well.

@@ -0,0 +1,7 @@
class CollectionDescription(object):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled this out of legacy manage.py. Not sure if we really want to keep it or not long term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure of this also. I also wasn't sure why it's doing an anonymous mapping to a dict rather than having a defined set of keys like in IndexDescription(NamedTuple).

Would these classes ultimately be useful to external users?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether it's a dictionary or an extention of NamedTuple, the usual motivation for a presenter object like is just to have greater control over what is shown to the user and discard any unwanted fields that may be emitted by the underlying generated code. I don't see why that's needed in this case though. Probably I will delete it soon but not in this diff.

@@ -0,0 +1,13 @@
from typing import NamedTuple
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled this out of legacy manage.py. Not sure if we really want to keep it or not long term.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the logic in this file moved from manage.py

db = response.database
host = response.status.host

self.index_host_store.set_host(self.config, name, "https://" + host)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We proactively stick this host url in the singleton store if they happen to call describe_index before doing any index operations. Just to save an additional round trip.

@@ -134,7 +134,7 @@ def configure_index_with_http_info(
:rtype: tuple(IndexMeta, status_code(int), headers(HTTPHeaderDict))
"""

_hosts = ["https://controller.{environment}.pinecone.io"]
_hosts = ["https://api.pinecone.io"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this file are generated, derived from changes to the servers field in our openapi spec stored in the other repository.

@@ -30,7 +30,7 @@ class IndexMetaDatabase(BaseModel):
"""

name: StrictStr = Field(...)
dimension: StrictStr = Field(...)
dimension: StrictInt = Field(...)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a generated change, based on an updated openapi spec.

@@ -0,0 +1,192 @@
import logging
Copy link
Collaborator Author

@jhamon jhamon Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these out of the giant index.grpc.py file. No changes.

@@ -0,0 +1,33 @@
from .retry import RetryConfig
Copy link
Collaborator Author

@jhamon jhamon Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these out of the giant index.grpc.py file. No changes.

@@ -0,0 +1,34 @@
from grpc._channel import _MultiThreadedRendezvous
Copy link
Collaborator Author

@jhamon jhamon Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these out of the giant index.grpc.py file. No changes.

from functools import wraps
from importlib.util import find_spec
from typing import NamedTuple, Optional, Dict, Iterable, Union, List, Tuple, Any
from typing import Optional, Dict, Iterable, Union, List, Tuple, Any
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted a bunch of supporting classes and functions into smaller files. Removed dependencies no longer in use by the class defined in this file. Otherwise, no changes.

@@ -0,0 +1,107 @@
import uuid
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these grpc-related utils out of a general utils.py file so there wouldn't be any try/except import statements needed in here, and also so they are closer to where they are actually being used in the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file got renamed and modified. Now it is tests/unit/test_control.py

@@ -0,0 +1,10 @@
import requests
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from pinecone/utils.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from pinecone/utils.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from pinecone/utils.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from pinecone/utils.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from pinecone/utils.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved some constants closer to where they are actually being used, since they were only used in one spot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from pinecone/utils.py

@@ -0,0 +1,88 @@
import re
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftovers, hopefully to be deleted soon pending investigation on vector_column_service.proto

Comment on lines +104 to 109
def __enter__(self):
return self

def __exit__(self, exc_type, exc_value, traceback):
self._api_client.close()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods were the only thing I needed to add back as a consequence changing Index to not subclass ApiClient to satisfy one test (using with Index(...) as index syntax). The python language docs show tthese must be implemented in order to use the with syntax.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's very helpful.

Comment on lines +92 to +101
def __init__(self, api_key: str, host: str, pool_threads=1, **kwargs):
api_key = api_key or kwargs.get("api_key")
host = host or kwargs.get('host')
pool_threads = pool_threads or kwargs.get("pool_threads")

self._config = Config(api_key=api_key, host=host, **kwargs)

api_client = ApiClient(configuration=self._config.OPENAPI_CONFIG, pool_threads=pool_threads)
api_client.user_agent = get_user_agent()
self._api_client = api_client
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a substantive change to adopt the new Config class.

Copy link
Contributor

@austin-denoble austin-denoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for taking this on! Organizationally and architecturally this feels much, much better to me. I think the refactor of the singleton approach and the ability for users to be more flexible when using the client is a big win.

Overall this looks good to go I think. I'd like to do some detailed testing around the changes here but will not have time this evening.

Definitely would like to get this merged to the feature branch and start iterating on it.

@@ -0,0 +1,7 @@
class CollectionDescription(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure of this also. I also wasn't sure why it's doing an anonymous mapping to a dict rather than having a defined set of keys like in IndexDescription(NamedTuple).

Would these classes ultimately be useful to external users?

QueryResult = None
"""@private"""
FetchResult = None
"""@private"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'm not really sure what this was for. I think it's probably fine to pull them out. Otherwise it would just be someone defining behavior themselves for these things anyways.

Comment on lines +44 to +45
openapi_config
or kwargs.pop("openapi_config", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that openapi_config would be passed in via method arg + also be in kwargs with the same name? I guess this question is also for other areas where we do something similar, like in pinecone/data/index.py in the __init__ and I'm just somewhat unclear on kwargs.

Comment on lines +5 to +15
class SingletonMeta(type):
_instances = {}

def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
instance = super().__call__(*args, **kwargs)
cls._instances[cls] = instance
return cls._instances[cls]


class IndexHostStore(metaclass=SingletonMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure I was using GPT for examples of Python singletons and this is the approach it gave me as well.

self._indexHosts = {}

def _key(self, config: Config, index_name: str) -> str:
return ":".join([config.API_KEY, index_name])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hyper-nit: Using a delimiter + join seems fine here, I suppose you could also use a template string for possibly easier readability, but obviously just imo:

Suggested change
return ":".join([config.API_KEY, index_name])
return f"{config.API_KEY}:{index_name}"

import copy

__all__ = [
"Index",
"FetchResponse",
"ProtobufAny",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing this was removed as it doesn't seem necessarily useful being exposed for end-users, but lemme know if otherwise.

Comment on lines +104 to 109
def __enter__(self):
return self

def __exit__(self, exc_type, exc_value, traceback):
self._api_client.close()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's very helpful.


class PineconeConfig(Config):
def __init__(self, api_key: str = None, host: str = None, **kwargs):
host = host or kwargs.get("host") or os.getenv("PINECONE_CONTROLLER_HOST") or DEFAULT_CONTROLLER_HOST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure whether the env var should be maintained, but I suppose it's a nice convenience and not too much trouble to keep around.

def validate_and_convert_errors(func):
@wraps(func)
def inner_func(*args, **kwargs):
Config.validate() # raises exceptions in case of invalid config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens within the init of the Config itself, right? Not really necessary to have it here.

assert Config.CONTROLLER_HOST == controller_host


def test_resolution_order_kwargs_over_config_file():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we lost the tests around resolution order over a config file, but I'd assume that's fine given we'll need to add new testing around the config singleton and stuff anyways, and the config file setup isn't really documented.

@jhamon jhamon merged commit 2981ab4 into spruce Oct 20, 2023
5 checks passed
@jhamon jhamon deleted the jhamon/config-refactor branch October 20, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants