Integrate SQLAlchemy for db conn management and introduce new SqlStor…

…age abstraction (#93) * Update expected VRS IDs for VCF tests * Update VRS IDs for variation tests * Added new SqlStorage implementation as an abstract base class for all RDBMS storage implementations The base class utilizes SqlAlchemy for connection management and SQL statement execution because it is the only connection pooling library that works with the Snowflake connector. The base class includes the background db write capabilities from the Snowflake implementation and actual SQL statement execution where standard SQL is used. Abstract methods are defined for queries where the SQL or database APIs are not standard. * Switch to snowflake-sqlalchemy package * Update the Postgres storage implementation to be a subclass of the new SqlStorage base class Primarily removed code that was included in the base class and reorganized remaining code into the base class API shape. Because the Snowflake connector only supports SqlAlchemy 1.4 which in turn only supports psycopg2, had to modify the batch insert logic to use a different API. * Update Snowflake storage implementation to be a subclass of new SqlStorage base class Removed code that is now included in base class and reorganized remaining code into base class API shape. * Updated unit tests to cover the use of background writes in Postgres storage implementation Refactored mocks for SqlAlchemy based testing into separate module * Rename test file and replace unused var names with underscore * Add storage option to always fully flush on batch context exit * When storage construction does not complete, the batch_thread and conn_pool are sometimes not created leading to spurious errors on close(). Check for these attributes before attempting to clean them up. * Depending on the underlying database, the returned column value can be a string or a dict * Add batch add mode settings to control what type of SQL statement to use when adding new VRS objects to the database * Update variation test data to match VRS 2.0 changes * Comment out response model to return full VRS objects instead of serialized version * Make get location/variation behave consistently even when the object store does not throw a KeyError on missing key * Remove code added to make debugging easier * Uupdate queries to use specified table name * Fix bug in detecting column value type on fetch * Batch add mode only makes sense for Snowflake because in Postgres the vrs_objects table has a primary key and uses "ON CONFLICT" on inserts * Switch to using question mark bind variables for Snowflake because named parameters were not working Pick up table name from environment in unit tests * Update to batch insert to play nicely with Snowflake quirks * Update example URL to be SQLAlchemy friendly * Use super() to invoke __init__() * Add support for Snowflake private key auth * Add monkey patch workaround for bug in Snowflake SQLAlchemy * Update collation in temp loading table * Storage implementations should be consistent with MutableMapping API and throw KeyError when an item is not found * Remove VRS model classes from response objects because the serialization used internally is not correct for API responses * Corrected path used for missing allele id test * Get location and get variation should be consistent in behavior when id is not found * Revert unecessary change * Throw KeyError when id is not found * Add missing argument to _get_connect_args * Code formatting * Suppress SQL injection warning as elsewhere * Code formatting * Adding missing SQL injection warning suppressions * Update README to reflect changes * Address "Incomplete URL substring sanitization" warning
biocommons · Apr 29, 2024 · 454504f · 454504f
1 parent e1981fe
commit 454504f
Show file tree

Hide file tree

Showing 11 changed files with 1,486 additions and 981 deletions.
diff --git a/README.md b/README.md
@@ -57,52 +57,75 @@ In another terminal:
     curl http://localhost:8000/info
 
 
-### Setting up Postgres
-
-A Postgres-backed *AnyVar* installation may use any Postgres instance, local
-or remote.  The following instructions are for using a docker-based
-Postgres instance.
-
-First, run the commands in [README-pg.md](src/anyvar/storage/README-pg.md). This will create and start a local Postgres docker instance.
-
-Next, run the commands in [postgres_init.sql](src/anyvar/storage/postgres_init.sql). This will create the `anyvar` user with the appropriate permissions and create the `anyvar` database.
-
-### Setting up Snowflake
-A Snowflake-backed *AnyVar* installation may use any Snowflake database schema.
+### SQL Database Setup
+A Postgres or Snowflake database may be used with *AnyVar*.  The Postgres database
+may be either local or remote.  Use the  `ANYVAR_STORAGE_URI` environment variable
+to define the database connection URL.  *AnyVar* uses [SQLAlchemy 1.4](https://docs.sqlalchemy.org/en/14/index.html) 
+to provide database connection management.  The default database connection URL
+is `"postgresql://postgres@localhost:5432/anyvar"`.
+
+The database integrations can be modified using the following parameters:
+* `ANYVAR_SQL_STORE_BATCH_LIMIT` - in batch mode, limit VRS object upsert batches 
+to this number; defaults to `100,000`
+* `ANYVAR_SQL_STORE_TABLE_NAME` - the name of the table that stores VRS objects; 
+defaults to `vrs_objects`
+* `ANYVAR_SQL_STORE_MAX_PENDING_BATCHES` - the maximum number of pending batches 
+to allow before blocking; defaults to `50`
+* `ANYVAR_SQL_STORE_FLUSH_ON_BATCHCTX_EXIT` - whether or not flush all pending 
+database writes when the batch manager exists; defaults to `True`
+
+The Postgres and Snowflake database connectors utilize a background thread 
+to write VRS objects to the database when operating in batch mode (e.g. annotating 
+a VCF file).  Queries and statistics query only against the already committed database 
+state.  Therefore, queries issued immediately after a batch operation may not reflect 
+all pending changes if the `ANYVAR_SQL_STORE_FLUSH_ON_BATCHCTX_EXIT` parameter is sett
+to `False`.
+
+#### Setting up Postgres
+The following instructions are for using a docker-based Postgres instance.
+
+First, run the commands in [README-pg.md](src/anyvar/storage/README-pg.md). 
+This will create and start a local Postgres docker instance.
+
+Next, run the commands in [postgres_init.sql](src/anyvar/storage/postgres_init.sql). 
+This will create the `anyvar` user with the appropriate permissions and create the 
+`anyvar` database.
+
+#### Setting up Snowflake
 The Snowflake database and schema must exist prior to starting *AnyVar*.  To point
 *AnyVar* at Snowflake, specify a Snowflake URI in the `ANYVAR_STORAGE_URI` environment
 variable.  For example:
-
-    snowflake://my-sf-acct/?database=sf_db_name&schema=sd_schema_name&user=sf_username&password=sf_password
-
+```
+snowflake://sf_username:@sf_account_identifier/sf_db_name/sf_schema_name?password=sf_password
+```
 [Snowflake connection parameter reference](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-api)
 
-When running interactively and connecting to a Snowflake account that utilizes federated authentication or SSO, add
-the parameter `authenticator=externalbrowser`.  Non-interactive execution in a federated authentication or SSO environment
-requires a service account to connect.  Connections using an encrypted or unencrypted private key are also supported by 
-specifying the parameter `private_key=path/to/file.p8`.  The key material may be URL-encoded and inlined in the connection URI, 
+When running interactively and connecting to a Snowflake account that utilizes 
+federated authentication or SSO, add the parameter `authenticator=externalbrowser`.  
+Non-interactive execution in a federated authentication or SSO environment
+requires a service account to connect.  Connections using an encrypted or unencrypted 
+private key are also supported by specifying the parameter `private_key=path/to/file.p8`.
+The key material may be URL-encoded and inlined in the connection URI, 
 for example: `private_key=-----BEGIN+PRIVATE+KEY-----%0AMIIEvAIBA...`
 
-
 Environment variables that can be used to modify Snowflake database integration:
-* `ANYVAR_SNOWFLAKE_STORE_BATCH_LIMIT` - in batch mode, limit VRS object upsert batches to this number; defaults to `100,000`
-* `ANYVAR_SNOWFLAKE_STORE_TABLE_NAME` - the name of the table that stores VRS objects; defaults to `vrs_objects`
-* `ANYVAR_SNOWFLAKE_STORE_MAX_PENDING_BATCHES` - the maximum number of pending batches to allow before blocking; defaults to `50`
 * `ANYVAR_SNOWFLAKE_STORE_PRIVATE_KEY_PASSPHRASE` - the passphrase for an encrypted private key
-
-NOTE: If you choose to create the VRS objects table in advance, the minimal table specification is as follows:
+* `ANYVAR_SNOWFLAKE_BATCH_ADD_MODE` - the SQL statement type to use when adding new VRS objects, one of:
+    * `merge` (default) - use a MERGE statement.  This guarantees that duplicate VRS IDs will
+    not be added, but also locks the VRS object table, limiting throughput.
+    * `insert_notin` - use INSERT INTO vrs_objects SELECT FROM tmp WHERE vrs_id NOT IN (...).
+    This narrows the chance of duplicates and does not require a table lock.
+    * `insert` - use INSERT INTO.  This maximizes throughput at the cost of not checking for
+    duplicates at all.
+
+If you choose to create the VRS objects table in advance, the minimal table specification is as follows:
 ```sql
 CREATE TABLE ... (
     vrs_id VARCHAR(500) COLLATE 'utf8',
     vrs_object VARIANT
 )
 ```
 
-NOTE: The Snowflake database connector utilizes a background thread to write VRS objects to the database when operating in batch
-mode (e.g. annotating a VCF file).  Queries and statistics query only against the already committed database state.  Therefore,
-queries issued immediately after a batch operation may not reflect all pending changes.
-
-
 ## Deployment
 
 NOTE: The authoritative and sole source for version tags is the

diff --git a/setup.cfg b/setup.cfg
@@ -9,7 +9,7 @@ install_requires =
     uvicorn
     ga4gh.vrs[extras]~=2.0.0a5
     psycopg[binary]
-    snowflake-connector-python~=3.4.1
+    snowflake-sqlalchemy~=1.5.1
 
 [options.package_data]
 * =

diff --git a/src/anyvar/anyvar.py b/src/anyvar/anyvar.py
@@ -28,8 +28,7 @@ def create_storage(uri: Optional[str] = None) -> _Storage:
     * PostgreSQL
     `postgresql://[username]:[password]@[domain]/[database]`
     * Snowflake
-    `snowflake://[account_identifier].snowflakecomputing.com/?[param=value]&[param=value]...`
-    `snowflake://[account_identifier]/?[param=value]&[param=value]...`
+    `snowflake://[user]:@[account]/[database]/[schema]?[param=value]&[param=value]...`
     """
     uri = uri or os.environ.get("ANYVAR_STORAGE_URI", DEFAULT_STORAGE_URI)