Skip to content

Commit

Permalink
Integrate SQLAlchemy for db conn management and introduce new SqlStor…
Browse files Browse the repository at this point in the history
…age abstraction (#93)

* Update expected VRS IDs for VCF tests

* Update VRS IDs for variation tests

* Added new SqlStorage implementation as an abstract base class for all RDBMS storage implementations

The base class utilizes SqlAlchemy for connection management and SQL statement execution because it is the only connection pooling library that works with the Snowflake connector.

The base class includes the background db write capabilities from the Snowflake implementation and actual SQL statement execution where standard SQL is used.  Abstract methods are defined for queries where the SQL or database APIs are not standard.

* Switch to snowflake-sqlalchemy package

* Update the Postgres storage implementation to be a subclass of the new SqlStorage base class

Primarily removed code that was included in the base class and reorganized remaining code into the base class API shape.

Because the Snowflake connector only supports SqlAlchemy 1.4 which in turn only supports psycopg2, had to modify the batch insert logic to use a different API.

* Update Snowflake storage implementation to be a subclass of new SqlStorage base class

Removed code that is now included in base class and reorganized remaining code into base class API shape.

* Updated unit tests to cover the use of background writes in Postgres storage implementation

Refactored mocks for SqlAlchemy based testing into separate module

* Rename test file and replace unused var names with underscore

* Add storage option to always fully flush on batch context exit

* When storage construction does not complete, the batch_thread and conn_pool are sometimes not created leading to spurious errors on close().  Check for these attributes before attempting to clean them up.

* Depending on the underlying database, the returned column value can be a string or a dict

* Add batch add mode settings to control what type of SQL statement to use when adding new VRS objects to the database

* Update variation test data to match VRS 2.0 changes

* Comment out response model to return full VRS objects instead of serialized version

* Make get location/variation behave consistently even when the object store does not throw a KeyError on missing key

* Remove code added to make debugging easier

* Uupdate queries to use specified table name

* Fix bug in detecting column value type on fetch

* Batch add mode only makes sense for Snowflake because in Postgres the vrs_objects table has a primary key and uses "ON CONFLICT" on inserts

* Switch to using question mark bind variables for Snowflake because named parameters were not working
Pick up table name from environment in unit tests

* Update to batch insert to play nicely with Snowflake quirks

* Update example URL to be SQLAlchemy friendly

* Use super() to invoke __init__()

* Add support for Snowflake private key auth

* Add monkey patch workaround for bug in Snowflake SQLAlchemy

* Update collation in temp loading table

* Storage implementations should be consistent with MutableMapping API and throw KeyError when an item is not found

* Remove VRS model classes from response objects because the serialization used internally is not correct for API responses

* Corrected path used for missing allele id test

* Get location and get variation should be consistent in behavior when id is not found

* Revert unecessary change

* Throw KeyError when id is not found

* Add missing argument to _get_connect_args

* Code formatting

* Suppress SQL injection warning as elsewhere

* Code formatting

* Adding missing SQL injection warning suppressions

* Update README to reflect changes

* Address "Incomplete URL substring sanitization" warning
  • Loading branch information
ehclark authored Apr 29, 2024
1 parent e1981fe commit 454504f
Show file tree
Hide file tree
Showing 11 changed files with 1,486 additions and 981 deletions.
83 changes: 53 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,52 +57,75 @@ In another terminal:
curl http://localhost:8000/info


### Setting up Postgres

A Postgres-backed *AnyVar* installation may use any Postgres instance, local
or remote. The following instructions are for using a docker-based
Postgres instance.

First, run the commands in [README-pg.md](src/anyvar/storage/README-pg.md). This will create and start a local Postgres docker instance.

Next, run the commands in [postgres_init.sql](src/anyvar/storage/postgres_init.sql). This will create the `anyvar` user with the appropriate permissions and create the `anyvar` database.

### Setting up Snowflake
A Snowflake-backed *AnyVar* installation may use any Snowflake database schema.
### SQL Database Setup
A Postgres or Snowflake database may be used with *AnyVar*. The Postgres database
may be either local or remote. Use the `ANYVAR_STORAGE_URI` environment variable
to define the database connection URL. *AnyVar* uses [SQLAlchemy 1.4](https://docs.sqlalchemy.org/en/14/index.html)
to provide database connection management. The default database connection URL
is `"postgresql://postgres@localhost:5432/anyvar"`.

The database integrations can be modified using the following parameters:
* `ANYVAR_SQL_STORE_BATCH_LIMIT` - in batch mode, limit VRS object upsert batches
to this number; defaults to `100,000`
* `ANYVAR_SQL_STORE_TABLE_NAME` - the name of the table that stores VRS objects;
defaults to `vrs_objects`
* `ANYVAR_SQL_STORE_MAX_PENDING_BATCHES` - the maximum number of pending batches
to allow before blocking; defaults to `50`
* `ANYVAR_SQL_STORE_FLUSH_ON_BATCHCTX_EXIT` - whether or not flush all pending
database writes when the batch manager exists; defaults to `True`

The Postgres and Snowflake database connectors utilize a background thread
to write VRS objects to the database when operating in batch mode (e.g. annotating
a VCF file). Queries and statistics query only against the already committed database
state. Therefore, queries issued immediately after a batch operation may not reflect
all pending changes if the `ANYVAR_SQL_STORE_FLUSH_ON_BATCHCTX_EXIT` parameter is sett
to `False`.

#### Setting up Postgres
The following instructions are for using a docker-based Postgres instance.

First, run the commands in [README-pg.md](src/anyvar/storage/README-pg.md).
This will create and start a local Postgres docker instance.

Next, run the commands in [postgres_init.sql](src/anyvar/storage/postgres_init.sql).
This will create the `anyvar` user with the appropriate permissions and create the
`anyvar` database.

#### Setting up Snowflake
The Snowflake database and schema must exist prior to starting *AnyVar*. To point
*AnyVar* at Snowflake, specify a Snowflake URI in the `ANYVAR_STORAGE_URI` environment
variable. For example:

snowflake://my-sf-acct/?database=sf_db_name&schema=sd_schema_name&user=sf_username&password=sf_password

```
snowflake://sf_username:@sf_account_identifier/sf_db_name/sf_schema_name?password=sf_password
```
[Snowflake connection parameter reference](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-api)

When running interactively and connecting to a Snowflake account that utilizes federated authentication or SSO, add
the parameter `authenticator=externalbrowser`. Non-interactive execution in a federated authentication or SSO environment
requires a service account to connect. Connections using an encrypted or unencrypted private key are also supported by
specifying the parameter `private_key=path/to/file.p8`. The key material may be URL-encoded and inlined in the connection URI,
When running interactively and connecting to a Snowflake account that utilizes
federated authentication or SSO, add the parameter `authenticator=externalbrowser`.
Non-interactive execution in a federated authentication or SSO environment
requires a service account to connect. Connections using an encrypted or unencrypted
private key are also supported by specifying the parameter `private_key=path/to/file.p8`.
The key material may be URL-encoded and inlined in the connection URI,
for example: `private_key=-----BEGIN+PRIVATE+KEY-----%0AMIIEvAIBA...`


Environment variables that can be used to modify Snowflake database integration:
* `ANYVAR_SNOWFLAKE_STORE_BATCH_LIMIT` - in batch mode, limit VRS object upsert batches to this number; defaults to `100,000`
* `ANYVAR_SNOWFLAKE_STORE_TABLE_NAME` - the name of the table that stores VRS objects; defaults to `vrs_objects`
* `ANYVAR_SNOWFLAKE_STORE_MAX_PENDING_BATCHES` - the maximum number of pending batches to allow before blocking; defaults to `50`
* `ANYVAR_SNOWFLAKE_STORE_PRIVATE_KEY_PASSPHRASE` - the passphrase for an encrypted private key

NOTE: If you choose to create the VRS objects table in advance, the minimal table specification is as follows:
* `ANYVAR_SNOWFLAKE_BATCH_ADD_MODE` - the SQL statement type to use when adding new VRS objects, one of:
* `merge` (default) - use a MERGE statement. This guarantees that duplicate VRS IDs will
not be added, but also locks the VRS object table, limiting throughput.
* `insert_notin` - use INSERT INTO vrs_objects SELECT FROM tmp WHERE vrs_id NOT IN (...).
This narrows the chance of duplicates and does not require a table lock.
* `insert` - use INSERT INTO. This maximizes throughput at the cost of not checking for
duplicates at all.

If you choose to create the VRS objects table in advance, the minimal table specification is as follows:
```sql
CREATE TABLE ... (
vrs_id VARCHAR(500) COLLATE 'utf8',
vrs_object VARIANT
)
```

NOTE: The Snowflake database connector utilizes a background thread to write VRS objects to the database when operating in batch
mode (e.g. annotating a VCF file). Queries and statistics query only against the already committed database state. Therefore,
queries issued immediately after a batch operation may not reflect all pending changes.


## Deployment

NOTE: The authoritative and sole source for version tags is the
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ install_requires =
uvicorn
ga4gh.vrs[extras]~=2.0.0a5
psycopg[binary]
snowflake-connector-python~=3.4.1
snowflake-sqlalchemy~=1.5.1

[options.package_data]
* =
Expand Down
3 changes: 1 addition & 2 deletions src/anyvar/anyvar.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,7 @@ def create_storage(uri: Optional[str] = None) -> _Storage:
* PostgreSQL
`postgresql://[username]:[password]@[domain]/[database]`
* Snowflake
`snowflake://[account_identifier].snowflakecomputing.com/?[param=value]&[param=value]...`
`snowflake://[account_identifier]/?[param=value]&[param=value]...`
`snowflake://[user]:@[account]/[database]/[schema]?[param=value]&[param=value]...`
"""
uri = uri or os.environ.get("ANYVAR_STORAGE_URI", DEFAULT_STORAGE_URI)

Expand Down
Loading

0 comments on commit 454504f

Please sign in to comment.