Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate SQLAlchemy for db conn management and introduce new SqlStorage abstraction #93

Merged
merged 45 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
f61bd01
Update expected VRS IDs for VCF tests
ehclark Mar 1, 2024
3b8df69
Update VRS IDs for variation tests
ehclark Mar 1, 2024
ad53f0d
Added new SqlStorage implementation as an abstract base class for all…
ehclark Mar 7, 2024
b3b7535
Switch to snowflake-sqlalchemy package
ehclark Mar 7, 2024
bf18038
Update the Postgres storage implementation to be a subclass of the ne…
ehclark Mar 7, 2024
b4c002d
Update Snowflake storage implementation to be a subclass of new SqlSt…
ehclark Mar 7, 2024
c279668
Updated unit tests to cover the use of background writes in Postgres …
ehclark Mar 7, 2024
92195de
Rename test file and replace unused var names with underscore
ehclark Mar 7, 2024
c58b144
Add storage option to always fully flush on batch context exit
ehclark Apr 16, 2024
7d15584
When storage construction does not complete, the batch_thread and con…
ehclark Apr 16, 2024
1653906
Depending on the underlying database, the returned column value can b…
ehclark Apr 16, 2024
daefc3c
Add batch add mode settings to control what type of SQL statement to …
ehclark Apr 16, 2024
4300ad6
Update variation test data to match VRS 2.0 changes
ehclark Apr 17, 2024
8971570
Comment out response model to return full VRS objects instead of seri…
ehclark Apr 17, 2024
15514da
Make get location/variation behave consistently even when the object …
ehclark Apr 17, 2024
cf63bf2
Remove code added to make debugging easier
ehclark Apr 17, 2024
4c0878e
Merge branch 'issue-85' into issue-87-db-conn-pool
ehclark Apr 17, 2024
e0a8d71
Uupdate queries to use specified table name
ehclark Apr 17, 2024
1a625b3
Fix bug in detecting column value type on fetch
ehclark Apr 17, 2024
5a62da4
Batch add mode only makes sense for Snowflake because in Postgres the…
ehclark Apr 17, 2024
bf09a47
Switch to using question mark bind variables for Snowflake because na…
ehclark Apr 17, 2024
ab96430
Update to batch insert to play nicely with Snowflake quirks
ehclark Apr 17, 2024
da15cb6
Update example URL to be SQLAlchemy friendly
ehclark Apr 18, 2024
22295be
Use super() to invoke __init__()
ehclark Apr 18, 2024
cb998fa
Add support for Snowflake private key auth
ehclark Apr 18, 2024
7a3e99e
Add monkey patch workaround for bug in Snowflake SQLAlchemy
ehclark Apr 18, 2024
e64455b
Update collation in temp loading table
ehclark Apr 18, 2024
5f9a483
Storage implementations should be consistent with MutableMapping API …
ehclark Apr 18, 2024
a6c7484
Remove VRS model classes from response objects because the serializat…
ehclark Apr 18, 2024
bec6d2f
Corrected path used for missing allele id test
ehclark Apr 18, 2024
2e28e95
Get location and get variation should be consistent in behavior when …
ehclark Apr 18, 2024
003d9d1
Revert unecessary change
ehclark Apr 18, 2024
5dee8f1
Throw KeyError when id is not found
ehclark Apr 18, 2024
b10af9a
Merge branch 'issue-85' into issue-87-db-conn-pool
ehclark Apr 18, 2024
d818d5c
Add missing argument to _get_connect_args
ehclark Apr 18, 2024
157a46a
Code formatting
ehclark Apr 18, 2024
277e886
Merge branch 'main' into issue-85
ehclark Apr 18, 2024
977e052
Merge branch 'main' into issue-87-db-conn-pool
ehclark Apr 18, 2024
202e752
Suppress SQL injection warning as elsewhere
ehclark Apr 18, 2024
770b8d6
Code formatting
ehclark Apr 18, 2024
54d4102
Adding missing SQL injection warning suppressions
ehclark Apr 18, 2024
5f45287
Update README to reflect changes
ehclark Apr 19, 2024
68701b8
Merge branch 'issue-85' into issue-87-db-conn-pool
ehclark Apr 23, 2024
c46a54b
Address "Incomplete URL substring sanitization" warning
ehclark Apr 23, 2024
66cd446
Merge branch 'main' into issue-87-db-conn-pool
ehclark Apr 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 53 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,52 +57,75 @@ In another terminal:
curl http://localhost:8000/info


### Setting up Postgres

A Postgres-backed *AnyVar* installation may use any Postgres instance, local
or remote. The following instructions are for using a docker-based
Postgres instance.

First, run the commands in [README-pg.md](src/anyvar/storage/README-pg.md). This will create and start a local Postgres docker instance.

Next, run the commands in [postgres_init.sql](src/anyvar/storage/postgres_init.sql). This will create the `anyvar` user with the appropriate permissions and create the `anyvar` database.

### Setting up Snowflake
A Snowflake-backed *AnyVar* installation may use any Snowflake database schema.
### SQL Database Setup
A Postgres or Snowflake database may be used with *AnyVar*. The Postgres database
may be either local or remote. Use the `ANYVAR_STORAGE_URI` environment variable
to define the database connection URL. *AnyVar* uses [SQLAlchemy 1.4](https://docs.sqlalchemy.org/en/14/index.html)
to provide database connection management. The default database connection URL
is `"postgresql://postgres@localhost:5432/anyvar"`.

The database integrations can be modified using the following parameters:
* `ANYVAR_SQL_STORE_BATCH_LIMIT` - in batch mode, limit VRS object upsert batches
to this number; defaults to `100,000`
* `ANYVAR_SQL_STORE_TABLE_NAME` - the name of the table that stores VRS objects;
defaults to `vrs_objects`
* `ANYVAR_SQL_STORE_MAX_PENDING_BATCHES` - the maximum number of pending batches
to allow before blocking; defaults to `50`
* `ANYVAR_SQL_STORE_FLUSH_ON_BATCHCTX_EXIT` - whether or not flush all pending
database writes when the batch manager exists; defaults to `True`

The Postgres and Snowflake database connectors utilize a background thread
to write VRS objects to the database when operating in batch mode (e.g. annotating
a VCF file). Queries and statistics query only against the already committed database
state. Therefore, queries issued immediately after a batch operation may not reflect
all pending changes if the `ANYVAR_SQL_STORE_FLUSH_ON_BATCHCTX_EXIT` parameter is sett
to `False`.

#### Setting up Postgres
The following instructions are for using a docker-based Postgres instance.

First, run the commands in [README-pg.md](src/anyvar/storage/README-pg.md).
This will create and start a local Postgres docker instance.

Next, run the commands in [postgres_init.sql](src/anyvar/storage/postgres_init.sql).
This will create the `anyvar` user with the appropriate permissions and create the
`anyvar` database.

#### Setting up Snowflake
The Snowflake database and schema must exist prior to starting *AnyVar*. To point
*AnyVar* at Snowflake, specify a Snowflake URI in the `ANYVAR_STORAGE_URI` environment
variable. For example:

snowflake://my-sf-acct/?database=sf_db_name&schema=sd_schema_name&user=sf_username&password=sf_password

```
snowflake://sf_username:@sf_account_identifier/sf_db_name/sf_schema_name?password=sf_password
```
[Snowflake connection parameter reference](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-api)

When running interactively and connecting to a Snowflake account that utilizes federated authentication or SSO, add
the parameter `authenticator=externalbrowser`. Non-interactive execution in a federated authentication or SSO environment
requires a service account to connect. Connections using an encrypted or unencrypted private key are also supported by
specifying the parameter `private_key=path/to/file.p8`. The key material may be URL-encoded and inlined in the connection URI,
When running interactively and connecting to a Snowflake account that utilizes
federated authentication or SSO, add the parameter `authenticator=externalbrowser`.
Non-interactive execution in a federated authentication or SSO environment
requires a service account to connect. Connections using an encrypted or unencrypted
private key are also supported by specifying the parameter `private_key=path/to/file.p8`.
The key material may be URL-encoded and inlined in the connection URI,
for example: `private_key=-----BEGIN+PRIVATE+KEY-----%0AMIIEvAIBA...`


Environment variables that can be used to modify Snowflake database integration:
* `ANYVAR_SNOWFLAKE_STORE_BATCH_LIMIT` - in batch mode, limit VRS object upsert batches to this number; defaults to `100,000`
* `ANYVAR_SNOWFLAKE_STORE_TABLE_NAME` - the name of the table that stores VRS objects; defaults to `vrs_objects`
* `ANYVAR_SNOWFLAKE_STORE_MAX_PENDING_BATCHES` - the maximum number of pending batches to allow before blocking; defaults to `50`
* `ANYVAR_SNOWFLAKE_STORE_PRIVATE_KEY_PASSPHRASE` - the passphrase for an encrypted private key

NOTE: If you choose to create the VRS objects table in advance, the minimal table specification is as follows:
* `ANYVAR_SNOWFLAKE_BATCH_ADD_MODE` - the SQL statement type to use when adding new VRS objects, one of:
* `merge` (default) - use a MERGE statement. This guarantees that duplicate VRS IDs will
not be added, but also locks the VRS object table, limiting throughput.
* `insert_notin` - use INSERT INTO vrs_objects SELECT FROM tmp WHERE vrs_id NOT IN (...).
This narrows the chance of duplicates and does not require a table lock.
* `insert` - use INSERT INTO. This maximizes throughput at the cost of not checking for
duplicates at all.

If you choose to create the VRS objects table in advance, the minimal table specification is as follows:
```sql
CREATE TABLE ... (
vrs_id VARCHAR(500) COLLATE 'utf8',
vrs_object VARIANT
)
```

NOTE: The Snowflake database connector utilizes a background thread to write VRS objects to the database when operating in batch
mode (e.g. annotating a VCF file). Queries and statistics query only against the already committed database state. Therefore,
queries issued immediately after a batch operation may not reflect all pending changes.


## Deployment

NOTE: The authoritative and sole source for version tags is the
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ install_requires =
uvicorn
ga4gh.vrs[extras]~=2.0.0a5
psycopg[binary]
snowflake-connector-python~=3.4.1
snowflake-sqlalchemy~=1.5.1

[options.package_data]
* =
Expand Down
3 changes: 1 addition & 2 deletions src/anyvar/anyvar.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,7 @@ def create_storage(uri: Optional[str] = None) -> _Storage:
* PostgreSQL
`postgresql://[username]:[password]@[domain]/[database]`
* Snowflake
`snowflake://[account_identifier].snowflakecomputing.com/?[param=value]&[param=value]...`
`snowflake://[account_identifier]/?[param=value]&[param=value]...`
`snowflake://[user]:@[account]/[database]/[schema]?[param=value]&[param=value]...`
"""
uri = uri or os.environ.get("ANYVAR_STORAGE_URI", DEFAULT_STORAGE_URI)

Expand Down
Loading
Loading