This is based on the timescaledb image defined in fly-apps/postgres-flex
.
- pg_lakehouse
- timescaledb
- postgres_fdw
- pg_search
- pgvector
- pg_cron
-
Add script to modify postgresql.conf to include extensionsThis proved not to be possible given the way that Fly aggressively manages config files. You can only change settings usingfly postgres config
. - Add script to activate extensions in the database
- Add support for pgxman or trunk for package management --makes installation of extensions easier
- If not already done, build the image and push to Docker Hub via the following commands:
docker build . -t <your-dockerhub-username>/flypg-extended:latest --platform "linux/amd64" -f extended.Dockerfile
- Deploy a new postgres cluster on Fly using the following command:
fly postgres create --image-ref <your-dockerhub-username>/flypg-extended:latest
fly secrets set --app risekit-analytics-db \
APPLICATION_DB_USERNAME=#### \
APPLICATION_DB_PASSWORD=############### \
AWS_ACCESS_KEY_ID=#################### \
AWS_SECRET_ACCESS_KEY=####### \
Note: The APPLICATION_DB_USERNAME
refers to the foreign data wrapper username used by the analytics database to connect to the replica database. It is currently set to mike
, but should probably be set to risekit-analytics
or something similar.
Because Fly aggressively manages the PostgreSQL configuration files, there does not seem to be a way to set this value at build time. Instead, it must be set post-deployment:
fly postgres config update --shared-preload-libraries repmgr,timescaledb,pg_lakehouse,pg_cron
This will set the ENV
variables defined in fly.toml
:
fly deploy
The following script will initialize extensions, foreign data tables, and a materialized view. It will also setup a pg_cron
job to refresh that view:
fly ssh console --pty -C '/usr/local/bin/risekit_db_init.sh' -s
Note that this command has to be run on the primary node of the cluster.
You should now be able to connect to the database:
fly proxy 5433 # connect via wireguard as usual
And run the following queries:
-- Checks if connection to S3 is working
SELECT * from recommendations_dictionaries LIMIT 1;
-- Checks if the materialized view is working
SELECT * FROM recommendations_view LIMIT 1;
-- Checks if the pg_cron job has been created
SELECT * FROM cron.job;
-- Checks if the pg_cron job has been run
-- (should be empty until the first run, which is scheduled on the hour)
SELECT * FROM cron.job_run_details;
This query illustrates how to extract values from the jsonb
dictionary in recommendations_view
:
SELECT
data->>'_surrogate_key_hash' AS surrogate_key_hash,
data->>'job_title_normalized' AS job_title_normalized,
data->>'job_resource_title' AS job_resource_title,
data->>'job_created_at' AS job_created_at,
data->>'job_zip3' AS job_zip3,
data->>'job_expire_at' AS job_expire_at,
data->>'candidate_id' AS candidate_id,
data->>'candidate_zip3' AS candidate_zip3,
data->>'candidate_job_title' AS candidate_job_title,
data->>'candidate_job_title_normalized' AS candidate_job_title_normalized,
data->>'job_resource_link_id' AS job_resource_link_id,
data->>'normalized_candidate_job_title_match_score' AS normalized_candidate_job_title_match_score
FROM recommendations_view
ORDER BY job_created_at DESC
SELECT cron.unschedule('refresh_recommendations_view', '0 * * * *', 'refresh materialized view recommendations_view');
Since an analytics database is likely to be read-heavy, with a smaller number of simultaneous connections, the following configuration could potentially increase performance. It does not reflect potential use of timescaledb
or pg_vector
:
fly postgres config update \
--max-connections 20 # from 300 \
--maintenance-work-mem 512MB # from 65 MB \
--work-mem 13MB # from 4 MB
- https://github.com/citusdata/pg_cron
- https://datawookie.dev/blog/2022/03/scheduling-refresh-materialised-view/
- https://medium.com/full-stack-architecture/postgresql-caching-with-pg-cron-and-materialized-views-3403697eadbf
- https://github.com/erichosick/postgresql-cron-example
- https://www.postgresql.org/docs/current/sql-creatematerializedview.html
- https://www.postgresql.org/docs/current/sql-refreshmaterializedview.html