Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace func.agdc.common_timestamp with hardcoded conversion #595

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

lars-fillmore
Copy link

@lars-fillmore lars-fillmore commented May 8, 2024

Current use of postgresql function agdc.common_timestamp is adding significant time (4x) to query execution, causing postgresql to queue queries and become unresponsive whilst waiting for database to return.
Function currently returns select ($1)::timestamp at time zone 'utc';
Rather than call a function to complete this, I've hardcoded this timstamp cast directly in the query expression.


📚 Documentation preview 📚: https://datacube-explorer--595.org.readthedocs.build/en/595/

@lars-fillmore lars-fillmore self-assigned this May 8, 2024
Copy link

@SpacemanPaul SpacemanPaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a try.

@omad
Copy link
Member

omad commented May 8, 2024

What testing has happened on this?

What's a before and after example query? With runtimes...

@lars-fillmore
Copy link
Author

TLDR:
Total time current: 16568.766 ms
Total time alt query: 4437.692 ms

Test query without change:
SELECT agdc.dataset.id, agdc.dataset.metadata_type_ref, agdc.dataset.dataset_type_ref, agdc.dataset.metadata, agdc.dataset.archived, agdc.dataset.added, agdc.dataset.added_by, array((SELECT selected_dataset_location.uri_scheme || ':' || selected_dataset_location.uri_body AS anon_1 FROM agdc.dataset_location AS selected_dataset_location WHERE selected_dataset_location.dataset_ref = agdc.dataset.id AND selected_dataset_location.archived IS NULL ORDER BY selected_dataset_location.added DESC, selected_dataset_location.id DESC)) AS uris FROM agdc.dataset WHERE agdc.dataset.archived IS NULL AND agdc.common_timestamp((agdc.dataset.metadata #>> '{properties, odc:processing_datetime}')) >= '1987-10-30T00:00:00.689088'::timestamp AND agdc.common_timestamp((agdc.dataset.metadata #>> '{properties, odc:processing_datetime}')) < '1987-10-31T00:00:00.689088'::timestamp AND agdc.dataset.dataset_type_ref = 478 LIMIT 101;
Results:
2024-05-08 04:14:06 UTC:10.95.118.63(55460):odc_admin@odc:[26185]:LOG: duration: 16568.766 ms plan: Query Text: SELECT agdc.dataset.id, agdc.dataset.metadata_type_ref, agdc.dataset.dataset_type_ref, agdc.dataset.metadata, agdc.dataset.archived, agdc.dataset.added, agdc.dataset.added_by, array((SELECT selected_dataset_location.uri_scheme || ':' || selected_dataset_location.uri_body AS anon_1 FROM agdc.dataset_location AS selected_dataset_location WHERE selected_dataset_location.dataset_ref = agdc.dataset.id AND selected_dataset_location.archived IS NULL ORDER BY selected_dataset_location.added DESC, selected_dataset_location.id DESC)) AS uris FROM agdc.dataset WHERE agdc.dataset.archived IS NULL AND agdc.common_timestamp((agdc.dataset.metadata #>> '{properties, odc:processing_datetime}')) >= '1987-10-30T00:00:00.689088'::timestamp AND agdc.common_timestamp((agdc.dataset.metadata #>> '{properties, odc:processing_datetime}')) < '1987-10-31T00:00:00.689088'::timestamp AND agdc.dataset.dataset_type_ref = 478 LIMIT 101; Limit (cost=0.42..89241.11 rows=101 width=1315) (actual time=4359.410..16566.268 rows=26 loops=1) -> Index Scan using dix_ga_ls_wo_3_region_code on dataset (cost=0.42..1198122.93 rows=1356 width=1315) (actual time=4359.409..16566.242 rows=26 loops=1) Filter: ((agdc.common_timestamp((metadata #>> '{properties,odc:processing_datetime}'::text[])) >= '1987-10-30 00:00:00.689088'::timestamp without time zone) AND (agdc.common_timestamp((metadata #>> '{properties,odc:processing_datetime}'::text[])) < '1987-10-31 00:00:00.689088'::timestamp without time zone)) Rows Removed by Filter: 416218 SubPlan 1 -> Sort (cost=8.59..8.60 rows=1 width=44) (actual time=1.226..1.226 rows=1 loops=26) Sort Key: selected_dataset_location.added DESC, selected_dataset_location.id DESC Sort Method: quicksort Memory: 25kB -> Index Scan using ix_agdc_dataset_location_dataset_ref on dataset_location selected_dataset_location (cost=0.56..8.58 rows=1 width=44) (actual time=1.214..1.215 rows=1 loops=26) Index Cond: (dataset_ref = dataset.id) Filter: (archived IS NULL)

Altered query:
SELECT agdc.dataset.id, agdc.dataset.metadata_type_ref, agdc.dataset.dataset_type_ref, agdc.dataset.metadata, agdc.dataset.archived, agdc.dataset.added, agdc.dataset.added_by, array((SELECT selected_dataset_location.uri_scheme || ':' || selected_dataset_location.uri_body AS anon_1 FROM agdc.dataset_location AS selected_dataset_location WHERE selected_dataset_location.dataset_ref = agdc.dataset.id AND selected_dataset_location.archived IS NULL ORDER BY selected_dataset_location.added DESC, selected_dataset_location.id DESC)) AS uris FROM agdc.dataset WHERE agdc.dataset.archived IS NULL AND (agdc.dataset.metadata #>> '{properties, odc:processing_datetime}')::timestamp at time zone 'utc' >= '1987-10-30T00:00:00.689088'::timestamp AND (agdc.dataset.metadata #>> '{properties, odc:processing_datetime}')::timestamp at time zone 'utc' < '1987-10-31T00:00:00.689088'::timestamp AND agdc.dataset.dataset_type_ref = 478 LIMIT 101;

Results:
2024-05-08 04:16:39 UTC:10.95.118.63(55460):odc_admin@odc:[26185]:LOG: duration: 4437.692 ms plan: Query Text: SELECT agdc.dataset.id, agdc.dataset.metadata_type_ref, agdc.dataset.dataset_type_ref, agdc.dataset.metadata, agdc.dataset.archived, agdc.dataset.added, agdc.dataset.added_by, array((SELECT selected_dataset_location.uri_scheme || ':' || selected_dataset_location.uri_body AS anon_1 FROM agdc.dataset_location AS selected_dataset_location WHERE selected_dataset_location.dataset_ref = agdc.dataset.id AND selected_dataset_location.archived IS NULL ORDER BY selected_dataset_location.added DESC, selected_dataset_location.id DESC)) AS uris FROM agdc.dataset WHERE agdc.dataset.archived IS NULL AND (agdc.dataset.metadata #>> '{properties, odc:processing_datetime}')::timestamp at time zone 'utc' >= '1987-10-30T00:00:00.689088'::timestamp AND (agdc.dataset.metadata #>> '{properties, odc:processing_datetime}')::timestamp at time zone 'utc' < '1987-10-31T00:00:00.689088'::timestamp AND agdc.dataset.dataset_type_ref = 478 LIMIT 101; Limit (cost=3336.75..77702.02 rows=101 width=1315) (actual time=1165.025..4436.100 rows=26 loops=1) -> Gather (cost=3336.75..1001745.73 rows=1356 width=1315) (actual time=1165.024..4436.093 rows=26 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on dataset (cost=2336.75..988955.31 rows=565 width=1283) (actual time=2251.032..4431.007 rows=9 loops=3) Recheck Cond: ((archived IS NULL) AND (dataset_type_ref = 478)) Filter: (((((metadata #>> '{properties,odc:processing_datetime}'::text[]))::timestamp without time zone AT TIME ZONE 'utc'::text) >= '1987-10-30 00:00:00.689088'::timestamp without time zone) AND ((((metadata #>> '{properties,odc:processing_datetime}'::text[]))::timestamp without time zone AT TIME ZONE 'utc'::text) < '1987-10-31 00:00:00.689088'::timestamp without time zone)) Rows Removed by Filter: 138739 Heap Blocks: exact=8978 -> Bitmap Index Scan on dix_ga_ls_wo_3_region_code1 (cost=0.00..2336.41 rows=271197 width=0) (actual time=25.016..25.016 rows=416244 loops=1) SubPlan 1 -> Sort (cost=8.59..8.60 rows=1 width=44) (actual time=0.012..0.012 rows=1 loops=26) Sort Key: selected_dataset_location.added DESC, selected_dataset_location.id DESC Sort Method: quicksort Memory: 25kB -> Index Scan using ix_agdc_dataset_location_dataset_ref on dataset_location selected_dataset_location (cost=0.56..8.58 rows=1 width=44) (actual time=0.008..0.008 rows=1 loops=26) Index Cond: (dataset_ref = dataset.id) Filter: (archived IS NULL)

@omad
Copy link
Member

omad commented May 9, 2024

This seems like a good idea, when we can get SQLAlchemy to behave with the new query. I was confused about why it made such a difference, so dug deeper by comparing the Old Query Plan and the New Query Plan.

The first obvious difference, is that the updated query is getting partitioned and run in parallel. It turns out that PostgreSQL is conservative with user defined functions, and they must explicitly be marked safe before being run in parallel.

alter function agdc.common_timestamp parallel safe;

After updating the function, we now run in parallel, and it's down to 4s vs 6s. I haven't worked out why yet it's doing a Parallel Index Scan instead of a Parallel Bitmap Heap Scan . I suspect it might be the timestamp getting converted to text and parsed back to a timestamp when it's not necessary, possibly because of the Argument data types on the agdc.common_timestamp, but I'm not sure. It's not to do with the Volatility I don't thin, because immutable is better than stable.

image

vs

image

Refs:

@SpacemanPaul
Copy link

SpacemanPaul commented May 10, 2024

Looks like this might be a SQLAlchemy 1.x issue - at any rate outputting :: casts in postgresql is listed in the changes for 2.0.

So easy to fix in (or with) datacube-core 1.9 but more problematic in 1.8.

@omad omad force-pushed the fix/utc-timezone-conversion branch from 37d7488 to bc0287e Compare June 13, 2024 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants