NERSC vs. NOIRLab DB interoperation #107

sbailey · 2025-01-29T19:30:29Z

Do we have a recipe for one someone could write database query code that could be used interchangeably between NERSC and NOIRLab with minimal modification, ideally with a starting if/then block to identify location and then allow everything else to be the same?

It appears that NOIRLab requires access through the datalab dl.queryClient.query interface specific to the NOIRLab DBs, while the DESI NERSC database supports direct access via psycopg2 or the schema-specific specprod-db . But I don't see an obvious way to write database query code that could work at either location, despite both locations having very similar DESI databases. Or maybe NOIRLab does allow direct psycopg2 connections, it just isn't the documented/recommended way if working entirely within the NOIRLab ecosystem?

Detail: They also use different schema names (e.g. desi_edr vs. fuji) but if that was the only difference I think it could be easily handled at the top of a script.

@weaverba137 @stephjuneau ?

The text was updated successfully, but these errors were encountered:

stephjuneau · 2025-01-29T19:36:30Z

Can you post an example psycopg2 query here? The queryClient understands SQL and ADQL so as long as you can convert to these languages it should be workable.

sbailey · 2025-01-29T20:10:46Z

I suspect that the SQL queries themselves would be identical; it's more of a question about the wrapper code to perform the query and get the results back. e.g. when working at NERSC:

import psycopg2

conn = psycopg2.connect(dbname='desi', user='desi', host='specprod-db.desi.lbl.gov')
specprod = "iron"
cur = conn.cursor()
cur.execute(f"SET search_path TO {specprod}")

q = """
SELECT zpix.targetid,ra,dec,z,flux_g,flux_r,flux_z
FROM zpix
JOIN photometry ON zpix.targetid=photometry.targetid
WHERE survey='main' AND (desi_target & 1) = 1
  AND zwarn=0 AND spectype='GALAXY' AND deltachi2>25
"""

cur.execute(q)
rows = cur.fetchall()

Is there equivalent code that would run the query q and fetch the results where the same code could be used at both NOIRLab and NERSC?

Update: I replaced it with a simpler query, but same point remains.

weaverba137 · 2025-01-29T20:15:25Z

cur.execute(f"SET search_path TO {specprod}")

This is a side-discussion, but this is dangerous because you are potentially excluding the public schema, which contains, e.g. q3c functions.

weaverba137 · 2025-01-29T20:15:58Z

But SET search_path to {specprod}, public; would probably work fine.

stephjuneau · 2025-01-29T23:57:02Z

Hi @sbailey -- so we only have desi_edr in production and our test version of desi_dr1 only includes zpix and only a subset of columns. I think the ideal approach would be to transfer the iron database to desi_dr1 in our test database. It would not be available to non-Data Lab people but I and/or Ben could benchmark some queries in this way. So I guess this would have to be after NERSC is back online.

As far as what the code would look like, here's a version I tried with desi_edr on our production system. Maybe you can already do the same for psycopg2.

Some take-away points are that:

we need to add an index on deltachi2 if that's going to be used in queries on a regular basis
The query modified for SV3 returns 151,379 rows in ~2.2sec here from a notebook

import psycopg2
from dl import queryClient as qc

## Need to set which database location (here or input keyword to a function)
#dbloc = 'nersc'
dbloc = 'dl'

## Choose which specprod
specprod = "fuji"
## Data Lab version of the schema name (will be set to empty string if running at NERSC)
schemaname_dl = "desi_edr."  #needs the dot for combining with table names

if dbloc=='nersc':
    conn = psycopg2.connect(dbname='desi', user='desi', host='specprod-db.desi.lbl.gov')
    cur = conn.cursor()
    cur.execute(f"SET search_path TO {specprod}")
    schemaname_dl = ''

## Query
q = f"""
SELECT DISTINCT zp.targetid,zp.mean_fiber_ra,zp.mean_fiber_dec,zp.z,phot.flux_g,phot.flux_r,phot.flux_z
FROM {schemaname_dl}zpix AS zp
JOIN {schemaname_dl}photometry AS phot ON zp.targetid=phot.targetid
WHERE zp.survey='sv3' AND (zp.sv3_desi_target & 1)=1
      AND zp.zwarn=0 AND zp.spectype='GALAXY' AND zp.spectype='GALAXY' AND zp.deltachi2>25
"""

## Execute query and save output in "rows"
if dbloc=='nersc':
    cur.execute(q)
    rows = cur.fetchall()
elif dbloc=='dl':
    # output will be an Astropy table (can set fmt='pandas' for dataframe)
    rows = qc.query(sql=q, fmt='table')

stephjuneau · 2025-01-30T00:01:31Z

P.S. I was going to add that I don't think DISTINCT is necessary for just zpix and phot but I think you might have changed the query after I copy-pasted the first example? Anyways, running without DISTINCT still takes ~2sec

sbailey · 2025-01-30T18:56:09Z

Thanks for the example code snippets. Does datalab have anything equivalent to SET search_path TO {schemaname},public;? Or since queryClient is taking SQL, would that work as a prefix to the user query to avoid having to modify the query itself? I'm trying to find a recipe that would abstract out the differences between the sites and provide a common interface so that the user wouldn't have to if/else everywhere they do a query. Something along the lines of

def query_datalab(query, releasename):
    from dl import queryClient
    # code to perform query on that release via queryClient, returning a table of results

def query_nersc(query, releasename)
    import psycopg2
    # code to perform query on that release using psycopg2, returning a table of results

def get_query_function(site, releasename):
    if site == 'nersc':
        return partial(query_nersc, releasename=releasename)
    elif site == 'datalab':
        return partial(query_datalab, releasename=releasename)
    else:
        ...

Then user code could be something like

queryfunc = get_query_function(site, 'edr')
results1 = queryfunc(myquery1)
results2 = queryfunc(myquery2)
...

and the helper functions would handle the details of "fuji" vs. "desi_edr", pyscopyg2 vs. queryClient etc. such that the same SQL query would return the same results from either location with the user only having to do an if/then site selection at the very beginning.

If we could come up with something like that, I'd suggest putting it into desiutil or desispec for the tutorials to use. Brainstorming for consideration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NERSC vs. NOIRLab DB interoperation #107

NERSC vs. NOIRLab DB interoperation #107

sbailey commented Jan 29, 2025

stephjuneau commented Jan 29, 2025

sbailey commented Jan 29, 2025 •

edited

Loading

weaverba137 commented Jan 29, 2025

weaverba137 commented Jan 29, 2025

stephjuneau commented Jan 29, 2025 •

edited

Loading

stephjuneau commented Jan 30, 2025

sbailey commented Jan 30, 2025

NERSC vs. NOIRLab DB interoperation #107

NERSC vs. NOIRLab DB interoperation #107

Comments

sbailey commented Jan 29, 2025

stephjuneau commented Jan 29, 2025

sbailey commented Jan 29, 2025 • edited Loading

weaverba137 commented Jan 29, 2025

weaverba137 commented Jan 29, 2025

stephjuneau commented Jan 29, 2025 • edited Loading

stephjuneau commented Jan 30, 2025

sbailey commented Jan 30, 2025

sbailey commented Jan 29, 2025 •

edited

Loading

stephjuneau commented Jan 29, 2025 •

edited

Loading