Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NERSC vs. NOIRLab DB interoperation #107

Open
sbailey opened this issue Jan 29, 2025 · 7 comments
Open

NERSC vs. NOIRLab DB interoperation #107

sbailey opened this issue Jan 29, 2025 · 7 comments

Comments

@sbailey
Copy link
Collaborator

sbailey commented Jan 29, 2025

Do we have a recipe for one someone could write database query code that could be used interchangeably between NERSC and NOIRLab with minimal modification, ideally with a starting if/then block to identify location and then allow everything else to be the same?

It appears that NOIRLab requires access through the datalab dl.queryClient.query interface specific to the NOIRLab DBs, while the DESI NERSC database supports direct access via psycopg2 or the schema-specific specprod-db . But I don't see an obvious way to write database query code that could work at either location, despite both locations having very similar DESI databases. Or maybe NOIRLab does allow direct psycopg2 connections, it just isn't the documented/recommended way if working entirely within the NOIRLab ecosystem?

Detail: They also use different schema names (e.g. desi_edr vs. fuji) but if that was the only difference I think it could be easily handled at the top of a script.

@weaverba137 @stephjuneau ?

@stephjuneau
Copy link
Contributor

Can you post an example psycopg2 query here? The queryClient understands SQL and ADQL so as long as you can convert to these languages it should be workable.

@sbailey
Copy link
Collaborator Author

sbailey commented Jan 29, 2025

I suspect that the SQL queries themselves would be identical; it's more of a question about the wrapper code to perform the query and get the results back. e.g. when working at NERSC:

import psycopg2

conn = psycopg2.connect(dbname='desi', user='desi', host='specprod-db.desi.lbl.gov')
specprod = "iron"
cur = conn.cursor()
cur.execute(f"SET search_path TO {specprod}")

q = """
SELECT zpix.targetid,ra,dec,z,flux_g,flux_r,flux_z
FROM zpix
JOIN photometry ON zpix.targetid=photometry.targetid
WHERE survey='main' AND (desi_target & 1) = 1
  AND zwarn=0 AND spectype='GALAXY' AND deltachi2>25
"""

cur.execute(q)
rows = cur.fetchall()

Is there equivalent code that would run the query q and fetch the results where the same code could be used at both NOIRLab and NERSC?

Update: I replaced it with a simpler query, but same point remains.

@weaverba137
Copy link
Member

cur.execute(f"SET search_path TO {specprod}")

This is a side-discussion, but this is dangerous because you are potentially excluding the public schema, which contains, e.g. q3c functions.

@weaverba137
Copy link
Member

But SET search_path to {specprod}, public; would probably work fine.

@stephjuneau
Copy link
Contributor

stephjuneau commented Jan 29, 2025

Hi @sbailey -- so we only have desi_edr in production and our test version of desi_dr1 only includes zpix and only a subset of columns. I think the ideal approach would be to transfer the iron database to desi_dr1 in our test database. It would not be available to non-Data Lab people but I and/or Ben could benchmark some queries in this way. So I guess this would have to be after NERSC is back online.

As far as what the code would look like, here's a version I tried with desi_edr on our production system. Maybe you can already do the same for psycopg2.

Some take-away points are that:

  • we need to add an index on deltachi2 if that's going to be used in queries on a regular basis
  • The query modified for SV3 returns 151,379 rows in ~2.2sec here from a notebook
import psycopg2
from dl import queryClient as qc

## Need to set which database location (here or input keyword to a function)
#dbloc = 'nersc'
dbloc = 'dl'

## Choose which specprod
specprod = "fuji"
## Data Lab version of the schema name (will be set to empty string if running at NERSC)
schemaname_dl = "desi_edr."  #needs the dot for combining with table names

if dbloc=='nersc':
    conn = psycopg2.connect(dbname='desi', user='desi', host='specprod-db.desi.lbl.gov')
    cur = conn.cursor()
    cur.execute(f"SET search_path TO {specprod}")
    schemaname_dl = ''

## Query
q = f"""
SELECT DISTINCT zp.targetid,zp.mean_fiber_ra,zp.mean_fiber_dec,zp.z,phot.flux_g,phot.flux_r,phot.flux_z
FROM {schemaname_dl}zpix AS zp
JOIN {schemaname_dl}photometry AS phot ON zp.targetid=phot.targetid
WHERE zp.survey='sv3' AND (zp.sv3_desi_target & 1)=1
      AND zp.zwarn=0 AND zp.spectype='GALAXY' AND zp.spectype='GALAXY' AND zp.deltachi2>25
"""

## Execute query and save output in "rows"
if dbloc=='nersc':
    cur.execute(q)
    rows = cur.fetchall()
elif dbloc=='dl':
    # output will be an Astropy table (can set fmt='pandas' for dataframe)
    rows = qc.query(sql=q, fmt='table')

@stephjuneau
Copy link
Contributor

P.S. I was going to add that I don't think DISTINCT is necessary for just zpix and phot but I think you might have changed the query after I copy-pasted the first example? Anyways, running without DISTINCT still takes ~2sec

@sbailey
Copy link
Collaborator Author

sbailey commented Jan 30, 2025

Thanks for the example code snippets. Does datalab have anything equivalent to SET search_path TO {schemaname},public;? Or since queryClient is taking SQL, would that work as a prefix to the user query to avoid having to modify the query itself? I'm trying to find a recipe that would abstract out the differences between the sites and provide a common interface so that the user wouldn't have to if/else everywhere they do a query. Something along the lines of

def query_datalab(query, releasename):
    from dl import queryClient
    # code to perform query on that release via queryClient, returning a table of results

def query_nersc(query, releasename)
    import psycopg2
    # code to perform query on that release using psycopg2, returning a table of results

def get_query_function(site, releasename):
    if site == 'nersc':
        return partial(query_nersc, releasename=releasename)
    elif site == 'datalab':
        return partial(query_datalab, releasename=releasename)
    else:
        ...

Then user code could be something like

queryfunc = get_query_function(site, 'edr')
results1 = queryfunc(myquery1)
results2 = queryfunc(myquery2)
...

and the helper functions would handle the details of "fuji" vs. "desi_edr", pyscopyg2 vs. queryClient etc. such that the same SQL query would return the same results from either location with the user only having to do an if/then site selection at the very beginning.

If we could come up with something like that, I'd suggest putting it into desiutil or desispec for the tutorials to use. Brainstorming for consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants