Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor WFAU image retrieval to reject deprecated images #2809

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
169 changes: 147 additions & 22 deletions astroquery/ukidss/tests/data/image_results.html
Original file line number Diff line number Diff line change
@@ -1,22 +1,147 @@
<HTML><HEAD>&nbsp;<script language="JavaScript" type="text/javascript">
<!--
window.focus();
//--> </script>
</HEAD><h3>GetImage cut-out results</h3>
<br><b>J2000 coords:</b> RA: 83.6330757 Dec:22.014436
<br><b>Programme:</b> All UKIDSS surveys
<br><b>Filter:</b> all
<br>Processing ...
<br>Connecting to database: UKIDSSDR7PLUS<p>
<table border="1"><tr bgcolor="#FFFFCC"><td><b>Link</b></td><td><b>multiframeID</b></td><td><b>frametype</b></td><td><b>obstype</b></td><td><b>filterid</b></td><td><b>shortname</b></td><td><b>dateObs</b></td><td><b>extNum</b></td>
<tr bgcolor=#FFDDDD>
<td><a href="http://surveys.roe.ac.uk/wsa/cgi-bin/getImage.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01818_sf_st.fit&mfid=1737581&extNo=4&lx=1339&hx=1638&ly=1952&hy=2251&rf=0&flip=1&uniq=556_22_18_31563_1&xpos=150.6&ypos=150.1&band=K&ra=83.6330757&dec=22.014436" target=show>show</a></td>
<td nowrap>1737581</td>
<td nowrap>leavstack</td>
<td nowrap>OBJECT</td>
<td nowrap>5</td>
<td nowrap>K</td>
<td nowrap>2007-10-11 13:12:05.5</td>
<td nowrap>5</td>
</table>
1 rows returned.
<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>WSA ImageList</title>

<script type="text/javascript" src="http://wsa.roe.ac.uk/imSwap.js">
</script>
</head>

<body bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#000080" alink="#FF0000" onLoad="swap(0);" >
<table cellpadding="0" cellspacing="0" border="0"><tr><td><b>WSA ImageList</b></td><td>&nbsp;&nbsp;&nbsp;</td>
<td><img name="imgMain" src="http://wsa.roe.ac.uk/static.gif" border="1"></td></table><p>
Not logged in: links will only be returned for frames that are publicly accessible<p>
Archive Listing<p>Searching...<br>
<b>Survey: </b>UKIDSS Galactic Clusters Survey, GCS<br>
<b>Waveband: </b>K<br>
<b>Minimum RA:</b> 5.551333333333333 hours <b>Maximum RA:</b> 5.599333333333333 hours <br>
<b>Minimum Dec:</b> 21.68116666666667 degrees <b>Maximum Dec:</b> 22.347833333333334 degrees <br>
<script language="JavaScript" type="text/javascript"> <!--
swap(1);
--> </script>
<br><b>Using database:</b> UKIDSSDR11PLUS
<p><table><tr><td align="right"><b>View</b> column link</td><td align="left">shows jpeg images of multiframe in a new window plus links to download file(s)</td>
<tr><td align="right"><b>Img</b> column link</td><td align="left">download the <a href="http://wsa.roe.ac.uk/qa.html#compress">RICE compressed</a> FITS image file. Use <b>View</b> column link to retrieve uncompressed images.</td>
<tr><td align="right"><b>Cat</b> column link</td><td align="left">download the FITS catalogue file.</td>
</table>
<br>begin row 1<br><table border="1"><tr bgcolor="#FFFFCC"><td align="center">View</td><td align="center">Img</td><td align="center">Cat</td>
<td align="middle"><b>multiframeID</b></td>
<td align="middle"><b>frameType</a></b></td>
<td align="middle"><b>obstype</a></b></td>
<td align="middle"><b>raBase</a></b></td>
<td align="middle"><b>decBase</a></b></td>
<td align="middle"><b>shortname</a></b></td>
<td align="middle"><b>exptime</a></b></td>
<td align="middle"><b>dateObs</a></b></td>
<td align="middle"><b>project</a></b></td>
<td align="middle"><b>numDetectors</b></td>
<td align="middle"><b>ukirtRunNo</a></b></td>
</tr>
<tr bgcolor=#FFDDDD>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/display.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01802_sf.fit&cat=NONE&comp=djoser:/disk04/wsa/products/jpgs/20071011_v1/w20071011_01802_sf&noExt=4&MFID=1737551&rID=48" target=display>view</a></td>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/fits_download.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01802_sf.fit&MFID=1737551&rID=48">FITS</a></td>
<td>&nbsp;</td>
<td nowrap align="right">1737551</td>
<td nowrap align="right">leav</td>
<td nowrap align="right">OBJECT</td>
<td nowrap align="right">+5.5777306</td>
<td nowrap align="right">+21.7913333</td>
<td nowrap align="right">K</td>
<td nowrap align="right">+10.000000</td>
<td nowrap align="right">2007-10-11 13:08:30.0</td>
<td nowrap align="right">U/UKIDSS/GCS21</td>
<td nowrap align="right">4</td>
<td nowrap align="right">1802</td>
</tr>
<tr bgcolor=#DDDDDD>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/display.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01802_sf_st.fit&cat=djoser:/disk05/wsa/ingest/fits/20071011_v1/w20071011_01802_sf_st_cat.fits&comp=djoser:/disk04/wsa/products/jpgs/20071011_v1/w20071011_01802_sf_st&noExt=4&MFID=1737553&rID=48" target=display>view</a></td>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/fits_download.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01802_sf_st.fit&MFID=1737553&rID=48">FITS</a></td>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/fits_download.cgi?file=djoser:/disk05/wsa/ingest/fits/20071011_v1/w20071011_01802_sf_st_cat.fits&MFID=1737553&rID=48">FITS</a></td>
<td nowrap align="right">1737553</td>
<td nowrap align="right">leavstack</td>
<td nowrap align="right">OBJECT</td>
<td nowrap align="right">+5.5777306</td>
<td nowrap align="right">+21.7913333</td>
<td nowrap align="right">K</td>
<td nowrap align="right">+10.000000</td>
<td nowrap align="right">2007-10-11 13:08:30.0</td>
<td nowrap align="right">U/UKIDSS/GCS21</td>
<td nowrap align="right">4</td>
<td nowrap align="right">1802</td>
</tr>
<tr bgcolor=#FFDDDD>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/display.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01806_sf.fit&cat=NONE&comp=djoser:/disk04/wsa/products/jpgs/20071011_v1/w20071011_01806_sf&noExt=4&MFID=1737559&rID=48" target=display>view</a></td>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/fits_download.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01806_sf.fit&MFID=1737559&rID=48">FITS</a></td>
<td>&nbsp;</td>
<td nowrap align="right">1737559</td>
<td nowrap align="right">leav</td>
<td nowrap align="right">OBJECT</td>
<td nowrap align="right">+5.5777306</td>
<td nowrap align="right">+21.7913333</td>
<td nowrap align="right">K</td>
<td nowrap align="right">+10.000000</td>
<td nowrap align="right">2007-10-11 13:09:18.6</td>
<td nowrap align="right">U/UKIDSS/GCS21</td>
<td nowrap align="right">4</td>
<td nowrap align="right">1806</td>
</tr>
<tr bgcolor=#DDDDDD>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/display.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01818_sf.fit&cat=NONE&comp=djoser:/disk04/wsa/products/jpgs/20071011_v1/w20071011_01818_sf&noExt=4&MFID=1737579&rID=48" target=display>view</a></td>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/fits_download.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01818_sf.fit&MFID=1737579&rID=48">FITS</a></td>
<td>&nbsp;</td>
<td nowrap align="right">1737579</td>
<td nowrap align="right">leav</td>
<td nowrap align="right">OBJECT</td>
<td nowrap align="right">+5.5935528</td>
<td nowrap align="right">+21.7913333</td>
<td nowrap align="right">K</td>
<td nowrap align="right">+10.000000</td>
<td nowrap align="right">2007-10-11 13:12:05.5</td>
<td nowrap align="right">U/UKIDSS/GCS21</td>
<td nowrap align="right">4</td>
<td nowrap align="right">1818</td>
</tr>
<tr bgcolor=#FFDDDD>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/display.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01818_sf_st.fit&cat=djoser:/disk05/wsa/ingest/fits/20071011_v1/w20071011_01818_sf_st_cat.fits&comp=djoser:/disk04/wsa/products/jpgs/20071011_v1/w20071011_01818_sf_st&noExt=4&MFID=1737581&rID=48" target=display>view</a></td>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/fits_download.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01818_sf_st.fit&MFID=1737581&rID=48">FITS</a></td>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/fits_download.cgi?file=djoser:/disk05/wsa/ingest/fits/20071011_v1/w20071011_01818_sf_st_cat.fits&MFID=1737581&rID=48">FITS</a></td>
<td nowrap align="right">1737581</td>
<td nowrap align="right">leavstack</td>
<td nowrap align="right">OBJECT</td>
<td nowrap align="right">+5.5935528</td>
<td nowrap align="right">+21.7913333</td>
<td nowrap align="right">K</td>
<td nowrap align="right">+10.000000</td>
<td nowrap align="right">2007-10-11 13:12:05.5</td>
<td nowrap align="right">U/UKIDSS/GCS21</td>
<td nowrap align="right">4</td>
<td nowrap align="right">1818</td>
</tr>
<tr bgcolor=#DDDDDD>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/display.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01822_sf.fit&cat=NONE&comp=djoser:/disk04/wsa/products/jpgs/20071011_v1/w20071011_01822_sf&noExt=4&MFID=1737587&rID=48" target=display>view</a></td>
<td><a href="http://wsa.roe.ac.uk/cgi-bin/fits_download.cgi?file=/disk05/wsa/ingest/fits/20071011_v1/w20071011_01822_sf.fit&MFID=1737587&rID=48">FITS</a></td>
<td>&nbsp;</td>
<td nowrap align="right">1737587</td>
<td nowrap align="right">leav</td>
<td nowrap align="right">OBJECT</td>
<td nowrap align="right">+5.5935528</td>
<td nowrap align="right">+21.7913333</td>
<td nowrap align="right">K</td>
<td nowrap align="right">+10.000000</td>
<td nowrap align="right">2007-10-11 13:12:53.8</td>
<td nowrap align="right">U/UKIDSS/GCS21</td>
<td nowrap align="right">4</td>
<td nowrap align="right">1822</td>
</tr>
</table><br> row(s) 1 to 6 displayed.<br>
<pre></pre><a href="javascript:history.go(-1)">Back to form</a> (uses Javascript)<script language="JavaScript" type="text/javascript">
<!--
swap(0)
-->
</script>
</body></html>
14 changes: 14 additions & 0 deletions astroquery/ukidss/tests/test_ukidss_remote.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,17 @@ def test_query_region_constraints(self):

assert isinstance(table_constraint, Table)
assert len(table_noconstraint) >= len(table_constraint)

def test_deprecated_image_list(self):
"""
Regression test for Issue 2808
"""
crd = SkyCoord(ra=211.3194905, dec=54.413845, unit=(u.deg, u.deg))
uk = ukidss.core.UkidssClass()
uk.database = 'UHSDR2'
result = uk.get_image_list(crd, waveband='all', ignore_deprecated=True)

# this image is not deprecated (deprecated==0)
assert "http://wsa.roe.ac.uk/cgi-bin/getImage.cgi?file=/disk73/wsa/ingest/fits/20190614_v5/w20190614_00626_st.fit&mfid=11076607&extNo=4&lx=1276&hx=1426&ly=187&hy=337&rf=0&flip=1&uniq=834_579_14_86394_6&xpos=75.9&ypos=75.7&band=K&ra=211.3194905&dec=54.413845" in result # noqa: E501
# this image is deprecated (deprecated==80)
assert "http://wsa.roe.ac.uk/cgi-bin/getImage.cgi?file=/disk53/wsa/ingest/fits/20150129_v5/w20150129_02901_st.fit&mfid=8278383&extNo=4&lx=1274&hx=1425&ly=195&hy=345&rf=0&flip=1&uniq=834_579_14_86394_5&xpos=76.6&ypos=75.9&band=J&ra=211.3194905&dec=54.413845" not in result # noqa: E501
96 changes: 81 additions & 15 deletions astroquery/wfau/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,13 @@
import time
from math import cos, radians
import requests
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup, XMLParsedAsHTMLWarning
from io import BytesIO, StringIO

import astropy.units as u
import astropy.coordinates as coord
import astropy.io.votable as votable
from astropy.io import ascii

from ..query import QueryWithLogin
from ..exceptions import InvalidQueryError, TimeoutError, NoResultsWarning
Expand Down Expand Up @@ -290,10 +291,49 @@ def get_images_async(self, coordinates, *, waveband='all', frame_type='stack',
show_progress=show_progress)
for url in image_urls]

def get_image_list(self, coordinates, *, waveband='all', frame_type='stack',
image_width=1 * u.arcmin, image_height=None,
radius=None, database=None,
programme_id=None, get_query_payload=False):
def get_image_list(self, coordinates, *, radius=None, ignore_deprecated=True, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell out kwargs, we try to get rid of this way of hiding them so we should not introduce more of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok. That's a lot of duplicated text, which I don't love, but fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but if it's public API, one would need useful docstrings, too. dumping kwarg out there is not exactly useful. OTOH, if it's an option of not getting slightly overlapping public API then certainly I would be in favour of not duplicating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_image_list is more of an internal-pointing method. It is technically usable, and useful, to users, but get_image_table is more useful and I prefer to point people there. I'd actually lean toward making _get_image_list private before duplicating its docstring and keyword arguments. Let me know which you prefer

"""
See `get_image_table` for a full list of options.

This method will return _only_ the URLs requested as a list of URLs.

Parameters
----------
ignore_deprecated : bool
If set (default: True), only images with the ``deprecated`` flag
set to zero will be included

Returns
-------
url_list : list of image urls

"""
image_table = self.get_image_table(coordinates, radius=radius, **kwargs)

if ignore_deprecated and radius is None:
image_urls = image_table[image_table['deprecated'] == 0]['Link']
elif radius is not None:
image_urls = image_table['Img']
else:
image_urls = image_table['Link']

# different links for radius queries and simple ones
if radius is not None:
image_urls = [link for link in image_urls if
('fits_download' in link and '_cat.fits'
not in link and '_two.fit' not in link)]
else:
# Not sure this is necessary any more (as of #2809), but it seems
# harmless and I'm not removing it until I'm sure
image_urls = [link.replace("getImage", "getFImage")
for link in image_urls]

return image_urls

def get_image_table(self, coordinates, *, waveband='all', frame_type='stack',
image_width=1 * u.arcmin, image_height=None,
radius=None, database=None,
programme_id=None, get_query_payload=False):
"""
Function that returns a list of urls from which to download the FITS
images.
Expand Down Expand Up @@ -337,7 +377,9 @@ def get_image_list(self, coordinates, *, waveband='all', frame_type='stack',

Returns
-------
url_list : list of image urls
table : Table
An astropy table containing the metadata table, including URLs, of
the requested files.

"""

Expand Down Expand Up @@ -399,19 +441,43 @@ def get_image_list(self, coordinates, *, waveband='all', frame_type='stack',
return request_payload

response = self._wfau_send_request(query_url, request_payload)
self._penultimate_response = response
response = self._check_page(response.url, "row")
self._last_response = response
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really necessary?

Also, try to avoid overwriting variables to help future debugging, the second response could be response_row, or whatever makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this was for debugging, I'll remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the _last_response addition has historically been very useful in debugging other tools, so I prefer to keep it, but I agree about the variable renaming

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, keeping _last_response is fine, I really just meant to spot _penultimate_response ;) and the rename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, I could drop that. But when else do you get to use penultimate in the real world?!


return self.parse_imagequery_page(response.text, radius=radius)

def parse_imagequery_page(self, html_in, radius=None):
"""
Parse the image metadata page
"""
ahref = re.compile(r'href="([a-zA-Z0-9_\.&\?=%/:-]+)"')

image_urls = self.extract_urls(response.text)
# different links for radius queries and simple ones
if radius is not None:
image_urls = [link for link in image_urls if
('fits_download' in link and '_cat.fits'
not in link and '_two.fit' not in link)]
else:
image_urls = [link.replace("getImage", "getFImage")
for link in image_urls]
html = "\n".join([
# for radius searches, "FITS" needs to be s/FITS/url/
row.replace(">FITS<", ">{}<".format(ahref.search(row).groups()[0])) if ">FITS<" in row else
row
for row in html_in.split("\n")])
with warnings.catch_warnings():
# this is really html; the xml parser doesn't work
warnings.simplefilter(action="ignore", category=XMLParsedAsHTMLWarning)
soup = BeautifulSoup(html, features='html5')
httb = soup.findAll('table')[2]
firstrow = httb.findAll('tr')[0]
for td in firstrow.findAll('td'):
td.name = 'th'
return ascii.read(str(httb), format='html')

return image_urls
else:
html = "\n".join([
# for ascii.read: th -> header
row.replace("td", "th") if row.startswith("<table border") else
# "show" is the default, but we want the URLs
row.replace(">show<", ">{}<".format(ahref.search(row).groups()[0])) if ">show<" in row else
row
for row in html_in.split("\n")])
return ascii.read(html, format='html')

def extract_urls(self, html_in):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I commented this before in another module: I think methods like this should not be class methods, definitely not public ones, but instead private functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree - happy to make this, and get_image_list, both private

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't make get_image_list private, at least not without a deprecation period.

"""
Expand Down
Loading