Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaia: change the signature of the method load_data #3014

Merged

Conversation

cosmoJFH
Copy link
Contributor

@cosmoJFH cosmoJFH commented May 29, 2024

The present implementation of the method GaiaClass.load_data makes use of the parameter output_file that defines a user defined zipped file were the data is saved.

We would like to change this parameter by the boolean parameter dump_to_file to save the results in the hard-code file name "datalink_output.zip" that will be saved in the current working directory.

This PR also contains 2 bug fixes in the present implementation:

  1. if the value of the parameter output_file is a simple name, i.e., without a path, the code will collect all the files with extensions '.fits', '.xml', '.csv', or '.ecsv' from the current working directory. This must be changed, since this implies that any file with any of those extensions will be included in zipped file, i.e., independently if they come from the execution of the method load_data, or from previous executions.

  2. If the parameter output_file is used, the built zip file is used to save the files coming from by the datalink service. Then the file is unzipped and the files that it contains are read, so that a list of Tables is built and returned to the user. But the the unzipped files are never removed. We think that this is a bug and it would be good to remove the unzipped files.

New units tests were developed.

cc @esdc-esac-esa-int

jira: GAIAMNGT-1700

@pep8speaks
Copy link

pep8speaks commented May 29, 2024

Hello @cosmoJFH! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-10-16 22:42:55 UTC

@bsipocz bsipocz added this to the v0.4.8 milestone Jun 11, 2024
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand the motivation behind removing the ability from the users to define the output name. At the minimum, it should go through a deprecation cycle, but I would rather keep the functionality.

Besides that it looks fine, minus a minor bug about having it but not using the overwrite parameter.

@@ -325,21 +323,31 @@ def load_data(self, ids, *, data_release=None, data_structure='INDIVIDUAL', retr

return files

def build_general_output_filename(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we try to slowly simplify the API whereever we can, so I would suggest to make this private from the beginning.
Or just as well, not have a separate method, but add the 4 lines at the place of usage (as I see it is only used in one place)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed the function build_general_output_filename.

@@ -170,7 +169,7 @@ def logout(self, *, verbose=False):

def load_data(self, ids, *, data_release=None, data_structure='INDIVIDUAL', retrieval_type="ALL",
linking_parameter='SOURCE_ID', valid_data=False, band=None, avoid_datatype_check=False,
format="votable", output_file=None, overwrite_output_file=False, verbose=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing a parameter from the public APIs should be done through a deprecation period.

But also, I do wonder why is this necessary, to remove the ability from the users to define a filename/location that better suits them?

So, my suggestion would be to change the default None to datalink_output.zip, and add the new boolean parameter. That way the default will be what you proposed here, but the users still have the ability to select a better location.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a problem, we should not break people's code without noticing them first.

if not os.path.isdir(path):
try:
os.mkdir(path)
except FileExistsError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is the overwrite_output_file parameter, it should be used here isn't it?

@bsipocz
Copy link
Member

bsipocz commented Jun 18, 2024

I would also raise the idea of maybe having a separate method for the download instead of the dump_to_file parameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether the dump_to_file is the best phrase to use.)

Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.

@andamian
Copy link

I would also raise the idea of maybe having a separate method for the download instead of the dump_to_file parameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether the dump_to_file is the best phrase to use.)

Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.

The proposed change is a bit counterintuitive - it's hard to tell what drives it (maybe subsequent processing on the resulting file?). If the user has no control on the name of the file I would expect it to be at least timestamped. Also, current working directory is a bit vague. What is current working directory when a script calls the astroquery library, or a test is run with pytest or astroquery is invoked in a Jupyter notebook. At least make the location of the file available of the file available in the class.

@keflavich
Copy link
Contributor

Sorry I'm a bit late to this, but I agree that removing users' ability to specify the output location does not seem like a good idea.

However, I had a hard time parsing the original intent: is the idea that datalink_output.zip is supposed to be a temporary file that is always removed after the files are extracted? It seems instead what is implemented is the opposite: the datalink_output.zip is kept, but all the files extracted from it are deleted. To me, this seems backward; what's the use case for preserving a generically-named zip file?

@hectorcanovas
Copy link
Contributor

Dear all,

Thanks for your valuable feedback. Let me give you an overview of the way this functionality is currently implemented and the reasons behind our proposal.
The load_data() method retrieves ancillary products from the Gaia Archive that are served via the DataLink protocol. These products are stored inside a (Python) dictionary, whose keys are named according to the values of the following arguments:

  • data_structure : options are: “INDIVIDUAL”, “RAW”, and “COMBINED” – this last one is going to be deprecated)
  • format: options are: VOTable and VOTable_plain (translates to “xml”), FITS, CSV, and ECSV
  • retrieval_type: options are: 'ALL' , 'EPOCH_PHOTOMETRY', 'RVS', 'XP_CONTINUOUS', 'XP_SAMPLED', 'MCMC_GSPPHOT' and 'MCMC_MSC'

Examples of key names are: “EPOCH_PHOTOMETRY-Gaia DR3 2263166706630078848.xml” or “EPOCH_PHOTOMETRY_RAW.xml”.

This file-naming procedure reproduces the behavior of the web interface of the Gaia ESA Archive (which does not allow to specify the names of the files to be downloaded). That said, it is possible to retrieve:

  • One product for one source, or
  • One product for many sources, or
  • All the products available for one source, or
  • All the products available for many sources.

From the web interface, the data is always downloaded as a single compressed (.zip) file that is named as: <user_name><job_id>. or <job_id>. (depending on the request type: registered or anonymous).
To the best of my knowledge, it is not possible to retrieve the <job_id> value to associated to the request launched by the load_data() method, and hence it is not possible to add this attribute to the name of the compressed file retrieved via Astroquery.Gaia. We therefore decided to assign a fixed value to the name of the (compressed) file that stores the data: “datalink_output.zip”. We considered the following reasons to support this idea:

  1. It mimics the behavior of the Gaia ESA Archive web GUI: the name of the retrieved file is set by the Archive.
  2. It avoids problems if users specify a name with or without adding the “.zip” suffix (it is possible to update the method to check this issue and correct the name as necessary, but that requires extra work), and
  3. It is very easy to modify the (fixed) file name in a Python script by e.g., appending a timestamp suffix.

We could consider the possibility of adding a separated method to download the ancillary files, but there are few reasons that make me reluctant to implement it (although I am not totally against this):

  1. The output of the load_data method is not a TAP job, but a Python dictionary whose elements (keys) contain ancillary products retrieved via DataLink.
  2. In DR4 the Gaia Archive will serve images in FITS format. Our tests show that these products can be written in a .fits file with the proposed implementation. However, retrieving them without using the “dump_to_file = True” option results in an incomplete download, as only the tabular HDUs from the fits file are retrieved, and the HDU that contains the image part is missed.

All in all, we considered that the proposed change would benefit the users by simplifying the way that the DataLink products can be stored using Astroquery.Gaia). But if you think that this change can be improved (by e.g. automatically adding a timestamp to the currently fixed file name), please let us know.

Kind regards,

==================================
Dr. Héctor Cánovas Cabrera (ORCID)

Telespazio for ESA - European Space Agency
SCO-09 – Support Archive Scientist for the Gaia mission
European Space Astronomy Centre (ESAC)
Camino Bajo del Castillo s/n
28692 Villanueva de la Cañada (Madrid), Spain

@andamian
Copy link

andamian commented Jul 2, 2024

Thank you for your great description @hectorcanovas. Yes, trying to emulate the Web Interface functionality is a great goal as it will make it easier for user to use both mechanisms to access their data. I'm not familiar with the Gaia Archive, but in general, the Web interface is used for exploration while the library is used for processing.
Even so, the Web browser can prompt for the location where the file to be dowloaded or, in case it goes into the Downloads directory, it adds a number suffix each time the file with the same name is downloaded so versions are not overridden - maybe this is different in Gaia.
The astroquery library is used for processing in which multiple downloads are likely to be required and users will need a way to distinguish between these files.
Again, I'm not familiar with Gaia and please ignore if these general comments do not apply to your case. It is a design decision that people most familiar with the service/user base must take.

@hectorcanovas
Copy link
Contributor

Dear all, following your feedback I have proposed to our team to add a timestamp suffix to the name of the file generated by the load_data method. With this change, the output filename would be:
"datalink_output_<time_stamp>.zip"
where <time_stamp> corresponds to the UTC time associated to the request received at the Gaia Archive DataLink server. The <time_stamp> format should follow the ISO 8601 standard: "yyyymmddThhmmss". Based on our internal records, a 1-second granularity should be enough to avoid duplicated filenames (but this format can be updated as needed). If you agree with this proposal we do our best to prepare a new, dedicated Pull Request right after the summer break.

@cosmoJFH
Copy link
Contributor Author

Changes committed to the branch: now the output file name follows the patter "datalink_output_<time_stamp>.zip" where the time stamp is obtained as

    now = datetime.now(timezone.utc)
    output_file = 'datalink_output_' + now.strftime("%Y%m%dT%H%M%S") + '.zip'

@bsipocz
Copy link
Member

bsipocz commented Sep 30, 2024

Test failures are real, while the RTD docs build failure is not. Could you please rebase? That would take care of both the conflict and RTD.

Copy link

codecov bot commented Oct 1, 2024

Codecov Report

Attention: Patch coverage is 70.96774% with 9 lines in your changes missing coverage. Please review.

Project coverage is 67.49%. Comparing base (6959406) to head (58f8692).
Report is 158 commits behind head on main.

Files with missing lines Patch % Lines
astroquery/gaia/core.py 70.96% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3014      +/-   ##
==========================================
+ Coverage   67.39%   67.49%   +0.10%     
==========================================
  Files         233      233              
  Lines       18405    18413       +8     
==========================================
+ Hits        12404    12428      +24     
+ Misses       6001     5985      -16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cosmoJFH
Copy link
Contributor Author

cosmoJFH commented Oct 1, 2024

Test failures are real, while the RTD docs build failure is not. Could you please rebase? That would take care of both the conflict and RTD.

The tests and conflicts were fixed.

@cosmoJFH
Copy link
Contributor Author

cosmoJFH commented Oct 7, 2024

Hi @bsipocz, what pending tasks do we need to address on our end?

@bsipocz
Copy link
Member

bsipocz commented Oct 7, 2024

Hi @bsipocz, what pending tasks do we need to address on our end?

No, it waits on my todo to a final review, I'll try to get to it this week.

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only remaining point is to add the deprecation on the removed kwargs, so while specifying it will have no effect, it this PR won't break existing user code just raise a warning first.

I'll add this deprecation and then go ahead with the merge.

Thanks!

if not overwrite_output_file and os.path.exists(output_file):
raise ValueError(f"{output_file} file already exists. Please use overwrite_output_file='True' to "
f"overwrite output file.")

path = os.path.dirname(output_file)

log.debug(f"Directory where the data will be saved: {path}")

if path != '':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe cleaner as '' will be False here:

Suggested change
if path != '':
if path:

@@ -415,6 +420,8 @@ gaia
epoch photometry service to return all data associated to a given source.
[#2376]

- New retrieval types for datalink (Gaia DR4 release). [#3110]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at the wrong location, but I'm cleaning up the changelog for release time anyway, so it doesn't matter here.

@bsipocz bsipocz merged commit aa35035 into astropy:main Oct 16, 2024
11 checks passed
@bsipocz
Copy link
Member

bsipocz commented Oct 16, 2024

Note to the other maintainers: I should have cleaned up the commit history here with a rebase, sometimes mistakes happen 🤷‍♀️

@hectorcanovas
Copy link
Contributor

hectorcanovas commented Oct 17, 2024

Dear @bsipocz,

Thank you very much for your help and feedback in implementing this PR. This update will support our ongoing preparations for Gaia DR4.

Regards,

Dr. Héctor Cánovas Cabrera (ORCID)

Telespazio for ESA - European Space Agency
SCO-09 – Support Archive Scientist for the Gaia mission

European Space Astronomy Centre (ESAC)
Camino Bajo del Castillo s/n
28692 Villanueva de la Cañada (Madrid), Spain

@cosmoJFH
Copy link
Contributor Author

I would also raise the idea of maybe having a separate method for the download instead of the dump_to_file parameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether the dump_to_file is the best phrase to use.)

Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.

@bsipocz, could you give us an example in other module for a method for the download?

@keflavich
Copy link
Contributor

https://github.com/astropy/astroquery/blob/main/astroquery/alma/core.py#L896 is a download example from alma

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants