-
-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gaia: change the signature of the method load_data #3014
Gaia: change the signature of the method load_data #3014
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to understand the motivation behind removing the ability from the users to define the output name. At the minimum, it should go through a deprecation cycle, but I would rather keep the functionality.
Besides that it looks fine, minus a minor bug about having it but not using the overwrite parameter.
astroquery/gaia/core.py
Outdated
@@ -325,21 +323,31 @@ def load_data(self, ids, *, data_release=None, data_structure='INDIVIDUAL', retr | |||
|
|||
return files | |||
|
|||
def build_general_output_filename(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we try to slowly simplify the API whereever we can, so I would suggest to make this private from the beginning.
Or just as well, not have a separate method, but add the 4 lines at the place of usage (as I see it is only used in one place)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have removed the function build_general_output_filename.
@@ -170,7 +169,7 @@ def logout(self, *, verbose=False): | |||
|
|||
def load_data(self, ids, *, data_release=None, data_structure='INDIVIDUAL', retrieval_type="ALL", | |||
linking_parameter='SOURCE_ID', valid_data=False, band=None, avoid_datatype_check=False, | |||
format="votable", output_file=None, overwrite_output_file=False, verbose=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing a parameter from the public APIs should be done through a deprecation period.
But also, I do wonder why is this necessary, to remove the ability from the users to define a filename/location that better suits them?
So, my suggestion would be to change the default None to datalink_output.zip
, and add the new boolean parameter. That way the default will be what you proposed here, but the users still have the ability to select a better location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a problem, we should not break people's code without noticing them first.
if not os.path.isdir(path): | ||
try: | ||
os.mkdir(path) | ||
except FileExistsError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is the overwrite_output_file
parameter, it should be used here isn't it?
I would also raise the idea of maybe having a separate method for the download instead of the Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions. |
The proposed change is a bit counterintuitive - it's hard to tell what drives it (maybe subsequent processing on the resulting file?). If the user has no control on the name of the file I would expect it to be at least timestamped. Also, current working directory is a bit vague. What is current working directory when a script calls the |
Sorry I'm a bit late to this, but I agree that removing users' ability to specify the output location does not seem like a good idea. However, I had a hard time parsing the original intent: is the idea that |
Dear all, Thanks for your valuable feedback. Let me give you an overview of the way this functionality is currently implemented and the reasons behind our proposal.
Examples of key names are: “EPOCH_PHOTOMETRY-Gaia DR3 2263166706630078848.xml” or “EPOCH_PHOTOMETRY_RAW.xml”. This file-naming procedure reproduces the behavior of the web interface of the Gaia ESA Archive (which does not allow to specify the names of the files to be downloaded). That said, it is possible to retrieve:
From the web interface, the data is always downloaded as a single compressed (.zip) file that is named as: <user_name><job_id>. or <job_id>. (depending on the request type: registered or anonymous).
We could consider the possibility of adding a separated method to download the ancillary files, but there are few reasons that make me reluctant to implement it (although I am not totally against this):
All in all, we considered that the proposed change would benefit the users by simplifying the way that the DataLink products can be stored using Astroquery.Gaia). But if you think that this change can be improved (by e.g. automatically adding a timestamp to the currently fixed file name), please let us know. Kind regards, ================================== Telespazio for ESA - European Space Agency |
Thank you for your great description @hectorcanovas. Yes, trying to emulate the Web Interface functionality is a great goal as it will make it easier for user to use both mechanisms to access their data. I'm not familiar with the Gaia Archive, but in general, the Web interface is used for exploration while the library is used for processing. |
Dear all, following your feedback I have proposed to our team to add a timestamp suffix to the name of the file generated by the load_data method. With this change, the output filename would be: |
… the format "%Y%m%dT%H%M%S"
Changes committed to the branch: now the output file name follows the patter "datalink_output_<time_stamp>.zip" where the time stamp is obtained as
|
Test failures are real, while the RTD docs build failure is not. Could you please rebase? That would take care of both the conflict and RTD. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3014 +/- ##
==========================================
+ Coverage 67.39% 67.49% +0.10%
==========================================
Files 233 233
Lines 18405 18413 +8
==========================================
+ Hits 12404 12428 +24
+ Misses 6001 5985 -16 ☔ View full report in Codecov by Sentry. |
The tests and conflicts were fixed. |
Hi @bsipocz, what pending tasks do we need to address on our end? |
No, it waits on my todo to a final review, I'll try to get to it this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only remaining point is to add the deprecation on the removed kwargs, so while specifying it will have no effect, it this PR won't break existing user code just raise a warning first.
I'll add this deprecation and then go ahead with the merge.
Thanks!
if not overwrite_output_file and os.path.exists(output_file): | ||
raise ValueError(f"{output_file} file already exists. Please use overwrite_output_file='True' to " | ||
f"overwrite output file.") | ||
|
||
path = os.path.dirname(output_file) | ||
|
||
log.debug(f"Directory where the data will be saved: {path}") | ||
|
||
if path != '': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maybe cleaner as ''
will be False here:
if path != '': | |
if path: |
@@ -415,6 +420,8 @@ gaia | |||
epoch photometry service to return all data associated to a given source. | |||
[#2376] | |||
|
|||
- New retrieval types for datalink (Gaia DR4 release). [#3110] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is at the wrong location, but I'm cleaning up the changelog for release time anyway, so it doesn't matter here.
Note to the other maintainers: I should have cleaned up the commit history here with a rebase, sometimes mistakes happen 🤷♀️ |
Dear @bsipocz, Thank you very much for your help and feedback in implementing this PR. This update will support our ongoing preparations for Gaia DR4. Regards, Dr. Héctor Cánovas Cabrera (ORCID) Telespazio for ESA - European Space Agency European Space Astronomy Centre (ESAC) |
@bsipocz, could you give us an example in other module for a method for the download? |
https://github.com/astropy/astroquery/blob/main/astroquery/alma/core.py#L896 is a download example from alma |
The present implementation of the method GaiaClass.load_data makes use of the parameter output_file that defines a user defined zipped file were the data is saved.
We would like to change this parameter by the boolean parameter dump_to_file to save the results in the hard-code file name "datalink_output.zip" that will be saved in the current working directory.
This PR also contains 2 bug fixes in the present implementation:
if the value of the parameter output_file is a simple name, i.e., without a path, the code will collect all the files with extensions '.fits', '.xml', '.csv', or '.ecsv' from the current working directory. This must be changed, since this implies that any file with any of those extensions will be included in zipped file, i.e., independently if they come from the execution of the method load_data, or from previous executions.
If the parameter output_file is used, the built zip file is used to save the files coming from by the datalink service. Then the file is unzipped and the files that it contains are read, so that a list of Tables is built and returned to the user. But the the unzipped files are never removed. We think that this is a bug and it would be good to remove the unzipped files.
New units tests were developed.
cc @esdc-esac-esa-int
jira: GAIAMNGT-1700