-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of EUMETSAT IASI-NG reader #2879
base: main
Are you sure you want to change the base?
Initial implementation of EUMETSAT IASI-NG reader #2879
Conversation
Alright, so I see from the workflows above that there are some failures due to the formatting of the file, I should maybe start with this before moving forward with the remaining code changes 😉. |
…atpy into eum_iasi_ng_reader_impl
Hi everyone, We now think that this MR is ready for a review by the Satpy team, so please let me know if you have any feedback on it. Thanks 😉! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I started reviewing this, and it looks sensible, but got distracted by all the comments, so I think we need to address this first.
So let’s start by reading this: https://santim0ren0.medium.com/clean-code-comments-f9eac4ada16d
In this PR I can see different types of comments:
- comments on what the next line is: if the name of the variable or function is good enough, they are not needed
- comments that explain what the following lines do: these lines should then be refactored into a function with a clear name
- comments that complement the docstrings: these should be integrated to the said docstrings
I’m happy to answer any questions if you need help with this :)
satpy/readers/iasi_ng_l2_nc.py
Outdated
# Prepare a description for this variable: | ||
prefix, var_name = key.rsplit("/", 1) | ||
dims = self.file_content[f"{key}/dimensions"] | ||
dtype = self.file_content[f"{key}/dtype"] | ||
|
||
desc = { | ||
"location": key, | ||
"prefix": prefix, | ||
"var_name": var_name, | ||
"shape": shape, | ||
"dtype": f"{dtype}", | ||
"dims": dims, | ||
"attribs": {}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that comment here surely means this code bit should be in its own function called eg prepare_description_for_variable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mraspaud ,
Thank you very much for this initial feedback 😊
I've read the article you mentioned, and even if I don't fully agree to this perspective on code comments I think I see your point, and will start working on the requested changes asap. I'll let you when all the fixes are integrated and you can have another look at this PR.
Thanks 🙏!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @mraspaud ,
I have now updated the content of the iasi_ng_l2_nc.py file removing those excessive comments (just keeping a couple of "dev notes" in the end).
Now, just to be sure before continuing: you would also expect me to do the same for the test_iasi_ng_l2_nc.py file, right :-)?
Also, I'm not sure if you would expect me to "resolve" each of the points reported above myself, or if you prefer to review the changes first and then resolve those points yourself when you juge the result acceptable (please just let me know if you think I should do it myself).
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, looks much better now!
yes it would be great if you could do the same in the test file.
Regarding the resolution of comments, I like it when the original commenter (in this case me) clicks the resolve button.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay 👌copy that: so I'm updating the test file now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mraspaud ,
I have now also updated the test_iasi_ng_l2_nc.py file as discussed above. So please let me know when you find some time to give this PR another look, if you notice anything else that we should discuss/change.
Thanks in advance for your time on this 🙏!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mraspaud ,
I haven't heard back from you concerning this PR since a good time now, So this is just a "ping message" to check if maybe this just got out of your mind but now you could find some time to allocate to it ;-). Thanks 🙏!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sorry, I had a long holiday. Reviewed now.
…atpy into eum_iasi_ng_reader_impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the hard work on the reader. I have mostly questions inline, but the one I’m a bit concerned about in the ability to work with actual 3d data…
def _collect_groups_info(self, base_name, obj): | ||
for group_name, group_obj in obj.groups.items(): | ||
full_group_name = base_name + group_name | ||
self.file_content[full_group_name] = group_obj | ||
self._collect_attrs(full_group_name, group_obj) | ||
self.collect_metadata(full_group_name, group_obj) | ||
self.collect_dimensions(full_group_name, group_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be covered by tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mraspaud ,
Thank you very much for your time reviewing this PR! Let's see what we can do about those additional points...
Hmmm, well... first thing I notice here after collecting all the latest changes from satpy main branch is that now my python environment seems to be broken (on Windows), since I get this error when trying to run the unit tests on this reader:
ImportError while loading conftest 'D:\Projects\satpy\satpy\tests\reader_tests\conftest.py'.
reader_tests\conftest.py:29: in
from xarray import DataTree
E ImportError: cannot import name 'DataTree' from 'xarray' (D:\Projects\NervProj.pyenvs\eum_resp2_t2_satpy_env\lib\site-packages\xarray_init_.py)
(even after rebuilding this python env from scratches, so I should have the latest version of xarray, etc)
=> Strange 🤔... (I'm not completely sure about that but I think the "DataTree" module is a pretty new addition in xarray right ?) Anyway, I'm investigating this further now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(=> OK xarray datatree error clarified: I was still installing another package called "xarray-datatree" in my env requirements, and because of this the xarray module was stuck on an old version (v2024.7.0). If I remove the "xarray-datatree" package then I get a recent version of xarray and no error for the tests execution).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, so I started with the last point reported below (ie. usage of actual temporary netcdf files instead of the mock framework), and with that change, the code discussed in this point is now considered as covered in the unit tests as far as I can tell checking the iasi_ng_l2_nc.py file from https://app.codecov.io/gh/pytroll/satpy/pull/2879
def is_attribute_path(self, var_path): | ||
"""Check if a given path is a root attribute path.""" | ||
return var_path.startswith("/attr") | ||
|
||
def is_property_path(self, var_path): | ||
"""Check if a given path is a sub-property path.""" | ||
return var_path.endswith(("/dtype", "/shape", "/dimensions")) | ||
|
||
def is_netcdf_group(self, obj): | ||
"""Check if a given object is a netCDF group.""" | ||
return isinstance(obj, netCDF4.Group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this trying to revert something that the base netcdf filehandler is building. Does is show that maybe the base netcdf filehandler class is not adapted to this case and that it would make more sense to go with eg an xarray.open_dataset(…)
call for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, well, from my perspective, the implementation we have in this reader rather makes an appropriate usage of the base Netcdf4Handler class: this reader depends heavily on the "self.file_content" member to analyze the available variables/dimensions/attributes, so I think it really makes sense to use the NetCDF4Handler class to access this instead of rebuilding it completely from scratches (if we were to open the file using only xarray directly for instance).
And at the same time, we need a few small changes/extensions compared to the features provided by the base netcdf class, so, to me it sounds appropriate to also override the base implementation of "_collect_groups_info" for instance, and provide a few additional helper methods (ie. those listed above).
def convert_data_type(self, data_array, dtype="auto"): | ||
"""Convert the data type if applicable.""" | ||
attribs = data_array.attrs | ||
|
||
cur_dtype = np.dtype(data_array.dtype).name | ||
|
||
if dtype == "auto" and cur_dtype in ["float32", "float64"]: | ||
dtype = cur_dtype | ||
|
||
to_float = "scale_factor" in attribs or "add_offset" in attribs | ||
|
||
if dtype == "auto": | ||
dtype = "float64" if to_float else cur_dtype | ||
|
||
if cur_dtype != dtype: | ||
data_array = data_array.astype(dtype) | ||
|
||
return data_array | ||
|
||
def apply_fill_value(self, data_array): | ||
"""Apply the rescaling transform on a given array.""" | ||
dtype = np.dtype(data_array.dtype).name | ||
if dtype not in ["float32", "float64"]: | ||
return data_array | ||
|
||
nan_val = np.nan if dtype == "float64" else np.float32(np.nan) | ||
attribs = data_array.attrs | ||
|
||
if "valid_min" in attribs: | ||
vmin = attribs["valid_min"] | ||
data_array = data_array.where(data_array >= vmin, other=nan_val) | ||
|
||
if "valid_max" in attribs: | ||
vmax = attribs["valid_max"] | ||
data_array = data_array.where(data_array <= vmax, other=nan_val) | ||
|
||
if "valid_range" in attribs: | ||
vrange = attribs["valid_range"] | ||
data_array = data_array.where(data_array >= vrange[0], other=nan_val) | ||
data_array = data_array.where(data_array <= vrange[1], other=nan_val) | ||
|
||
missing_val = attribs.get("missing_value", None) | ||
missing_val = attribs.get("_FillValue", missing_val) | ||
|
||
if missing_val is None: | ||
return data_array | ||
|
||
return data_array.where(data_array != missing_val, other=nan_val) | ||
|
||
def apply_rescaling(self, data_array): | ||
"""Apply the rescaling transform on a given array.""" | ||
attribs = data_array.attrs | ||
if "scale_factor" in attribs or "add_offset" in attribs: | ||
scale_factor = attribs.setdefault("scale_factor", 1) | ||
add_offset = attribs.setdefault("add_offset", 0) | ||
|
||
data_array = (data_array * scale_factor) + add_offset | ||
|
||
for key in ["valid_range", "valid_min", "valid_max"]: | ||
if key in attribs: | ||
attribs[key] = attribs[key] * scale_factor + add_offset | ||
|
||
data_array.attrs.update(attribs) | ||
|
||
return data_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the base netcdf file handler has a maskandscale
parameter, could it be used instead of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just tried to enable this when creating the reader
def __init__(self, filename, filename_info, filetype_info, **kwargs): """Initialize object.""" super().__init__( filename, filename_info, filetype_info, auto_maskandscale=True, **kwargs )
and then disabling the call to the methods listed above, and this seems to work fine in most cases indeed, but then the following unit test with fail:
def test_nbr_iterations_dataset(self, twv_scene):
"""Test loading the nbr_iterations dataset."""
twv_scene.load(["nbr_iterations"])
dset = twv_scene["nbr_iterations"]
assert len(dset.dims) == 2
assert dset.dtype == np.int32
E AssertionError: assert dtype('float64') == <class 'numpy.int32'>
=> the original datatype for this array is "int32" in the netcdf file, but there is also a "missing_value" attribute for it, and it seems that in this case the auto_maskandscale flag will automatically convert the array to "float64", but we don't want that. So I'm not quite sure how we could make use of this properly? (=> unless you have a suggestion on this point I think we will need to keep the code above to deal with the int32 arrays as expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, having a "missing_value" or "_FillValue" attribute means that that value should be replaced by NaN, according to xarray and CF iirc. I can see that this variable is inherently an integer, but integers don't support NaNs. I would be fine with having this as float personally, but I understand if you want to have it as int.
I wonder what is the best way then. I mean the code here does the job, but I would argue it duplicates the code in xarray. Is there a way instead of just deactivating mask_and_scale just for this variable? or alternatively call xarray's mask and scale function manually for the relevant variables?
satpy/readers/iasi_ng_l2_nc.py
Outdated
def convert_to_datetime(self, data_array, ds_info): | ||
"""Convert the data to datetime values.""" | ||
epoch = ds_info["seconds_since_epoch"] | ||
|
||
# Note: converting the time values to ns precision to avoid warnings | ||
# from panda+numpy: | ||
data_array = xr.DataArray( | ||
data=pd.to_datetime(epoch) + (data_array * 1e9).astype("timedelta64[ns]"), | ||
dims=data_array.dims, | ||
attrs=data_array.attrs, | ||
) | ||
|
||
return data_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there an xarray utility function to do this? I think it's supposed to support cf time units
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to investigate this a bit further, and there is indeed an xr.decode_cf(dset)
function, but this only works on a Dataset object, not a DataArray.
I tried a few things but I don't seem to be able to get it to work as desired in this case. I also tried to perform this decoding directly when opening the xarray dataset using this following:
`
class IASINGL2NCFileHandler(NetCDF4FsspecFileHandler):
"""Reader for IASI-NG L2 products in NetCDF format."""
def __init__(self, filename, filename_info, filetype_info, **kwargs):
"""Initialize object."""
super().__init__(
filename,
filename_info,
filetype_info,
xarray_kwargs={"decode_cf": True},
**kwargs,
)
`
But the datetime checking unit tests are still failing with this 😅. Any suggestion on this point from your side ?
raise KeyError(f"Invalid dimension name {dim_name}") | ||
rep_count = self.dimensions_desc[dim_name] | ||
|
||
data_array = xr.concat([data_array] * rep_count, dim=data_array.dims[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not familiar with the format, can you tell me why replicating data is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's not that replicating the data is strictly "necessary". The rationale is rather as follow:
- The reader should make is easy for end user to access the data provided by the instrument in the most convient/standard/expected format, and it should reduce the need for technical understanding of the file content structure if possible.
- On this instrument, the generated data will be 3D for most datasets since the dimensions used will be (n_lines, n_for, n_fov)
- But the "onboard_utc" variable will only be 2D of shape (n_lines, n_for), because the same timestamp is applicable on all the elements on the n_fov dim for a given pair in (n_lines, n_for)
- So to make this dataset match the number of elements in the other datasets, we replicate the source elements "n_fov" times, while still keeping the the data array 2D for user convinience (more on this below in fact).
Note: we have in fact a "configuration entry" in the reader yaml file to control if this transformation should happen or not, and as you can see, this configuration entry is set to True, simply because this seems the most appropriate setting considering a typical usage case where we have tested this reader at EUMETSAT. And for now, I don't see any reason myself why it could be beneficial for the end user to disable this timestamp broadcasting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if a 2d coordinate could be a good match here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mraspaud ,
Sorry, but I don't really understand what you mean above with "a 2d coordinate". Could you please clarify if you really think this current implementation should be changed (and how you want it to be in that case), before this pull request could be accepted ? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, the onboard utc is a time coordinate that is defined along n_lines, n_for, so a good candidate for a 2d coordinate. You can assign this 2d array as a coordinate for your data array. That means that when you select something from you array for example, onboard utc will follow as a coordinate, so you don't need to flesh it out to 3d.
To be clear, I think this approach would be preferable, as I believe it sticks more to the xarray way of doing things.
def apply_reshaping(self, data_array): | ||
"""Apply the reshaping transform on a given IASI-NG data array. | ||
|
||
Those arrays may come as 3D array, in which case we collapse the | ||
last 2 dimensions on a single axis (ie. the number of columns or "y") | ||
|
||
In the process, we also rename the first axis to "x" | ||
""" | ||
if len(data_array.dims) > 2: | ||
data_array = data_array.stack(y=(data_array.dims[1:])) | ||
|
||
if data_array.dims[0] != "x": | ||
data_array = data_array.rename({data_array.dims[0]: "x"}) | ||
|
||
if data_array.dims[1] != "y": | ||
data_array = data_array.rename({data_array.dims[1]: "y"}) | ||
|
||
return data_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 3d data not working as you expect in satpy? I would rather we keep 3d array as such and not flatten them if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, the idea here is to use the reader as an intermediate layer between the end user and the data file to hide some of the "non-critical technical/acquisition details".
One of this consideration is that the measurement datasets have a shape of (n_lines, n_for, n_fov) because of different timestamps for the acquisition/processing of different chunks of columns, but in the end the resulting dataset is meant to be used as a (n_lines, n_cols) array (in most cases at least, and otherwise, the user could still use the n_for/n_fov values to reshape the arrays if that is ever needed?).
So that's the reason why we transform the data arrays to this shape here, and also rename the dimensions by default.
=> Do you really think this is overstepping the role the reader should have in "formating" the data ? Considering we are already filtering the data content itself with valid min/max, replacing missing values with NaN or transforming double values into timestamps for instance, I would have considered this reshaping transformation to be at a similar level of "data pre-processing" actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we do in readers is provide the data in an xarray.DataArray form that is consumable by the user for whatever work they want to do. The data filtering is to make sure the data is presented in a way that is consistent with the intent of the data provider. For example, for storage purposes, netcdf data is often packed, ie floats are converted to ints, because it takes less space. That's why we need min/max and missing values. I consider this merely as restoring approximately the data to the state it was before writing.
Now, if the data provider has an array in 3d, I am guessing the intent is for it to stay in 3d, otherwise it would have been transformed to 2d to start with, as it would have taken less space on disk for example.
I'm not familiar with the data, but the general philosophy in satpy is to try not to guess what the user wants and provide the data as is after unpacking. If you think this 2d array is useful to the users, I'm all for having it, but then it's a derived product that can be available (on top of the 3d one), but maybe not straight from the reader, but rather as a composite.
Hi @mraspaud ! So I've just completed another pass on the code for this reader based on your latest review feedback above. I made some changes (mostly changing the test file to use real netcdf files) and for some of the other points, I tried my best, but I'm also ending with some more questions / requests for confirmation myself 😅. So please just let me know when you get another chance to look into this PR ;-), thanks! |
Hello @mraspaud, I wanted to follow up on the discussion points raised above. Would you be able to provide feedback on these open items? I'd appreciate any guidance on what really needs to be addressed to move this PR forward. Thank you for your time 👍! |
Hello Martin @mraspaud! The new IASI-NG reader will be part of a new version of EUMETSAT's MONALiSA (https://www.eumetsat.int/media/50321) which we are working on in a "new features project". The IASI-NG Satpy reader has now jumped onto the critical path of the project. Currently I check if we are still on track and if we are able to land this aircraft on time 🙂 And if there are any crosswinds during landing 😉 Do you have an idea when you have time to check the remaining items updated by Emmanuel? Kind regards, |
Hi Rudi, |
Hello Martin @mraspaud, |
This PR introduces the initial implementation of the IASI-NG reader for Satpy.
The changes provided here are self-contained in three files:
satpy/readers/iasi_ng_l2_nc.py
: The implementation file for the reader.satpy/etc/readers/iasi_ng_l2_nc.yaml
: The YAML configuration file for the reader.satpy/tests/reader_tests/test_iasi_ng_l2_nc.py
: The unit tests for the reader.There are still a few points I need to address (this list may be extended progressively):
print(...)
statements).AUTHORS.md
file.The idea here is to start this PR early in the development phase to get early feedback and review notes from you, ensuring we are on the right track for eventual PR acceptance.
If you have any feedback on this PR already, please let me know 😊! In the meantime, I will continue working on the points mentioned above and will provide additional details on the changes as they are made on the feature branch.
Thanks for your help 🙏!