-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data order columns first #107
base: main
Are you sure you want to change the base?
Conversation
… be a list), to enable multidimensional 'columns' in the future
… be a list), to enable multidimensional 'columns' in the future
So this doesn't affect how the data are written, just how we store it internally? |
Correct - it's still columns in the .ort file. |
I'm actually very skeptical with making this change. Multi-dimensional data will be some special case that requires some further discussion for the text format, so I don't really see any advantage of changing the index order. Initially I would have been for this change, but now that we published version 1.x and other programs have used it this change will break every program that depends on orsopy. Is there any important reason to force this change? |
This is only an important change if we are going to support multidimensional datasets. If we as a group decide that is something we are going to do, then this change is to make sure that the shape of the arrays we're using internally looks like this: It is not the end of the world if we have the second instead of the first, but it doesn't seem like an easy thing to explain to end-users. Telling them to stack their 1-d columns in column order instead of row order is a much easier thing to ask of them. This is clearly a breaking change and would have to be done very carefully of course, changing version numbers etc. I think the primary thing for us all to decide is if we want to support multidimensional datasets going forward. From our early discussions it seemed like this might be important for some use cases, but I can't speak to that directly (I am not one of the ones that would make immediate use of multidimensional datasets - does anyone else want to weigh in?) |
(From memory multidimensional arrays (such as that used for resolution information) have to use an appropriate number of labelled columns in the .ort file) I agree that having Still I don't mind biting the bullet for changes in this PR, it's still relatively early days. I would definitely like to store multidimensional data in a file, I was waiting until we had a working HDF file rather than trying to stuff it into .ort. Would they be stored as separate HDF5 datasets rather than trying to keep it in the same HDF5 dataset as Q/R? The point would still come up as to how we would keep it in a |
There's a working hdf exporter in the PR next door #97, if you want to try it out and see how the data is structured. What I had envisioned was that data would be defined as one of
Then this is compatible with the .ort writer by using the second form but transposed from the current order. |
Sorry I accidentally closed this while trying to type on my phone. Reopened now. |
An alternative if we want to allow multidimensional data is to just define the data as a literal list of columns/arrays. The text version could then decide how to write based on dimensionality of the list elements and the hdf5 version could store each column as one dataset that then can be any dimensionality. This would require some different tests in the class, but it would not break code that loads data from .ort files. |
I think I prefer this option. It allows dR to be something along the lines of |
It seems like we all agree on the "list of columns" approach. |
I wonder if we can make further progress with this PR now? (would need merge conflicts resolved, and {load,save}_nexus to be updated probably. It seemed like we settled on a list of columns approach? How about: If one is saving to ORT one concatenates the columns. If an individual column is multidimensional raise an error. If one is loading from ORT, leave code unchanged. |
Seems like good solutions to me. In addition I think it would be nice if "save_orso" and "load_orso" would be able to do both ".ort" and ".orb" as it would make it easier for software implementing it to support both file formats. (Either by suffix or in case of load by checking the first line?) |
For reading in, would a check of the "magic number" to see if the file is an HDF5 file be sufficient? I would rather not go by filename or extension. Sometimes those are mandated by facility policy and not under the users' (or instrument scientists') control. |
Yes, my idea was to use the extension, if provided as either .ort or .orb, else evaluate file content in a straight forward way. I'm not sure what would be the easiest way, but the proposed magic number approach seems fast and easy and should give no false positives for the binary case. |
The first code section of
|
I think there is a function in h5py for "is this hdf5", which I would prefer over raising exceptions |
At one point in the functions we will have to raise an error when the format is not correct. So we could either perform this test in load_nexus and then catch the specific ORSO error (as proposed by Andrew) or we can make the check at two different functions separately. For me it wouldn't really matter. In any case, the test for hdf5 should be done first, before doing a validation of further ORSO format compatibility. (As should be a test for text file in case of the .ort reader.) |
As of python 3.11, they will allow unpacking variable length tuples with typing of the members (though specifying the types looks a little awkward): https://peps.python.org/pep-0646/#unpacking-unbounded-tuple-types , which we could use to set these constraints in the python types, creating a subclass for e.g. QzColumn and requiring that the first element of Maybe we can look at this issue again in a few years when 3.11 is the most recently-supported python. @dataclass
class QzColumn(Column):
"""
Information about a data column.
"""
name: Literal["Qz"] = "Qz"
unit: Optional[Literal["1/nm", "1/angstrom"]] = None
physical_quantity = "momentum_transfer"
@dataclass
class QzErrorColumn(ErrorColumn):
"""
Information about a data column.
"""
error_of: Literal["Qz"] = "Qz" |
As we consider support for multidimensional columns (e.g. resolution) it would make the most sense to have the column index first in the data order. Then extra dimensions can be added to the shape.
Under the current ordering, the length of the column is the first index, and the column number is the second index, and then additional dimensions would have to go after the column index, which is awkward.
Note that multidimensional columns will be most easily supported in a binary format (NeXus).