Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data order columns first #107

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

bmaranville
Copy link
Contributor

As we consider support for multidimensional columns (e.g. resolution) it would make the most sense to have the column index first in the data order. Then extra dimensions can be added to the shape.

Under the current ordering, the length of the column is the first index, and the column number is the second index, and then additional dimensions would have to go after the column index, which is awkward.

Note that multidimensional columns will be most easily supported in a binary format (NeXus).

@bmaranville bmaranville requested review from aglavic, arm61 and andyfaff May 9, 2023 14:20
@arm61
Copy link
Contributor

arm61 commented May 9, 2023

So this doesn't affect how the data are written, just how we store it internally?

@bmaranville
Copy link
Contributor Author

Correct - it's still columns in the .ort file.

@aglavic
Copy link
Collaborator

aglavic commented May 9, 2023

I'm actually very skeptical with making this change. Multi-dimensional data will be some special case that requires some further discussion for the text format, so I don't really see any advantage of changing the index order. Initially I would have been for this change, but now that we published version 1.x and other programs have used it this change will break every program that depends on orsopy.

Is there any important reason to force this change?

@bmaranville
Copy link
Contributor Author

bmaranville commented May 9, 2023

This is only an important change if we are going to support multidimensional datasets. If we as a group decide that is something we are going to do, then this change is to make sure that the shape of the arrays we're using internally looks like this:
(column_number, dataset_dim1, dataset_dim2, dataset_dim3...) and not
(dataset_dim1, column_number, dataset_dim2, dataset_dim3...)

It is not the end of the world if we have the second instead of the first, but it doesn't seem like an easy thing to explain to end-users. Telling them to stack their 1-d columns in column order instead of row order is a much easier thing to ask of them.

This is clearly a breaking change and would have to be done very carefully of course, changing version numbers etc.

I think the primary thing for us all to decide is if we want to support multidimensional datasets going forward. From our early discussions it seemed like this might be important for some use cases, but I can't speak to that directly (I am not one of the ones that would make immediate use of multidimensional datasets - does anyone else want to weigh in?)

@bmaranville bmaranville mentioned this pull request May 9, 2023
@andyfaff
Copy link
Contributor

(From memory multidimensional arrays (such as that used for resolution information) have to use an appropriate number of labelled columns in the .ort file)

I agree that having OrsoDataset.data.shape == (ncols, npnts) makes more sense from a software POV (that's what this PR is proposing, right?). Besides the back compat concerns that @aglavic raises there is also the issue that we generally refer to "columns in a dataset", with e.g. Q/R data being two columns. Won't it be confusing when they turn up as two rows in OrsoDataset.data? Could we refer to them differently?

Still I don't mind biting the bullet for changes in this PR, it's still relatively early days.

I would definitely like to store multidimensional data in a file, I was waiting until we had a working HDF file rather than trying to stuff it into .ort. Would they be stored as separate HDF5 datasets rather than trying to keep it in the same HDF5 dataset as Q/R? The point would still come up as to how we would keep it in a OrsoDataset.

@bmaranville
Copy link
Contributor Author

bmaranville commented May 10, 2023

There's a working hdf exporter in the PR next door #97, if you want to try it out and see how the data is structured. What I had envisioned was that data would be defined as one of

  • a list of arrays, with one array per column, e.g. [Qz<ndarray, shape=(101,>), R <ndarray, shape=(101,), dR <ndarray, shape=(101, 15)]
  • a 2-d array where the first dimension is ncolumns and the second is npoints (which can be treated exactly the same in the code as the first case)

Then this is compatible with the .ort writer by using the second form but transposed from the current order.

@bmaranville bmaranville reopened this May 10, 2023
@bmaranville
Copy link
Contributor Author

Sorry I accidentally closed this while trying to type on my phone. Reopened now.

@aglavic
Copy link
Collaborator

aglavic commented May 10, 2023

An alternative if we want to allow multidimensional data is to just define the data as a literal list of columns/arrays. The text version could then decide how to write based on dimensionality of the list elements and the hdf5 version could store each column as one dataset that then can be any dimensionality.

This would require some different tests in the class, but it would not break code that loads data from .ort files.
Another alternative would be some kind of future flag that switches the order and a deprecation warning with switching at a later stage. I don't know, just brain storming here.

@andyfaff
Copy link
Contributor

andyfaff commented Dec 5, 2023

a list of arrays, with one array per column, e.g. [Qz<ndarray, shape=(101,>), R <ndarray, shape=(101,), dR <ndarray, shape=(101, 15)]

I think I prefer this option. It allows dR to be something along the lines of (101, 31), which is the kind of array size that might be used.

@bmaranville
Copy link
Contributor Author

It seems like we all agree on the "list of columns" approach.

@andyfaff
Copy link
Contributor

I wonder if we can make further progress with this PR now? (would need merge conflicts resolved, and {load,save}_nexus to be updated probably.

It seemed like we settled on a list of columns approach?

How about:
If one sets up OrsoDataset with a numpy array one assumes that the second dimension has the correct number of columns (and store the array unchanged)
If one sets up OrsoDataset with a sequence one assumes that each entry is a column (and store the sequence converted to a tuple).

If one is saving to ORT one concatenates the columns. If an individual column is multidimensional raise an error.
If one is saving to ORB save each column to an individual HDF dataset.

If one is loading from ORT, leave code unchanged.
If one is loading from ORB load with the sequence approach.

@aglavic
Copy link
Collaborator

aglavic commented Feb 26, 2024

Seems like good solutions to me.

In addition I think it would be nice if "save_orso" and "load_orso" would be able to do both ".ort" and ".orb" as it would make it easier for software implementing it to support both file formats. (Either by suffix or in case of load by checking the first line?)

@bmaranville
Copy link
Contributor Author

For reading in, would a check of the "magic number" to see if the file is an HDF5 file be sufficient? I would rather not go by filename or extension. Sometimes those are mandated by facility policy and not under the users' (or instrument scientists') control.

@aglavic
Copy link
Collaborator

aglavic commented Feb 27, 2024

Yes, my idea was to use the extension, if provided as either .ort or .orb, else evaluate file content in a straight forward way. I'm not sure what would be the easiest way, but the proposed magic number approach seems fast and easy and should give no false positives for the binary case.

@andyfaff
Copy link
Contributor

The first code section of load_orso could just be:

try:
    load_nexus(...)
except: ThisIsntAnHDFFileError
    pass

# load ort as usual

@bmaranville
Copy link
Contributor Author

The first code section of load_orso could just be:

try:
    load_nexus(...)
except: ThisIsntAnHDFFileError
    pass

# load ort as usual

I think there is a function in h5py for "is this hdf5", which I would prefer over raising exceptions

@aglavic
Copy link
Collaborator

aglavic commented Feb 27, 2024

At one point in the functions we will have to raise an error when the format is not correct. So we could either perform this test in load_nexus and then catch the specific ORSO error (as proposed by Andrew) or we can make the check at two different functions separately. For me it wouldn't really matter.

In any case, the test for hdf5 should be done first, before doing a validation of further ORSO format compatibility. (As should be a test for text file in case of the .ort reader.)

@bmaranville
Copy link
Contributor Author

bmaranville commented Feb 27, 2024

As of python 3.11, they will allow unpacking variable length tuples with typing of the members (though specifying the types looks a little awkward): https://peps.python.org/pep-0646/#unpacking-unbounded-tuple-types , which we could use to set these constraints in the python types, creating a subclass for e.g. QzColumn and requiring that the first element of Orso.columns is of type QzColumn.

Maybe we can look at this issue again in a few years when 3.11 is the most recently-supported python.
Example QzColumn:

@dataclass
class QzColumn(Column):
    """
    Information about a data column.
    """

    name: Literal["Qz"] = "Qz"
    unit: Optional[Literal["1/nm", "1/angstrom"]] = None
    physical_quantity = "momentum_transfer"

@dataclass
class QzErrorColumn(ErrorColumn):
    """
    Information about a data column.
    """

    error_of: Literal["Qz"] = "Qz"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants