Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpotFindingResults Serialization to Disk (and loading) #1961

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

ctcisar
Copy link

@ctcisar ctcisar commented Oct 13, 2021

I have added two simple methods to save SpotFindingResults to disk and reload them. This is handled by calling the methods in physical_coord_ranges, log, and _results to save each object, and then writing a json file that identifies the indices and paths to each of these files. The loader opens this json and uses the loading method from each of said objects to recreate the original SpotFindingResults object.

This can be useful for storing, comparing, and troubleshooting intermediate results. Storing the contents of each xarray-like object as a netcdf allows for compatibility across most package version differences, unlike saving the entire SpotFindingResults to a Python pickle.

@njmei
Copy link
Collaborator

njmei commented Oct 17, 2021

Would it be possible to not write the *.json file and instead write all the SpotFindingResults data (including log info) into 1 netcdf file?

@berl
Copy link
Collaborator

berl commented Oct 17, 2021

@ctcisar thanks for this! Making it easier to view intermediate results is super useful. I think @njmei may have also recently mentioned something like this being missing.
It would also be nice to have some test coverage here- something simple like https://github.com/spacetx/starfish/blob/master/starfish/core/types/test/test_decoded_spots.py that generates some data, saves it and reloads it using your new methods would be great.

@ctcisar
Copy link
Author

ctcisar commented Oct 18, 2021

@njmei I was thinking about doing something like that initially, but faced problems saving the SpotAttributes together. Unless a DataArray can contain items of type DataArray, and it's possible to save the whole thing at once?

@berl I wrote something like this using our own data as I was testing this, I'll write up and push a unit test proper shortly.

@njmei
Copy link
Collaborator

njmei commented Oct 18, 2021

@ctcisar Because the SpotAttributes are pandas DataFrames (I think) under the hood, it should be possible to merge them into one DataFrame (with enough data to recreate individual SpotAttributes later) then convert to xarray and save together with coordinate data as a netcdf.

If I have some spare time this week, I'll see if can come up with an implementation that only produces/loads 1 file (probably netcdf). Starfish is already pretty heavy in terms of all the metadata files it produces, it would be nice to have a solution that doesn't produce even more metadata-like files.

@ctcisar
Copy link
Author

ctcisar commented Oct 20, 2021

@berl I added a testing file in the same format as the provided test, hopefully it's up to spec.

In the process of writing this, I realised that alternating (obj type Log).encode() and Log.decode() will result in each result getting nested in an additional dict, structured as {"log": [data]}. I do not know if this is intended behavior, but have fixed the loader to accommodate for this.

@berl
Copy link
Collaborator

berl commented Oct 20, 2021

@ctcisar thanks for adding the tests. sorry about the delays in the CI tests- I started them up and the linter fails now. can you run that locally, make fixes and push here again? The CI should run automatically now

@ctcisar
Copy link
Author

ctcisar commented Oct 20, 2021

@berl I was having difficulty getting flake8 to run on my system, it doesn't like the way my python environment is set up. I addressed all the things mentioned in CI on my last commit.

The error I get when I run make all locally (with the flake8 lines commented out of the Makefile) is the following:

AttributeError: 'DataArray' object has no attribute 'to_numpy'

But I'm not sure where this is coming from, because DataArray does have that. (notably, when I tried to run the xr version of assert equal, it ran into errors with non-str keys. So I used this to work around it)

@ctcisar
Copy link
Author

ctcisar commented Oct 21, 2021

I am at a bit of a loss for the last flake8 error. It says

starfish/core/types/test/test_saving_spots.py:9:1: I100 Import statements are in the wrong order. 'from starfish.core.types import PerImageSliceSpotResults, SpotAttributes, SpotFindingResults' should be before 'from starfish.types import Axes, Coordinates, Features'

Even though the order it is describing is the order they are imported in. I have also tried the reverse order, and that didn't make a difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants