Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use essreduce loaders #114

Merged
merged 37 commits into from
Mar 20, 2024
Merged

Use essreduce loaders #114

merged 37 commits into from
Mar 20, 2024

Conversation

nvaytet
Copy link
Member

@nvaytet nvaytet commented Mar 8, 2024

Fixes #112

Note: CI is failing because essreduce does not exist as a package yet

EDIT by JL:
Also fixes #106, #109, #110

Note:
We have changed the place where we merge events from multiple files in the workflow.
We used to merge the events early, into a single data array. This meant we had to make sure that sample and detector positions were the same for all runs.
Now we compute Q and wavelength for numerator and denominator for each file, and merge just before normalization.

@jl-wynen
Copy link
Member

jl-wynen commented Mar 8, 2024

@nvaytet Can you target the PR on the namespace pkg branch? Then the diff only shows the relevant changes. Merging that branch will retarget the PR automatically onto main.

@nvaytet nvaytet changed the base branch from main to namespace-package March 8, 2024 12:07
@nvaytet
Copy link
Member Author

nvaytet commented Mar 8, 2024

@jl-wynen I forgot to remove the old loaders...

Comment on lines 185 to 196
# Note here we specify the name instead of using
# ess.reduce.nexus.extract_detector_data because we need the name to put the
# events back into the original data group.
key = f'{detector_name}_events'
events = _preprocess_data(
out[key],
sample_position=raw_sample['position'],
source_position=raw_source['position'],
)
if detector_name in DETECTOR_BANK_RESHAPING:
events = DETECTOR_BANK_RESHAPING[detector_name](events)
out[key] = events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain your comment, and why we need this? Can't these transformations be applied after extracting the detector data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying we could insert an additional step after the RawData[Sample] has been extracted, where we would reshape and patch (add necessary coordinates, etc.) the data, and output something called PreparedData that would be just before the MaskedData in the chain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Comment on lines 205 to 207
out = nexus.load_monitor(file_path=file_path, monitor_name=monitor_name)
key = f'{monitor_name}_events'
out[key] = _preprocess_data(out[key], source_position=raw_source['position'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question to above, why do we need to preprocess before extracting the monitor data? This makes this brittle, as we still hard-code some naming convention.

Base automatically changed from namespace-package to main March 11, 2024 09:21
@jl-wynen
Copy link
Member

ESSreduce has been released

@jl-wynen jl-wynen force-pushed the use-essreduce-loaders branch from 74b5d6c to 38fcbac Compare March 11, 2024 13:34
@nvaytet nvaytet marked this pull request as draft March 11, 2024 14:16
@nvaytet nvaytet marked this pull request as ready for review March 18, 2024 13:46
@nvaytet nvaytet requested a review from SimonHeybrock March 18, 2024 13:46
"providers = providers + (\n",
" isis.data.transmission_from_background_run,\n",
" isis.data.transmission_from_sample_run,\n",
")\n",
"pipeline = sciline.Pipeline(providers, params=params)\n",
"pipeline.set_param_table(masks)"
"# pipeline.set_param_table(masks)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented code?

plopp
pythreejs
sciline>=24.2.0
sciline>=23.9.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the version regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, @jl-wynen ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happened in this commit: e757525 because I updated dependencies. This replaced the bound because it was set in 825f4ad in base.in instead of pyproject.toml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I fixed it in the pyproject.toml.

@@ -33,7 +28,7 @@ def apply_component_user_offsets_to_raw_data(
data: RawData[ScatteringRunType],
sample_offset: Optional[SampleOffset],
detector_bank_offset: Optional[DetectorBankOffset],
) -> RawDataWithComponentUserOffsets[ScatteringRunType]:
) -> PatchedData[ScatteringRunType]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #59 (comment) you argued that the naming I chose, ConfiguredRawData, was unclear. I would argue that PatchedData is considerably less clear than that, since "patching" has multiple meanings. In particular, we are not patching sections of data together. How about going back to something like ConfiguredData after all?

Copy link
Member Author

@nvaytet nvaytet Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are adding missing coordinates, variances on the events, and possibly adding offsets to some of the coordinates (in the case of ISIS data).

Either we find a name that covers all 3 well enough (I don't really get the above from ConfiguredData, but PatchedData is indeed quite general), or we should make them into separate steps?

I don't like making separate steps because it means that all of these steps could be optional, and we'd have to have e.g. a dummy step for applying position offsets in loki, just because it is needed in isis.

I bascially wanted a name that would mean "ready for the rest of the reduction workflow", but was never able to find a good name :-(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just dropping the Raw prefix? This function would turn RawData into Data. But I still don't see what is so wrong with configured-data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried ConfiguredAndCompleteData but I still don't really like it. Now I'm thinking of ConfiguredReducibleData to hint that it has the necessary pieces to proceed with the rest of the reduction.

But I still don't see what is so wrong with configured-data.

I think Configured captures the notion that some configurations were applied by the user, e.g. positions offsets, but not so much the fetching of the sample/source positions so that we can later perform conversion to wavelength. That's why I thought we were "patching" different parts of the NeXus file together. But I agree that PatchedData is just as vague.

Regarding why I thought we could be more specific in your PR was because it was to describe only the step where user offsets were applied to the data. Now that different things can happen in this step, we do need to go back to something more general.

Comment on lines 247 to 249
return FinalSummedQ[ScatteringRunType, IofQPart](
reduce(_merge_events_or_histograms, data.values())
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried something like

reducer = sc.reduce(data.values())
out = reducer.bins.concat() if reducer.bins is not None else reducer.sum()

but reducer.bins is never None, even if the elements inside data.values() are dense data.

So I went back to functools.reduce.

Copy link
Member

@SimonHeybrock SimonHeybrock Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you use it as if data.values()[0].bins is None? Or all(x.bins is None for x in data.values())?

Copy link
Member

@SimonHeybrock SimonHeybrock Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: Doing it in the way you chose is extremely inefficient if you have many files (please give it a try!). You are basically adding an $N^2$ term to the required memory access (reads+writes+allocations).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought about getting the first element in data.values() but I didn't think it was worth it. I was under the impression that using either functools or sc.reduce was equivalent, also in terms of performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have to qualify my statement: When summing, I think the current implementation of sc.reduce is actually at a disadvantage, compared to functools.reduce, but for concatenation it is $N^2$.

Comment on lines 21 to 24
def apply_pixel_masks(
data: RawData[ScatteringRunType],
masks: Optional[sciline.Series[PixelMaskFilename, PixelMask]],
data: TofData[ScatteringRunType],
masked_ids: Optional[sciline.Series[PixelMaskFilename, MaskedDetectorIDs]],
) -> MaskedData[ScatteringRunType]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally happy with this change, as it will prevent using manual masks that are constructed, e.g., using a condition of the pixel positions, etc. Can the MaskedDetectorIDs -> PixelMask conversion be done in a separate step?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was that it was using the RawData[SampleRun] to get the detector ids, and using it also for the BackgroundRun. This led to an infinite recursion error when trying to set a parameter table for using events from multiple runs, I think because it was trying to get a sciline.Series[Filename[BackgroundRun], RawData[SampleRun]] which it could not materialize.

I could have a PixelMask[ScatteringRunType] as an alternative? I think that would work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will prevent using manual masks that are constructed, e.g., using a condition of the pixel positions

I think in this case you would have to use a different provider anyway, because the masks are sciline.Series[PixelMaskFilename, MaskedDetectorIDs], i.e. they depend on PixelMaskFilename, so most probably wouldn't be created from a condition?

FYI I'm still running into issues with PixelMask[ScatteringRunType].

@nvaytet nvaytet merged commit 0371984 into main Mar 20, 2024
3 checks passed
@nvaytet nvaytet deleted the use-essreduce-loaders branch March 20, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants