Completely general binning system with fractional binning #54

lucas-wilkins · 2023-10-05T17:48:28Z

I've completed the basic stuff needed to do fractional binning, it can take any kind of input binning and convert to any kind of output binning. I have only implemented the -1th order (existing rebinning system) and 0th order (fractional binning) interpolation so far, but there is room for higher orders in the future, most of the hard work is done.

As apparently there is some controversy over whether this is useful, I thought I would hang on with actually implementing all the new slicers in this way, until at least @butlerpd can see the results below. As you can see, the 0th order ("fractional binning") approach is much more stable, though as things are completely general, its a bit slower. The main cost of computation for orders >= 0 is working out the mapping between input and output bins, I have implemented various caching optimisations which mean that these calculations only need to be performed once, so applying a slicer to the same input/output mapping will be relatively fast.

slicer_demo.py contains the code used to generate these plots...

Here's a demo of the region sum applied to data sampled at random points

As you can see, the output of order 0 is pretty smooth compared with order -1.

Data:

Example regions (output bin)

Output for -1 and 0 order, value against bin size

Here's a demo of the region average for data sampled on a grid

As you can see, the output of order 0 is much smoother than order -1. By symmetry, we know that the ideal result would be a constant value.

Data:

Example regions (output bin)

Output for -1 and 0 order, value against bin size (should read average, but I forgot to change the title)

RichardWaiteSTFC · 2023-12-07T11:14:52Z

Just a couple of comments about bining - I apologise in advance if this isn't relevant, of use or indeed completely wrong!

In terms of treating resolution:

I think the best way to account for resolution is to evaluate the objective function (incl. resolution convolution) for each voxel in the data (maybe even average the objective function of the voxel) and then do the equivalent binning before calculating the cost function on the binned data - I believe this is what Horace does (software for analysing direct inelastic data)
One advantage of calculating the cost function on the binned data is to make sure you have enough counts to be able to use chi-squared as a cost function instead of the slower Poisson MLE (but of course it depends how slow it is to evaluate the objective function on each voxel...)
Another advantage is that it might be easier to parameterize background functions etc.

I also have a particular concern about including fractions of a voxel:

If you measure N counts in a bin, the error for that bin is sqrt(N). If you instead measured with half the bin width you would get two bins with on average N/2 counts (for a constant function though this may break down for small N...), with the error in each bin sqrt(N/2). However, if you naively applied error propagation to the fractional overlap method, you might instead come up with an error sqrt(N)/2 which I think is an underestimate. Do you agree, or maybe it's more subtle than that?
You can also get odd artefacts, if you assume the counts are distributed evenly over the bin - I don't know if that is an assumption you're making?
- A common effect is to systematically increase the width of peaks (though admittedly I don't calculate the uncertainty on the fit parameters here, I normally use numdifftools for this but I haven't got it installed on this computer)
- Note here I am fitting using a cost function I believe is appropriate for Poisson-distributed errors, if I was using chi-squared the affect may be worse due to the possible underestimation of the errors.
- You could interpolate instead to reduce such artefacts, but then you may run into issues with noisy data.

See some toy code here (warranty not included, may be bugs!)

import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize

# generate data
def fgauss(x, ht, cen, sig, bg):
    return ht * np.exp(-0.5 * ((np.asarray(x) - cen) / sig) ** 2) + bg
    
npts = 15
np.random.seed(1)
x = np.linspace(0,1,npts)
pinit = [50, (x[-1]-x[0])/2, (x[-1]-x[0])/8, 5]  # ht, cen, sig, bg
y = np.random.poisson(lam=fgauss(x, *pinit))
e = np.sqrt(y)

# bunch data (combine adjacent 3 bins)
nbunch = 3
xbunch = x.reshape(-1, nbunch).mean(axis=1)
ybunch = y.reshape(-1, nbunch).sum(axis=1)
ebunch = np.sqrt(np.sum(e.reshape(-1, nbunch)**2, axis=1))  # or equivalently np.sqrt(y) as data not normalised

# un-bunch the data using fractional area method
# could interpolate instead...
yfrac = np.repeat(ybunch/nbunch, nbunch)
efrac = np.repeat(ebunch/nbunch, nbunch)  # underestimates error (should be sqrt(y)?)

# try fitting with poisson MLE

def poisson_cost(params, x, ycnts, fobj):
    yfit = fobj(x, *params)
    # replace 0s in pdf with a small number
    # this stops ycnts[ifit]/yfit[ifit]) = inf (but does still punish fit)
    yfit[yfit<1e-4] = 1e-4  
    # get log-likelihood for non-zero bins
    ifit = ycnts>0
    D = 2*np.sum(ycnts[ifit]*np.log(ycnts[ifit]/yfit[ifit]) - (ycnts[ifit] - yfit[ifit]))
    # include contribution from zero bins (first term is zero)
    D = D + 2*np.sum(yfit[~ifit])
    return D
    
# fit data
res = optimize.minimize(poisson_cost, pinit, args=(x, y, fgauss), 
                        method='L-BFGS-B')
res_bunched = optimize.minimize(poisson_cost, pinit, args=(xbunch, ybunch/nbunch, fgauss), 
                                method='L-BFGS-B')
res_frac = optimize.minimize(poisson_cost, pinit, args=(x, yfrac, fgauss), 
                             method='L-BFGS-B')
print("Actual parameters p = ", np.round(pinit, 3))
print("Fit to simulated data p= ", np.round(res.x,3))
print("Fit to bunched data p= ", np.round(res_bunched.x,3))
print("Fit to unbunched data p= ", np.round(res_frac.x,3))
                   
fig, ax = plt.subplots()
ax.errorbar(x, y, yerr=e, marker='o', capsize=2, ls='', color='k', label='data')
ax.errorbar(xbunch, ybunch/nbunch, yerr=ebunch/nbunch, marker='o', capsize=2, ls='', color='r', label='bunched')
ax.errorbar(x, yfrac, yerr=efrac, marker='s', capsize=2, ls='', color='b', markersize=2, alpha=0.5, label='unbunched')
ax.plot(x, fgauss(x, *pinit), '-k', label='actual pdf')
ax.plot(x, fgauss(x, *res.x), '--k', label='fit to data')
ax.plot(x, fgauss(x, *res_bunched.x), ':r', label='fit to bunched')
ax.plot(x, fgauss(x, *res_frac.x), ':b', label='fit to unbunched')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend(fontsize=8)
fig.show()

Prints out

#                           ht      cen     sig     bg
Actual parameters p =      [50.     0.5    0.125  5.   ]
Fit to simulated data p=   [51.471  0.519  0.12   3.206]
Fit to bunched data p=     [46.321  0.513  0.133  3.224]
Fit to unbunched data p= [42.608  0.513  0.144  3.35 ]

To be clear, I don't know if the above code is actually what you're doing here, my point is this can go wrong and it's worth checking!

andyfaff · 2024-01-19T03:01:57Z

Another aspect that wasn't mentioned by @RichardWaiteSTFC was that whenever you divide pixels to do fractional rebinning then the signal in adjacent pixels is no longer independent but becomes correlated. Thus to be 100% correct the error propagation needs to take into account covariance.

Here's a demo of how it arises:

import numpy as np
import uncertainties as unc
from uncertainties import ufloat
from uncertainties import unumpy as unp

a = ufloat(15, unp.sqrt(15))
b = ufloat(20, unp.sqrt(20))
c = ufloat(25, unp.sqrt(25))

d = 0.5*a + 0.5*b
e = 0.5*b + 0.5*c

print(repr(d))
print(repr(e))

print(np.array(unc.covariance_matrix([d, e])))

gives:

17.5+/-2.9580398915498085
22.5+/-3.3541019662496847
[[ 8.75  5.  ]
 [ 5.   11.25]]

if d and e were uncorrelated the off diagonal terms would be zero.

Another consequence of taking these kinds of correlations into account is that actually all datapoints in a reduced SAS measurement are correlated! As well as this kind of fractional rebinning the reduction typically uses steps like normalising images/spectra by dividing by a common monitor count. Since this monitor count has an uncertainty then all resultant datapoints become correlated:

a = ufloat(100, unp.sqrt(100))
b = ufloat(201, unp.sqrt(201))
monitor = ufloat(80, unp.sqrt(80))

print(repr(a/monitor))
print(repr(b/monitor))

# again, non-zero diagonal terms indicate that a/monitor and b/monitor are correlated
print(np.array(unc.covariance_matrix([a / monitor, b / monitor])))

gives

2.5125+/-0.3321361966498081
[[0.03515625 0.03925781]
 [0.03925781 0.11031445]]

In many cases the correlation may be sufficiently small to be ignorable, but one should at least know about it. It's effects can be surprisingly large. In the monitor example I gave one can minimise the correlation by greatly increasing the monitor counts. Given that reduction contains many steps one has to take good care of how the error propagation is carried out. correlation is something that is conveniently ignored.

Heybrock et al. wrote a paper on these kinds of effects.

lucas-wilkins · 2024-07-30T10:31:22Z

Another aspect that wasn't mentioned by @RichardWaiteSTFC was that whenever you divide pixels to do fractional rebinning then the signal in adjacent pixels is no longer independent but becomes correlated. Thus to be 100% correct the error propagation needs to take into account covariance.

Here's a demo of how it arises:
import numpy as np
import uncertainties as unc
from uncertainties import ufloat
from uncertainties import unumpy as unp

a = ufloat(15, unp.sqrt(15))
b = ufloat(20, unp.sqrt(20))
c = ufloat(25, unp.sqrt(25))

d = 0.5*a + 0.5*b
e = 0.5*b + 0.5*c

print(repr(d))
print(repr(e))

print(np.array(unc.covariance_matrix([d, e])))
gives:
17.5+/-2.9580398915498085
22.5+/-3.3541019662496847
[[ 8.75  5.  ]
 [ 5.   11.25]]
if d and e were uncorrelated the off diagonal terms would be zero.

Another consequence of taking these kinds of correlations into account is that actually all datapoints in a reduced SAS measurement are correlated! As well as this kind of fractional rebinning the reduction typically uses steps like normalising images/spectra by dividing by a common monitor count. Since this monitor count has an uncertainty then all resultant datapoints become correlated:
a = ufloat(100, unp.sqrt(100))
b = ufloat(201, unp.sqrt(201))
monitor = ufloat(80, unp.sqrt(80))

print(repr(a/monitor))
print(repr(b/monitor))

# again, non-zero diagonal terms indicate that a/monitor and b/monitor are correlated
print(np.array(unc.covariance_matrix([a / monitor, b / monitor])))
gives
2.5125+/-0.3321361966498081
[[0.03515625 0.03925781]
 [0.03925781 0.11031445]]
In many cases the correlation may be sufficiently small to be ignorable, but one should at least know about it. It's effects can be surprisingly large. In the monitor example I gave one can minimise the correlation by greatly increasing the monitor counts. Given that reduction contains many steps one has to take good care of how the error propagation is carried out. correlation is something that is conveniently ignored.

Heybrock et al. wrote a paper on these kinds of effects.

I only just saw this Andy. Sure, I get that. I have spoken to Simon and read the paper. I even have a project plan at ISIS to fix some of these things (it won't happen, but hey!). Fact is, we don't have that information.

lucas-wilkins · 2024-07-30T10:32:12Z

Correlation is taken into account wherever it is known about though.

lucas-wilkins · 2024-10-15T16:34:21Z

This should not have been closed, weird git stuff!

lucas-wilkins and others added 12 commits September 20, 2023 13:16

Work towards fraction binning

cace32f

Mesh merging and some refactoring

459341d

Triangulated mesh

1290f31

Mesh merging works

5e809ea

Implementation of Rebinner base class

f7fc0a5

Work towards demo

b8f0694

Voronoi mesh edges and ordering

583c8b4

It works, needs benchmarking

bf653ad

Much faster assignment/merge method

8cc300a

Significantly faster edge crossing algorithm

1f83877

Demo

555f76c

Requirements

55a1138

lucas-wilkins linked an issue Oct 5, 2023 that may be closed by this pull request

"Fractional binning" #48

Open

Merge branch 'master' into 48-fractional-binning

c7c04a5

lucas-wilkins requested a review from krzywon October 5, 2023 17:54

lucas-wilkins added the DiscussAtTheCall label Dec 4, 2023

wpotrzebowski removed the DiscussAtTheCall label Dec 5, 2023

Merge branch 'master' into 48-fractional-binning

628e56c

lucas-wilkins changed the base branch from master to refactor_24 October 15, 2024 16:20

Merge branch 'refactor_24' into 48-fractional-binning

6df7930

lucas-wilkins changed the base branch from refactor_24 to 85-add-interpolationrebinning October 15, 2024 16:21

lucas-wilkins merged commit 209ba83 into 85-add-interpolationrebinning Oct 15, 2024
8 of 12 checks passed

lucas-wilkins deleted the 48-fractional-binning branch October 15, 2024 16:31

lucas-wilkins restored the 48-fractional-binning branch October 15, 2024 16:33

lucas-wilkins mentioned this pull request Oct 15, 2024

48 fractional binning #101

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completely general binning system with fractional binning #54

Completely general binning system with fractional binning #54

lucas-wilkins commented Oct 5, 2023 •

edited

Loading

RichardWaiteSTFC commented Dec 7, 2023 •

edited

Loading

andyfaff commented Jan 19, 2024

lucas-wilkins commented Jul 30, 2024

lucas-wilkins commented Jul 30, 2024

lucas-wilkins commented Oct 15, 2024

Completely general binning system with fractional binning #54

Completely general binning system with fractional binning #54

Conversation

lucas-wilkins commented Oct 5, 2023 • edited Loading

RichardWaiteSTFC commented Dec 7, 2023 • edited Loading

andyfaff commented Jan 19, 2024

lucas-wilkins commented Jul 30, 2024

lucas-wilkins commented Jul 30, 2024

lucas-wilkins commented Oct 15, 2024

lucas-wilkins commented Oct 5, 2023 •

edited

Loading

RichardWaiteSTFC commented Dec 7, 2023 •

edited

Loading