Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

Closed
3 tasks done
sebasv opened this issue Aug 4, 2021 · 14 comments · Fixed by #47716
Closed
3 tasks done

BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

sebasv opened this issue Aug 4, 2021 · 14 comments · Fixed by #47716
Labels
Bug Dependencies Required and optional dependencies Numeric Operations Arithmetic, Comparison, and Logical operations Upstream issue Issue related to pandas dependency

Comments

@sebasv
Copy link
Contributor

sebasv commented Aug 4, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

a = pd.Series(np.random.normal(scale=0.1, size=(1_000_000,)).astype(np.float32)).pow(2)

assert isinstance(np.mean(a), float)
assert isinstance(np.mean(a.values), np.float32)
assert abs(1 - np.mean(a)/np.mean(a.values)) > 4e-4

Problem description

  1. pd.DataFrame.mean/pd.Series.mean/np.mean(pd.Series) outputs a Python float instead of a numpy float. Since np.mean(pd.Series.values) does return an np float, I'm assuming for now that this should be fixed in pandas
  2. if dtype==np.float32, then calling mean on a pandas object gives a significantly different result vs calling mean on the underlying numpy ndarray.

Expected Output

The output of np.mean(a) should be the same as np.mean(a.values).

additional tests

# both b and c ~1e-2
b = a.mean() # the pandas impl of mean
assert isinstance(b, float) # PYTHON float, not numpy float? Ergo implicit f64

h = np.mean(a)
assert isinstance(h, float)
assert h == b

c = a.values.mean() # the numpy impl of mean
assert isinstance(c, np.float32) # as exprected

print('\nerrors between pandas mean and numpy mean')
print(f'relative error: {abs(1-b/c):.3e}') # ~ 5e-4
print(f'absolute error: {abs(b -c):.3e}') # ~ 5e-6

print(f'relative error after casting: {abs(1-np.float32(b)/c):.3e}') # ~ 5e-4
print(f'absolute error after casting: {abs(np.float32(b) -c):.3e}') # ~ 5e-6

d = a.sum() / len(a) 
assert isinstance(d, np.float64) # expected, because division. Note `sum` returns an np.float32

e = a.values.sum() / len(a)
assert isinstance(e, np.float64) # expected, because division

# these methods are equivalent
assert d==e

# and up to f32 precision equal to the numpy impl
assert d.astype(np.float32) == c

# the cherry on the cake
f = a.astype(np.float64).mean()
assert isinstance(f, float) # still not ideal, should be np.float64

g = a.astype(np.float64).values.mean()
print('\nrelative error between pandas f64 mean and numpy f64 mean')
print(f'relative error numpy f64/pandas f64: {abs(1-g/f):.3e}') # ~ 1e-14 -- 1e-16, not bad but I would have expected equality

print('\nerrors between pandas f64 mean and numpy/pandas f32 mean')
print(f'relative error pandas f32/pandas f64: {abs(1-b/f):.3e}') # ~ 5e-4
print(f'absolute error numpy f32/pandas f64: {abs(1-c/f):.3e}') # ~ 1e-7 -- 1e-9

# finally...
h = np.mean(a)
assert isinstance(h, float)
assert h == b

output

errors between pandas mean and numpy mean
relative error: 5.210e-04
absolute error: 5.204e-06
relative error after casting: 5.210e-04
absolute error after casting: 5.204e-06

relative error between pandas f64 mean and numpy f64 mean
relative error numpy f64/pandas f64: 1.066e-14

errors between pandas f64 mean and numpy/pandas f32 mean
relative error pandas f32/pandas f64: 5.214e-04
absolute error numpy f32/pandas f64: 2.399e-07

Output of pd.show_versions()

INSTALLED VERSIONS

commit : c7f7443
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-80-generic
Version : #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.0
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.9.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.51.2

@sebasv sebasv added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 4, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 4, 2021
@sebasv sebasv changed the title BUG: [float32 precision] pandas float32 mean is inconsistent with numpy BUG: np.mean(pd.Series) != np.mean(pd.Series.values) Aug 4, 2021
@sebasv
Copy link
Contributor Author

sebasv commented Aug 4, 2021

I think I found it. It seems to be loss of numerical precision because of a bad summation routine.

a = pd.Series((np.ones(shape=(1_000_000,))/10).astype(np.float32))

man = np.float32(0)
for i in a.values:
    man += i
man /= np.float32(len(a))
assert isinstance(man, np.float32)

kahan = np.float32(0)
kahan_c = np.float32(0)
for i in a.values:
    temp1 = i - kahan_c
    temp2 = kahan + temp1
    kahan_c = (temp2 - kahan) - temp1
    kahan = temp2

kahan /= np.float32(len(a))

print(f'''
{a.mean()}
{man}

{kahan}

{a.sum()/len(a)}
{a.values.mean()}''') 

outputs

0.10095834732055664
0.10095834732055664

0.10000000149011612

0.1000000859375
0.10000008344650269

@sebasv
Copy link
Contributor Author

sebasv commented Aug 4, 2021

As far as I can see from Python call tracing, pd.Series.sum delegates to a C function np.ndarray.sum, while pd.Series.mean delegates to a C function bottleneck.reduce.nanmean. I haven't got C call tracing set up but I'm working through the code to understand why np.nanmean isn't used.

@jbrockmendel
Copy link
Member

but I'm working through the code to understand why np.nanmean isn't used.

the code for that is in core.nanops, we use bottleneck bc it is generally faster

@sebasv
Copy link
Contributor Author

sebasv commented Aug 4, 2021

Yep, bottleneck nansum and nanmean do a naive summation. Wow those floating point errors compound, especially with single precision.

I believe this should be addressed in some way, the precision loss is dramatic and to be honest I was under the impression that compensated summation was the default. I'm filing an issue with bottleneck as well. What would be the preferred action for pandas? I can think of

  1. A warning message
  2. move to numpy.nanmean, at the very least if dtype==np.float32
  3. do nothing/wait for bottleneck's response (not my favorite).

@sebasv
Copy link
Contributor Author

sebasv commented Aug 4, 2021

existing issue for botleneck:
pydata/bottleneck#379

@mzeitlin11
Copy link
Member

There's some other discussion for this precision vs performance tradeoff in #39622.

@sebasv
Copy link
Contributor Author

sebasv commented Aug 5, 2021

I should point out that this error compounds so dramatically that I ran into reproducibility issues with a published paper. The relative error in my data is 0.5%, which completely throws off my metrics.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2021

note that there has been much discussion about this in #25307 and many linked issues

@sebasv
Copy link
Contributor Author

sebasv commented Aug 5, 2021

Thank you for linking, I wasn't aware and I am a bit surprised these issues got closed without a pandas-side solution. There the question was raised what pandas should do about a problem in a third-party library. Shouldn't the answer be "do not use that library"? Their routines produce arbitrarily large errors, I don't see how that can be defended. If I choose to use f32 precision on values around 1 I expect to get answers to be precise up to ~1e-6, and so, I imagine, do most other pandas users.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2021

@sebasv we have been using bottleneck for years and float32 while common now was uncommon a few years back

so just don't use the library is not possible - though u can simply uninstall it as it's not a required dependency

that said i believe we actually don't use bottleneck for sum; would be easy to add mean to this list

these choices are never easy - if we just simply don't use the library and use numpy for this then people will complain about increased memory usage
and as always pandas is all volunteer

@sebasv
Copy link
Contributor Author

sebasv commented Aug 5, 2021

If I draft up a PR to add mean to the not-bottleneck list, will it be considered for merging?

For future readers with this problem, either uninstalling bottleneck or pd.set_option('compute.use_bottleneck', False) works. If you use conda and conda remove bottleneck wants to uninstall pandas as a dependency, try using pip uninstall bottleneck.

@krachyon
Copy link

krachyon commented Aug 6, 2021

Just to add my two cents here as I've hit the issue from astropy: I really feel that to opportunistically choose an implementation with different behavior based on which library I happen to have installed is a pretty unpleasant trap. Especially when the bottleneck can even fall back to numpy under certain conditions (see astropy/astropy#11492 for details).
The choice is totally hidden from the user and most likely they won't even think about looking for an issue like this before having grown a bunch of grey hair due to inexplicable inconsistent results, which just looking at the issues linked here has happened to at least three people independently now. The principle of least surprise for the user here would definitely be to match the behavior of numpy.

To be clear I would see this mostly as an issue with bottleneck, as it advertises itself as a drop-in replacement for numpy-routines without explicit and obvious mention of this discrepancy. So Imho the choice of using bottleneck should have to be an explicit opt-in for users that really need that last bit of performance and actively decide to sacrifice accuracy for it.

@mroeschke mroeschke added Dependencies Required and optional dependencies Numeric Operations Arithmetic, Comparison, and Logical operations Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@JMBurley
Copy link
Contributor

JMBurley commented Jul 13, 2022

@sebasv I think you should make that PR. The logic to not use bottleneck for sum is identical for not using it for mean.

Otherwise, to help with tracking, Pandas should keep an eye on Bottleneck mitigating this issue in pydata/bottleneck#379

@sebasv
Copy link
Contributor Author

sebasv commented Jul 14, 2022

@JMBurley I'll commit this morning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dependencies Required and optional dependencies Numeric Operations Arithmetic, Comparison, and Logical operations Upstream issue Issue related to pandas dependency
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants