BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

sebasv · 2021-08-04T09:37:14Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

a = pd.Series(np.random.normal(scale=0.1, size=(1_000_000,)).astype(np.float32)).pow(2)

assert isinstance(np.mean(a), float)
assert isinstance(np.mean(a.values), np.float32)
assert abs(1 - np.mean(a)/np.mean(a.values)) > 4e-4

Problem description

pd.DataFrame.mean/pd.Series.mean/np.mean(pd.Series) outputs a Python float instead of a numpy float. Since np.mean(pd.Series.values) does return an np float, I'm assuming for now that this should be fixed in pandas
if dtype==np.float32, then calling mean on a pandas object gives a significantly different result vs calling mean on the underlying numpy ndarray.

Expected Output

The output of np.mean(a) should be the same as np.mean(a.values).

additional tests

# both b and c ~1e-2
b = a.mean() # the pandas impl of mean
assert isinstance(b, float) # PYTHON float, not numpy float? Ergo implicit f64

h = np.mean(a)
assert isinstance(h, float)
assert h == b

c = a.values.mean() # the numpy impl of mean
assert isinstance(c, np.float32) # as exprected

print('\nerrors between pandas mean and numpy mean')
print(f'relative error: {abs(1-b/c):.3e}') # ~ 5e-4
print(f'absolute error: {abs(b -c):.3e}') # ~ 5e-6

print(f'relative error after casting: {abs(1-np.float32(b)/c):.3e}') # ~ 5e-4
print(f'absolute error after casting: {abs(np.float32(b) -c):.3e}') # ~ 5e-6

d = a.sum() / len(a) 
assert isinstance(d, np.float64) # expected, because division. Note `sum` returns an np.float32

e = a.values.sum() / len(a)
assert isinstance(e, np.float64) # expected, because division

# these methods are equivalent
assert d==e

# and up to f32 precision equal to the numpy impl
assert d.astype(np.float32) == c

# the cherry on the cake
f = a.astype(np.float64).mean()
assert isinstance(f, float) # still not ideal, should be np.float64

g = a.astype(np.float64).values.mean()
print('\nrelative error between pandas f64 mean and numpy f64 mean')
print(f'relative error numpy f64/pandas f64: {abs(1-g/f):.3e}') # ~ 1e-14 -- 1e-16, not bad but I would have expected equality

print('\nerrors between pandas f64 mean and numpy/pandas f32 mean')
print(f'relative error pandas f32/pandas f64: {abs(1-b/f):.3e}') # ~ 5e-4
print(f'absolute error numpy f32/pandas f64: {abs(1-c/f):.3e}') # ~ 1e-7 -- 1e-9

# finally...
h = np.mean(a)
assert isinstance(h, float)
assert h == b

output

errors between pandas mean and numpy mean
relative error: 5.210e-04
absolute error: 5.204e-06
relative error after casting: 5.210e-04
absolute error after casting: 5.204e-06

relative error between pandas f64 mean and numpy f64 mean
relative error numpy f64/pandas f64: 1.066e-14

errors between pandas f64 mean and numpy/pandas f32 mean
relative error pandas f32/pandas f64: 5.214e-04
absolute error numpy f32/pandas f64: 2.399e-07

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : c7f7443
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-80-generic
Version : #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.0
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.9.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.51.2

The text was updated successfully, but these errors were encountered:

sebasv · 2021-08-04T17:41:17Z

I think I found it. It seems to be loss of numerical precision because of a bad summation routine.

a = pd.Series((np.ones(shape=(1_000_000,))/10).astype(np.float32))

man = np.float32(0)
for i in a.values:
    man += i
man /= np.float32(len(a))
assert isinstance(man, np.float32)

kahan = np.float32(0)
kahan_c = np.float32(0)
for i in a.values:
    temp1 = i - kahan_c
    temp2 = kahan + temp1
    kahan_c = (temp2 - kahan) - temp1
    kahan = temp2

kahan /= np.float32(len(a))

print(f'''
{a.mean()}
{man}

{kahan}

{a.sum()/len(a)}
{a.values.mean()}''')

outputs

0.10095834732055664
0.10095834732055664

0.10000000149011612

0.1000000859375
0.10000008344650269

sebasv · 2021-08-04T18:17:56Z

As far as I can see from Python call tracing, pd.Series.sum delegates to a C function np.ndarray.sum, while pd.Series.mean delegates to a C function bottleneck.reduce.nanmean. I haven't got C call tracing set up but I'm working through the code to understand why np.nanmean isn't used.

jbrockmendel · 2021-08-04T18:26:07Z

but I'm working through the code to understand why np.nanmean isn't used.

the code for that is in core.nanops, we use bottleneck bc it is generally faster

sebasv · 2021-08-04T19:41:56Z

Yep, bottleneck nansum and nanmean do a naive summation. Wow those floating point errors compound, especially with single precision.

I believe this should be addressed in some way, the precision loss is dramatic and to be honest I was under the impression that compensated summation was the default. I'm filing an issue with bottleneck as well. What would be the preferred action for pandas? I can think of

A warning message
move to numpy.nanmean, at the very least if dtype==np.float32
do nothing/wait for bottleneck's response (not my favorite).

sebasv · 2021-08-04T19:43:13Z

existing issue for botleneck:
pydata/bottleneck#379

mzeitlin11 · 2021-08-04T22:05:11Z

There's some other discussion for this precision vs performance tradeoff in #39622.

sebasv · 2021-08-05T10:45:26Z

I should point out that this error compounds so dramatically that I ran into reproducibility issues with a published paper. The relative error in my data is 0.5%, which completely throws off my metrics.

jreback · 2021-08-05T11:21:49Z

note that there has been much discussion about this in #25307 and many linked issues

sebasv · 2021-08-05T11:41:30Z

Thank you for linking, I wasn't aware and I am a bit surprised these issues got closed without a pandas-side solution. There the question was raised what pandas should do about a problem in a third-party library. Shouldn't the answer be "do not use that library"? Their routines produce arbitrarily large errors, I don't see how that can be defended. If I choose to use f32 precision on values around 1 I expect to get answers to be precise up to ~1e-6, and so, I imagine, do most other pandas users.

jreback · 2021-08-05T12:32:40Z

@sebasv we have been using bottleneck for years and float32 while common now was uncommon a few years back

so just don't use the library is not possible - though u can simply uninstall it as it's not a required dependency

that said i believe we actually don't use bottleneck for sum; would be easy to add mean to this list

these choices are never easy - if we just simply don't use the library and use numpy for this then people will complain about increased memory usage
and as always pandas is all volunteer

sebasv · 2021-08-05T12:58:16Z

If I draft up a PR to add mean to the not-bottleneck list, will it be considered for merging?

For future readers with this problem, either uninstalling bottleneck or pd.set_option('compute.use_bottleneck', False) works. If you use conda and conda remove bottleneck wants to uninstall pandas as a dependency, try using pip uninstall bottleneck.

krachyon · 2021-08-06T21:59:50Z

Just to add my two cents here as I've hit the issue from astropy: I really feel that to opportunistically choose an implementation with different behavior based on which library I happen to have installed is a pretty unpleasant trap. Especially when the bottleneck can even fall back to numpy under certain conditions (see astropy/astropy#11492 for details).
The choice is totally hidden from the user and most likely they won't even think about looking for an issue like this before having grown a bunch of grey hair due to inexplicable inconsistent results, which just looking at the issues linked here has happened to at least three people independently now. The principle of least surprise for the user here would definitely be to match the behavior of numpy.

To be clear I would see this mostly as an issue with bottleneck, as it advertises itself as a drop-in replacement for numpy-routines without explicit and obvious mention of this discrepancy. So Imho the choice of using bottleneck should have to be an explicit opt-in for users that really need that last bit of performance and actively decide to sacrifice accuracy for it.

JMBurley · 2022-07-13T16:45:11Z

@sebasv I think you should make that PR. The logic to not use bottleneck for sum is identical for not using it for mean.

Otherwise, to help with tracking, Pandas should keep an eye on Bottleneck mitigating this issue in pydata/bottleneck#379

sebasv · 2022-07-14T05:37:38Z

@JMBurley I'll commit this morning

sebasv added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 4, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 4, 2021

code sample for pandas-dev#42878

59b75e6

sebasv changed the title ~~BUG: [float32 precision] pandas float32 mean is inconsistent with numpy~~ BUG: np.mean(pd.Series) != np.mean(pd.Series.values) Aug 4, 2021

mroeschke added Dependencies Required and optional dependencies Numeric Operations Arithmetic, Comparison, and Logical operations Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021

This was referenced Jun 16, 2022

BUG: aggregation of np.float16/np.float32 is wrong for big dataset #47370

Closed

ENH: add additional extras_require sections for optional dependencies #39164

Closed

sebasv mentioned this issue Jul 14, 2022

opt out of bottleneck for nanmean #47716

Merged

5 tasks

mroeschke closed this as completed in #47716 Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

sebasv commented Aug 4, 2021 •

edited

Loading

INSTALLED VERSIONS

sebasv commented Aug 4, 2021 •

edited

Loading

sebasv commented Aug 4, 2021

jbrockmendel commented Aug 4, 2021

sebasv commented Aug 4, 2021 •

edited

Loading

sebasv commented Aug 4, 2021 •

edited

Loading

mzeitlin11 commented Aug 4, 2021

sebasv commented Aug 5, 2021

jreback commented Aug 5, 2021

sebasv commented Aug 5, 2021

jreback commented Aug 5, 2021

sebasv commented Aug 5, 2021 •

edited

Loading

krachyon commented Aug 6, 2021

JMBurley commented Jul 13, 2022 •

edited

Loading

sebasv commented Jul 14, 2022

BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

Comments

sebasv commented Aug 4, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

sebasv commented Aug 4, 2021 • edited Loading

sebasv commented Aug 4, 2021

jbrockmendel commented Aug 4, 2021

sebasv commented Aug 4, 2021 • edited Loading

sebasv commented Aug 4, 2021 • edited Loading

mzeitlin11 commented Aug 4, 2021

sebasv commented Aug 5, 2021

jreback commented Aug 5, 2021

sebasv commented Aug 5, 2021

jreback commented Aug 5, 2021

sebasv commented Aug 5, 2021 • edited Loading

krachyon commented Aug 6, 2021

JMBurley commented Jul 13, 2022 • edited Loading

sebasv commented Jul 14, 2022

sebasv commented Aug 4, 2021 •

edited

Loading

Output of `pd.show_versions()`

sebasv commented Aug 4, 2021 •

edited

Loading

sebasv commented Aug 4, 2021 •

edited

Loading

sebasv commented Aug 4, 2021 •

edited

Loading

sebasv commented Aug 5, 2021 •

edited

Loading

JMBurley commented Jul 13, 2022 •

edited

Loading