-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878
Comments
I think I found it. It seems to be loss of numerical precision because of a bad summation routine. a = pd.Series((np.ones(shape=(1_000_000,))/10).astype(np.float32))
man = np.float32(0)
for i in a.values:
man += i
man /= np.float32(len(a))
assert isinstance(man, np.float32)
kahan = np.float32(0)
kahan_c = np.float32(0)
for i in a.values:
temp1 = i - kahan_c
temp2 = kahan + temp1
kahan_c = (temp2 - kahan) - temp1
kahan = temp2
kahan /= np.float32(len(a))
print(f'''
{a.mean()}
{man}
{kahan}
{a.sum()/len(a)}
{a.values.mean()}''') outputs
|
As far as I can see from Python call tracing, |
the code for that is in core.nanops, we use bottleneck bc it is generally faster |
Yep, bottleneck I believe this should be addressed in some way, the precision loss is dramatic and to be honest I was under the impression that compensated summation was the default. I'm filing an issue with bottleneck as well. What would be the preferred action for pandas? I can think of
|
existing issue for botleneck: |
There's some other discussion for this precision vs performance tradeoff in #39622. |
I should point out that this error compounds so dramatically that I ran into reproducibility issues with a published paper. The relative error in my data is 0.5%, which completely throws off my metrics. |
note that there has been much discussion about this in #25307 and many linked issues |
Thank you for linking, I wasn't aware and I am a bit surprised these issues got closed without a pandas-side solution. There the question was raised what pandas should do about a problem in a third-party library. Shouldn't the answer be "do not use that library"? Their routines produce arbitrarily large errors, I don't see how that can be defended. If I choose to use f32 precision on values around 1 I expect to get answers to be precise up to ~1e-6, and so, I imagine, do most other pandas users. |
@sebasv we have been using bottleneck for years and float32 while common now was uncommon a few years back so just don't use the library is not possible - though u can simply uninstall it as it's not a required dependency that said i believe we actually don't use bottleneck for sum; would be easy to add mean to this list these choices are never easy - if we just simply don't use the library and use numpy for this then people will complain about increased memory usage |
If I draft up a PR to add mean to the not-bottleneck list, will it be considered for merging? For future readers with this problem, either uninstalling bottleneck or |
Just to add my two cents here as I've hit the issue from astropy: I really feel that to opportunistically choose an implementation with different behavior based on which library I happen to have installed is a pretty unpleasant trap. Especially when the bottleneck can even fall back to numpy under certain conditions (see astropy/astropy#11492 for details). To be clear I would see this mostly as an issue with bottleneck, as it advertises itself as a drop-in replacement for numpy-routines without explicit and obvious mention of this discrepancy. So Imho the choice of using bottleneck should have to be an explicit opt-in for users that really need that last bit of performance and actively decide to sacrifice accuracy for it. |
@sebasv I think you should make that PR. The logic to not use bottleneck for sum is identical for not using it for mean. Otherwise, to help with tracking, Pandas should keep an eye on Bottleneck mitigating this issue in pydata/bottleneck#379 |
@JMBurley I'll commit this morning |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
pd.DataFrame.mean
/pd.Series.mean
/np.mean(pd.Series)
outputs a Python float instead of a numpy float. Sincenp.mean(pd.Series.values)
does return an np float, I'm assuming for now that this should be fixed in pandasdtype==np.float32
, then callingmean
on a pandas object gives a significantly different result vs callingmean
on the underlying numpy ndarray.Expected Output
The output of
np.mean(a)
should be the same asnp.mean(a.values)
.additional tests
output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : c7f7443
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-80-generic
Version : #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.3
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : 1.3.8
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.0
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.9.0
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.51.2
The text was updated successfully, but these errors were encountered: