Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot pickle '_thread.lock' object exception after DataArray transpose and copy operations from netCDF file. #8442

Closed
sharkinsspatial opened this issue Nov 11, 2023 · 22 comments · Fixed by #8571

Comments

@sharkinsspatial
Copy link

What is your issue?

I hit this issue while using rioxarray with a series of operations similar to those noted in this issue corteva/rioxarray#614. After looking through the rioxarray codebase a bit I was able to reproduce the issue with pure xarray operations.

If the Dataset is opened with the default lock=True settings, transposing a DataArray's coordinates and then copying the DataArray results in a cannot pickle '_thread.lock' object exception.

If the Dataset is opened with lock=False, no error is thrown.

This sample notebook reproduces the error.

This might be user error on my part, but it would be great to have some clarification on why lock=False is necessary here as my understanding was that this should only be necessary when using parallel write operations.

@sharkinsspatial sharkinsspatial added the needs triage Issue that has not been reviewed by xarray team member label Nov 11, 2023
Copy link

welcome bot commented Nov 11, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@max-sixty max-sixty added needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 5, 2023
@sharkinsspatial
Copy link
Author

sharkinsspatial commented Dec 6, 2023

Here is a locally reproducible MCVE.

import xarray as xr
import numpy as np

file_path = "test.nc"

ds = xr.Dataset(
    {
        'latitude': np.arange(10),
        'longitude': np.arange(10),
        'precip': (['latitude', 'longitude'], np.arange(100).reshape(10, 10))
    }
)

ds.to_netcdf(file_path, engine="h5netcdf")

ds = xr.open_dataset(file_path, engine="h5netcdf", decode_coords=True, decode_times=True)
da = ds["precip"]
da = da.transpose("longitude", "latitude", missing_dims="ignore")
da = da.copy()

Note that if xr.open_dataset is called with lock=False the _io.BufferedReader error is not thrown. 👍

@max-sixty
Copy link
Collaborator

max-sixty commented Dec 6, 2023

Hmm, I don't get an error there. Can you post your dependencies? (Instructions in the bug report template)

Edit: though it seems to rely on the file being there from #8443...

@sharkinsspatial
Copy link
Author

sharkinsspatial commented Dec 6, 2023

Apologies, I had not written the netCDF file out in the MCVE 🤦‍♂️, the example is updated now. I was able to produce the error in the environment below.

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.0 (default, Apr 14 2021, 14:07:04)
[Clang 12.0.0 (clang-1200.0.32.29)]
python-bits: 64
OS: Darwin
OS-release: 19.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: None

xarray: 2023.11.0
pandas: 2.1.3
numpy: 1.26.2
scipy: None
netCDF4: None
pydap: None
h5netcdf: 1.3.0
h5py: 3.10.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.12.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

@dcherian
Copy link
Contributor

dcherian commented Dec 6, 2023

Mine too succeeds with libhdf5: 1.14.2, otherwise my versions of xarray, h5netcdf, h5py match yours.

PS: the code I run has to_netcdf

@sharkinsspatial
Copy link
Author

🤔 I upgraded to libhdf5: 1.14.3 and was still able to reproduce. To try and isolate any potential h5netcdf problems I also attempted the following with the default netcdf4 engine and hit the same exception.

import xarray as xr
import numpy as np

file_path = "test.nc"

ds = xr.Dataset(
    {
        'latitude': np.arange(10),
        'longitude': np.arange(10),
        'precip': (['latitude', 'longitude'], np.arange(100).reshape(10, 10))
    }
)

ds.to_netcdf(file_path)

ds = xr.open_dataset(file_path, decode_coords=True, decode_times=True)
da = ds["precip"]
da = da.transpose("longitude", "latitude", missing_dims="ignore")
da = da.copy()

I'm going to ask a few colleagues to try and replicate to see if this is something peculiar to my environment.

@sharkinsspatial
Copy link
Author

sharkinsspatial commented Dec 7, 2023

My colleague was also able to reproduce the exception as well with the ☝️ netcdf4 engine code and the following environment.

INSTALLED VERSIONS ------------------ commit: None python: 3.9.18 (main, Nov 2 2023, 16:51:22) [Clang 14.0.3 (clang-1403.0.22.14.1)] python-bits: 64 OS: Darwin OS-release: 23.1.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.0

xarray: 2023.10.1
pandas: 2.1.3
numpy: 1.26.2
scipy: 1.9.1
netCDF4: 1.6.1
pydap: None
h5netcdf: None
h5py: 3.8.0
Nio: None
zarr: 2.12.0
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.8.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.10.0
cupy: None
pint: None
sparse: 0.13.0
flox: None
numpy_groupies: None
setuptools: 69.0.2
pip: 23.3.1
conda: None
pytest: 6.2.5
mypy: 0.910
IPython: 7.34.0
sphinx: 6.2.1

I'm unsure what the differences in environments could be 🤔 .

@max-sixty
Copy link
Collaborator

That's a puzzle... Can we reproduce it in a binder / in a test?

@zmoon
Copy link
Contributor

zmoon commented Dec 13, 2023

No idea if it has the same underlying cause (I'm not transposing but am copying), but I do have a situation that used to work but now1 gives this same cannot pickle '_thread.lock' object error2. I'll have to see if I can make it into a minimal example. Tried downgrading some things in my environment to no avail.

Edit: here's a little example3 experimenting with joblib.dump to see when the error is raised.

import xarray as xr
from joblib import dump

ds = xr.tutorial.load_dataset("air_temperature").isel(time=slice(4))
ds.to_netcdf("ds.nc", engine="netcdf4")
dump(ds, "ds.joblib")  # 0. Succeeds
ds.close()

# 1. Try to pickle the whole Dataset
ds = xr.open_dataset("ds.nc")
dump(ds, "ds.joblib")  # TypeError: cannot pickle '_thread.lock' object

# 2. Try to pickle a DataArray
ds = xr.open_dataset("ds.nc")
dump(ds.air, "ds.air.joblib")  # TypeError: cannot pickle '_thread.lock' object

# 3. Somehow adding a new variable makes it okay to pickle `ds.air` (and `ds` if `.copy()` applied)
ds = xr.open_dataset("ds.nc")
ds["b"] = xr.zeros_like(ds.air)
dump(ds.air, "ds.air.joblib")  # Succeeds
dump(ds, "ds.joblib")  # But this still fails
dump(ds.copy(), "ds.joblib")  # Succeeds
Versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.133.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.1
libnetcdf: 4.9.2

xarray: 2023.12.0
pandas: 1.5.3
numpy: 1.26.2
scipy: 1.11.4
netCDF4: 1.6.4
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.7.3
cartopy: 0.22.0
seaborn: 0.11.0
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: 1.7.1
IPython: 8.18.1
sphinx: 5.3.0

Also tried in an env with HDF5 1.14.3, it didn't help.

Footnotes

  1. First noticed a month or two ago I think.

  2. Based on what happened later on this thread, maybe in my old env where it was working I had Dask available, for its SerializableLock, unlike in this new env where I was getting the error.

  3. Not super related to my real case except that my case involves joblib.

@zmoon
Copy link
Contributor

zmoon commented Dec 13, 2023

I was able to reproduce the error in OP's above example in a fresh env. Similar to one of my experiments, the error is, for me, averted if you add a new variable to the Dataset (e.g. ds["asdf"] = xr.zeros_like(ds.precip)) before the transpose line.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Dec 13, 2023

@zmoon Thanks for this MCVE! I can't reproduce the error, though. Also the MCVE in #8442 (comment) works nicely (details below).

Does it still fail if environments are created from scratch or on other systems? It looks like linux itself is not affected, only MacOSX and WSL?

Versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 4.19.0-22-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.14.2
libnetcdf: 4.9.2

xarray: 2023.12.0
pandas: 2.1.3
numpy: 1.26.2
scipy: 1.11.4
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.3.0
h5py: 3.10.0
Nio: None
zarr: None
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: None
dask: 2023.11.0
distributed: 2023.11.0
matplotlib: 3.8.2
cartopy: 0.22.0
seaborn: None
numbagg: None
fsspec: 2023.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: None
mypy: None
IPython: 8.18.1
sphinx: None
</details>



@kmuehlbauer
Copy link
Contributor

OK, here we go, I've taken dask out of the loop in a fresh env and can now reproduce both MCVE.

Versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.12.0 | packaged by conda-forge | (main, Oct  3 2023, 08:43:22) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.14.21-150500.55.19-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2023.12.0
pandas: 2.1.4
numpy: 1.26.2
scipy: None
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.3.0
h5py: 3.10.0
Nio: None
zarr: None
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: None
mypy: None
IPython: 8.18.1
sphinx: None

@zmoon
Copy link
Contributor

zmoon commented Dec 13, 2023

@kmuehlbauer I experienced the error on Windows as well as WSL.

I tried a fresh env on Linux and still got the error 🤷

Versions
mamba create -n test-lock python=3.11 xarray pooch netcdf4 h5netcdf joblib
INSTALLED VERSIONS
------------------
commit: None
python: 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.27.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2023.12.0
pandas: 2.1.4
numpy: 1.26.2
scipy: None
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.3.0
h5py: 3.10.0
Nio: None
zarr: None
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

Edit: From above OP also didn't have Dask. Adding dask-core to my env, no more error.

@kmuehlbauer
Copy link
Contributor

There has been some refactoring lately involving dask and other ChunkManagers. Not sure, if this has anything to do with it, but maybe @TomNicholas has more insight here.

@TomNicholas
Copy link
Member

TomNicholas commented Dec 13, 2023

I don't really see why this should have anything to do with it... I guess it's not impossible that somehow some dask lock argument is now getting lost, but I suggest that if we can now reproduce the error someone should do a git-bisect to find out which commit caused the regression.

EDIT: But you're saying you can reproduce this without dask anyway @kmuehlbauer ?

@kmuehlbauer
Copy link
Contributor

Yes, thanks @TomNicholas for looking into this. Will try to bisect this.

@kmuehlbauer
Copy link
Contributor

@zmoon @sharkinsspatial Did this ever work for you? I've a hard time finding a working commit. I've checked several versions back to 0.17.0 without success. Also the other involved dependencies (hdf5, netcdf-c, netcdf4-python, h5py, pandas) would be good to know to recreate a working environment.

@zmoon
Copy link
Contributor

zmoon commented Dec 13, 2023

@kmuehlbauer for me I don't have the environment anymore, but I suspect I probably had dask installed in it and that's why it was working.

@kmuehlbauer
Copy link
Contributor

TL;DR:

The current default of xr.open_dataset (netcdf4/h5netcdf) uses lazy loading which uses threading.Lock as default locking mechanism if dask is not available. The object cannot be pickled and after some computations (here .transpose) also not (deep)-copied. The only way around is to either explicitly use lock=False when opening files or do a .load() or .compute() before pickle/copy.

Inspection:

Using the MCVE given here #8442 (comment) I checked the types of the underlying array and how this works for transposing or not:

  • cache=True in open_dataset (default)

    • no transpose
      • before copy: <class 'xarray.core.indexing.MemoryCachedArray'>
      • after copy: <class 'xarray.core.indexing.MemoryCachedArray'>
      • trying to pickle raises TypeError: cannot pickle '_thread.lock' object in pickle
    • with transpose
      • before transpose: <class 'xarray.core.indexing.MemoryCachedArray'>
      • after transpose: <class 'xarray.core.indexing.LazilyVectorizedIndexedArray'>
      • trying to copy raises: TypeError: cannot pickle '_thread.lock' object in deepcopy
  • cache=False in open_dataset

    • no transpose
      • before copy: <class 'xarray.core.indexing.CopyOnWriteArray'>
      • after copy: <class 'xarray.core.indexing.CopyOnWriteArray'>
      • trying to pickle raises TypeError: cannot pickle '_thread.lock' object in pickle
    • with transpose
      • before transpose: <class 'xarray.core.indexing.CopyOnWriteArray'>
      • after transpose: <class 'xarray.core.indexing.LazilyVectorizedIndexedArray'>
      • trying to copy raises: TypeError: cannot pickle '_thread.lock' object in deepcopy

Reading with netcdf4 and h5netcdf backends the data is wrapped in xarray's lazy classes See https://docs.xarray.dev/en/stable/user-guide/io.html#netcdf:

Data is always loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation.

and further:

Xarray’s lazy loading of remote or on-disk datasets is often but not always desirable. Before performing computationally intense operations, it is often a good idea to load a Dataset (or DataArray) entirely into memory by invoking the Dataset.load() method.

There is also a mention for Pickle:

https://docs.xarray.dev/en/stable/user-guide/io.html#pickle

When pickling an object opened from a NetCDF file, the pickle file will contain a reference to the file on disk. If you want to store the actual array values, load it into memory first with Dataset.load() or Dataset.compute().

What to do?

The pickle issue might not be the big problem as the user is advised to load/compute before. But the copy-issue should be resolved somehow. Unfortunately I do not have an immediate solution to this. @pydata/xarray any ideas?

@max-sixty
Copy link
Collaborator

(brief message to say thanks a lot @kmuehlbauer for the excellent summary)

@max-sixty max-sixty removed the needs mcve https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports label Dec 15, 2023
@shoyer
Copy link
Member

shoyer commented Dec 16, 2023

I believe the issue are these two default locks for HDF5 and NetCDFC:

HDF5_LOCK = SerializableLock()

Probably the easiest way to handle this is to fork the code for SerializableLock from dask. It isn't very complicated:
https://github.com/dask/dask/blob/6f2100847e2042d459534294531e8884bef13a99/dask/utils.py#L1160

@kmuehlbauer
Copy link
Contributor

Thanks @shoyer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants