-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No support for mixed column type #726
Comments
The check we use for converting to categorical is if there are repeated values in a string column. This wouldn't quite trigger in your case. I think the issue here is you have a mixed type column in |
Yeah, I'm willing to accept that there isn't a single correct way of doing this, but at the end of the day, I'm getting burned pretty frequently by long-running code that fails in the middle because the file can't be written. I'd much prefer some sort of conversion to occur, especially if it's consistent with how the conversion was being performed previously. Some code we're running seems to return dtype: object series that have only numeric values (something like the result of Would more aggressively converting to categorical, eg for mixed types, be another option? |
I think a "proactively convert all contents to something we can write" method could be reasonable. Falling back to pickling could also be useful for your long running tasks. That way we won't break anything by converting it wrong. But also, do you know why your getting object valued columns? |
Yeah, I think this would be great.
Good suggestion!
Obviously this is the short-term solution, but these issues keep on cropping up so ideal to implement the convert-all-contents-to-something-we-can-write method. If you give some pointers on general approach ie where to look, I'm happy to put together a PR. |
The more I think about this, the less I like it feature. If you write a file, but reading it gives you back something different I don't think the writing was very useful. For cases where there's something we don't support, I'd think interoperability gives way to just getting the data stored. An alternative here: Maybe we can skip aggressive coercion, write all we can, and just pickle the elements we can't write? This idea has been floating around for a bit.Comments from some old code: File ~/anaconda3/envs/scvelo2/lib/python3.8/site-packages/anndata/_io/h5ad.py:148, in write_not_implemented(f, key, value, dataset_kwargs)
143 @report_write_key_on_error
144 def write_not_implemented(f, key, value, dataset_kwargs=MappingProxyType({})):
145 # If it’s not an array, try and make it an array. If that fails, pickle it.
146 # Maybe rethink that, maybe this should just pickle,
147 # and have explicit implementations for everything else
--> 148 raise NotImplementedError(
149 f"Failed to write value for {key}, "
150 f"since a writer for type {type(value)} has not been implemented yet."
151 ) But if the elements breaking your write methods are like: I think this would basically look like walking the tree of the object, and calling something like @Zethson and @Imipenem may have some additional insight here, as they're working with quite heterogenous medical record data over at theislab/ehrapy. |
I'm not entirely against pickling, but it's not a great long-term storage solution as changes to eg pandas could make it impossible to deserialize. I would try as hard as possible to do faithful conversions and as a final fallback convert to pickle but with a warning message noting which slots needed that conversion - and allow reading back the h5ad even if some fields fail to unpickle. Would more aggressive conversion to categorical work as well?
Irrespective of other solutions, applying |
Yeah, I would say that you have to opt-in to the pickling behavior. And the documentation would say "please don't do this for anything other than ephemeral storage, and even then please don't". This could be a "permissive" writing mode.
I have also been wanting to have a permissive reading mode. E.g. if you don't recognize the IO spec for a single element, you can skip it.
I think this is the route ehrapy had been taking, but they have run into issues with it.
We could have an |
i experience the same behaviour, "TypeError: Can't implicitly convert non-string objects to strings" when I'm attempting to write an anndata which has
|
I get the same error after calculating Marker genes (wilcoxon method) and trying to save anndata object again
|
I have the same error as @Avaptel18 |
The issue reported by @bio-la is tracked in #679 and now is on tracked for inclusion in 0.9.0. The specific error reported by @avpatel18 could happen for a large number of reasons, some of which are not things we can handle. It may be resolved for some cases which can now be resolved to strings + null values. |
I have the same error as @Avaptel18 |
Same issue as @Avaptel18 on writing to hdf5:
It happens if you calculate marker genes with
Any workaround for this? Looked like there's going to be a fix in anndata 0.9., but that was postponed... |
Here's a minimal example of the bug I run into: import scanpy as sc
import anndata
print("scanpy", sc.__version__) # 1.9.3
print("anndata", anndata.__version__) # 0.8.0
adata = sc.datasets.pbmc3k_processed()
sc.tl.rank_genes_groups(adata, groupby='louvain')
sc.tl.filter_rank_genes_groups(adata, min_fold_change=1) # commenting this line will make the whole thing run without error
adata.write_h5ad('/tmp/pbmc_bug.h5ad')
As mentioned previously, |
Deleting 'rank_genes_groups' and 'rank_genes_groups_filtered' resolved the issue for me. Thank you!
|
@kaarg2: That "solves" the problem, but of course the DEG won't be saved to disk then, which is very inconvenient. @flying-sheep: As you mentioned, any anndata produced by scanpy and its functions should really be writable to disk, everything else is a bug. We should really fix that! I'd take a stab at the problem myself, but the format that In fact the Looks like the best options at this point would be:
|
print("anndata", anndata.__version__) # 0.8.0 that’s not the newest anndata version.
Since we’re going to release 0.10 in a little bit, I won’t bother figuring out when this was fixed. It’s either fixed or will be later today. |
I'm still dealing with this issue, at the very least is it possible to flag what the problematic obs column is? Running obs.dtypes() shows I have a mix of int/float32/float64/category fields in the pandas object (nothing seems special here), but it is really hard to know which column is the one causing the error since it's such a generic message. I know it's obs related (and not uns) cause I tried to delete the entire uns layer and it still throw this error, plus the only lead I have in the error points to an obs key error:
|
There might be some entries in adata.obs or the title of some obs column contain illegal character such as "/". You can try changing those characters in your adata.obs. Alternatively, you can use pickle. An example can be found here. |
Also getting the same issue. Suggestions from @redst4r make sense and should be implemented |
OK, this issue was closed since there are multiple specific sub-issues, and any further conversation here is just confusing. To avoid further “me too”s, I’m going to link to the related issues a final time and then lock this thread. Feel free to open new issues if you think something is not covered by the following:
|
I'm reopening #558 as I'm still having this problem with latest anndata and h5py. It seems that anndata needs to be a bit more defensive about dtypes when writing to h5ad using h5py>=3. I know that most non-numeric columns get converted to categoricals in most situations, but I'm hitting these edge cases with some frequency even with anndata objects that should be large enough to produce categoricals. I suppose some sort of the following logic would solve the issue:
We're mostly running anndata<0.8 so maybe this is fixed there, but I'm also hitting this issue when writing the
.uns
slot, would be great to apply the same clean-up/failsafe logic there.Example with long traceback
and
The text was updated successfully, but these errors were encountered: