Using already processed Anndata object for Scarf #84

bennjmo · 2023-03-24T22:58:08Z

bennjmo
Mar 24, 2023

I have an existing Anndata object of a large dataset ~190k cells I would like to process with Scarf (it was previously processed on a cluster and I would like to see if Scarf would work for us going forward).

This is a subsetted version I am working with just to speed things up, but you can see the obs, var, and uns are already quite busy.

AnnData object with n_obs × n_vars = 88286 × 4649
    obs: 'sample', 'SampleID', 'Phenotype', 'Batch', 'doublet_score', 'predicted_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'batch', 'n_genes', 'leiden_0.0', 'leiden_0.1', 'leiden_0.2', 'leiden', 'v0_v1_celltypes', 'v1_mito20_wo367.leiden_0.1', 'v1_mito20_wo367.leiden_0.2', 'v1_mito20_wo367.leiden_0.3', 'v1_mito20_wo367.leiden_0.4', 'v3_myl.leiden_0.1', 'v3_myl.leiden_0.2', 'v3_myl.leiden_0.3', 'v3_myl.leiden_0.4', 'v3_myl.leiden_0.5', 'v3_myl.leiden_0.6'
    var: 'gene_ids', 'feature_types', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'Batch_colors', 'Phenotype_colors', 'SampleID_colors', 'hvg', 'leiden', 'leiden_0.0_colors', 'leiden_0.1_colors', 'leiden_0.2_colors', 'log1p', 'neighbors', 'pca', 'umap', 'v0_v1_celltypes_colors', 'v1_mito20_wo367.leiden_0.1_colors', 'v1_mito20_wo367.leiden_0.2_colors', 'v1_mito20_wo367.leiden_0.3_colors', 'v1_mito20_wo367.leiden_0.4_colors', 'v3_myl.leiden_0.1_colors', 'v3_myl.leiden_0.2_colors', 'v3_myl.leiden_0.3_colors', 'v3_myl.leiden_0.4_colors', 'v3_myl.leiden_0.5_colors', 'v3_myl.leiden_0.6_colors'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

What is the best way to take an object like this into Scarf? I am having trouble reverting back to just raw data so that I can go through the Scarf workflow. For example

ds.plot_cells_dists()

gave negative numbers on the RNA_ncounts plot.

Thanks!

parashardhapola · 2023-03-26T17:28:05Z

parashardhapola
Mar 26, 2023
Maintainer

Hi @bennjmo,

It seems you have loaded the scaled data from H5ad file into Scarf.
Sometimes Anndata saves raw data in aseparate location.

First you need to investigate if and where the raw data is stored in your anndata/H5ad file. Using h5py (which is installed as dependency by both anndata and scarf) you may investigate the content of the the file directly.

import h5py

handle = h5py.File("<filename>.h5ad", mode="r")
print (handle.keys())

You should see the usual groups like, 'X', 'obs', 'obsm', 'uns', and 'var' in the output but, you are mainly looking for layers or raw.X. Within layers you might have counts or spliced that might contain the raw data.

print (handle["layers"].keys())

Once you have located the raw count data (the group will have these three entries: data, indices, indptr), you may override the defaults in Scarf to load counts from this location:

readers = scarf.H5adReader(
    h5ad_fn = "<filename>.h5ad",
    matrix_key='layers/counts',                           # Override this
    cell_attrs_key='obs',
    cell_ids_key='_index',
    feature_attrs_key='var',
    feature_ids_key='_index',
    feature_name_key='gene_short_name',
    obsm_attrs_key: str = 'obsm',
    category_names_key: str = '__categories'
)

Hope this helps! Feel free to post here if you more questions :)

/Parashar

1 reply

bennjmo Mar 26, 2023
Author

Thank you so much for your response @parashardhapola . I would really like to get Scarf working! Unfortunately, I have a sinking feeling the raw data was not saved in the object I am working with. Here is what I was able to investigate:

handle = h5py.File("file.h5ad", mode="r")
print (handle.keys())

<KeysViewHDF5 ['X', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm']>
time: 1.9 ms (started: 2023-03-26 15:09:29 -05:00)

print (handle["raw"].keys())
<KeysViewHDF5 ['X', 'var']>

adata.raw
<anndata._core.raw.Raw at 0x7fb03f6ab280>

adata.raw.var_names[:10]

Index(['MIR1302-2HG', 'AL627309.1', 'AL627309.3', 'AL627309.2', 'AL627309.5',
       'AL627309.4', 'AP006222.2', 'AL732372.1', 'AC114498.1', 'AL669831.2'],
      dtype='object')

print(adata.raw.X[:10, :100])

  (0, 12)	6.9539466
  (0, 26)	7.535025
  (0, 40)	6.9539466
  (0, 44)	5.9655366
  (0, 55)	7.9481163
  (0, 56)	6.9539466
  (0, 58)	7.9481163

So this dataset cannot be taken into Scarf? I was hoping to at least use it to for better downsampling methodology than what I have been doing in Seurat.

Thanks again!

-Ben

parashardhapola · 2023-03-26T21:22:46Z

parashardhapola
Mar 26, 2023
Maintainer

Hi Ben,

Yes. It does seem like that the raw counts weren't saved into the file. You may still be able to use Scarf using the normalized count values you have in the H5ad file. Read in the H5ad file like below:

reader = scarf.H5adReader(
    h5ad_fn = "<filename>.h5ad",
    matrix_key='raw/X',      #Updated
    cell_attrs_key='obs',
    cell_ids_key='_index',
    feature_attrs_key='raw/var',    #Updated 
    feature_ids_key='_index',
    feature_name_key='gene_short_name',
    obsm_attrs_key: str = 'obsm',
    category_names_key: str = '__categories'
)

Continue with the rest of the workflow as usual. But you will have to turn off the normalization because the data is already normalized.

After loading the datastore using scarf.DataStore you need to run this: ds.RNA.normMethod = scarf.assay.norm_dummy. This will basically override the inbuilt normalization function with a dummy function which doesn't perform any normalization. Please note that you will need to run this everytime after loading the dataset.

If the gene names or ids do not load correctly then you will need to look into the H5ad file and check the names within raw/var and use those as values for feature_ids_key and feature_name_key parameter in scarf.H5adReader.

/Parashar

PS: Looking at the few rows of data that you have printed out, I'm afraid that the data might be log transformed. This can be problematic for the highly variable gene selection step in Scarf. The problem can be avoided if you already have an HVG list that you want to use. In such case, all we need to do is to set log_transform parameter to False in the make_graph function. But if you want Scarf to do HVG selection, we will need to write a couple of lines of custom normalization function that calculates antilog of the data and the rest of the steps would be as the regular workflow.

3 replies

bennjmo Mar 27, 2023
Author

Thank you again for walking me through this!

First, I was not able to load in the data with

reader = scarf.H5adReader(
    h5ad_fn = "<filename>.h5ad",
    matrix_key='raw/X',      #Updated
    cell_attrs_key='obs',
    cell_ids_key='_index',
    feature_attrs_key='raw/var',    #Updated 
    feature_ids_key='_index',
    feature_name_key='gene_short_name',
    obsm_attrs_key: str = 'obsm',
    category_names_key: str = '__categories'
)

I also tried obsm_attrs_key = 'obsm' instead of key: str = 'obsm' but that didn't work either.

However, I was able to get the data in as I had before with

reader = scarf.H5adReader(
    '<filename>.h5ad', 
    cell_ids_key='index',
    feature_ids_key='index',
    feature_name_key='gene_ids'
)

ds.RNA.feats.head() gives

I thought i was in good shape so I ran

ds.make_graph(
    feat_key='highly_variable_intersection',
    k=11,
    dims=15,
    n_centroids=100,
    log_transform = False
)

But I get this

ValueError: ERROR: Either I__highly_variable_intersection does not exist or is not bool type even though its clearly in the RNA.feats.head.

parashardhapola Mar 27, 2023
Maintainer

You are quite close. The first error was due to syntax issue due to my reckless copy pasting. 😅. But it doesn't matter, you were able to read in the file just fine.

About the second error, note that I__highly_variable_intersection has two underscores (aka dunders) after I

Since you didn't mention how you got the HVG column into Scarf, I think you may know how to fix this with the above info.

If this doesn't work, I'm happy to help you.

/Parashar

parashardhapola Mar 27, 2023
Maintainer

Okay. Reading your message again, I realized that you may still have read in the scaled data. So pasting the typo fixed version here:

reader = scarf.H5adReader(
    h5ad_fn="<filename>.h5ad",
    matrix_key='raw/X',      #Updated
    cell_attrs_key='obs',
    cell_ids_key='_index',
    feature_attrs_key='raw/var',    #Updated 
    feature_ids_key='_index',
    feature_name_key='gene_short_name',
    obsm_attrs_key='obsm',
    category_names_key= '__categories'
)

bennjmo · 2023-03-27T20:54:28Z

bennjmo
Mar 27, 2023
Author

I am still getting errors when trying to write

reader = scarf.H5adReader(
    h5ad_fn="test.h5ad",
    matrix_key='raw/X',      #Updated
    cell_attrs_key='obs',
    cell_ids_key='_index',
    feature_attrs_key='raw/var',    #Updated 
    feature_ids_key='_index',
    feature_name_key='gene_short_name',
    obsm_attrs_key='obsm',
    category_names_key= '__categories'
)

writer = scarf.H5adToZarr(
    reader,
    zarr_loc='scarf_datasets/test.zarr'
)
writer.dump()

Yet it still works if I use this as the reader. I don't think I understand the difference that '_index' vs 'index'. Is making here.

reader = scarf.H5adReader(
    'test.h5ad', 
    cell_ids_key='index',
    feature_ids_key='index',
    feature_name_key='gene_ids'
)

I think I'm just going to scratch this attempt and see if I can get my hands on raw data so I can proceed with the typical Scarf workflow. But I really do appreciate your assistance!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using already processed Anndata object for Scarf #84

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using already processed Anndata object for Scarf #84

bennjmo Mar 24, 2023

Replies: 3 comments · 4 replies

parashardhapola Mar 26, 2023 Maintainer

bennjmo Mar 26, 2023 Author

parashardhapola Mar 26, 2023 Maintainer

bennjmo Mar 27, 2023 Author

parashardhapola Mar 27, 2023 Maintainer

parashardhapola Mar 27, 2023 Maintainer

bennjmo Mar 27, 2023 Author

bennjmo
Mar 24, 2023

Replies: 3 comments 4 replies

parashardhapola
Mar 26, 2023
Maintainer

bennjmo Mar 26, 2023
Author

parashardhapola
Mar 26, 2023
Maintainer

bennjmo Mar 27, 2023
Author

parashardhapola Mar 27, 2023
Maintainer

parashardhapola Mar 27, 2023
Maintainer

bennjmo
Mar 27, 2023
Author