Feedback on concatenate() #541

grst · 2024-04-08T10:00:28Z

While I in the end was able to concatenate the data the way I like, the user experience wasn't as great as I had hoped, so wanted to drop some feedback. As I'm not that familiar with spatialdata yet, it might be that there are already better solutions -- please let me know if there are.

Starting situation

I have ~20 Visium Cytassist samples from a clinical trial processed with nf-core/spatialtranscriptomics (using the nf-core/spatialvi#67 branch that already uses spatialdata). The pipeline generates a single .zarr folder for each sample.

Desired outcome

I would like to have all samples in a single SpatialData object. The AnnData table should contain the gene expression from all samples.

Pain points

sd.concatenate enforces that the input is a list. Is there a reason this can't accept any Sequence type (e.g. dict_values)?
Usually, I pass a dictionary sample_id -> AnnData to anndata.concat, which nicely makes unique obs_names in combination with concat(..., index_unique="_"). This doesn't work with spatialdata.concatenate, which leaves me with either manipulating the obs_names for each object before concatenation, or ugly obs names with numeric sufficies (e.g. AACTCAACCTTGACCA-1_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0). IMO it would be great to support a dict as input to spatialdata.concatenate, too.

The per-sample SpatialData objects all have the same names for images, shapes and coordinate systems. I currently rename them like this:

sdatas_vis = {}

for _, row in tqdm(samplesheet.iterrows(), total=samplesheet.shape[0]):
   sample = row["sample"]
   tmp_sd = sd.read_zarr(sample_path / sample / "data" / "sdata_processed.zarr")
   tmp_sd.tables["table"].obs = tmp_sd.tables["table"].obs.assign(**row)
   tmp_sd.tables["table"].obs["region"] = sample
   tmp_sd.tables["table"].uns["spatialdata_attrs"]["region"] = sample
   # rename images
   tmp_sd.images[f"{sample}_hires"] = tmp_sd.images["visium_hires_image"]
   tmp_sd.images[f"{sample}_lowres"] = tmp_sd.images["visium_lowres_image"]
   del tmp_sd.images["visium_hires_image"]
   del tmp_sd.images["visium_lowres_image"]
   # rename shapes
   tmp_sd.shapes[f"{sample}"] = tmp_sd.shapes["visium"]
   del tmp_sd.shapes["visium"]

   sdatas_vis[sample] = tmp_sd

which seems a bit cumbersome. I'm wondering if there's a better solution or what's the intended way of handling such cases. It could also be worth adding a process to the nf-core/spatialtranscriptomics pipeline that already does the concatenation step.

The text was updated successfully, but these errors were encountered:

melonora · 2024-04-15T13:12:30Z

I am a bit swamped at the moment, but I will look into implementing your suggestions. As you said it would be worthwhile to handle dicts.

wangjiawen2013 · 2024-08-06T03:00:31Z

I have the same issue !
The per-sample SpatialData objects all have the same names for images, shapes and coordinate systems. So when I concatenate them, an keyerror occurred: KeyError: 'Images must have unique names across the SpatialData objects to concatenate'

wangjiawen2013 · 2024-08-06T03:05:01Z

And it's better to have a way to retrieve (subset) each objects from the concatenated objects.

LucaMarconato · 2024-08-06T11:42:14Z

@wangjiawen2013 does SpatialData.subset() works for your use case or you would improve something?

wangjiawen2013 · 2024-08-07T02:50:06Z

What I mean is how to concatenate multi spatialdata objects and subset each objects from the concatenated objects according to the sample names (each object have a unique name) again. The SpatialData objects from xenium all have the same names for images, shapes and coordinate systems, so I cannot concatenate them because KeyError: Images must have unique names across the SpatialData objects to concatenate.
SpatialData.subset() can only get elements, not objects.
We can concatenate and subset anndata objects well, what i mean is to concatenate and subset spatialdata objects like anndata objects.

LucaMarconato · 2024-10-01T15:50:38Z

Hi, getting back to this today.

What I mean is how to concatenate multi spatialdata objects and subset each objects from the concatenated objects according to the sample names (each object have a unique name) again. The SpatialData objects from xenium all have the same names for images, shapes and coordinate systems, so I cannot concatenate them because KeyError: Images must have unique names across the SpatialData objects to concatenate.
SpatialData.subset() can only get elements, not objects.
We can concatenate and subset anndata objects well, what i mean is to concatenate and subset spatialdata objects like anndata objects.

What I suggest here is, for each sample, to map all its geometry to a coordinate system called "sample_XXX", with XXX being the name/id of the sample. In this way you can easily get the SpatialData object for that sample using sdata.filter_by_coordinate_system(). Also subset() (with filter_table=True, which is the default) should work. You can see an example of both strategies in action in this notebook (3 samples; 1 image and 1 labels per sample; 1 global table).

Please let me know if it works for you.

LucaMarconato · 2024-10-01T15:57:20Z

@grst

The per-sample SpatialData objects all have the same names for images, shapes and coordinate systems. I currently rename them like this:
...
which seems a bit cumbersome. I'm wondering if there's a better solution or what's the intended way of handling such cases. It could also be worth adding a process to the nf-core/spatialtranscriptomics pipeline that already does the concatenation step.

@wangjiawen2013

I have the same issue !
The per-sample SpatialData objects all have the same names for images, shapes and coordinate systems. So when I concatenate them, an keyerror occurred: KeyError: 'Images must have unique names across the SpatialData objects to concatenate'

Currently what you implemented is basically I would do it. An idea would be to wrap that into an official helper function, or have the concatenate() function doing this automatically (contributions are appreciated!). But in the long term our approach would be to allow the user to have nested NGFF hierarchies #398. This would solve the problem with unique names because the new element name would be its relative path to the NGFF store root (sample0/image would be different from sample1/image).

LucaMarconato · 2024-10-01T15:58:16Z

sd.concatenate enforces that the input is a list. Is there a reason this can't accept any Sequence type (e.g. dict_values)?

I'll fix this. I'll support Iterable so that my_dict.values() will work.

LucaMarconato · 2024-10-01T16:29:32Z

Usually, I pass a dictionary sample_id -> AnnData to anndata.concat, which nicely makes unique obs_names in combination with concat(..., index_unique="_"). This doesn't work with spatialdata.concatenate, which leaves me with either manipulating the obs_names for each object before concatenation, or ugly obs names with numeric sufficies (e.g. AACTCAACCTTGACCA-1_0_0_0_0_0_0_0_0_0_0_0_0_0_0_0). IMO it would be great to support a dict as input to spatialdata.concatenate, too.

I will try exploring a solution for this.

But one comment on this. In spatialdata we don't use the .obs names for a series of reasons:

what maps the table rows to some geometries is the pair (region, instance_id), the obs index is not enough, so we don't use it. What we use are the columns named after the region_key and instance_key values.
we needed the obs indices to be integers, but AnnData only supported strings.

So, renaming the obs would come for convenience (or to guarantee that obs are unique), but not be used in any other spatialdata API, rather in downstream calls of anndata/scanpy APIs. So I wonder if we should proceed as follows:

if the user passes a dict, we don't use the dict (we just use the values), and then we pass the dict to ad.concat. This ensures unique names.
if the user doesn't pass a dict we simply call .obs_names_make_unique() in the resulting tables. If this is not done for instance the join_spatialelement_table() API calls fail.

LucaMarconato · 2024-10-02T20:42:10Z

Ok actually I have implemented all the above. @wangjiawen2013 @grst it would be great if you could try this out please 😊

wangjiawen2013 · 2024-10-03T02:02:09Z

Good news! I'll try later.

wangjiawen2013 · 2024-10-17T02:13:30Z

@LucaMarconato
Is this implementation released in the current sptial-data version ?

LucaMarconato · 2024-11-11T14:47:04Z

Yes, it's available in the released version.

Pancreas-Pratik · 2024-11-24T23:44:41Z

Ok actually I have implemented all the above. @wangjiawen2013 @grst it would be great if you could try this out please 😊

Thank you @LucaMarconato. I can confirm these new improvements work. This saved me a bunch of time!

I was unsure for a while on exactly what to do for merging two Xenium datasets together with identical element names. I had to learn some of the SpatialData terminology which really was just learning what the term element was referring to. Thank you for the design documentation which helped with learning the terminology (https://spatialdata.scverse.org/en/latest/design_doc.html#elements).

I also had to look through the Files changed tab on #720 and view the notes you had written about these new changes. The notes explained that a dict had to be used as input within spatialdata.concatenate(), which would let SpatialData know what to use for appending unique sample ID suffixes in order to generate a merged object with unique element names successfully (without the KeyError: Images... @wangjiawen2013 mentioned above #541 (comment)). For maybe someone who was stuck like I was, what was suggested above in the above comment #541 (comment) is what what is now implemented within spatialdata.

For me (using two xenium datasets) this now looks like:

Loading both xenium datasets:

from spatialdata_io import xenium

first_sdata = xenium('./data/first_xenium_dataset/outs/')
second_sdata = xenium('./data/second_xenium_dataset/outs/')

Creating the dicts to be used for appending suffixes to each object's elements to ensure unique element names and prevent KeyError: Images... when using spatialdata.concatenate():

first_second_dict = {
    'first': first_sdata,
    'second': second_sdata
}

and then finally, merge:

import spatialdata as sd
fs_sdata=sd.concatenate(first_second_dict)

wangjiawen2013 · 2024-11-28T07:57:34Z

This feature maybe useful when we compare two samples. If we don't merge the two datasets, we cannot assign the same color to the same cell type easily, and it's impossible to compare cellular niches (i.e., regions with distinct compositions of cell types). For example, If we use separated datasets, we find niche1, niche2 and niche3 in dataset1, and niche1, niche2, nich3 in dataset2, how can we know niche1 in dataset1 is niche1 in dataset2 ? If we use concatenated datasets and find niche1, niche2, niche3 using niches discovery algorithm for the whole datasets, we can compare the niches distribution among samples.
https://monkeybread.readthedocs.io/en/latest/index.html

wangjiawen2013 · 2024-11-28T08:51:26Z

The display of sdata.table can be re-formatted to start a new line for a table

wangjiawen2013 · 2024-11-28T13:01:27Z

Hi, Another question, How to separate or subset each object from the concatenated object ?

Pancreas-Pratik · 2024-11-29T03:09:12Z

@wangjiawen2013 I have not tried this myself, yet, but may need to soon. @LucaMarconato provided links above (#541 (comment)) to the MIBI-TOF tutorial which shows how this could be done (and also a similar method using sdata.filter_by_coordinate_system()).

giovp added enhancement ✨ New feature or request method: concatenate labels Sep 6, 2024

LucaMarconato assigned melonora Sep 30, 2024

LucaMarconato assigned LucaMarconato and unassigned melonora Oct 1, 2024

LucaMarconato mentioned this issue Oct 1, 2024

Improved concatenate() for non-unique element names #720

Merged

LucaMarconato closed this as completed in #720 Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback on concatenate() #541

Feedback on concatenate() #541

grst commented Apr 8, 2024

melonora commented Apr 15, 2024

wangjiawen2013 commented Aug 6, 2024

wangjiawen2013 commented Aug 6, 2024

LucaMarconato commented Aug 6, 2024

wangjiawen2013 commented Aug 7, 2024 •

edited

Loading

LucaMarconato commented Oct 1, 2024 •

edited

Loading

LucaMarconato commented Oct 1, 2024

LucaMarconato commented Oct 1, 2024 •

edited

Loading

LucaMarconato commented Oct 1, 2024 •

edited

Loading

LucaMarconato commented Oct 2, 2024

wangjiawen2013 commented Oct 3, 2024

wangjiawen2013 commented Oct 17, 2024 •

edited

Loading

LucaMarconato commented Nov 11, 2024

Pancreas-Pratik commented Nov 24, 2024

wangjiawen2013 commented Nov 28, 2024 •

edited

Loading

wangjiawen2013 commented Nov 28, 2024 •

edited

Loading

wangjiawen2013 commented Nov 28, 2024 via email •

edited

Loading

Pancreas-Pratik commented Nov 29, 2024 •

edited

Loading

Feedback on concatenate() #541

Feedback on concatenate() #541

Comments

grst commented Apr 8, 2024

Starting situation

Desired outcome

Pain points

melonora commented Apr 15, 2024

wangjiawen2013 commented Aug 6, 2024

wangjiawen2013 commented Aug 6, 2024

LucaMarconato commented Aug 6, 2024

wangjiawen2013 commented Aug 7, 2024 • edited Loading

LucaMarconato commented Oct 1, 2024 • edited Loading

LucaMarconato commented Oct 1, 2024

LucaMarconato commented Oct 1, 2024 • edited Loading

LucaMarconato commented Oct 1, 2024 • edited Loading

LucaMarconato commented Oct 2, 2024

wangjiawen2013 commented Oct 3, 2024

wangjiawen2013 commented Oct 17, 2024 • edited Loading

LucaMarconato commented Nov 11, 2024

Pancreas-Pratik commented Nov 24, 2024

wangjiawen2013 commented Nov 28, 2024 • edited Loading

wangjiawen2013 commented Nov 28, 2024 • edited Loading

wangjiawen2013 commented Nov 28, 2024 via email • edited Loading

Pancreas-Pratik commented Nov 29, 2024 • edited Loading

wangjiawen2013 commented Aug 7, 2024 •

edited

Loading

LucaMarconato commented Oct 1, 2024 •

edited

Loading

LucaMarconato commented Oct 1, 2024 •

edited

Loading

LucaMarconato commented Oct 1, 2024 •

edited

Loading

wangjiawen2013 commented Oct 17, 2024 •

edited

Loading

wangjiawen2013 commented Nov 28, 2024 •

edited

Loading

wangjiawen2013 commented Nov 28, 2024 •

edited

Loading

wangjiawen2013 commented Nov 28, 2024 via email •

edited

Loading

Pancreas-Pratik commented Nov 29, 2024 •

edited

Loading