scverse · grst · Sep 8, 2024 · Jan 7, 2024 · Jan 7, 2024 · Jan 9, 2024
diff --git a/.github/workflows/conda.yaml b/.github/workflows/conda.yaml
@@ -22,27 +22,26 @@ jobs:
       matrix:
         include:
           - os: ubuntu-latest
-            python: "3.12"
+            python: "3.11"
 
     env:
       OS: ${{ matrix.os }}
       PYTHON: ${{ matrix.python }}
 
     steps:
-      - uses: actions/checkout@v3
+      - uses: actions/checkout@v4
 
       - name: Setup Miniconda
-        uses: conda-incubator/setup-miniconda@v2
+        uses: conda-incubator/setup-miniconda@v3
         with:
-          miniforge-variant: Mambaforge
-          miniforge-version: latest
+          mamba-version: "*"
           channels: conda-forge,bioconda
           channel-priority: strict
-          python-version: ${{ matrix.python-version }}
+          python-version: ${{ matrix.python }}
 
       - name: install conda build
         run: |
-          mamba install -y boa conda-verify
+          mamba install -y boa conda-verify python=${{ matrix.python }}
         shell: bash
 
       - name: build and test package

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,20 +8,22 @@ and this project adheres to [Semantic Versioning][].
 [keep a changelog]: https://keepachangelog.com/en/1.0.0/
 [semantic versioning]: https://semver.org/spec/v2.0.0.html
 
-## [Unreleased]
+## v0.18.0
 
 ### Additions
 
 -   Isotypically included B cells are now labelled as `receptor_subtype="IGH+IGK/L"` instead of `ambiguous` in `tl.chain_qc` ([#537](https://github.com/scverse/scirpy/pull/537)).
 -   Added the `normalized_hamming` metric to `pp.ir_dist` that accounts for differences in CDR3 sequence length ([#512](https://github.com/scverse/scirpy/pull/512)).
+-   `tl.define_clonotype_clusters` now has an option to require J genes to match (`same_j_gene=True`) in addition to `same_v_gene`. ([#470](https://github.com/scverse/scirpy/pull/470)).
 
 ### Performance improvements
 
--   The hamming distance was reimplemented with numba, achieving a significant speedup ([#512](https://github.com/scverse/scirpy/pull/512)).
+-   The hamming distance has been reimplemented with numba, achieving a significant speedup ([#512](https://github.com/scverse/scirpy/pull/512)).
+-   Clonotype clustering has been accelerated leveraging sparse matrix operations ([#470](https://github.com/scverse/scirpy/pull/470)).
 
 ### Fixes
 
--   Fix that pl.clonotype_network couldn't use non-standard obsm key ([#545](https://github.com/scverse/scirpy/pull/545)).
+-   Fix that `pl.clonotype_network` couldn't use non-standard obsm key ([#545](https://github.com/scverse/scirpy/pull/545)).
 
 ### Other changes
 
@@ -54,7 +56,7 @@ and this project adheres to [Semantic Versioning][].
 
 ### Fixes
 
--   Fix issue with detecting the number of available CPUs on MacOD ([#518](https://github.com/scverse/scirpy/pull/502))
+-   Fix issue with detecting the number of available CPUs on MacOS ([#518](https://github.com/scverse/scirpy/pull/502))
 
 ## v0.16.1
 

diff --git a/docs/tutorials/tutorial_3k_tcr.ipynb b/docs/tutorials/tutorial_3k_tcr.ipynb
diff --git a/src/scirpy/ir_dist/_clonotype_neighbors.py b/src/scirpy/ir_dist/_clonotype_neighbors.py
diff --git a/src/scirpy/ir_dist/_util.py b/src/scirpy/ir_dist/_util.py
@@ -233,61 +233,90 @@
 
     def lookup(
         self,
-        object_id: int,
-        forward_lookup_table: str,
-        reverse_lookup_table: str | None = None,
-    ) -> coo_matrix | np.ndarray:
-        """Get ids of neighboring objects from a lookup table.
-
-        Performs the following lookup:
+        object_ids: np.ndarray[int],
+        forward_lookup_table_name: str,
+        reverse_lookup_table_name: str | None = None,
+    ) -> sp.csr_matrix:
+        """
+        Creates a distance matrix between objects with the given ids based on a feature distance matrix.
 
-            object_id -> dist_mat -> neighboring features -> neighboring objects.
+        To get the distance between two objects we need to look up the features of the two objects.
+        The distance between those two features is then the distance between the two objects.
 
-        where an object is a clonotype in our case (but could be used for something else)
+        To do so, we first use the `object_ids` together with the `forward_lookup_table` to look up
+        the indices of the objects in the feature `distance_matrix`. Afterwards we pick the according row for each object
+        out of the `distance_matrix` and construct a `rows` matrix (n_object_ids x n_features).
 
-        "nan"s are not looked up via the distance matrix, they return a row of zeros
+        "nan"s (index = -1) are not looked up in the feature `distance_matrix`, they return a row of zeros
         instead.
 
+        Then we use the entries of the `reverse_lookup_table` to construct a `reverse_lookup_matrix` (n_features x n_object_ids).
+        By multiplying the `rows` matrix with the `reverse_lookup_matrix` we get the final `object_distance_matrix` that shows
+        the distances between the objects with the given `object_ids` regarding a certain feature column.
+
+        It might not be obvious at the first sight that the matrix multiplication between `rows` and `reverse_lookup_matrix` gives
+        us the desired result. But this trick allows us to use the built-in sparse matrix multiplication of `scipy.sparse`
+        for enhanced performance.
+
         Parameters
         ----------
-        object_id
-            The row index of the feature_table.
-        forward_lookup_table
+        object_ids
+            The row indices of the feature_table.
+        forward_lookup_table_name
             The unique identifier of a lookup table previously added via
             `add_lookup_table`.
-        reverse_lookup_table
+        reverse_lookup_table_name
             The unique identifier of the lookup table used for the reverse lookup.
             If not provided will use the same lookup table for forward and reverse
             lookup. This is useful to calculate distances across features from
             different columns of the feature table (e.g. primary and secondary VJ chains).
+
+        Returns
+        -------
+        object_distance_matrix
+            A CSR matrix containing the pairwise distances between objects with the
+            given `object_ids` regarding a certain feature column.
         """
-        distance_matrix_name, forward, reverse = self.lookups[forward_lookup_table]
+        distance_matrix_name, forward_lookup_table, reverse_lookup_table = self.lookups[forward_lookup_table_name]
 
-        if reverse_lookup_table is not None:
-            distance_matrix_name_reverse, _, reverse = self.lookups[reverse_lookup_table]
+        if reverse_lookup_table_name is not None:
+            distance_matrix_name_reverse, _, reverse_lookup_table = self.lookups[reverse_lookup_table_name]
             if distance_matrix_name != distance_matrix_name_reverse:
                 raise ValueError("Forward and reverse lookup tablese must be defined " "on the same distance matrices.")
 
         distance_matrix = self.distance_matrices[distance_matrix_name]
-        idx_in_dist_mat = forward[object_id]
-        if idx_in_dist_mat == -1:  # nan
-            return reverse.empty()
-        else:
-            # get distances from the distance matrix...
-            row = distance_matrix[idx_in_dist_mat, :]
-
-            if reverse.is_boolean:
-                assert (
-                    len(row.indices) == 1  # type: ignore
-                ), "Boolean reverse lookup only works for identity distance matrices."
-                return reverse[row.indices[0]]  # type: ignore
-            else:
-                # ... and get column indices directly from sparse row
-                # sum concatenates coo matrices
-                return merge_coo_matrices(
-                    (reverse[i] * multiplier for i, multiplier in zip(row.indices, row.data, strict=False)),  # type: ignore
-                    shape=(1, reverse.size),
-                )
+
+        if np.max(distance_matrix.data) > np.iinfo(np.uint8).max:
+            raise OverflowError(
+                "The data values in the distance scipy.sparse.csr_matrix exceed the maximum value for uint8 (255)"
+            )
+
+        indices_in_dist_mat = forward_lookup_table[object_ids]
+        indptr = np.empty(distance_matrix.indptr.shape[0] + 1, dtype=np.int64)
+        indptr[:-1] = distance_matrix.indptr
+        indptr[-1] = indptr[-2]
+        distance_matrix_extended = sp.csr_matrix(
+            (distance_matrix.data.astype(np.uint8), distance_matrix.indices, indptr),
+            shape=(distance_matrix.shape[0] + 1, distance_matrix.shape[1]),
+        )
+        rows = distance_matrix_extended[indices_in_dist_mat, :]
+
+        reverse_matrix_data = [np.array([], dtype=np.uint8)] * rows.shape[1]
+        reverse_matrix_col = [np.array([], dtype=np.int64)] * rows.shape[1]
+        nnz_array = np.zeros(rows.shape[1], dtype=np.int64)
+
+        for key, value in reverse_lookup_table.lookup.items():
+            reverse_matrix_data[key] = value.data
+            reverse_matrix_col[key] = value.col
+            nnz_array[key] = value.nnz
+
+        data = np.concatenate(reverse_matrix_data)
+        col = np.concatenate(reverse_matrix_col)
+        indptr = np.concatenate([np.array([0], dtype=np.int64), np.cumsum(nnz_array)])
+
+        reverse_matrix = sp.csr_matrix((data, col, indptr), shape=(rows.shape[1], reverse_lookup_table.size))
+        object_distance_matrix = rows * reverse_matrix
+        return object_distance_matrix
 
     def add_distance_matrix(
         self,

diff --git a/src/scirpy/tests/data/clonotypes_test_data/j_gene_test_data.h5ad b/src/scirpy/tests/data/clonotypes_test_data/j_gene_test_data.h5ad
diff --git a/src/scirpy/tests/test_clonotypes.py b/src/scirpy/tests/test_clonotypes.py
@@ -2,6 +2,7 @@
 import sys
 from typing import cast
 
+import anndata as ad
 import numpy as np
 import numpy.testing as npt
 import pandas as pd
@@ -339,3 +340,23 @@ def test_clonotype_convergence(adata_clonotype):
             categories=["convergent", "not convergent"],
         ),
     )
+
+
+def test_j_gene_matching():
+    from . import TESTDATA
+
+    data = ad.read_h5ad(TESTDATA / "clonotypes_test_data/j_gene_test_data.h5ad")
+
+    ir.tl.define_clonotype_clusters(
+        data,
+        sequence="nt",
+        metric="normalized_hamming",
+        receptor_arms="all",
+        dual_ir="any",
+        same_j_gene=True,
+        key_added="test_j_gene",
+    )
+
+    clustering = data.obs["test_j_gene"].tolist()
+    expected = ["0", "0", "0", "0", "0", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2"]
+    assert np.array_equal(clustering, expected)
diff --git a/src/scirpy/tl/_clonotypes.py b/src/scirpy/tl/_clonotypes.py
@@ -197,9 +197,10 @@
     receptor_arms: Literal["VJ", "VDJ", "all", "any"] = "all",
     dual_ir: Literal["primary_only", "all", "any"] = "any",
     same_v_gene: bool = False,
+    same_j_gene: bool = False,
     within_group: Sequence[str] | str | None = "receptor_type",
     key_added: str | None = None,
-    partitions: Literal["connected", "leiden"] = "connected",
+    partitions: Literal["connected", "leiden", "fastgreedy"] = "connected",
     resolution: float = 1,
     n_iterations: int = 5,
     distance_key: str | None = None,
@@ -249,12 +250,19 @@
 
     partitions
         How to find graph partitions that define a clonotype.
-        Possible values are `leiden`, for using the "Leiden" algorithm and
+        Possible values are `leiden`, for using the "Leiden" algorithm,
+        `fastgreedy` for using the "Fastgreedy" algorithm and
         `connected` to find fully connected sub-graphs.
 
-        The difference is that the Leiden algorithm further divides
+        The difference is that the Leiden and Fastgreedy algorithms further divide
         fully connected subgraphs into highly-connected modules.
 
+        "Leiden" finds the community structure of the graph using the
+        Leiden algorithm of Traag, van Eck & Waltman.
+
+        "Fastgreedy" finds the community structure of the graph according to the
+        algorithm of Clauset et al based on the greedy optimization of modularity.
+
     resolution
         `resolution` parameter for the leiden algorithm.
     n_iterations
@@ -289,6 +297,7 @@
         receptor_arms=receptor_arms,  # type: ignore
         dual_ir=dual_ir,  # type: ignore
         same_v_gene=same_v_gene,
+        same_j_gene=same_j_gene,
         match_columns=within_group,
         distance_key=distance_key,
         sequence_key="junction_aa" if sequence == "aa" else "junction",
@@ -304,6 +313,8 @@
             resolution_parameter=resolution,
             n_iterations=n_iterations,
         )
+    elif partitions == "fastgreedy":
+        part = g.community_fastgreedy().as_clustering()
     else:
         part = g.clusters(mode="weak")