Separating `qc` from `parallel qc` and enhancing processing #15

mr-eyes · 2024-10-16T18:55:58Z

snipe qc is very efficient on a single core. It loads the reference signatures (ref,amplicon,variance_sigs) only once and then process all provided samples. On the other hand snipe parallel_qc load all needed resources per sample, which can be fast in large-scale but also introduces memory overhead. This is a Python limitation and will be solved in the future when the Rust or C++ version is implemented.

coderabbitai · 2024-10-16T18:56:05Z

Walkthrough

The changes introduce enhancements to the quality control (QC) processes for genomic samples across multiple files. A new class, MultiSigReferenceQC, is added for sequencing data analysis. The ReferenceQC class is updated with new analytical methods, while the SnipeSig class sees constructor modifications and new statistical properties. The command-line interface is enhanced for better QC operations, including a new parallel processing command. Additionally, the CI workflow is updated to include Python version 3.13.

Changes

File Path	Change Summary
`src/snipe/api/multisig_reference_QC.py`	Added `MultiSigReferenceQC` class for sequencing data quality control with new methods.
`src/snipe/api/reference_QC.py`	Updated `ReferenceQC` class with new methods for data analysis and refined existing methods.
`src/snipe/api/snipe_sig.py`	Modified `SnipeSig` class constructor, added statistical properties, and enhanced logging and validation.
`src/snipe/cli/cli_qc.py`	Enhanced CLI with `validate_sig_input` function and updated `qc` method signature.
`src/snipe/cli/cli_qc_parallel.py`	Introduced parallel processing capabilities and new validation functions.
`src/snipe/cli/main.py`	Added new command `parallel-qc` for executing parallelized QC operations.
`.github/workflows/CI.yaml`	Updated CI workflow to include Python version 3.13 in the testing matrix.

Poem

In the fields of data bright,
QC hops with all its might.
MultiSig leads the way,
With stats to save the day.
Parallel paths we now can trace,
In genomic realms, we find our place! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 28

🧹 Outside diff range and nitpick comments (5)

src/snipe/cli/main.py (2)
35-36: LGTM: CLI group modifications are correct. Consider enhancing the docstring.

The new parallel-qc command is correctly added to the CLI group, and the docstring has been updated to include information about it. The implementation follows the existing pattern and is consistent with other commands.

Consider enhancing the docstring to provide more details about the trade-offs between qc and parallel-qc. For example:
 Commands:
 - `sketch`: Perform sketching operations on genomic data.
 - `qc`: Execute quality control (QC) on multiple samples against a reference genome.
-    - `parallel-qc`: Parallelized version of the `qc` command (not optimized for memory).
+    - `parallel-qc`: Parallelized version of the `qc` command. Offers improved speed for multiple samples at the cost of increased memory usage.
 - `ops`: Perform various operations on sketches.
This change would provide users with more context about when to choose between qc and parallel-qc.

Also applies to: 42-42

Issue Detected: parallel_qc Function Not Found in cli_qc_parallel.py

The parallel_qc function does not appear to be defined in the snipe.cli.cli_qc_parallel module. This missing definition may lead to errors when attempting to use the parallel-qc command.

File Affected: src/snipe/cli/cli_qc_parallel.py

Missing Function: parallel_qc

Please address the missing implementation of the parallel_qc function to ensure the CLI command is fully operational.

🔗 Analysis chain

Line range hint 20-42: Overall, the changes enhance the CLI with parallel QC capability.

The addition of the parallel-qc command expands the tool's functionality, allowing users to choose between regular and parallel QC processing based on their needs and available resources. The changes are minimal and non-disruptive to the existing code structure.

To ensure the new functionality works as expected, please verify the following:

The parallel_qc function is correctly implemented in the snipe.cli.cli_qc_parallel module.

The parallelization provides a significant performance improvement for multiple samples.

The memory usage trade-off is acceptable for the intended use cases.

You can use the following script to check the implementation of the parallel_qc function:

Additionally, consider adding performance benchmarks comparing qc and parallel-qc for various sample sizes to document the trade-offs.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the implementation of the parallel_qc function

# Test: Check if the parallel_qc function is defined in the correct module
ast-grep --lang python --pattern $'def parallel_qc():
  $$$' src/snipe/cli/cli_qc_parallel.py

# Test: Check for parallelization implementation (e.g., use of concurrent.futures)
rg --type python 'concurrent\.futures' src/snipe/cli/cli_qc_parallel.py
Length of output: 199

Script:
#!/bin/bash
# Description: Verify the implementation of the parallel_qc function

# Test: Check if the parallel_qc function is defined in the correct module
ast-grep --lang python --pattern $'def parallel_qc():
  $$$' src/snipe/cli/cli_qc_parallel.py

# Test: Check for parallelization implementation (e.g., use of concurrent.futures)
rg --type py 'concurrent\.futures' src/snipe/cli/cli_qc_parallel.py
Length of output: 400
src/snipe/cli/cli_qc.py (1)
Line range hint 441-454: Fix reporting of failed samples

In the section where failed samples are reported, there is a reassignment of failed_samples which might cause confusion or errors. Since failed_samples is already a list of sample names, reassigning it with [res['sample'] for res in failed] where failed is an integer (the count of failed samples) will cause an error.

Apply this diff to correct the reporting of failed samples:
     # Report failed samples if any
     if failed:
-        failed_samples = [res['sample'] for res in failed]
         logger.warning(f"The following {len(failed_samples)} sample(s) failed during QC processing:")
         for sample in failed_samples:
             logger.warning(f"- {sample}")
Remove the line reassigning failed_samples to prevent errors.
src/snipe/api/reference_QC.py (2)
Line range hint 321-429: Duplicate definition of _calculate_advanced_stats method

The method _calculate_advanced_stats is defined twice in the ReferenceQC class (lines 321-375 and 377-429). This will cause the first definition to be overwritten by the second, potentially leading to unexpected behavior or loss of functionality. Please consolidate the two method definitions into a single method, ensuring that all necessary functionality is preserved.

Line range hint 673-700: Update method documentation to reflect the use of genomic k-mers in splitting

The split_sig_randomly method has been modified to split the intersection of sample_sig and reference_sig, meaning it now only considers genomic k-mers. However, the method's docstring still indicates that it splits the entire sample signature. Please update the docstring to accurately describe the new behavior so that users understand that only genomic k-mers are considered in the random split.

Apply this diff to update the method's docstring:
 def split_sig_randomly(self, n: int) -> List[SnipeSig]:
     r"""
-    Split the sample signature into `n` random parts based on abundances.
+    Split the genomic portion of the sample signature (k-mers shared with the reference signature) into `n` random parts based on abundances.

-    This method distributes the k-mers of the sample signature into `n` parts using a multinomial distribution
+    This method distributes the genomic k-mers (shared between the sample and reference signatures) into `n` parts using a multinomial distribution
     based on their abundances. Each k-mer's abundance is split across the `n` parts proportionally.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c208b7f and e8e9d26.

📒 Files selected for processing (6)

src/snipe/api/multisig_reference_QC.py (1 hunks)
src/snipe/api/reference_QC.py (1 hunks)
src/snipe/api/snipe_sig.py (8 hunks)
src/snipe/cli/cli_qc.py (7 hunks)
src/snipe/cli/cli_qc_parallel.py (1 hunks)
src/snipe/cli/main.py (2 hunks)

🧰 Additional context used

🪛 Ruff

src/snipe/api/multisig_reference_QC.py

1-1: heapq imported but unused

Remove unused import: heapq

(F401)

4-4: typing.Iterator imported but unused

Remove unused import

(F401)

4-4: typing.Union imported but unused

Remove unused import

(F401)

10-10: os imported but unused

Remove unused import: os

(F401)

11-11: requests imported but unused

Remove unused import: requests

(F401)

12-12: tqdm.tqdm imported but unused

Remove unused import: tqdm.tqdm

(F401)

13-13: urllib.parse.urlparse imported but unused

Remove unused import: urllib.parse.urlparse

(F401)

14-14: Redefinition of unused Optional from line 4

Remove definition: Optional

(F811)

15-15: sourmash imported but unused

Remove unused import: sourmash

(F401)

17-17: concurrent imported but unused

Remove unused import: concurrent

(F401)

src/snipe/api/snipe_sig.py

5-5: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

src/snipe/cli/cli_qc.py

5-5: typing.Any imported but unused

Remove unused import

(F401)

5-5: typing.Dict imported but unused

Remove unused import

(F401)

5-5: typing.Union imported but unused

Remove unused import

(F401)

10-10: concurrent.futures imported but unused

Remove unused import: concurrent.futures

(F401)

src/snipe/cli/cli_qc_parallel.py

5-5: typing.Union imported but unused

Remove unused import: typing.Union

(F401)

13-13: snipe.api.sketch.SnipeSketch imported but unused

Remove unused import: snipe.api.sketch.SnipeSketch

(F401)

🔇 Additional comments (2)

src/snipe/cli/main.py (1)

20-20: LGTM: Import statement for parallel_qc is correctly added.

The import statement for parallel_qc is properly placed and follows the existing import style in the file. It's consistent with the project structure and other CLI-related imports.
src/snipe/cli/cli_qc.py (1)
403-408: Ensure correct parameter names when instantiating MultiSigReferenceQC

Verify that the parameters passed to MultiSigReferenceQC match its constructor's expected arguments. The parameter ychr should likely be ychr_sig or ychr_signature for clarity and consistency.

If the constructor expects ychr_signature, apply this diff to correct the parameter name:
     qc_instance = MultiSigReferenceQC(
             reference_sig=reference_sig,
             amplicon_sig=amplicon_sig,
-            ychr=ychr_sig if ychr_sig else None,
+            ychr_sig=ychr_sig if ychr_sig else None,
             varsigs=vars_snipesigs if vars_snipesigs else None,
             enable_logging=debug
         )
Alternatively, confirm that ychr is the correct parameter name in the MultiSigReferenceQC constructor.

coderabbitai · 2024-10-16T19:00:27Z

src/snipe/cli/cli_qc.py

        - `-o qc_comprehensive.tsv`: Output file for QC results.

        **Expected Output:**

        A TSV file named `qc_comprehensive.tsv` containing comprehensive QC metrics, including advanced analyses, ROI predictions, and data from amplicon and variable signatures for both `sample1.sig` and `sample2.sig`.
    """

+    print(sample)


⚠️ Potential issue

Remove unnecessary debug print(sample) statement

The print(sample) statement appears to be a leftover from debugging. It should be removed to prevent unintended output in the command-line interface.

Apply this diff to remove the debug statement:

-print(sample)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(sample)

coderabbitai · 2024-10-16T19:00:27Z

src/snipe/cli/cli_qc.py

@@ -2,133 +2,28 @@
 import sys
 import time
 import logging
-from typing import Optional, Any, List, Dict, Set
+from typing import Optional, Any, List, Dict, Set, Union


🛠️ Refactor suggestion

Remove unused imports from typing module

The imports Any, Dict, and Union from the typing module are not used in the code. Removing them will clean up the import statements.

Apply this diff to remove the unused imports:

-from typing import Optional, Any, List, Dict, Set, Union +from typing import Optional, List, Set

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from typing import Optional, Any, List, Dict, Set, Union

from typing import Optional, List, Set

🧰 Tools

🪛 Ruff

5-5: typing.Any imported but unused

Remove unused import

(F401)

5-5: typing.Dict imported but unused

Remove unused import

(F401)

5-5: typing.Union imported but unused

Remove unused import

(F401)

coderabbitai · 2024-10-16T19:00:27Z

src/snipe/cli/cli_qc.py

+def validate_sig_input(ctx, param, value: tuple) -> str:
+    supported_extensions = ['.zip', '.sig']
+    for path in value:
+        if not os.path.exists(path):
+            raise click.BadParameter(f"File not found: {path}")
+        if not any(path.lower().endswith(ext) for ext in supported_extensions):
+            raise click.BadParameter(f"Unsupported file format: {path}, supported formats: {', '.join(supported_extensions)}")
+    return value


⚠️ Potential issue

Correct return type annotation in validate_sig_input function

The function validate_sig_input processes a tuple of paths and returns the same tuple. However, the return type is annotated as -> str, which should be corrected to -> tuple to accurately reflect the returned value.

Apply this diff to fix the function signature:

-def validate_sig_input(ctx, param, value: tuple) -> str: +def validate_sig_input(ctx, param, value: tuple) -> tuple:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def validate_sig_input(ctx, param, value: tuple) -> str:

supported_extensions = ['.zip', '.sig']

for path in value:

if not os.path.exists(path):

raise click.BadParameter(f"File not found: {path}")

if not any(path.lower().endswith(ext) for ext in supported_extensions):

raise click.BadParameter(f"Unsupported file format: {path}, supported formats: {', '.join(supported_extensions)}")

return value

def validate_sig_input(ctx, param, value: tuple) -> tuple:

supported_extensions = ['.zip', '.sig']

for path in value:

if not os.path.exists(path):

raise click.BadParameter(f"File not found: {path}")

if not any(path.lower().endswith(ext) for ext in supported_extensions):

raise click.BadParameter(f"Unsupported file format: {path}, supported formats: {', '.join(supported_extensions)}")

return value

coderabbitai · 2024-10-16T19:00:27Z

src/snipe/cli/cli_qc.py

+        sample_sig = SnipeSig(sourmash_sig=sample_path, sig_type=SigType.SAMPLE, enable_logging=debug)
+        try:
+            sample_stats = qc_instance.process_sample(sample_sig=sample_sig,
+                          predict_extra_folds = predict_extra_folds if roi else None,
+                          advanced=advanced)
+            sample_to_stats[sample_sig.name] = sample_stats
+        except Exception as e:
+            failed_samples.append(sample_sig.name)
+            qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}")
+            continue


⚠️ Potential issue

Handle exceptions when processing samples appropriately

When a sample fails to process, its name is added to failed_samples, but later in the code, there may be a mismatch when reporting failed samples. Ensure that failed_samples contains the correct information for reporting, and consider logging the exception traceback for better debugging.

Consider modifying the exception handling block:

except Exception as e: failed_samples.append(sample_sig.name) - qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}") + qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}", exc_info=True) continue

This change adds the exc_info=True parameter to log the full traceback of the exception.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sample_sig = SnipeSig(sourmash_sig=sample_path, sig_type=SigType.SAMPLE, enable_logging=debug)

try:

sample_stats = qc_instance.process_sample(sample_sig=sample_sig,

predict_extra_folds = predict_extra_folds if roi else None,

advanced=advanced)

sample_to_stats[sample_sig.name] = sample_stats

except Exception as e:

failed_samples.append(sample_sig.name)

qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}")

continue

sample_sig = SnipeSig(sourmash_sig=sample_path, sig_type=SigType.SAMPLE, enable_logging=debug)

try:

sample_stats = qc_instance.process_sample(sample_sig=sample_sig,

predict_extra_folds = predict_extra_folds if roi else None,

advanced=advanced)

sample_to_stats[sample_sig.name] = sample_stats

except Exception as e:

failed_samples.append(sample_sig.name)

qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}", exc_info=True)

continue

coderabbitai · 2024-10-16T19:00:28Z

src/snipe/cli/cli_qc.py

@@ -137,18 +32,17 @@

 @click.command()
 @click.option('--ref', type=click.Path(exists=True), required=True, help='Reference genome signature file (required).')
-@click.option('--sample', type=click.Path(exists=True), multiple=True, help='Sample signature file. Can be provided multiple times.')
+@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.')


⚠️ Potential issue

Simplify @click.option for --sample by adjusting parameters

Using both type=click.Path(exists=True) and callback=validate_sig_input may lead to redundant path existence checks and potential conflicts. Since the validate_sig_input function handles validation, you can remove the exists=True parameter. Additionally, when multiple=True, the default value should be an empty tuple () instead of None.

Apply this diff to adjust the option configuration:

-@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.') +@click.option('--sample', type=click.Path(), callback=validate_sig_input, multiple=True, default=(), help='Sample signature file. Can be provided multiple times.')

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.')

@click.option('--sample', type=click.Path(), callback=validate_sig_input, multiple=True, default=(), help='Sample signature file. Can be provided multiple times.')

coderabbitai · 2024-10-16T19:00:30Z

src/snipe/api/multisig_reference_QC.py

+        if sample_sig._type != SigType.SAMPLE:
+            self.logger.error("Invalid signature type for sample_sig: %s | %s", sample_sig.sigtype, sample_sig._type)
+            raise ValueError(f"sample_sig must be of type {SigType.SAMPLE}, got {sample_sig.sigtype}")
+


⚠️ Potential issue

Use sigtype instead of _type to check sample_sig type

In the condition if sample_sig._type != SigType.SAMPLE:, you're accessing the protected member _type. It's better to use the public attribute sigtype, as done elsewhere in the code, for consistency and to avoid potential issues with access to protected members.

Apply this diff to fix the issue:

-if sample_sig._type != SigType.SAMPLE: +if sample_sig.sigtype != SigType.SAMPLE:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if sample_sig._type != SigType.SAMPLE:

self.logger.error("Invalid signature type for sample_sig: %s | %s", sample_sig.sigtype, sample_sig._type)

raise ValueError(f"sample_sig must be of type {SigType.SAMPLE}, got {sample_sig.sigtype}")

if sample_sig.sigtype != SigType.SAMPLE:

self.logger.error("Invalid signature type for sample_sig: %s | %s", sample_sig.sigtype, sample_sig._type)

raise ValueError(f"sample_sig must be of type {SigType.SAMPLE}, got {sample_sig.sigtype}")

coderabbitai · 2024-10-16T19:00:30Z

src/snipe/api/multisig_reference_QC.py

+        if ychr and not isinstance(ychr, SnipeSig):
+            self.logger.error("Invalid signature type for ychr: %s", ychr.sigtype)
+            raise ValueError(f"ychr must be of type {SigType.SAMPLE}, got {ychr.sigtype}")
+


⚠️ Potential issue

Fix AttributeError when logging invalid ychr parameter

In the condition if ychr and not isinstance(ychr, SnipeSig):, accessing ychr.sigtype will raise an AttributeError if ychr is not an instance of SnipeSig, because ychr won't have a sigtype attribute. Adjust the error logging to avoid accessing ychr.sigtype when ychr is not a SnipeSig instance.

Apply this diff to fix the issue:

if ychr and not isinstance(ychr, SnipeSig): - self.logger.error("Invalid signature type for ychr: %s", ychr.sigtype) - raise ValueError(f"ychr must be of type {SigType.SAMPLE}, got {ychr.sigtype}") + self.logger.error("Invalid type for ychr: %s", type(ychr)) + raise ValueError(f"ychr must be an instance of SnipeSig with sigtype {SigType.SAMPLE}, got {type(ychr)}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if ychr and not isinstance(ychr, SnipeSig):

self.logger.error("Invalid signature type for ychr: %s", ychr.sigtype)

raise ValueError(f"ychr must be of type {SigType.SAMPLE}, got {ychr.sigtype}")

if ychr and not isinstance(ychr, SnipeSig):

self.logger.error("Invalid type for ychr: %s", type(ychr))

raise ValueError(f"ychr must be an instance of SnipeSig with sigtype {SigType.SAMPLE}, got {type(ychr)}")

coderabbitai · 2024-10-16T19:00:30Z

src/snipe/api/snipe_sig.py

@@ -117,18 +117,19 @@ def __init__(self, *, sourmash_sig: Union[str, sourmash.signature.SourmashSignat
                    self.logger.debug(f"Iterating over signature: {signame}")
                    if signame.endswith("-snipegenome"):
                        sig = sig.to_mutable()
+                        # self.chr_to_sig[sig.name] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME, enable_logging=enable_logging)


🛠️ Refactor suggestion

Remove commented-out code to improve readability.

The lines at 120, 127, and 132 contain code that's been commented out. If this code is no longer needed, consider removing it to enhance code clarity and maintainability.

Apply this diff to remove the commented-out code:

- # self.chr_to_sig[sig.name] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME, enable_logging=enable_logging)

- # sig.name = signame.replace("sex-","")

- # sig.name = signame.replace("autosome-","")

Also applies to: 127-127, 132-132

coderabbitai · 2024-10-16T19:00:30Z

src/snipe/api/snipe_sig.py

 from snipe.api.enums import SigType
-from typing import Any, Dict, Iterator, List, Optional, Union
+from typing import Dict, Iterator, List, Union, Optional


⚠️ Potential issue

Remove unused import Optional.

The Optional class from the typing module is imported but not used anywhere in the code. Removing it will clean up the import statements.

Apply this diff to remove the unused import:

-from typing import Dict, Iterator, List, Union, Optional +from typing import Dict, Iterator, List, Union

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from typing import Dict, Iterator, List, Union, Optional

from typing import Dict, Iterator, List, Union

🧰 Tools

🪛 Ruff

5-5: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

coderabbitai · 2024-10-16T19:00:31Z

src/snipe/api/snipe_sig.py

+                self.logger.error("Failed to export signatures to zip: %s", e)
                raise Exception(f"Failed to export signatures to zip: {e}") from e


⚠️ Potential issue

Raise a more specific exception instead of a generic Exception.

Raising a generic Exception can make error handling less precise. Consider raising a more specific exception, such as IOError or a custom exception class, to provide clearer context and facilitate better error management.

Apply this diff to raise a more specific exception:

- raise Exception(f"Failed to export signatures to zip: {e}") from e + raise IOError(f"Failed to export signatures to zip: {e}") from e

Alternatively, define and use a custom exception that better reflects the nature of the error.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.logger.error("Failed to export signatures to zip: %s", e)

raise Exception(f"Failed to export signatures to zip: {e}") from e

self.logger.error("Failed to export signatures to zip: %s", e)

raise IOError(f"Failed to export signatures to zip: {e}") from e

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (1)

src/snipe/cli/cli_qc_parallel.py (1)

153-155: Enhance the docstring for the parallel_qc function.

The current docstring for the parallel_qc function is quite brief and doesn't fully explain the parallelization aspect or the function's capabilities.

Consider expanding the docstring to provide more details about the function's behavior, parallelization strategy, and memory usage. Here's a suggested enhancement:

def parallel_qc(ref: str, sample: List[str], samples_from_file: Optional[str],
       amplicon: Optional[str], roi: bool, cores: int, advanced: bool, 
       ychr: Optional[str], debug: bool, output: str, vars: List[str]):
    """
    Perform parallelized quality control (QC) on genomic samples.

    This function implements a parallel version of the `qc` command, processing multiple samples
    concurrently to improve performance. It uses multiprocessing to distribute the workload
    across multiple CPU cores.

    Note: This implementation is not optimized for memory usage and may consume significant
    resources when processing large numbers of samples or using large variable signature files.

    Args:
        ref (str): Path to the reference genome signature file.
        sample (List[str]): List of paths to sample signature files.
        samples_from_file (Optional[str]): Path to a file containing sample paths (one per line).
        amplicon (Optional[str]): Path to the amplicon signature file.
        roi (bool): Flag to calculate ROI for 1, 2, 5, and 9 folds.
        cores (int): Number of CPU cores to use for parallel processing.
        advanced (bool): Flag to include advanced QC metrics.
        ychr (Optional[str]): Path to the Y chromosome signature file.
        debug (bool): Flag to enable debugging and detailed logging.
        output (str): Path to the output TSV file for QC results.
        vars (List[str]): List of paths to variable signature files.

    The function processes samples in parallel, aggregates results, and writes them to the
    specified output TSV file. It includes error handling and logging for each sample processed.
    """

This enhanced docstring provides more context about the function's purpose, its parallelization strategy, and potential memory usage considerations. It also details the arguments and their purposes, which will be helpful for users and maintainers of the code.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e8e9d26 and d884134.

📒 Files selected for processing (2)

.github/workflows/CI.yaml (1 hunks)
src/snipe/cli/cli_qc_parallel.py (1 hunks)

🧰 Additional context used

🪛 Ruff

src/snipe/cli/cli_qc_parallel.py

5-5: typing.Union imported but unused

Remove unused import: typing.Union

(F401)

13-13: snipe.api.sketch.SnipeSketch imported but unused

Remove unused import: snipe.api.sketch.SnipeSketch

(F401)

🔇 Additional comments (6)

.github/workflows/CI.yaml (1)

21-21: Verify compatibility with Python 3.13 and consider potential impacts.

Adding Python 3.13 to the test matrix is a forward-looking change that ensures the project will be tested against the latest Python version. However, there are a few points to consider:

As of October 2024, Python 3.13 is not yet released. This change anticipates its release, which is good for future-proofing but may cause immediate issues.

Adding another Python version will increase the CI pipeline execution time.

The project's dependencies may not yet support Python 3.13, which could lead to test failures.

To ensure this change doesn't cause immediate issues, please run the following checks:

Consider the following suggestions:

If Python 3.13 is not yet available, you might want to add it in a separate PR closer to its release date.

Update the project's pyproject.toml or setup.py to specify compatible Python versions.

Consider using a conditional job that only runs Python 3.13 tests on scheduled runs or specific branches to avoid potential failures in regular PRs.

src/snipe/cli/cli_qc_parallel.py (5)

20-27: LGTM: Input validation function is well-implemented.

The validate_sig_input function correctly checks for file existence and supported extensions. It raises appropriate exceptions for invalid inputs, which is good for error handling.

47-130: LGTM: Well-structured sample processing function with comprehensive error handling.

The process_sample function is well-implemented with detailed logging, proper error handling, and clear logic for processing different signature types and calculating metrics. The use of try-except blocks for ROI calculation is a good practice for handling potential errors.

132-135: LGTM: TSV file validation function is correctly implemented.

The validate_tsv_file function properly checks for the .tsv file extension and raises an appropriate exception for invalid extensions. This ensures that the output file will have the correct format.

150-363: LGTM: Well-implemented parallel QC function with comprehensive error handling and logging.

The parallel_qc function is well-structured and effectively implements parallel processing of samples for QC. It includes thorough error handling, detailed logging, and proper resource management. The use of tqdm for progress tracking is a nice touch for user feedback during long-running operations.

Some notable good practices:

Proper signal handling for graceful shutdown.

Clear separation of sample collection, processing, and result aggregation.

Comprehensive error handling and reporting for failed samples.

Use of concurrent.futures.ProcessPoolExecutor for parallel processing.

Overall, this implementation provides a robust and efficient way to perform QC on multiple samples in parallel.

1-363: LGTM: Well-implemented parallel QC functionality with robust error handling and efficient resource utilization.

This new file, cli_qc_parallel.py, successfully implements a parallel quality control (QC) functionality for the Snipe framework. The code demonstrates several strengths:

Comprehensive input validation and error handling throughout the file.

Effective use of multiprocessing for parallel sample processing, improving performance for large datasets.

Detailed logging that aids in debugging and monitoring the QC process.

Proper resource management, including graceful shutdown handling.

Clear separation of concerns with well-structured functions for different aspects of the QC process.

The suggested improvements (updating docstrings, removing unused imports, and replacing os._exit with sys.exit in specific cases) are minor and do not detract from the overall quality of the implementation.

This parallel QC implementation should significantly improve the performance of the Snipe framework when processing multiple samples, while maintaining robustness and providing clear output for users.

🧰 Tools

🪛 Ruff

5-5: typing.Union imported but unused

Remove unused import: typing.Union

(F401)

13-13: snipe.api.sketch.SnipeSketch imported but unused

Remove unused import: snipe.api.sketch.SnipeSketch

(F401)

coderabbitai · 2024-10-16T19:03:06Z

src/snipe/cli/cli_qc_parallel.py

+def process_sample(sample_path: str, reference_sig: SnipeSig, amplicon_sig: Optional[SnipeSig],
+                  advanced: bool, roi: bool, debug: bool,
+                  ychr_sig: Optional[SnipeSig] = None,
+                  var_sigs: Optional[List[SnipeSig]] = None) -> Dict[str, Any]:
+    """
+    Process a single sample for QC.
+
+    Parameters:
+    - sample_path (str): Path to the sample signature file.
+    - ref_path (str): Path to the reference signature file.
+    - amplicon_path (Optional[str]): Path to the amplicon signature file.
+    - advanced (bool): Flag to include advanced metrics.
+    - roi (bool): Flag to calculate ROI.
+    - debug (bool): Flag to enable debugging.
+    - vars_paths (Optional[Union[List[str], List[SnipeSig]]]): List of paths to variable signature files or SnipeSig objects.
+
+    Returns:
+    - Dict[str, Any]: QC results for the sample.
+    """


⚠️ Potential issue

Update the docstring to match the current function signature.

The docstring for process_sample doesn't accurately reflect the current function parameters. This mismatch could lead to confusion for developers using or maintaining this function.

Please update the docstring to match the current function signature. Here's a suggested update:

def process_sample(sample_path: str, reference_sig: SnipeSig, amplicon_sig: Optional[SnipeSig], advanced: bool, roi: bool, debug: bool, ychr_sig: Optional[SnipeSig] = None, var_sigs: Optional[List[SnipeSig]] = None) -> Dict[str, Any]: """ Process a single sample for QC. Parameters: - sample_path (str): Path to the sample signature file. - reference_sig (SnipeSig): Reference genome signature object. - amplicon_sig (Optional[SnipeSig]): Amplicon signature object. - advanced (bool): Flag to include advanced metrics. - roi (bool): Flag to calculate ROI. - debug (bool): Flag to enable debugging. - ychr_sig (Optional[SnipeSig]): Y chromosome signature object. - var_sigs (Optional[List[SnipeSig]]): List of variable signature objects. Returns: - Dict[str, Any]: QC results for the sample. """

This update will ensure that the docstring accurately reflects the function's parameters and their types.

coderabbitai · 2024-10-16T19:03:06Z

src/snipe/cli/cli_qc_parallel.py

+        logger.warning("KeyboardInterrupt received. Shutting down...")
+        os._exit(1)  # Forcefully terminate the program
+    except Exception as e:
+        logger.error(f"An unexpected error occurred: {e}")
+        os._exit(1)  # Forcefully terminate the program


⚠️ Potential issue

Use sys.exit(1) for graceful termination.

In the error handling sections, os._exit(1) is used, which immediately terminates the program without calling cleanup handlers or flushing buffers. This can lead to resource leaks or incomplete logging.

Replace os._exit(1) with sys.exit(1) in these two instances:

except KeyboardInterrupt: logger.warning("KeyboardInterrupt received. Shutting down...") - os._exit(1) # Forcefully terminate the program + sys.exit(1) # Gracefully terminate the program except Exception as e: logger.error(f"An unexpected error occurred: {e}") - os._exit(1) # Forcefully terminate the program + sys.exit(1) # Gracefully terminate the program

This change will allow for a more graceful shutdown, ensuring that all resources are properly released and logs are flushed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

logger.warning("KeyboardInterrupt received. Shutting down...")

os._exit(1) # Forcefully terminate the program

except Exception as e:

logger.error(f"An unexpected error occurred: {e}")

os._exit(1) # Forcefully terminate the program

logger.warning("KeyboardInterrupt received. Shutting down...")

sys.exit(1) # Gracefully terminate the program

except Exception as e:

logger.error(f"An unexpected error occurred: {e}")

sys.exit(1) # Gracefully terminate the program

coderabbitai · 2024-10-16T19:03:06Z

src/snipe/cli/cli_qc_parallel.py

+import os
+import sys
+import time
+import logging
+from typing import Optional, Any, List, Dict, Set, Union
+
+import click
+import pandas as pd
+from tqdm import tqdm
+import concurrent.futures
+
+from snipe.api.enums import SigType
+from snipe.api.sketch import SnipeSketch
+from snipe.api.snipe_sig import SnipeSig
+from snipe.api.reference_QC import ReferenceQC
+
+import signal
+


⚠️ Potential issue

Remove unused imports to clean up the code.

There are two unused imports that should be removed to improve code cleanliness:

Apply this diff to remove the unused imports:

-from typing import Optional, Any, List, Dict, Set, Union +from typing import Optional, Any, List, Dict, Set -from snipe.api.sketch import SnipeSketch

This change will remove the unused Union type and SnipeSketch class, which are not used in the current implementation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import os

import sys

import time

import logging

from typing import Optional, Any, List, Dict, Set, Union

import click

import pandas as pd

from tqdm import tqdm

import concurrent.futures

from snipe.api.enums import SigType

from snipe.api.sketch import SnipeSketch

from snipe.api.snipe_sig import SnipeSig

from snipe.api.reference_QC import ReferenceQC

import signal

import os

import sys

import time

import logging

from typing import Optional, Any, List, Dict, Set

import click

import pandas as pd

from tqdm import tqdm

import concurrent.futures

from snipe.api.enums import SigType

from snipe.api.snipe_sig import SnipeSig

from snipe.api.reference_QC import ReferenceQC

import signal

🧰 Tools

🪛 Ruff

5-5: typing.Union imported but unused

Remove unused import: typing.Union

(F401)

13-13: snipe.api.sketch.SnipeSketch imported but unused

Remove unused import: snipe.api.sketch.SnipeSketch

(F401)

mr-eyes added 12 commits October 15, 2024 11:39

little enhancement

8ef9f89

little enhancement

60f6139

Alternative to ReferenceQC (load once process many)

6c12660

Minor updates in SnipeSig

53456a7

archiving the 1st version of

f2bbca0

New snipe qc with MultiSigReferenceQC

56ec28e

parallelize the snipe qc

891aa6d

ROI on genomic k-mers only

db1ff9f

ROI on genomic k-mers only

1db0168

enhancing ROI even more

715a96d

separate qc from parallel qc

e8e9d26

ready

59abf1d

support py313

d884134

coderabbitai bot reviewed Oct 16, 2024

View reviewed changes

mr-eyes merged commit 1e11f0f into main Oct 16, 2024
4 of 5 checks passed

mr-eyes deleted the enhance_vars branch October 16, 2024 19:02

coderabbitai bot reviewed Oct 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separating `qc` from `parallel qc` and enhancing processing #15

Separating `qc` from `parallel qc` and enhancing processing #15

mr-eyes commented Oct 16, 2024 •

edited

Loading

coderabbitai bot commented Oct 16, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot left a comment

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

coderabbitai bot Oct 16, 2024

	from typing import Optional, Any, List, Dict, Set, Union
	from typing import Optional, List, Set

	@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.')
	@click.option('--sample', type=click.Path(), callback=validate_sig_input, multiple=True, default=(), help='Sample signature file. Can be provided multiple times.')

	from typing import Dict, Iterator, List, Union, Optional
	from typing import Dict, Iterator, List, Union

		self.logger.error("Failed to export signatures to zip: %s", e)
		raise Exception(f"Failed to export signatures to zip: {e}") from e

Separating qc from parallel qc and enhancing processing #15

Separating qc from parallel qc and enhancing processing #15

Conversation

mr-eyes commented Oct 16, 2024 • edited Loading

coderabbitai bot commented Oct 16, 2024 • edited Loading

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 16, 2024

Choose a reason for hiding this comment

Separating `qc` from `parallel qc` and enhancing processing #15

Separating `qc` from `parallel qc` and enhancing processing #15

mr-eyes commented Oct 16, 2024 •

edited

Loading

coderabbitai bot commented Oct 16, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)