Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separating qc from parallel qc and enhancing processing #15

Merged
merged 13 commits into from
Oct 16, 2024
Merged

Conversation

mr-eyes
Copy link
Member

@mr-eyes mr-eyes commented Oct 16, 2024

snipe qc is very efficient on a single core. It loads the reference signatures (ref,amplicon,variance_sigs) only once and then process all provided samples. On the other hand snipe parallel_qc load all needed resources per sample, which can be fast in large-scale but also introduces memory overhead. This is a Python limitation and will be solved in the future when the Rust or C++ version is implemented.

Copy link
Contributor

coderabbitai bot commented Oct 16, 2024

Walkthrough

The changes introduce enhancements to the quality control (QC) processes for genomic samples across multiple files. A new class, MultiSigReferenceQC, is added for sequencing data analysis. The ReferenceQC class is updated with new analytical methods, while the SnipeSig class sees constructor modifications and new statistical properties. The command-line interface is enhanced for better QC operations, including a new parallel processing command. Additionally, the CI workflow is updated to include Python version 3.13.

Changes

File Path Change Summary
src/snipe/api/multisig_reference_QC.py Added MultiSigReferenceQC class for sequencing data quality control with new methods.
src/snipe/api/reference_QC.py Updated ReferenceQC class with new methods for data analysis and refined existing methods.
src/snipe/api/snipe_sig.py Modified SnipeSig class constructor, added statistical properties, and enhanced logging and validation.
src/snipe/cli/cli_qc.py Enhanced CLI with validate_sig_input function and updated qc method signature.
src/snipe/cli/cli_qc_parallel.py Introduced parallel processing capabilities and new validation functions.
src/snipe/cli/main.py Added new command parallel-qc for executing parallelized QC operations.
.github/workflows/CI.yaml Updated CI workflow to include Python version 3.13 in the testing matrix.

Poem

In the fields of data bright,
QC hops with all its might.
MultiSig leads the way,
With stats to save the day.
Parallel paths we now can trace,
In genomic realms, we find our place! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 28

🧹 Outside diff range and nitpick comments (5)
src/snipe/cli/main.py (2)

35-36: LGTM: CLI group modifications are correct. Consider enhancing the docstring.

The new parallel-qc command is correctly added to the CLI group, and the docstring has been updated to include information about it. The implementation follows the existing pattern and is consistent with other commands.

Consider enhancing the docstring to provide more details about the trade-offs between qc and parallel-qc. For example:

 Commands:
 - `sketch`: Perform sketching operations on genomic data.
 - `qc`: Execute quality control (QC) on multiple samples against a reference genome.
-    - `parallel-qc`: Parallelized version of the `qc` command (not optimized for memory).
+    - `parallel-qc`: Parallelized version of the `qc` command. Offers improved speed for multiple samples at the cost of increased memory usage.
 - `ops`: Perform various operations on sketches.

This change would provide users with more context about when to choose between qc and parallel-qc.

Also applies to: 42-42


Issue Detected: parallel_qc Function Not Found in cli_qc_parallel.py

The parallel_qc function does not appear to be defined in the snipe.cli.cli_qc_parallel module. This missing definition may lead to errors when attempting to use the parallel-qc command.

  • File Affected: src/snipe/cli/cli_qc_parallel.py
  • Missing Function: parallel_qc

Please address the missing implementation of the parallel_qc function to ensure the CLI command is fully operational.

🔗 Analysis chain

Line range hint 20-42: Overall, the changes enhance the CLI with parallel QC capability.

The addition of the parallel-qc command expands the tool's functionality, allowing users to choose between regular and parallel QC processing based on their needs and available resources. The changes are minimal and non-disruptive to the existing code structure.

To ensure the new functionality works as expected, please verify the following:

  1. The parallel_qc function is correctly implemented in the snipe.cli.cli_qc_parallel module.
  2. The parallelization provides a significant performance improvement for multiple samples.
  3. The memory usage trade-off is acceptable for the intended use cases.

You can use the following script to check the implementation of the parallel_qc function:

Additionally, consider adding performance benchmarks comparing qc and parallel-qc for various sample sizes to document the trade-offs.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of the parallel_qc function

# Test: Check if the parallel_qc function is defined in the correct module
ast-grep --lang python --pattern $'def parallel_qc():
  $$$' src/snipe/cli/cli_qc_parallel.py

# Test: Check for parallelization implementation (e.g., use of concurrent.futures)
rg --type python 'concurrent\.futures' src/snipe/cli/cli_qc_parallel.py

Length of output: 199


Script:

#!/bin/bash
# Description: Verify the implementation of the parallel_qc function

# Test: Check if the parallel_qc function is defined in the correct module
ast-grep --lang python --pattern $'def parallel_qc():
  $$$' src/snipe/cli/cli_qc_parallel.py

# Test: Check for parallelization implementation (e.g., use of concurrent.futures)
rg --type py 'concurrent\.futures' src/snipe/cli/cli_qc_parallel.py

Length of output: 400

src/snipe/cli/cli_qc.py (1)

Line range hint 441-454: Fix reporting of failed samples

In the section where failed samples are reported, there is a reassignment of failed_samples which might cause confusion or errors. Since failed_samples is already a list of sample names, reassigning it with [res['sample'] for res in failed] where failed is an integer (the count of failed samples) will cause an error.

Apply this diff to correct the reporting of failed samples:

     # Report failed samples if any
     if failed:
-        failed_samples = [res['sample'] for res in failed]
         logger.warning(f"The following {len(failed_samples)} sample(s) failed during QC processing:")
         for sample in failed_samples:
             logger.warning(f"- {sample}")

Remove the line reassigning failed_samples to prevent errors.

src/snipe/api/reference_QC.py (2)

Line range hint 321-429: Duplicate definition of _calculate_advanced_stats method

The method _calculate_advanced_stats is defined twice in the ReferenceQC class (lines 321-375 and 377-429). This will cause the first definition to be overwritten by the second, potentially leading to unexpected behavior or loss of functionality. Please consolidate the two method definitions into a single method, ensuring that all necessary functionality is preserved.


Line range hint 673-700: Update method documentation to reflect the use of genomic k-mers in splitting

The split_sig_randomly method has been modified to split the intersection of sample_sig and reference_sig, meaning it now only considers genomic k-mers. However, the method's docstring still indicates that it splits the entire sample signature. Please update the docstring to accurately describe the new behavior so that users understand that only genomic k-mers are considered in the random split.

Apply this diff to update the method's docstring:

 def split_sig_randomly(self, n: int) -> List[SnipeSig]:
     r"""
-    Split the sample signature into `n` random parts based on abundances.
+    Split the genomic portion of the sample signature (k-mers shared with the reference signature) into `n` random parts based on abundances.

-    This method distributes the k-mers of the sample signature into `n` parts using a multinomial distribution
+    This method distributes the genomic k-mers (shared between the sample and reference signatures) into `n` parts using a multinomial distribution
     based on their abundances. Each k-mer's abundance is split across the `n` parts proportionally.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c208b7f and e8e9d26.

📒 Files selected for processing (6)
  • src/snipe/api/multisig_reference_QC.py (1 hunks)
  • src/snipe/api/reference_QC.py (1 hunks)
  • src/snipe/api/snipe_sig.py (8 hunks)
  • src/snipe/cli/cli_qc.py (7 hunks)
  • src/snipe/cli/cli_qc_parallel.py (1 hunks)
  • src/snipe/cli/main.py (2 hunks)
🧰 Additional context used
🪛 Ruff
src/snipe/api/multisig_reference_QC.py

1-1: heapq imported but unused

Remove unused import: heapq

(F401)


4-4: typing.Iterator imported but unused

Remove unused import

(F401)


4-4: typing.Union imported but unused

Remove unused import

(F401)


10-10: os imported but unused

Remove unused import: os

(F401)


11-11: requests imported but unused

Remove unused import: requests

(F401)


12-12: tqdm.tqdm imported but unused

Remove unused import: tqdm.tqdm

(F401)


13-13: urllib.parse.urlparse imported but unused

Remove unused import: urllib.parse.urlparse

(F401)


14-14: Redefinition of unused Optional from line 4

Remove definition: Optional

(F811)


15-15: sourmash imported but unused

Remove unused import: sourmash

(F401)


17-17: concurrent imported but unused

Remove unused import: concurrent

(F401)

src/snipe/api/snipe_sig.py

5-5: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

src/snipe/cli/cli_qc.py

5-5: typing.Any imported but unused

Remove unused import

(F401)


5-5: typing.Dict imported but unused

Remove unused import

(F401)


5-5: typing.Union imported but unused

Remove unused import

(F401)


10-10: concurrent.futures imported but unused

Remove unused import: concurrent.futures

(F401)

src/snipe/cli/cli_qc_parallel.py

5-5: typing.Union imported but unused

Remove unused import: typing.Union

(F401)


13-13: snipe.api.sketch.SnipeSketch imported but unused

Remove unused import: snipe.api.sketch.SnipeSketch

(F401)

🔇 Additional comments (2)
src/snipe/cli/main.py (1)

20-20: LGTM: Import statement for parallel_qc is correctly added.

The import statement for parallel_qc is properly placed and follows the existing import style in the file. It's consistent with the project structure and other CLI-related imports.

src/snipe/cli/cli_qc.py (1)

403-408: Ensure correct parameter names when instantiating MultiSigReferenceQC

Verify that the parameters passed to MultiSigReferenceQC match its constructor's expected arguments. The parameter ychr should likely be ychr_sig or ychr_signature for clarity and consistency.

If the constructor expects ychr_signature, apply this diff to correct the parameter name:

     qc_instance = MultiSigReferenceQC(
             reference_sig=reference_sig,
             amplicon_sig=amplicon_sig,
-            ychr=ychr_sig if ychr_sig else None,
+            ychr_sig=ychr_sig if ychr_sig else None,
             varsigs=vars_snipesigs if vars_snipesigs else None,
             enable_logging=debug
         )

Alternatively, confirm that ychr is the correct parameter name in the MultiSigReferenceQC constructor.

- `-o qc_comprehensive.tsv`: Output file for QC results.

**Expected Output:**

A TSV file named `qc_comprehensive.tsv` containing comprehensive QC metrics, including advanced analyses, ROI predictions, and data from amplicon and variable signatures for both `sample1.sig` and `sample2.sig`.
"""

print(sample)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unnecessary debug print(sample) statement

The print(sample) statement appears to be a leftover from debugging. It should be removed to prevent unintended output in the command-line interface.

Apply this diff to remove the debug statement:

-print(sample)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(sample)

@@ -2,133 +2,28 @@
import sys
import time
import logging
from typing import Optional, Any, List, Dict, Set
from typing import Optional, Any, List, Dict, Set, Union
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove unused imports from typing module

The imports Any, Dict, and Union from the typing module are not used in the code. Removing them will clean up the import statements.

Apply this diff to remove the unused imports:

-from typing import Optional, Any, List, Dict, Set, Union
+from typing import Optional, List, Set
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from typing import Optional, Any, List, Dict, Set, Union
from typing import Optional, List, Set
🧰 Tools
🪛 Ruff

5-5: typing.Any imported but unused

Remove unused import

(F401)


5-5: typing.Dict imported but unused

Remove unused import

(F401)


5-5: typing.Union imported but unused

Remove unused import

(F401)

Comment on lines +17 to +24
def validate_sig_input(ctx, param, value: tuple) -> str:
supported_extensions = ['.zip', '.sig']
for path in value:
if not os.path.exists(path):
raise click.BadParameter(f"File not found: {path}")
if not any(path.lower().endswith(ext) for ext in supported_extensions):
raise click.BadParameter(f"Unsupported file format: {path}, supported formats: {', '.join(supported_extensions)}")
return value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct return type annotation in validate_sig_input function

The function validate_sig_input processes a tuple of paths and returns the same tuple. However, the return type is annotated as -> str, which should be corrected to -> tuple to accurately reflect the returned value.

Apply this diff to fix the function signature:

-def validate_sig_input(ctx, param, value: tuple) -> str:
+def validate_sig_input(ctx, param, value: tuple) -> tuple:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def validate_sig_input(ctx, param, value: tuple) -> str:
supported_extensions = ['.zip', '.sig']
for path in value:
if not os.path.exists(path):
raise click.BadParameter(f"File not found: {path}")
if not any(path.lower().endswith(ext) for ext in supported_extensions):
raise click.BadParameter(f"Unsupported file format: {path}, supported formats: {', '.join(supported_extensions)}")
return value
def validate_sig_input(ctx, param, value: tuple) -> tuple:
supported_extensions = ['.zip', '.sig']
for path in value:
if not os.path.exists(path):
raise click.BadParameter(f"File not found: {path}")
if not any(path.lower().endswith(ext) for ext in supported_extensions):
raise click.BadParameter(f"Unsupported file format: {path}, supported formats: {', '.join(supported_extensions)}")
return value

Comment on lines +413 to +422
sample_sig = SnipeSig(sourmash_sig=sample_path, sig_type=SigType.SAMPLE, enable_logging=debug)
try:
sample_stats = qc_instance.process_sample(sample_sig=sample_sig,
predict_extra_folds = predict_extra_folds if roi else None,
advanced=advanced)
sample_to_stats[sample_sig.name] = sample_stats
except Exception as e:
failed_samples.append(sample_sig.name)
qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}")
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle exceptions when processing samples appropriately

When a sample fails to process, its name is added to failed_samples, but later in the code, there may be a mismatch when reporting failed samples. Ensure that failed_samples contains the correct information for reporting, and consider logging the exception traceback for better debugging.

Consider modifying the exception handling block:

     except Exception as e:
         failed_samples.append(sample_sig.name)
-        qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}")
+        qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}", exc_info=True)
         continue

This change adds the exc_info=True parameter to log the full traceback of the exception.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sample_sig = SnipeSig(sourmash_sig=sample_path, sig_type=SigType.SAMPLE, enable_logging=debug)
try:
sample_stats = qc_instance.process_sample(sample_sig=sample_sig,
predict_extra_folds = predict_extra_folds if roi else None,
advanced=advanced)
sample_to_stats[sample_sig.name] = sample_stats
except Exception as e:
failed_samples.append(sample_sig.name)
qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}")
continue
sample_sig = SnipeSig(sourmash_sig=sample_path, sig_type=SigType.SAMPLE, enable_logging=debug)
try:
sample_stats = qc_instance.process_sample(sample_sig=sample_sig,
predict_extra_folds = predict_extra_folds if roi else None,
advanced=advanced)
sample_to_stats[sample_sig.name] = sample_stats
except Exception as e:
failed_samples.append(sample_sig.name)
qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}", exc_info=True)
continue

@@ -137,18 +32,17 @@

@click.command()
@click.option('--ref', type=click.Path(exists=True), required=True, help='Reference genome signature file (required).')
@click.option('--sample', type=click.Path(exists=True), multiple=True, help='Sample signature file. Can be provided multiple times.')
@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Simplify @click.option for --sample by adjusting parameters

Using both type=click.Path(exists=True) and callback=validate_sig_input may lead to redundant path existence checks and potential conflicts. Since the validate_sig_input function handles validation, you can remove the exists=True parameter. Additionally, when multiple=True, the default value should be an empty tuple () instead of None.

Apply this diff to adjust the option configuration:

-@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.')
+@click.option('--sample', type=click.Path(), callback=validate_sig_input, multiple=True, default=(), help='Sample signature file. Can be provided multiple times.')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.')
@click.option('--sample', type=click.Path(), callback=validate_sig_input, multiple=True, default=(), help='Sample signature file. Can be provided multiple times.')

Comment on lines +380 to +383
if sample_sig._type != SigType.SAMPLE:
self.logger.error("Invalid signature type for sample_sig: %s | %s", sample_sig.sigtype, sample_sig._type)
raise ValueError(f"sample_sig must be of type {SigType.SAMPLE}, got {sample_sig.sigtype}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use sigtype instead of _type to check sample_sig type

In the condition if sample_sig._type != SigType.SAMPLE:, you're accessing the protected member _type. It's better to use the public attribute sigtype, as done elsewhere in the code, for consistency and to avoid potential issues with access to protected members.

Apply this diff to fix the issue:

-if sample_sig._type != SigType.SAMPLE:
+if sample_sig.sigtype != SigType.SAMPLE:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if sample_sig._type != SigType.SAMPLE:
self.logger.error("Invalid signature type for sample_sig: %s | %s", sample_sig.sigtype, sample_sig._type)
raise ValueError(f"sample_sig must be of type {SigType.SAMPLE}, got {sample_sig.sigtype}")
if sample_sig.sigtype != SigType.SAMPLE:
self.logger.error("Invalid signature type for sample_sig: %s | %s", sample_sig.sigtype, sample_sig._type)
raise ValueError(f"sample_sig must be of type {SigType.SAMPLE}, got {sample_sig.sigtype}")

Comment on lines +292 to +295
if ychr and not isinstance(ychr, SnipeSig):
self.logger.error("Invalid signature type for ychr: %s", ychr.sigtype)
raise ValueError(f"ychr must be of type {SigType.SAMPLE}, got {ychr.sigtype}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix AttributeError when logging invalid ychr parameter

In the condition if ychr and not isinstance(ychr, SnipeSig):, accessing ychr.sigtype will raise an AttributeError if ychr is not an instance of SnipeSig, because ychr won't have a sigtype attribute. Adjust the error logging to avoid accessing ychr.sigtype when ychr is not a SnipeSig instance.

Apply this diff to fix the issue:

if ychr and not isinstance(ychr, SnipeSig):
-    self.logger.error("Invalid signature type for ychr: %s", ychr.sigtype)
-    raise ValueError(f"ychr must be of type {SigType.SAMPLE}, got {ychr.sigtype}")
+    self.logger.error("Invalid type for ychr: %s", type(ychr))
+    raise ValueError(f"ychr must be an instance of SnipeSig with sigtype {SigType.SAMPLE}, got {type(ychr)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if ychr and not isinstance(ychr, SnipeSig):
self.logger.error("Invalid signature type for ychr: %s", ychr.sigtype)
raise ValueError(f"ychr must be of type {SigType.SAMPLE}, got {ychr.sigtype}")
if ychr and not isinstance(ychr, SnipeSig):
self.logger.error("Invalid type for ychr: %s", type(ychr))
raise ValueError(f"ychr must be an instance of SnipeSig with sigtype {SigType.SAMPLE}, got {type(ychr)}")

@@ -117,18 +117,19 @@ def __init__(self, *, sourmash_sig: Union[str, sourmash.signature.SourmashSignat
self.logger.debug(f"Iterating over signature: {signame}")
if signame.endswith("-snipegenome"):
sig = sig.to_mutable()
# self.chr_to_sig[sig.name] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME, enable_logging=enable_logging)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove commented-out code to improve readability.

The lines at 120, 127, and 132 contain code that's been commented out. If this code is no longer needed, consider removing it to enhance code clarity and maintainability.

Apply this diff to remove the commented-out code:

- # self.chr_to_sig[sig.name] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME, enable_logging=enable_logging)
- # sig.name = signame.replace("sex-","")
- # sig.name = signame.replace("autosome-","")

Also applies to: 127-127, 132-132

from snipe.api.enums import SigType
from typing import Any, Dict, Iterator, List, Optional, Union
from typing import Dict, Iterator, List, Union, Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused import Optional.

The Optional class from the typing module is imported but not used anywhere in the code. Removing it will clean up the import statements.

Apply this diff to remove the unused import:

-from typing import Dict, Iterator, List, Union, Optional
+from typing import Dict, Iterator, List, Union
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from typing import Dict, Iterator, List, Union, Optional
from typing import Dict, Iterator, List, Union
🧰 Tools
🪛 Ruff

5-5: typing.Optional imported but unused

Remove unused import: typing.Optional

(F401)

Comment on lines +530 to 531
self.logger.error("Failed to export signatures to zip: %s", e)
raise Exception(f"Failed to export signatures to zip: {e}") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Raise a more specific exception instead of a generic Exception.

Raising a generic Exception can make error handling less precise. Consider raising a more specific exception, such as IOError or a custom exception class, to provide clearer context and facilitate better error management.

Apply this diff to raise a more specific exception:

- raise Exception(f"Failed to export signatures to zip: {e}") from e
+ raise IOError(f"Failed to export signatures to zip: {e}") from e

Alternatively, define and use a custom exception that better reflects the nature of the error.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.logger.error("Failed to export signatures to zip: %s", e)
raise Exception(f"Failed to export signatures to zip: {e}") from e
self.logger.error("Failed to export signatures to zip: %s", e)
raise IOError(f"Failed to export signatures to zip: {e}") from e

@mr-eyes mr-eyes merged commit 1e11f0f into main Oct 16, 2024
4 of 5 checks passed
@mr-eyes mr-eyes deleted the enhance_vars branch October 16, 2024 19:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (1)
src/snipe/cli/cli_qc_parallel.py (1)

153-155: Enhance the docstring for the parallel_qc function.

The current docstring for the parallel_qc function is quite brief and doesn't fully explain the parallelization aspect or the function's capabilities.

Consider expanding the docstring to provide more details about the function's behavior, parallelization strategy, and memory usage. Here's a suggested enhancement:

def parallel_qc(ref: str, sample: List[str], samples_from_file: Optional[str],
       amplicon: Optional[str], roi: bool, cores: int, advanced: bool, 
       ychr: Optional[str], debug: bool, output: str, vars: List[str]):
    """
    Perform parallelized quality control (QC) on genomic samples.

    This function implements a parallel version of the `qc` command, processing multiple samples
    concurrently to improve performance. It uses multiprocessing to distribute the workload
    across multiple CPU cores.

    Note: This implementation is not optimized for memory usage and may consume significant
    resources when processing large numbers of samples or using large variable signature files.

    Args:
        ref (str): Path to the reference genome signature file.
        sample (List[str]): List of paths to sample signature files.
        samples_from_file (Optional[str]): Path to a file containing sample paths (one per line).
        amplicon (Optional[str]): Path to the amplicon signature file.
        roi (bool): Flag to calculate ROI for 1, 2, 5, and 9 folds.
        cores (int): Number of CPU cores to use for parallel processing.
        advanced (bool): Flag to include advanced QC metrics.
        ychr (Optional[str]): Path to the Y chromosome signature file.
        debug (bool): Flag to enable debugging and detailed logging.
        output (str): Path to the output TSV file for QC results.
        vars (List[str]): List of paths to variable signature files.

    The function processes samples in parallel, aggregates results, and writes them to the
    specified output TSV file. It includes error handling and logging for each sample processed.
    """

This enhanced docstring provides more context about the function's purpose, its parallelization strategy, and potential memory usage considerations. It also details the arguments and their purposes, which will be helpful for users and maintainers of the code.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e8e9d26 and d884134.

📒 Files selected for processing (2)
  • .github/workflows/CI.yaml (1 hunks)
  • src/snipe/cli/cli_qc_parallel.py (1 hunks)
🧰 Additional context used
🪛 Ruff
src/snipe/cli/cli_qc_parallel.py

5-5: typing.Union imported but unused

Remove unused import: typing.Union

(F401)


13-13: snipe.api.sketch.SnipeSketch imported but unused

Remove unused import: snipe.api.sketch.SnipeSketch

(F401)

🔇 Additional comments (6)
.github/workflows/CI.yaml (1)

21-21: Verify compatibility with Python 3.13 and consider potential impacts.

Adding Python 3.13 to the test matrix is a forward-looking change that ensures the project will be tested against the latest Python version. However, there are a few points to consider:

  1. As of October 2024, Python 3.13 is not yet released. This change anticipates its release, which is good for future-proofing but may cause immediate issues.
  2. Adding another Python version will increase the CI pipeline execution time.
  3. The project's dependencies may not yet support Python 3.13, which could lead to test failures.

To ensure this change doesn't cause immediate issues, please run the following checks:

Consider the following suggestions:

  1. If Python 3.13 is not yet available, you might want to add it in a separate PR closer to its release date.
  2. Update the project's pyproject.toml or setup.py to specify compatible Python versions.
  3. Consider using a conditional job that only runs Python 3.13 tests on scheduled runs or specific branches to avoid potential failures in regular PRs.
src/snipe/cli/cli_qc_parallel.py (5)

20-27: LGTM: Input validation function is well-implemented.

The validate_sig_input function correctly checks for file existence and supported extensions. It raises appropriate exceptions for invalid inputs, which is good for error handling.


47-130: LGTM: Well-structured sample processing function with comprehensive error handling.

The process_sample function is well-implemented with detailed logging, proper error handling, and clear logic for processing different signature types and calculating metrics. The use of try-except blocks for ROI calculation is a good practice for handling potential errors.


132-135: LGTM: TSV file validation function is correctly implemented.

The validate_tsv_file function properly checks for the .tsv file extension and raises an appropriate exception for invalid extensions. This ensures that the output file will have the correct format.


150-363: LGTM: Well-implemented parallel QC function with comprehensive error handling and logging.

The parallel_qc function is well-structured and effectively implements parallel processing of samples for QC. It includes thorough error handling, detailed logging, and proper resource management. The use of tqdm for progress tracking is a nice touch for user feedback during long-running operations.

Some notable good practices:

  1. Proper signal handling for graceful shutdown.
  2. Clear separation of sample collection, processing, and result aggregation.
  3. Comprehensive error handling and reporting for failed samples.
  4. Use of concurrent.futures.ProcessPoolExecutor for parallel processing.

Overall, this implementation provides a robust and efficient way to perform QC on multiple samples in parallel.


1-363: LGTM: Well-implemented parallel QC functionality with robust error handling and efficient resource utilization.

This new file, cli_qc_parallel.py, successfully implements a parallel quality control (QC) functionality for the Snipe framework. The code demonstrates several strengths:

  1. Comprehensive input validation and error handling throughout the file.
  2. Effective use of multiprocessing for parallel sample processing, improving performance for large datasets.
  3. Detailed logging that aids in debugging and monitoring the QC process.
  4. Proper resource management, including graceful shutdown handling.
  5. Clear separation of concerns with well-structured functions for different aspects of the QC process.

The suggested improvements (updating docstrings, removing unused imports, and replacing os._exit with sys.exit in specific cases) are minor and do not detract from the overall quality of the implementation.

This parallel QC implementation should significantly improve the performance of the Snipe framework when processing multiple samples, while maintaining robustness and providing clear output for users.

🧰 Tools
🪛 Ruff

5-5: typing.Union imported but unused

Remove unused import: typing.Union

(F401)


13-13: snipe.api.sketch.SnipeSketch imported but unused

Remove unused import: snipe.api.sketch.SnipeSketch

(F401)

Comment on lines +28 to +46
def process_sample(sample_path: str, reference_sig: SnipeSig, amplicon_sig: Optional[SnipeSig],
advanced: bool, roi: bool, debug: bool,
ychr_sig: Optional[SnipeSig] = None,
var_sigs: Optional[List[SnipeSig]] = None) -> Dict[str, Any]:
"""
Process a single sample for QC.

Parameters:
- sample_path (str): Path to the sample signature file.
- ref_path (str): Path to the reference signature file.
- amplicon_path (Optional[str]): Path to the amplicon signature file.
- advanced (bool): Flag to include advanced metrics.
- roi (bool): Flag to calculate ROI.
- debug (bool): Flag to enable debugging.
- vars_paths (Optional[Union[List[str], List[SnipeSig]]]): List of paths to variable signature files or SnipeSig objects.

Returns:
- Dict[str, Any]: QC results for the sample.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update the docstring to match the current function signature.

The docstring for process_sample doesn't accurately reflect the current function parameters. This mismatch could lead to confusion for developers using or maintaining this function.

Please update the docstring to match the current function signature. Here's a suggested update:

def process_sample(sample_path: str, reference_sig: SnipeSig, amplicon_sig: Optional[SnipeSig],
                   advanced: bool, roi: bool, debug: bool,
                   ychr_sig: Optional[SnipeSig] = None,
                   var_sigs: Optional[List[SnipeSig]] = None) -> Dict[str, Any]:
    """
    Process a single sample for QC.

    Parameters:
    - sample_path (str): Path to the sample signature file.
    - reference_sig (SnipeSig): Reference genome signature object.
    - amplicon_sig (Optional[SnipeSig]): Amplicon signature object.
    - advanced (bool): Flag to include advanced metrics.
    - roi (bool): Flag to calculate ROI.
    - debug (bool): Flag to enable debugging.
    - ychr_sig (Optional[SnipeSig]): Y chromosome signature object.
    - var_sigs (Optional[List[SnipeSig]]): List of variable signature objects.

    Returns:
    - Dict[str, Any]: QC results for the sample.
    """

This update will ensure that the docstring accurately reflects the function's parameters and their types.

Comment on lines +306 to +310
logger.warning("KeyboardInterrupt received. Shutting down...")
os._exit(1) # Forcefully terminate the program
except Exception as e:
logger.error(f"An unexpected error occurred: {e}")
os._exit(1) # Forcefully terminate the program
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use sys.exit(1) for graceful termination.

In the error handling sections, os._exit(1) is used, which immediately terminates the program without calling cleanup handlers or flushing buffers. This can lead to resource leaks or incomplete logging.

Replace os._exit(1) with sys.exit(1) in these two instances:

except KeyboardInterrupt:
    logger.warning("KeyboardInterrupt received. Shutting down...")
-    os._exit(1)  # Forcefully terminate the program
+    sys.exit(1)  # Gracefully terminate the program
except Exception as e:
    logger.error(f"An unexpected error occurred: {e}")
-    os._exit(1)  # Forcefully terminate the program
+    sys.exit(1)  # Gracefully terminate the program

This change will allow for a more graceful shutdown, ensuring that all resources are properly released and logs are flushed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.warning("KeyboardInterrupt received. Shutting down...")
os._exit(1) # Forcefully terminate the program
except Exception as e:
logger.error(f"An unexpected error occurred: {e}")
os._exit(1) # Forcefully terminate the program
logger.warning("KeyboardInterrupt received. Shutting down...")
sys.exit(1) # Gracefully terminate the program
except Exception as e:
logger.error(f"An unexpected error occurred: {e}")
sys.exit(1) # Gracefully terminate the program

Comment on lines +1 to +18
import os
import sys
import time
import logging
from typing import Optional, Any, List, Dict, Set, Union

import click
import pandas as pd
from tqdm import tqdm
import concurrent.futures

from snipe.api.enums import SigType
from snipe.api.sketch import SnipeSketch
from snipe.api.snipe_sig import SnipeSig
from snipe.api.reference_QC import ReferenceQC

import signal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused imports to clean up the code.

There are two unused imports that should be removed to improve code cleanliness:

Apply this diff to remove the unused imports:

-from typing import Optional, Any, List, Dict, Set, Union
+from typing import Optional, Any, List, Dict, Set
-from snipe.api.sketch import SnipeSketch

This change will remove the unused Union type and SnipeSketch class, which are not used in the current implementation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import os
import sys
import time
import logging
from typing import Optional, Any, List, Dict, Set, Union
import click
import pandas as pd
from tqdm import tqdm
import concurrent.futures
from snipe.api.enums import SigType
from snipe.api.sketch import SnipeSketch
from snipe.api.snipe_sig import SnipeSig
from snipe.api.reference_QC import ReferenceQC
import signal
import os
import sys
import time
import logging
from typing import Optional, Any, List, Dict, Set
import click
import pandas as pd
from tqdm import tqdm
import concurrent.futures
from snipe.api.enums import SigType
from snipe.api.snipe_sig import SnipeSig
from snipe.api.reference_QC import ReferenceQC
import signal
🧰 Tools
🪛 Ruff

5-5: typing.Union imported but unused

Remove unused import: typing.Union

(F401)


13-13: snipe.api.sketch.SnipeSketch imported but unused

Remove unused import: snipe.api.sketch.SnipeSketch

(F401)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant