-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separating qc
from parallel qc
and enhancing processing
#15
Conversation
WalkthroughThe changes introduce enhancements to the quality control (QC) processes for genomic samples across multiple files. A new class, Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 28
🧹 Outside diff range and nitpick comments (5)
src/snipe/cli/main.py (2)
35-36
: LGTM: CLI group modifications are correct. Consider enhancing the docstring.The new
parallel-qc
command is correctly added to the CLI group, and the docstring has been updated to include information about it. The implementation follows the existing pattern and is consistent with other commands.Consider enhancing the docstring to provide more details about the trade-offs between
qc
andparallel-qc
. For example:Commands: - `sketch`: Perform sketching operations on genomic data. - `qc`: Execute quality control (QC) on multiple samples against a reference genome. - - `parallel-qc`: Parallelized version of the `qc` command (not optimized for memory). + - `parallel-qc`: Parallelized version of the `qc` command. Offers improved speed for multiple samples at the cost of increased memory usage. - `ops`: Perform various operations on sketches.This change would provide users with more context about when to choose between
qc
andparallel-qc
.Also applies to: 42-42
Issue Detected:
parallel_qc
Function Not Found incli_qc_parallel.py
The
parallel_qc
function does not appear to be defined in thesnipe.cli.cli_qc_parallel
module. This missing definition may lead to errors when attempting to use theparallel-qc
command.
- File Affected:
src/snipe/cli/cli_qc_parallel.py
- Missing Function:
parallel_qc
Please address the missing implementation of the
parallel_qc
function to ensure the CLI command is fully operational.🔗 Analysis chain
Line range hint
20-42
: Overall, the changes enhance the CLI with parallel QC capability.The addition of the
parallel-qc
command expands the tool's functionality, allowing users to choose between regular and parallel QC processing based on their needs and available resources. The changes are minimal and non-disruptive to the existing code structure.To ensure the new functionality works as expected, please verify the following:
- The
parallel_qc
function is correctly implemented in thesnipe.cli.cli_qc_parallel
module.- The parallelization provides a significant performance improvement for multiple samples.
- The memory usage trade-off is acceptable for the intended use cases.
You can use the following script to check the implementation of the
parallel_qc
function:Additionally, consider adding performance benchmarks comparing
qc
andparallel-qc
for various sample sizes to document the trade-offs.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of the parallel_qc function # Test: Check if the parallel_qc function is defined in the correct module ast-grep --lang python --pattern $'def parallel_qc(): $$$' src/snipe/cli/cli_qc_parallel.py # Test: Check for parallelization implementation (e.g., use of concurrent.futures) rg --type python 'concurrent\.futures' src/snipe/cli/cli_qc_parallel.pyLength of output: 199
Script:
#!/bin/bash # Description: Verify the implementation of the parallel_qc function # Test: Check if the parallel_qc function is defined in the correct module ast-grep --lang python --pattern $'def parallel_qc(): $$$' src/snipe/cli/cli_qc_parallel.py # Test: Check for parallelization implementation (e.g., use of concurrent.futures) rg --type py 'concurrent\.futures' src/snipe/cli/cli_qc_parallel.pyLength of output: 400
src/snipe/cli/cli_qc.py (1)
Line range hint
441-454
: Fix reporting of failed samplesIn the section where failed samples are reported, there is a reassignment of
failed_samples
which might cause confusion or errors. Sincefailed_samples
is already a list of sample names, reassigning it with[res['sample'] for res in failed]
wherefailed
is an integer (the count of failed samples) will cause an error.Apply this diff to correct the reporting of failed samples:
# Report failed samples if any if failed: - failed_samples = [res['sample'] for res in failed] logger.warning(f"The following {len(failed_samples)} sample(s) failed during QC processing:") for sample in failed_samples: logger.warning(f"- {sample}")
Remove the line reassigning
failed_samples
to prevent errors.src/snipe/api/reference_QC.py (2)
Line range hint
321-429
: Duplicate definition of_calculate_advanced_stats
methodThe method
_calculate_advanced_stats
is defined twice in theReferenceQC
class (lines 321-375 and 377-429). This will cause the first definition to be overwritten by the second, potentially leading to unexpected behavior or loss of functionality. Please consolidate the two method definitions into a single method, ensuring that all necessary functionality is preserved.
Line range hint
673-700
: Update method documentation to reflect the use of genomic k-mers in splittingThe
split_sig_randomly
method has been modified to split the intersection ofsample_sig
andreference_sig
, meaning it now only considers genomic k-mers. However, the method's docstring still indicates that it splits the entire sample signature. Please update the docstring to accurately describe the new behavior so that users understand that only genomic k-mers are considered in the random split.Apply this diff to update the method's docstring:
def split_sig_randomly(self, n: int) -> List[SnipeSig]: r""" - Split the sample signature into `n` random parts based on abundances. + Split the genomic portion of the sample signature (k-mers shared with the reference signature) into `n` random parts based on abundances. - This method distributes the k-mers of the sample signature into `n` parts using a multinomial distribution + This method distributes the genomic k-mers (shared between the sample and reference signatures) into `n` parts using a multinomial distribution based on their abundances. Each k-mer's abundance is split across the `n` parts proportionally.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
- src/snipe/api/multisig_reference_QC.py (1 hunks)
- src/snipe/api/reference_QC.py (1 hunks)
- src/snipe/api/snipe_sig.py (8 hunks)
- src/snipe/cli/cli_qc.py (7 hunks)
- src/snipe/cli/cli_qc_parallel.py (1 hunks)
- src/snipe/cli/main.py (2 hunks)
🧰 Additional context used
🪛 Ruff
src/snipe/api/multisig_reference_QC.py
1-1:
heapq
imported but unusedRemove unused import:
heapq
(F401)
4-4:
typing.Iterator
imported but unusedRemove unused import
(F401)
4-4:
typing.Union
imported but unusedRemove unused import
(F401)
10-10:
os
imported but unusedRemove unused import:
os
(F401)
11-11:
requests
imported but unusedRemove unused import:
requests
(F401)
12-12:
tqdm.tqdm
imported but unusedRemove unused import:
tqdm.tqdm
(F401)
13-13:
urllib.parse.urlparse
imported but unusedRemove unused import:
urllib.parse.urlparse
(F401)
14-14: Redefinition of unused
Optional
from line 4Remove definition:
Optional
(F811)
15-15:
sourmash
imported but unusedRemove unused import:
sourmash
(F401)
17-17:
concurrent
imported but unusedRemove unused import:
concurrent
(F401)
src/snipe/api/snipe_sig.py
5-5:
typing.Optional
imported but unusedRemove unused import:
typing.Optional
(F401)
src/snipe/cli/cli_qc.py
5-5:
typing.Any
imported but unusedRemove unused import
(F401)
5-5:
typing.Dict
imported but unusedRemove unused import
(F401)
5-5:
typing.Union
imported but unusedRemove unused import
(F401)
10-10:
concurrent.futures
imported but unusedRemove unused import:
concurrent.futures
(F401)
src/snipe/cli/cli_qc_parallel.py
5-5:
typing.Union
imported but unusedRemove unused import:
typing.Union
(F401)
13-13:
snipe.api.sketch.SnipeSketch
imported but unusedRemove unused import:
snipe.api.sketch.SnipeSketch
(F401)
🔇 Additional comments (2)
src/snipe/cli/main.py (1)
20-20
: LGTM: Import statement for parallel_qc is correctly added.The import statement for
parallel_qc
is properly placed and follows the existing import style in the file. It's consistent with the project structure and other CLI-related imports.src/snipe/cli/cli_qc.py (1)
403-408
: Ensure correct parameter names when instantiatingMultiSigReferenceQC
Verify that the parameters passed to
MultiSigReferenceQC
match its constructor's expected arguments. The parameterychr
should likely beychr_sig
orychr_signature
for clarity and consistency.If the constructor expects
ychr_signature
, apply this diff to correct the parameter name:qc_instance = MultiSigReferenceQC( reference_sig=reference_sig, amplicon_sig=amplicon_sig, - ychr=ychr_sig if ychr_sig else None, + ychr_sig=ychr_sig if ychr_sig else None, varsigs=vars_snipesigs if vars_snipesigs else None, enable_logging=debug )Alternatively, confirm that
ychr
is the correct parameter name in theMultiSigReferenceQC
constructor.
- `-o qc_comprehensive.tsv`: Output file for QC results. | ||
|
||
**Expected Output:** | ||
|
||
A TSV file named `qc_comprehensive.tsv` containing comprehensive QC metrics, including advanced analyses, ROI predictions, and data from amplicon and variable signatures for both `sample1.sig` and `sample2.sig`. | ||
""" | ||
|
||
print(sample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary debug print(sample)
statement
The print(sample)
statement appears to be a leftover from debugging. It should be removed to prevent unintended output in the command-line interface.
Apply this diff to remove the debug statement:
-print(sample)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
print(sample) |
@@ -2,133 +2,28 @@ | |||
import sys | |||
import time | |||
import logging | |||
from typing import Optional, Any, List, Dict, Set | |||
from typing import Optional, Any, List, Dict, Set, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove unused imports from typing
module
The imports Any
, Dict
, and Union
from the typing
module are not used in the code. Removing them will clean up the import statements.
Apply this diff to remove the unused imports:
-from typing import Optional, Any, List, Dict, Set, Union
+from typing import Optional, List, Set
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from typing import Optional, Any, List, Dict, Set, Union | |
from typing import Optional, List, Set |
🧰 Tools
🪛 Ruff
5-5:
typing.Any
imported but unusedRemove unused import
(F401)
5-5:
typing.Dict
imported but unusedRemove unused import
(F401)
5-5:
typing.Union
imported but unusedRemove unused import
(F401)
def validate_sig_input(ctx, param, value: tuple) -> str: | ||
supported_extensions = ['.zip', '.sig'] | ||
for path in value: | ||
if not os.path.exists(path): | ||
raise click.BadParameter(f"File not found: {path}") | ||
if not any(path.lower().endswith(ext) for ext in supported_extensions): | ||
raise click.BadParameter(f"Unsupported file format: {path}, supported formats: {', '.join(supported_extensions)}") | ||
return value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct return type annotation in validate_sig_input
function
The function validate_sig_input
processes a tuple of paths and returns the same tuple. However, the return type is annotated as -> str
, which should be corrected to -> tuple
to accurately reflect the returned value.
Apply this diff to fix the function signature:
-def validate_sig_input(ctx, param, value: tuple) -> str:
+def validate_sig_input(ctx, param, value: tuple) -> tuple:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def validate_sig_input(ctx, param, value: tuple) -> str: | |
supported_extensions = ['.zip', '.sig'] | |
for path in value: | |
if not os.path.exists(path): | |
raise click.BadParameter(f"File not found: {path}") | |
if not any(path.lower().endswith(ext) for ext in supported_extensions): | |
raise click.BadParameter(f"Unsupported file format: {path}, supported formats: {', '.join(supported_extensions)}") | |
return value | |
def validate_sig_input(ctx, param, value: tuple) -> tuple: | |
supported_extensions = ['.zip', '.sig'] | |
for path in value: | |
if not os.path.exists(path): | |
raise click.BadParameter(f"File not found: {path}") | |
if not any(path.lower().endswith(ext) for ext in supported_extensions): | |
raise click.BadParameter(f"Unsupported file format: {path}, supported formats: {', '.join(supported_extensions)}") | |
return value |
sample_sig = SnipeSig(sourmash_sig=sample_path, sig_type=SigType.SAMPLE, enable_logging=debug) | ||
try: | ||
sample_stats = qc_instance.process_sample(sample_sig=sample_sig, | ||
predict_extra_folds = predict_extra_folds if roi else None, | ||
advanced=advanced) | ||
sample_to_stats[sample_sig.name] = sample_stats | ||
except Exception as e: | ||
failed_samples.append(sample_sig.name) | ||
qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}") | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle exceptions when processing samples appropriately
When a sample fails to process, its name is added to failed_samples
, but later in the code, there may be a mismatch when reporting failed samples. Ensure that failed_samples
contains the correct information for reporting, and consider logging the exception traceback for better debugging.
Consider modifying the exception handling block:
except Exception as e:
failed_samples.append(sample_sig.name)
- qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}")
+ qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}", exc_info=True)
continue
This change adds the exc_info=True
parameter to log the full traceback of the exception.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
sample_sig = SnipeSig(sourmash_sig=sample_path, sig_type=SigType.SAMPLE, enable_logging=debug) | |
try: | |
sample_stats = qc_instance.process_sample(sample_sig=sample_sig, | |
predict_extra_folds = predict_extra_folds if roi else None, | |
advanced=advanced) | |
sample_to_stats[sample_sig.name] = sample_stats | |
except Exception as e: | |
failed_samples.append(sample_sig.name) | |
qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}") | |
continue | |
sample_sig = SnipeSig(sourmash_sig=sample_path, sig_type=SigType.SAMPLE, enable_logging=debug) | |
try: | |
sample_stats = qc_instance.process_sample(sample_sig=sample_sig, | |
predict_extra_folds = predict_extra_folds if roi else None, | |
advanced=advanced) | |
sample_to_stats[sample_sig.name] = sample_stats | |
except Exception as e: | |
failed_samples.append(sample_sig.name) | |
qc_instance.logger.error(f"Failed to process sample {sample_sig.name}: {e}", exc_info=True) | |
continue |
@@ -137,18 +32,17 @@ | |||
|
|||
@click.command() | |||
@click.option('--ref', type=click.Path(exists=True), required=True, help='Reference genome signature file (required).') | |||
@click.option('--sample', type=click.Path(exists=True), multiple=True, help='Sample signature file. Can be provided multiple times.') | |||
@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify @click.option
for --sample
by adjusting parameters
Using both type=click.Path(exists=True)
and callback=validate_sig_input
may lead to redundant path existence checks and potential conflicts. Since the validate_sig_input
function handles validation, you can remove the exists=True
parameter. Additionally, when multiple=True
, the default value should be an empty tuple ()
instead of None
.
Apply this diff to adjust the option configuration:
-@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.')
+@click.option('--sample', type=click.Path(), callback=validate_sig_input, multiple=True, default=(), help='Sample signature file. Can be provided multiple times.')
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@click.option('--sample', type=click.Path(exists=True), callback=validate_sig_input, multiple=True, default = None, help='Sample signature file. Can be provided multiple times.') | |
@click.option('--sample', type=click.Path(), callback=validate_sig_input, multiple=True, default=(), help='Sample signature file. Can be provided multiple times.') |
if sample_sig._type != SigType.SAMPLE: | ||
self.logger.error("Invalid signature type for sample_sig: %s | %s", sample_sig.sigtype, sample_sig._type) | ||
raise ValueError(f"sample_sig must be of type {SigType.SAMPLE}, got {sample_sig.sigtype}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use sigtype
instead of _type
to check sample_sig
type
In the condition if sample_sig._type != SigType.SAMPLE:
, you're accessing the protected member _type
. It's better to use the public attribute sigtype
, as done elsewhere in the code, for consistency and to avoid potential issues with access to protected members.
Apply this diff to fix the issue:
-if sample_sig._type != SigType.SAMPLE:
+if sample_sig.sigtype != SigType.SAMPLE:
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if sample_sig._type != SigType.SAMPLE: | |
self.logger.error("Invalid signature type for sample_sig: %s | %s", sample_sig.sigtype, sample_sig._type) | |
raise ValueError(f"sample_sig must be of type {SigType.SAMPLE}, got {sample_sig.sigtype}") | |
if sample_sig.sigtype != SigType.SAMPLE: | |
self.logger.error("Invalid signature type for sample_sig: %s | %s", sample_sig.sigtype, sample_sig._type) | |
raise ValueError(f"sample_sig must be of type {SigType.SAMPLE}, got {sample_sig.sigtype}") |
if ychr and not isinstance(ychr, SnipeSig): | ||
self.logger.error("Invalid signature type for ychr: %s", ychr.sigtype) | ||
raise ValueError(f"ychr must be of type {SigType.SAMPLE}, got {ychr.sigtype}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix AttributeError
when logging invalid ychr
parameter
In the condition if ychr and not isinstance(ychr, SnipeSig):
, accessing ychr.sigtype
will raise an AttributeError
if ychr
is not an instance of SnipeSig
, because ychr
won't have a sigtype
attribute. Adjust the error logging to avoid accessing ychr.sigtype
when ychr
is not a SnipeSig
instance.
Apply this diff to fix the issue:
if ychr and not isinstance(ychr, SnipeSig):
- self.logger.error("Invalid signature type for ychr: %s", ychr.sigtype)
- raise ValueError(f"ychr must be of type {SigType.SAMPLE}, got {ychr.sigtype}")
+ self.logger.error("Invalid type for ychr: %s", type(ychr))
+ raise ValueError(f"ychr must be an instance of SnipeSig with sigtype {SigType.SAMPLE}, got {type(ychr)}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if ychr and not isinstance(ychr, SnipeSig): | |
self.logger.error("Invalid signature type for ychr: %s", ychr.sigtype) | |
raise ValueError(f"ychr must be of type {SigType.SAMPLE}, got {ychr.sigtype}") | |
if ychr and not isinstance(ychr, SnipeSig): | |
self.logger.error("Invalid type for ychr: %s", type(ychr)) | |
raise ValueError(f"ychr must be an instance of SnipeSig with sigtype {SigType.SAMPLE}, got {type(ychr)}") | |
src/snipe/api/snipe_sig.py
Outdated
@@ -117,18 +117,19 @@ def __init__(self, *, sourmash_sig: Union[str, sourmash.signature.SourmashSignat | |||
self.logger.debug(f"Iterating over signature: {signame}") | |||
if signame.endswith("-snipegenome"): | |||
sig = sig.to_mutable() | |||
# self.chr_to_sig[sig.name] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME, enable_logging=enable_logging) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove commented-out code to improve readability.
The lines at 120, 127, and 132 contain code that's been commented out. If this code is no longer needed, consider removing it to enhance code clarity and maintainability.
Apply this diff to remove the commented-out code:
- # self.chr_to_sig[sig.name] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME, enable_logging=enable_logging)
- # sig.name = signame.replace("sex-","")
- # sig.name = signame.replace("autosome-","")
Also applies to: 127-127, 132-132
src/snipe/api/snipe_sig.py
Outdated
from snipe.api.enums import SigType | ||
from typing import Any, Dict, Iterator, List, Optional, Union | ||
from typing import Dict, Iterator, List, Union, Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import Optional
.
The Optional
class from the typing
module is imported but not used anywhere in the code. Removing it will clean up the import statements.
Apply this diff to remove the unused import:
-from typing import Dict, Iterator, List, Union, Optional
+from typing import Dict, Iterator, List, Union
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from typing import Dict, Iterator, List, Union, Optional | |
from typing import Dict, Iterator, List, Union |
🧰 Tools
🪛 Ruff
5-5:
typing.Optional
imported but unusedRemove unused import:
typing.Optional
(F401)
src/snipe/api/snipe_sig.py
Outdated
self.logger.error("Failed to export signatures to zip: %s", e) | ||
raise Exception(f"Failed to export signatures to zip: {e}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise a more specific exception instead of a generic Exception
.
Raising a generic Exception
can make error handling less precise. Consider raising a more specific exception, such as IOError
or a custom exception class, to provide clearer context and facilitate better error management.
Apply this diff to raise a more specific exception:
- raise Exception(f"Failed to export signatures to zip: {e}") from e
+ raise IOError(f"Failed to export signatures to zip: {e}") from e
Alternatively, define and use a custom exception that better reflects the nature of the error.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self.logger.error("Failed to export signatures to zip: %s", e) | |
raise Exception(f"Failed to export signatures to zip: {e}") from e | |
self.logger.error("Failed to export signatures to zip: %s", e) | |
raise IOError(f"Failed to export signatures to zip: {e}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (1)
src/snipe/cli/cli_qc_parallel.py (1)
153-155
: Enhance the docstring for theparallel_qc
function.The current docstring for the
parallel_qc
function is quite brief and doesn't fully explain the parallelization aspect or the function's capabilities.Consider expanding the docstring to provide more details about the function's behavior, parallelization strategy, and memory usage. Here's a suggested enhancement:
def parallel_qc(ref: str, sample: List[str], samples_from_file: Optional[str], amplicon: Optional[str], roi: bool, cores: int, advanced: bool, ychr: Optional[str], debug: bool, output: str, vars: List[str]): """ Perform parallelized quality control (QC) on genomic samples. This function implements a parallel version of the `qc` command, processing multiple samples concurrently to improve performance. It uses multiprocessing to distribute the workload across multiple CPU cores. Note: This implementation is not optimized for memory usage and may consume significant resources when processing large numbers of samples or using large variable signature files. Args: ref (str): Path to the reference genome signature file. sample (List[str]): List of paths to sample signature files. samples_from_file (Optional[str]): Path to a file containing sample paths (one per line). amplicon (Optional[str]): Path to the amplicon signature file. roi (bool): Flag to calculate ROI for 1, 2, 5, and 9 folds. cores (int): Number of CPU cores to use for parallel processing. advanced (bool): Flag to include advanced QC metrics. ychr (Optional[str]): Path to the Y chromosome signature file. debug (bool): Flag to enable debugging and detailed logging. output (str): Path to the output TSV file for QC results. vars (List[str]): List of paths to variable signature files. The function processes samples in parallel, aggregates results, and writes them to the specified output TSV file. It includes error handling and logging for each sample processed. """This enhanced docstring provides more context about the function's purpose, its parallelization strategy, and potential memory usage considerations. It also details the arguments and their purposes, which will be helpful for users and maintainers of the code.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- .github/workflows/CI.yaml (1 hunks)
- src/snipe/cli/cli_qc_parallel.py (1 hunks)
🧰 Additional context used
🪛 Ruff
src/snipe/cli/cli_qc_parallel.py
5-5:
typing.Union
imported but unusedRemove unused import:
typing.Union
(F401)
13-13:
snipe.api.sketch.SnipeSketch
imported but unusedRemove unused import:
snipe.api.sketch.SnipeSketch
(F401)
🔇 Additional comments (6)
.github/workflows/CI.yaml (1)
21-21
: Verify compatibility with Python 3.13 and consider potential impacts.Adding Python 3.13 to the test matrix is a forward-looking change that ensures the project will be tested against the latest Python version. However, there are a few points to consider:
- As of October 2024, Python 3.13 is not yet released. This change anticipates its release, which is good for future-proofing but may cause immediate issues.
- Adding another Python version will increase the CI pipeline execution time.
- The project's dependencies may not yet support Python 3.13, which could lead to test failures.
To ensure this change doesn't cause immediate issues, please run the following checks:
Consider the following suggestions:
- If Python 3.13 is not yet available, you might want to add it in a separate PR closer to its release date.
- Update the project's
pyproject.toml
orsetup.py
to specify compatible Python versions.- Consider using a conditional job that only runs Python 3.13 tests on scheduled runs or specific branches to avoid potential failures in regular PRs.
src/snipe/cli/cli_qc_parallel.py (5)
20-27
: LGTM: Input validation function is well-implemented.The
validate_sig_input
function correctly checks for file existence and supported extensions. It raises appropriate exceptions for invalid inputs, which is good for error handling.
47-130
: LGTM: Well-structured sample processing function with comprehensive error handling.The
process_sample
function is well-implemented with detailed logging, proper error handling, and clear logic for processing different signature types and calculating metrics. The use of try-except blocks for ROI calculation is a good practice for handling potential errors.
132-135
: LGTM: TSV file validation function is correctly implemented.The
validate_tsv_file
function properly checks for the .tsv file extension and raises an appropriate exception for invalid extensions. This ensures that the output file will have the correct format.
150-363
: LGTM: Well-implemented parallel QC function with comprehensive error handling and logging.The
parallel_qc
function is well-structured and effectively implements parallel processing of samples for QC. It includes thorough error handling, detailed logging, and proper resource management. The use oftqdm
for progress tracking is a nice touch for user feedback during long-running operations.Some notable good practices:
- Proper signal handling for graceful shutdown.
- Clear separation of sample collection, processing, and result aggregation.
- Comprehensive error handling and reporting for failed samples.
- Use of
concurrent.futures.ProcessPoolExecutor
for parallel processing.Overall, this implementation provides a robust and efficient way to perform QC on multiple samples in parallel.
1-363
: LGTM: Well-implemented parallel QC functionality with robust error handling and efficient resource utilization.This new file,
cli_qc_parallel.py
, successfully implements a parallel quality control (QC) functionality for the Snipe framework. The code demonstrates several strengths:
- Comprehensive input validation and error handling throughout the file.
- Effective use of multiprocessing for parallel sample processing, improving performance for large datasets.
- Detailed logging that aids in debugging and monitoring the QC process.
- Proper resource management, including graceful shutdown handling.
- Clear separation of concerns with well-structured functions for different aspects of the QC process.
The suggested improvements (updating docstrings, removing unused imports, and replacing
os._exit
withsys.exit
in specific cases) are minor and do not detract from the overall quality of the implementation.This parallel QC implementation should significantly improve the performance of the Snipe framework when processing multiple samples, while maintaining robustness and providing clear output for users.
🧰 Tools
🪛 Ruff
5-5:
typing.Union
imported but unusedRemove unused import:
typing.Union
(F401)
13-13:
snipe.api.sketch.SnipeSketch
imported but unusedRemove unused import:
snipe.api.sketch.SnipeSketch
(F401)
def process_sample(sample_path: str, reference_sig: SnipeSig, amplicon_sig: Optional[SnipeSig], | ||
advanced: bool, roi: bool, debug: bool, | ||
ychr_sig: Optional[SnipeSig] = None, | ||
var_sigs: Optional[List[SnipeSig]] = None) -> Dict[str, Any]: | ||
""" | ||
Process a single sample for QC. | ||
|
||
Parameters: | ||
- sample_path (str): Path to the sample signature file. | ||
- ref_path (str): Path to the reference signature file. | ||
- amplicon_path (Optional[str]): Path to the amplicon signature file. | ||
- advanced (bool): Flag to include advanced metrics. | ||
- roi (bool): Flag to calculate ROI. | ||
- debug (bool): Flag to enable debugging. | ||
- vars_paths (Optional[Union[List[str], List[SnipeSig]]]): List of paths to variable signature files or SnipeSig objects. | ||
|
||
Returns: | ||
- Dict[str, Any]: QC results for the sample. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the docstring to match the current function signature.
The docstring for process_sample
doesn't accurately reflect the current function parameters. This mismatch could lead to confusion for developers using or maintaining this function.
Please update the docstring to match the current function signature. Here's a suggested update:
def process_sample(sample_path: str, reference_sig: SnipeSig, amplicon_sig: Optional[SnipeSig],
advanced: bool, roi: bool, debug: bool,
ychr_sig: Optional[SnipeSig] = None,
var_sigs: Optional[List[SnipeSig]] = None) -> Dict[str, Any]:
"""
Process a single sample for QC.
Parameters:
- sample_path (str): Path to the sample signature file.
- reference_sig (SnipeSig): Reference genome signature object.
- amplicon_sig (Optional[SnipeSig]): Amplicon signature object.
- advanced (bool): Flag to include advanced metrics.
- roi (bool): Flag to calculate ROI.
- debug (bool): Flag to enable debugging.
- ychr_sig (Optional[SnipeSig]): Y chromosome signature object.
- var_sigs (Optional[List[SnipeSig]]): List of variable signature objects.
Returns:
- Dict[str, Any]: QC results for the sample.
"""
This update will ensure that the docstring accurately reflects the function's parameters and their types.
logger.warning("KeyboardInterrupt received. Shutting down...") | ||
os._exit(1) # Forcefully terminate the program | ||
except Exception as e: | ||
logger.error(f"An unexpected error occurred: {e}") | ||
os._exit(1) # Forcefully terminate the program |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use sys.exit(1)
for graceful termination.
In the error handling sections, os._exit(1)
is used, which immediately terminates the program without calling cleanup handlers or flushing buffers. This can lead to resource leaks or incomplete logging.
Replace os._exit(1)
with sys.exit(1)
in these two instances:
except KeyboardInterrupt:
logger.warning("KeyboardInterrupt received. Shutting down...")
- os._exit(1) # Forcefully terminate the program
+ sys.exit(1) # Gracefully terminate the program
except Exception as e:
logger.error(f"An unexpected error occurred: {e}")
- os._exit(1) # Forcefully terminate the program
+ sys.exit(1) # Gracefully terminate the program
This change will allow for a more graceful shutdown, ensuring that all resources are properly released and logs are flushed.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
logger.warning("KeyboardInterrupt received. Shutting down...") | |
os._exit(1) # Forcefully terminate the program | |
except Exception as e: | |
logger.error(f"An unexpected error occurred: {e}") | |
os._exit(1) # Forcefully terminate the program | |
logger.warning("KeyboardInterrupt received. Shutting down...") | |
sys.exit(1) # Gracefully terminate the program | |
except Exception as e: | |
logger.error(f"An unexpected error occurred: {e}") | |
sys.exit(1) # Gracefully terminate the program |
import os | ||
import sys | ||
import time | ||
import logging | ||
from typing import Optional, Any, List, Dict, Set, Union | ||
|
||
import click | ||
import pandas as pd | ||
from tqdm import tqdm | ||
import concurrent.futures | ||
|
||
from snipe.api.enums import SigType | ||
from snipe.api.sketch import SnipeSketch | ||
from snipe.api.snipe_sig import SnipeSig | ||
from snipe.api.reference_QC import ReferenceQC | ||
|
||
import signal | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports to clean up the code.
There are two unused imports that should be removed to improve code cleanliness:
Apply this diff to remove the unused imports:
-from typing import Optional, Any, List, Dict, Set, Union
+from typing import Optional, Any, List, Dict, Set
-from snipe.api.sketch import SnipeSketch
This change will remove the unused Union
type and SnipeSketch
class, which are not used in the current implementation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import os | |
import sys | |
import time | |
import logging | |
from typing import Optional, Any, List, Dict, Set, Union | |
import click | |
import pandas as pd | |
from tqdm import tqdm | |
import concurrent.futures | |
from snipe.api.enums import SigType | |
from snipe.api.sketch import SnipeSketch | |
from snipe.api.snipe_sig import SnipeSig | |
from snipe.api.reference_QC import ReferenceQC | |
import signal | |
import os | |
import sys | |
import time | |
import logging | |
from typing import Optional, Any, List, Dict, Set | |
import click | |
import pandas as pd | |
from tqdm import tqdm | |
import concurrent.futures | |
from snipe.api.enums import SigType | |
from snipe.api.snipe_sig import SnipeSig | |
from snipe.api.reference_QC import ReferenceQC | |
import signal |
🧰 Tools
🪛 Ruff
5-5:
typing.Union
imported but unusedRemove unused import:
typing.Union
(F401)
13-13:
snipe.api.sketch.SnipeSketch
imported but unusedRemove unused import:
snipe.api.sketch.SnipeSketch
(F401)
snipe qc
is very efficient on a single core. It loads the reference signatures (ref,amplicon,variance_sigs) only once and then process all provided samples. On the other handsnipe parallel_qc
load all needed resources per sample, which can be fast in large-scale but also introduces memory overhead. This is a Python limitation and will be solved in the future when the Rust or C++ version is implemented.