-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing the snipe qc
command.
#11
Conversation
WalkthroughThe changes in this pull request involve updates across multiple files, primarily focusing on enhancing configurations, documentation, and functionality within the Snipe project. Key modifications include the addition of new dependencies in Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🧹 Outside diff range and nitpick comments (7)
docs/mkdocs.yml (1)
Line range hint
29-34
: Great additions to enhance documentation capabilities!The new markdown extensions (pymdownx.details, pymdownx.critic, pymdownx.caret, pymdownx.keys, pymdownx.mark, pymdownx.tilde) will provide more flexibility in formatting and styling the documentation. The mkdocstrings configuration to show function signatures is particularly useful for API documentation.
Consider adding a brief comment in the YAML file explaining the purpose of these new extensions to help future maintainers understand their inclusion.
Also applies to: 48-52
src/snipe/api/sketch.py (1)
Line range hint
71-86
: Approve docstring updates with a minor suggestionThe updates to the
process_sequences
method docstring accurately reflect the shift to FracMinHash sketches. The changes are consistent and appropriate.For improved clarity, consider updating the return type annotation in the method signature to match the docstring:
- ) -> sourmash.MinHash: + ) -> sourmash.FracMinHash:This change would make the type annotation consistent with the updated docstring.
tests/api/test_api.py (1)
Enhance Test Coverage for New SnipeSig Features
The
SnipeSig
class has been updated with several new properties and methods, includingtotal_abundance
,mean_abundance
,median_abundance
,get_sample_stats
,export
,export_to_string
,get_unique_signatures
,common_hashes
, andcopy
. Currently, the test suite does not include test cases for these new additions.To maintain comprehensive test coverage and ensure all functionalities are properly validated, please add corresponding test cases for each of the newly introduced properties and methods.
New Properties to Test:
total_abundance
mean_abundance
median_abundance
get_sample_stats
New Methods to Test:
export
export_to_string
get_unique_signatures
common_hashes
copy
🔗 Analysis chain
Line range hint
1-1024
: Review test coverage for new SnipeSig featuresThe test suite appears comprehensive, covering various aspects of the SnipeSig class. However, with the recent changes in the module structure, it's important to ensure that any new functionality added to the SnipeSig class is also adequately tested.
Please review the SnipeSig class implementation and check if there are any new methods or features that might require additional test coverage. You can use the following command to inspect the SnipeSig class:
If you find any new features that are not covered in the current test suite, consider adding appropriate test cases to maintain comprehensive test coverage.
Line range hint
1-1024
: Locate the SnipeSig class definitionThe previous attempt to display
snipe/api/snipe_sig.py
did not find theSnipeSig
class file. It's possible that the class has been moved or renamed. To ensure comprehensive test coverage, please locate the current file where theSnipeSig
class is defined. You can use the following command to search for the class definition:If the
ast-grep
command does not yield results, consider usingrg
(ripgrep) as an alternative to search for the class definition:Please execute the appropriate script to locate the
SnipeSig
class and update the test suite accordingly based on its current implementation and location.
Line range hint
1-1024
: View the Content of the SnipeSig ClassThe
SnipeSig
class is located atsrc/snipe/api/snipe_sig.py
. To ensure that the test suite adequately covers all functionalities, please review the contents of this class and verify if there are any new methods or features that lack corresponding test cases. You can use the following command to view the file:If you identify any new features or modifications, consider adding appropriate test cases to maintain comprehensive test coverage.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Display the content of the SnipeSig class file cat snipe/api/snipe_sig.pyLength of output: 83
Script:
#!/bin/bash # Search for the definition of the SnipeSig class ast-grep --lang python --pattern 'class SnipeSig { $$$ }'Length of output: 504
Script:
#!/bin/bash # Search for the definition of the SnipeSig class using ripgrep rg 'class SnipeSig'Length of output: 64
Script:
#!/bin/bash # Display the content of the SnipeSig class file cat src/snipe/api/snipe_sig.pyLength of output: 58847
src/snipe/cli/main.py (1)
70-74
: Rename variablesamples
for clarityThe variable
samples
holdssample
,ref
, andamplicon
, which are not all samples. Consider renaming it toinputs
orinput_files
to better reflect its content and enhance code readability.Apply this rename:
-def sketch(ctx, sample: Optional[str], ref: Optional[str], amplicon: Optional[str], ...): ... - samples = [sample, ref, amplicon] - provided = [s for s in samples if s] + inputs = [sample, ref, amplicon] + provided = [s for s in inputs if s] if len(provided) != 1: click.echo('Error: Exactly one of --sample, --ref, or --amplicon must be provided.') sys.exit(1)src/snipe/api/snipe_sig.py (2)
4-4
: Remove unused importsAny
andOptional
.The imports
Any
andOptional
from thetyping
module are not used and can be removed to clean up the code.Apply this diff to remove the unused imports:
- from typing import Any, Dict, Iterator, List, Optional, Union + from typing import Dict, Iterator, List, Union🧰 Tools
🪛 Ruff
4-4:
typing.Any
imported but unusedRemove unused import
(F401)
4-4:
typing.Optional
imported but unusedRemove unused import
(F401)
205-205
: Typo in the return type annotation.At line 205, there's a typographical error in the return type annotation in the docstring. The character
ß
should be replaced withB
inSourmashSignature
.Apply this diff to fix the typo:
- sourmash.signature.SourmashßSignature, list of sourmash.signature.SourmashSignature, or None if loading fails. + sourmash.signature.SourmashSignature, list of sourmash.signature.SourmashSignature, or None if loading fails.src/snipe/api/reference_QC.py (1)
386-389
: Remove commented-out code to improve readability.The code between lines 386-389 is commented out. If it is no longer needed, consider removing it to keep the code clean.
Apply this diff to remove the commented code:
- # self.logger.debug("Sample hashes: %s", self.sample_sig.hashes) - # self.logger.debug("Sample abundances: %s", self.sample_sig.abundances) - # self.logger.debug("Reference hashes: %s", self.reference_sig.hashes) - # self.logger.debug("Reference abundances: %s", self.reference_sig.abundances)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (13)
- .pylintrc (1 hunks)
- .vscode/settings.json (1 hunks)
- docs/docs/Sketch.md (1 hunks)
- docs/docs/SnipeSig.md (1 hunks)
- docs/mkdocs.yml (1 hunks)
- pyproject.toml (1 hunks)
- src/snipe/api/init.py (0 hunks)
- src/snipe/api/reference_QC.py (14 hunks)
- src/snipe/api/sketch.py (5 hunks)
- src/snipe/api/snipe_sig.py (1 hunks)
- src/snipe/cli/main.py (6 hunks)
- tests/api/test_api.py (1 hunks)
- tests/api/test_reference_qc.py (2 hunks)
💤 Files with no reviewable changes (1)
- src/snipe/api/init.py
✅ Files skipped from review due to trivial changes (3)
- .pylintrc
- docs/docs/Sketch.md
- docs/docs/SnipeSig.md
🧰 Additional context used
🪛 Biome
.vscode/settings.json
[error] 5-5: expected
,
but instead found// # show line numbers in jupyter notebooks
Remove // # show line numbers in jupyter notebooks
(parse)
🪛 Ruff
src/snipe/api/reference_QC.py
13-13:
cgi
imported but unusedRemove unused import:
cgi
(F401)
15-15: Redefinition of unused
Optional
from line 4Remove definition:
Optional
(F811)
src/snipe/api/sketch.py
14-14:
snipe.api.enums.SigType
imported but unusedRemove unused import:
snipe.api.enums.SigType
(F401)
15-15:
snipe.api.snipe_sig.SnipeSig
imported but unusedRemove unused import:
snipe.api.snipe_sig.SnipeSig
(F401)
src/snipe/api/snipe_sig.py
4-4:
typing.Any
imported but unusedRemove unused import
(F401)
4-4:
typing.Optional
imported but unusedRemove unused import
(F401)
945-945: Loop control variable
it
not used within loop body(B007)
1137-1137: Loop control variable
it
not used within loop body(B007)
tests/api/test_reference_qc.py
1079-1079: Local variable
expected_stats
is assigned to but never usedRemove assignment to unused variable
expected_stats
(F841)
🔇 Additional comments (13)
docs/mkdocs.yml (1)
Line range hint
100-102
: Verify privacy considerations for analytics.The addition of Google Tag Manager for analytics is noted. While this is useful for tracking documentation usage, please ensure that:
- This aligns with the project's privacy policy.
- Users are informed about the usage of analytics, possibly through a cookie banner or privacy notice.
- The analytics implementation complies with relevant data protection regulations (e.g., GDPR if applicable).
To check if there's a privacy policy or notice in place, run:
#!/bin/bash # Search for privacy-related documents fd -t f -e md -e txt "privacy|cookie|gdpr" docsConsider adding a comment in the YAML file explaining the purpose of the analytics configuration and linking to the project's privacy policy.
pyproject.toml (1)
32-35
: New dependencies added: Consider impact on project size and performanceThe addition of pandas, tqdm, lzstring, and rapidfuzz as project dependencies suggests significant enhancements to the project's capabilities:
- pandas: Powerful data manipulation, likely for QC data processing.
- tqdm: Progress tracking, improving UX for long-running operations.
- lzstring: String compression, potentially for optimizing data storage/transfer.
- rapidfuzz: Fuzzy string matching, possibly for enhanced search or comparison features.
These additions align well with the PR objective of implementing the
snipe qc
command. However, consider the following:
- Project size: These dependencies, especially pandas, will significantly increase the project's footprint. Ensure this aligns with your distribution strategy.
- Performance: The added functionality may impact startup time and memory usage. Consider lazy loading if these libraries are not always needed.
- Compatibility: Verify that these new dependencies are compatible with your supported Python versions and other existing dependencies.
To ensure compatibility and identify potential issues, please run the following verification script:
src/snipe/api/sketch.py (3)
Line range hint
182-195
: Approve docstring updates for_sketch_sample
methodThe updates to the
_sketch_sample
method docstring accurately reflect the shift to FracMinHash sketches while maintaining the correct return type ofsourmash.SourmashSignature
. These changes are consistent and appropriate.
Line range hint
467-479
: Approve docstring updates foramplicon_sketching
methodThe updates to the
amplicon_sketching
method docstring accurately reflect the shift to FracMinHash sketches. These changes are consistent with the overall transition in terminology and are appropriate.
Line range hint
1-541
: Overall assessment of changesThe modifications in this file consistently update the terminology from MinHash to FracMinHash across various methods, aligning with the project's direction. The changes primarily affect docstrings and comments, while the core functionality remains intact. This approach ensures backward compatibility while preparing for potential future updates to the underlying implementation.
Key points:
- Terminology updated from MinHash to FracMinHash in class and method descriptions.
- Import statements modified, potentially as part of a module restructuring.
- Method implementations remain unchanged, preserving existing functionality.
To ensure that these changes don't introduce any inconsistencies, please run the following verification script:
This script will help identify any inconsistencies in terminology usage across the project and verify that method signatures align with their docstrings.
As a reminder, please address the unused imports identified earlier:
- Remove
SigType
fromsnipe.api.enums
(line 14)- Remove
SnipeSig
fromsnipe.api.snipe_sig
(line 15)These changes will help maintain a clean and efficient codebase.
🧰 Tools
🪛 Ruff
14-14:
snipe.api.enums.SigType
imported but unusedRemove unused import:
snipe.api.enums.SigType
(F401)
15-15:
snipe.api.snipe_sig.SnipeSig
imported but unusedRemove unused import:
snipe.api.snipe_sig.SnipeSig
(F401)
tests/api/test_api.py (3)
Line range hint
1-1024
: Summary: Minor import change with no apparent issuesThe change in this file is limited to updating the import statement for the SnipeSig class. This modification reflects a restructuring of the module organization, which is a positive change for code maintainability. The rest of the test file remains unchanged, indicating that the core functionality of the SnipeSig class should not be affected.
To ensure the change is fully integrated:
- Verify that no other files in the project still use the old import statement.
- Run the full test suite to confirm that all tests pass after the import change.
- Review the SnipeSig class implementation to check if any new features need additional test coverage.
These steps will help maintain the integrity and comprehensiveness of the test suite while accommodating the new module structure.
10-10
: Update import statement for consistencyThe import statement for
SnipeSig
has been updated to reflect a new module structure. This change improves code organization and maintainability.To ensure this change is consistent across the codebase, let's verify other import statements:
#!/bin/bash # Search for any remaining old import statements rg --type python 'from snipe.api import SnipeSig' # Search for the new import statement to confirm its usage rg --type python 'from snipe.api.snipe_sig import SnipeSig'
Line range hint
1-1024
: Verify test suite execution after import changeWhile the import statement change should not affect the functionality of the tests, it's crucial to ensure that the SnipeSig class behaves as expected after the module restructuring.
Please run the entire test suite to confirm that all tests pass and there are no unintended side effects from the import change. You can use the following command:
tests/api/test_reference_qc.py (5)
922-982
: Well-structured test for basic variable consumptionThe test
test_nonref_consume_from_vars_basic
correctly validates the basic functionality ofnonref_consume_from_vars
, ensuring non-reference k-mers are assigned to variables as pervars_order
.
983-1050
: Effective test handling overlapping variablesThe test
test_nonref_consume_from_vars_overlapping_vars
appropriately checks that overlapping variables are handled correctly, and non-reference k-mers are consumed only once.
1090-1123
: Correctly test forValueError
with mismatchedvars_order
The test
test_nonref_consume_from_vars_vars_order_not_matching_vars
appropriately verifies that aValueError
is raised whenvars_order
contains variables not present invars
, ensuring input validation of the method.
1125-1159
: Accurate test for scenario with no non-reference k-mersThe test
test_nonref_consume_from_vars_no_nonref_kmers
correctly asserts that an empty dictionary is returned when there are no non-reference k-mers, validating the method's behavior in this edge case.
1161-1217
: Comprehensive test for all k-mers being consumedThe test
test_nonref_consume_from_vars_all_kmers_consumed
effectively checks that when all non-reference k-mers are consumed by variables, the method reports accurately, ensuring completeness in the method's functionality.
@@ -82,6 +82,7 @@ nav: | |||
- API: | |||
- SnipeSig: SnipeSig.md | |||
- ReferenceQC: ReferenceQC.md | |||
- Sketch: Sketch.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Action Required: Update Documentation Path
The file Sketch.md
was found at docs/docs/Sketch.md
, but the navigation entry references it as Sketch.md
under the docs
directory. Please update the navigation path in mkdocs.yml
to docs/docs/Sketch.md
or move Sketch.md
to docs/Sketch.md
to ensure consistency.
- Option 1: Update
mkdocs.yml
:- Sketch: docs/docs/Sketch.md
- Option 2: Move
Sketch.md
todocs/Sketch.md
🔗 Analysis chain
LGTM! New API documentation added.
The addition of "Sketch: Sketch.md" to the API section of the navigation is consistent with the existing structure and naming conventions.
To ensure the new documentation file exists, run the following command:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify the existence of the Sketch.md file
fd -t f "Sketch.md" docs
Length of output: 49
Script:
#!/bin/bash
# Perform a case-insensitive search for Sketch.md in the docs directory
fd -ti f "sketch\.md" docs
Length of output: 197
Script:
#!/bin/bash
# Perform a case-insensitive search for Sketch.md in the docs directory
fd -t f -i "sketch\.md" docs
Length of output: 50
src/snipe/api/sketch.py
Outdated
@@ -8,17 +8,16 @@ | |||
import threading | |||
import queue | |||
from typing import Any, Dict, List, Optional, Tuple | |||
|
|||
from pyfastx import Fastx as SequenceReader | |||
from pyfastx import Fastx as SequenceReader # pylint: disable=no-name-in-module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve changes and address unused imports
The updates to import statements and class description look good, aligning with the shift to FracMinHash sketches. However, there are unused imports that should be addressed:
SigType
fromsnipe.api.enums
(line 14)SnipeSig
fromsnipe.api.snipe_sig
(line 15)
Please remove these unused imports to keep the codebase clean:
-from snipe.api.enums import SigType
-from snipe.api.snipe_sig import SnipeSig
Also applies to: 15-15, 20-23
src/snipe/cli/main.py
Outdated
# Configure worker-specific logging | ||
logger = logging.getLogger(f'snipe_qc_worker_{os.path.basename(sample_path)}') | ||
logger.setLevel(logging.DEBUG if debug else logging.INFO) | ||
if not logger.hasHandlers(): | ||
handler = logging.StreamHandler(sys.stdout) | ||
handler.setLevel(logging.DEBUG if debug else logging.INFO) | ||
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') | ||
handler.setFormatter(formatter) | ||
logger.addHandler(handler) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Review logging configuration within multiprocessing
Configuring logging inside each worker process may lead to interleaved or missing log messages due to multiprocessing complexities. Consider setting up a centralized logging configuration before starting the worker processes or using process-safe logging handlers to ensure log messages are captured correctly.
You can utilize a QueueHandler
and QueueListener
to safely handle logs from multiple processes. Here's an outline:
- Set up a logging
Queue
andQueueHandler
in the main process. - Pass the logging
Queue
to worker processes. - Configure worker processes to use the
QueueHandler
. - In the main process, start a
QueueListener
to handle log records.
This approach ensures that all log messages from worker processes are funneled through the main process's logging system and are handled sequentially.
src/snipe/cli/main.py
Outdated
sample = futures[future] | ||
try: | ||
result = future.result() | ||
results.append(result) | ||
except Exception as exc: | ||
logger.error(f"Sample {sample} generated an exception: {exc}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve exception logging in the QC process
When logging exceptions during sample processing, sample
is a dictionary of arguments, which may not provide clear information. Extract the sample path or name for clearer log messages.
Modify the exception handling to extract and log the sample name:
for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Processing samples"):
args = futures[future]
- sample = futures[future]
try:
result = future.result()
results.append(result)
except Exception as exc:
- logger.error(f"Sample {sample} generated an exception: {exc}")
+ sample_path = args.get("sample_path", "Unknown")
+ sample_name = os.path.splitext(os.path.basename(sample_path))[0]
+ logger.error(f"Sample '{sample_name}' generated an exception: {exc}")
results.append({
- "sample": os.path.splitext(os.path.basename(sample))[0],
- "file_path": sample,
+ "sample": sample_name,
+ "file_path": sample_path,
"QC_Error": str(exc)
})
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
sample = futures[future] | |
try: | |
result = future.result() | |
results.append(result) | |
except Exception as exc: | |
logger.error(f"Sample {sample} generated an exception: {exc}") | |
args = futures[future] | |
try: | |
result = future.result() | |
results.append(result) | |
except Exception as exc: | |
sample_path = args.get("sample_path", "Unknown") | |
sample_name = os.path.splitext(os.path.basename(sample_path))[0] | |
logger.error(f"Sample '{sample_name}' generated an exception: {exc}") | |
results.append({ | |
"sample": sample_name, | |
"file_path": sample_path, | |
"QC_Error": str(exc) | |
}) |
src/snipe/cli/main.py
Outdated
def validate_tsv_file(ctx, param, value: str) -> str: | ||
""" | ||
Ensure that only one of --sample, --ref, or --amplicon is provided. | ||
|
||
Args: | ||
ctx: Click context. | ||
param: Click parameter. | ||
value (Any): The value of the parameter. | ||
|
||
Raises: | ||
click.UsageError: If more than one or none of the mutually exclusive options are provided. | ||
|
||
Returns: | ||
Any: The validated value. | ||
Validate that the output file has a .tsv extension. | ||
""" | ||
sample, ref, amplicon = ctx.params.get('sample'), ctx.params.get('ref'), ctx.params.get('amplicon') | ||
if sum([bool(sample), bool(ref), bool(amplicon)]) > 1: | ||
raise click.UsageError('Only one of --sample, --ref, or --amplicon can be used at a time.') | ||
if not any([bool(sample), bool(ref), bool(amplicon)]): | ||
raise click.UsageError('You must specify one of --sample, --ref, or --amplicon.') | ||
if not value.lower().endswith('.tsv'): | ||
raise click.BadParameter('Output file must have a .tsv extension.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor file validation functions to avoid duplication
The functions validate_zip_file
and validate_tsv_file
perform similar validations for file extensions. Consider creating a single generic function that accepts the expected extension as a parameter to reduce code duplication.
Apply this refactoring:
-def validate_zip_file(ctx, param, value: str) -> str:
- """
- Validate that the output file has a .zip extension.
- """
- if not value.lower().endswith('.zip'):
- raise click.BadParameter('Output file must have a .zip extension.')
- return value
-
-def validate_tsv_file(ctx, param, value: str) -> str:
- """
- Validate that the output file has a .tsv extension.
- """
- if not value.lower().endswith('.tsv'):
- raise click.BadParameter('Output file must have a .tsv extension.')
- return value
+def validate_file_extension(expected_ext: str):
+ def validator(ctx, param, value: str) -> str:
+ """
+ Validate that the output file has the expected extension.
+ """
+ if not value.lower().endswith(f'.{expected_ext}'):
+ raise click.BadParameter(f'Output file must have a .{expected_ext} extension.')
+ return value
+ return validator
Then update the @click.option
decorators to use the new validator:
@click.option('-o', '--output-file', required=True, callback=validate_zip_file, help='Output file with .zip extension.')
+@click.option('-o', '--output-file', required=True, callback=validate_file_extension('zip'), help='Output file with .zip extension.')
-@click.option('-o', '--output', required=True, callback=validate_tsv_file, help='Output TSV file for QC results.')
+@click.option('-o', '--output', required=True, callback=validate_file_extension('tsv'), help='Output TSV file for QC results.')
Committable suggestion was skipped due to low confidence.
tests/api/test_reference_qc.py
Outdated
expected_stats = { | ||
"non-var non-genomic total k-mer abundance": 21, # 6 + 7 + 8 | ||
"non-var non-genomic coverage index": 3 / 3 # 3 k-mers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused variable expected_stats
The variable expected_stats
in test_nonref_consume_from_vars_empty_vars
is assigned but never used, which results in unnecessary code.
Apply this diff to remove the unused variable:
- expected_stats = {
- "non-var non-genomic total k-mer abundance": 21, # 6 + 7 + 8
- "non-var non-genomic coverage index": 3 / 3 # 3 k-mers
- }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
expected_stats = { | |
"non-var non-genomic total k-mer abundance": 21, # 6 + 7 + 8 | |
"non-var non-genomic coverage index": 3 / 3 # 3 k-mers |
🧰 Tools
🪛 Ruff
1079-1079: Local variable
expected_stats
is assigned to but never usedRemove assignment to unused variable
expected_stats
(F841)
src/snipe/api/snipe_sig.py
Outdated
def __radd__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig': | ||
r""" | ||
Implements the right-hand + operator to support sum(). | ||
|
||
Returns: | ||
SnipeSig: Union of self and other. | ||
""" | ||
return self.__radd_sum__(other) | ||
|
||
# Override the __sum__ method | ||
def __radd_sum__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig': | ||
r""" | ||
Internal helper method to support the sum() function. | ||
|
||
Parameters: | ||
other (int or SnipeSig): The other object to add. If other is 0, return self. | ||
|
||
Returns: | ||
SnipeSig: The result of the addition. | ||
""" | ||
if other == 0: | ||
return self | ||
if not isinstance(other, SnipeSig): | ||
raise TypeError(f"Unsupported operand type(s) for +: 'SnipeSig' and '{type(other).__name__}'") | ||
return self.union_sigs(other) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify __radd__
method; remove unnecessary __radd_sum__
helper method.
The __radd_sum__
method is not a standard magic method and is unnecessary. You can implement the logic for supporting the sum()
function directly within the __radd__
method.
Apply this diff to simplify the code:
def __radd__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig':
r"""
Implements the right-hand + operator to support sum().
"""
- return self.__radd_sum__(other)
-
- # Override the __sum__ method
- def __radd_sum__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig':
- r"""
- Internal helper method to support the sum() function.
- """
if other == 0:
return self
if not isinstance(other, SnipeSig):
raise TypeError(f"Unsupported operand type(s) for +: 'SnipeSig' and '{type(other).__name__}'")
return self.union_sigs(other)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def __radd__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig': | |
r""" | |
Implements the right-hand + operator to support sum(). | |
Returns: | |
SnipeSig: Union of self and other. | |
""" | |
return self.__radd_sum__(other) | |
# Override the __sum__ method | |
def __radd_sum__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig': | |
r""" | |
Internal helper method to support the sum() function. | |
Parameters: | |
other (int or SnipeSig): The other object to add. If other is 0, return self. | |
Returns: | |
SnipeSig: The result of the addition. | |
""" | |
if other == 0: | |
return self | |
if not isinstance(other, SnipeSig): | |
raise TypeError(f"Unsupported operand type(s) for +: 'SnipeSig' and '{type(other).__name__}'") | |
return self.union_sigs(other) | |
def __radd__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig': | |
r""" | |
Implements the right-hand + operator to support sum(). | |
Returns: | |
SnipeSig: Union of self and other. | |
""" | |
if other == 0: | |
return self | |
if not isinstance(other, SnipeSig): | |
raise TypeError(f"Unsupported operand type(s) for +: 'SnipeSig' and '{type(other).__name__}'") | |
return self.union_sigs(other) |
src/snipe/api/reference_QC.py
Outdated
import os | ||
import requests | ||
from tqdm import tqdm | ||
import cgi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import cgi
.
The cgi
module imported on line 13 is not used in the code. Removing it will clean up the imports.
Apply this diff to remove the unused import:
-import cgi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import cgi |
🧰 Tools
🪛 Ruff
13-13:
cgi
imported but unusedRemove unused import:
cgi
(F401)
src/snipe/api/reference_QC.py
Outdated
from tqdm import tqdm | ||
import cgi | ||
from urllib.parse import urlparse | ||
from typing import Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove redundant import of Optional
.
Optional
is already imported on line 4; importing it again on line 15 is unnecessary.
Apply this diff to remove the redundant import:
-from typing import Optional
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from typing import Optional |
🧰 Tools
🪛 Ruff
15-15: Redefinition of unused
Optional
from line 4Remove definition:
Optional
(F811)
src/snipe/api/reference_QC.py
Outdated
filename = None | ||
if content_disposition: | ||
self.logger.debug("Parsing Content-Disposition header for filename.") | ||
parts = content_disposition.split(';') | ||
for part in parts: | ||
part = part.strip() | ||
if part.lower().startswith('filename*='): | ||
# Handle RFC 5987 encoding (e.g., filename*=UTF-8''example.txt) | ||
encoded_filename = part.split('=', 1)[1].strip() | ||
if "''" in encoded_filename: | ||
filename = encoded_filename.split("''", 1)[1] | ||
else: | ||
filename = encoded_filename | ||
self.logger.debug(f"Filename extracted from headers (RFC 5987): {filename}") | ||
break | ||
elif part.lower().startswith('filename='): | ||
# Remove 'filename=' and any surrounding quotes | ||
filename = part.split('=', 1)[1].strip(' "') | ||
self.logger.debug(f"Filename extracted from headers: {filename}") | ||
break | ||
|
||
if not filename: | ||
self.logger.debug("Falling back to filename derived from URL path.") | ||
filename = os.path.basename(url_path) | ||
if not filename: | ||
filename = 'downloaded_file' | ||
self.logger.debug(f"Filename derived from URL: {filename}") | ||
|
||
return filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanitize extracted filenames to prevent directory traversal vulnerabilities.
The _extract_filename
method extracts filenames from HTTP headers and URL paths. There's a potential security risk if the extracted filename
includes directory traversal characters like ../
. Consider sanitizing the filename
using os.path.basename()
to ensure it does not contain such characters and is safe to use in file paths.
Apply this diff to sanitize the filename:
+ import os
...
if not filename:
self.logger.debug("Falling back to filename derived from URL path.")
filename = os.path.basename(url_path)
if not filename:
filename = 'downloaded_file'
self.logger.debug(f"Filename derived from URL: {filename}")
+ else:
+ # Sanitize the filename to prevent directory traversal
+ filename = os.path.basename(filename)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
filename = None | |
if content_disposition: | |
self.logger.debug("Parsing Content-Disposition header for filename.") | |
parts = content_disposition.split(';') | |
for part in parts: | |
part = part.strip() | |
if part.lower().startswith('filename*='): | |
# Handle RFC 5987 encoding (e.g., filename*=UTF-8''example.txt) | |
encoded_filename = part.split('=', 1)[1].strip() | |
if "''" in encoded_filename: | |
filename = encoded_filename.split("''", 1)[1] | |
else: | |
filename = encoded_filename | |
self.logger.debug(f"Filename extracted from headers (RFC 5987): {filename}") | |
break | |
elif part.lower().startswith('filename='): | |
# Remove 'filename=' and any surrounding quotes | |
filename = part.split('=', 1)[1].strip(' "') | |
self.logger.debug(f"Filename extracted from headers: {filename}") | |
break | |
if not filename: | |
self.logger.debug("Falling back to filename derived from URL path.") | |
filename = os.path.basename(url_path) | |
if not filename: | |
filename = 'downloaded_file' | |
self.logger.debug(f"Filename derived from URL: {filename}") | |
return filename | |
filename = None | |
if content_disposition: | |
self.logger.debug("Parsing Content-Disposition header for filename.") | |
parts = content_disposition.split(';') | |
for part in parts: | |
part = part.strip() | |
if part.lower().startswith('filename*='): | |
# Handle RFC 5987 encoding (e.g., filename*=UTF-8''example.txt) | |
encoded_filename = part.split('=', 1)[1].strip() | |
if "''" in encoded_filename: | |
filename = encoded_filename.split("''", 1)[1] | |
else: | |
filename = encoded_filename | |
self.logger.debug(f"Filename extracted from headers (RFC 5987): {filename}") | |
break | |
elif part.lower().startswith('filename='): | |
# Remove 'filename=' and any surrounding quotes | |
filename = part.split('=', 1)[1].strip(' "') | |
self.logger.debug(f"Filename extracted from headers: {filename}") | |
break | |
if not filename: | |
self.logger.debug("Falling back to filename derived from URL path.") | |
filename = os.path.basename(url_path) | |
if not filename: | |
filename = 'downloaded_file' | |
self.logger.debug(f"Filename derived from URL: {filename}") | |
else: | |
# Sanitize the filename to prevent directory traversal | |
filename = os.path.basename(filename) | |
return filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (3)
src/snipe/api/reference_QC.py (3)
Line range hint
690-718
: LGTM: Efficient implementation of signature splitting with caching.The changes to
split_sig_randomly
method effectively implement the caching mechanism. The use of deep copies when returning cached results ensures thread-safety. Caching the results after computation is a good practice for improving performance in subsequent calls.A minor suggestion for improvement:
Consider adding a log message when returning cached results to aid in debugging and performance monitoring. You can add this line after the debug message on line 695:
self.logger.debug("Returning cached split signatures for n=%d.", n)
1341-1501
: LGTM: Well-implemented method for analyzing non-reference k-mers.The
nonref_consume_from_vars
method is a valuable addition that provides insights into the distribution of non-reference k-mers across different variables. The implementation is thorough, with good use of logging for debugging purposes.A suggestion for improvement:
Consider adding a check for empty
vars
orvars_order
at the beginning of the method to handle edge cases more gracefully. You can add something like this after the existing checks:if not vars or not vars_order: self.logger.warning("Empty vars or vars_order provided. Skipping non-reference k-mer analysis.") return {}This will prevent potential issues if the method is called with empty inputs.
1546-1711
: LGTM: Well-implemented PreparedQC class with robust download functionality.The
PreparedQC
class is a valuable addition, extending theReferenceQC
class with functionality for working with prepared snipe profiles. The methods for downloading OSF databases and extracting filenames are thorough and handle various scenarios well.A suggestion for improvement:
In the
download_osf_db
method, consider adding a timeout to therequests.get
call to prevent indefinite hanging if the server doesn't respond. You can modify line 1621 as follows:with requests.get(download_url, stream=True, allow_redirects=True, timeout=30) as response:This will ensure that the download attempt times out after 30 seconds (adjust as needed) if there's no response from the server, improving the robustness of the download process.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- src/snipe/api/reference_QC.py (14 hunks)
🧰 Additional context used
🪛 Ruff
src/snipe/api/reference_QC.py
13-13:
cgi
imported but unusedRemove unused import:
cgi
(F401)
15-15: Redefinition of unused
Optional
from line 4Remove definition:
Optional
(F811)
🔇 Additional comments (2)
src/snipe/api/reference_QC.py (2)
257-261
: LGTM: Efficient caching of split signatures.The addition of the
_split_cache
attribute is a good optimization. It will help avoid redundant computations of split signatures, potentially improving performance for repeated operations.
Line range hint
1-1711
: Overall: Well-implemented enhancements to ReferenceQC with good new functionality.The changes to
src/snipe/api/reference_QC.py
significantly enhance the capabilities of theReferenceQC
class and introduce a valuablePreparedQC
subclass. Key improvements include:
- Efficient caching mechanism for split signatures.
- New method for analyzing non-reference k-mers against variable signatures.
- Robust genome signature loading functionality.
- New
PreparedQC
class with download capabilities for prepared profiles.These additions are generally well-implemented, with good use of logging and error handling. The suggested minor improvements in error handling and logging will further enhance the robustness and debuggability of the code.
Great job on these enhancements! They add substantial value to the Snipe project's quality control capabilities.
def load_genome_sig_to_dict(self, *, zip_file_path: str, **kwargs) -> Dict[str, 'SnipeSig']: | ||
""" | ||
Load a genome signature into a dictionary of SnipeSig instances. | ||
|
||
Parameters: | ||
zip_file_path (str): Path to the zip file containing the genome signatures. | ||
**kwargs: Additional keyword arguments to pass to the SnipeSig constructor. | ||
|
||
Returns: | ||
Dict[str, SnipeSig]: A dictionary mapping genome names to SnipeSig instances. | ||
""" | ||
|
||
return self.sex_stats No newline at end of file | ||
genome_chr_name_to_sig = {} | ||
|
||
sourmash_sigs: List[sourmash.signature.SourmashSignature] = sourmash.load_file_as_signatures(zip_file_path) | ||
sex_count = 0 | ||
autosome_count = 0 | ||
genome_count = 0 | ||
for sig in sourmash_sigs: | ||
name = sig.name | ||
if name.endswith("-snipegenome"): | ||
self.logger.debug(f"Loading genome signature: {name}") | ||
restored_name = name.replace("-snipegenome", "") | ||
genome_chr_name_to_sig[restored_name] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME) | ||
genome_count += 1 | ||
elif "sex" in name: | ||
sex_count += 1 | ||
genome_chr_name_to_sig[name.replace('sex-','')] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME) | ||
elif "autosome" in name: | ||
autosome_count += 1 | ||
genome_chr_name_to_sig[name.replace('autosome-','')] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME) | ||
else: | ||
logging.warning(f"Unknown genome signature name: {name}, are you sure you generated this with `snipe sketch --ref`?") | ||
|
||
self.logger.debug("Loaded %d genome signatures and %d sex chrs and %d autosome chrs", genome_count, sex_count, autosome_count) | ||
|
||
if genome_count != 1: | ||
logging.error(f"Expected 1 genome signature, found {genome_count}") | ||
|
||
|
||
return genome_chr_name_to_sig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM: Well-structured method for loading genome signatures.
The load_genome_sig_to_dict
method is a good addition for loading and organizing genome signatures. The differentiation between genome, sex chromosome, and autosome signatures is well-handled.
A suggestion for improvement:
The error handling when the genome count is not 1 could be more robust. Instead of just logging an error, consider raising an exception to prevent further processing with incorrect data. Replace lines 1539-1541 with:
if genome_count != 1:
error_msg = f"Expected 1 genome signature, found {genome_count}"
self.logger.error(error_msg)
raise ValueError(error_msg)
This change will ensure that the method fails explicitly when the expected number of genome signatures is not found, making it easier to catch and handle this error condition in the calling code.
import os | ||
import requests | ||
from tqdm import tqdm | ||
import cgi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import cgi
.
The cgi
module is imported on line 13 but not used in the code. Removing it will clean up the imports and improve code clarity.
Apply this diff to remove the unused import:
-import cgi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import cgi |
🧰 Tools
🪛 Ruff
13-13:
cgi
imported but unusedRemove unused import:
cgi
(F401)
from tqdm import tqdm | ||
import cgi | ||
from urllib.parse import urlparse | ||
from typing import Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove redundant import of Optional
.
Optional
is already imported on line 4 from typing
. The import on line 15 is redundant and can be removed.
Apply this diff to remove the redundant import:
-from typing import Optional
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from typing import Optional |
🧰 Tools
🪛 Ruff
15-15: Redefinition of unused
Optional
from line 4Remove definition:
Optional
(F811)
No description provided.