Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing the snipe qc command. #11

Merged
merged 7 commits into from
Oct 15, 2024
Merged

Implementing the snipe qc command. #11

merged 7 commits into from
Oct 15, 2024

Conversation

mr-eyes
Copy link
Member

@mr-eyes mr-eyes commented Oct 15, 2024

No description provided.

Copy link
Contributor

coderabbitai bot commented Oct 15, 2024

Walkthrough

The changes in this pull request involve updates across multiple files, primarily focusing on enhancing configurations, documentation, and functionality within the Snipe project. Key modifications include the addition of new dependencies in pyproject.toml, updates to API documentation, and significant enhancements to the ReferenceQC and SnipeSig classes. The CLI has been expanded with new commands for quality control operations, and various import paths have been adjusted to reflect a restructured module organization.

Changes

File Change Summary
.pylintrc Added directive to disable line-too-long warning.
.vscode/settings.json Commented-out configuration options for Jupyter notebooks related to line numbers.
docs/docs/Sketch.md Added new section titled "Python API Documentation" referencing snipe.api.sketch.
docs/docs/SnipeSig.md Updated documentation reference from ::: snipe.api to ::: snipe.api.snipe_sig.
docs/mkdocs.yml Added "Sketch" entry to navigation, confirmed mkdocstrings options, and added Google Tag Manager analytics configuration.
pyproject.toml Added dependencies: pandas, tqdm, lzstring, rapidfuzz.
src/snipe/api/__init__.py Removed SnipeSig class and its associated methods.
src/snipe/api/reference_QC.py Enhanced ReferenceQC class with new metrics, methods, and a new subclass PreparedQC.
src/snipe/api/sketch.py Changed terminology from "MinHash" to "FracMinHash" in methods and documentation.
src/snipe/api/snipe_sig.py Introduced SnipeSig class with various methods for managing Sourmash signatures.
src/snipe/cli/main.py Added quality control command, new validation methods, and enhanced sample processing functionality.
tests/api/test_api.py Updated import path for SnipeSig class.
tests/api/test_reference_qc.py Refactored import for SnipeSig and added new tests for ReferenceQC class methods.

Poem

In the code where rabbits play,
New features hop in bright array.
From sketches to QC, all in sight,
With dependencies added, our future's bright!
So let us code and celebrate,
For every change, we elevate! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Outside diff range and nitpick comments (7)
docs/mkdocs.yml (1)

Line range hint 29-34: Great additions to enhance documentation capabilities!

The new markdown extensions (pymdownx.details, pymdownx.critic, pymdownx.caret, pymdownx.keys, pymdownx.mark, pymdownx.tilde) will provide more flexibility in formatting and styling the documentation. The mkdocstrings configuration to show function signatures is particularly useful for API documentation.

Consider adding a brief comment in the YAML file explaining the purpose of these new extensions to help future maintainers understand their inclusion.

Also applies to: 48-52

src/snipe/api/sketch.py (1)

Line range hint 71-86: Approve docstring updates with a minor suggestion

The updates to the process_sequences method docstring accurately reflect the shift to FracMinHash sketches. The changes are consistent and appropriate.

For improved clarity, consider updating the return type annotation in the method signature to match the docstring:

-    ) -> sourmash.MinHash:
+    ) -> sourmash.FracMinHash:

This change would make the type annotation consistent with the updated docstring.

tests/api/test_api.py (1)

Enhance Test Coverage for New SnipeSig Features

The SnipeSig class has been updated with several new properties and methods, including total_abundance, mean_abundance, median_abundance, get_sample_stats, export, export_to_string, get_unique_signatures, common_hashes, and copy. Currently, the test suite does not include test cases for these new additions.

To maintain comprehensive test coverage and ensure all functionalities are properly validated, please add corresponding test cases for each of the newly introduced properties and methods.

  • New Properties to Test:

    • total_abundance
    • mean_abundance
    • median_abundance
    • get_sample_stats
  • New Methods to Test:

    • export
    • export_to_string
    • get_unique_signatures
    • common_hashes
    • copy
🔗 Analysis chain

Line range hint 1-1024: Review test coverage for new SnipeSig features

The test suite appears comprehensive, covering various aspects of the SnipeSig class. However, with the recent changes in the module structure, it's important to ensure that any new functionality added to the SnipeSig class is also adequately tested.

Please review the SnipeSig class implementation and check if there are any new methods or features that might require additional test coverage. You can use the following command to inspect the SnipeSig class:

If you find any new features that are not covered in the current test suite, consider adding appropriate test cases to maintain comprehensive test coverage.


Line range hint 1-1024: Locate the SnipeSig class definition

The previous attempt to display snipe/api/snipe_sig.py did not find the SnipeSig class file. It's possible that the class has been moved or renamed. To ensure comprehensive test coverage, please locate the current file where the SnipeSig class is defined. You can use the following command to search for the class definition:

If the ast-grep command does not yield results, consider using rg (ripgrep) as an alternative to search for the class definition:

Please execute the appropriate script to locate the SnipeSig class and update the test suite accordingly based on its current implementation and location.


Line range hint 1-1024: View the Content of the SnipeSig Class

The SnipeSig class is located at src/snipe/api/snipe_sig.py. To ensure that the test suite adequately covers all functionalities, please review the contents of this class and verify if there are any new methods or features that lack corresponding test cases. You can use the following command to view the file:

If you identify any new features or modifications, consider adding appropriate test cases to maintain comprehensive test coverage.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Display the content of the SnipeSig class file
cat snipe/api/snipe_sig.py

Length of output: 83


Script:

#!/bin/bash
# Search for the definition of the SnipeSig class
ast-grep --lang python --pattern 'class SnipeSig { $$$ }'

Length of output: 504


Script:

#!/bin/bash
# Search for the definition of the SnipeSig class using ripgrep
rg 'class SnipeSig'

Length of output: 64


Script:

#!/bin/bash
# Display the content of the SnipeSig class file
cat src/snipe/api/snipe_sig.py

Length of output: 58847

src/snipe/cli/main.py (1)

70-74: Rename variable samples for clarity

The variable samples holds sample, ref, and amplicon, which are not all samples. Consider renaming it to inputs or input_files to better reflect its content and enhance code readability.

Apply this rename:

-def sketch(ctx, sample: Optional[str], ref: Optional[str], amplicon: Optional[str], ...):
    ...
-    samples = [sample, ref, amplicon]
-    provided = [s for s in samples if s]
+    inputs = [sample, ref, amplicon]
+    provided = [s for s in inputs if s]
     if len(provided) != 1:
         click.echo('Error: Exactly one of --sample, --ref, or --amplicon must be provided.')
         sys.exit(1)
src/snipe/api/snipe_sig.py (2)

4-4: Remove unused imports Any and Optional.

The imports Any and Optional from the typing module are not used and can be removed to clean up the code.

Apply this diff to remove the unused imports:

- from typing import Any, Dict, Iterator, List, Optional, Union
+ from typing import Dict, Iterator, List, Union
🧰 Tools
🪛 Ruff

4-4: typing.Any imported but unused

Remove unused import

(F401)


4-4: typing.Optional imported but unused

Remove unused import

(F401)


205-205: Typo in the return type annotation.

At line 205, there's a typographical error in the return type annotation in the docstring. The character ß should be replaced with B in SourmashSignature.

Apply this diff to fix the typo:

- sourmash.signature.SourmashßSignature, list of sourmash.signature.SourmashSignature, or None if loading fails.
+ sourmash.signature.SourmashSignature, list of sourmash.signature.SourmashSignature, or None if loading fails.
src/snipe/api/reference_QC.py (1)

386-389: Remove commented-out code to improve readability.

The code between lines 386-389 is commented out. If it is no longer needed, consider removing it to keep the code clean.

Apply this diff to remove the commented code:

-    # self.logger.debug("Sample hashes: %s", self.sample_sig.hashes)
-    # self.logger.debug("Sample abundances: %s", self.sample_sig.abundances)
-    # self.logger.debug("Reference hashes: %s", self.reference_sig.hashes)
-    # self.logger.debug("Reference abundances: %s", self.reference_sig.abundances)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 5bd83d6 and 5e4d6dd.

📒 Files selected for processing (13)
  • .pylintrc (1 hunks)
  • .vscode/settings.json (1 hunks)
  • docs/docs/Sketch.md (1 hunks)
  • docs/docs/SnipeSig.md (1 hunks)
  • docs/mkdocs.yml (1 hunks)
  • pyproject.toml (1 hunks)
  • src/snipe/api/init.py (0 hunks)
  • src/snipe/api/reference_QC.py (14 hunks)
  • src/snipe/api/sketch.py (5 hunks)
  • src/snipe/api/snipe_sig.py (1 hunks)
  • src/snipe/cli/main.py (6 hunks)
  • tests/api/test_api.py (1 hunks)
  • tests/api/test_reference_qc.py (2 hunks)
💤 Files with no reviewable changes (1)
  • src/snipe/api/init.py
✅ Files skipped from review due to trivial changes (3)
  • .pylintrc
  • docs/docs/Sketch.md
  • docs/docs/SnipeSig.md
🧰 Additional context used
🪛 Biome
.vscode/settings.json

[error] 5-5: expected , but instead found // # show line numbers in jupyter notebooks

Remove // # show line numbers in jupyter notebooks

(parse)

🪛 Ruff
src/snipe/api/reference_QC.py

13-13: cgi imported but unused

Remove unused import: cgi

(F401)


15-15: Redefinition of unused Optional from line 4

Remove definition: Optional

(F811)

src/snipe/api/sketch.py

14-14: snipe.api.enums.SigType imported but unused

Remove unused import: snipe.api.enums.SigType

(F401)


15-15: snipe.api.snipe_sig.SnipeSig imported but unused

Remove unused import: snipe.api.snipe_sig.SnipeSig

(F401)

src/snipe/api/snipe_sig.py

4-4: typing.Any imported but unused

Remove unused import

(F401)


4-4: typing.Optional imported but unused

Remove unused import

(F401)


945-945: Loop control variable it not used within loop body

(B007)


1137-1137: Loop control variable it not used within loop body

(B007)

tests/api/test_reference_qc.py

1079-1079: Local variable expected_stats is assigned to but never used

Remove assignment to unused variable expected_stats

(F841)

🔇 Additional comments (13)
docs/mkdocs.yml (1)

Line range hint 100-102: Verify privacy considerations for analytics.

The addition of Google Tag Manager for analytics is noted. While this is useful for tracking documentation usage, please ensure that:

  1. This aligns with the project's privacy policy.
  2. Users are informed about the usage of analytics, possibly through a cookie banner or privacy notice.
  3. The analytics implementation complies with relevant data protection regulations (e.g., GDPR if applicable).

To check if there's a privacy policy or notice in place, run:

#!/bin/bash
# Search for privacy-related documents
fd -t f -e md -e txt "privacy|cookie|gdpr" docs

Consider adding a comment in the YAML file explaining the purpose of the analytics configuration and linking to the project's privacy policy.

pyproject.toml (1)

32-35: New dependencies added: Consider impact on project size and performance

The addition of pandas, tqdm, lzstring, and rapidfuzz as project dependencies suggests significant enhancements to the project's capabilities:

  1. pandas: Powerful data manipulation, likely for QC data processing.
  2. tqdm: Progress tracking, improving UX for long-running operations.
  3. lzstring: String compression, potentially for optimizing data storage/transfer.
  4. rapidfuzz: Fuzzy string matching, possibly for enhanced search or comparison features.

These additions align well with the PR objective of implementing the snipe qc command. However, consider the following:

  1. Project size: These dependencies, especially pandas, will significantly increase the project's footprint. Ensure this aligns with your distribution strategy.
  2. Performance: The added functionality may impact startup time and memory usage. Consider lazy loading if these libraries are not always needed.
  3. Compatibility: Verify that these new dependencies are compatible with your supported Python versions and other existing dependencies.

To ensure compatibility and identify potential issues, please run the following verification script:

src/snipe/api/sketch.py (3)

Line range hint 182-195: Approve docstring updates for _sketch_sample method

The updates to the _sketch_sample method docstring accurately reflect the shift to FracMinHash sketches while maintaining the correct return type of sourmash.SourmashSignature. These changes are consistent and appropriate.


Line range hint 467-479: Approve docstring updates for amplicon_sketching method

The updates to the amplicon_sketching method docstring accurately reflect the shift to FracMinHash sketches. These changes are consistent with the overall transition in terminology and are appropriate.


Line range hint 1-541: Overall assessment of changes

The modifications in this file consistently update the terminology from MinHash to FracMinHash across various methods, aligning with the project's direction. The changes primarily affect docstrings and comments, while the core functionality remains intact. This approach ensures backward compatibility while preparing for potential future updates to the underlying implementation.

Key points:

  1. Terminology updated from MinHash to FracMinHash in class and method descriptions.
  2. Import statements modified, potentially as part of a module restructuring.
  3. Method implementations remain unchanged, preserving existing functionality.

To ensure that these changes don't introduce any inconsistencies, please run the following verification script:

This script will help identify any inconsistencies in terminology usage across the project and verify that method signatures align with their docstrings.

As a reminder, please address the unused imports identified earlier:

  • Remove SigType from snipe.api.enums (line 14)
  • Remove SnipeSig from snipe.api.snipe_sig (line 15)

These changes will help maintain a clean and efficient codebase.

🧰 Tools
🪛 Ruff

14-14: snipe.api.enums.SigType imported but unused

Remove unused import: snipe.api.enums.SigType

(F401)


15-15: snipe.api.snipe_sig.SnipeSig imported but unused

Remove unused import: snipe.api.snipe_sig.SnipeSig

(F401)

tests/api/test_api.py (3)

Line range hint 1-1024: Summary: Minor import change with no apparent issues

The change in this file is limited to updating the import statement for the SnipeSig class. This modification reflects a restructuring of the module organization, which is a positive change for code maintainability. The rest of the test file remains unchanged, indicating that the core functionality of the SnipeSig class should not be affected.

To ensure the change is fully integrated:

  1. Verify that no other files in the project still use the old import statement.
  2. Run the full test suite to confirm that all tests pass after the import change.
  3. Review the SnipeSig class implementation to check if any new features need additional test coverage.

These steps will help maintain the integrity and comprehensiveness of the test suite while accommodating the new module structure.


10-10: Update import statement for consistency

The import statement for SnipeSig has been updated to reflect a new module structure. This change improves code organization and maintainability.

To ensure this change is consistent across the codebase, let's verify other import statements:

#!/bin/bash
# Search for any remaining old import statements
rg --type python 'from snipe.api import SnipeSig'

# Search for the new import statement to confirm its usage
rg --type python 'from snipe.api.snipe_sig import SnipeSig'

Line range hint 1-1024: Verify test suite execution after import change

While the import statement change should not affect the functionality of the tests, it's crucial to ensure that the SnipeSig class behaves as expected after the module restructuring.

Please run the entire test suite to confirm that all tests pass and there are no unintended side effects from the import change. You can use the following command:

tests/api/test_reference_qc.py (5)

922-982: Well-structured test for basic variable consumption

The test test_nonref_consume_from_vars_basic correctly validates the basic functionality of nonref_consume_from_vars, ensuring non-reference k-mers are assigned to variables as per vars_order.


983-1050: Effective test handling overlapping variables

The test test_nonref_consume_from_vars_overlapping_vars appropriately checks that overlapping variables are handled correctly, and non-reference k-mers are consumed only once.


1090-1123: Correctly test for ValueError with mismatched vars_order

The test test_nonref_consume_from_vars_vars_order_not_matching_vars appropriately verifies that a ValueError is raised when vars_order contains variables not present in vars, ensuring input validation of the method.


1125-1159: Accurate test for scenario with no non-reference k-mers

The test test_nonref_consume_from_vars_no_nonref_kmers correctly asserts that an empty dictionary is returned when there are no non-reference k-mers, validating the method's behavior in this edge case.


1161-1217: Comprehensive test for all k-mers being consumed

The test test_nonref_consume_from_vars_all_kmers_consumed effectively checks that when all non-reference k-mers are consumed by variables, the method reports accurately, ensuring completeness in the method's functionality.

@@ -82,6 +82,7 @@ nav:
- API:
- SnipeSig: SnipeSig.md
- ReferenceQC: ReferenceQC.md
- Sketch: Sketch.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Action Required: Update Documentation Path

The file Sketch.md was found at docs/docs/Sketch.md, but the navigation entry references it as Sketch.md under the docs directory. Please update the navigation path in mkdocs.yml to docs/docs/Sketch.md or move Sketch.md to docs/Sketch.md to ensure consistency.

  • Option 1: Update mkdocs.yml:
    - Sketch: docs/docs/Sketch.md
    
  • Option 2: Move Sketch.md to docs/Sketch.md
🔗 Analysis chain

LGTM! New API documentation added.

The addition of "Sketch: Sketch.md" to the API section of the navigation is consistent with the existing structure and naming conventions.

To ensure the new documentation file exists, run the following command:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the existence of the Sketch.md file
fd -t f "Sketch.md" docs

Length of output: 49


Script:

#!/bin/bash
# Perform a case-insensitive search for Sketch.md in the docs directory
fd -ti f "sketch\.md" docs

Length of output: 197


Script:

#!/bin/bash
# Perform a case-insensitive search for Sketch.md in the docs directory
fd -t f -i "sketch\.md" docs

Length of output: 50

@@ -8,17 +8,16 @@
import threading
import queue
from typing import Any, Dict, List, Optional, Tuple

from pyfastx import Fastx as SequenceReader
from pyfastx import Fastx as SequenceReader # pylint: disable=no-name-in-module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Approve changes and address unused imports

The updates to import statements and class description look good, aligning with the shift to FracMinHash sketches. However, there are unused imports that should be addressed:

  1. SigType from snipe.api.enums (line 14)
  2. SnipeSig from snipe.api.snipe_sig (line 15)

Please remove these unused imports to keep the codebase clean:

-from snipe.api.enums import SigType
-from snipe.api.snipe_sig import SnipeSig

Also applies to: 15-15, 20-23

Comment on lines +231 to +240
# Configure worker-specific logging
logger = logging.getLogger(f'snipe_qc_worker_{os.path.basename(sample_path)}')
logger.setLevel(logging.DEBUG if debug else logging.INFO)
if not logger.hasHandlers():
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG if debug else logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Review logging configuration within multiprocessing

Configuring logging inside each worker process may lead to interleaved or missing log messages due to multiprocessing complexities. Consider setting up a centralized logging configuration before starting the worker processes or using process-safe logging handlers to ensure log messages are captured correctly.

You can utilize a QueueHandler and QueueListener to safely handle logs from multiple processes. Here's an outline:

  1. Set up a logging Queue and QueueHandler in the main process.
  2. Pass the logging Queue to worker processes.
  3. Configure worker processes to use the QueueHandler.
  4. In the main process, start a QueueListener to handle log records.

This approach ensures that all log messages from worker processes are funneled through the main process's logging system and are handled sequentially.

Comment on lines +443 to +448
sample = futures[future]
try:
result = future.result()
results.append(result)
except Exception as exc:
logger.error(f"Sample {sample} generated an exception: {exc}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve exception logging in the QC process

When logging exceptions during sample processing, sample is a dictionary of arguments, which may not provide clear information. Extract the sample path or name for clearer log messages.

Modify the exception handling to extract and log the sample name:

 for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Processing samples"):
     args = futures[future]
-    sample = futures[future]
     try:
         result = future.result()
         results.append(result)
     except Exception as exc:
-        logger.error(f"Sample {sample} generated an exception: {exc}")
+        sample_path = args.get("sample_path", "Unknown")
+        sample_name = os.path.splitext(os.path.basename(sample_path))[0]
+        logger.error(f"Sample '{sample_name}' generated an exception: {exc}")
         results.append({
-            "sample": os.path.splitext(os.path.basename(sample))[0],
-            "file_path": sample,
+            "sample": sample_name,
+            "file_path": sample_path,
             "QC_Error": str(exc)
         })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sample = futures[future]
try:
result = future.result()
results.append(result)
except Exception as exc:
logger.error(f"Sample {sample} generated an exception: {exc}")
args = futures[future]
try:
result = future.result()
results.append(result)
except Exception as exc:
sample_path = args.get("sample_path", "Unknown")
sample_name = os.path.splitext(os.path.basename(sample_path))[0]
logger.error(f"Sample '{sample_name}' generated an exception: {exc}")
results.append({
"sample": sample_name,
"file_path": sample_path,
"QC_Error": str(exc)
})

Comment on lines +28 to +33
def validate_tsv_file(ctx, param, value: str) -> str:
"""
Ensure that only one of --sample, --ref, or --amplicon is provided.

Args:
ctx: Click context.
param: Click parameter.
value (Any): The value of the parameter.

Raises:
click.UsageError: If more than one or none of the mutually exclusive options are provided.

Returns:
Any: The validated value.
Validate that the output file has a .tsv extension.
"""
sample, ref, amplicon = ctx.params.get('sample'), ctx.params.get('ref'), ctx.params.get('amplicon')
if sum([bool(sample), bool(ref), bool(amplicon)]) > 1:
raise click.UsageError('Only one of --sample, --ref, or --amplicon can be used at a time.')
if not any([bool(sample), bool(ref), bool(amplicon)]):
raise click.UsageError('You must specify one of --sample, --ref, or --amplicon.')
if not value.lower().endswith('.tsv'):
raise click.BadParameter('Output file must have a .tsv extension.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor file validation functions to avoid duplication

The functions validate_zip_file and validate_tsv_file perform similar validations for file extensions. Consider creating a single generic function that accepts the expected extension as a parameter to reduce code duplication.

Apply this refactoring:

-def validate_zip_file(ctx, param, value: str) -> str:
-    """
-    Validate that the output file has a .zip extension.
-    """
-    if not value.lower().endswith('.zip'):
-        raise click.BadParameter('Output file must have a .zip extension.')
-    return value
-
-def validate_tsv_file(ctx, param, value: str) -> str:
-    """
-    Validate that the output file has a .tsv extension.
-    """
-    if not value.lower().endswith('.tsv'):
-        raise click.BadParameter('Output file must have a .tsv extension.')
-    return value
+def validate_file_extension(expected_ext: str):
+    def validator(ctx, param, value: str) -> str:
+        """
+        Validate that the output file has the expected extension.
+        """
+        if not value.lower().endswith(f'.{expected_ext}'):
+            raise click.BadParameter(f'Output file must have a .{expected_ext} extension.')
+        return value
+    return validator

Then update the @click.option decorators to use the new validator:

 @click.option('-o', '--output-file', required=True, callback=validate_zip_file, help='Output file with .zip extension.')
+@click.option('-o', '--output-file', required=True, callback=validate_file_extension('zip'), help='Output file with .zip extension.')
 
-@click.option('-o', '--output', required=True, callback=validate_tsv_file, help='Output TSV file for QC results.')
+@click.option('-o', '--output', required=True, callback=validate_file_extension('tsv'), help='Output TSV file for QC results.')

Committable suggestion was skipped due to low confidence.

Comment on lines +1079 to +1081
expected_stats = {
"non-var non-genomic total k-mer abundance": 21, # 6 + 7 + 8
"non-var non-genomic coverage index": 3 / 3 # 3 k-mers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused variable expected_stats

The variable expected_stats in test_nonref_consume_from_vars_empty_vars is assigned but never used, which results in unnecessary code.

Apply this diff to remove the unused variable:

-            expected_stats = {
-                "non-var non-genomic total k-mer abundance": 21,  # 6 + 7 + 8
-                "non-var non-genomic coverage index": 3 / 3  # 3 k-mers
-            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
expected_stats = {
"non-var non-genomic total k-mer abundance": 21, # 6 + 7 + 8
"non-var non-genomic coverage index": 3 / 3 # 3 k-mers
🧰 Tools
🪛 Ruff

1079-1079: Local variable expected_stats is assigned to but never used

Remove assignment to unused variable expected_stats

(F841)

Comment on lines +1202 to +1227
def __radd__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig':
r"""
Implements the right-hand + operator to support sum().

Returns:
SnipeSig: Union of self and other.
"""
return self.__radd_sum__(other)

# Override the __sum__ method
def __radd_sum__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig':
r"""
Internal helper method to support the sum() function.

Parameters:
other (int or SnipeSig): The other object to add. If other is 0, return self.

Returns:
SnipeSig: The result of the addition.
"""
if other == 0:
return self
if not isinstance(other, SnipeSig):
raise TypeError(f"Unsupported operand type(s) for +: 'SnipeSig' and '{type(other).__name__}'")
return self.union_sigs(other)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify __radd__ method; remove unnecessary __radd_sum__ helper method.

The __radd_sum__ method is not a standard magic method and is unnecessary. You can implement the logic for supporting the sum() function directly within the __radd__ method.

Apply this diff to simplify the code:

 def __radd__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig':
     r"""
     Implements the right-hand + operator to support sum().
     """
-    return self.__radd_sum__(other)
-
-    # Override the __sum__ method
-    def __radd_sum__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig':
-        r"""
-        Internal helper method to support the sum() function.
-        """
     if other == 0:
         return self
     if not isinstance(other, SnipeSig):
         raise TypeError(f"Unsupported operand type(s) for +: 'SnipeSig' and '{type(other).__name__}'")
     return self.union_sigs(other)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __radd__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig':
r"""
Implements the right-hand + operator to support sum().
Returns:
SnipeSig: Union of self and other.
"""
return self.__radd_sum__(other)
# Override the __sum__ method
def __radd_sum__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig':
r"""
Internal helper method to support the sum() function.
Parameters:
other (int or SnipeSig): The other object to add. If other is 0, return self.
Returns:
SnipeSig: The result of the addition.
"""
if other == 0:
return self
if not isinstance(other, SnipeSig):
raise TypeError(f"Unsupported operand type(s) for +: 'SnipeSig' and '{type(other).__name__}'")
return self.union_sigs(other)
def __radd__(self, other: Union[int, 'SnipeSig']) -> 'SnipeSig':
r"""
Implements the right-hand + operator to support sum().
Returns:
SnipeSig: Union of self and other.
"""
if other == 0:
return self
if not isinstance(other, SnipeSig):
raise TypeError(f"Unsupported operand type(s) for +: 'SnipeSig' and '{type(other).__name__}'")
return self.union_sigs(other)

import os
import requests
from tqdm import tqdm
import cgi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused import cgi.

The cgi module imported on line 13 is not used in the code. Removing it will clean up the imports.

Apply this diff to remove the unused import:

-import cgi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import cgi
🧰 Tools
🪛 Ruff

13-13: cgi imported but unused

Remove unused import: cgi

(F401)

from tqdm import tqdm
import cgi
from urllib.parse import urlparse
from typing import Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove redundant import of Optional.

Optional is already imported on line 4; importing it again on line 15 is unnecessary.

Apply this diff to remove the redundant import:

-from typing import Optional
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from typing import Optional
🧰 Tools
🪛 Ruff

15-15: Redefinition of unused Optional from line 4

Remove definition: Optional

(F811)

Comment on lines +1677 to +1705
filename = None
if content_disposition:
self.logger.debug("Parsing Content-Disposition header for filename.")
parts = content_disposition.split(';')
for part in parts:
part = part.strip()
if part.lower().startswith('filename*='):
# Handle RFC 5987 encoding (e.g., filename*=UTF-8''example.txt)
encoded_filename = part.split('=', 1)[1].strip()
if "''" in encoded_filename:
filename = encoded_filename.split("''", 1)[1]
else:
filename = encoded_filename
self.logger.debug(f"Filename extracted from headers (RFC 5987): {filename}")
break
elif part.lower().startswith('filename='):
# Remove 'filename=' and any surrounding quotes
filename = part.split('=', 1)[1].strip(' "')
self.logger.debug(f"Filename extracted from headers: {filename}")
break

if not filename:
self.logger.debug("Falling back to filename derived from URL path.")
filename = os.path.basename(url_path)
if not filename:
filename = 'downloaded_file'
self.logger.debug(f"Filename derived from URL: {filename}")

return filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Sanitize extracted filenames to prevent directory traversal vulnerabilities.

The _extract_filename method extracts filenames from HTTP headers and URL paths. There's a potential security risk if the extracted filename includes directory traversal characters like ../. Consider sanitizing the filename using os.path.basename() to ensure it does not contain such characters and is safe to use in file paths.

Apply this diff to sanitize the filename:

+            import os
             ...
             if not filename:
                 self.logger.debug("Falling back to filename derived from URL path.")
                 filename = os.path.basename(url_path)
                 if not filename:
                     filename = 'downloaded_file'
                 self.logger.debug(f"Filename derived from URL: {filename}")
+            else:
+                # Sanitize the filename to prevent directory traversal
+                filename = os.path.basename(filename)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
filename = None
if content_disposition:
self.logger.debug("Parsing Content-Disposition header for filename.")
parts = content_disposition.split(';')
for part in parts:
part = part.strip()
if part.lower().startswith('filename*='):
# Handle RFC 5987 encoding (e.g., filename*=UTF-8''example.txt)
encoded_filename = part.split('=', 1)[1].strip()
if "''" in encoded_filename:
filename = encoded_filename.split("''", 1)[1]
else:
filename = encoded_filename
self.logger.debug(f"Filename extracted from headers (RFC 5987): {filename}")
break
elif part.lower().startswith('filename='):
# Remove 'filename=' and any surrounding quotes
filename = part.split('=', 1)[1].strip(' "')
self.logger.debug(f"Filename extracted from headers: {filename}")
break
if not filename:
self.logger.debug("Falling back to filename derived from URL path.")
filename = os.path.basename(url_path)
if not filename:
filename = 'downloaded_file'
self.logger.debug(f"Filename derived from URL: {filename}")
return filename
filename = None
if content_disposition:
self.logger.debug("Parsing Content-Disposition header for filename.")
parts = content_disposition.split(';')
for part in parts:
part = part.strip()
if part.lower().startswith('filename*='):
# Handle RFC 5987 encoding (e.g., filename*=UTF-8''example.txt)
encoded_filename = part.split('=', 1)[1].strip()
if "''" in encoded_filename:
filename = encoded_filename.split("''", 1)[1]
else:
filename = encoded_filename
self.logger.debug(f"Filename extracted from headers (RFC 5987): {filename}")
break
elif part.lower().startswith('filename='):
# Remove 'filename=' and any surrounding quotes
filename = part.split('=', 1)[1].strip(' "')
self.logger.debug(f"Filename extracted from headers: {filename}")
break
if not filename:
self.logger.debug("Falling back to filename derived from URL path.")
filename = os.path.basename(url_path)
if not filename:
filename = 'downloaded_file'
self.logger.debug(f"Filename derived from URL: {filename}")
else:
# Sanitize the filename to prevent directory traversal
filename = os.path.basename(filename)
return filename

@mr-eyes mr-eyes merged commit 65101e3 into main Oct 15, 2024
3 of 4 checks passed
@mr-eyes mr-eyes deleted the var branch October 15, 2024 00:26
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)
src/snipe/api/reference_QC.py (3)

Line range hint 690-718: LGTM: Efficient implementation of signature splitting with caching.

The changes to split_sig_randomly method effectively implement the caching mechanism. The use of deep copies when returning cached results ensures thread-safety. Caching the results after computation is a good practice for improving performance in subsequent calls.

A minor suggestion for improvement:

Consider adding a log message when returning cached results to aid in debugging and performance monitoring. You can add this line after the debug message on line 695:

self.logger.debug("Returning cached split signatures for n=%d.", n)

1341-1501: LGTM: Well-implemented method for analyzing non-reference k-mers.

The nonref_consume_from_vars method is a valuable addition that provides insights into the distribution of non-reference k-mers across different variables. The implementation is thorough, with good use of logging for debugging purposes.

A suggestion for improvement:

Consider adding a check for empty vars or vars_order at the beginning of the method to handle edge cases more gracefully. You can add something like this after the existing checks:

if not vars or not vars_order:
    self.logger.warning("Empty vars or vars_order provided. Skipping non-reference k-mer analysis.")
    return {}

This will prevent potential issues if the method is called with empty inputs.


1546-1711: LGTM: Well-implemented PreparedQC class with robust download functionality.

The PreparedQC class is a valuable addition, extending the ReferenceQC class with functionality for working with prepared snipe profiles. The methods for downloading OSF databases and extracting filenames are thorough and handle various scenarios well.

A suggestion for improvement:

In the download_osf_db method, consider adding a timeout to the requests.get call to prevent indefinite hanging if the server doesn't respond. You can modify line 1621 as follows:

with requests.get(download_url, stream=True, allow_redirects=True, timeout=30) as response:

This will ensure that the download attempt times out after 30 seconds (adjust as needed) if there's no response from the server, improving the robustness of the download process.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 5e4d6dd and 2f8b29d.

📒 Files selected for processing (1)
  • src/snipe/api/reference_QC.py (14 hunks)
🧰 Additional context used
🪛 Ruff
src/snipe/api/reference_QC.py

13-13: cgi imported but unused

Remove unused import: cgi

(F401)


15-15: Redefinition of unused Optional from line 4

Remove definition: Optional

(F811)

🔇 Additional comments (2)
src/snipe/api/reference_QC.py (2)

257-261: LGTM: Efficient caching of split signatures.

The addition of the _split_cache attribute is a good optimization. It will help avoid redundant computations of split signatures, potentially improving performance for repeated operations.


Line range hint 1-1711: Overall: Well-implemented enhancements to ReferenceQC with good new functionality.

The changes to src/snipe/api/reference_QC.py significantly enhance the capabilities of the ReferenceQC class and introduce a valuable PreparedQC subclass. Key improvements include:

  1. Efficient caching mechanism for split signatures.
  2. New method for analyzing non-reference k-mers against variable signatures.
  3. Robust genome signature loading functionality.
  4. New PreparedQC class with download capabilities for prepared profiles.

These additions are generally well-implemented, with good use of logging and error handling. The suggested minor improvements in error handling and logging will further enhance the robustness and debuggability of the code.

Great job on these enhancements! They add substantial value to the Snipe project's quality control capabilities.

Comment on lines 1502 to 1543
def load_genome_sig_to_dict(self, *, zip_file_path: str, **kwargs) -> Dict[str, 'SnipeSig']:
"""
Load a genome signature into a dictionary of SnipeSig instances.

Parameters:
zip_file_path (str): Path to the zip file containing the genome signatures.
**kwargs: Additional keyword arguments to pass to the SnipeSig constructor.

Returns:
Dict[str, SnipeSig]: A dictionary mapping genome names to SnipeSig instances.
"""

return self.sex_stats No newline at end of file
genome_chr_name_to_sig = {}

sourmash_sigs: List[sourmash.signature.SourmashSignature] = sourmash.load_file_as_signatures(zip_file_path)
sex_count = 0
autosome_count = 0
genome_count = 0
for sig in sourmash_sigs:
name = sig.name
if name.endswith("-snipegenome"):
self.logger.debug(f"Loading genome signature: {name}")
restored_name = name.replace("-snipegenome", "")
genome_chr_name_to_sig[restored_name] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME)
genome_count += 1
elif "sex" in name:
sex_count += 1
genome_chr_name_to_sig[name.replace('sex-','')] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME)
elif "autosome" in name:
autosome_count += 1
genome_chr_name_to_sig[name.replace('autosome-','')] = SnipeSig(sourmash_sig=sig, sig_type=SigType.GENOME)
else:
logging.warning(f"Unknown genome signature name: {name}, are you sure you generated this with `snipe sketch --ref`?")

self.logger.debug("Loaded %d genome signatures and %d sex chrs and %d autosome chrs", genome_count, sex_count, autosome_count)

if genome_count != 1:
logging.error(f"Expected 1 genome signature, found {genome_count}")


return genome_chr_name_to_sig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

LGTM: Well-structured method for loading genome signatures.

The load_genome_sig_to_dict method is a good addition for loading and organizing genome signatures. The differentiation between genome, sex chromosome, and autosome signatures is well-handled.

A suggestion for improvement:

The error handling when the genome count is not 1 could be more robust. Instead of just logging an error, consider raising an exception to prevent further processing with incorrect data. Replace lines 1539-1541 with:

if genome_count != 1:
    error_msg = f"Expected 1 genome signature, found {genome_count}"
    self.logger.error(error_msg)
    raise ValueError(error_msg)

This change will ensure that the method fails explicitly when the expected number of genome signatures is not found, making it easier to catch and handle this error condition in the calling code.

import os
import requests
from tqdm import tqdm
import cgi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused import cgi.

The cgi module is imported on line 13 but not used in the code. Removing it will clean up the imports and improve code clarity.

Apply this diff to remove the unused import:

-import cgi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import cgi
🧰 Tools
🪛 Ruff

13-13: cgi imported but unused

Remove unused import: cgi

(F401)

from tqdm import tqdm
import cgi
from urllib.parse import urlparse
from typing import Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove redundant import of Optional.

Optional is already imported on line 4 from typing. The import on line 15 is redundant and can be removed.

Apply this diff to remove the redundant import:

-from typing import Optional
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from typing import Optional
🧰 Tools
🪛 Ruff

15-15: Redefinition of unused Optional from line 4

Remove definition: Optional

(F811)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant