Reformat to line-length 100

iscc · Aug 19, 2024 · fccbc4b · fccbc4b
1 parent 74f63fd
commit fccbc4b
Show file tree

Hide file tree

Showing 14 changed files with 161 additions and 78 deletions.
diff --git a/README.md b/README.md
@@ -5,50 +5,57 @@
 [![Downloads](https://pepy.tech/badge/iscc-sct)](https://pepy.tech/project/iscc-sct)
 
 > [!CAUTION]
-> **This is a proof of concept.** All releases with version numbers below v1.0.0 may break backward compatibility and
-> produce incompatible Semantic Text-Codes. The algorithms of this `iscc-sct` repository are experimental and not part
-> of the official [ISO 24138:2024](https://www.iso.org/standard/77899.html) standard.
+> **This is a proof of concept.** All releases with version numbers below v1.0.0 may break backward
+> compatibility and produce incompatible Semantic Text-Codes. The algorithms of this `iscc-sct`
+> repository are experimental and not part of the official
+> [ISO 24138:2024](https://www.iso.org/standard/77899.html) standard.
 
-`iscc-sct` is a semantic Text-Code for the [ISCC](https://core.iscc.codes) (*International Standard Content Code*).
-Semantic Text-Codes are short identifiers created from text documents that preserve similarity (in hamming distance)
-for semantically similar cross-lingual text inputs.
+`iscc-sct` is a semantic Text-Code for the [ISCC](https://core.iscc.codes) (*International Standard
+Content Code*). Semantic Text-Codes are short identifiers created from text documents that preserve
+similarity (in hamming distance) for semantically similar cross-lingual text inputs.
 
 ## What is the ISCC
 
-The ISCC is a combination of various similarity preserving fingerprints and an identifier for digital media content.
+The ISCC is a combination of various similarity preserving fingerprints and an identifier for
+digital media content.
 
-ISCCs are generated algorithmically from digital content, just like cryptographic hashes. However, instead of using a
-single cryptographic hash function to identify data only, the ISCC uses various algorithms to create a composite
-identifier that exhibits similarity-preserving properties (soft hash or Simprint).
+ISCCs are generated algorithmically from digital content, just like cryptographic hashes. However,
+instead of using a single cryptographic hash function to identify data only, the ISCC uses various
+algorithms to create a composite identifier that exhibits similarity-preserving properties (soft
+hash or Simprint).
 
-The component-based structure of the ISCC identifies content at multiple levels of abstraction. Each component is
-self-describing, modular, and can be used separately or with others to aid in various content identification tasks. The
-algorithmic design supports content deduplication, database synchronization, indexing, integrity verification,
-timestamping, versioning, data provenance, similarity clustering, anomaly detection, usage tracking, allocation of
-royalties, fact-checking and general digital asset management use-cases.
+The component-based structure of the ISCC identifies content at multiple levels of abstraction. Each
+component is self-describing, modular, and can be used separately or with others to aid in various
+content identification tasks. The algorithmic design supports content deduplication, database
+synchronization, indexing, integrity verification, timestamping, versioning, data provenance,
+similarity clustering, anomaly detection, usage tracking, allocation of royalties, fact-checking and
+general digital asset management use-cases.
 
 ## What is ISCC Semantic Text-Code?
 
-The ISCC framework already includes a Text-Code based on lexical similarity for near-duplicate matching. The ISCC
-Semantic Text-Code is a planned additional ISCC-UNIT focused on capturing a more abstract and broader semantic
-similarity. It is engineered to be robust against a wide range of variations and, most remarkably, translations of text
-that cannot be matched based on lexical similarity alone.
+The ISCC framework already includes a Text-Code based on lexical similarity for near-duplicate
+matching. The ISCC Semantic Text-Code is a planned additional ISCC-UNIT focused on capturing a more
+abstract and broader semantic similarity. It is engineered to be robust against a wide range of
+variations and, most remarkably, translations of text that cannot be matched based on lexical
+similarity alone.
 
 ### Translation Matching
 
-One of the most interesting aspects of the Semantic Text-Code is its ability to generate **(near)-identical codes for
-translations of the same text**. This means that the same content, expressed in different languages, can be identified
-and linked, opening up new possibilities for cross-lingual content identification and similarity detection.
+One of the most interesting aspects of the Semantic Text-Code is its ability to generate
+**(near)-identical codes for translations of the same text**. This means that the same content,
+expressed in different languages, can be identified and linked, opening up new possibilities for
+cross-lingual content identification and similarity detection.
 
 ## Key Features
 
-- **Semantic Similarity**: Utilizes deep learning models to generate codes that reflect the semantic essence of text.
-- **Translation Matching**: Creates nearly identical codes for text translations, enabling cross-lingual content
-  identification.
-- **Bit-Length Flexibility**: Supports generating codes of various bit lengths (up to 256 bits), allowing for
-  adjustable granularity in similarity detection.
-- **ISCC Compatible**: Generates codes fully compatible with the ISCC specification, facilitating seamless integration
-  with existing ISCC-based systems.
+- **Semantic Similarity**: Utilizes deep learning models to generate codes that reflect the semantic
+  essence of text.
+- **Translation Matching**: Creates nearly identical codes for text translations, enabling
+  cross-lingual content identification.
+- **Bit-Length Flexibility**: Supports generating codes of various bit lengths (up to 256 bits),
+  allowing for adjustable granularity in similarity detection.
+- **ISCC Compatible**: Generates codes fully compatible with the ISCC specification, facilitating
+  seamless integration with existing ISCC-based systems.
 
 ## Installation
 
@@ -138,40 +145,42 @@ This process ensures robustness to variations and translations, enabling cross-l
 
 ## Development and Contributing
 
-We welcome contributions to enhance the capabilities, efficiency, and compatibility of this proof of concept with the
-broader ISCC ecosystem. For development, install the project in development mode using
-[Poetry](https://python-poetry.org):
+We welcome contributions to enhance the capabilities, efficiency, and compatibility of this proof of
+concept with the broader ISCC ecosystem. For development, install the project in development mode
+using [Poetry](https://python-poetry.org):
 
 ```shell
 git clone https://github.com/iscc/iscc-sct.git
 cd iscc-sct
 poetry install
 ```
 
-If you have suggestions for improvements or bug fixes, please open an issue or pull request. For major changes, please
-open an issue first to discuss your ideas.
+If you have suggestions for improvements or bug fixes, please open an issue or pull request. For
+major changes, please open an issue first to discuss your ideas.
 
 ## Future Work
 
 ### Shift Resistant Semantic Chunking
 
-The current chunking strategy uses tries to maximize chunk sizes (up to 127 tokens) wheil still splitting at lexically
-sensible boundaries with an overlap of up to 48 tokens. See
+The current chunking strategy uses tries to maximize chunk sizes (up to 127 tokens) wheil still
+splitting at lexically sensible boundaries with an overlap of up to 48 tokens. See
 [text-splitter](https://github.com/benbrandt/text-splitter).
 
-Cross document chunk matching via granular Simprints can likely be improved significantly with a semantically aware and
-shift resistant chunking strategy. Better shift resistance would improve the chances that the bounderies detected for
-semantically similar text sequences in different documents are aligned.
+Cross document chunk matching via granular Simprints can likely be improved significantly with a
+semantically aware and shift resistant chunking strategy. Better shift resistance would improve the
+chances that the bounderies detected for semantically similar text sequences in different documents
+are aligned.
 
 ### MRL based Embeddings
 
-A text embedding model trained with [Matryoshka Representation Learning](https://arxiv.org/pdf/2205.13147) may yield
-better results with short 64-bit Semantic Text-Codes.
+A text embedding model trained with
+[Matryoshka Representation Learning](https://arxiv.org/pdf/2205.13147) may yield better results with
+short 64-bit Semantic Text-Codes.
 
 ### Larger Chunk Sizes
 
-A text embedding model with support for a larger `max_token` size (currently 128) may yield higher-order granular
-simprints based on larger chunks of text.
+A text embedding model with support for a larger `max_token` size (currently 128) may yield
+higher-order granular simprints based on larger chunks of text.
 
 ## Acknowledgements
 

diff --git a/iscc_sct/cli.py b/iscc_sct/cli.py
@@ -8,9 +8,15 @@
 
 def main():
     parser = argparse.ArgumentParser(description="Generate Semantic Text-Codes for text files.")
-    parser.add_argument("path", type=str, help="Path to text files (supports glob patterns).", nargs="?")
-    parser.add_argument("-b", "--bits", type=int, default=256, help="Bit-Length of Code (default 256)")
-    parser.add_argument("-g", "--granular", action="store_true", help="Activate granular processing.")
+    parser.add_argument(
+        "path", type=str, help="Path to text files (supports glob patterns).", nargs="?"
+    )
+    parser.add_argument(
+        "-b", "--bits", type=int, default=256, help="Bit-Length of Code (default 256)"
+    )
+    parser.add_argument(
+        "-g", "--granular", action="store_true", help="Activate granular processing."
+    )
     parser.add_argument("-d", "--debug", action="store_true", help="Show debugging messages.")
     args = parser.parse_args()
 

diff --git a/iscc_sct/code_semantic_text.py b/iscc_sct/code_semantic_text.py
@@ -233,11 +233,15 @@ def model():
     so.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
     try:
         with sct.timer("ONNXMODEL load time"):
-            return rt.InferenceSession(sct.MODEL_PATH, sess_options=so, providers=selected_onnx_providers)
+            return rt.InferenceSession(
+                sct.MODEL_PATH, sess_options=so, providers=selected_onnx_providers
+            )
     except NoSuchFile:  # pragma: no cover
         with sct.timer("ONNXMODEL aquisition/load time"):
             model_path = sct.get_model()
-            return rt.InferenceSession(model_path, sess_options=so, providers=selected_onnx_providers)
+            return rt.InferenceSession(
+                model_path, sess_options=so, providers=selected_onnx_providers
+            )
 
 
 def tokenize_chunks(chunks):

diff --git a/iscc_sct/demo.py b/iscc_sct/demo.py
@@ -85,7 +85,9 @@ def generate_similarity_bar(similarity):
 
     # Adjust the text position to be centered within the colored bar
     text_position = "left: 50%;" if similarity >= 0 else "right: 50%;"
-    text_alignment = "transform: translateX(-50%);" if similarity >= 0 else "transform: translateX(50%);"
+    text_alignment = (
+        "transform: translateX(-50%);" if similarity >= 0 else "transform: translateX(50%);"
+    )
 
     bar_html = f"""
     <h3>Semantic Similarity</h3>
@@ -188,10 +190,14 @@ def update_sample_text(choice, group):
         return samples[group][choice]
 
     sample_dropdown_a.change(
-        lambda choice: update_sample_text(choice, "a"), inputs=[sample_dropdown_a], outputs=[in_text_a]
+        lambda choice: update_sample_text(choice, "a"),
+        inputs=[sample_dropdown_a],
+        outputs=[in_text_a],
     )
     sample_dropdown_b.change(
-        lambda choice: update_sample_text(choice, "b"), inputs=[sample_dropdown_b], outputs=[in_text_b]
+        lambda choice: update_sample_text(choice, "b"),
+        inputs=[sample_dropdown_b],
+        outputs=[in_text_b],
     )
 
     def process_text(text, nbits, suffix):
@@ -205,7 +211,9 @@ def process_text(text, nbits, suffix):
                 out_chunks_func: gr.HighlightedText(value=None, elem_id="chunked-text"),
             }
 
-        result = sct.gen_text_code_semantic(text, bits=nbits, simprints=True, offsets=True, sizes=True, contents=True)
+        result = sct.gen_text_code_semantic(
+            text, bits=nbits, simprints=True, offsets=True, sizes=True, contents=True
+        )
         iscc = sct.Metadata(**result).to_object_format()
 
         # Generate chunked text with simprints and overlaps
@@ -275,8 +283,12 @@ def recalculate_iscc(text_a, text_b, nbits):
         show_progress="full",
     )
 
-    out_code_a.change(compare_codes, inputs=[out_code_a, out_code_b, in_iscc_bits], outputs=[out_similarity])
-    out_code_b.change(compare_codes, inputs=[out_code_a, out_code_b, in_iscc_bits], outputs=[out_similarity])
+    out_code_a.change(
+        compare_codes, inputs=[out_code_a, out_code_b, in_iscc_bits], outputs=[out_similarity]
+    )
+    out_code_b.change(
+        compare_codes, inputs=[out_code_a, out_code_b, in_iscc_bits], outputs=[out_similarity]
+    )
 
     def reset_all():
         return (

diff --git a/iscc_sct/dev.py b/iscc_sct/dev.py
@@ -38,5 +38,5 @@ def format_yml():
                 default_flow_style=False,
                 default_style=">",
                 allow_unicode=True,
-                line_break="\n"
+                line_break="\n",
             )
diff --git a/iscc_sct/models.py b/iscc_sct/models.py
@@ -82,7 +82,9 @@ def __repr__(self):
         return self.pretty_repr()
 
     def pretty_repr(self):
-        return self.model_dump_json(indent=2, exclude_unset=True, exclude_none=True, exclude_defaults=False)
+        return self.model_dump_json(
+            indent=2, exclude_unset=True, exclude_none=True, exclude_defaults=False
+        )
 
 
 class Feature(PrettyBaseModel):
@@ -132,9 +134,15 @@ def to_index_format(self) -> "Metadata":
                 new_features.append(new_feature_set)
             else:
                 new_feature_set.simprints = [f.simprint for f in feature_set.simprints]
-                new_feature_set.offsets = [f.offset for f in feature_set.simprints if f.offset is not None]
-                new_feature_set.sizes = [f.size for f in feature_set.simprints if f.size is not None]
-                new_feature_set.contents = [f.content for f in feature_set.simprints if f.content is not None]
+                new_feature_set.offsets = [
+                    f.offset for f in feature_set.simprints if f.offset is not None
+                ]
+                new_feature_set.sizes = [
+                    f.size for f in feature_set.simprints if f.size is not None
+                ]
+                new_feature_set.contents = [
+                    f.content for f in feature_set.simprints if f.content is not None
+                ]
                 new_features.append(new_feature_set)
 
         return Metadata(iscc=self.iscc, characters=self.characters, features=new_features)
@@ -154,7 +162,9 @@ def get_content(self) -> Optional[str]:
             # Convert to object format if in index format
             feature_set = self.to_object_format().features[0]
 
-        if not all(feature.content and feature.offset is not None for feature in feature_set.simprints):
+        if not all(
+            feature.content and feature.offset is not None for feature in feature_set.simprints
+        ):
             return None
 
         # Sort features by offset
@@ -191,7 +201,9 @@ def get_overlaps(self) -> List[str]:
             # Convert to object format if in index format
             feature_set = self.to_object_format().features[0]
 
-        if not all(feature.content and feature.offset is not None for feature in feature_set.simprints):
+        if not all(
+            feature.content and feature.offset is not None for feature in feature_set.simprints
+        ):
             return []
 
         # Sort features by offset

diff --git a/iscc_sct/options.py b/iscc_sct/options.py
@@ -29,15 +29,27 @@ class SctOptions(BaseSettings):
         multiple_of=32,
     )
 
-    characters: bool = Field(True, description="ISCC_SCT_CHARACTERS - Include document character count")
-    embedding: bool = Field(False, description="ISCC_SCT_EMBEDDING - Include global document embedding")
+    characters: bool = Field(
+        True, description="ISCC_SCT_CHARACTERS - Include document character count"
+    )
+    embedding: bool = Field(
+        False, description="ISCC_SCT_EMBEDDING - Include global document embedding"
+    )
 
-    precision: int = Field(8, description="ISCC_SCT_PRECISION - Max fractional digits for embeddings (default 8)")
+    precision: int = Field(
+        8, description="ISCC_SCT_PRECISION - Max fractional digits for embeddings (default 8)"
+    )
 
-    simprints: bool = Field(False, description="ISCC_SCT_SIMPRINTS - Include granular feature simprints")
-    offsets: bool = Field(False, description="ISCC_SCT_OFFSETS - Include offsets of granular features")
+    simprints: bool = Field(
+        False, description="ISCC_SCT_SIMPRINTS - Include granular feature simprints"
+    )
+    offsets: bool = Field(
+        False, description="ISCC_SCT_OFFSETS - Include offsets of granular features"
+    )
 
-    sizes: bool = Field(False, description="ISCC_SCT_SIZES - Include sizes of granular features (number of chars)")
+    sizes: bool = Field(
+        False, description="ISCC_SCT_SIZES - Include sizes of granular features (number of chars)"
+    )
 
     contents: bool = Field(False, description="ISCC_SCT_CONTENTS - Include granular text chunks")
 
@@ -52,7 +64,9 @@ class SctOptions(BaseSettings):
         description="ISCC_SCT_OVERLAP - Max tokens allowed to overlap between chunks (Default 48)",
     )
 
-    trim: bool = Field(False, description="ISCC_SCT_TRIM - Trim whitespace from chunks (Default False)")
+    trim: bool = Field(
+        False, description="ISCC_SCT_TRIM - Trim whitespace from chunks (Default False)"
+    )
 
     model_config = SettingsConfigDict(
         env_file=".env",

diff --git a/pyproject.toml b/pyproject.toml
@@ -90,7 +90,7 @@ omit = ["iscc_sct/dev.py", "tests/", "iscc_sct/demo.py"]
 
 [tool.poe.tasks]
 format-code = { cmd = "ruff format", help = "Code style formating with ruff" }
-format-markdown = { cmd = "mdformat --wrap 119 --end-of-line lf README.md", help = "Markdown formating with mdformat" }
+format-markdown = { cmd = "mdformat --wrap 100 --end-of-line lf README.md", help = "Markdown formating with mdformat" }
 format-yml = { script = "iscc_sct.dev:format_yml", help = "Format YML files"}
 convert-lf = { script = "iscc_sct.dev:convert_lf", help = "Convert line endings to LF"}
 test = { cmd = "pytest --cov=iscc_sct --cov-fail-under=100", help = "Run tests with coverage" }

diff --git a/tests/benchmark.py b/tests/benchmark.py
@@ -32,7 +32,9 @@ def benchmark(folder):
         elapsed_time = end_time - start_time
         total_time += elapsed_time
         file_count += 1
-        log.info(f"Processed {txt_path.name} in {elapsed_time:.2f} seconds. ISCC: {iscc_meta['iscc']}")
+        log.info(
+            f"Processed {txt_path.name} in {elapsed_time:.2f} seconds. ISCC: {iscc_meta['iscc']}"
+        )
 
     if file_count > 0:
         avg_time = total_time / file_count
@@ -45,7 +47,9 @@ def benchmark(folder):
 
 def main():
     parser = argparse.ArgumentParser(description="Benchmark ISCC Semantic-Code Text generation.")
-    parser.add_argument("folder", type=str, help="Directory containing text files for benchmarking.")
+    parser.add_argument(
+        "folder", type=str, help="Directory containing text files for benchmarking."
+    )
     args = parser.parse_args()
 
     benchmark(args.folder)

diff --git a/tests/test_cli.py b/tests/test_cli.py
@@ -52,7 +52,9 @@ def test_cli_generate_sct(sample_text_file):
 
 
 def test_cli_generate_sct_granular(sample_text_file):
-    result = subprocess.run([sct, str(sample_text_file), "--granular"], capture_output=True, text=True)
+    result = subprocess.run(
+        [sct, str(sample_text_file), "--granular"], capture_output=True, text=True
+    )
     assert result.returncode == 0
     assert "features" in result.stdout
 

diff --git a/tests/test_iscc_sct.py b/tests/test_iscc_sct.py
@@ -178,7 +178,9 @@ def test_embed_tokens():
     chunks = ["Hello World", "These are chunks"]
     tokens = tokenize_chunks(chunks)
     embeddings = embed_tokens(tokens)
-    assert list(embeddings[0][0][:3]) == pytest.approx([0.05907335, 0.11408358, 0.12727071], rel=1e-2)
+    assert list(embeddings[0][0][:3]) == pytest.approx(
+        [0.05907335, 0.11408358, 0.12727071], rel=1e-2
+    )
 
 
 def test_embed_chunks():