Skip to content

Commit

Permalink
Reformat to line-length 100
Browse files Browse the repository at this point in the history
  • Loading branch information
titusz committed Aug 19, 2024
1 parent 74f63fd commit fccbc4b
Show file tree
Hide file tree
Showing 14 changed files with 161 additions and 78 deletions.
95 changes: 52 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,50 +5,57 @@
[![Downloads](https://pepy.tech/badge/iscc-sct)](https://pepy.tech/project/iscc-sct)

> [!CAUTION]
> **This is a proof of concept.** All releases with version numbers below v1.0.0 may break backward compatibility and
> produce incompatible Semantic Text-Codes. The algorithms of this `iscc-sct` repository are experimental and not part
> of the official [ISO 24138:2024](https://www.iso.org/standard/77899.html) standard.
> **This is a proof of concept.** All releases with version numbers below v1.0.0 may break backward
> compatibility and produce incompatible Semantic Text-Codes. The algorithms of this `iscc-sct`
> repository are experimental and not part of the official
> [ISO 24138:2024](https://www.iso.org/standard/77899.html) standard.
`iscc-sct` is a semantic Text-Code for the [ISCC](https://core.iscc.codes) (*International Standard Content Code*).
Semantic Text-Codes are short identifiers created from text documents that preserve similarity (in hamming distance)
for semantically similar cross-lingual text inputs.
`iscc-sct` is a semantic Text-Code for the [ISCC](https://core.iscc.codes) (*International Standard
Content Code*). Semantic Text-Codes are short identifiers created from text documents that preserve
similarity (in hamming distance) for semantically similar cross-lingual text inputs.

## What is the ISCC

The ISCC is a combination of various similarity preserving fingerprints and an identifier for digital media content.
The ISCC is a combination of various similarity preserving fingerprints and an identifier for
digital media content.

ISCCs are generated algorithmically from digital content, just like cryptographic hashes. However, instead of using a
single cryptographic hash function to identify data only, the ISCC uses various algorithms to create a composite
identifier that exhibits similarity-preserving properties (soft hash or Simprint).
ISCCs are generated algorithmically from digital content, just like cryptographic hashes. However,
instead of using a single cryptographic hash function to identify data only, the ISCC uses various
algorithms to create a composite identifier that exhibits similarity-preserving properties (soft
hash or Simprint).

The component-based structure of the ISCC identifies content at multiple levels of abstraction. Each component is
self-describing, modular, and can be used separately or with others to aid in various content identification tasks. The
algorithmic design supports content deduplication, database synchronization, indexing, integrity verification,
timestamping, versioning, data provenance, similarity clustering, anomaly detection, usage tracking, allocation of
royalties, fact-checking and general digital asset management use-cases.
The component-based structure of the ISCC identifies content at multiple levels of abstraction. Each
component is self-describing, modular, and can be used separately or with others to aid in various
content identification tasks. The algorithmic design supports content deduplication, database
synchronization, indexing, integrity verification, timestamping, versioning, data provenance,
similarity clustering, anomaly detection, usage tracking, allocation of royalties, fact-checking and
general digital asset management use-cases.

## What is ISCC Semantic Text-Code?

The ISCC framework already includes a Text-Code based on lexical similarity for near-duplicate matching. The ISCC
Semantic Text-Code is a planned additional ISCC-UNIT focused on capturing a more abstract and broader semantic
similarity. It is engineered to be robust against a wide range of variations and, most remarkably, translations of text
that cannot be matched based on lexical similarity alone.
The ISCC framework already includes a Text-Code based on lexical similarity for near-duplicate
matching. The ISCC Semantic Text-Code is a planned additional ISCC-UNIT focused on capturing a more
abstract and broader semantic similarity. It is engineered to be robust against a wide range of
variations and, most remarkably, translations of text that cannot be matched based on lexical
similarity alone.

### Translation Matching

One of the most interesting aspects of the Semantic Text-Code is its ability to generate **(near)-identical codes for
translations of the same text**. This means that the same content, expressed in different languages, can be identified
and linked, opening up new possibilities for cross-lingual content identification and similarity detection.
One of the most interesting aspects of the Semantic Text-Code is its ability to generate
**(near)-identical codes for translations of the same text**. This means that the same content,
expressed in different languages, can be identified and linked, opening up new possibilities for
cross-lingual content identification and similarity detection.

## Key Features

- **Semantic Similarity**: Utilizes deep learning models to generate codes that reflect the semantic essence of text.
- **Translation Matching**: Creates nearly identical codes for text translations, enabling cross-lingual content
identification.
- **Bit-Length Flexibility**: Supports generating codes of various bit lengths (up to 256 bits), allowing for
adjustable granularity in similarity detection.
- **ISCC Compatible**: Generates codes fully compatible with the ISCC specification, facilitating seamless integration
with existing ISCC-based systems.
- **Semantic Similarity**: Utilizes deep learning models to generate codes that reflect the semantic
essence of text.
- **Translation Matching**: Creates nearly identical codes for text translations, enabling
cross-lingual content identification.
- **Bit-Length Flexibility**: Supports generating codes of various bit lengths (up to 256 bits),
allowing for adjustable granularity in similarity detection.
- **ISCC Compatible**: Generates codes fully compatible with the ISCC specification, facilitating
seamless integration with existing ISCC-based systems.

## Installation

Expand Down Expand Up @@ -138,40 +145,42 @@ This process ensures robustness to variations and translations, enabling cross-l

## Development and Contributing

We welcome contributions to enhance the capabilities, efficiency, and compatibility of this proof of concept with the
broader ISCC ecosystem. For development, install the project in development mode using
[Poetry](https://python-poetry.org):
We welcome contributions to enhance the capabilities, efficiency, and compatibility of this proof of
concept with the broader ISCC ecosystem. For development, install the project in development mode
using [Poetry](https://python-poetry.org):

```shell
git clone https://github.com/iscc/iscc-sct.git
cd iscc-sct
poetry install
```

If you have suggestions for improvements or bug fixes, please open an issue or pull request. For major changes, please
open an issue first to discuss your ideas.
If you have suggestions for improvements or bug fixes, please open an issue or pull request. For
major changes, please open an issue first to discuss your ideas.

## Future Work

### Shift Resistant Semantic Chunking

The current chunking strategy uses tries to maximize chunk sizes (up to 127 tokens) wheil still splitting at lexically
sensible boundaries with an overlap of up to 48 tokens. See
The current chunking strategy uses tries to maximize chunk sizes (up to 127 tokens) wheil still
splitting at lexically sensible boundaries with an overlap of up to 48 tokens. See
[text-splitter](https://github.com/benbrandt/text-splitter).

Cross document chunk matching via granular Simprints can likely be improved significantly with a semantically aware and
shift resistant chunking strategy. Better shift resistance would improve the chances that the bounderies detected for
semantically similar text sequences in different documents are aligned.
Cross document chunk matching via granular Simprints can likely be improved significantly with a
semantically aware and shift resistant chunking strategy. Better shift resistance would improve the
chances that the bounderies detected for semantically similar text sequences in different documents
are aligned.

### MRL based Embeddings

A text embedding model trained with [Matryoshka Representation Learning](https://arxiv.org/pdf/2205.13147) may yield
better results with short 64-bit Semantic Text-Codes.
A text embedding model trained with
[Matryoshka Representation Learning](https://arxiv.org/pdf/2205.13147) may yield better results with
short 64-bit Semantic Text-Codes.

### Larger Chunk Sizes

A text embedding model with support for a larger `max_token` size (currently 128) may yield higher-order granular
simprints based on larger chunks of text.
A text embedding model with support for a larger `max_token` size (currently 128) may yield
higher-order granular simprints based on larger chunks of text.

## Acknowledgements

Expand Down
12 changes: 9 additions & 3 deletions iscc_sct/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,15 @@

def main():
parser = argparse.ArgumentParser(description="Generate Semantic Text-Codes for text files.")
parser.add_argument("path", type=str, help="Path to text files (supports glob patterns).", nargs="?")
parser.add_argument("-b", "--bits", type=int, default=256, help="Bit-Length of Code (default 256)")
parser.add_argument("-g", "--granular", action="store_true", help="Activate granular processing.")
parser.add_argument(
"path", type=str, help="Path to text files (supports glob patterns).", nargs="?"
)
parser.add_argument(
"-b", "--bits", type=int, default=256, help="Bit-Length of Code (default 256)"
)
parser.add_argument(
"-g", "--granular", action="store_true", help="Activate granular processing."
)
parser.add_argument("-d", "--debug", action="store_true", help="Show debugging messages.")
args = parser.parse_args()

Expand Down
8 changes: 6 additions & 2 deletions iscc_sct/code_semantic_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,11 +233,15 @@ def model():
so.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL
try:
with sct.timer("ONNXMODEL load time"):
return rt.InferenceSession(sct.MODEL_PATH, sess_options=so, providers=selected_onnx_providers)
return rt.InferenceSession(
sct.MODEL_PATH, sess_options=so, providers=selected_onnx_providers
)
except NoSuchFile: # pragma: no cover
with sct.timer("ONNXMODEL aquisition/load time"):
model_path = sct.get_model()
return rt.InferenceSession(model_path, sess_options=so, providers=selected_onnx_providers)
return rt.InferenceSession(
model_path, sess_options=so, providers=selected_onnx_providers
)


def tokenize_chunks(chunks):
Expand Down
24 changes: 18 additions & 6 deletions iscc_sct/demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,9 @@ def generate_similarity_bar(similarity):

# Adjust the text position to be centered within the colored bar
text_position = "left: 50%;" if similarity >= 0 else "right: 50%;"
text_alignment = "transform: translateX(-50%);" if similarity >= 0 else "transform: translateX(50%);"
text_alignment = (
"transform: translateX(-50%);" if similarity >= 0 else "transform: translateX(50%);"
)

bar_html = f"""
<h3>Semantic Similarity</h3>
Expand Down Expand Up @@ -188,10 +190,14 @@ def update_sample_text(choice, group):
return samples[group][choice]

sample_dropdown_a.change(
lambda choice: update_sample_text(choice, "a"), inputs=[sample_dropdown_a], outputs=[in_text_a]
lambda choice: update_sample_text(choice, "a"),
inputs=[sample_dropdown_a],
outputs=[in_text_a],
)
sample_dropdown_b.change(
lambda choice: update_sample_text(choice, "b"), inputs=[sample_dropdown_b], outputs=[in_text_b]
lambda choice: update_sample_text(choice, "b"),
inputs=[sample_dropdown_b],
outputs=[in_text_b],
)

def process_text(text, nbits, suffix):
Expand All @@ -205,7 +211,9 @@ def process_text(text, nbits, suffix):
out_chunks_func: gr.HighlightedText(value=None, elem_id="chunked-text"),
}

result = sct.gen_text_code_semantic(text, bits=nbits, simprints=True, offsets=True, sizes=True, contents=True)
result = sct.gen_text_code_semantic(
text, bits=nbits, simprints=True, offsets=True, sizes=True, contents=True
)
iscc = sct.Metadata(**result).to_object_format()

# Generate chunked text with simprints and overlaps
Expand Down Expand Up @@ -275,8 +283,12 @@ def recalculate_iscc(text_a, text_b, nbits):
show_progress="full",
)

out_code_a.change(compare_codes, inputs=[out_code_a, out_code_b, in_iscc_bits], outputs=[out_similarity])
out_code_b.change(compare_codes, inputs=[out_code_a, out_code_b, in_iscc_bits], outputs=[out_similarity])
out_code_a.change(
compare_codes, inputs=[out_code_a, out_code_b, in_iscc_bits], outputs=[out_similarity]
)
out_code_b.change(
compare_codes, inputs=[out_code_a, out_code_b, in_iscc_bits], outputs=[out_similarity]
)

def reset_all():
return (
Expand Down
2 changes: 1 addition & 1 deletion iscc_sct/dev.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ def format_yml():
default_flow_style=False,
default_style=">",
allow_unicode=True,
line_break="\n"
line_break="\n",
)
24 changes: 18 additions & 6 deletions iscc_sct/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,9 @@ def __repr__(self):
return self.pretty_repr()

def pretty_repr(self):
return self.model_dump_json(indent=2, exclude_unset=True, exclude_none=True, exclude_defaults=False)
return self.model_dump_json(
indent=2, exclude_unset=True, exclude_none=True, exclude_defaults=False
)


class Feature(PrettyBaseModel):
Expand Down Expand Up @@ -132,9 +134,15 @@ def to_index_format(self) -> "Metadata":
new_features.append(new_feature_set)
else:
new_feature_set.simprints = [f.simprint for f in feature_set.simprints]
new_feature_set.offsets = [f.offset for f in feature_set.simprints if f.offset is not None]
new_feature_set.sizes = [f.size for f in feature_set.simprints if f.size is not None]
new_feature_set.contents = [f.content for f in feature_set.simprints if f.content is not None]
new_feature_set.offsets = [
f.offset for f in feature_set.simprints if f.offset is not None
]
new_feature_set.sizes = [
f.size for f in feature_set.simprints if f.size is not None
]
new_feature_set.contents = [
f.content for f in feature_set.simprints if f.content is not None
]
new_features.append(new_feature_set)

return Metadata(iscc=self.iscc, characters=self.characters, features=new_features)
Expand All @@ -154,7 +162,9 @@ def get_content(self) -> Optional[str]:
# Convert to object format if in index format
feature_set = self.to_object_format().features[0]

if not all(feature.content and feature.offset is not None for feature in feature_set.simprints):
if not all(
feature.content and feature.offset is not None for feature in feature_set.simprints
):
return None

# Sort features by offset
Expand Down Expand Up @@ -191,7 +201,9 @@ def get_overlaps(self) -> List[str]:
# Convert to object format if in index format
feature_set = self.to_object_format().features[0]

if not all(feature.content and feature.offset is not None for feature in feature_set.simprints):
if not all(
feature.content and feature.offset is not None for feature in feature_set.simprints
):
return []

# Sort features by offset
Expand Down
28 changes: 21 additions & 7 deletions iscc_sct/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,27 @@ class SctOptions(BaseSettings):
multiple_of=32,
)

characters: bool = Field(True, description="ISCC_SCT_CHARACTERS - Include document character count")
embedding: bool = Field(False, description="ISCC_SCT_EMBEDDING - Include global document embedding")
characters: bool = Field(
True, description="ISCC_SCT_CHARACTERS - Include document character count"
)
embedding: bool = Field(
False, description="ISCC_SCT_EMBEDDING - Include global document embedding"
)

precision: int = Field(8, description="ISCC_SCT_PRECISION - Max fractional digits for embeddings (default 8)")
precision: int = Field(
8, description="ISCC_SCT_PRECISION - Max fractional digits for embeddings (default 8)"
)

simprints: bool = Field(False, description="ISCC_SCT_SIMPRINTS - Include granular feature simprints")
offsets: bool = Field(False, description="ISCC_SCT_OFFSETS - Include offsets of granular features")
simprints: bool = Field(
False, description="ISCC_SCT_SIMPRINTS - Include granular feature simprints"
)
offsets: bool = Field(
False, description="ISCC_SCT_OFFSETS - Include offsets of granular features"
)

sizes: bool = Field(False, description="ISCC_SCT_SIZES - Include sizes of granular features (number of chars)")
sizes: bool = Field(
False, description="ISCC_SCT_SIZES - Include sizes of granular features (number of chars)"
)

contents: bool = Field(False, description="ISCC_SCT_CONTENTS - Include granular text chunks")

Expand All @@ -52,7 +64,9 @@ class SctOptions(BaseSettings):
description="ISCC_SCT_OVERLAP - Max tokens allowed to overlap between chunks (Default 48)",
)

trim: bool = Field(False, description="ISCC_SCT_TRIM - Trim whitespace from chunks (Default False)")
trim: bool = Field(
False, description="ISCC_SCT_TRIM - Trim whitespace from chunks (Default False)"
)

model_config = SettingsConfigDict(
env_file=".env",
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ omit = ["iscc_sct/dev.py", "tests/", "iscc_sct/demo.py"]

[tool.poe.tasks]
format-code = { cmd = "ruff format", help = "Code style formating with ruff" }
format-markdown = { cmd = "mdformat --wrap 119 --end-of-line lf README.md", help = "Markdown formating with mdformat" }
format-markdown = { cmd = "mdformat --wrap 100 --end-of-line lf README.md", help = "Markdown formating with mdformat" }
format-yml = { script = "iscc_sct.dev:format_yml", help = "Format YML files"}
convert-lf = { script = "iscc_sct.dev:convert_lf", help = "Convert line endings to LF"}
test = { cmd = "pytest --cov=iscc_sct --cov-fail-under=100", help = "Run tests with coverage" }
Expand Down
8 changes: 6 additions & 2 deletions tests/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ def benchmark(folder):
elapsed_time = end_time - start_time
total_time += elapsed_time
file_count += 1
log.info(f"Processed {txt_path.name} in {elapsed_time:.2f} seconds. ISCC: {iscc_meta['iscc']}")
log.info(
f"Processed {txt_path.name} in {elapsed_time:.2f} seconds. ISCC: {iscc_meta['iscc']}"
)

if file_count > 0:
avg_time = total_time / file_count
Expand All @@ -45,7 +47,9 @@ def benchmark(folder):

def main():
parser = argparse.ArgumentParser(description="Benchmark ISCC Semantic-Code Text generation.")
parser.add_argument("folder", type=str, help="Directory containing text files for benchmarking.")
parser.add_argument(
"folder", type=str, help="Directory containing text files for benchmarking."
)
args = parser.parse_args()

benchmark(args.folder)
Expand Down
4 changes: 3 additions & 1 deletion tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@ def test_cli_generate_sct(sample_text_file):


def test_cli_generate_sct_granular(sample_text_file):
result = subprocess.run([sct, str(sample_text_file), "--granular"], capture_output=True, text=True)
result = subprocess.run(
[sct, str(sample_text_file), "--granular"], capture_output=True, text=True
)
assert result.returncode == 0
assert "features" in result.stdout

Expand Down
4 changes: 3 additions & 1 deletion tests/test_iscc_sct.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,9 @@ def test_embed_tokens():
chunks = ["Hello World", "These are chunks"]
tokens = tokenize_chunks(chunks)
embeddings = embed_tokens(tokens)
assert list(embeddings[0][0][:3]) == pytest.approx([0.05907335, 0.11408358, 0.12727071], rel=1e-2)
assert list(embeddings[0][0][:3]) == pytest.approx(
[0.05907335, 0.11408358, 0.12727071], rel=1e-2
)


def test_embed_chunks():
Expand Down
Loading

0 comments on commit fccbc4b

Please sign in to comment.