Name	Name	Last commit message	Last commit date
Latest commit semantic-release-bot Build: Released 1.2.0 [skip ci] Sep 18, 2023 8818b73 · Sep 18, 2023 History 165 Commits
.github/workflows	.github/workflows	Make: Publish StringZilla to NPM	Sep 18, 2023
.vscode	.vscode	Add: Baseline NodeJS binding	Sep 18, 2023
bench	bench	Merge branch 'main' of https://github.com/ashvardanian/Stringzilla	Jun 21, 2023
javascript	javascript	Add: Baseline NodeJS binding	Sep 18, 2023
python	python	Add: Levenstein distance	Sep 10, 2023
scripts	scripts	Add: Levenstein distance	Sep 10, 2023
stringzilla	stringzilla	Add: Baseline NodeJS binding	Sep 18, 2023
.clang-format	.clang-format	Refactor: Regrouping folders	Jun 18, 2023
.dockerignore	.dockerignore	Added dockerfiles and reorganized code	Nov 5, 2020
.gitignore	.gitignore	Add: Baseline NodeJS binding	Sep 18, 2023
.releaserc	.releaserc	Make: Semantic Versioning	Jul 13, 2023
CMakeLists.txt	CMakeLists.txt	Make: Change CXX standard	Jul 28, 2023
LICENSE	LICENSE	Docs: Add Apache 2.0 LICENSE	Aug 29, 2023
README.md	README.md	Add: Baseline NodeJS binding	Sep 18, 2023
VERSION	VERSION	Build: Released 1.2.0 [skip ci]	Sep 18, 2023
binding.gyp	binding.gyp	Add: Baseline NodeJS binding	Sep 18, 2023
package-lock.json	package-lock.json	Add: Baseline NodeJS binding	Sep 18, 2023
package.json	package.json	Build: Released 1.2.0 [skip ci]	Sep 18, 2023
pyproject.toml	pyproject.toml	Make: Cleaning build caches on Windows	Jun 22, 2023
setup.py	setup.py	Make: Publish StringZilla to NPM	Sep 18, 2023
stringzilla.jpeg	stringzilla.jpeg	Refactor: Tests, docs	Jun 19, 2023

StringZilla 🦖

StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅

Python docs
C docs
JavaScript docs.
Rust docs.

Performance

StringZilla uses a heuristic so simple it's almost stupid... but it works. It matches the first few letters of words with hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms. So if you're haunted by open(...).readlines() and str().splitlines() taking forever, this should help 😊

Substring Search

Backend \ Device	IoT	Laptop	Server
Speed Comparison 🐇
Python `for` loop	4 MB/s	14 MB/s	11 MB/s
C++ `for` loop	520 MB/s	1.0 GB/s	900 MB/s
C++ `string.find`	560 MB/s	1.2 GB/s	1.3 GB/s
Scalar StringZilla	2 GB/s	3.3 GB/s	3.5 GB/s
Hyper-Scalar StringZilla	4.3 GB/s	12 GB/s	12.1 GB/s
Efficiency Metrics 📊
CPU Specs	8-core ARM, 0.5 W/core	8-core Intel, 5.6 W/core	22-core Intel, 6.3 W/core
Performance/Core	2.1 - 3.3 GB/s	11 GB/s	10.5 GB/s
Bytes/Joule	4.2 GB/J	2 GB/J	1.6 GB/J

Partition & Sort

Coming soon.

Quick Start: Python 🐍

1️. Install via pip: pip install stringzilla
2. Import classes: from stringzilla import Str, File, Strs

Basic Usage

StringZilla offers two mostly interchangeable core classes:

from stringzilla import Str, File

text1 = Str('some-string')
text2 = File('some-file.txt')

The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Basic Operations

Length: len(text) -> int
Indexing: text[42] -> str
Slicing: text[42:46] -> str

Advanced Operations

'substring' in text -> bool
text.contains('substring', start=0, end=9223372036854775807) -> bool
text.find('substring', start=0, end=9223372036854775807) -> int
text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs

Collection-Level Operations

Once split into a Strs object, you can sort, shuffle, and reorganize the slices.

lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)

Need copies?

sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)

Basic list-like operations are also supported:

lines.append('Pythonic string')
lines.extend(shuffled_copy)

Quick Start: C 🛠️

There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.

#include "stringzilla.h"

// Initialize your haystack and needle
strzl_haystack_t haystack = {your_text, your_text_length};
strzl_needle_t needle = {your_subtext, your_subtext_length, your_anomaly_offset};

// Perform string-level operations
size_t character_count = strzl_naive_count_char(haystack, 'a');
size_t character_position = strzl_naive_find_char(haystack, 'a');
size_t substring_position = strzl_naive_find_substr(haystack, needle);

// Perform collection level operations
strzl_array_t array = {your_order, your_count, your_get_begin, your_get_length, your_handle};
strzl_sort(&array, &your_config);

Contributing 👾

Future development plans include:

Replace PyBind11 with CPython.
Reverse-order operations in Python #12.
Bindings for JavaScript #25, Java, and Rust.
Faster string sorting algorithm.
Splitting CSV rows into columns.
Splitting with multiple separators at once #29.
UTF-8 validation.
Arm SVE backend.

Here's how to set up your dev environment and run some tests.

Development

CPython:

# Clean up and install
rm -rf build && pip install -e . && pytest scripts/test.py -s -x

# Install without dependencies
pip install -e . --no-index --no-deps

NodeJS:

npm install && node javascript/test.js

Benchmarking

To benchmark on some custom file and pattern combinations:

python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

Packaging

To validate packaging:

cibuildwheel --platform linux

Compiling C++ Tests

# Install dependencies
brew install libomp llvm

# Compile and run tests
cmake -B ./build_release \
    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
    -DSTRINGZILLA_USE_OPENMP=1 \
    -DSTRINGZILLA_BUILD_TEST=1 \
    && \
    make -C ./build_release -j && ./build_release/stringzilla_test

License 📜

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.

If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack 🤗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StringZilla 🦖

Performance

Substring Search

Partition & Sort

Quick Start: Python 🐍

Basic Usage

Basic Operations

Advanced Operations

Collection-Level Operations

Quick Start: C 🛠️

Contributing 👾

Development

Benchmarking

Packaging

Compiling C++ Tests

License 📜

About

Releases 72

Used by 1.5k

Contributors 42

Languages

License

ashvardanian/StringZilla

Folders and files

Latest commit

History

Repository files navigation

StringZilla 🦖

Performance

Substring Search

Partition & Sort

Quick Start: Python 🐍

Basic Usage

Basic Operations

Advanced Operations

Collection-Level Operations

Quick Start: C 🛠️

Contributing 👾

Development

Benchmarking

Packaging

Compiling C++ Tests

License 📜

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 72

Used by 1.5k

Contributors 42

Languages