Improve: Reorganizing for readability

ashvardanian · Jan 28, 2024 · 9dd1f8c · 9dd1f8c
1 parent ae7b119
commit 9dd1f8c
Show file tree

Hide file tree

Showing 3 changed files with 588 additions and 439 deletions.
diff --git a/README.md b/README.md
@@ -335,6 +335,62 @@ sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length,
 sz_sort(&array, &your_config);
 ```
 
+Unlike LibC:
+
+- all strings are expected to have a length, and are not necesserily null-terminated.
+- every operations has a reverse order counterpart.
+
+That way `sz_find` and `sz_rfind` are similar to `strstr` and `strrstr` in LibC.
+Similarly, `sz_find_byte` and `sz_rfind_byte` replace `memchr` and `memrchr`.
+The `sz_find_charset` maps to `strspn` and `strcspn`, while `sz_rfind_charset` has no sibling in LibC.
+
+<table>
+    <tr>
+        <th>LibC Functionality</th>
+        <th>StringZilla Equivalents</th>
+    </tr>
+    <tr>
+        <td><code>memchr(haystack, needle, haystack_length)</code>, <code>strchr</code></td>
+        <td><code>sz_find_byte(haystack, haystack_length, needle)</code></td>
+    </tr>
+    <tr>
+        <td><code>memrchr(haystack, needle, haystack_length)</code></td>
+        <td><code>sz_rfind_byte(haystack, haystack_length, needle)</code></td>
+    </tr>
+    <tr>
+        <td><code>memcmp</code>, <code>strcmp</code></td>
+        <td><code>sz_order</code>, <code>sz_equal</code></td>
+    </tr>
+    <tr>
+        <td><code>strlen(haystack)</code></td>
+        <td><code>sz_find_byte(haystack, haystack_length, needle)</code></td>
+    </tr>
+    <tr>
+        <td><code>strcspn(haystack, needles)</code></td>
+        <td><code>sz_rfind_charset(haystack, haystack_length, needles_bitset)</code></td>
+    </tr>
+    <tr>
+        <td><code>strspn(haystack, needles)</code></td>
+        <td><code>sz_find_charset(haystack, haystack_length, needles_bitset)</code></td>
+    </tr>
+    <tr>
+        <td><code>memmem(haystack, haystack_length, needle, needle_length)</code>, <code>strstr</code></td>
+        <td><code>sz_find(haystack, haystack_length, needle, needle_length)</code></td>
+    </tr>
+    <tr>
+        <td><code>memcpy(destination, source, destination_length)</code></td>
+        <td><code>sz_copy(destination, source, destination_length)</code></td>
+    </tr>
+    <tr>
+        <td><code>memmove(destination, source, destination_length)</code></td>
+        <td><code>sz_move(destination, source, destination_length)</code></td>
+    </tr>
+    <tr>
+        <td><code>memset(destination, value, destination_length)</code></td>
+        <td><code>sz_fill(destination, destination_length, value)</code></td>
+    </tr>
+</table>
+
 ### Basic Usage with C++ 11 and Newer
 
 There is a stable C++ 11 interface available in the `ashvardanian::stringzilla` namespace.
@@ -765,11 +821,6 @@ __`SZ_USE_MISALIGNED_LOADS`__:
 > Going from `char`-like types to `uint64_t`-like ones can significanly accelerate the serial (SWAR) backend.
 > So consider enabling it if you are building for some embedded device.
 
-__`SZ_CACHE_LINE_WIDTH`, `SZ_SWAR_THRESHOLD`__:
-
-> The width of the cache line and the "SWAR threshold" are performance-optimization settings.
-> They will mostly affect the serial performance.
-
 __`SZ_AVOID_LIBC`__:
 
 > When using the C header-only library one can disable the use of LibC.
@@ -788,9 +839,98 @@ __`STRINGZILLA_BUILD_SHARED`, `STRINGZILLA_BUILD_TEST`, `STRINGZILLA_BUILD_BENCH
 
 ## Algorithms & Design Decisions 📚
 
+StringZilla aims to optimize some of the slowest string operations.
+Some popular operations, however, like equality comparisons and relative order checking, almost always complete on some of the very first bytes in either string.
+In such operations vectorization is almost useless, unless huge and very similar strings are considered.
+StringZilla implements those operations as well, but won't result in substantial speedups.
+
 ### Hashing
 
-### Substring Search
+Hashing is a very deeply studies subject with countless implementations.
+Choosing the right hashing algorithm for your application can be crucial from both performance and security standpoint.
+In StringZilla a 64-bit rolling hash function is reused for both string hashes and substring hashes, Rabin-style fingerprints, and is accelerated with SIMD for longer strings.
+
+#### Why not CRC32?
+
+Cyclic Redundancy Check 32 is one of the most commonly used hash functions in Computer Science.
+It has in-hardware support on both x86 and Arm, for both 8-bit, 16-bit, 32-bit, and 64-bit words.
+The `0x1EDC6F41` polynomial is used in iSCSI, Btrfs, ext4, and the `0x04C11DB7` in SATA, Ethernet, Zlib, PNG.
+In case of Arm more than one polynomial is supported.
+It is, however, somewhat limiting for Big Data usecases, which often have to deal with more than 4 Billion strings, making collisions unavoidable.
+Moreover, the existing SIMD approaches are tricky, combining general purpose computations with specialized instructions, to utilize more silicon in every cycle.
+
+Some of the best articles on CRC32:
+
+- [Comprehensive derivation of approaches](https://github.com/komrad36/CRC)
+- [Faster computation for 4 KB buffers on x86](https://www.corsix.org/content/fast-crc32c-4k)
+- [Comparing different lookup tables](https://create.stephan-brumme.com/crc32)
+
+Some of the best open-source implementations:
+
+- [By Peter Cawley](https://github.com/corsix/fast-crc32)
+- [By Stephan Brumme](https://github.com/stbrumme/crc32)
+
+#### Other Modern Alternatives
+
+[MurmurHash](https://github.com/aappleby/smhasher/blob/master/README.md) from 2008 by Austin Appleby is one of the best known non-cryptographic hashes.
+It has a very short implementation and is capable of producing 32-bit and 128-bit hashes.
+The [CityHash](https://opensource.googleblog.com/2011/04/introducing-cityhash) from 2011 by Google and the [xxHash](https://github.com/Cyan4973/xxHash) improve on that, better leveraging the super-scalar nature of modern CPUs and producing 64-bit and 128-bit hashes.
+
+Neither of those functions are cryptographic, unlike MD5, SHA, and BLAKE algorithms.
+Most of cryptographic hashes are based on the Merkle-Damgård construction, and aren't resistant to the length-extension attacks.
+Current state of the Art, might be the [BLAKE3](https://github.com/BLAKE3-team/BLAKE3) algorithm.
+It's resistant to a broad range of attacks, can process 2 bytes per CPU cycle, and comes with a very optimized official implementation for C and Rust.
+It has the same 128-bit security level as the BLAKE2, and achieves its performance gains by reducing the number of mixing rounds, and processing data in 1 KiB chunks, which is great for longer strings, but may result in poor performance on short ones.
+
+> [!TIP]
+> All mentioned libraries have undergone extensive testing and are considered production-ready.
+> They can definitely accelerate your application, but so may the downstream mixer.
+> For instance, when a hash-table is constructed, the hashes are further shrinked to address table buckets.
+> If the mixer looses entropy, the performance gains from the hash function may be lost.
+> An example would be power-of-two modulo, which is a common mixer, but is known to be weak.
+> One alternative would be the [fastrange](https://github.com/lemire/fastrange) by Daniel Lemire.
+> Another one is the [Fibonacci hash trick](https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/) using the Golden Ratio, also used in StringZilla.
+
+### Exact Substring Search
+
+StringZilla uses different exactsubstring search algorithms for different needle lengths and backends:
+
+- When no SIMD is available - SWAR (SIMD Within A Register) algorithms are used on 64-bit words.
+- Boyer-Moore-Horspool (BMH) algorithm with Raita heuristic variation for longer needles.
+- SIMD algorithms are randomized to look at different parts of the needle.
+- Apostolico-Giancarlo algorithm is _considered_ for longer needles, if preprocessing time isn't an issue.
+
+Substring search algorithms are generally divided into: comparison-based, automaton-based, and bit-parallel.
+Different families are effective for different alphabet sizes and needle lengths.
+The more operations are needed per-character - the more effective SIMD would be.
+The longer the needle - the more effective the skip-tables are.
+
+On very short needles, especially 1-4 characters long, brute force with SIMD is the fastest solution.
+On mid-length needles, bit-parallel algorithms are effective, as the character masks fit into 32-bit or 64-bit words.
+Either way, if the needle is under 64-bytes long, on haystack traversal we will still fetch every CPU cache line.
+So the only way to improve performance is to reduce the number of comparisons.
+
+Going beyond that, to long needles, Boyer-Moore (BM) and its variants are often the best choice.
+It has two tables: the good-suffix shift and the bad-character shift.
+Common choice is to use the simplified BMH algorithm, which only uses the bad-character shift table, reducing the pre-processing time.
+In the C++ Standards Library, the `std::string::find` function uses the BMH algorithm with Raita's heuristic.
+We do something similar longer needles.
+
+All those, still, have $O(hn)$ worst case complexity, and struggle with repetitive needle patterns.
+To guarantee $O(h)$ worst case time complexity, the Apostolico-Giancarlo (AG) algorithm adds an additional skip-table.
+Preprocessing phase is $O(n+sigma)$ in time and space.
+On traversal, performs from $(h/n)$ to $(3h/2)$ comparisons.
+We should consider implementing it if we can:
+
+- accelerate the preprocessing phase of the needle.
+- simplify the control-flow of the main loop.
+- replace the array of shift values with a circular buffer.
+
+Reading materials:
+
+- Exact String Matching Algorithms in Java: https://www-igm.univ-mlv.fr/~lecroq/string
+- SIMD-friendly algorithms for substring searching: http://0x80.pl/articles/simd-strfind.html
+
 
 ### Levenshtein Edit Distance
 

diff --git a/include/stringzilla/experimental.h b/include/stringzilla/experimental.h
@@ -243,6 +243,79 @@ SZ_INTERNAL sz_cptr_t _sz_find_bounded_last_bitap_upto_64bytes_serial(sz_cptr_t
     return NULL;
 }
 
+#if SZ_USE_AVX512
+
+SZ_PUBLIC sz_size_t sz_edit_distance_avx512(     //
+    sz_cptr_t const a, sz_size_t const a_length, //
+    sz_cptr_t const b, sz_size_t const b_length, //
+    sz_size_t const bound, sz_memory_allocator_t *alloc) {
+
+    sz_u512_vec_t a_vec, b_vec, previous_vec, current_vec, permutation_vec;
+    sz_u512_vec_t cost_deletion_vec, cost_insertion_vec, cost_substitution_vec;
+    sz_size_t min_distance;
+
+    b_vec.zmm = _mm512_maskz_loadu_epi8(_sz_u64_mask_until(b_length), b);
+    previous_vec.zmm = _mm512_set_epi8(63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, //
+                                       47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, //
+                                       31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, //
+                                       15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
+
+    // Shifting bytes across the whole ZMM register is quite complicated, so let's use a permutation for that.
+    permutation_vec.zmm = _mm512_set_epi8(62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, //
+                                          46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, //
+                                          30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, //
+                                          14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 63);
+
+    for (sz_size_t idx_a = 0; idx_a != a_length; ++idx_a) {
+        min_distance = bound - 1;
+
+        a_vec.zmm = _mm512_set1_epi8(a[idx_a]);
+        // We first start by computing the cost of deletions and substitutions
+        // for (sz_size_t idx_b = 0; idx_b != b_length; ++idx_b) {
+        //     sz_u8_t cost_deletion = previous_vec.u8s[idx_b + 1] + 1;
+        //     sz_u8_t cost_substitution = previous_vec.u8s[idx_b] + (a[idx_a] != b[idx_b]);
+        //     current_vec.u8s[idx_b + 1] = sz_min_of_two(cost_deletion, cost_substitution);
+        // }
+        cost_deletion_vec.zmm = _mm512_add_epi8(previous_vec.zmm, _mm512_set1_epi8(1));
+        cost_substitution_vec.zmm =
+            _mm512_mask_set1_epi8(_mm512_setzero_si512(), _mm512_cmpneq_epi8_mask(a_vec.zmm, b_vec.zmm), 0x01);
+        cost_substitution_vec.zmm = _mm512_add_epi8(previous_vec.zmm, cost_substitution_vec.zmm);
+        cost_substitution_vec.zmm = _mm512_permutexvar_epi8(permutation_vec.zmm, cost_substitution_vec.zmm);
+        current_vec.zmm = _mm512_min_epu8(cost_deletion_vec.zmm, cost_substitution_vec.zmm);
+        current_vec.u8s[0] = idx_a + 1;
+
+        // Now we need to compute the inclusive prefix sums using the minimum operator
+        // In one line:
+        //      current_vec.u8s[idx_b + 1] = sz_min_of_two(current_vec.u8s[idx_b + 1], current_vec.u8s[idx_b] + 1)
+        //
+        // Unrolling this:
+        //      current_vec.u8s[0 + 1] = sz_min_of_two(current_vec.u8s[0 + 1], current_vec.u8s[0] + 1)
+        //      current_vec.u8s[1 + 1] = sz_min_of_two(current_vec.u8s[1 + 1], current_vec.u8s[1] + 1)
+        //      current_vec.u8s[2 + 1] = sz_min_of_two(current_vec.u8s[2 + 1], current_vec.u8s[2] + 1)
+        //      current_vec.u8s[3 + 1] = sz_min_of_two(current_vec.u8s[3 + 1], current_vec.u8s[3] + 1)
+        //
+        // Alternatively, using a tree-like reduction in log2 steps:
+        //      - 6 cycles of reductions shifting by 1, 2, 4, 8, 16, 32, 64 bytes;
+        //      - with each cycle containing at least one shift, min, add, blend.
+        //
+        // Which adds meaningless complexity without any performance gains.
+        for (sz_size_t idx_b = 0; idx_b != b_length; ++idx_b) {
+            sz_u8_t cost_insertion = current_vec.u8s[idx_b] + 1;
+            current_vec.u8s[idx_b + 1] = sz_min_of_two(current_vec.u8s[idx_b + 1], cost_insertion);
+        }
+
+        // Swap previous_distances and current_distances pointers
+        sz_u512_vec_t temp_vec;
+        temp_vec.zmm = previous_vec.zmm;
+        previous_vec.zmm = current_vec.zmm;
+        current_vec.zmm = temp_vec.zmm;
+    }
+
+    return previous_vec.u8s[b_length] < bound ? previous_vec.u8s[b_length] : bound;
+}
+
+#endif // SZ_USE_AVX512
+
 #ifdef __cplusplus
 } // extern "C"
 #endif