Merge pull request #187 from pratheekv39/stringalgorithms

#122 Added 5 New Algorithms under String Algorithms
ajay-dhangar · Oct 14, 2024 · e63a37c · e63a37c
2 parents edf9288 + e70e2be
commit e63a37c
Show file tree

Hide file tree

Showing 6 changed files with 545 additions and 0 deletions.
diff --git a/docs/algorithms/string-algorithms/_category_.json b/docs/algorithms/string-algorithms/_category_.json
@@ -0,0 +1,8 @@
+{
+    "label": "String Algorithms",
+    "position": 3,
+    "link": {
+      "type": "generated-index",
+      "description": "Learn about some String Algorithms."
+    }
+  }
diff --git a/docs/algorithms/string-algorithms/apostolico-giancarlo-algorithm.md b/docs/algorithms/string-algorithms/apostolico-giancarlo-algorithm.md
@@ -0,0 +1,100 @@
+---
+
+id: apostolico-giancarlo-algo  
+sidebar_position: 1  
+title: Apostolico–Giancarlo Algorithm  
+sidebar_label: Apostolico–Giancarlo Algorithm  
+
+---
+
+### Definition:
+
+The Apostolico–Giancarlo algorithm is an advanced string matching algorithm designed for efficient searching of a pattern in a text by minimizing redundant comparisons. It utilizes the knowledge gained from previous mismatches to skip unnecessary character comparisons.
+
+### Characteristics:
+
+- **Efficient Skipping**:
+  - This algorithm reduces the number of comparisons by reusing information about previously matched characters and skipping over sections of text that cannot possibly match the pattern.
+
+- **Text Scanning**:
+  - It processes the text in a left-to-right fashion, scanning characters and performing checks to see if the pattern matches.
+
+- **Optimal Shifts**:
+  - Apostolico–Giancarlo optimizes the pattern shifting process after mismatches by using suffix information, ensuring fewer comparisons in cases of repeated patterns.
+
+- **Suboptimal on Small Patterns**:
+  - While efficient for longer patterns, its performance may not be as significant for smaller ones compared to simpler algorithms like Knuth-Morris-Pratt (KMP).
+
+### Time Complexity:
+
+- **Best Case: $O\left(\frac{n}{m}\right)$**  
+  In the best-case scenario, the algorithm performs optimally, making only a fraction of comparisons proportional to the length of the text divided by the length of the pattern.
+
+- **Average Case: $O(n)$**  
+  On average, the Apostolico–Giancarlo algorithm makes approximately linear scans through the text, resulting in efficient performance for most practical use cases.
+
+- **Worst Case: $O(n \times m)$**  
+  In the worst case, if the pattern has repeated sections that align poorly with the text, the algorithm could degrade to quadratic time complexity, where `n` is the text length and `m` is the pattern length.
+
+### Space Complexity:
+
+- **Space Complexity: $O(m + n)$**  
+  The algorithm requires additional space for storing suffix and shift tables, but the space overhead is linear with respect to both the pattern and the text size.
+
+### C++ Implementation:
+
+**Iterative Approach**
+```cpp
+#include <iostream>
+#include <vector>
+#include <string>
+using namespace std;
+
+void computeSuffixArray(const string& pattern, vector<int>& suffixArray) {
+    int m = pattern.length();
+    suffixArray[m - 1] = m;
+    for (int i = m - 2; i >= 0; --i) {
+        int j = i;
+        while (j >= 0 && pattern[j] == pattern[m - 1 - (i - j)]) {
+            --j;
+        }
+        suffixArray[i] = i - j;
+    }
+}
+
+void apostolicoGiancarloSearch(const string& text, const string& pattern) {
+    int n = text.length();
+    int m = pattern.length();
+    if (m > n) return;
+
+    vector<int> suffixArray(m);
+    computeSuffixArray(pattern, suffixArray);
+
+    int i = 0;
+    while (i <= n - m) {
+        int j = m - 1;
+        while (j >= 0 && pattern[j] == text[i + j]) {
+            --j;
+        }
+        if (j < 0) {
+            cout << "Pattern found at index " << i << endl;
+            i += suffixArray[0]; // Shift based on the suffix array
+        } else {
+            i += max(1, suffixArray[j]);
+        }
+    }
+}
+
+int main() {
+    string text = "ABAAABCDABC";
+    string pattern = "ABC";
+
+    apostolicoGiancarloSearch(text, pattern);
+
+    return 0;
+}
+```
+
+### Summary:
+
+The Apostolico–Giancarlo algorithm is an advanced string matching algorithm that leverages optimal shifts and pattern reuse to efficiently find patterns within text. Though it offers significant performance advantages for large and repetitive patterns, it is not always the first choice for small or simple patterns.
diff --git a/docs/algorithms/string-algorithms/bitap-algorithm.md b/docs/algorithms/string-algorithms/bitap-algorithm.md
@@ -0,0 +1,108 @@
+---
+
+id: bitap-algo  
+sidebar_position: 3  
+title: Bitap Algorithm  
+sidebar_label: Bitap Algorithm  
+
+---
+
+### Definition:
+
+The Bitap algorithm, also known as the **Shift-Or**, **Shift-And**, or **Bitap for Approximate String Matching**, is a string matching algorithm that efficiently finds patterns in a text with possible mismatches or errors. The algorithm leverages bitwise operations to perform both exact and approximate string matching, making it ideal for fuzzy searching.
+
+### Characteristics:
+
+- **Bitwise Matching**:
+  - The Bitap algorithm uses bitwise operations to compare the pattern against the text. Each bit represents whether a character in the text matches a position in the pattern.
+
+- **Approximate Matching**:
+  - It supports approximate matching, where the pattern may have a certain number of mismatches, insertions, or deletions. This is especially useful in fields like text retrieval or DNA sequence matching.
+
+- **Pattern Masking**:
+  - The pattern is preprocessed into bitmasks, which are then used during the text scan to track how much of the pattern has been matched, including the handling of allowed errors.
+
+- **Linear Search with Errors**:
+  - The algorithm scans the text linearly, and the number of allowed errors (insertions, deletions, substitutions) is parameterized, allowing for flexible search criteria.
+
+### Time Complexity:
+
+- **Best Case: $O\left(\frac{n}{w}\right)$**  
+  The best-case complexity is linear, as the algorithm processes `w` characters in parallel per word size `w` of the machine.
+
+- **Average Case: $O(n)$**  
+  On average, the algorithm performs in linear time with respect to the text size `n`, especially for small patterns or when only a few errors are allowed.
+
+- **Worst Case: $O(n \times m)$**  
+  In the worst case, if the pattern is large or if there are many errors allowed, the time complexity can degrade to quadratic, where `m` is the pattern length.
+
+### Space Complexity:
+
+- **Space Complexity: $O(m)$**  
+  The algorithm requires space proportional to the pattern length `m` for storing bitmasks, making it efficient in terms of memory usage.
+
+### C++ Implementation:
+
+**Approximate Matching with `k` Allowed Errors**
+```cpp
+#include <iostream>
+#include <vector>
+#include <string>
+using namespace std;
+
+#define CHAR_SIZE 256 // Extended ASCII
+
+void preprocessPattern(const string& pattern, vector<int>& patternMask) {
+    int m = pattern.size();
+    for (int i = 0; i < CHAR_SIZE; ++i) {
+        patternMask[i] = ~0;
+    }
+    for (int i = 0; i < m; ++i) {
+        patternMask[pattern[i]] &= ~(1 << i);
+    }
+}
+
+void bitapSearch(const string& text, const string& pattern, int maxErrors) {
+    int n = text.size();
+    int m = pattern.size();
+
+    if (m > n) return;
+
+    vector<int> patternMask(CHAR_SIZE);
+    preprocessPattern(pattern, patternMask);
+
+    vector<int> R(maxErrors + 1, ~0);
+    for (int i = 0; i <= maxErrors; ++i) {
+        R[i] = ~1; // All bits set except the least significant bit
+    }
+
+    for (int i = 0; i < n; ++i) {
+        int oldR_jMinus1 = ~0;
+        for (int j = 0; j <= maxErrors; ++j) {
+            int temp = R[j];
+            R[j] = ((R[j] << 1) | patternMask[text[i]]);
+            if (j > 0) {
+                R[j] &= (oldR_jMinus1 << 1) | (R[j - 1] << 1) | oldR_jMinus1;
+            }
+            oldR_jMinus1 = temp;
+        }
+        if ((R[maxErrors] & (1 << (m - 1))) == 0) {
+            cout << "Pattern found at index " << i - m + 1 << " with " << maxErrors << " allowed errors." << endl;
+        }
+    }
+}
+
+int main() {
+    string text = "this is a simple example";
+    string pattern = "example";
+    int maxErrors = 1; // Allow 1 error (insertion, deletion, or substitution)
+
+    bitapSearch(text, pattern, maxErrors);
+
+    return 0;
+}
+```
+
+### Summary:
+
+The Bitap algorithm is a highly efficient string matching technique that supports approximate matching, making it ideal for applications requiring fuzzy search capabilities. Its use of bitwise operations allows for fast text scanning, while its flexibility in handling errors sets it apart from other exact matching algorithms. Despite its quadratic worst-case complexity, it performs well for small patterns and a limited number of errors.
diff --git a/docs/algorithms/string-algorithms/bndm-algorithm.md b/docs/algorithms/string-algorithms/bndm-algorithm.md
@@ -0,0 +1,108 @@
+---
+
+id: bndm-algo  
+sidebar_position: 2  
+title: BNDM Algorithm  
+sidebar_label: BNDM Algorithm  
+
+---
+
+### Definition:
+
+The BNDM (Backward Nondeterministic Dawg Matching) algorithm is an efficient string matching algorithm derived from the Backward Dawg Matching (BDM) algorithm. It uses bitwise operations to simulate a nondeterministic automaton, matching the pattern in reverse order while scanning the text.
+
+### Characteristics:
+
+- **Bitwise Automaton Simulation**:
+  - BNDM represents the search pattern as a bitmask and simulates a nondeterministic automaton using bitwise operations. This reduces the number of character comparisons and enables efficient pattern matching.
+
+- **Reverse Pattern Matching**:
+  - The algorithm scans the pattern in reverse, comparing it against the text from right to left, which helps in faster identification of mismatches and skips.
+
+- **Efficient for Short Patterns**:
+  - BNDM is particularly efficient for short patterns, often outperforming other string matching algorithms like Boyer-Moore and Knuth-Morris-Pratt for small pattern sizes.
+
+- **Extension of BDM**:
+  - It improves upon the BDM algorithm by handling more general cases and providing better performance for non-trivial patterns.
+
+### Time Complexity:
+
+- **Best Case: $O\left(\frac{n}{w}\right)$**  
+  In the best-case scenario, where `w` is the word size of the machine, the algorithm takes advantage of the word-level parallelism and makes few character comparisons.
+
+- **Average Case: $O(n)$**  
+  On average, BNDM performs linear scans through the text, making it highly efficient for typical use cases, especially with short patterns.
+
+- **Worst Case: $O(n \times m)$**  
+  In the worst case, when the text and pattern have poor alignment, BNDM may require multiple full scans of the text, leading to quadratic complexity, where `n` is the text length and `m` is the pattern length.
+
+### Space Complexity:
+
+- **Space Complexity: $O(m)$**  
+  The space complexity of BNDM is linear with respect to the pattern length, as the algorithm stores bitmasks and tables based on the pattern.
+
+### C++ Implementation:
+
+**Iterative Approach**
+```cpp
+#include <iostream>
+#include <vector>
+#include <string>
+using namespace std;
+
+#define CHAR_SIZE 256 // Assuming extended ASCII
+
+void preprocessPattern(const string& pattern, vector<int>& B) {
+    int m = pattern.length();
+    for (int i = 0; i < CHAR_SIZE; ++i) {
+        B[i] = 0;
+    }
+    for (int i = 0; i < m; ++i) {
+        B[pattern[i]] |= (1 << i);
+    }
+}
+
+void BNDMSearch(const string& text, const string& pattern) {
+    int n = text.length();
+    int m = pattern.length();
+
+    if (m > n) return;
+
+    vector<int> B(CHAR_SIZE);
+    preprocessPattern(pattern, B);
+
+    for (int i = 0; i <= n - m; ) {
+        int j = m - 1;
+        int mask = (1 << j);
+        int D = -1; // Bit mask for the current window
+
+        while (D && j >= 0) {
+            D &= B[text[i + j]];
+            if (D) {
+                --j;
+                D <<= 1;
+            }
+        }
+
+        if (j < 0) {
+            cout << "Pattern found at index " << i << endl;
+        }
+
+        // Shift the window based on the number of bits set in D
+        i += (m - __builtin_ctz(D));
+    }
+}
+
+int main() {
+    string text = "ABCABCABCD";
+    string pattern = "ABC";
+
+    BNDMSearch(text, pattern);
+
+    return 0;
+}
+```
+
+### Summary:
+
+The BNDM (Backward Nondeterministic Dawg Matching) algorithm is an efficient and powerful string matching technique, especially for small patterns. It leverages bitwise operations and reverse pattern matching to minimize unnecessary character comparisons, making it highly suitable for short strings and quick searches. Its linear time complexity in average cases makes it a solid choice for string matching tasks in practical applications.