feat(string): init ChunkedRollingHash

2025-05-05 01:20:13 +02:00 · 2025-05-05 01:20:13 +02:00 · e5f0d1df58
commit e5f0d1df58
parent b29ae6087d
3 changed files with 282 additions and 1 deletions
--- a/88
+++ b/88
@ -0,0 +1,88 @@
+These are just a couple of brain farts that came up and I'd rather note down.
+There's no clear structure.
+
+RFC 1341 Boundary Matching in a Circular Buffer
+1. Algorithm Considerations
+
+Knuth-Morris-Pratt (KMP) Limitations:
+
+    Useful when patterns have prefix-suffix overlaps for efficient skipping.
+
+    If the failure table consists only of zeros, KMP provides no speed advantage
+    over naive searching.
+
+    Boundary pattern is arbitrary, meaning KMP’s preprocessing may not be
+    beneficial.
+
+Alternatives to KMP:
+
+    Rabin-Karp rolling hash → Uses fast hash comparisons instead of
+    character-by-character matching.
+
+    Boyer-Moore-Horspool → Precomputes skip distances to avoid redundant
+    comparisons, works well for longer patterns.
+
+    Crochemore-Perrin two-way search → used by str.find(), flexible
+    but assumes a linear memory layout so not really applicable for my circular
+    buffer approach
+
+2. Boundary Characteristics
+
+Max length: 70 bytes. Character set: ASCII only. No structure guarantees: The
+boundary is client-defined, so I must be able to handle arbitrary sequences.
+
+3. Algorithm Selection
+
+Rolling Hash → Best for arbitrary short-to-medium patterns in a circular buffer.
+Boyer-Moore → Ideal if the boundary has distinct character distributions to
+optimize skipping.
+
+
+
+
+# Optimized Chunk-Based Rolling Hash Matching
+
+We need to efficiently detect an RFC 1341 multipart boundary inside a circular
+buffer, ensuring minimal overhead while avoiding unnecessary comparisons.
+
+Traditional approaches like Knuth-Morris-Pratt (KMP) don’t provide an advantage
+when the boundary lacks repeated subpatterns. Meanwhile, full rolling hash
+matching scans every byte, which can be wasteful.
+
+Thus, we introduce a chunk-wise hash-based skipping strategy, allowing us to
+skip large sections of the buffer when an early non-match is detected.
+
+## Core Idea
+
+Precompute hashes for evenly sized chunks of the boundary. -> First, match only
+the hash of the first chunk → immediately skip unnecessary buffer sections if no
+match. -> If the first chunk matches, progressively verify subsequent chunks
+until the full boundary is confirmed.  Benefits Over Full Matching
+
+## Benefits Over Full Matching
+
+- Reduces comparisons significantly → eliminates large sections early when
+  non-matches occur.
+- Balances preprocessing cost vs runtime → faster
+  elimination means fewer wasted cycles.
+  Integrates seamlessly into circular buffers → allows skipping intelligently.
+
+
+### Precompute Chunk Hashes
+
+- Divide the pattern into `N` equal-sized chunks (e.g., 7 chunks of 10 bytes
+  for a 70-byte boundary).
+- Compute a rolling hash for each chunk in addition to the full pattern, storing
+  them for quick lookup.
+
+### Sliding Window Search in the Buffer
+
+- Compute the rolling hash for each window of size chunk_size.
+- Compare the first chunk’s hash with the buffer window.
+- If no match, skip boundary_length - chunk_size bytes.
+
+### Progressive Chunk Verification
+
+- If the first chunk matches, verify the next chunk sequentially.
+- Continue matching chunks until the full boundary is confirmed.
+- Perform final character-by-character validation to rule out hash collisions.