feat(string): init ChunkedRollingHash
This commit is contained in:
parent
b29ae6087d
commit
e5f0d1df58
3 changed files with 282 additions and 1 deletions
88
NOTES
Normal file
88
NOTES
Normal file
|
|
@ -0,0 +1,88 @@
|
|||
These are just a couple of brain farts that came up and I'd rather note down.
|
||||
There's no clear structure.
|
||||
|
||||
RFC 1341 Boundary Matching in a Circular Buffer
|
||||
1. Algorithm Considerations
|
||||
|
||||
Knuth-Morris-Pratt (KMP) Limitations:
|
||||
|
||||
Useful when patterns have prefix-suffix overlaps for efficient skipping.
|
||||
|
||||
If the failure table consists only of zeros, KMP provides no speed advantage
|
||||
over naive searching.
|
||||
|
||||
Boundary pattern is arbitrary, meaning KMP’s preprocessing may not be
|
||||
beneficial.
|
||||
|
||||
Alternatives to KMP:
|
||||
|
||||
Rabin-Karp rolling hash → Uses fast hash comparisons instead of
|
||||
character-by-character matching.
|
||||
|
||||
Boyer-Moore-Horspool → Precomputes skip distances to avoid redundant
|
||||
comparisons, works well for longer patterns.
|
||||
|
||||
Crochemore-Perrin two-way search → used by str.find(), flexible
|
||||
but assumes a linear memory layout so not really applicable for my circular
|
||||
buffer approach
|
||||
|
||||
2. Boundary Characteristics
|
||||
|
||||
Max length: 70 bytes. Character set: ASCII only. No structure guarantees: The
|
||||
boundary is client-defined, so I must be able to handle arbitrary sequences.
|
||||
|
||||
3. Algorithm Selection
|
||||
|
||||
Rolling Hash → Best for arbitrary short-to-medium patterns in a circular buffer.
|
||||
Boyer-Moore → Ideal if the boundary has distinct character distributions to
|
||||
optimize skipping.
|
||||
|
||||
|
||||
|
||||
|
||||
# Optimized Chunk-Based Rolling Hash Matching
|
||||
|
||||
We need to efficiently detect an RFC 1341 multipart boundary inside a circular
|
||||
buffer, ensuring minimal overhead while avoiding unnecessary comparisons.
|
||||
|
||||
Traditional approaches like Knuth-Morris-Pratt (KMP) don’t provide an advantage
|
||||
when the boundary lacks repeated subpatterns. Meanwhile, full rolling hash
|
||||
matching scans every byte, which can be wasteful.
|
||||
|
||||
Thus, we introduce a chunk-wise hash-based skipping strategy, allowing us to
|
||||
skip large sections of the buffer when an early non-match is detected.
|
||||
|
||||
## Core Idea
|
||||
|
||||
Precompute hashes for evenly sized chunks of the boundary. -> First, match only
|
||||
the hash of the first chunk → immediately skip unnecessary buffer sections if no
|
||||
match. -> If the first chunk matches, progressively verify subsequent chunks
|
||||
until the full boundary is confirmed. Benefits Over Full Matching
|
||||
|
||||
## Benefits Over Full Matching
|
||||
|
||||
- Reduces comparisons significantly → eliminates large sections early when
|
||||
non-matches occur.
|
||||
- Balances preprocessing cost vs runtime → faster
|
||||
elimination means fewer wasted cycles.
|
||||
Integrates seamlessly into circular buffers → allows skipping intelligently.
|
||||
|
||||
|
||||
### Precompute Chunk Hashes
|
||||
|
||||
- Divide the pattern into `N` equal-sized chunks (e.g., 7 chunks of 10 bytes
|
||||
for a 70-byte boundary).
|
||||
- Compute a rolling hash for each chunk in addition to the full pattern, storing
|
||||
them for quick lookup.
|
||||
|
||||
### Sliding Window Search in the Buffer
|
||||
|
||||
- Compute the rolling hash for each window of size chunk_size.
|
||||
- Compare the first chunk’s hash with the buffer window.
|
||||
- If no match, skip boundary_length - chunk_size bytes.
|
||||
|
||||
### Progressive Chunk Verification
|
||||
|
||||
- If the first chunk matches, verify the next chunk sequentially.
|
||||
- Continue matching chunks until the full boundary is confirmed.
|
||||
- Perform final character-by-character validation to rule out hash collisions.
|
||||
Loading…
Add table
Add a link
Reference in a new issue