88 lines
3.1 KiB
Text
88 lines
3.1 KiB
Text
These are just a couple of brain farts that came up and I'd rather note down.
|
||
There's no clear structure.
|
||
|
||
RFC 1341 Boundary Matching in a Circular Buffer
|
||
1. Algorithm Considerations
|
||
|
||
Knuth-Morris-Pratt (KMP) Limitations:
|
||
|
||
Useful when patterns have prefix-suffix overlaps for efficient skipping.
|
||
|
||
If the failure table consists only of zeros, KMP provides no speed advantage
|
||
over naive searching.
|
||
|
||
Boundary pattern is arbitrary, meaning KMP’s preprocessing may not be
|
||
beneficial.
|
||
|
||
Alternatives to KMP:
|
||
|
||
Rabin-Karp rolling hash → Uses fast hash comparisons instead of
|
||
character-by-character matching.
|
||
|
||
Boyer-Moore-Horspool → Precomputes skip distances to avoid redundant
|
||
comparisons, works well for longer patterns.
|
||
|
||
Crochemore-Perrin two-way search → used by str.find(), flexible
|
||
but assumes a linear memory layout so not really applicable for my circular
|
||
buffer approach
|
||
|
||
2. Boundary Characteristics
|
||
|
||
Max length: 70 bytes. Character set: ASCII only. No structure guarantees: The
|
||
boundary is client-defined, so I must be able to handle arbitrary sequences.
|
||
|
||
3. Algorithm Selection
|
||
|
||
Rolling Hash → Best for arbitrary short-to-medium patterns in a circular buffer.
|
||
Boyer-Moore → Ideal if the boundary has distinct character distributions to
|
||
optimize skipping.
|
||
|
||
|
||
|
||
|
||
# Optimized Chunk-Based Rolling Hash Matching
|
||
|
||
We need to efficiently detect an RFC 1341 multipart boundary inside a circular
|
||
buffer, ensuring minimal overhead while avoiding unnecessary comparisons.
|
||
|
||
Traditional approaches like Knuth-Morris-Pratt (KMP) don’t provide an advantage
|
||
when the boundary lacks repeated subpatterns. Meanwhile, full rolling hash
|
||
matching scans every byte, which can be wasteful.
|
||
|
||
Thus, we introduce a chunk-wise hash-based skipping strategy, allowing us to
|
||
skip large sections of the buffer when an early non-match is detected.
|
||
|
||
## Core Idea
|
||
|
||
Precompute hashes for evenly sized chunks of the boundary. -> First, match only
|
||
the hash of the first chunk → immediately skip unnecessary buffer sections if no
|
||
match. -> If the first chunk matches, progressively verify subsequent chunks
|
||
until the full boundary is confirmed. Benefits Over Full Matching
|
||
|
||
## Benefits Over Full Matching
|
||
|
||
- Reduces comparisons significantly → eliminates large sections early when
|
||
non-matches occur.
|
||
- Balances preprocessing cost vs runtime → faster
|
||
elimination means fewer wasted cycles.
|
||
Integrates seamlessly into circular buffers → allows skipping intelligently.
|
||
|
||
|
||
### Precompute Chunk Hashes
|
||
|
||
- Divide the pattern into `N` equal-sized chunks (e.g., 7 chunks of 10 bytes
|
||
for a 70-byte boundary).
|
||
- Compute a rolling hash for each chunk in addition to the full pattern, storing
|
||
them for quick lookup.
|
||
|
||
### Sliding Window Search in the Buffer
|
||
|
||
- Compute the rolling hash for each window of size chunk_size.
|
||
- Compare the first chunk’s hash with the buffer window.
|
||
- If no match, skip boundary_length - chunk_size bytes.
|
||
|
||
### Progressive Chunk Verification
|
||
|
||
- If the first chunk matches, verify the next chunk sequentially.
|
||
- Continue matching chunks until the full boundary is confirmed.
|
||
- Perform final character-by-character validation to rule out hash collisions.
|