py-utils/NOTES
2025-05-06 16:39:44 +02:00

88 lines
3.1 KiB
Text
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

These are just a couple of brain farts that came up and I'd rather note down.
There's no clear structure.
RFC 1341 Boundary Matching in a Circular Buffer
1. Algorithm Considerations
Knuth-Morris-Pratt (KMP) Limitations:
Useful when patterns have prefix-suffix overlaps for efficient skipping.
If the failure table consists only of zeros, KMP provides no speed advantage
over naive searching.
Boundary pattern is arbitrary, meaning KMPs preprocessing may not be
beneficial.
Alternatives to KMP:
Rabin-Karp rolling hash → Uses fast hash comparisons instead of
character-by-character matching.
Boyer-Moore-Horspool → Precomputes skip distances to avoid redundant
comparisons, works well for longer patterns.
Crochemore-Perrin two-way search → used by str.find(), flexible
but assumes a linear memory layout so not really applicable for my circular
buffer approach
2. Boundary Characteristics
Max length: 70 bytes. Character set: ASCII only. No structure guarantees: The
boundary is client-defined, so I must be able to handle arbitrary sequences.
3. Algorithm Selection
Rolling Hash → Best for arbitrary short-to-medium patterns in a circular buffer.
Boyer-Moore → Ideal if the boundary has distinct character distributions to
optimize skipping.
# Optimized Chunk-Based Rolling Hash Matching
We need to efficiently detect an RFC 1341 multipart boundary inside a circular
buffer, ensuring minimal overhead while avoiding unnecessary comparisons.
Traditional approaches like Knuth-Morris-Pratt (KMP) dont provide an advantage
when the boundary lacks repeated subpatterns. Meanwhile, full rolling hash
matching scans every byte, which can be wasteful.
Thus, we introduce a chunk-wise hash-based skipping strategy, allowing us to
skip large sections of the buffer when an early non-match is detected.
## Core Idea
Precompute hashes for evenly sized chunks of the boundary. -> First, match only
the hash of the first chunk → immediately skip unnecessary buffer sections if no
match. -> If the first chunk matches, progressively verify subsequent chunks
until the full boundary is confirmed. Benefits Over Full Matching
## Benefits Over Full Matching
- Reduces comparisons significantly → eliminates large sections early when
non-matches occur.
- Balances preprocessing cost vs runtime → faster
elimination means fewer wasted cycles.
Integrates seamlessly into circular buffers → allows skipping intelligently.
### Precompute Chunk Hashes
- Divide the pattern into `N` equal-sized chunks (e.g., 7 chunks of 10 bytes
for a 70-byte boundary).
- Compute a rolling hash for each chunk in addition to the full pattern, storing
them for quick lookup.
### Sliding Window Search in the Buffer
- Compute the rolling hash for each window of size chunk_size.
- Compare the first chunks hash with the buffer window.
- If no match, skip boundary_length - chunk_size bytes.
### Progressive Chunk Verification
- If the first chunk matches, verify the next chunk sequentially.
- Continue matching chunks until the full boundary is confirmed.
- Perform final character-by-character validation to rule out hash collisions.