py-utils/NOTES

These are just a couple of brain farts that came up and I'd rather note down.
There's no clear structure.

RFC 1341 Boundary Matching in a Circular Buffer
1. Algorithm Considerations

Knuth-Morris-Pratt (KMP) Limitations:

    Useful when patterns have prefix-suffix overlaps for efficient skipping.

    If the failure table consists only of zeros, KMP provides no speed advantage
    over naive searching.

    Boundary pattern is arbitrary, meaning KMP’s preprocessing may not be
    beneficial.

Alternatives to KMP:

    Rabin-Karp rolling hash → Uses fast hash comparisons instead of
    character-by-character matching.

    Boyer-Moore-Horspool → Precomputes skip distances to avoid redundant
    comparisons, works well for longer patterns.

    Crochemore-Perrin two-way search → used by str.find(), flexible
    but assumes a linear memory layout so not really applicable for my circular
    buffer approach

2. Boundary Characteristics

Max length: 70 bytes. Character set: ASCII only. No structure guarantees: The
boundary is client-defined, so I must be able to handle arbitrary sequences.

3. Algorithm Selection

Rolling Hash → Best for arbitrary short-to-medium patterns in a circular buffer.
Boyer-Moore → Ideal if the boundary has distinct character distributions to
optimize skipping.


# Optimized Chunk-Based Rolling Hash Matching

We need to efficiently detect an RFC 1341 multipart boundary inside a circular
buffer, ensuring minimal overhead while avoiding unnecessary comparisons.

Traditional approaches like Knuth-Morris-Pratt (KMP) don’t provide an advantage
when the boundary lacks repeated subpatterns. Meanwhile, full rolling hash
matching scans every byte, which can be wasteful.

Thus, we introduce a chunk-wise hash-based skipping strategy, allowing us to
skip large sections of the buffer when an early non-match is detected.

## Core Idea

Precompute hashes for evenly sized chunks of the boundary. -> First, match only
the hash of the first chunk → immediately skip unnecessary buffer sections if no
match. -> If the first chunk matches, progressively verify subsequent chunks
until the full boundary is confirmed.  Benefits Over Full Matching

## Benefits Over Full Matching

- Reduces comparisons significantly → eliminates large sections early when
  non-matches occur.
- Balances preprocessing cost vs runtime → faster
  elimination means fewer wasted cycles.
  Integrates seamlessly into circular buffers → allows skipping intelligently.


### Precompute Chunk Hashes

- Divide the pattern into `N` equal-sized chunks (e.g., 7 chunks of 10 bytes
  for a 70-byte boundary).
- Compute a rolling hash for each chunk in addition to the full pattern, storing
  them for quick lookup.

### Sliding Window Search in the Buffer

- Compute the rolling hash for each window of size chunk_size.
- Compare the first chunk’s hash with the buffer window.
- If no match, skip boundary_length - chunk_size bytes.

### Progressive Chunk Verification

- If the first chunk matches, verify the next chunk sequentially.
- Continue matching chunks until the full boundary is confirmed.
- Perform final character-by-character validation to rule out hash collisions.