Parquet-spec Split Block Bloom Filter for probabilistic set membership.
More...
#include <split_block.hpp>
|
| | SplitBlockBloomFilter (size_t num_distinct_values, double fpr=0.01) |
| | Create a bloom filter sized for the expected number of distinct values.
|
| |
| void | insert (uint64_t hash) |
| | Insert a pre-computed xxHash64 value into the filter.
|
| |
| bool | might_contain (uint64_t hash) const |
| | Check if a pre-computed hash value might be present in the filter.
|
| |
| void | insert_value (const std::string &val) |
| | Insert a string value (hashed with xxHash64).
|
| |
| void | insert_value (int32_t val) |
| | Insert a 32-bit integer value (hashed with xxHash64).
|
| |
| void | insert_value (int64_t val) |
| | Insert a 64-bit integer value (hashed with xxHash64).
|
| |
| void | insert_value (float val) |
| | Insert a 32-bit float value (hashed with xxHash64).
|
| |
| void | insert_value (double val) |
| | Insert a 64-bit double value (hashed with xxHash64).
|
| |
| bool | might_contain_value (const std::string &val) const |
| | Check if a string value might be present in the filter.
|
| |
| bool | might_contain_value (int32_t val) const |
| | Check if a 32-bit integer value might be present in the filter.
|
| |
| bool | might_contain_value (int64_t val) const |
| | Check if a 64-bit integer value might be present in the filter.
|
| |
| bool | might_contain_value (float val) const |
| | Check if a 32-bit float value might be present in the filter.
|
| |
| bool | might_contain_value (double val) const |
| | Check if a 64-bit double value might be present in the filter.
|
| |
| const std::vector< uint8_t > & | data () const |
| | Access the raw filter data for serialization into a Parquet file.
|
| |
| size_t | size_bytes () const |
| | Total size of the filter in bytes (always a multiple of 32).
|
| |
| size_t | num_blocks () const |
| | Number of 32-byte blocks in the filter.
|
| |
| void | merge (const SplitBlockBloomFilter &other) |
| | Merge another filter into this one via bitwise OR.
|
| |
| void | reset () |
| | Reset the filter (clear all bits to zero).
|
| |
| double | estimated_fpr (size_t num_insertions) const |
| | Estimate the false-positive rate after a given number of insertions.
|
| |
|
| static constexpr size_t | kBytesPerBlock = 32 |
| | Block size in bytes (32 bytes = 256 bits = 8 x uint32_t words).
|
| |
| static constexpr size_t | kWordsPerBlock = 8 |
| | Number of uint32_t words per block (8).
|
| |
| static constexpr size_t | kMinBytes = kBytesPerBlock |
| | Minimum filter size: one block (32 bytes).
|
| |
| static constexpr size_t | kMaxBytes = 128ULL * 1024 * 1024 |
| | Maximum filter size: 128 MiB (Parquet spec recommendation).
|
| |
Parquet-spec Split Block Bloom Filter for probabilistic set membership.
A space-efficient probabilistic data structure that answers "is this value
possibly in the set?" with a tunable false-positive rate. False negatives never occur: if might_contain() returns false, the value was definitely never inserted.
The filter is organized as an array of 32-byte blocks. Each insertion touches exactly one block, setting 8 bits (one per uint32_t word) derived from the hash via the Parquet salt constants. This block-local design is cache-friendly and allows efficient SIMD implementations.
- Note
- Thread safety: not thread-safe. External synchronization is required if multiple threads insert or query concurrently.
bloom.insert_value(std::string("AAPL"));
bloom.insert_value(42);
assert(bloom.might_contain_value(std::string("AAPL")) == true);
Parquet-spec Split Block Bloom Filter for probabilistic set membership.
- See also
- https://github.com/apache/parquet-format/blob/master/BloomFilter.md
Definition at line 57 of file split_block.hpp.
◆ SplitBlockBloomFilter()
| signet::forge::SplitBlockBloomFilter::SplitBlockBloomFilter |
( |
size_t |
num_distinct_values, |
|
|
double |
fpr = 0.01 |
|
) |
| |
|
inlineexplicit |
Create a bloom filter sized for the expected number of distinct values.
Uses the optimal sizing formula from the Parquet specification: num_bytes = -8 * ndv / ln(1 - fpr^(1/8)) rounded up to the next multiple of 32 bytes, then clamped to [kMinBytes, kMaxBytes].
- Parameters
-
| num_distinct_values | Expected number of distinct values to insert. A value of 0 is treated as 1 (at least one block is allocated). |
| fpr | Desired false-positive rate, must be in the open interval (0, 1). Default is 0.01 (1%). |
- Exceptions
-
| std::invalid_argument | if fpr is not in (0, 1). |
Definition at line 87 of file split_block.hpp.
◆ data()
| const std::vector< uint8_t > & signet::forge::SplitBlockBloomFilter::data |
( |
| ) |
const |
|
inline |
Access the raw filter data for serialization into a Parquet file.
- Returns
- Const reference to the internal byte vector.
Definition at line 249 of file split_block.hpp.
◆ estimated_fpr()
| double signet::forge::SplitBlockBloomFilter::estimated_fpr |
( |
size_t |
num_insertions | ) |
const |
|
inline |
Estimate the false-positive rate after a given number of insertions.
Uses the standard Bloom filter FPR formula adapted for split blocks: fpr ~ (1 - exp(-k * n / (num_blocks * 256)))^k where n = num_insertions, k = 8 (bits per insertion), and each block contains 256 bits.
- Parameters
-
| num_insertions | Number of distinct values inserted so far. |
- Returns
- Estimated false-positive probability in [0.0, 1.0]. Returns 0.0 if
num_insertions is 0.
Definition at line 318 of file split_block.hpp.
◆ from_data()
| static SplitBlockBloomFilter signet::forge::SplitBlockBloomFilter::from_data |
( |
const uint8_t * |
src, |
|
|
size_t |
size |
|
) |
| |
|
inlinestatic |
Reconstruct a filter from previously serialized bytes.
- Parameters
-
| src | Pointer to the serialized filter data. |
| size | Length of the data in bytes. Must be a positive multiple of 32. |
- Returns
- A filter initialized with the provided data.
- Exceptions
-
| std::invalid_argument | if size is 0 or not a multiple of 32. |
Definition at line 265 of file split_block.hpp.
◆ insert()
| void signet::forge::SplitBlockBloomFilter::insert |
( |
uint64_t |
hash | ) |
|
|
inline |
Insert a pre-computed xxHash64 value into the filter.
The upper 32 bits select the block; the lower 32 bits, multiplied by each of the 8 salt constants, determine the bit positions within that block.
- Parameters
-
Definition at line 141 of file split_block.hpp.
◆ insert_value() [1/5]
| void signet::forge::SplitBlockBloomFilter::insert_value |
( |
const std::string & |
val | ) |
|
|
inline |
Insert a string value (hashed with xxHash64).
- Parameters
-
Definition at line 188 of file split_block.hpp.
◆ insert_value() [2/5]
| void signet::forge::SplitBlockBloomFilter::insert_value |
( |
double |
val | ) |
|
|
inline |
Insert a 64-bit double value (hashed with xxHash64).
- Parameters
-
Definition at line 208 of file split_block.hpp.
◆ insert_value() [3/5]
| void signet::forge::SplitBlockBloomFilter::insert_value |
( |
float |
val | ) |
|
|
inline |
Insert a 32-bit float value (hashed with xxHash64).
- Parameters
-
Definition at line 203 of file split_block.hpp.
◆ insert_value() [4/5]
| void signet::forge::SplitBlockBloomFilter::insert_value |
( |
int32_t |
val | ) |
|
|
inline |
Insert a 32-bit integer value (hashed with xxHash64).
- Parameters
-
| val | The int32_t to insert. |
Definition at line 193 of file split_block.hpp.
◆ insert_value() [5/5]
| void signet::forge::SplitBlockBloomFilter::insert_value |
( |
int64_t |
val | ) |
|
|
inline |
Insert a 64-bit integer value (hashed with xxHash64).
- Parameters
-
| val | The int64_t to insert. |
Definition at line 198 of file split_block.hpp.
◆ merge()
Merge another filter into this one via bitwise OR.
After merging, this filter represents the union of both sets. Both filters must have identical byte sizes.
- Parameters
-
| other | The filter to merge into this one. |
- Exceptions
-
| std::invalid_argument | if the filters have different sizes. |
Definition at line 293 of file split_block.hpp.
◆ might_contain()
| bool signet::forge::SplitBlockBloomFilter::might_contain |
( |
uint64_t |
hash | ) |
const |
|
inline |
Check if a pre-computed hash value might be present in the filter.
- Parameters
-
- Returns
false if the value is definitely not in the filter; true if the value is probably in the filter (subject to FPR).
Definition at line 164 of file split_block.hpp.
◆ might_contain_value() [1/5]
| bool signet::forge::SplitBlockBloomFilter::might_contain_value |
( |
const std::string & |
val | ) |
const |
|
inline |
Check if a string value might be present in the filter.
- Parameters
-
- Returns
false if definitely absent; true if probably present.
Definition at line 215 of file split_block.hpp.
◆ might_contain_value() [2/5]
| bool signet::forge::SplitBlockBloomFilter::might_contain_value |
( |
double |
val | ) |
const |
|
inline |
Check if a 64-bit double value might be present in the filter.
- Parameters
-
- Returns
false if definitely absent; true if probably present.
Definition at line 239 of file split_block.hpp.
◆ might_contain_value() [3/5]
| bool signet::forge::SplitBlockBloomFilter::might_contain_value |
( |
float |
val | ) |
const |
|
inline |
Check if a 32-bit float value might be present in the filter.
- Parameters
-
- Returns
false if definitely absent; true if probably present.
Definition at line 233 of file split_block.hpp.
◆ might_contain_value() [4/5]
| bool signet::forge::SplitBlockBloomFilter::might_contain_value |
( |
int32_t |
val | ) |
const |
|
inline |
Check if a 32-bit integer value might be present in the filter.
- Parameters
-
- Returns
false if definitely absent; true if probably present.
Definition at line 221 of file split_block.hpp.
◆ might_contain_value() [5/5]
| bool signet::forge::SplitBlockBloomFilter::might_contain_value |
( |
int64_t |
val | ) |
const |
|
inline |
Check if a 64-bit integer value might be present in the filter.
- Parameters
-
- Returns
false if definitely absent; true if probably present.
Definition at line 227 of file split_block.hpp.
◆ num_blocks()
| size_t signet::forge::SplitBlockBloomFilter::num_blocks |
( |
| ) |
const |
|
inline |
◆ reset()
| void signet::forge::SplitBlockBloomFilter::reset |
( |
| ) |
|
|
inline |
Reset the filter (clear all bits to zero).
Definition at line 304 of file split_block.hpp.
◆ size_bytes()
| size_t signet::forge::SplitBlockBloomFilter::size_bytes |
( |
| ) |
const |
|
inline |
Total size of the filter in bytes (always a multiple of 32).
Definition at line 252 of file split_block.hpp.
◆ with_size()
Create a bloom filter with an explicit byte size.
- Parameters
-
| num_bytes | Filter size in bytes. Must be a positive multiple of 32. Clamped to kMaxBytes if larger. |
- Returns
- A zero-initialized filter of the requested size.
- Exceptions
-
| std::invalid_argument | if num_bytes is 0 or not a multiple of 32. |
Definition at line 119 of file split_block.hpp.
◆ kBytesPerBlock
| constexpr size_t signet::forge::SplitBlockBloomFilter::kBytesPerBlock = 32 |
|
staticconstexpr |
Block size in bytes (32 bytes = 256 bits = 8 x uint32_t words).
Definition at line 60 of file split_block.hpp.
◆ kMaxBytes
| constexpr size_t signet::forge::SplitBlockBloomFilter::kMaxBytes = 128ULL * 1024 * 1024 |
|
staticconstexpr |
Maximum filter size: 128 MiB (Parquet spec recommendation).
Definition at line 69 of file split_block.hpp.
◆ kMinBytes
| constexpr size_t signet::forge::SplitBlockBloomFilter::kMinBytes = kBytesPerBlock |
|
staticconstexpr |
Minimum filter size: one block (32 bytes).
Definition at line 66 of file split_block.hpp.
◆ kWordsPerBlock
| constexpr size_t signet::forge::SplitBlockBloomFilter::kWordsPerBlock = 8 |
|
staticconstexpr |
Number of uint32_t words per block (8).
Definition at line 63 of file split_block.hpp.
The documentation for this class was generated from the following file: