Signet Forge 0.1.0
C++20 Parquet library with AI-native extensions
DEMO
Loading...
Searching...
No Matches
signet::forge::SplitBlockBloomFilter Class Reference

Parquet-spec Split Block Bloom Filter for probabilistic set membership. More...

#include <split_block.hpp>

Public Member Functions

 SplitBlockBloomFilter (size_t num_distinct_values, double fpr=0.01)
 Create a bloom filter sized for the expected number of distinct values.
 
void insert (uint64_t hash)
 Insert a pre-computed xxHash64 value into the filter.
 
bool might_contain (uint64_t hash) const
 Check if a pre-computed hash value might be present in the filter.
 
void insert_value (const std::string &val)
 Insert a string value (hashed with xxHash64).
 
void insert_value (int32_t val)
 Insert a 32-bit integer value (hashed with xxHash64).
 
void insert_value (int64_t val)
 Insert a 64-bit integer value (hashed with xxHash64).
 
void insert_value (float val)
 Insert a 32-bit float value (hashed with xxHash64).
 
void insert_value (double val)
 Insert a 64-bit double value (hashed with xxHash64).
 
bool might_contain_value (const std::string &val) const
 Check if a string value might be present in the filter.
 
bool might_contain_value (int32_t val) const
 Check if a 32-bit integer value might be present in the filter.
 
bool might_contain_value (int64_t val) const
 Check if a 64-bit integer value might be present in the filter.
 
bool might_contain_value (float val) const
 Check if a 32-bit float value might be present in the filter.
 
bool might_contain_value (double val) const
 Check if a 64-bit double value might be present in the filter.
 
const std::vector< uint8_t > & data () const
 Access the raw filter data for serialization into a Parquet file.
 
size_t size_bytes () const
 Total size of the filter in bytes (always a multiple of 32).
 
size_t num_blocks () const
 Number of 32-byte blocks in the filter.
 
void merge (const SplitBlockBloomFilter &other)
 Merge another filter into this one via bitwise OR.
 
void reset ()
 Reset the filter (clear all bits to zero).
 
double estimated_fpr (size_t num_insertions) const
 Estimate the false-positive rate after a given number of insertions.
 

Static Public Member Functions

static SplitBlockBloomFilter with_size (size_t num_bytes)
 Create a bloom filter with an explicit byte size.
 
static SplitBlockBloomFilter from_data (const uint8_t *src, size_t size)
 Reconstruct a filter from previously serialized bytes.
 

Static Public Attributes

static constexpr size_t kBytesPerBlock = 32
 Block size in bytes (32 bytes = 256 bits = 8 x uint32_t words).
 
static constexpr size_t kWordsPerBlock = 8
 Number of uint32_t words per block (8).
 
static constexpr size_t kMinBytes = kBytesPerBlock
 Minimum filter size: one block (32 bytes).
 
static constexpr size_t kMaxBytes = 128ULL * 1024 * 1024
 Maximum filter size: 128 MiB (Parquet spec recommendation).
 

Detailed Description

Parquet-spec Split Block Bloom Filter for probabilistic set membership.

A space-efficient probabilistic data structure that answers "is this value possibly in the set?" with a tunable false-positive rate. False negatives never occur: if might_contain() returns false, the value was definitely never inserted.

The filter is organized as an array of 32-byte blocks. Each insertion touches exactly one block, setting 8 bits (one per uint32_t word) derived from the hash via the Parquet salt constants. This block-local design is cache-friendly and allows efficient SIMD implementations.

Note
Thread safety: not thread-safe. External synchronization is required if multiple threads insert or query concurrently.
// Size for 10,000 distinct values at 1% FPR
SplitBlockBloomFilter bloom(10000, 0.01);
bloom.insert_value(std::string("AAPL"));
bloom.insert_value(42);
assert(bloom.might_contain_value(std::string("AAPL")) == true);
Parquet-spec Split Block Bloom Filter for probabilistic set membership.
See also
https://github.com/apache/parquet-format/blob/master/BloomFilter.md

Definition at line 57 of file split_block.hpp.

Constructor & Destructor Documentation

◆ SplitBlockBloomFilter()

signet::forge::SplitBlockBloomFilter::SplitBlockBloomFilter ( size_t  num_distinct_values,
double  fpr = 0.01 
)
inlineexplicit

Create a bloom filter sized for the expected number of distinct values.

Uses the optimal sizing formula from the Parquet specification: num_bytes = -8 * ndv / ln(1 - fpr^(1/8)) rounded up to the next multiple of 32 bytes, then clamped to [kMinBytes, kMaxBytes].

Parameters
num_distinct_valuesExpected number of distinct values to insert. A value of 0 is treated as 1 (at least one block is allocated).
fprDesired false-positive rate, must be in the open interval (0, 1). Default is 0.01 (1%).
Exceptions
std::invalid_argumentif fpr is not in (0, 1).

Definition at line 87 of file split_block.hpp.

Member Function Documentation

◆ data()

const std::vector< uint8_t > & signet::forge::SplitBlockBloomFilter::data ( ) const
inline

Access the raw filter data for serialization into a Parquet file.

Returns
Const reference to the internal byte vector.

Definition at line 249 of file split_block.hpp.

◆ estimated_fpr()

double signet::forge::SplitBlockBloomFilter::estimated_fpr ( size_t  num_insertions) const
inline

Estimate the false-positive rate after a given number of insertions.

Uses the standard Bloom filter FPR formula adapted for split blocks: fpr ~ (1 - exp(-k * n / (num_blocks * 256)))^k where n = num_insertions, k = 8 (bits per insertion), and each block contains 256 bits.

Parameters
num_insertionsNumber of distinct values inserted so far.
Returns
Estimated false-positive probability in [0.0, 1.0]. Returns 0.0 if num_insertions is 0.

Definition at line 318 of file split_block.hpp.

◆ from_data()

static SplitBlockBloomFilter signet::forge::SplitBlockBloomFilter::from_data ( const uint8_t *  src,
size_t  size 
)
inlinestatic

Reconstruct a filter from previously serialized bytes.

Parameters
srcPointer to the serialized filter data.
sizeLength of the data in bytes. Must be a positive multiple of 32.
Returns
A filter initialized with the provided data.
Exceptions
std::invalid_argumentif size is 0 or not a multiple of 32.

Definition at line 265 of file split_block.hpp.

◆ insert()

void signet::forge::SplitBlockBloomFilter::insert ( uint64_t  hash)
inline

Insert a pre-computed xxHash64 value into the filter.

The upper 32 bits select the block; the lower 32 bits, multiplied by each of the 8 salt constants, determine the bit positions within that block.

Parameters
hashA 64-bit hash (typically from xxhash::hash64()).

Definition at line 141 of file split_block.hpp.

◆ insert_value() [1/5]

void signet::forge::SplitBlockBloomFilter::insert_value ( const std::string &  val)
inline

Insert a string value (hashed with xxHash64).

Parameters
valThe string to insert.

Definition at line 188 of file split_block.hpp.

◆ insert_value() [2/5]

void signet::forge::SplitBlockBloomFilter::insert_value ( double  val)
inline

Insert a 64-bit double value (hashed with xxHash64).

Parameters
valThe double to insert.

Definition at line 208 of file split_block.hpp.

◆ insert_value() [3/5]

void signet::forge::SplitBlockBloomFilter::insert_value ( float  val)
inline

Insert a 32-bit float value (hashed with xxHash64).

Parameters
valThe float to insert.

Definition at line 203 of file split_block.hpp.

◆ insert_value() [4/5]

void signet::forge::SplitBlockBloomFilter::insert_value ( int32_t  val)
inline

Insert a 32-bit integer value (hashed with xxHash64).

Parameters
valThe int32_t to insert.

Definition at line 193 of file split_block.hpp.

◆ insert_value() [5/5]

void signet::forge::SplitBlockBloomFilter::insert_value ( int64_t  val)
inline

Insert a 64-bit integer value (hashed with xxHash64).

Parameters
valThe int64_t to insert.

Definition at line 198 of file split_block.hpp.

◆ merge()

void signet::forge::SplitBlockBloomFilter::merge ( const SplitBlockBloomFilter other)
inline

Merge another filter into this one via bitwise OR.

After merging, this filter represents the union of both sets. Both filters must have identical byte sizes.

Parameters
otherThe filter to merge into this one.
Exceptions
std::invalid_argumentif the filters have different sizes.

Definition at line 293 of file split_block.hpp.

◆ might_contain()

bool signet::forge::SplitBlockBloomFilter::might_contain ( uint64_t  hash) const
inline

Check if a pre-computed hash value might be present in the filter.

Parameters
hashA 64-bit hash (typically from xxhash::hash64()).
Returns
false if the value is definitely not in the filter; true if the value is probably in the filter (subject to FPR).

Definition at line 164 of file split_block.hpp.

◆ might_contain_value() [1/5]

bool signet::forge::SplitBlockBloomFilter::might_contain_value ( const std::string &  val) const
inline

Check if a string value might be present in the filter.

Parameters
valThe string to test.
Returns
false if definitely absent; true if probably present.

Definition at line 215 of file split_block.hpp.

◆ might_contain_value() [2/5]

bool signet::forge::SplitBlockBloomFilter::might_contain_value ( double  val) const
inline

Check if a 64-bit double value might be present in the filter.

Parameters
valThe double to test.
Returns
false if definitely absent; true if probably present.

Definition at line 239 of file split_block.hpp.

◆ might_contain_value() [3/5]

bool signet::forge::SplitBlockBloomFilter::might_contain_value ( float  val) const
inline

Check if a 32-bit float value might be present in the filter.

Parameters
valThe float to test.
Returns
false if definitely absent; true if probably present.

Definition at line 233 of file split_block.hpp.

◆ might_contain_value() [4/5]

bool signet::forge::SplitBlockBloomFilter::might_contain_value ( int32_t  val) const
inline

Check if a 32-bit integer value might be present in the filter.

Parameters
valThe int32_t to test.
Returns
false if definitely absent; true if probably present.

Definition at line 221 of file split_block.hpp.

◆ might_contain_value() [5/5]

bool signet::forge::SplitBlockBloomFilter::might_contain_value ( int64_t  val) const
inline

Check if a 64-bit integer value might be present in the filter.

Parameters
valThe int64_t to test.
Returns
false if definitely absent; true if probably present.

Definition at line 227 of file split_block.hpp.

◆ num_blocks()

size_t signet::forge::SplitBlockBloomFilter::num_blocks ( ) const
inline

Number of 32-byte blocks in the filter.

Definition at line 255 of file split_block.hpp.

◆ reset()

void signet::forge::SplitBlockBloomFilter::reset ( )
inline

Reset the filter (clear all bits to zero).

Definition at line 304 of file split_block.hpp.

◆ size_bytes()

size_t signet::forge::SplitBlockBloomFilter::size_bytes ( ) const
inline

Total size of the filter in bytes (always a multiple of 32).

Definition at line 252 of file split_block.hpp.

◆ with_size()

static SplitBlockBloomFilter signet::forge::SplitBlockBloomFilter::with_size ( size_t  num_bytes)
inlinestatic

Create a bloom filter with an explicit byte size.

Parameters
num_bytesFilter size in bytes. Must be a positive multiple of 32. Clamped to kMaxBytes if larger.
Returns
A zero-initialized filter of the requested size.
Exceptions
std::invalid_argumentif num_bytes is 0 or not a multiple of 32.

Definition at line 119 of file split_block.hpp.

Member Data Documentation

◆ kBytesPerBlock

constexpr size_t signet::forge::SplitBlockBloomFilter::kBytesPerBlock = 32
staticconstexpr

Block size in bytes (32 bytes = 256 bits = 8 x uint32_t words).

Definition at line 60 of file split_block.hpp.

◆ kMaxBytes

constexpr size_t signet::forge::SplitBlockBloomFilter::kMaxBytes = 128ULL * 1024 * 1024
staticconstexpr

Maximum filter size: 128 MiB (Parquet spec recommendation).

Definition at line 69 of file split_block.hpp.

◆ kMinBytes

constexpr size_t signet::forge::SplitBlockBloomFilter::kMinBytes = kBytesPerBlock
staticconstexpr

Minimum filter size: one block (32 bytes).

Definition at line 66 of file split_block.hpp.

◆ kWordsPerBlock

constexpr size_t signet::forge::SplitBlockBloomFilter::kWordsPerBlock = 8
staticconstexpr

Number of uint32_t words per block (8).

Definition at line 63 of file split_block.hpp.


The documentation for this class was generated from the following file: