Signet Forge 0.1.0
C++20 Parquet library with AI-native extensions
DEMO
Loading...
Searching...
No Matches
signet::forge::DictionaryEncoder< T > Class Template Reference

Dictionary encoder for Parquet PLAIN_DICTIONARY / RLE_DICTIONARY encoding. More...

#include <dictionary.hpp>

Public Member Functions

 DictionaryEncoder ()=default
 Default-construct an empty dictionary encoder.
 
bool is_full () const
 Check whether the dictionary has reached its maximum capacity.
 
bool put (const T &value)
 Add a value to the encoding stream.
 
void flush ()
 Finalize the encoding.
 
std::vector< uint8_t > dictionary_page () const
 Get the dictionary page as PLAIN-encoded unique values.
 
std::vector< uint8_t > indices_page () const
 Get the data page as RLE_DICTIONARY-encoded indices.
 
size_t dictionary_size () const
 Number of unique values in the dictionary.
 
size_t num_values () const
 Total number of values encoded (including duplicates).
 
int bit_width () const
 Bits per dictionary index (ceil(log2(dictionary_size))).
 
void reset ()
 Reset the encoder, clearing the dictionary, indices, and all internal state.
 
bool is_worthwhile () const
 Heuristic check: is dictionary encoding worthwhile for this data?
 

Static Public Attributes

static constexpr size_t MAX_DICTIONARY_ENTRIES = 1 << 20
 Maximum number of dictionary entries before fallback to PLAIN encoding.
 

Detailed Description

template<typename T>
class signet::forge::DictionaryEncoder< T >

Dictionary encoder for Parquet PLAIN_DICTIONARY / RLE_DICTIONARY encoding.

Builds a dictionary of unique values, assigning each a sequential integer index. The dictionary page is PLAIN-encoded (one entry per unique value), and the data page is an RLE/Bit-Packing Hybrid stream of dictionary indices prefixed by a 1-byte bit_width. Use dictionary_page() and indices_page() after flush() to retrieve the encoded outputs.

Template Parameters
TThe value type (std::string, int32_t, int64_t, float, or double).
See also
DictionaryDecoder, RleEncoder

Definition at line 260 of file dictionary.hpp.

Constructor & Destructor Documentation

◆ DictionaryEncoder()

template<typename T >
signet::forge::DictionaryEncoder< T >::DictionaryEncoder ( )
default

Default-construct an empty dictionary encoder.

Member Function Documentation

◆ bit_width()

template<typename T >
int signet::forge::DictionaryEncoder< T >::bit_width ( ) const
inline

Bits per dictionary index (ceil(log2(dictionary_size))).

Returns
Bit width for index encoding (0 for single-entry dictionaries).

Definition at line 361 of file dictionary.hpp.

◆ dictionary_page()

template<typename T >
std::vector< uint8_t > signet::forge::DictionaryEncoder< T >::dictionary_page ( ) const
inline

Get the dictionary page as PLAIN-encoded unique values.

Returns the raw bytes of the dictionary page, suitable for writing as a Parquet DICTIONARY_PAGE. Each entry is encoded per its type (BYTE_ARRAY for strings, fixed-width LE for numeric types).

Returns
PLAIN-encoded dictionary page bytes.
See also
indices_page

Definition at line 317 of file dictionary.hpp.

◆ dictionary_size()

template<typename T >
size_t signet::forge::DictionaryEncoder< T >::dictionary_size ( ) const
inline

Number of unique values in the dictionary.

Returns
Dictionary cardinality.

Definition at line 351 of file dictionary.hpp.

◆ flush()

template<typename T >
void signet::forge::DictionaryEncoder< T >::flush ( )
inline

Finalize the encoding.

Must be called after all put() calls.

Note
This is a no-op for DictionaryEncoder (indices are stored incrementally). It exists for API symmetry with other encoders.

Definition at line 304 of file dictionary.hpp.

◆ indices_page()

template<typename T >
std::vector< uint8_t > signet::forge::DictionaryEncoder< T >::indices_page ( ) const
inline

Get the data page as RLE_DICTIONARY-encoded indices.

Returns a byte buffer starting with a 1-byte bit_width followed by the RLE/Bit-Packing Hybrid encoded dictionary indices. This is the format expected by Parquet for RLE_DICTIONARY (encoding type 8) data pages.

Returns
Encoded indices page (1-byte bit_width prefix + RLE payload).
See also
dictionary_page, DictionaryDecoder::decode

Definition at line 333 of file dictionary.hpp.

◆ is_full()

template<typename T >
bool signet::forge::DictionaryEncoder< T >::is_full ( ) const
inline

Check whether the dictionary has reached its maximum capacity.

Prevents DoS via unbounded dictionary growth from high-cardinality input.

Returns
true if the dictionary is full and put() will return false.

Definition at line 272 of file dictionary.hpp.

◆ is_worthwhile()

template<typename T >
bool signet::forge::DictionaryEncoder< T >::is_worthwhile ( ) const
inline

Heuristic check: is dictionary encoding worthwhile for this data?

Returns true when fewer than 40% of values are unique (i.e., there is meaningful repetition). High-cardinality columns (many unique values relative to total rows) should fall back to PLAIN encoding.

Returns
true if dictionary encoding provides good compression.

Definition at line 381 of file dictionary.hpp.

◆ num_values()

template<typename T >
size_t signet::forge::DictionaryEncoder< T >::num_values ( ) const
inline

Total number of values encoded (including duplicates).

Returns
Total row count fed to put().

Definition at line 356 of file dictionary.hpp.

◆ put()

template<typename T >
bool signet::forge::DictionaryEncoder< T >::put ( const T &  value)
inline

Add a value to the encoding stream.

If value has not been seen before, it is assigned a fresh sequential dictionary index. The corresponding index is appended to the internal indices buffer regardless.

Parameters
valueThe value to encode.
Returns
true on success, false if the dictionary is full (DoS prevention — caller should fall back to PLAIN encoding).

Definition at line 283 of file dictionary.hpp.

◆ reset()

template<typename T >
void signet::forge::DictionaryEncoder< T >::reset ( )
inline

Reset the encoder, clearing the dictionary, indices, and all internal state.

After reset, the encoder can be reused for a new encoding session.

Definition at line 368 of file dictionary.hpp.

Member Data Documentation

◆ MAX_DICTIONARY_ENTRIES

template<typename T >
constexpr size_t signet::forge::DictionaryEncoder< T >::MAX_DICTIONARY_ENTRIES = 1 << 20
staticconstexpr

Maximum number of dictionary entries before fallback to PLAIN encoding.

CWE-400: Uncontrolled Resource Consumption — bounds dictionary memory growth.

Definition at line 267 of file dictionary.hpp.


The documentation for this class was generated from the following file: