![]() |
Signet Forge 0.1.0
C++20 Parquet library with AI-native extensions
|
DEMO |
Dictionary encoder for Parquet PLAIN_DICTIONARY / RLE_DICTIONARY encoding. More...
#include <dictionary.hpp>
Public Member Functions | |
| DictionaryEncoder ()=default | |
| Default-construct an empty dictionary encoder. | |
| bool | is_full () const |
| Check whether the dictionary has reached its maximum capacity. | |
| bool | put (const T &value) |
| Add a value to the encoding stream. | |
| void | flush () |
| Finalize the encoding. | |
| std::vector< uint8_t > | dictionary_page () const |
| Get the dictionary page as PLAIN-encoded unique values. | |
| std::vector< uint8_t > | indices_page () const |
| Get the data page as RLE_DICTIONARY-encoded indices. | |
| size_t | dictionary_size () const |
| Number of unique values in the dictionary. | |
| size_t | num_values () const |
| Total number of values encoded (including duplicates). | |
| int | bit_width () const |
| Bits per dictionary index (ceil(log2(dictionary_size))). | |
| void | reset () |
| Reset the encoder, clearing the dictionary, indices, and all internal state. | |
| bool | is_worthwhile () const |
| Heuristic check: is dictionary encoding worthwhile for this data? | |
Static Public Attributes | |
| static constexpr size_t | MAX_DICTIONARY_ENTRIES = 1 << 20 |
| Maximum number of dictionary entries before fallback to PLAIN encoding. | |
Dictionary encoder for Parquet PLAIN_DICTIONARY / RLE_DICTIONARY encoding.
Builds a dictionary of unique values, assigning each a sequential integer index. The dictionary page is PLAIN-encoded (one entry per unique value), and the data page is an RLE/Bit-Packing Hybrid stream of dictionary indices prefixed by a 1-byte bit_width. Use dictionary_page() and indices_page() after flush() to retrieve the encoded outputs.
| T | The value type (std::string, int32_t, int64_t, float, or double). |
Definition at line 260 of file dictionary.hpp.
|
default |
Default-construct an empty dictionary encoder.
|
inline |
Bits per dictionary index (ceil(log2(dictionary_size))).
Definition at line 361 of file dictionary.hpp.
|
inline |
Get the dictionary page as PLAIN-encoded unique values.
Returns the raw bytes of the dictionary page, suitable for writing as a Parquet DICTIONARY_PAGE. Each entry is encoded per its type (BYTE_ARRAY for strings, fixed-width LE for numeric types).
Definition at line 317 of file dictionary.hpp.
|
inline |
Number of unique values in the dictionary.
Definition at line 351 of file dictionary.hpp.
|
inline |
Finalize the encoding.
Must be called after all put() calls.
Definition at line 304 of file dictionary.hpp.
|
inline |
Get the data page as RLE_DICTIONARY-encoded indices.
Returns a byte buffer starting with a 1-byte bit_width followed by the RLE/Bit-Packing Hybrid encoded dictionary indices. This is the format expected by Parquet for RLE_DICTIONARY (encoding type 8) data pages.
Definition at line 333 of file dictionary.hpp.
|
inline |
Check whether the dictionary has reached its maximum capacity.
Prevents DoS via unbounded dictionary growth from high-cardinality input.
true if the dictionary is full and put() will return false. Definition at line 272 of file dictionary.hpp.
|
inline |
Heuristic check: is dictionary encoding worthwhile for this data?
Returns true when fewer than 40% of values are unique (i.e., there is meaningful repetition). High-cardinality columns (many unique values relative to total rows) should fall back to PLAIN encoding.
true if dictionary encoding provides good compression. Definition at line 381 of file dictionary.hpp.
|
inline |
Total number of values encoded (including duplicates).
Definition at line 356 of file dictionary.hpp.
|
inline |
Add a value to the encoding stream.
If value has not been seen before, it is assigned a fresh sequential dictionary index. The corresponding index is appended to the internal indices buffer regardless.
| value | The value to encode. |
true on success, false if the dictionary is full (DoS prevention — caller should fall back to PLAIN encoding). Definition at line 283 of file dictionary.hpp.
|
inline |
Reset the encoder, clearing the dictionary, indices, and all internal state.
After reset, the encoder can be reused for a new encoding session.
Definition at line 368 of file dictionary.hpp.
|
staticconstexpr |
Maximum number of dictionary entries before fallback to PLAIN encoding.
CWE-400: Uncontrolled Resource Consumption — bounds dictionary memory growth.
Definition at line 267 of file dictionary.hpp.