![]() |
Signet Forge 0.1.0
C++20 Parquet library with AI-native extensions
|
DEMO |
Parquet file reader with typed column access and full encoding support. More...
#include <reader.hpp>
Classes | |
| struct | RowGroupInfo |
| Summary metadata for a single row group. More... | |
Public Member Functions | |
| const Schema & | schema () const |
| Return the file's column schema. | |
| int64_t | num_rows () const |
| Return the total number of rows across all row groups. | |
| int64_t | num_row_groups () const |
| Return the number of row groups in the file. | |
| const std::string & | created_by () const |
Return the created_by string from the file footer metadata. | |
| const std::vector< thrift::KeyValue > & | key_value_metadata () const |
| Return the file-level key-value metadata pairs. | |
| RowGroupInfo | row_group (size_t index) const |
| Return summary metadata for a specific row group. | |
| FileStats | file_stats () const |
| Compute aggregate statistics for the entire file. | |
| template<typename T > | |
| expected< std::vector< T > > | read_column (size_t row_group_index, size_t column_index) |
| Read a single column from a row group as a typed vector. | |
| expected< std::vector< std::string > > | read_column_as_strings (size_t row_group_index, size_t column_index) |
| Read a column and convert every value to its string representation. | |
| expected< std::vector< std::vector< std::string > > > | read_row_group (size_t row_group_index) |
| Read all columns from a single row group as string vectors. | |
| expected< std::vector< std::vector< std::string > > > | read_all () |
| Read the entire file as a row-major vector of string vectors. | |
| expected< std::vector< std::vector< std::string > > > | read_columns (const std::vector< std::string > &column_names) |
| Read a subset of columns (by name) across all row groups. | |
| const thrift::Statistics * | column_statistics (size_t row_group_index, size_t column_index) const |
| Access Parquet column statistics for a specific column chunk. | |
| expected< SplitBlockBloomFilter > | read_bloom_filter (size_t row_group_index, size_t column_index) const |
| Read the Split Block Bloom Filter for a column chunk, if present. | |
| template<typename T > | |
| bool | bloom_might_contain (size_t row_group_index, size_t column_index, const T &value) const |
| Check whether a value might exist in a column using its bloom filter. | |
| expected< ColumnIndex > | read_column_index (size_t row_group_index, size_t column_index) const |
| Read the ColumnIndex (min/max per page) for a column chunk. | |
| expected< OffsetIndex > | read_offset_index (size_t row_group_index, size_t column_index) const |
| Read the OffsetIndex (page locations) for a column chunk. | |
| bool | has_page_index (size_t row_group_index, size_t column_index) const |
| Check whether a column chunk has both ColumnIndex and OffsetIndex data. | |
| ~ParquetReader ()=default | |
| Destructor. Releases the in-memory file buffer and all decode state. | |
| ParquetReader (ParquetReader &&) noexcept=default | |
| Move constructor. | |
| ParquetReader & | operator= (ParquetReader &&) noexcept=default |
| Move assignment operator. | |
Static Public Member Functions | |
| static expected< ParquetReader > | open (const std::filesystem::path &path) |
| Open and parse a Parquet file, returning a ready-to-query reader. | |
| static void | ensure_default_codecs_registered () |
| Ensure common compression codecs are registered in the global CodecRegistry. | |
Parquet file reader with typed column access and full encoding support.
Opens a Parquet file, verifies PAR1 magic bytes, deserializes the Thrift footer to extract FileMetaData, builds a Schema, and provides typed access to column data via ColumnReader.
Supported encodings: PLAIN, RLE_DICTIONARY, DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT, and RLE (booleans). Decompression is handled transparently via the CodecRegistry (Snappy, ZSTD, LZ4, Gzip).
When built with SIGNET_ENABLE_COMMERCIAL, encrypted Parquet files (PME with PARE footer magic) are supported via a FileDecryptor.
open() factory method. Definition at line 167 of file reader.hpp.
|
default |
Destructor. Releases the in-memory file buffer and all decode state.
|
defaultnoexcept |
Move constructor.
|
inline |
Check whether a value might exist in a column using its bloom filter.
If no bloom filter is present for the column chunk, returns true (conservative: the value cannot be ruled out). A return value of false guarantees the value is absent; true may be a false positive.
| T | The value type (must be hashable by xxHash64). |
| row_group_index | Zero-based row group index. |
| column_index | Zero-based column index within the row group. |
| value | The value to probe. |
false if the bloom filter definitively excludes the value; true otherwise (including when no filter is available). Definition at line 984 of file reader.hpp.
|
inline |
Access Parquet column statistics for a specific column chunk.
Returns a pointer to the thrift::Statistics stored in the column chunk's metadata (min, max, null_count, distinct_count, etc.). Returns nullptr if the row group index or column index is out of range, or if the column chunk has no statistics.
| row_group_index | Zero-based row group index. |
| column_index | Zero-based column index within the row group. |
nullptr if unavailable. Definition at line 899 of file reader.hpp.
|
inline |
Return the created_by string from the file footer metadata.
This typically identifies the library and version that wrote the file (e.g. "signet-forge 0.1.0"). Returns an empty string if the field was not set by the writer.
Definition at line 386 of file reader.hpp.
|
inlinestatic |
Ensure common compression codecs are registered in the global CodecRegistry.
Called automatically by open(). Registers Snappy (always), and optionally ZSTD, LZ4, and Gzip when their respective SIGNET_HAS_* macros are defined. Safe to call multiple times; only the first call performs registration.
Definition at line 511 of file reader.hpp.
|
inline |
Compute aggregate statistics for the entire file.
Iterates over all row groups and columns to produce a FileStats summary including: file size, total rows, row group count, column count, per-column compressed/uncompressed sizes, null counts, compression ratio, bytes-per-row, and bloom filter/page index presence flags.
FileStats struct populated from the file's metadata. Definition at line 443 of file reader.hpp.
|
inline |
Check whether a column chunk has both ColumnIndex and OffsetIndex data.
| row_group_index | Zero-based row group index. |
| column_index | Zero-based column index within the row group. |
true if both column_index_offset and offset_index_offset are present in the column chunk metadata; false otherwise (including for out-of-range indices). Definition at line 1090 of file reader.hpp.
|
inline |
Return the file-level key-value metadata pairs.
These are arbitrary string key-value pairs stored in the Parquet footer (e.g. Arrow schema, pandas metadata). Returns an empty vector if the writer did not set any key-value metadata.
Definition at line 393 of file reader.hpp.
|
inline |
Return the number of row groups in the file.
Definition at line 377 of file reader.hpp.
|
inline |
Return the total number of rows across all row groups.
Definition at line 374 of file reader.hpp.
|
inlinestatic |
Open and parse a Parquet file, returning a ready-to-query reader.
Reads the entire file into memory, validates PAR1/PARE magic bytes, deserializes the Thrift footer into FileMetaData, and constructs the column Schema. Common compression codecs (Snappy, and optionally ZSTD/LZ4/Gzip) are registered automatically on first call.
| path | Filesystem path to the .parquet file. |
encryption parameter for decrypting PME-encrypted files (PARE footer magic). ParquetReader on success, or an Error with codes such as IO_ERROR, INVALID_FILE, CORRUPT_FOOTER, ENCRYPTION_ERROR, or LICENSE_ERROR. Definition at line 189 of file reader.hpp.
|
defaultnoexcept |
Move assignment operator.
|
inline |
Read the entire file as a row-major vector of string vectors.
Iterates over all row groups via read_row_group() and transposes the column-major data into rows, where each inner vector has one string element per column.
vector<vector<string>> on success, or an Error if any row group or column read fails. read_column<T>() or column projection for selective access. Definition at line 804 of file reader.hpp.
|
inline |
Read the Split Block Bloom Filter for a column chunk, if present.
Locates the bloom filter in the file using the column chunk's bloom_filter_offset, reads the 4-byte LE size header, validates alignment to kBytesPerBlock, and deserializes the filter data.
| row_group_index | Zero-based row group index. |
| column_index | Zero-based column index within the row group. |
SplitBlockBloomFilter on success, or an Error with INVALID_FILE (no filter), OUT_OF_RANGE, or CORRUPT_PAGE. Definition at line 925 of file reader.hpp.
|
inline |
Read a single column from a row group as a typed vector.
Automatically selects the appropriate decoding path based on the column's encoding metadata:
Decompression and (in commercial builds) decryption are applied transparently before decoding.
| T | The C++ value type. Must match the column's physical type: bool, int32_t, int64_t, float, double, or std::string. |
| row_group_index | Zero-based row group index. |
| column_index | Zero-based column index within the schema. |
Error with codes such as OUT_OF_RANGE, CORRUPT_PAGE, UNSUPPORTED_ENCODING, or UNSUPPORTED_COMPRESSION. Definition at line 554 of file reader.hpp.
|
inline |
Read a column and convert every value to its string representation.
Dispatches to read_column<T>() based on the column's physical type, then converts each value using std::to_string() (numeric types), "true"/"false" (booleans), identity (BYTE_ARRAY/string), or hex-encoding (FIXED_LEN_BYTE_ARRAY).
| row_group_index | Zero-based row group index. |
| column_index | Zero-based column index within the schema. |
Error. Definition at line 698 of file reader.hpp.
|
inline |
Read the ColumnIndex (min/max per page) for a column chunk.
Deserializes the Thrift-encoded ColumnIndex structure from the file at the offset recorded in the column chunk metadata. The ColumnIndex enables predicate pushdown by providing per-page min/max boundaries and null page flags.
| row_group_index | Zero-based row group index. |
| column_index | Zero-based column index within the row group. |
ColumnIndex on success, or an Error with OUT_OF_RANGE, INVALID_FILE (no index), or CORRUPT_PAGE. Definition at line 1010 of file reader.hpp.
|
inline |
Read a subset of columns (by name) across all row groups.
Resolves each column name to its schema index via Schema::find_column(), then reads that column from every row group, concatenating the results into a single vector per projected column. The outer vector is ordered to match column_names.
| column_names | Column names to project. Each must exist in the schema or SCHEMA_MISMATCH is returned. |
vector<vector<string>> with one entry per requested column, spanning all row groups. Definition at line 850 of file reader.hpp.
|
inline |
Read the OffsetIndex (page locations) for a column chunk.
Deserializes the Thrift-encoded OffsetIndex structure, which maps each data page to its file offset, compressed size, and first row index. Used together with ColumnIndex for page-level predicate pushdown and selective I/O.
| row_group_index | Zero-based row group index. |
| column_index | Zero-based column index within the row group. |
OffsetIndex on success, or an Error with OUT_OF_RANGE, INVALID_FILE (no index), or CORRUPT_PAGE. Definition at line 1052 of file reader.hpp.
|
inline |
Read all columns from a single row group as string vectors.
Calls read_column_as_strings() for every column in the schema, producing one vector<string> per column. The outer vector is indexed by column ordinal.
| row_group_index | Zero-based row group index. |
Error if any column read fails. Definition at line 774 of file reader.hpp.
|
inline |
Return summary metadata for a specific row group.
| index | Zero-based row group index. Must be less than num_row_groups(). |
RowGroupInfo struct with row count, byte size, and index. | std::out_of_range | if index >= num_row_groups(). |
Definition at line 417 of file reader.hpp.
|
inline |
Return the file's column schema.
Definition at line 371 of file reader.hpp.