Signet Forge 0.1.0
C++20 Parquet library with AI-native extensions
DEMO
Loading...
Searching...
No Matches
signet::forge::ParquetReader Class Reference

Parquet file reader with typed column access and full encoding support. More...

#include <reader.hpp>

Classes

struct  RowGroupInfo
 Summary metadata for a single row group. More...
 

Public Member Functions

const Schemaschema () const
 Return the file's column schema.
 
int64_t num_rows () const
 Return the total number of rows across all row groups.
 
int64_t num_row_groups () const
 Return the number of row groups in the file.
 
const std::string & created_by () const
 Return the created_by string from the file footer metadata.
 
const std::vector< thrift::KeyValue > & key_value_metadata () const
 Return the file-level key-value metadata pairs.
 
RowGroupInfo row_group (size_t index) const
 Return summary metadata for a specific row group.
 
FileStats file_stats () const
 Compute aggregate statistics for the entire file.
 
template<typename T >
expected< std::vector< T > > read_column (size_t row_group_index, size_t column_index)
 Read a single column from a row group as a typed vector.
 
expected< std::vector< std::string > > read_column_as_strings (size_t row_group_index, size_t column_index)
 Read a column and convert every value to its string representation.
 
expected< std::vector< std::vector< std::string > > > read_row_group (size_t row_group_index)
 Read all columns from a single row group as string vectors.
 
expected< std::vector< std::vector< std::string > > > read_all ()
 Read the entire file as a row-major vector of string vectors.
 
expected< std::vector< std::vector< std::string > > > read_columns (const std::vector< std::string > &column_names)
 Read a subset of columns (by name) across all row groups.
 
const thrift::Statisticscolumn_statistics (size_t row_group_index, size_t column_index) const
 Access Parquet column statistics for a specific column chunk.
 
expected< SplitBlockBloomFilterread_bloom_filter (size_t row_group_index, size_t column_index) const
 Read the Split Block Bloom Filter for a column chunk, if present.
 
template<typename T >
bool bloom_might_contain (size_t row_group_index, size_t column_index, const T &value) const
 Check whether a value might exist in a column using its bloom filter.
 
expected< ColumnIndexread_column_index (size_t row_group_index, size_t column_index) const
 Read the ColumnIndex (min/max per page) for a column chunk.
 
expected< OffsetIndexread_offset_index (size_t row_group_index, size_t column_index) const
 Read the OffsetIndex (page locations) for a column chunk.
 
bool has_page_index (size_t row_group_index, size_t column_index) const
 Check whether a column chunk has both ColumnIndex and OffsetIndex data.
 
 ~ParquetReader ()=default
 Destructor. Releases the in-memory file buffer and all decode state.
 
 ParquetReader (ParquetReader &&) noexcept=default
 Move constructor.
 
ParquetReaderoperator= (ParquetReader &&) noexcept=default
 Move assignment operator.
 

Static Public Member Functions

static expected< ParquetReaderopen (const std::filesystem::path &path)
 Open and parse a Parquet file, returning a ready-to-query reader.
 
static void ensure_default_codecs_registered ()
 Ensure common compression codecs are registered in the global CodecRegistry.
 

Detailed Description

Parquet file reader with typed column access and full encoding support.

Opens a Parquet file, verifies PAR1 magic bytes, deserializes the Thrift footer to extract FileMetaData, builds a Schema, and provides typed access to column data via ColumnReader.

Supported encodings: PLAIN, RLE_DICTIONARY, DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT, and RLE (booleans). Decompression is handled transparently via the CodecRegistry (Snappy, ZSTD, LZ4, Gzip).

When built with SIGNET_ENABLE_COMMERCIAL, encrypted Parquet files (PME with PARE footer magic) are supported via a FileDecryptor.

Note
Move-only. Default-constructed readers are not valid; use the static open() factory method.
See also
ParquetWriter for the corresponding writer.
Schema, ColumnReader, FileStats

Definition at line 167 of file reader.hpp.

Constructor & Destructor Documentation

◆ ~ParquetReader()

signet::forge::ParquetReader::~ParquetReader ( )
default

Destructor. Releases the in-memory file buffer and all decode state.

◆ ParquetReader()

signet::forge::ParquetReader::ParquetReader ( ParquetReader &&  )
defaultnoexcept

Move constructor.

Member Function Documentation

◆ bloom_might_contain()

template<typename T >
bool signet::forge::ParquetReader::bloom_might_contain ( size_t  row_group_index,
size_t  column_index,
const T &  value 
) const
inline

Check whether a value might exist in a column using its bloom filter.

If no bloom filter is present for the column chunk, returns true (conservative: the value cannot be ruled out). A return value of false guarantees the value is absent; true may be a false positive.

Template Parameters
TThe value type (must be hashable by xxHash64).
Parameters
row_group_indexZero-based row group index.
column_indexZero-based column index within the row group.
valueThe value to probe.
Returns
false if the bloom filter definitively excludes the value; true otherwise (including when no filter is available).
See also
read_bloom_filter()

Definition at line 984 of file reader.hpp.

◆ column_statistics()

const thrift::Statistics * signet::forge::ParquetReader::column_statistics ( size_t  row_group_index,
size_t  column_index 
) const
inline

Access Parquet column statistics for a specific column chunk.

Returns a pointer to the thrift::Statistics stored in the column chunk's metadata (min, max, null_count, distinct_count, etc.). Returns nullptr if the row group index or column index is out of range, or if the column chunk has no statistics.

Parameters
row_group_indexZero-based row group index.
column_indexZero-based column index within the row group.
Returns
Pointer to statistics, or nullptr if unavailable.
Note
The returned pointer is valid for the lifetime of this reader.

Definition at line 899 of file reader.hpp.

◆ created_by()

const std::string & signet::forge::ParquetReader::created_by ( ) const
inline

Return the created_by string from the file footer metadata.

This typically identifies the library and version that wrote the file (e.g. "signet-forge 0.1.0"). Returns an empty string if the field was not set by the writer.

Definition at line 386 of file reader.hpp.

◆ ensure_default_codecs_registered()

static void signet::forge::ParquetReader::ensure_default_codecs_registered ( )
inlinestatic

Ensure common compression codecs are registered in the global CodecRegistry.

Called automatically by open(). Registers Snappy (always), and optionally ZSTD, LZ4, and Gzip when their respective SIGNET_HAS_* macros are defined. Safe to call multiple times; only the first call performs registration.

Definition at line 511 of file reader.hpp.

◆ file_stats()

FileStats signet::forge::ParquetReader::file_stats ( ) const
inline

Compute aggregate statistics for the entire file.

Iterates over all row groups and columns to produce a FileStats summary including: file size, total rows, row group count, column count, per-column compressed/uncompressed sizes, null counts, compression ratio, bytes-per-row, and bloom filter/page index presence flags.

Returns
A FileStats struct populated from the file's metadata.
See also
FileStats

Definition at line 443 of file reader.hpp.

◆ has_page_index()

bool signet::forge::ParquetReader::has_page_index ( size_t  row_group_index,
size_t  column_index 
) const
inline

Check whether a column chunk has both ColumnIndex and OffsetIndex data.

Parameters
row_group_indexZero-based row group index.
column_indexZero-based column index within the row group.
Returns
true if both column_index_offset and offset_index_offset are present in the column chunk metadata; false otherwise (including for out-of-range indices).
See also
read_column_index(), read_offset_index()

Definition at line 1090 of file reader.hpp.

◆ key_value_metadata()

const std::vector< thrift::KeyValue > & signet::forge::ParquetReader::key_value_metadata ( ) const
inline

Return the file-level key-value metadata pairs.

These are arbitrary string key-value pairs stored in the Parquet footer (e.g. Arrow schema, pandas metadata). Returns an empty vector if the writer did not set any key-value metadata.

Definition at line 393 of file reader.hpp.

◆ num_row_groups()

int64_t signet::forge::ParquetReader::num_row_groups ( ) const
inline

Return the number of row groups in the file.

Definition at line 377 of file reader.hpp.

◆ num_rows()

int64_t signet::forge::ParquetReader::num_rows ( ) const
inline

Return the total number of rows across all row groups.

Definition at line 374 of file reader.hpp.

◆ open()

static expected< ParquetReader > signet::forge::ParquetReader::open ( const std::filesystem::path &  path)
inlinestatic

Open and parse a Parquet file, returning a ready-to-query reader.

Reads the entire file into memory, validates PAR1/PARE magic bytes, deserializes the Thrift footer into FileMetaData, and constructs the column Schema. Common compression codecs (Snappy, and optionally ZSTD/LZ4/Gzip) are registered automatically on first call.

Parameters
pathFilesystem path to the .parquet file.
Note
Commercial builds accept an optional encryption parameter for decrypting PME-encrypted files (PARE footer magic).
Returns
The constructed ParquetReader on success, or an Error with codes such as IO_ERROR, INVALID_FILE, CORRUPT_FOOTER, ENCRYPTION_ERROR, or LICENSE_ERROR.
Note
The entire file is loaded into memory. For very large files, consider using the memory-mapped reader path instead.
See also
close()

Definition at line 189 of file reader.hpp.

◆ operator=()

ParquetReader & signet::forge::ParquetReader::operator= ( ParquetReader &&  )
defaultnoexcept

Move assignment operator.

◆ read_all()

expected< std::vector< std::vector< std::string > > > signet::forge::ParquetReader::read_all ( )
inline

Read the entire file as a row-major vector of string vectors.

Iterates over all row groups via read_row_group() and transposes the column-major data into rows, where each inner vector has one string element per column.

Returns
A row-major vector<vector<string>> on success, or an Error if any row group or column read fails.
Note
For large files this allocates all data as strings in memory. Prefer read_column<T>() or column projection for selective access.
See also
read_row_group(), read_columns()

Definition at line 804 of file reader.hpp.

◆ read_bloom_filter()

expected< SplitBlockBloomFilter > signet::forge::ParquetReader::read_bloom_filter ( size_t  row_group_index,
size_t  column_index 
) const
inline

Read the Split Block Bloom Filter for a column chunk, if present.

Locates the bloom filter in the file using the column chunk's bloom_filter_offset, reads the 4-byte LE size header, validates alignment to kBytesPerBlock, and deserializes the filter data.

Parameters
row_group_indexZero-based row group index.
column_indexZero-based column index within the row group.
Returns
A SplitBlockBloomFilter on success, or an Error with INVALID_FILE (no filter), OUT_OF_RANGE, or CORRUPT_PAGE.
See also
bloom_might_contain(), SplitBlockBloomFilter

Definition at line 925 of file reader.hpp.

◆ read_column()

template<typename T >
expected< std::vector< T > > signet::forge::ParquetReader::read_column ( size_t  row_group_index,
size_t  column_index 
)
inline

Read a single column from a row group as a typed vector.

Automatically selects the appropriate decoding path based on the column's encoding metadata:

  • RLE_DICTIONARY – dictionary decode (string, int32, int64, float, double)
  • DELTA_BINARY_PACKED – delta decode (int32, int64)
  • BYTE_STREAM_SPLIT – BSS decode (float, double)
  • RLE – run-length decode (bool only)
  • PLAIN – raw value decode via ColumnReader (all types)

Decompression and (in commercial builds) decryption are applied transparently before decoding.

Template Parameters
TThe C++ value type. Must match the column's physical type: bool, int32_t, int64_t, float, double, or std::string.
Parameters
row_group_indexZero-based row group index.
column_indexZero-based column index within the schema.
Returns
A vector of decoded values on success, or an Error with codes such as OUT_OF_RANGE, CORRUPT_PAGE, UNSUPPORTED_ENCODING, or UNSUPPORTED_COMPRESSION.
See also
read_column_as_strings(), read_row_group(), read_all()

Definition at line 554 of file reader.hpp.

◆ read_column_as_strings()

expected< std::vector< std::string > > signet::forge::ParquetReader::read_column_as_strings ( size_t  row_group_index,
size_t  column_index 
)
inline

Read a column and convert every value to its string representation.

Dispatches to read_column<T>() based on the column's physical type, then converts each value using std::to_string() (numeric types), "true"/"false" (booleans), identity (BYTE_ARRAY/string), or hex-encoding (FIXED_LEN_BYTE_ARRAY).

Parameters
row_group_indexZero-based row group index.
column_indexZero-based column index within the schema.
Returns
A vector of string-converted values on success, or an Error.
See also
read_column(), read_row_group()

Definition at line 698 of file reader.hpp.

◆ read_column_index()

expected< ColumnIndex > signet::forge::ParquetReader::read_column_index ( size_t  row_group_index,
size_t  column_index 
) const
inline

Read the ColumnIndex (min/max per page) for a column chunk.

Deserializes the Thrift-encoded ColumnIndex structure from the file at the offset recorded in the column chunk metadata. The ColumnIndex enables predicate pushdown by providing per-page min/max boundaries and null page flags.

Parameters
row_group_indexZero-based row group index.
column_indexZero-based column index within the row group.
Returns
A ColumnIndex on success, or an Error with OUT_OF_RANGE, INVALID_FILE (no index), or CORRUPT_PAGE.
See also
read_offset_index(), has_page_index()

Definition at line 1010 of file reader.hpp.

◆ read_columns()

expected< std::vector< std::vector< std::string > > > signet::forge::ParquetReader::read_columns ( const std::vector< std::string > &  column_names)
inline

Read a subset of columns (by name) across all row groups.

Resolves each column name to its schema index via Schema::find_column(), then reads that column from every row group, concatenating the results into a single vector per projected column. The outer vector is ordered to match column_names.

Parameters
column_namesColumn names to project. Each must exist in the schema or SCHEMA_MISMATCH is returned.
Returns
A column-major vector<vector<string>> with one entry per requested column, spanning all row groups.
See also
read_column_as_strings(), read_all()

Definition at line 850 of file reader.hpp.

◆ read_offset_index()

expected< OffsetIndex > signet::forge::ParquetReader::read_offset_index ( size_t  row_group_index,
size_t  column_index 
) const
inline

Read the OffsetIndex (page locations) for a column chunk.

Deserializes the Thrift-encoded OffsetIndex structure, which maps each data page to its file offset, compressed size, and first row index. Used together with ColumnIndex for page-level predicate pushdown and selective I/O.

Parameters
row_group_indexZero-based row group index.
column_indexZero-based column index within the row group.
Returns
An OffsetIndex on success, or an Error with OUT_OF_RANGE, INVALID_FILE (no index), or CORRUPT_PAGE.
See also
read_column_index(), has_page_index()

Definition at line 1052 of file reader.hpp.

◆ read_row_group()

expected< std::vector< std::vector< std::string > > > signet::forge::ParquetReader::read_row_group ( size_t  row_group_index)
inline

Read all columns from a single row group as string vectors.

Calls read_column_as_strings() for every column in the schema, producing one vector<string> per column. The outer vector is indexed by column ordinal.

Parameters
row_group_indexZero-based row group index.
Returns
A column-major vector of string vectors on success, or an Error if any column read fails.
See also
read_column_as_strings(), read_all()

Definition at line 774 of file reader.hpp.

◆ row_group()

RowGroupInfo signet::forge::ParquetReader::row_group ( size_t  index) const
inline

Return summary metadata for a specific row group.

Parameters
indexZero-based row group index. Must be less than num_row_groups().
Returns
A RowGroupInfo struct with row count, byte size, and index.
Exceptions
std::out_of_rangeif index >= num_row_groups().

Definition at line 417 of file reader.hpp.

◆ schema()

const Schema & signet::forge::ParquetReader::schema ( ) const
inline

Return the file's column schema.

Returns
Const reference to the Schema parsed from the Thrift footer.

Definition at line 371 of file reader.hpp.


The documentation for this class was generated from the following file: