Signet Forge 0.1.0
C++20 Parquet library with AI-native extensions
DEMO
Loading...
Searching...
No Matches
signet::forge::ParquetWriter Class Reference

Streaming Parquet file writer with row-based and column-based APIs. More...

#include <writer.hpp>

Public Types

using Options = WriterOptions
 Alias for WriterOptions, usable as ParquetWriter::Options.
 

Public Member Functions

expected< void > write_row (const std::vector< std::string > &values)
 Write a single row as a vector of string values.
 
size_t num_columns () const noexcept
 Returns the number of columns in the writer's schema.
 
template<typename T >
expected< void > write_column (size_t col_index, const T *values, size_t count)
 Write a batch of typed values to a single column.
 
expected< void > write_column (size_t col_index, const std::string *values, size_t count)
 Write a batch of string values to a BYTE_ARRAY column.
 
expected< void > flush_row_group ()
 Flush the current row group to disk.
 
expected< WriteStatsclose ()
 Close the file and finalize the Parquet footer.
 
 ~ParquetWriter ()
 Destructor.
 
 ParquetWriter (const ParquetWriter &)=delete
 Deleted copy constructor. ParquetWriter is move-only.
 
ParquetWriteroperator= (const ParquetWriter &)=delete
 Deleted copy-assignment operator. ParquetWriter is move-only.
 
 ParquetWriter (ParquetWriter &&other) noexcept
 Move constructor.
 
ParquetWriteroperator= (ParquetWriter &&other) noexcept
 Move-assignment operator.
 
int64_t rows_written () const
 Returns the total number of rows written so far.
 
int64_t row_groups_written () const
 Returns the number of row groups that have been flushed to disk.
 
bool is_open () const
 Returns whether the writer is open and accepting data.
 

Static Public Member Functions

static expected< ParquetWriteropen (const std::filesystem::path &path, const Schema &schema, const Options &options=Options{})
 Open a new Parquet file for writing.
 
static expected< void > csv_to_parquet (const std::filesystem::path &csv_input, const std::filesystem::path &parquet_output, const Options &options=Options{})
 Convert a CSV file to a Parquet file.
 

Detailed Description

Streaming Parquet file writer with row-based and column-based APIs.

ParquetWriter is the primary write-path class in Signet Forge. It produces spec-compliant Apache Parquet files with configurable encoding (PLAIN, DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT, RLE_DICTIONARY, RLE), compression (Snappy, ZSTD, LZ4, Gzip), optional bloom filters, page indexes, and Parquet Modular Encryption (commercial tier).

Lifecycle:

auto w = ParquetWriter::open(path, schema, options);
w->write_column(0, data, n); // or w->write_row({...})
w->flush_row_group();
auto stats = w->close();
static expected< ParquetWriter > open(const std::filesystem::path &path, const Schema &schema, const Options &options=Options{})
Open a new Parquet file for writing.
Definition writer.hpp:303

The class is move-only (non-copyable). If the user forgets to call close(), the destructor performs a best-effort close.

Note
Thread safety: ParquetWriter is not thread-safe. All calls must be serialized by the caller.
See also
WriterOptions, Schema, WriteStats, ParquetReader

Definition at line 280 of file writer.hpp.

Member Typedef Documentation

◆ Options

Constructor & Destructor Documentation

◆ ~ParquetWriter()

signet::forge::ParquetWriter::~ParquetWriter ( )
inline

Destructor.

Performs a best-effort close() if the file is still open.

Any errors during the implicit close are silently discarded. Prefer calling close() explicitly so that errors and WriteStats can be inspected.

Definition at line 1024 of file writer.hpp.

◆ ParquetWriter() [1/2]

signet::forge::ParquetWriter::ParquetWriter ( const ParquetWriter )
delete

Deleted copy constructor. ParquetWriter is move-only.

◆ ParquetWriter() [2/2]

signet::forge::ParquetWriter::ParquetWriter ( ParquetWriter &&  other)
inlinenoexcept

Move constructor.

Transfers ownership of the open file and all internal state from other. After the move, other is in a closed, empty state.

Definition at line 1041 of file writer.hpp.

Member Function Documentation

◆ close()

expected< WriteStats > signet::forge::ParquetWriter::close ( )
inline

Close the file and finalize the Parquet footer.

Flushes any remaining row data via flush_row_group(), serializes the Thrift FileMetaData (schema, row group metadata, statistics, custom key-value pairs), writes the footer length as a 4-byte LE integer, and appends the closing PAR1 magic (or PARE for encrypted footers).

After close() returns, the file on disk is a complete, spec-valid Parquet file. Calling close() on an already-closed writer is safe and returns an empty WriteStats.

Returns
WriteStats summarizing file size, row/row-group counts, per-column compression ratios, and encoding details.
Note
The writer must be closed (explicitly or via the destructor) to produce a valid Parquet file. Omitting close() results in a truncated, unreadable file.
See also
flush_row_group, WriteStats

Definition at line 869 of file writer.hpp.

◆ csv_to_parquet()

static expected< void > signet::forge::ParquetWriter::csv_to_parquet ( const std::filesystem::path &  csv_input,
const std::filesystem::path &  parquet_output,
const Options options = Options{} 
)
inlinestatic

Convert a CSV file to a Parquet file.

Reads the entire CSV into memory, auto-detects column types by scanning every value in each column (priority: INT64 > DOUBLE > BOOLEAN > STRING), builds a Schema, writes all rows through a ParquetWriter, and closes the output file.

The first line of the CSV is treated as the header (column names). Quoted fields with embedded commas and escaped double-quotes ("") are supported.

Parameters
csv_inputPath to the input CSV file.
parquet_outputPath for the output Parquet file (created or truncated).
optionsWriter options forwarded to ParquetWriter::open().
Returns
expected<void> – error on I/O failure, empty CSV, or any write/close error.
Note
The entire CSV is loaded into memory; very large files may require a streaming approach instead.
See also
ParquetWriter::open

Definition at line 1144 of file writer.hpp.

◆ flush_row_group()

expected< void > signet::forge::ParquetWriter::flush_row_group ( )
inline

Flush the current row group to disk.

Encodes any pending string rows (row-based API), verifies that all columns have the same value count, writes column chunks with the selected encoding and compression, emits bloom filters and page indexes if enabled, and records the row group metadata for the footer.

This method is called automatically by write_row() when the pending row count reaches WriterOptions::row_group_size, and by close() to drain any remaining data. It may also be called explicitly to control row group boundaries.

Returns
expected<void> – error on I/O failure, schema mismatch (column value counts differ), or compression/encryption error.
Note
Calling flush_row_group() when no data is pending is a no-op.
See also
close, write_row, write_column

Definition at line 520 of file writer.hpp.

◆ is_open()

bool signet::forge::ParquetWriter::is_open ( ) const
inline

Returns whether the writer is open and accepting data.

Returns
true if the writer is open, false after close() or move.

Definition at line 1118 of file writer.hpp.

◆ num_columns()

size_t signet::forge::ParquetWriter::num_columns ( ) const
inlinenoexcept

Returns the number of columns in the writer's schema.

Returns
Column count (always >= 1 for a validly-opened writer).

Definition at line 393 of file writer.hpp.

◆ open()

static expected< ParquetWriter > signet::forge::ParquetWriter::open ( const std::filesystem::path &  path,
const Schema schema,
const Options options = Options{} 
)
inlinestatic

Open a new Parquet file for writing.

Creates (or truncates) the file at path, writes the 4-byte PAR1 magic header, and initializes internal column writers, bloom filters, and page-index builders according to options. Parent directories are created automatically if they do not exist.

Parameters
pathFilesystem path for the output Parquet file.
schemaColumn schema describing names, physical types, and logical types.
optionsWriter configuration (encoding, compression, bloom filters, encryption, etc.). Defaults to plain, uncompressed output.
Returns
An open ParquetWriter on success, or an Error (IO_ERROR) on failure.
See also
close, WriterOptions

Definition at line 303 of file writer.hpp.

◆ operator=() [1/2]

ParquetWriter & signet::forge::ParquetWriter::operator= ( const ParquetWriter )
delete

Deleted copy-assignment operator. ParquetWriter is move-only.

◆ operator=() [2/2]

ParquetWriter & signet::forge::ParquetWriter::operator= ( ParquetWriter &&  other)
inlinenoexcept

Move-assignment operator.

Closes the current file (if open) before transferring ownership from other.

Definition at line 1066 of file writer.hpp.

◆ row_groups_written()

int64_t signet::forge::ParquetWriter::row_groups_written ( ) const
inline

Returns the number of row groups that have been flushed to disk.

Returns
Count of completed row groups (does not include any in-progress row group that has not yet been flushed).

Definition at line 1112 of file writer.hpp.

◆ rows_written()

int64_t signet::forge::ParquetWriter::rows_written ( ) const
inline

Returns the total number of rows written so far.

Includes both rows already flushed to completed row groups and rows buffered in memory awaiting the next flush_row_group() call.

Returns
Total row count (flushed + pending).

Definition at line 1105 of file writer.hpp.

◆ write_column() [1/2]

expected< void > signet::forge::ParquetWriter::write_column ( size_t  col_index,
const std::string *  values,
size_t  count 
)
inline

Write a batch of string values to a BYTE_ARRAY column.

This overload handles variable-length binary / UTF-8 data. Each string is stored with a 4-byte little-endian length prefix in the PLAIN encoding buffer, matching the Parquet BYTE_ARRAY wire format.

Parameters
col_indexZero-based column index in the schema.
valuesPointer to a contiguous array of count strings.
countNumber of string values to write.
Returns
expected<void> – error if the writer is closed or col_index is out of range.
See also
write_column(size_t, const T*, size_t)

Definition at line 467 of file writer.hpp.

◆ write_column() [2/2]

template<typename T >
expected< void > signet::forge::ParquetWriter::write_column ( size_t  col_index,
const T *  values,
size_t  count 
)
inline

Write a batch of typed values to a single column.

The caller writes each column independently and then calls flush_row_group(). All columns within a row group must receive the same number of values; a mismatch is detected at flush time.

Supported template types map to Parquet physical types:

  • bool -> BOOLEAN
  • int32_t -> INT32
  • int64_t -> INT64
  • float -> FLOAT
  • double -> DOUBLE
  • std::string -> BYTE_ARRAY (use the string overload instead)
Template Parameters
TC++ type matching the column's physical type.
Parameters
col_indexZero-based column index in the schema.
valuesPointer to a contiguous array of count values.
countNumber of values to write.
Returns
expected<void> – error if the writer is closed or col_index is out of range.
See also
write_column(size_t, const std::string*, size_t), flush_row_group

Definition at line 419 of file writer.hpp.

◆ write_row()

expected< void > signet::forge::ParquetWriter::write_row ( const std::vector< std::string > &  values)
inline

Write a single row as a vector of string values.

Each string is parsed and converted to its column's physical type when the row group is flushed (either automatically when WriterOptions::row_group_size is reached, or explicitly via flush_row_group()). The number of values must exactly match the schema's column count.

Parameters
valuesOne string per column, in schema order.
Returns
expected<void> – error if the writer is closed or if values.size() does not match the schema.
See also
write_column, flush_row_group

Definition at line 368 of file writer.hpp.


The documentation for this class was generated from the following file: