Skip to Content
This documentation is provided with the HEAT environment and is relevant for this HEAT instance only.
RunnersSystem Utilssystem-bulk-tabular-writer

system-bulk-tabular-writer

Platform reserved (system-utils only). Parent must be node-output-query. For each output ID in the parent payload, reads blob data from HEAT Managed Object Store, ingests rows into an optional HEAT Bulk Analytics DB PostgreSQL instance, and registers heatBulkWriterCatalogV1 JSON describing tables, columns, sample values, and join keys for downstream analytics.

When to use it

Enable when the cluster runs heat-bulk-analytics-postgres (included in standard deploy/k8s database manifests). Core API setup registers the DataSource automatically.

node-output-query → system-bulk-tabular-writer → system-bulk-analytics-query

Shipped reference session template: bulk-analytics-ingest-reference.

Configuration

KeyDefaultPurpose
inputFormatautoauto, csv, or json.
maxRowsPerOutput500000Row guardrail per source output.
failOnFirstErrorfalseWhen false, collect per-output failures and still succeed with partial stats.

There is no maxOutputsPerRun cap in v1.

While ingesting, the node sets LastState to Processing and updates StatusDetails at least every few seconds. Example:

Ingested output 45360 (31/237, 13%; processed 31, skipped 0, failed 0; 31.0k rows; elapsed 8m 42s; ETA ~58m; 3.6 outputs/min; 59.4 rows/s)

Progress includes percent complete, cumulative rows written this run, elapsed time, estimated time remaining (after a short warm-up), output throughput, and row throughput (rows/s).

Analytics database layout

  • One PostgreSQL database per bulk-writer node instance: bulk_ni_{nodeInstanceId}.
  • CSV files with the same header set share a table named csv_{schemaKey} (8-character stable hash). Display hints KB_BIO and KB_VEHICLE are inferred from column prefixes when present.
  • CSV data columns use conservative type inference on first table creation. Non-empty sample values determine Postgres types (BIGINT, DOUBLE PRECISION, BOOLEAN, TIMESTAMPTZ, UUID, or TEXT). Known null sentinels (N/A, empty, null, none, -) are ignored during inference and stored as SQL NULL on insert. Mixed or ambiguous values fall back to TEXT.
  • JSON tabular payloads use json_{schemaKey} from top-level table keys.
  • System ledger table _ingested_outputs tracks ingested output IDs for idempotent re-runs.

heatBulkWriterCatalogV1 output

The writer registers JSON with (among other fields):

FieldPurpose
catalogVersionContract version (heatBulkWriterCatalogV1).
bulkWriterNodeInstanceIdNode instance that owns bulk_ni_{id}.
dataSourceNameHEAT Bulk Analytics DB.
analyticsDatabaseDatabase name for SQL clients.
joinKeysShared columns (heat_session_id, heat_output_id, and similar).
tablesPer-table columns[] with postgresType, portable valueKind (integer, number, boolean, timestamp, uuid, string), nullable, and sampleValue, plus qualifiedFrom for SQL.
tables._ingested_outputs.systemTabletrue for the ledger table.

A system-bulk-analytics-query child node reads this catalog and runs read-only SQL against analyticsDatabase. See system-bulk-analytics-query.

Enable path (operators)

  1. Deploy heat-bulk-analytics-postgres (32Gi PVC, cluster-internal service).
  2. Core API setup automatically registers DataSource metadata HEAT Bulk Analytics DB on startup (same pattern as HEAT Managed Object Store). Requires heat-bulk-analytics-postgres pod to be running before ingest.
  3. Bump system-utils memory if ingesting very large batches (deployment already allows up to 4Gi).

Session delete: RetentionService best-effort DROP DATABASE bulk_ni_{nodeInstanceId} for each bulk-writer node on the session. Failure to reach analytics Postgres does not block session delete.

Limitations

  • Requires analytics Postgres when ingest is expected; the node fails with ProcessingFailed if the DataSource is missing or unreachable.
  • Column types are fixed when a table is first created (CREATE TABLE IF NOT EXISTS). Re-ingest into an existing all-TEXT table keeps TEXT; drop bulk_ni_{id} or recreate the writer database to pick up new inference rules.
  • First production run over ~170k outputs may take hours; monitor analytics PVC disk and system-utils memory. Row inserts use batched multi-row SQL (not one statement per row).
  • Analytics Postgres holds derivative telemetry (including user GUID columns from CSV); treat access and retention per deployment policy. Do not expose via public ingress.