system-bulk-tabular-writer
Platform reserved (system-utils only). Parent must be node-output-query. For each output ID in the parent payload, reads blob data from HEAT Managed Object Store, ingests rows into an optional HEAT Bulk Analytics DB PostgreSQL instance, and registers heatBulkWriterCatalogV1 JSON describing tables, columns, sample values, and join keys for downstream analytics.
When to use it
Enable when the cluster runs heat-bulk-analytics-postgres (included in standard deploy/k8s database manifests). Core API setup registers the DataSource automatically.
node-output-query → system-bulk-tabular-writer → system-bulk-analytics-queryShipped reference session template: bulk-analytics-ingest-reference.
Configuration
| Key | Default | Purpose |
|---|---|---|
inputFormat | auto | auto, csv, or json. |
maxRowsPerOutput | 500000 | Row guardrail per source output. |
failOnFirstError | false | When false, collect per-output failures and still succeed with partial stats. |
There is no maxOutputsPerRun cap in v1.
While ingesting, the node sets LastState to Processing and updates StatusDetails at least every few seconds. Example:
Ingested output 45360 (31/237, 13%; processed 31, skipped 0, failed 0; 31.0k rows; elapsed 8m 42s; ETA ~58m; 3.6 outputs/min; 59.4 rows/s)Progress includes percent complete, cumulative rows written this run, elapsed time, estimated time remaining (after a short warm-up), output throughput, and row throughput (rows/s).
Analytics database layout
- One PostgreSQL database per bulk-writer node instance:
bulk_ni_{nodeInstanceId}. - CSV files with the same header set share a table named
csv_{schemaKey}(8-character stable hash). Display hintsKB_BIOandKB_VEHICLEare inferred from column prefixes when present. - CSV data columns use conservative type inference on first table creation. Non-empty sample values determine Postgres types (
BIGINT,DOUBLE PRECISION,BOOLEAN,TIMESTAMPTZ,UUID, orTEXT). Known null sentinels (N/A, empty,null,none,-) are ignored during inference and stored as SQLNULLon insert. Mixed or ambiguous values fall back toTEXT. - JSON tabular payloads use
json_{schemaKey}from top-level table keys. - System ledger table
_ingested_outputstracks ingested output IDs for idempotent re-runs.
heatBulkWriterCatalogV1 output
The writer registers JSON with (among other fields):
| Field | Purpose |
|---|---|
catalogVersion | Contract version (heatBulkWriterCatalogV1). |
bulkWriterNodeInstanceId | Node instance that owns bulk_ni_{id}. |
dataSourceName | HEAT Bulk Analytics DB. |
analyticsDatabase | Database name for SQL clients. |
joinKeys | Shared columns (heat_session_id, heat_output_id, and similar). |
tables | Per-table columns[] with postgresType, portable valueKind (integer, number, boolean, timestamp, uuid, string), nullable, and sampleValue, plus qualifiedFrom for SQL. |
tables._ingested_outputs.systemTable | true for the ledger table. |
A system-bulk-analytics-query child node reads this catalog and runs read-only SQL against analyticsDatabase. See system-bulk-analytics-query.
Enable path (operators)
- Deploy
heat-bulk-analytics-postgres(32Gi PVC, cluster-internal service). - Core API setup automatically registers DataSource metadata
HEAT Bulk Analytics DBon startup (same pattern as HEAT Managed Object Store). Requiresheat-bulk-analytics-postgrespod to be running before ingest. - Bump system-utils memory if ingesting very large batches (deployment already allows up to 4Gi).
Session delete: RetentionService best-effort DROP DATABASE bulk_ni_{nodeInstanceId} for each bulk-writer node on the session. Failure to reach analytics Postgres does not block session delete.
Limitations
- Requires analytics Postgres when ingest is expected; the node fails with
ProcessingFailedif the DataSource is missing or unreachable. - Column types are fixed when a table is first created (
CREATE TABLE IF NOT EXISTS). Re-ingest into an existing all-TEXTtable keepsTEXT; dropbulk_ni_{id}or recreate the writer database to pick up new inference rules. - First production run over ~170k outputs may take hours; monitor analytics PVC disk and system-utils memory. Row inserts use batched multi-row SQL (not one statement per row).
- Analytics Postgres holds derivative telemetry (including user GUID columns from CSV); treat access and retention per deployment policy. Do not expose via public ingress.
Related
- Bulk analytics guide
- node-output-query
- Data sources (HEAT Bulk Analytics DB)
- System Utils