Bulk analytics workflow
This page walks through the standard bulk analytics graph: discover outputs, ingest into analytics Postgres, query with SQL, and optionally publish a dashboard.
Prerequisites
heat-bulk-analytics-postgresdeployed and reachable from system-utils.- core-api restarted after deploy (registers
HEAT Bulk Analytics DBDataSource). - Upstream ingest sessions that produced tabular outputs (typically from a node named
csv-inputorjson-input). - A system-utils runner linked to
node-output-query,system-bulk-tabular-writer, andsystem-bulk-analytics-query(shipped presets handle this on bootstrap).
Pipeline overview
node-output-query → system-bulk-tabular-writer → system-bulk-analytics-query → (optional) system-arbex-jsEach stage consumes the previous node’s current JSON output (stored in HEAT Managed Object Store like other system-utils artifacts).
Stage 1: Discover outputs (node-output-query)
Purpose: Build a single JSON payload listing every matching NodeOutput.Id from core Postgres.
Parent: none (root of the bulk graph).
Typical configuration (from bulk-analytics-ingest-reference):
{
"sourceNodeName": "csv-input",
"includeStaticBackedSessions": false
}Optional filters narrow the result set before ingest:
| Key | Example use |
|---|---|
sessionTemplateName | Only outputs from sessions of one template |
projectId / projectName | One project |
sessionCreatedFrom / sessionCreatedTo | Date window on session creation |
simulationName | Match Session.SimulationName |
Output shape:
{
"sourceNodeName": "csv-input",
"criteria": { "sourceNodeName": "csv-input" },
"totalCount": 237,
"outputIds": [45330, 45331, 45332]
}There is no pagination in v1. Very large outputIds arrays (for example 170k+) produce a large JSON blob and long downstream runtime. Plan system-utils memory and analytics disk accordingly.
Detail: node-output-query
Stage 2: Bulk ingest (system-bulk-tabular-writer)
Purpose: For each outputId, read the blob from object storage and insert rows into HEAT Bulk Analytics DB.
Parent: must be node-output-query.
Typical configuration:
{
"inputFormat": "auto",
"maxRowsPerOutput": 500000,
"failOnFirstError": false
}| Key | Default | Notes |
|---|---|---|
inputFormat | auto | auto, csv, or json |
maxRowsPerOutput | 500000 | Row cap per source output |
failOnFirstError | false | When false, collect per-output failures and still succeed with partial stats |
What happens during ingest
- Opens (or creates) database
bulk_ni_{writerNodeInstanceId}. - For each output ID:
- Skips if already listed in ledger table
_ingested_outputs(idempotent re-runs). - Parses CSV or JSON and inserts rows.
- CSV files with the same header set share one table
csv_{schemaKey}(8-character hash).
- Skips if already listed in ledger table
- Updates
StatusDetailsevery few seconds, for example:Ingested output 45360 (31/237, 13%; processed 31, skipped 0, failed 0; 31k rows; elapsed 8m 42s; ETA ~58m; 3.6 outputs/min; 59.4 rows/s) - Emits
heatBulkWriterCatalogV1when complete.
Performance: v1 ingests outputs sequentially. Each CSV output inserts rows in batched multi-row INSERTs (up to 1,000 rows or the Postgres parameter limit per round trip, whichever is smaller). Hundreds of outputs × thousands of rows each can still take tens of minutes to hours when object-store fetch or parsing dominates. Reprocess after failure is safe: already-ingested outputs are skipped.
Detail: system-bulk-tabular-writer
Stage 3: SQL query (system-bulk-analytics-query)
Purpose: Run one or more read-only SELECT statements against the writer’s analytics database.
Parent: system-bulk-tabular-writer (recommended), or configure an explicit database target for standalone queries.
When the parent is a bulk writer, the node reads analyticsDatabase from the parent’s heatBulkWriterCatalogV1 automatically.
Single-query example (ingest summary from reference template):
{
"sql": "SELECT COUNT(*) AS total_rows FROM \"_ingested_outputs\"",
"outputTableName": "ingest_summary",
"maxRows": 1
}Multi-query example (from bulk-analytics-arbex-sample):
{
"queries": [
{
"sql": "SELECT COUNT(*) AS total_ingested FROM \"_ingested_outputs\"",
"outputTableName": "ingest_summary",
"maxRows": 1
},
{
"sql": "SELECT table_name, SUM(row_count) AS row_count FROM \"_ingested_outputs\" GROUP BY table_name ORDER BY table_name",
"outputTableName": "table_row_counts",
"maxRows": 100
}
]
}Standalone query (writer ran in an earlier session):
{
"bulkWriterNodeInstanceId": 42,
"tableName": "csv_a1b2c3d4",
"maxRows": 100
}Use table names and qualifiedFrom values from the writer catalog when writing SQL. See Catalog and SQL.
Detail: system-bulk-analytics-query
Single-query configs default to outputFormat: auto, which registers a CSV artifact (text/csv). Use "outputFormat": "json" when a downstream node (for example system-arbex-js) reads the tabular JSON envelope.
Stage 4: Dashboard (optional, system-arbex-js)
Purpose: Convert tabular query JSON into $heat-dataservice channels and a suggested Next layout.
Parent: system-bulk-analytics-query.
The shipped bulk-analytics-arbex-sample template includes an Arbex script that:
- Reads
tables.ingest_summaryandtables.table_row_countsfrom the query output. - Publishes summary value channels and a bar chart of row counts per analytics table.
Detail: system-arbex-js, Next data service guide
To persist a dashboard dimension after Arbex, add heat-system-next-dimension downstream (not included in the shipped sample).
Authoring your own template
Minimal three-node template:
{
"Name": "my-bulk-analytics",
"Nodes": [
{
"Name": "discover-csv",
"NodeTemplateName": "node-output-query",
"DefaultConfiguration": { "sourceNodeName": "csv-input" },
"Parents": []
},
{
"Name": "bulk-ingest",
"NodeTemplateName": "system-bulk-tabular-writer",
"DefaultConfiguration": { "inputFormat": "auto" },
"Parents": ["discover-csv"]
},
{
"Name": "analytics-sql",
"NodeTemplateName": "system-bulk-analytics-query",
"DefaultConfiguration": {
"sql": "SELECT COUNT(*) AS n FROM \"_ingested_outputs\"",
"outputTableName": "summary",
"maxRows": 1
},
"Parents": ["bulk-ingest"]
}
]
}Adjust sourceNodeName, SQL, and optional filters to match your ingest template.
Monitoring progress
| Location | What to watch |
|---|---|
Writer node StatusDetails | Output index, processed/skipped/failed counts |
Writer node LastState | Processing during ingest |
| Analytics Postgres PVC usage | Grows with row volume |
| system-utils pod memory | Large outputIds arrays and CSV parsing |