Skip to Content
This documentation is provided with the HEAT environment and is relevant for this HEAT instance only.

Bulk analytics workflow

This page walks through the standard bulk analytics graph: discover outputs, ingest into analytics Postgres, query with SQL, and optionally publish a dashboard.

Prerequisites

  • heat-bulk-analytics-postgres deployed and reachable from system-utils.
  • core-api restarted after deploy (registers HEAT Bulk Analytics DB DataSource).
  • Upstream ingest sessions that produced tabular outputs (typically from a node named csv-input or json-input).
  • A system-utils runner linked to node-output-query, system-bulk-tabular-writer, and system-bulk-analytics-query (shipped presets handle this on bootstrap).

Pipeline overview

node-output-query → system-bulk-tabular-writer → system-bulk-analytics-query → (optional) system-arbex-js

Each stage consumes the previous node’s current JSON output (stored in HEAT Managed Object Store like other system-utils artifacts).

Stage 1: Discover outputs (node-output-query)

Purpose: Build a single JSON payload listing every matching NodeOutput.Id from core Postgres.

Parent: none (root of the bulk graph).

Typical configuration (from bulk-analytics-ingest-reference):

{ "sourceNodeName": "csv-input", "includeStaticBackedSessions": false }

Optional filters narrow the result set before ingest:

KeyExample use
sessionTemplateNameOnly outputs from sessions of one template
projectId / projectNameOne project
sessionCreatedFrom / sessionCreatedToDate window on session creation
simulationNameMatch Session.SimulationName

Output shape:

{ "sourceNodeName": "csv-input", "criteria": { "sourceNodeName": "csv-input" }, "totalCount": 237, "outputIds": [45330, 45331, 45332] }

There is no pagination in v1. Very large outputIds arrays (for example 170k+) produce a large JSON blob and long downstream runtime. Plan system-utils memory and analytics disk accordingly.

Detail: node-output-query

Stage 2: Bulk ingest (system-bulk-tabular-writer)

Purpose: For each outputId, read the blob from object storage and insert rows into HEAT Bulk Analytics DB.

Parent: must be node-output-query.

Typical configuration:

{ "inputFormat": "auto", "maxRowsPerOutput": 500000, "failOnFirstError": false }
KeyDefaultNotes
inputFormatautoauto, csv, or json
maxRowsPerOutput500000Row cap per source output
failOnFirstErrorfalseWhen false, collect per-output failures and still succeed with partial stats

What happens during ingest

  1. Opens (or creates) database bulk_ni_{writerNodeInstanceId}.
  2. For each output ID:
    • Skips if already listed in ledger table _ingested_outputs (idempotent re-runs).
    • Parses CSV or JSON and inserts rows.
    • CSV files with the same header set share one table csv_{schemaKey} (8-character hash).
  3. Updates StatusDetails every few seconds, for example:
    Ingested output 45360 (31/237, 13%; processed 31, skipped 0, failed 0; 31k rows; elapsed 8m 42s; ETA ~58m; 3.6 outputs/min; 59.4 rows/s)
  4. Emits heatBulkWriterCatalogV1 when complete.

Performance: v1 ingests outputs sequentially. Each CSV output inserts rows in batched multi-row INSERTs (up to 1,000 rows or the Postgres parameter limit per round trip, whichever is smaller). Hundreds of outputs × thousands of rows each can still take tens of minutes to hours when object-store fetch or parsing dominates. Reprocess after failure is safe: already-ingested outputs are skipped.

Detail: system-bulk-tabular-writer

Stage 3: SQL query (system-bulk-analytics-query)

Purpose: Run one or more read-only SELECT statements against the writer’s analytics database.

Parent: system-bulk-tabular-writer (recommended), or configure an explicit database target for standalone queries.

When the parent is a bulk writer, the node reads analyticsDatabase from the parent’s heatBulkWriterCatalogV1 automatically.

Single-query example (ingest summary from reference template):

{ "sql": "SELECT COUNT(*) AS total_rows FROM \"_ingested_outputs\"", "outputTableName": "ingest_summary", "maxRows": 1 }

Multi-query example (from bulk-analytics-arbex-sample):

{ "queries": [ { "sql": "SELECT COUNT(*) AS total_ingested FROM \"_ingested_outputs\"", "outputTableName": "ingest_summary", "maxRows": 1 }, { "sql": "SELECT table_name, SUM(row_count) AS row_count FROM \"_ingested_outputs\" GROUP BY table_name ORDER BY table_name", "outputTableName": "table_row_counts", "maxRows": 100 } ] }

Standalone query (writer ran in an earlier session):

{ "bulkWriterNodeInstanceId": 42, "tableName": "csv_a1b2c3d4", "maxRows": 100 }

Use table names and qualifiedFrom values from the writer catalog when writing SQL. See Catalog and SQL.

Detail: system-bulk-analytics-query

Single-query configs default to outputFormat: auto, which registers a CSV artifact (text/csv). Use "outputFormat": "json" when a downstream node (for example system-arbex-js) reads the tabular JSON envelope.

Stage 4: Dashboard (optional, system-arbex-js)

Purpose: Convert tabular query JSON into $heat-dataservice channels and a suggested Next layout.

Parent: system-bulk-analytics-query.

The shipped bulk-analytics-arbex-sample template includes an Arbex script that:

  • Reads tables.ingest_summary and tables.table_row_counts from the query output.
  • Publishes summary value channels and a bar chart of row counts per analytics table.

Detail: system-arbex-js, Next data service guide

To persist a dashboard dimension after Arbex, add heat-system-next-dimension downstream (not included in the shipped sample).

Authoring your own template

Minimal three-node template:

{ "Name": "my-bulk-analytics", "Nodes": [ { "Name": "discover-csv", "NodeTemplateName": "node-output-query", "DefaultConfiguration": { "sourceNodeName": "csv-input" }, "Parents": [] }, { "Name": "bulk-ingest", "NodeTemplateName": "system-bulk-tabular-writer", "DefaultConfiguration": { "inputFormat": "auto" }, "Parents": ["discover-csv"] }, { "Name": "analytics-sql", "NodeTemplateName": "system-bulk-analytics-query", "DefaultConfiguration": { "sql": "SELECT COUNT(*) AS n FROM \"_ingested_outputs\"", "outputTableName": "summary", "maxRows": 1 }, "Parents": ["bulk-ingest"] } ] }

Adjust sourceNodeName, SQL, and optional filters to match your ingest template.

Monitoring progress

LocationWhat to watch
Writer node StatusDetailsOutput index, processed/skipped/failed counts
Writer node LastStateProcessing during ingest
Analytics Postgres PVC usageGrows with row volume
system-utils pod memoryLarge outputIds arrays and CSV parsing