Bulk analytics workflow

This page walks through the standard bulk analytics graph: discover outputs, ingest into analytics Postgres, query with SQL, and optionally publish a dashboard.

Prerequisites

heat-bulk-analytics-postgres deployed and reachable from system-utils.
core-api restarted after deploy (registers HEAT Bulk Analytics DB DataSource).
Upstream ingest sessions that produced tabular outputs (typically from a node named csv-input or json-input).
A system-utils runner linked to node-output-query, system-bulk-tabular-writer, and system-bulk-analytics-query (shipped presets handle this on bootstrap).

Pipeline overview


node-output-query → system-bulk-tabular-writer → system-bulk-analytics-query → (optional) system-arbex-js

Each stage consumes the previous node’s current JSON output (stored in HEAT Managed Object Store like other system-utils artifacts).

Stage 1: Discover outputs (`node-output-query`)

Purpose: Build a single JSON payload listing every matching NodeOutput.Id from core Postgres.

Parent: none (root of the bulk graph).

Typical configuration (from bulk-analytics-ingest-reference):


{
  "sourceNodeName": "csv-input",
  "includeStaticBackedSessions": false
}

Optional filters narrow the result set before ingest:

Key	Example use
`sessionTemplateName`	Only outputs from sessions of one template
`projectId` / `projectName`	One project
`sessionCreatedFrom` / `sessionCreatedTo`	Date window on session creation
`simulationName`	Match `Session.SimulationName`

Output shape:


{
  "sourceNodeName": "csv-input",
  "criteria": { "sourceNodeName": "csv-input" },
  "totalCount": 237,
  "outputIds": [45330, 45331, 45332]
}

There is no pagination in v1. Very large outputIds arrays (for example 170k+) produce a large JSON blob and long downstream runtime. Plan system-utils memory and analytics disk accordingly.

Detail: node-output-query

Stage 2: Bulk ingest (`system-bulk-tabular-writer`)

Purpose: For each outputId, read the blob from object storage and insert rows into HEAT Bulk Analytics DB.

Parent: must be node-output-query.

Typical configuration:


{
  "inputFormat": "auto",
  "maxRowsPerOutput": 500000,
  "failOnFirstError": false
}

Key	Default	Notes
`inputFormat`	`auto`	`auto`, `csv`, or `json`
`maxRowsPerOutput`	`500000`	Row cap per source output
`failOnFirstError`	`false`	When `false`, collect per-output failures and still succeed with partial stats

What happens during ingest

Opens (or creates) database bulk_ni_{writerNodeInstanceId}.
For each output ID:
- Skips if already listed in ledger table _ingested_outputs (idempotent re-runs).
- Parses CSV or JSON and inserts rows.
- CSV files with the same header set share one table csv_{schemaKey} (8-character hash).

Updates StatusDetails every few seconds, for example:


Ingested output 45360 (31/237, 13%; processed 31, skipped 0, failed 0; 31k rows; elapsed 8m 42s; ETA ~58m; 3.6 outputs/min; 59.4 rows/s)

Emits heatBulkWriterCatalogV1 when complete.

Performance: v1 ingests outputs sequentially. Each CSV output inserts rows in batched multi-row INSERTs (up to 1,000 rows or the Postgres parameter limit per round trip, whichever is smaller). Hundreds of outputs × thousands of rows each can still take tens of minutes to hours when object-store fetch or parsing dominates. Reprocess after failure is safe: already-ingested outputs are skipped.

Detail: system-bulk-tabular-writer

Stage 3: SQL query (`system-bulk-analytics-query`)

Purpose: Run one or more read-only SELECT statements against the writer’s analytics database.

Parent: system-bulk-tabular-writer (recommended), or configure an explicit database target for standalone queries.

When the parent is a bulk writer, the node reads analyticsDatabase from the parent’s heatBulkWriterCatalogV1 automatically.

Single-query example (ingest summary from reference template):


{
  "sql": "SELECT COUNT(*) AS total_rows FROM \"_ingested_outputs\"",
  "outputTableName": "ingest_summary",
  "maxRows": 1
}

Multi-query example (from bulk-analytics-arbex-sample):


{
  "queries": [
    {
      "sql": "SELECT COUNT(*) AS total_ingested FROM \"_ingested_outputs\"",
      "outputTableName": "ingest_summary",
      "maxRows": 1
    },
    {
      "sql": "SELECT table_name, SUM(row_count) AS row_count FROM \"_ingested_outputs\" GROUP BY table_name ORDER BY table_name",
      "outputTableName": "table_row_counts",
      "maxRows": 100
    }
  ]
}

Standalone query (writer ran in an earlier session):


{
  "bulkWriterNodeInstanceId": 42,
  "tableName": "csv_a1b2c3d4",
  "maxRows": 100
}

Use table names and qualifiedFrom values from the writer catalog when writing SQL. See Catalog and SQL.

Detail: system-bulk-analytics-query

Single-query configs default to outputFormat: auto, which registers a CSV artifact (text/csv). Use "outputFormat": "json" when a downstream node (for example system-arbex-js) reads the tabular JSON envelope.

Stage 4: Dashboard (optional, `system-arbex-js`)

Purpose: Convert tabular query JSON into $heat-dataservice channels and a suggested Next layout.

Parent: system-bulk-analytics-query.

The shipped bulk-analytics-arbex-sample template includes an Arbex script that:

Reads tables.ingest_summary and tables.table_row_counts from the query output.
Publishes summary value channels and a bar chart of row counts per analytics table.

Detail: system-arbex-js, Next data service guide

To persist a dashboard dimension after Arbex, add heat-system-next-dimension downstream (not included in the shipped sample).

Authoring your own template

Minimal three-node template:


{
  "Name": "my-bulk-analytics",
  "Nodes": [
    {
      "Name": "discover-csv",
      "NodeTemplateName": "node-output-query",
      "DefaultConfiguration": { "sourceNodeName": "csv-input" },
      "Parents": []
    },
    {
      "Name": "bulk-ingest",
      "NodeTemplateName": "system-bulk-tabular-writer",
      "DefaultConfiguration": { "inputFormat": "auto" },
      "Parents": ["discover-csv"]
    },
    {
      "Name": "analytics-sql",
      "NodeTemplateName": "system-bulk-analytics-query",
      "DefaultConfiguration": {
        "sql": "SELECT COUNT(*) AS n FROM \"_ingested_outputs\"",
        "outputTableName": "summary",
        "maxRows": 1
      },
      "Parents": ["bulk-ingest"]
    }
  ]
}

Adjust sourceNodeName, SQL, and optional filters to match your ingest template.

Monitoring progress

Location	What to watch
Writer node `StatusDetails`	Output index, processed/skipped/failed counts
Writer node `LastState`	`Processing` during ingest
Analytics Postgres PVC usage	Grows with row volume
system-utils pod memory	Large `outputIds` arrays and CSV parsing

Bulk analytics workflow

Prerequisites

Pipeline overview

Stage 1: Discover outputs (node-output-query)

Stage 2: Bulk ingest (system-bulk-tabular-writer)