Bulk analytics catalog and SQL

After system-bulk-tabular-writer completes, it registers heatBulkWriterCatalogV1 JSON. Downstream system-bulk-analytics-query and external tools use this catalog to discover database names, table names, column types, and safe SQL identifiers.

Catalog top-level fields

Field	Purpose
`catalogVersion`	Always `heatBulkWriterCatalogV1`
`bulkWriterNodeInstanceId`	Writer node instance id
`dataSourceName`	`HEAT Bulk Analytics DB`
`analyticsDatabase`	PostgreSQL database name (`bulk_ni_{id}`)
`analyticsHost`	Cluster-internal host (default `heat-bulk-analytics-postgres`)
`joinKeys`	Shared columns for joining tables
`tables`	Map of table name → table metadata
`outputsProcessedThisRun`	Outputs ingested this run
`outputsSkippedAlreadyIngested`	Skipped (ledger hit)
`outputsFailed`	Read or ingest failures
`failedOutputIds`	Output ids that failed

Table layout

Each bulk writer owns one PostgreSQL database: bulk_ni_{nodeInstanceId}.

Table pattern	Contents
`csv_{schemaKey}`	CSV rows sharing the same header set (`schemaKey` is an 8-character hash of sorted column names)
`json_{schemaKey}`	JSON tabular payloads (one JSONB `payload` column plus metadata)
`_ingested_outputs`	System ledger: which output ids were ingested, row counts, schema keys

Every data table includes HEAT metadata columns:

Column	Postgres type	Purpose
`heat_session_id`	`uuid`	Session that owned the source output
`heat_node_output_id`	`integer`	Source `NodeOutput.Id`
`heat_source_session_id`	`uuid`	Same as `heat_session_id` for CSV ingest (join key)

Display hints such as KB_BIO and KB_VEHICLE are inferred from CSV column prefixes when present (catalog displayHint on tables).

Column typing

CSV data columns use conservative type inference when a table is first created. The writer samples non-empty cell values from the first ingested file for that schema:

`valueKind`	Postgres type	Inference rule (simplified)
`integer`	`bigint`	All sampled values parse as integers
`number`	`double precision`	All sampled values parse as floats
`boolean`	`boolean`	All sampled values are `true`/`false`/`yes`/`no`
`timestamp`	`timestamptz`	All sampled values are ISO-8601 timestamps
`uuid`	`uuid`	All sampled values are GUIDs
`string`	`text`	Mixed types, labels, or ambiguity

Null sentinels (N/A, empty, null, none, -) are ignored during inference and stored as SQL NULL on insert.

Important: types are fixed at CREATE TABLE IF NOT EXISTS. Re-ingesting into an existing all-TEXT table does not alter column types. Drop bulk_ni_{id} and re-run the writer to pick up new inference rules.

Column catalog entry

Each column in tables[].columns[] includes:


{
  "name": "KB_VEHICLE_SPEED",
  "postgresType": "double precision",
  "valueKind": "number",
  "nullable": true,
  "sampleValue": 12.5
}

Field	Use for downstream tools
`valueKind`	Portable type for UI binding, chart axes, and API consumers
`postgresType`	Exact Postgres type for SQL generation
`nullable`	Whether SQL `NULL` is expected
`sampleValue`	Quick preview (strings truncated)

Query result JSON from system-bulk-analytics-query returns typed cell values in rows (numbers and booleans as JSON numbers/booleans, not strings).

Join keys

Default catalog joinKeys:


["heat_session_id", "heat_node_output_id", "heat_source_session_id"]

Use these to correlate rows across csv_* tables or to tie analytics rows back to core session metadata.

Example join:


SELECT
  a.heat_session_id,
  a."KB_VEHICLE_SPEED" AS speed,
  b."KB_BIO_HEART_RATE" AS heart_rate
FROM "csv_vehicle_hash" AS a
JOIN "csv_bio_hash" AS b
  ON a.heat_session_id = b.heat_session_id
WHERE a."KB_VEHICLE_SPEED" > 50
LIMIT 1000

Replace table names with values from your catalog tables keys and use each table’s qualifiedFrom for quoting.

SQL patterns (`system-bulk-analytics-query`)

Ingest ledger


SELECT COUNT(*) AS total_ingested FROM "_ingested_outputs"


SELECT table_name, SUM(row_count) AS row_count
FROM "_ingested_outputs"
GROUP BY table_name
ORDER BY table_name

Range filter on typed numeric column

When valueKind is number or integer:


SELECT heat_session_id, "KB_VEHICLE_SPEED"
FROM "csv_a1b2c3d4"
WHERE "KB_VEHICLE_SPEED" > 50
ORDER BY "KB_VEHICLE_SPEED" DESC
LIMIT 500

No CAST is required for correctly inferred types.

Time window

When valueKind is timestamp:


SELECT heat_session_id, "KB_VEHICLE_CAPTURED_DATE_TIME (IS0 8601)"
FROM "csv_a1b2c3d4"
WHERE "KB_VEHICLE_CAPTURED_DATE_TIME (IS0 8601)" >= TIMESTAMPTZ '2026-05-27T00:00:00Z'
  AND "KB_VEHICLE_CAPTURED_DATE_TIME (IS0 8601)" < TIMESTAMPTZ '2026-05-28T00:00:00Z'
LIMIT 10000

Columns that stayed `string`

Mixed labels (N/A, gear names, and similar) remain text. Use CAST or regex guards:


SELECT *
FROM "csv_a1b2c3d4"
WHERE "KB_VEHICLE_PHYS_GEAR" ~ '^[0-9]+$'

Security constraints

The query node accepts SELECT-only SQL (including WITH … SELECT). DDL, DML, COPY, and multiple statements are rejected. Use maxRows (default 10000, cap 100000) to limit result size.

Using the catalog in Arbex

bulk-analytics-arbex-sample reads query output shape:


{
  "tables": {
    "ingest_summary": {
      "columns": ["total_ingested"],
      "rows": [[237]]
    }
  }
}

Arbex scripts can map columns + rows into heat.dataservice value channels and layout builders. See the script embedded in the shipped template and Arbex API reference.