Generated 2026-03-26 17:49 CST · ShifuAI internal engineering memo

What Should Test Results Look Like?

The confusion is not really about whether ShifuAI has tests. It is about form factor: what exactly a maintainer should open, scan, and trust when they want to understand current test health. Coverage, test results, and history are related, but they are not the same thing.

Thesis: Coverage is a supporting metric, not the main test-result artifact. The main artifact should express workflow health across conceptual lanes.
Scope: top 5 workflows only - auth, cdc, resume core, jobs cloud, job report - using the conceptual lanes unit, integration, E2E, and live.

Current State: Why This Feels Hard Today

Deterministic Vitest + coverage E2E Playwright report Live Terminal only Deploy Release metadata coverage-summary.json coverage HTML terminal block playwright-report test-results failure traces stdout / stderr manual interpretation little retention GitHub Release commit + deploy not test state Problem Outputs are organized by tool and surface, not by workflow health
Today the repo produces many outputs, but they are scattered across tools and execution surfaces. That makes them hard to read as one coherent health artifact.

In the current repo, deterministic runs already emit coverage summaries and HTML, Playwright already emits reports and traces, live runs often stay in the terminal, and deploys create GitHub Releases. None of those are useless. The issue is that they answer different questions and they do not naturally combine into one clear workflow-level readout.

Another way to say it: current outputs are organized by tool surface, while the maintainer question is organized by workflow health.

Foundations: How To Read Test Health

Workflow Health What a maintainer actually wants to understand Lane Results Unit · Integration · E2E · Live Supporting Metrics Coverage · test counts · duration · evidence links History / Timeline main artifact cell layer supporting later layer
The signal stack makes one distinction unavoidable: coverage is not the same thing as a workflow-health artifact. It supports that artifact, but it does not replace it.

Primary Axis

Workflows first

Rows should be things like auth or job report, because that is how maintainers think about product health.

Columns

Conceptual lanes

Use unit, integration, E2E, and live even if current commands combine some of those beneath the surface.

Primary Cell Signal

Health + recency

The first thing a reader should see in a cell is whether that slice is healthy and how recently that evidence was refreshed.

Supporting Signals

Evidence, not overload

Counts, artifact links, and notes matter, but they should support the cell rather than overpower the grid.

Cell field What it means Why it matters
status Pass, Fail, Stale, Untested, N/A The headline signal. It tells the maintainer how to interpret the rest of the cell.
last_run Human-readable recency such as 2h ago or 4d ago A stale pass is weaker than a fresh pass. Recency changes trust.
evidence_count How many tests or checks back the signal Helps distinguish one smoke check from deeper evidence.
artifact_type Coverage summary, Playwright report, terminal log, release note Lets the reader know what kind of evidence exists behind the cell.
notes Short qualifier like “smoke only” or “manual run” Adds nuance without bloating the grid.

Exact Cell Vocabulary

Pass Fail Stale Untested N/A

Pass means healthy with current evidence. Fail means the latest evidence is red. Stale means previously green, but not recent enough to trust by default. Untested means that lane is conceptually relevant but currently missing. N/A means the lane is not a useful lens for that workflow.

The Artifact: Workflow Snapshot Matrix

The first screen answers one question: for each important workflow, what do unit, integration, E2E, and live currently say? Rows are workflows. Columns are conceptual lanes. Each cell shows health + recency. Coverage lives in the footer as a supporting metric.

Workflow Health Snapshot 2026-03-26 14:30 CST · a47993c WORKFLOW UNIT INTEGRATION E2E LIVE auth Pass2h ago Pass2h ago Untested Stale6d cdc Pass2h ago Pass2h ago Pass3d ago Untested resume core Pass2h ago Pass2h ago Pass3d ago Pass1d jobs cloud Pass2h ago Pass2h ago Stale9d ago Untested job report Pass2h ago Pass2h ago Pass3d ago Stale8d COVERAGE (supporting) Stmt 51% Branch 42% Fn 52% Line 53% scope: backend + lib only
Option 1: A single screen that answers "what's healthy, what's stale, what's missing?" organized by how you think about the product.

Shifu-Grounded Worked Example

If ShifuAI shipped only one visual test-result artifact in the near term, the strongest v1 is the workflow snapshot matrix below. It is intentionally a single snapshot, not a timeline. The values are illustrative so the structure is easy to read.

Illustrative snapshot

Workflow Snapshot Matrix

The cell shows only status + recency. Evidence counts, artifact links, and notes would live in the drilldown below the matrix, not inside the first-screen grid.

Workflow
Unit
Integration
E2E
Live
auth
Pass2h ago
Pass2h ago
Untested-
Stale6d ago
cdc
Pass2h ago
Pass2h ago
Pass3d ago
Untested-
resume core
Pass2h ago
Pass2h ago
Pass3d ago
Pass1d ago
jobs cloud
Pass2h ago
Pass2h ago
Stale9d ago
Untested-
job report
Pass2h ago
Pass2h ago
Pass3d ago
Stale8d ago
Fresh pass Latest run failed Old evidence Missing lane Not applicable

Footnote: this matrix is conceptual first. Current repo commands may combine some deterministic unit and integration evidence beneath the surface. That is an implementation detail, not a reason to make the human-facing artifact harder to understand.

Workflow How to read its row What the matrix tells you immediately
auth Deterministic evidence is healthy, but browser and live confidence are weaker. Likely stable for contracts, but not deeply exercised end to end.
cdc Strong deterministic evidence plus recent browser evidence. Good daily confidence even without live coverage.
resume core Balanced row with support across all four lanes. Most complete workflow-health picture in the snapshot.
jobs cloud Deterministic evidence is current; deeper lanes are weaker or missing. Health looks decent, but the risk is hidden in stale or absent higher-realism lanes.
job report Good deterministic and E2E shape, but live evidence is not recent. Strong everyday confidence, weaker provider-confidence signal.

Recommendation

Recommended direction

Start with a single-snapshot static report

For ShifuAI, the strongest v1 is a static report organized by workflow rows and conceptual lane columns, with health + recency as the first-screen cell signal. Coverage should stay on the page, but as a supporting global metric rather than the main visual object.

In practice that means the maintainer first scans for red, stale, and untested cells, then inspects supporting evidence, and only then looks at coverage percentages for additional context.

This keeps the artifact understandable even as the exact commands, reports, and tooling evolve. The underlying details can change over time; the mental model should stay stable.