Generated 2026-03-26 17:49 CST · ShifuAI internal engineering memo

What Should Test Results Look Like?

The confusion is not really about whether ShifuAI has tests. It is about form factor: what exactly a maintainer should open, scan, and trust when they want to understand current test health. Coverage, test results, and history are related, but they are not the same thing.

Thesis: Coverage is a supporting metric, not the main test-result artifact. The main artifact should express workflow health across conceptual lanes.

Scope: top 5 workflows only - auth, cdc, resume core, jobs cloud, job report - using the conceptual lanes unit, integration, E2E, and live.

Current State: Why This Feels Hard Today

Today the repo produces many outputs, but they are scattered across tools and execution surfaces. That makes them hard to read as one coherent health artifact.

In the current repo, deterministic runs already emit coverage summaries and HTML, Playwright already emits reports and traces, live runs often stay in the terminal, and deploys create GitHub Releases. None of those are useless. The issue is that they answer different questions and they do not naturally combine into one clear workflow-level readout.

Another way to say it: current outputs are organized by tool surface, while the maintainer question is organized by workflow health.

Foundations: How To Read Test Health

The signal stack makes one distinction unavoidable: coverage is not the same thing as a workflow-health artifact. It supports that artifact, but it does not replace it.

Primary Axis

Workflows first

Rows should be things like auth or job report, because that is how maintainers think about product health.

Columns

Conceptual lanes

Use unit, integration, E2E, and live even if current commands combine some of those beneath the surface.

Primary Cell Signal

Health + recency

The first thing a reader should see in a cell is whether that slice is healthy and how recently that evidence was refreshed.

Supporting Signals

Evidence, not overload

Counts, artifact links, and notes matter, but they should support the cell rather than overpower the grid.

Cell field	What it means	Why it matters
`status`	`Pass`, `Fail`, `Stale`, `Untested`, `N/A`	The headline signal. It tells the maintainer how to interpret the rest of the cell.
`last_run`	Human-readable recency such as `2h ago` or `4d ago`	A stale pass is weaker than a fresh pass. Recency changes trust.
`evidence_count`	How many tests or checks back the signal	Helps distinguish one smoke check from deeper evidence.
`artifact_type`	Coverage summary, Playwright report, terminal log, release note	Lets the reader know what kind of evidence exists behind the cell.
`notes`	Short qualifier like “smoke only” or “manual run”	Adds nuance without bloating the grid.

Exact Cell Vocabulary

Pass Fail Stale Untested N/A

Pass means healthy with current evidence. Fail means the latest evidence is red. Stale means previously green, but not recent enough to trust by default. Untested means that lane is conceptually relevant but currently missing. N/A means the lane is not a useful lens for that workflow.

The Artifact: Workflow Snapshot Matrix

The first screen answers one question: for each important workflow, what do unit, integration, E2E, and live currently say? Rows are workflows. Columns are conceptual lanes. Each cell shows health + recency. Coverage lives in the footer as a supporting metric.

Option 1: A single screen that answers "what's healthy, what's stale, what's missing?" organized by how you think about the product.

Shifu-Grounded Worked Example

If ShifuAI shipped only one visual test-result artifact in the near term, the strongest v1 is the workflow snapshot matrix below. It is intentionally a single snapshot, not a timeline. The values are illustrative so the structure is easy to read.

Illustrative snapshot

Workflow Snapshot Matrix

The cell shows only status + recency. Evidence counts, artifact links, and notes would live in the drilldown below the matrix, not inside the first-screen grid.

Workflow

Unit

Integration

E2E

Live

auth

Pass2h ago

Untested-

Stale6d ago

cdc

Pass2h ago

Pass3d ago

Untested-

resume core

Pass2h ago

Pass3d ago

Pass1d ago

jobs cloud

Pass2h ago

Stale9d ago

Untested-

job report

Pass2h ago

Pass3d ago

Stale8d ago

Fresh pass Latest run failed Old evidence Missing lane Not applicable

Footnote: this matrix is conceptual first. Current repo commands may combine some deterministic unit and integration evidence beneath the surface. That is an implementation detail, not a reason to make the human-facing artifact harder to understand.

Workflow	How to read its row	What the matrix tells you immediately
`auth`	Deterministic evidence is healthy, but browser and live confidence are weaker.	Likely stable for contracts, but not deeply exercised end to end.
`cdc`	Strong deterministic evidence plus recent browser evidence.	Good daily confidence even without live coverage.
`resume core`	Balanced row with support across all four lanes.	Most complete workflow-health picture in the snapshot.
`jobs cloud`	Deterministic evidence is current; deeper lanes are weaker or missing.	Health looks decent, but the risk is hidden in stale or absent higher-realism lanes.
`job report`	Good deterministic and E2E shape, but live evidence is not recent.	Strong everyday confidence, weaker provider-confidence signal.

Recommendation

Recommended direction

Start with a single-snapshot static report

For ShifuAI, the strongest v1 is a static report organized by workflow rows and conceptual lane columns, with health + recency as the first-screen cell signal. Coverage should stay on the page, but as a supporting global metric rather than the main visual object.

In practice that means the maintainer first scans for red, stale, and untested cells, then inspects supporting evidence, and only then looks at coverage percentages for additional context.

First: scan for Fail, Stale, and Untested cells.
Second: inspect workflows whose lane profile is obviously weak or uneven.
Third: use coverage as supporting evidence, not as the headline interpretation.

This keeps the artifact understandable even as the exact commands, reports, and tooling evolve. The underlying details can change over time; the mental model should stay stable.