The confusion is not really about whether ShifuAI has tests. It is about form factor: what exactly a maintainer should open, scan, and trust when they want to understand current test health. Coverage, test results, and history are related, but they are not the same thing.
auth, cdc, resume core,
jobs cloud, job report - using the conceptual lanes
unit, integration, E2E, and live.
In the current repo, deterministic runs already emit coverage summaries and HTML, Playwright already emits reports and traces, live runs often stay in the terminal, and deploys create GitHub Releases. None of those are useless. The issue is that they answer different questions and they do not naturally combine into one clear workflow-level readout.
Another way to say it: current outputs are organized by tool surface, while the maintainer question is organized by workflow health.
Rows should be things like auth or job report, because that is
how maintainers think about product health.
Use unit, integration, E2E, and live
even if current commands combine some of those beneath the surface.
The first thing a reader should see in a cell is whether that slice is healthy and how recently that evidence was refreshed.
Counts, artifact links, and notes matter, but they should support the cell rather than overpower the grid.
| Cell field | What it means | Why it matters |
|---|---|---|
status |
Pass, Fail, Stale, Untested, N/A |
The headline signal. It tells the maintainer how to interpret the rest of the cell. |
last_run |
Human-readable recency such as 2h ago or 4d ago |
A stale pass is weaker than a fresh pass. Recency changes trust. |
evidence_count |
How many tests or checks back the signal | Helps distinguish one smoke check from deeper evidence. |
artifact_type |
Coverage summary, Playwright report, terminal log, release note | Lets the reader know what kind of evidence exists behind the cell. |
notes |
Short qualifier like “smoke only” or “manual run” | Adds nuance without bloating the grid. |
Pass means healthy with current evidence. Fail means the latest evidence is red. Stale means previously green, but not recent enough to trust by default. Untested means that lane is conceptually relevant but currently missing. N/A means the lane is not a useful lens for that workflow.
The first screen answers one question: for each important workflow, what do unit, integration, E2E, and live currently say? Rows are workflows. Columns are conceptual lanes. Each cell shows health + recency. Coverage lives in the footer as a supporting metric.
If ShifuAI shipped only one visual test-result artifact in the near term, the strongest v1 is the workflow snapshot matrix below. It is intentionally a single snapshot, not a timeline. The values are illustrative so the structure is easy to read.
The cell shows only status + recency. Evidence counts, artifact links, and notes would live in the drilldown below the matrix, not inside the first-screen grid.
Footnote: this matrix is conceptual first. Current repo commands may combine some deterministic unit and integration evidence beneath the surface. That is an implementation detail, not a reason to make the human-facing artifact harder to understand.
| Workflow | How to read its row | What the matrix tells you immediately |
|---|---|---|
auth |
Deterministic evidence is healthy, but browser and live confidence are weaker. | Likely stable for contracts, but not deeply exercised end to end. |
cdc |
Strong deterministic evidence plus recent browser evidence. | Good daily confidence even without live coverage. |
resume core |
Balanced row with support across all four lanes. | Most complete workflow-health picture in the snapshot. |
jobs cloud |
Deterministic evidence is current; deeper lanes are weaker or missing. | Health looks decent, but the risk is hidden in stale or absent higher-realism lanes. |
job report |
Good deterministic and E2E shape, but live evidence is not recent. | Strong everyday confidence, weaker provider-confidence signal. |
For ShifuAI, the strongest v1 is a static report organized by workflow rows and conceptual lane columns, with health + recency as the first-screen cell signal. Coverage should stay on the page, but as a supporting global metric rather than the main visual object.
In practice that means the maintainer first scans for red, stale, and untested cells, then inspects supporting evidence, and only then looks at coverage percentages for additional context.
Fail, Stale, and Untested cells.This keeps the artifact understandable even as the exact commands, reports, and tooling evolve. The underlying details can change over time; the mental model should stay stable.