Test Strategies
Pyramid, parallelism, and the cost of flakes.
Test Strategies in CI
Not all tests are the same. Unit tests take milliseconds and run in isolation. End-to-end tests take minutes and require a running application. How you arrange them in the pipeline determines both how fast you get feedback and how much you trust the result.
Analogy
Think of how a car is checked on the factory floor. Every bolt gets a torque-wrench click on the way in — cheap, constant, done thousands of times an hour; that's unit tests. Entire subsystems get rolled onto a test rig: the engine runs for ten minutes, the electronics loop is probed, the brakes are cycled; that's integration, slower but exercising real interactions. Finally, one finished car is driven around the test track by a human — expensive, irreplaceable for catching "the dashboard squeaks over bumps," but you'd never try to inspect every bolt by driving a lap for each one. Flaky tests are a torque wrench that beeps randomly: after the third false alarm, the line worker stops listening for it, and a real loose bolt slips through.
The test pyramid
The test pyramid is a heuristic about where to invest test effort:
▲
/E2E\ few, slow, high fidelity
/──────\
/ Integ \ some, medium speed
/──────────\
/ Unit \ many, fast, low fidelity per test
/──────────────\
The base is widest because unit tests are cheap to write and run. E2E tests are expensive; they exercise the full stack, require infrastructure, and are sensitive to timing. The pyramid says: maximise your unit test count, keep integration tests moderate, and keep E2E tests focused on the highest-value user journeys.
Inverting the pyramid — many E2E, few unit — produces a slow, flaky, and expensive test suite. This is the "ice cream cone" anti-pattern.
Unit tests
A unit test exercises a single function or module with its dependencies mocked or stubbed. It runs in milliseconds, in memory, with no network or filesystem access.
Unit tests give you fast feedback on logic errors. They are bad at catching integration bugs — two modules that each work correctly but fail together. That's what integration tests are for.
Integration tests
An integration test exercises two or more real components together. A typical integration test brings up a real database connection, calls the service layer, and asserts on the resulting state — no mocks for the components under test.
Integration tests are slower than unit tests (database round-trips, process startup) but catch a different class of bug. They are usually run after unit tests in the pipeline and may require service containers (Docker, Testcontainers).
End-to-end tests
An E2E test drives the full application as a user would — browser automation, real API calls, real storage. Tools: Playwright, Cypress, Selenium.
E2E tests are the most expensive and the most realistic. Run them last in the pipeline, after build and unit/integration checks pass. Keep the suite small and focused: login, critical purchase path, core API surface.
Parallelisation and sharding
Large test suites can be parallelised across multiple runners. Unit tests are embarrassingly parallel — partition by file or test ID. E2E tests can be sharded by spec file.
# GitHub Actions matrix — 4 Playwright shards
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: npx playwright test --shard=${{ matrix.shard }}/4
With four shards, a 20-minute E2E suite becomes a 5-minute job in the critical path. The trade-off: four runner-minutes of compute instead of one.
Flaky tests
A flaky test passes sometimes and fails sometimes without any code change. Common causes: timing assumptions, shared mutable state, network dependencies, randomised data.
Flaky tests are the most corrosive thing that can happen to a test suite. They teach engineers to re-run failed pipelines until they go green. The signal degrades: a genuine failure looks identical to a flake. Soon nobody trusts the tests.
Adding retry: 2 to a flaky test makes the pipeline green again but does not fix the root cause. The test still fails internally; you've just silenced the symptom. Retries mask the signal.
Fix flaky tests immediately. Track them. Delete ones that cannot be fixed and that cover functionality unit tests already cover.
What to run on PR vs. on merge
Not every test needs to run on every PR commit.
| Trigger | Tests |
|---|---|
| PR commit | Unit + integration |
| PR merge check (required) | Unit + integration + E2E on critical paths |
| Merge to main | Full suite + coverage gate |
| Nightly | Full suite + load tests |
Front-loading cheap tests and deferring expensive ones reduces the feedback loop for developers. The coverage gate on main ensures the suite stays complete.
Coverage gates
A coverage gate blocks merge if line or branch coverage drops below a threshold. This is the mechanism that ensures new code ships with tests.
Set realistic thresholds and enforce them. A 99% threshold with 300 tests catching legitimate regressions is more valuable than a 70% threshold that nobody enforces.