it · level 6

Ticket Triage & Structured Troubleshooting

Priority vs severity, SLA clocks, and the triage questions that separate one-line tickets from hours of guesswork.

180 XP

Ticket Triage & Structured Troubleshooting

Tickets are how a company's pain reaches IT. On a given day you might see a prod outage, a phishing report, a password reset, a printer on fire, and the CEO saying "email is broken." Triage is the job of turning that unordered stream into a queue where the right thing gets worked first and nothing slips through the SLA.

Severity vs priority — two different questions

The single biggest trap in triage is treating severity and priority as the same number. They're different questions:

Severity answers how broken is the underlying system? S1 is prod down or data loss. S2 is a major degradation for a group. S3 is one user blocked with a workaround. S4 is cosmetic.
Priority answers how fast must we respond? P0 is "page someone now." P1 is "within the hour." P2 is "today." P3 is "this week." P4 is "when we get around to it."

Same failure can land in different priority columns. If the CEO's email breaks at quarter-close with the board waiting for numbers, that's still technically S2 (one user's authentication path degraded) — but it's P0, because the business impact of doing nothing for even 20 minutes is enormous. If the same auth path breaks for an intern on their second day, the severity is unchanged and the priority is P3.

	S1 (prod down)	S2 (group degraded)	S3 (user blocked)	S4 (cosmetic)
P0 (now)	Pager event	Exec-impact outage	CEO locked out	Very unusual
P1 (<1h)	Sev1 runbook	Security incident	Phishing triage	—
P2 (<4h)	Rare	Printer floor	VPN for a user	—
P3 (<1d)	—	—	Password reset	Wrong font
P4 (<2d)	—	—	—	Slow laptop

The diagonal is the "expected" region; the off-diagonal cells are the judgement calls.

SLA clocks — first-response vs resolution

Every priority comes with an SLA (Service Level Agreement) — a contract with the reporter about when they'll hear back. Two separate clocks:

First-response SLA — how quickly someone acknowledges the ticket. This is about being a good citizen; a silent queue is worse than a slow queue.
Resolution SLA — how quickly the issue is actually fixed. Much harder to meet because it depends on the nature of the problem.

SLAs pause when the ticket is blocked on the user (waiting for information, asking them to try something). They resume when the user replies. A ticket that sits in "waiting on customer" for three days isn't an SLA breach — it's a customer who took three days to reply.

Common values in helpdesk systems:

P0 → 15-minute first response
P1 → 1-hour first response
P2 → 4-hour first response, same-business-day resolution
P3 → next-business-day first response
P4 → informational, 48-hour acknowledgement

Intake questions — the triage interview

A triage interview is short and mechanical. Three questions almost always move a ticket forward:

What changed? — new laptop, new app version, password rotation, upstream maintenance. 80% of tickets are caused by something that changed recently.
When did it start? — maps to a deploy, an incident, a scheduled job. Bounds the investigation.
Who is affected? — just you, your team, your floor, the whole company. Determines severity and sometimes priority in one question.

Missing one of these is the single biggest reason for "could you send more detail?" back-and-forth. Ask all three in your first reply.

Escalation — don't heroically hold a ticket you can't solve

If you've been on a ticket for longer than its first-response SLA without meaningful progress, escalate. Handing off a P1 at 55 minutes because you're stuck is a win for the customer — it's not a confession of failure. The rule of thumb: if the next 30 minutes of your work is as likely to solve it as someone else's, you're the right owner. If not, get help.

Escalation paths are usually written down in the runbook:

Tier 1 (generalist helpdesk) → Tier 2 (specialist IT)
Tier 2 → Vendor support (with a support contract number ready)
Tier 2 → Engineering on-call (for product-side bugs)

Writing user-visible updates

Updates are not for you — they're for the reporter. Keep them short, factual, and cadenced:

"14:32 — We've reproduced the issue and are investigating the mail gateway. Next update at 14:50 regardless of progress."

Three things done right: concrete state, what we're doing, a commitment to the next update. No speculation. No "should be fixed soon." If the cadence you promised passes without news, post another update saying "still investigating, next update at …". Silence is worse than an honest "we don't know yet."

Real tools you'll see

Jira Service Management (Atlassian) — queue + SLA automations.
Zendesk — more customer-facing, strong macros and triggers.
Freshservice — ITIL-shaped, competitive with Jira SM.
ServiceNow — enterprise incumbent, deep integrations, heavy process.

They all map onto the same abstraction: a queue of tickets, a priority → SLA mapping, a severity field, a status/owner, a comment thread, and automation that pauses/resumes clocks.

Gotchas

Prioritising by squeakiness, not impact. The loudest reporter isn't automatically the highest priority. A quiet P0 (one engineer seeing a weird crash that'll be everywhere in an hour) still outranks a loud P3.
Re-prioritising silently. If you move a ticket from P1 to P3, tell the reporter why. Otherwise they think you're ignoring them.
Missing the SLA pause. A ticket waiting on the user is not accruing SLA time — make sure your system's "waiting on customer" status is actually paused.
Heroic debugging at the cost of the queue. One ticket is not worth breaching three.

Playground

Pick a ticket from the queue. Choose a priority and severity, name an owner, sketch the first action you'd take. Submit and read the score breakdown. The SLA clock is running — try to answer before it goes red.

Visualizer

The priority / severity matrix plots all ten seed tickets as faint ghost dots in a 4 × 5 grid. As you score them in the playground, your placements are overlaid in full colour so you can see how close you got — and where the expected answer disagrees with the obvious-looking cell.