Most web platforms get to warm up. Traffic builds through the morning, peaks around lunch, tails off. You can autoscale into demand because demand announces itself.
Exam platforms do not get that courtesy. At 9:00am sharp, ten thousand candidates who have been waiting nervously for weeks all click "Begin" within the same sixty seconds. There is no ramp. There is a wall. And it is the single least forgiving moment in the whole system, because the people hitting it are already anxious and the stakes could not be higher.
The synchronised-start problem
A normal load test — gradually increasing virtual users — completely misses this. The exam-day pattern is a near-instantaneous spike to peak, sustained for hours, then a second correlated spike at the end when everyone submits at once. Your capacity has to be sized for the spike, not the average, and the spike can be ten or twenty times your steady state.
Autoscaling rarely saves you here, because scaling takes minutes and the stampede takes seconds. By the time new capacity spins up, the candidates have already met an error page — and an error page during a licensure exam is not a glitch, it is a legal and reputational event.
You cannot autoscale your way out of a spike that arrives faster than a server can boot. You have to be ready before the clock strikes.
Provision for the worst sixty seconds
The discipline that works is unfashionable: pre-provision for peak and accept the idle cost. For high-stakes delivery, headroom is not waste — it is the product. A few hours of over-provisioned capacity on results day is trivial against the cost of a mass failure that voids an exam sitting.
The other half is making the start itself cheaper. Pre-loading exam content to the candidate's device before the clock starts, so the 9am click reads from something already in hand rather than hammering an origin, turns a stampede into a quiet unlock. Design the expensive moment out of the critical second.
Degrade gracefully, fail honestly
Things will still go wrong at the edges, so decide in advance how the system behaves under stress. A candidate's connection should be able to drop and resume without losing their work. The clock should be authoritative and recoverable, not a number living only in a browser tab. When something fails, it should fail in a way that protects the candidate's attempt rather than discarding it — and it should record what happened, because every incident becomes a dispute later.
This is ordinary site-reliability discipline, and the canonical reference still holds up: Google's SRE book on error budgets and graceful degradation, alongside frameworks like the AWS Well-Architected reliability pillar. The exam context just raises the cost of getting it wrong.
Reliability is an integrity feature
It is tempting to file uptime under "operations" and integrity under "security," as if they were separate concerns. They are not. A platform that buckles on results day forces rushed extensions, manual workarounds, and exceptions — and every exception is a crack a determined cheat can lever open. Reliability and integrity are the same promise viewed from two angles, which is why we treat them together in our global reliability and infrastructure work, under the same delivery layer as OroLink. The exam has to be fair. It also has to actually load.
Discussion 0
Sign in or create a free account to comment and vote.
No comments yet. Be the first to share your thoughts.