Software failures: lessons for testers?
Whether it's a programmed alarm clock on your smartphone or ordering a meal via an app, everything involves code, and sometimes that code goes haywire!
Software bugs can be benign, but some have cost billions, caused major crises and even endangered lives.
For testers, these failures are valuable lessons never to be forgotten.
In this article, we explore some of the most significant software failures in history, and the lessons to be learned from them.
1. Ariane 5 (1996) - a $370 million bug
On June 4 1996, at Kourou in French Guiana, the Ariane 5 rocket was launched for the very first time.
37 seconds after take-off, it deviated from its trajectory, disintegrated in mid-flight and exploded, causing the loss of over $370 million worth of equipment, all due to a software error.
The guidance software used a portion of code reused from the previous version, Ariane 4, and Ariane 5's flight conditions were radically different.
When converting a floating-point number to an integer, an unhandled exception occurred, causing the navigation system to lose control.
What's more, the error occurred in a software module that was no longer needed after takeoff, but remained active.
This bug could have been detected during realistic simulations or a thorough code audit. In reality, it was a known flaw, but one that had been deemed unlikely.
The lesson for testers:
Never assume that old code is reliable simply because it worked before. Every reuse of code must be put into the context of the new system.
Testing is not just about validating current functionality, but also assessing the relevance and robustness of legacy code, especially in mission-critical environments.
2. The Y2K bug (the year 2000)
At the end of the 90s, a seemingly simple problem took on global proportions: date management in computer systems.
For decades, to save memory, developers have often coded years as just two digits (e.g. "99" for 1999).
Many feared that computers would interpret the year 2000 as 1900, leading to massive errors in date calculations, banking systems and navigation software, for example.
Contrary to the prevailing panic, January 1, 2000 was not a day of widespread chaos.
There were no large-scale power failures, no grounded planes, no collapse of banking systems. Yet this doesn't mean that the bug was overestimated, or that it had no consequences.
Hundreds of billions of dollars have been invested in prevention, mainly by governments, banks, hospitals and insurance companies. A worldwide campaign to update and test computer systems has been underway for several years.
However, a few bugs have been identified:
- Bank cards issued with expiry dates in "00" have been rejected.
- Nuclear monitoring equipment in Japan briefly failed.
- Vending machines and ticketing systems misinterpreted the dates.
- Some software programs display the year "19100" due to badly managed conversions.
Nothing catastrophic, but enough to demonstrate that the risk was real.
The lesson for testers:
It's crucial to test for time-related edge cases, and to avoid making implicit assumptions in the code ("we'll never get to the year 2000").
It also shows that preventive testing, even if costly and invisible to the end user, can be decisive in avoiding colossal crises.
3. Heathrow Terminal 5 (2008) - logistical chaos caused by software
On March 27, 2008, London Heathrow Airport inaugurated its new Terminal 5, designed to revolutionize the passenger experience.
From day one, tens of thousands of baggage items were lost, flights were cancelled or delayed, and British Airways' image was severely tarnished, all because of a new automated baggage handling system that had not been sufficiently tested in real-life conditions.
The errors stemmed from a combination of software problems, lack of coordination between the various systems (baggage elevators, conveyor belts, scanners, etc.), and poor staff training.
Over 42,000 baggage items lost in just a few days, with losses estimated at several tens of millions of euros.
The lesson for testers:
A system may work perfectly in a test environment, but fail in production.
Testing must therefore include complex, multi-system scenarios, and take human behavior into account.
4. Knight Capital (2012) - 440 million losses in 45 minutes
On August 1, 2012, US high-frequency trading company Knight Capital rolled out new trading software on the financial markets.
Less than an hour later, the company lost over $440 million.
An old test feature, supposedly deactivated, was still active on some servers.
The software automatically sent massive and inconsistent buy and sell orders on hundreds of stocks. The system was unable to detect the anomaly, as no rollback or real-time monitoring mechanism was in place.
The company tried to limit the damage, but the damage was done. The bug caused unusual volatility in the market, literally ruined Knight Capital, and the company was bought out a few months later.
All due to a deployment error and the absence of post-release validation tests. No real-life tests had been carried out on all servers, and errors were not reported in time.
The lesson for testers:
Testing must include the deployment phase itself, not just the functionality. It's essential to validate that all environments are consistent, that legacy features are disabled, and that monitoring tools are active.
A configuration error can sometimes have as much impact as a functional bug, and the slightest deviation can cause a disastrous domino effect.
5. Windows 10 Update (2018) - deleting user files
In October 2018, Microsoft launched an update to Windows 10 that was supposed to improve system stability.
A few days after deployment, thousands of users reported a particularly serious bug. The update deleted personal files in the "Documents" folder, with no warning and no possibility of recovery.
This bug had been reported by testers several weeks before the official launch. Obviously, the feedback was not taken into account in time, and no corrective measures were applied.
The problem stemmed from a conflict between the Known Folder Redirection tool and duplicate management, a known but poorly managed case during testing.
Microsoft temporarily suspended the update and issued a patch, but the damage was done, and user confidence took a hit!
The lesson for testers:
It's not enough to test "normal" cases. Specific cases, custom configurations and user feedback must be an integral part of the test cycle.
A good QA process must also be able to listen and react quickly.
What should today's testers remember?
1. The test does not stop at "works as expected".
A piece of software may well do what it's asked to do, but it may not do it well in a real-life context. The tester's role is also to imagine what could go wrong.
2. Testing means anticipating the improbable
Just because a case is unlikely doesn't mean it's not worth testing. Potential impact should guide testing priorities as much as frequency.
3. Communication is an anti-bug weapon
Most major bugs are the result of a lack of communication between teams, between developers and testers, or between the company and its users.
4. Automation tools do not replace human intuition
Tools like Mr Suricate allow you to automate no-code test scenarios in a powerful way. On the other hand, the tester's curiosity, his ability to ask the right questions, remains irreplaceable.
5. Document to avoid repeating mistakes
Every bug discovered is a learning opportunity. Documenting causes, impacts and solutions raises the overall level of quality in the company.
Testing also means learning from failure
At Mr Suricate, we see every day just how much the automation of no-code tests enables development teams to anticipate, detect and correct errors more quickly.
As Benjamin Franklin said: "A penny saved is a penny earned." In this logic, QA testing is an essential component of a company's return on investment.
👉 See the article - ROI and test automation: what savings and what sales generated?
If you'd like to calculate your own ROI and measure the impact of automation on your projects, we offer a free estimate.