Introduction
On the 19th of September 2012 at 20.02 UTC,
Sophos issued an update to its detection rules in the form of an "IDE"
that caused many false alerts on customers’ computers and in
their Sophos management consoles. These false alerts also prevented the
updating of our products and some other software products.
Sophos strives to maintain the industry’s highest standards
in product quality, customer support, and protection for our
customers. This event represented an unacceptable lapse in our quality
and release processes which adversely affected many of our customers
and partners.
This document summarizes how this event occurred and what changes
Sophos has already made—or will make in the near
future—to improve our QA, test, and release processes, and to
make sure this kind of incident does not happen again.
Impact on Customers
Affected customers fell into two broad
categories: 1) Those using the current version of Sophos endpoint
software with current default settings for cleanup and Live Protection,
and 2) Those using Sophos endpoint software with Live Protection not
enabled, or with changes to the default cleanup settings.
Category 1
Customers using Sophos endpoint products with
the current Sophos default settings of “Deny access only” and “Enable
Live Protection” were impacted the least—their Sophos installation and
third-party applications were quickly returned to normal operation. The
primary impact to these customers was the additional work required to
remove large numbers of false positive alerts from event logs and to
answer resulting end user questions as to why false alerts were
appearing.
Category 2
In other customer environments, the false
alerts required remedial work on each endpoint to restore Sophos
software and some third-party application components. For many
customers and partners this resulted in significant extra workload to
remediate the problem and, in some cases, impacted other systems and
software (e.g., Java, Adobe Acrobat, Google Chrome, etc.).
For customers without Live Protection enabled, it was necessary to
enable it, and make some changes to their Sophos Update Manager to
resolve the issue. In some cases, it also was necessary for
administrators to visit each affected endpoint.
Customers who had selected the non-default "Move" or "Delete"
actions in their endpoint cleanup policies experienced a more
significant impact because the Sophos endpoint product ceased to
update itself correctly. Similarly many third-party applications were
also impacted. We provided tools to aid in the recovery process, but
recovery still involved running the tool(s) on each affected endpoint
computer.
Background
The Sophos endpoint protection agent includes a
threat detection engine that is used to scan files and other content to
identify malware, suspicious code, known good code, etc. The engine
uses a large set of nearly 200,000 detection rules to determine what is
identified and what should trigger an alert. To make sure that Sophos is
able to identify the latest malware, exploits and legitimate
applications, the detection rules are updated frequently, usually in
the form of an IDE file that contains multiple rule updates. These
rules make calls to various "operators" within the engine, and if they
trigger a detection, can cause various actions to happen (e.g.,
identify items needed for cleanup, send Live Protection data back to
Sophos, trigger an alert to the end user or console, etc.). The Sophos
threat detection engine is designed to be platform independent so that
all forms of malware can be detected on all main operating system
platforms (Windows, Linux, UNIX, Mac).
Sophos typically releases a new version of the endpoint protection
agent and underlying threat detection engine once per month, and
releases IDE rule updates several times per day to protect customers
from the steady stream of new malware that is introduced. In a given
day, SophosLabs identifies as many as 180,000 new viruses or pieces of
malware. In a typical 24-hour period, Sophos releases six IDEs,
although the number may vary based on the threat urgency or the
overall threat landscape. Sophos carries out extensive testing on
each new version of software, on each update to our detection engine,
and on every rule update (IDE) before releasing them to customers.
As with all software and data updates, there is always a risk of
errors, bugs and defects. Given this, Sophos carries out significant
automated testing and multiple steps of analyst review to provide
system-based and human "safety nets" before any update to rules is
released to customers. Each IDE release, including the IDE involved in
this incident, passes through a twelve-phase test procedure. This
procedure involves hundreds of physical and virtual machines
operating in parallel on multiple platforms (different Windows, Mac
and Linux/UNIX versions) and testing against tens of millions of
files and terabytes of data that are designed to characterize the
entire known database of active threats and legitimate applications.
The tests include automated and human code inspection, validation,
large-scale scanning simulations, production environment testing and
peer reviews.
How this event happened, and steps Sophos is taking to prevent it in
the future
The initial cause of the incident was a human
error, in which a Sophos analyst incorrectly coded an update to our
detection rules within an IDE file update that caused false positive
triggers in the Sophos endpoint product for Windows.
Once the analyst made the error, the error still should have been
identified, or caught, by our twelve-phase test procedure. In this
case, a combination of human error in code review, human error
resulting in incorrect interpretation of test results, and a mismatch
in test environments meant that the faulty IDE was allowed to pass
through to release.
The following sections provide more detail into these factors and
are categorized into three main areas:
1. Rule modification
2. Test processes
3. Product
resiliency
1. Rule Modification
A modification was made on 19th
September 2012 to our detection rules—specifically, to a rule
that was intended to adjust the volume of Live Protection lookups to
Sophos relating to allow-listed software. The modified rule was added
to the next scheduled IDE update: AGEN-XUV.IDE. This modified rule used
an operator within the threat detection engine which is intended to
identify the detection name.
This threat detection engine operator has a known side effect,
where it also reports the detection itself to the Sophos endpoint
agent. This operator has been commonly used by SophosLabs, typically
as part of writing cleanup rules. The behavior of the operator is
well documented, such that analysts are required to use the operator
only when preceded by a separate and different operator that reports
the detection to the endpoint agent anyway, thereby preventing any
adverse side effects.
To add further complication, this updated rule also used an
operator that is not supported on certain UNIX variants, so the
analyst included checks to ensure the rule only executed on Windows
environments. Since the rule was designed to identify legitimate
Windows software this restriction was deemed acceptable.
As well as the faulty rule, the new IDE also separately contained
protection against some customer-submitted samples and a critical
Microsoft Internet Explorer vulnerability (see http://www.
sophos.com/en-us/threat-center/threat-analyses/viruses-and-spyware/Exp
~20124969-A/detailed-analysis.aspx).
Recommended Actions
Sophos has identified and implemented a
fix to the IDE rule and will soon release an updated version of the
threat detection engine as highlighted below.
| STEPS SOPHOS IS TAKING | STATUS |
| Publish fixed
IDE rule to all updating systems (javab-jd.ide). | COMPLETE |
| Release new version of the
threat detection engine with a fix for the original side effect. |
COMPLETE |
| All SophosLabs analysts to
review all known and documented defects and side effects and to include
checking for such calls in peer reviews. | COMPLETE |
| Verify/enhance automatic
identity checks to include triggering alerts on areas of risk in engine
operators and currently known good and bad practice. | COMPLETE |
2. Test Processes
Sophos conducts comprehensive testing on every endpoint product
release. This testing covers the threat detection engine, unit testing,
and system testing across every supported platform and including large
scale production environments. In addition, Sophos conducts
comprehensive tests on each IDE before it is released. Five phases in
the twelve-phase testing should have detected the faulty IDE. The test
phases are listed in the table below.
| Sophos Test
Phases Performed for Every IDE Release and Update |
Sophos Test Phases That Should Have Prevented Release of
the Faulty IDE |
- Analyst unit test
- Peer
review
- Compiler tests and warning
- Validation of
identity
- Positive tests (testing comprehensive malware
detection)
- Wild list tests (based on samples published by
wildlist.org)
- Detection tests (IDE test)
- Performance tests
- Memory tests
- Cleanup removal
tests
- False positive tests
- Production tests
|
- Analyst unit test
- Peer review
- Validation of
identity
- Detection tests (IDE test)
- False positive
tests
|
There were failures within each of these five test phases that, in
combination, meant the IDE was passed through the test procedures
without being rejected. We examine each of these five issues below.
Analyst unit test. The analyst used a newer and
un-released version of the threat detection engine where the operator
side effect had already been fixed and resolved within the engine.
The analyst therefore did not see the incorrect detection reports
during development testing. Analysts typically work on the latest
version of the engine when creating rules to ensure that they are
taking advantage of the latest protection capabilities. Any rules
they create are then tested against all previous versions of the
engine still in deployment, as part of our release testing process.
Peer review. As part of our process, all
modifications and additions to any of our identities, including global
routines, undergo a peer review process where a second, independent
analyst reviews the changes. This particular erroneous use of the engine
operator and the implications of the engine defect together with the
Windows check should have been identified, but due to human error they
were not.
Validation of identity. As part of the analyst
workflow, identities are submitted to Sophos’ source control
system that tracks and stores all identities, rules and revisions.
This is known as the identity database (IDB). As part of this
submission process, new rules and identities are validated and errors
and warnings are generated. The validation check is designed to
identify errors in programming code syntax, misuse of operators,
inclusion of known errors, and other simple, easily defined errors.
Because the operator employed a validation check that was effective
when used correctly, the error was not included in the database of
critical errors, and thus was not flagged as a critical error.
Detection tests (IDE test). The IDE test phase
is intended to verify that all the appropriate files are detected by
our released threat detection engine and that rule data with the
release candidate IDE is added. This is a set of 31 tests distributed
over a large number of virtual machine nodes.
On this test pass, several of the individual node machines failed
(not the test itself but the virtual machine), and the results of
these tests therefore included a critical error. Because the IDE
release was deemed urgent (fixing both urgent customer-submitted
samples as well as the critical Microsoft Internet Explorer
vulnerability), and because the analyst believed the virtual machine
node failures were due to instability in the underlying test system
hardware or virtual machine environment rather than the actual test
itself, the critical errors related to the node failures were
overlooked. In the test results, the node failure notification
appeared first, but had the analyst looked carefully at results
further down the error notification screen, he or she would have seen
the error messages that flagged an "Shh" identity. Critically, the
analyst did not further research the cause of the critical error or
re-run the test.
False positive tests. In parallel to the IDE
test, Sophos conducts a false positive test on any IDE release
candidate. The false positive test environment (or "rig") consists of
a very large number of parallel systems. These systems use our most
recently released threat detection engine and rules, with the release
candidate IDE added, to scan more than 10 million "good" files and
terabytes of data. The set of test files is regularly updated and
includes all Microsoft operating system files, many popular
applications (such as Java, Adobe, and Google Maps), a large number of
business applications, and all current and previous releases of
Sophos products.
The threat detection engine is compiled on and supported on
multiple platforms including Windows and many Linux/UNIX variants.
Because the test is designed to be comprehensive and because there is
such a huge data set, the false positive tests are executed on Linux
servers. The vast majority of Sophos rules and identities are
designed to be cross-platform and run identically across multiple
operating systems, including Linux, Windows, Mac OS, and UNIX. The
core purpose of the false positive test was to identify false
positives, not to confirm cross-platform operability of the IDE. This
rule was a rare example of one that was written by the analyst to
operate only in Windows environments. Since the false positive rig
operates only on Linux servers, the tests did not flag the Shh/ false
positives because the rule with the underlying error was specifically
flagged for Windows only.
As stated above, if any one of these five issues had not occurred,
then the IDE would have failed one of the test phases and would not
have been released.
Recommended Actions
To make sure this series of events does
not happen again, Sophos has taken or will soon implement the following
actions.
| STEPS SOPHOS IS TAKING | STATUS |
| 1. As policy,
re-run all IDE tests in the event of any critical error or system
failure. When an IDE needs rebuilding, remove not only the offending
identity but non-urgent items. On a re-spin, include only urgent
items, thus reducing risk of a second round of failures. No test can
be deemed passed if there are ANY system, tool or test failures.
| COMPLETE |
| 2. Implement a false
positive test environment that matches platforms appropriately,
including Windows and Linux platforms, to allow the false positive test
environment to also serve as a platform test environment. |
COMPLETE |
| 3. Enhance and extend the
false positive test system in terms of scale, platform coverage and
resiliency. | Partially Complete: final phase end March 2013 |
| 4. Introduce separate
release cycles for urgent identity updates and more general
procedure/rule changes. General procedure/rule changes are to be subject
to additional test procedures and a longer release cycle. |
COMPLETE |
| 5. All general procedure/rule
changes are to go through extended additional testing in production
environments before being released to customers. | IN
PROGRESS |
| 6. Formalize the goals, peer review
inspections, and entry and exit criteria of each phase. Document
process and requirements differentiating between review and
"inspection" and conduct regular training and education of all
SophosLabs staff, whether new or veteran employees. | COMPLETE |
| 7. Verify and enhance identity
validation checks to include a broader database of coding errors, and
to cover rules that, even if they don’t generate failure, may
introduce additional risk of triggering engine side effects or defects.
| COMPLETE |
3.
Product Resiliency
The Sophos endpoint product consists of an
agent with a system of automatic updating that can be configured to
update redundantly from central installation directories via the Sophos
Update Manager and/or directly from Sophos via http.
The component of the endpoint agent that performs updating is called
Sophos AutoUpdate. The updating system is completely independent of the
Sophos Remote Management System (RMS), which can be used to send
configuration policy changes from the management console and receive
centralized reporting and event collection. There is a third
independent communication path from each Sophos endpoint agent via live
lookups to Sophos when the "Live Protection" configuration is set to
"Enabled." Together these three systems not only ensure that Sophos can
provide regular, continually updated protection against the latest and
most current malware threats, but also provide redundancy to enable
recovery from problems. Live Protection is included as part of all
versions of the Sophos endpoint product since version 9.5 introduced in
June 2010. In default setting it is enabled.
In this case, for customers with Live Protection enabled, this
independent lookup process marked Sophos Update Manager and Sophos
AutoUpdate as clean to override the false positive in the locally
deployed rules, so that Sophos Update Manager and the endpoint agent
(and any other affected application) could return to normal operation.
For customers without Live Protection enabled, the false positive
continued to prevent Sophos Update Manager and Sophos AutoUpdate from
executing correctly, meaning that those environments could not
download the fixed IDE. Although the Sophos endpoint agent itself
continued to scan and protect endpoint computers, remedial action had
to be taken to allow the agents to continue getting updates from
Sophos. For many affected customers, this meant making configuration
changes to the Update Manager to enable it to continue updating. In
most cases, deploying these configuration changes in conjunction with
a policy change to enable Live Protection, and delivering via the
independent (redundant) RMS system to endpoint agents, would allow
full functionality of both Sophos and third-party applications to be
restored.
The vast majority of Sophos customers either had Live Protection
enabled or had selected the default cleanup setting for the Sophos
endpoint product to “deny access” to files suspected of being malware,
rather than the optional settings of “move” or “delete”.
In some cases, especially when customers had non-default policies
that moved or deleted files detected with the false positive,
customers also had to repair the Sophos software on each endpoint
computer to get automatic updating working again. Sophos engineers
quickly developed and introduced new tools to automate much of this
process and made them available on the company’s website.
Sophos is presently investigating a number of changes that could
be made to Sophos endpoint products that would mitigate the immediate
impact of such a defect and make recovery from such an incident more
automated and manageable for customers and partners.
Recommended Actions
The table below includes some of these
product enhancements that Sophos is presently considering.
| STEPS SOPHOS IS TAKING | STATUS |
| Assess the
possibility to remove the "Delete" function, and replace with a robust
recoverable quarantine functionality. | IN PROGRESS |
| Enhance messaging within
the product about Live Protection and encourage existing and future
customers to enable it. | IN PROGRESS |
| Improve resilience of Sophos updating
processes. | IN PROGRESS |
| Improve the self-protection capabilities of the Sophos
endpoint agent, to include updating. | IN PROGRESS |
INCIDENT RESPONSE: Additional Lessons Learned
While
customers and partners generally regarded Sophos as forthcoming in our
communication of the incident and highly responsive in helping affected
customers recover, there are three significant areas in which our
response could have been substantially improved.
1. Telecommunications capacity at support call centers
Upon discovering the incident, Sophos transferred all available
technical staff to handling support calls over extended hours in our
major call centers in Abingdon (UK), Boston, Wiesbaden (Germany),
Madrid, Milan, Paris, Sydney, Yokohama, Vancouver, and other cities
around the globe. Call volumes within the first 48 hours of the
incident were running at well over 10 times normal levels. This
created long hold times for customers and in some cases exceeded the
capacity of our phone switches. As a result, customer received busy
signals and did not enter our phone queue. This situation was
exacerbated in our Boston call center by a fault in the phone system.
For a three-day period the inbound phone capacity in Boston was
reduced by 50%. Sophos endeavors to provide the most responsive
customer support in the IT security industry, with customers
typically waiting less than three minutes before reaching a qualified
technical support agent. During this incident, Sophos was not able to
deliver on our target response and wait times for our customers. And
while we are making a host of additional efforts to make sure that
such a disruptive event never happens again, we are also taking steps
to improve our support infrastructure should a large-scale issue of
any kind affect our customers.
Recommended Actions
| STEPS
SOPHOS IS TAKING | STATUS |
| Put in place
a third-party overflow/call failover supplier to handle sudden peaks
in support calls. | COMPLETE IN UK AND NORTH AMERICA |
| Implement an emergency incident response procedure to immediately
change workflows for inbound call handling including IVR messages,
call flows and call-backs. | IN PROGRESS |
| Work
with telco service providers to improve resiliency and redundancy of
telco and PBX systems to protect against infrastructure failure
during spikes in phone traffic. | IN PROGRESS |
2. Proactive notification system
Sophos announced the false
positive issue on Twitter and added a knowledgebase article within an
hour of the incident occurring. We publicized the incident extensively
via press and social media outlets within 2.5 hours, once the extent of
the impact had become clear. However, it took up to 19 hours for all
Sophos customers to be notified by email.
Recommended Actions
| STEPS
SOPHOS IS TAKING | STATUS |
| Change Sophos
emergency incident response procedure to be able to initiate customer
communications 24/7. | COMPLETE |
| Develop and regularly test
automated email communication systems and contact databases to
accelerate delivery of emergency email notifications. | COMPLETE |
| Identify alternative non-email based
methods of communicating such updates to customers, for example SMS,
console alerts and RSS. | IN PROGRESS |
3. Knowledgebase article
Early versions of the Sophos
knowledgebase article (KBA) were difficult to follow for many
customers. Additionally, Sophos’ publishing system experienced
inconsistent results when pushing out improved revisions of the KBA.
This resulted in outdated or (at times) blank page results during the
first 48 hours, when the Sophos website experienced peak load that was
25 times our normal level of page views across our entire website.
Recommended Actions
| STEPS
SOPHOS IS TAKING | STATUS |
| Change Sophos
emergency incident response procedure to immediately allocate a
usability/workflow owner for the main KBAs. | COMPLETE |
| Build knowledgebase article
template for emergency incidents including elements such as screen
shots, checklists to help partners and customers identify whether or not
they have been affected, and if so how serious the impact might
be. | COMPLETE |
| Review and change KBA
publishing process to improve our ability to publish quickly, including
in localized languages, during emergency response situations. |
COMPLETE |
Additional
Recommendations
This incident has made it clear that Sophos
should proactively provide more prescriptive advice on the most
appropriate policy settings for our customers and partners to use when
deploying and managing our products. While the Sophos default settings
for new installations would have prevented this incident from causing
disruption beyond the alerts, some customers had not enabled Live
Protection, and other customers had chosen to change the default
setting for cleanup not available to "Move" or "Delete."
We have published a
knowledgebase article 114345 that provides our current
recommended policy. This includes, among other recommendations,
enabling Live Protection and setting the no cleanup action to "Deny
access only."
Conclusion
Since Sophos was founded in 1985 our clients and
partners have grown to expect industry-leading product quality and
best-in-class customer support. We must recommit ourselves to making
regular and concrete improvements to our systems, processes, and
organization including the ones noted in this document to make sure an
event like this never happens again. In parallel, we must provide an
improved response to any critical security events, regardless of their
source. Nothing is more important to us than earning our
customers’ trust to protect their organizations. Sophos will learn
from this incident and will emerge an even stronger company, delivering
even greater value to our partners and customers in a more reliable and
responsive fashion.