Detailed root cause analysis of the
shh/updater-b false positive incident

October 4, 2012

Introduction

On the 19th of September 2012 at 20.02 UTC, Sophos issued an update to its detection rules in the form of an "IDE" that caused many false alerts on customers’ computers and in their Sophos management consoles. These false alerts also prevented the updating of our products and some other software products.

Sophos strives to maintain the industry’s highest standards in product quality, customer support, and protection for our customers. This event represented an unacceptable lapse in our quality and release processes which adversely affected many of our customers and partners.

This document summarizes how this event occurred and what changes Sophos has already made—or will make in the near future—to improve our QA, test, and release processes, and to make sure this kind of incident does not happen again.

Impact on Customers

Affected customers fell into two broad categories: 1) Those using the current version of Sophos endpoint software with current default settings for cleanup and Live Protection, and 2) Those using Sophos endpoint software with Live Protection not enabled, or with changes to the default cleanup settings.

Category 1

Customers using Sophos endpoint products with the current Sophos default settings of “Deny access only” and “Enable Live Protection” were impacted the least—their Sophos installation and third-party applications were quickly returned to normal operation. The primary impact to these customers was the additional work required to remove large numbers of false positive alerts from event logs and to answer resulting end user questions as to why false alerts were appearing.

Category 2

In other customer environments, the false alerts required remedial work on each endpoint to restore Sophos software and some third-party application components. For many customers and partners this resulted in significant extra workload to remediate the problem and, in some cases, impacted other systems and software (e.g., Java, Adobe Acrobat, Google Chrome, etc.).

For customers without Live Protection enabled, it was necessary to enable it, and make some changes to their Sophos Update Manager to resolve the issue. In some cases, it also was necessary for administrators to visit each affected endpoint.

Customers who had selected the non-default "Move" or "Delete" actions in their endpoint cleanup policies experienced a more significant impact because the Sophos endpoint product ceased to update itself correctly. Similarly many third-party applications were also impacted. We provided tools to aid in the recovery process, but recovery still involved running the tool(s) on each affected endpoint computer.

Background

The Sophos endpoint protection agent includes a threat detection engine that is used to scan files and other content to identify malware, suspicious code, known good code, etc. The engine uses a large set of nearly 200,000 detection rules to determine what is identified and what should trigger an alert. To make sure that Sophos is able to identify the latest malware, exploits and legitimate applications, the detection rules are updated frequently, usually in the form of an IDE file that contains multiple rule updates. These rules make calls to various "operators" within the engine, and if they trigger a detection, can cause various actions to happen (e.g., identify items needed for cleanup, send Live Protection data back to Sophos, trigger an alert to the end user or console, etc.). The Sophos threat detection engine is designed to be platform independent so that all forms of malware can be detected on all main operating system platforms (Windows, Linux, UNIX, Mac).

Sophos typically releases a new version of the endpoint protection agent and underlying threat detection engine once per month, and releases IDE rule updates several times per day to protect customers from the steady stream of new malware that is introduced. In a given day, SophosLabs identifies as many as 180,000 new viruses or pieces of malware. In a typical 24-hour period, Sophos releases six IDEs, although the number may vary based on the threat urgency or the overall threat landscape. Sophos carries out extensive testing on each new version of software, on each update to our detection engine, and on every rule update (IDE) before releasing them to customers.

As with all software and data updates, there is always a risk of errors, bugs and defects. Given this, Sophos carries out significant automated testing and multiple steps of analyst review to provide system-based and human "safety nets" before any update to rules is released to customers. Each IDE release, including the IDE involved in this incident, passes through a twelve-phase test procedure. This procedure involves hundreds of physical and virtual machines operating in parallel on multiple platforms (different Windows, Mac and Linux/UNIX versions) and testing against tens of millions of files and terabytes of data that are designed to characterize the entire known database of active threats and legitimate applications. The tests include automated and human code inspection, validation, large-scale scanning simulations, production environment testing and peer reviews.

How this event happened, and steps Sophos is taking to prevent it in the future

The initial cause of the incident was a human error, in which a Sophos analyst incorrectly coded an update to our detection rules within an IDE file update that caused false positive triggers in the Sophos endpoint product for Windows.

Once the analyst made the error, the error still should have been identified, or caught, by our twelve-phase test procedure. In this case, a combination of human error in code review, human error resulting in incorrect interpretation of test results, and a mismatch in test environments meant that the faulty IDE was allowed to pass through to release.

The following sections provide more detail into these factors and are categorized into three main areas:

1. Rule modification
2. Test processes
3. Product resiliency

1. Rule Modification

A modification was made on 19th September 2012 to our detection rules—specifically, to a rule that was intended to adjust the volume of Live Protection lookups to Sophos relating to allow-listed software. The modified rule was added to the next scheduled IDE update: AGEN-XUV.IDE. This modified rule used an operator within the threat detection engine which is intended to identify the detection name.

This threat detection engine operator has a known side effect, where it also reports the detection itself to the Sophos endpoint agent. This operator has been commonly used by SophosLabs, typically as part of writing cleanup rules. The behavior of the operator is well documented, such that analysts are required to use the operator only when preceded by a separate and different operator that reports the detection to the endpoint agent anyway, thereby preventing any adverse side effects.

To add further complication, this updated rule also used an operator that is not supported on certain UNIX variants, so the analyst included checks to ensure the rule only executed on Windows environments. Since the rule was designed to identify legitimate Windows software this restriction was deemed acceptable.

As well as the faulty rule, the new IDE also separately contained protection against some customer-submitted samples and a critical Microsoft Internet Explorer vulnerability (see http://www. sophos.com/en-us/threat-center/threat-analyses/viruses-and-spyware/Exp ~20124969-A/detailed-analysis.aspx).

Recommended Actions

Sophos has identified and implemented a fix to the IDE rule and will soon release an updated version of the threat detection engine as highlighted below.

STEPS SOPHOS IS TAKING STATUS
Publish fixed IDE rule to all updating systems (javab-jd.ide). check COMPLETE
Release new version of the threat detection engine with a fix for the original side effect. check COMPLETE
All SophosLabs analysts to review all known and documented defects and side effects and to include checking for such calls in peer reviews. check COMPLETE
Verify/enhance automatic identity checks to include triggering alerts on areas of risk in engine operators and currently known good and bad practice. check COMPLETE

2. Test Processes

Sophos conducts comprehensive testing on every endpoint product release. This testing covers the threat detection engine, unit testing, and system testing across every supported platform and including large scale production environments. In addition, Sophos conducts comprehensive tests on each IDE before it is released. Five phases in the twelve-phase testing should have detected the faulty IDE. The test phases are listed in the table below.

Sophos Test Phases Performed for Every IDE Release and Update Sophos Test Phases That Should Have Prevented Release of the Faulty IDE
  • Analyst unit test
  • Peer review
  • Compiler tests and warning
  • Validation of identity
  • Positive tests (testing comprehensive malware detection)
  • Wild list tests (based on samples published by wildlist.org)
  • Detection tests (IDE test)
  • Performance tests
  • Memory tests
  • Cleanup removal tests
  • False positive tests
  • Production tests
  • Analyst unit test
  • Peer review
  • Validation of identity
  • Detection tests (IDE test)
  • False positive tests

 

There were failures within each of these five test phases that, in combination, meant the IDE was passed through the test procedures without being rejected. We examine each of these five issues below.

Analyst unit test. The analyst used a newer and un-released version of the threat detection engine where the operator side effect had already been fixed and resolved within the engine. The analyst therefore did not see the incorrect detection reports during development testing. Analysts typically work on the latest version of the engine when creating rules to ensure that they are taking advantage of the latest protection capabilities. Any rules they create are then tested against all previous versions of the engine still in deployment, as part of our release testing process.

Peer review. As part of our process, all modifications and additions to any of our identities, including global routines, undergo a peer review process where a second, independent analyst reviews the changes. This particular erroneous use of the engine operator and the implications of the engine defect together with the Windows check should have been identified, but due to human error they were not.

Validation of identity. As part of the analyst workflow, identities are submitted to Sophos’ source control system that tracks and stores all identities, rules and revisions. This is known as the identity database (IDB). As part of this submission process, new rules and identities are validated and errors and warnings are generated. The validation check is designed to identify errors in programming code syntax, misuse of operators, inclusion of known errors, and other simple, easily defined errors. Because the operator employed a validation check that was effective when used correctly, the error was not included in the database of critical errors, and thus was not flagged as a critical error.

Detection tests (IDE test). The IDE test phase is intended to verify that all the appropriate files are detected by our released threat detection engine and that rule data with the release candidate IDE is added. This is a set of 31 tests distributed over a large number of virtual machine nodes.

On this test pass, several of the individual node machines failed (not the test itself but the virtual machine), and the results of these tests therefore included a critical error. Because the IDE release was deemed urgent (fixing both urgent customer-submitted samples as well as the critical Microsoft Internet Explorer vulnerability), and because the analyst believed the virtual machine node failures were due to instability in the underlying test system hardware or virtual machine environment rather than the actual test itself, the critical errors related to the node failures were overlooked. In the test results, the node failure notification appeared first, but had the analyst looked carefully at results further down the error notification screen, he or she would have seen the error messages that flagged an "Shh" identity. Critically, the analyst did not further research the cause of the critical error or re-run the test.

False positive tests. In parallel to the IDE test, Sophos conducts a false positive test on any IDE release candidate. The false positive test environment (or "rig") consists of a very large number of parallel systems. These systems use our most recently released threat detection engine and rules, with the release candidate IDE added, to scan more than 10 million "good" files and terabytes of data. The set of test files is regularly updated and includes all Microsoft operating system files, many popular applications (such as Java, Adobe, and Google Maps), a large number of business applications, and all current and previous releases of Sophos products.

The threat detection engine is compiled on and supported on multiple platforms including Windows and many Linux/UNIX variants. Because the test is designed to be comprehensive and because there is such a huge data set, the false positive tests are executed on Linux servers. The vast majority of Sophos rules and identities are designed to be cross-platform and run identically across multiple operating systems, including Linux, Windows, Mac OS, and UNIX. The core purpose of the false positive test was to identify false positives, not to confirm cross-platform operability of the IDE. This rule was a rare example of one that was written by the analyst to operate only in Windows environments. Since the false positive rig operates only on Linux servers, the tests did not flag the Shh/ false positives because the rule with the underlying error was specifically flagged for Windows only.

As stated above, if any one of these five issues had not occurred, then the IDE would have failed one of the test phases and would not have been released.

Recommended Actions

To make sure this series of events does not happen again, Sophos has taken or will soon implement the following actions.

STEPS SOPHOS IS TAKING STATUS
1. As policy, re-run all IDE tests in the event of any critical error or system failure. When an IDE needs rebuilding, remove not only the offending identity but non-urgent items. On a re-spin, include only urgent items, thus reducing risk of a second round of failures. No test can be deemed passed if there are ANY system, tool or test failures. check COMPLETE
2. Implement a false positive test environment that matches platforms appropriately, including Windows and Linux platforms, to allow the false positive test environment to also serve as a platform test environment. check COMPLETE
3. Enhance and extend the false positive test system in terms of scale, platform coverage and resiliency. Partially Complete: final phase end March 2013
4. Introduce separate release cycles for urgent identity updates and more general procedure/rule changes. General procedure/rule changes are to be subject to additional test procedures and a longer release cycle. check COMPLETE
5. All general procedure/rule changes are to go through extended additional testing in production environments before being released to customers. IN PROGRESS
6. Formalize the goals, peer review inspections, and entry and exit criteria of each phase. Document process and requirements differentiating between review and "inspection" and conduct regular training and education of all SophosLabs staff, whether new or veteran employees. check COMPLETE
7. Verify and enhance identity validation checks to include a broader database of coding errors, and to cover rules that, even if they don’t generate failure, may introduce additional risk of triggering engine side effects or defects.

check COMPLETE

 

3. Product Resiliency

The Sophos endpoint product consists of an agent with a system of automatic updating that can be configured to update redundantly from central installation directories via the Sophos Update Manager and/or directly from Sophos via http.

The component of the endpoint agent that performs updating is called Sophos AutoUpdate. The updating system is completely independent of the Sophos Remote Management System (RMS), which can be used to send configuration policy changes from the management console and receive centralized reporting and event collection. There is a third independent communication path from each Sophos endpoint agent via live lookups to Sophos when the "Live Protection" configuration is set to "Enabled." Together these three systems not only ensure that Sophos can provide regular, continually updated protection against the latest and most current malware threats, but also provide redundancy to enable recovery from problems. Live Protection is included as part of all versions of the Sophos endpoint product since version 9.5 introduced in June 2010. In default setting it is enabled.

In this case, for customers with Live Protection enabled, this independent lookup process marked Sophos Update Manager and Sophos AutoUpdate as clean to override the false positive in the locally deployed rules, so that Sophos Update Manager and the endpoint agent (and any other affected application) could return to normal operation.

For customers without Live Protection enabled, the false positive continued to prevent Sophos Update Manager and Sophos AutoUpdate from executing correctly, meaning that those environments could not download the fixed IDE. Although the Sophos endpoint agent itself continued to scan and protect endpoint computers, remedial action had to be taken to allow the agents to continue getting updates from Sophos. For many affected customers, this meant making configuration changes to the Update Manager to enable it to continue updating. In most cases, deploying these configuration changes in conjunction with a policy change to enable Live Protection, and delivering via the independent (redundant) RMS system to endpoint agents, would allow full functionality of both Sophos and third-party applications to be restored.

The vast majority of Sophos customers either had Live Protection enabled or had selected the default cleanup setting for the Sophos endpoint product to “deny access” to files suspected of being malware, rather than the optional settings of “move” or “delete”.

In some cases, especially when customers had non-default policies that moved or deleted files detected with the false positive, customers also had to repair the Sophos software on each endpoint computer to get automatic updating working again. Sophos engineers quickly developed and introduced new tools to automate much of this process and made them available on the company’s website.

Sophos is presently investigating a number of changes that could be made to Sophos endpoint products that would mitigate the immediate impact of such a defect and make recovery from such an incident more automated and manageable for customers and partners.

Recommended Actions

The table below includes some of these product enhancements that Sophos is presently considering.

STEPS SOPHOS IS TAKING STATUS
Assess the possibility to remove the "Delete" function, and replace with a robust recoverable quarantine functionality. IN PROGRESS
Enhance messaging within the product about Live Protection and encourage existing and future customers to enable it. IN PROGRESS
Improve resilience of Sophos updating processes. IN PROGRESS
Improve the self-protection capabilities of the Sophos endpoint agent, to include updating. IN PROGRESS

 

INCIDENT RESPONSE: Additional Lessons Learned

While customers and partners generally regarded Sophos as forthcoming in our communication of the incident and highly responsive in helping affected customers recover, there are three significant areas in which our response could have been substantially improved.

1. Telecommunications capacity at support call centers

Upon discovering the incident, Sophos transferred all available technical staff to handling support calls over extended hours in our major call centers in Abingdon (UK), Boston, Wiesbaden (Germany), Madrid, Milan, Paris, Sydney, Yokohama, Vancouver, and other cities around the globe. Call volumes within the first 48 hours of the incident were running at well over 10 times normal levels. This created long hold times for customers and in some cases exceeded the capacity of our phone switches. As a result, customer received busy signals and did not enter our phone queue. This situation was exacerbated in our Boston call center by a fault in the phone system. For a three-day period the inbound phone capacity in Boston was reduced by 50%. Sophos endeavors to provide the most responsive customer support in the IT security industry, with customers typically waiting less than three minutes before reaching a qualified technical support agent. During this incident, Sophos was not able to deliver on our target response and wait times for our customers. And while we are making a host of additional efforts to make sure that such a disruptive event never happens again, we are also taking steps to improve our support infrastructure should a large-scale issue of any kind affect our customers.

Recommended Actions

STEPS SOPHOS IS TAKING STATUS
Put in place a third-party overflow/call failover supplier to handle sudden peaks in support calls. check COMPLETE IN UK AND NORTH AMERICA
Implement an emergency incident response procedure to immediately change workflows for inbound call handling including IVR messages, call flows and call-backs. IN PROGRESS
Work with telco service providers to improve resiliency and redundancy of telco and PBX systems to protect against infrastructure failure during spikes in phone traffic. IN PROGRESS

2. Proactive notification system

Sophos announced the false positive issue on Twitter and added a knowledgebase article within an hour of the incident occurring. We publicized the incident extensively via press and social media outlets within 2.5 hours, once the extent of the impact had become clear. However, it took up to 19 hours for all Sophos customers to be notified by email.

Recommended Actions

STEPS SOPHOS IS TAKING STATUS
Change Sophos emergency incident response procedure to be able to initiate customer communications 24/7. check COMPLETE
Develop and regularly test automated email communication systems and contact databases to accelerate delivery of emergency email notifications. check COMPLETE
Identify alternative non-email based methods of communicating such updates to customers, for example SMS, console alerts and RSS. IN PROGRESS

3. Knowledgebase article

Early versions of the Sophos knowledgebase article (KBA) were difficult to follow for many customers. Additionally, Sophos’ publishing system experienced inconsistent results when pushing out improved revisions of the KBA. This resulted in outdated or (at times) blank page results during the first 48 hours, when the Sophos website experienced peak load that was 25 times our normal level of page views across our entire website.

Recommended Actions

STEPS SOPHOS IS TAKING STATUS
Change Sophos emergency incident response procedure to immediately allocate a usability/workflow owner for the main KBAs. check COMPLETE
Build knowledgebase article template for emergency incidents including elements such as screen shots, checklists to help partners and customers identify whether or not they have been affected, and if so how serious the impact might be. check COMPLETE
Review and change KBA publishing process to improve our ability to publish quickly, including in localized languages, during emergency response situations. check COMPLETE

 

Additional Recommendations

This incident has made it clear that Sophos should proactively provide more prescriptive advice on the most appropriate policy settings for our customers and partners to use when deploying and managing our products. While the Sophos default settings for new installations would have prevented this incident from causing disruption beyond the alerts, some customers had not enabled Live Protection, and other customers had chosen to change the default setting for cleanup not available to "Move" or "Delete."

We have published a knowledgebase article 114345 that provides our current recommended policy. This includes, among other recommendations, enabling Live Protection and setting the no cleanup action to "Deny access only."

Conclusion

Since Sophos was founded in 1985 our clients and partners have grown to expect industry-leading product quality and best-in-class customer support. We must recommit ourselves to making regular and concrete improvements to our systems, processes, and organization including the ones noted in this document to make sure an event like this never happens again. In parallel, we must provide an improved response to any critical security events, regardless of their source. Nothing is more important to us than earning our customers’ trust to protect their organizations. Sophos will learn from this incident and will emerge an even stronger company, delivering even greater value to our partners and customers in a more reliable and responsive fashion.