On Validation, pt III

From the first two articles (here, and here) on this topic arises the obvious question…so what? Not
validating findings has worked well for many, to the point that the lack of validation is not recognized. After all, who notices that findings were not verified? The peer review process? The manager? The customer? Given just the fact how pervasive training materials and processes are that focus solely on single artifacts in isolation should give us a clear understanding that validating findings is not a common practice. That is, if the need for validation is not pervasive in our industry literature, and if someone isn’t asking the question, “…but how do you know?”, then what leads us to assume that validation is part of what we do?


Consider a statement often seen in ransomware investigation/response reports up until about November 2019; that statement was some version of “…no evidence of data exfiltration was observed…”. However, did anyone ask, “…what did you look at?” Was this finding (i.e., “…no evidence of…”) validated by examining data sources that would definitely indicate data exfiltration, such as web server logs, or the BITS Client Event Log? Or how about indirect sources, such as unusual processes making outbound network connections? Understanding how findings were validated is not about assigning blame; rather, it’s about truly understanding the efficacy of controls, as well as risk. If findings such as “…data was not exfiltrated…” are not validated, what happens when we find out later that it was? More importantly, if you don’t understand what was examined, how can you address issues to ensure that these findings can be validated in the future?

When we ask the question, “…how do you know?”, the next question might be, “…what is the cost of validation?” And at the same time, we have to consider, “…what is the cost of not validating findings?”

The Cost of Validation
In the previous blog posts, I presented “case studies” or examples of things that should be considered in order to validate findings, particular in the second article. When considering the ‘cost’ of validation, what we’re asking is, why aren’t these steps performed, and what’s preventing the analyst from taking the steps necessary to validate the findings? 

For example, why would an analyst see a Run key value and not take the steps to validate that it actually executed, including determining if that Run key value was disabled? Or parse the Shell-Core Event Log and perhaps see how many times it may have executed? Or parse the Application Event Log to determine if an attempt to execute the program pointed to resulted in an application crash? In short, why simply state that program execution occurred based on nothing more than observing the Run key value contents? 

Is it because taking those steps is “too expensive” in terms of time or effort, and would negatively impact SLAs, either explicit or self-inflicted? Does it take too long do so, so much so that the ticket or report would not be issued in what’s considered a “timely” manner? 

Could you issue the ticket or report in order to meet SLAs, make every attempt to validate your findings, and then issue an updated ticket when you have the information you need?

The Cost of Not Validating
In our industry, an analyst producing a ticket or report based on their analysis is very often well abstracted from the final effects, based on decisions made and resources deployed due to their findings. What this means is that whether in an internal/FTE or consulting role, the SOC or DFIR analyst may not ever know the final disposition of an incident and how that was impacted by their findings. That analyst will likely never see the meeting where someone decides either to do nothing, or to deploy a significant staff presence over a holiday weekend.

Let’s consider case study #1 again, the PCI case referenced in the first post. Given that it was a PCI case, it’s likely that the bank notified the merchant that they were identified as part of a common point of purchase (CPP) investigation, and required a PCI forensic investigation. The analyst reported their findings, identifying the “window of compromise” as four years, rather than the three weeks it should have been. Many merchants have an idea of the number of transactions they send to the brands on a regular basis…for smaller merchants, it may be a month, and for larger vendors, a week. They also have a sense of the “rhythm” of credit card transactions; some merchants have more transactions during the week and fewer on the weekends. The point is that when the PCI Council needed to decide on a fine, they take the “window of compromise” into account.

During another incident in the financial sector, a false positive was not validated, and was reported as a true positive. This led to the domain controller being isolated, which ultimately triggered a regulatory investigation.

Consider this…what happens when you tell a customer, “OMGZ!! You have this APT Umpty-Fratz malware running as a Windows service on your domain controller!!”, only to later find out that every time the endpoint is restarted, the service failed to start (based on “Service Control Manager/7000” events, or Windows Error Reporting events, application crashes, etc.)? The first message to go out sounds really, REALLY bad, but the validated finding says, “yes, you were compromised, and yes, you do need a DFIR investigation to determine the root cause, but for the moment, it doesn’t appear that the persistence mechanism worked.”

Conclusion
So, what’s the deal? Are you validating findings? What say you?

Article Link: Windows Incident Response: On Validation, pt III