Following on the heels of my previous post regarding file formats and sharing the link to the post on LinkedIn, I had some additional thoughts that would benefit greatly from not blasting those thoughts out as comments to the original post, but instead editing and refining them via this medium.
My first thought was, is it necessary for every analyst to have deep, intimate knowledge of file formats? The answer to that is a resounding "no", because it's simply not possible, and not scalable. There are too many possible file formats for analysts to be familiar with; however, if a few knowledgeable analysts, ones who understand the value of the file format information to DFIR, CTI, etc., document the information and are available to act as resources, then that should suffice. With the format and it's value documented and reviewed/updated regularly, this should act as a powerful resource.
What is necessary is that analysts be familiar enough with data sources and file formats to understand when something is amiss, or when something is not presented by the tools, and know enough recognize that fact. From there, simple troubleshooting steps can allow the analyst to develop a thoughtful, reasoned question, and to seek guidance.
So, why is understanding file formats important for DFIR analysts?
First, you need to understand how to effectively parse the data. Again, not every analyst needs to be an expert in all file formats - that's simply impossible. But, if you're working on Windows systems, understanding file formats such as the MFT and USN change journal, and how they can be tied together, is important. In fact, it can be critical to correctly answering analysis questions. Many analysts parse these two file separately, but David and Matt's TriForce tool allowed these files (and the $LogFile) to be automatically correlated.
So, do you need to be able to write an OLE parser when so many others already exist? No, not at all. However, we should have enough of an understanding to know that certain tools allow us to parse certain types of files, albeit only to a certain level. We should also have enough understanding of the file format to recognize that not all files that follow the format are necessarily going to have the same content. I know, that sounds somewhat rudimentary, but there are lot of new folks in the DFIR industry who don't have previous experience with older, perhaps less observed file formats.
Having an understanding of the format also allows us to ask better questions, particularly when it comes to troubleshooting an issue with the parser. Was something missed because the parser did not address or handle something, or is it because our "understanding" of the file format is actually a "misunderstanding"?
2. The "Other" Data
Second, understanding file formats provides insight into what other data the file format may contain, such as deleted data, "slack", and metadata. Several file formats emulate file systems; MS describes OLE files as "a file system within a file", and Registry hives are described as a "hierarchal database". Each of these file formats has their own method for addressing deleted content, as well as managing "slack". Further, many file formats maintain metadata that can be used for a variety of purposes. In 2002, an MSWord document contained metadata that came back to haunt Tony Blair's administration. More recently, Registry hive files were found to contain metadata that identified hives as "dirty", prompting further actions from DFIR analysts. Understanding what metadata may be available is also valuable when that metadata is not present, as observed recently in "weaponized" LNK files delivered by Emotet threat actors, in a change of TTPs.
Third, "file carving" has been an important topic since the early days of forensic analysis, and analysts have been asked to recover deleted files from a number of file systems. Recovering, or "carving" deleted files can be arduous and error prone, and if you understand file formats, it may be much more fruitful to carve for file records, rather than the entire file. For example, understanding the file and record structure of Windows 2000 and XP Event Logs (.evt files) allowed records to be recovered from memory and unallocated space, where carving for entire files would yield limited results, if any. In fact, understanding the record structure allowed for complete records to be extracted from "unallocated space" within the .evt files themselves. I used the successfully where an Event Log file header stated that there were 20 records in the file, but I was able to extract 22 complete records from the file.
Even today, the same holds true for other types of "records", including "records" such as Registry key and value nodes, etc. Rather than looking for file headers and then grabbing the subsequent X number of bytes, we can instead look for the smaller records. I've used this approach to extract deleted keys and values from the unallocated space within a Registry hive file, and the same technique can be used for other data sources, as well.
4. What's "In" The File
Finally, understanding the file format will help understand what should and should not be resident in the file. One example I like to look back on occurred during a PCI forensic investigation; an analyst on our team ran our process for searching for credit card numbers (CCNs) and stated in their draft report that CCNs were found "in" a Registry hive file. As this is not something we'd seen previously, this peaked our curiosity, and some of use wanted to take a closer look. It turned out that what had happened was this...the threat actor had compromised the system, and run their process for locating CCNs. At the time, the malware used would (a) dump process memory from the back office server process that managed CCN authorization and processing to a file, (b) parse the process memory dump with a 'compiled' Perl script that included 7 regex's to locate CCNs, and then (c) write the potential CCNs to a text file. The threat actor then compressed, encrypted, and exfiltrated the output file, deleting the original text file. This deleted text file then became part of unallocated space within the file system, and the sectors that comprised the file were available for reallocation.
Later, as the Registry hive file "grew" and new data was added, sectors from the file system were added to the logical structure of the file, and some of those sectors were from the deleted text file. So, while the CCNs were found "in" the logical file structure, they were not actually part of the Registry. The CCN search process we used at the time returned the "hit" as well as the offset within the file; a visual inspection of the file via a hex editor illustrated that the CCNs were not part of the Registry structure, as they were not found to be associated with any key or value nodes.
As such, what at first looked like a new threat actor TTP was really just how the file system worked, which had a significant impact on the message that was delivered to Visa, who "ran" the PCI Council at the time.