Tomorrow I’ll be giving a talk on breach data including:
- Places where it’s located
- How to make large data sets searchable in a reasonable amount of time
- How some organizations are using breach data to improve their security posture
Whenever I give conference talks I try to remove or reduce any barriers to entry. When I have given talks on memory forensics, I have always used the Windows standalone version of Volatility instead of Linux for my demos so attendees who were not really comfortable with Linux wouldn’t feel like they couldn’t try the techniques.
With that idea in mind, I wanted to find a way to make large breach datasets searchable without the need to maintain huge databases, normalize hundreds (or more) of disparate datasets etc. Similar to a recent blog post I wrote where I used a forensics tool called bulk extractor to help quickly acquire selectors (emails, phone numbers etc) from a large dataset, I decided to use a common forensics technique of indexing for this problem.
Indexing has been used in forensics for years. You basically trade effort and extra storage space now for much quicker search results in the future. Imagine getting a hard drive in to examine on a Friday. You could let the drive process over the weekend and Monday morning quickly view the results and perform your searches. Ironically indexing isn’t nearly as common as it used to be in forensics but the technique works very well for breach data. To understand the tradeoffs and advantages, here’s a real world example.
I had a dataset of breach data that was 126 GB in size. Searching that data for an email address took about 50 minutes. I started up a job to index the data which took two full days to run and an extra 76 GB in storage space. I felt the results were well worth it since now my searches took 2 minutes instead of 50.
Years ago in a forensics class I learned of a free tool called “Agent Ransack” (https://www.mythicsoft.com/) which made searching drives for information easier. As I started searching for options to index this data I realized that the same company that made Agent Ransack made a professional version called “Filelocator Pro” which has indexing capabilities. I try to stick to free resources whenever I can and Filelocator Pro has a $60 cost but it seemed to be the easiest and most affordable method of accomplishing what I was going for without the need to massage a lot of data.
While indexing large amounts of data I figured out that Filelocator pro has a few…. idiosyncrasies that I wanted to bring to the attention of anyone thinking of using it.
Before you index large datasets, I would highly recommend splitting up large files to chucks no bigger than 1GB in size. This will not only help them index faster, but you’ll get a lot less errors while indexing. There are a lot of ways to do this but I used a program called G-Split (https://www.gdgsoft.com/gsplit) which made it really quick and easy. You can pick how big you want the chunks to be, what it should name the files etc.
For organizational purposes, I made two directories, data and indexes. In each of those directories, I would make a sub directory for each dataset. For instance in data I may have a sub directory called “collection_1” with all of the data from the collection 1 dump in chunks no bigger than 1 GB. In the index directory I will have a directory called “collection_1_index” where I have Filelocator pro store the index it’s making.
Thankfully storage space is really cheap so I bought an 8 TB hard drive to store the data on.
Making the index:
While making the index, I highly suggest making an index for each set of data rather than one massive index. It just makes things quicker overall and much more useable. With some particularly large datasets (100 GB plus), I would even have to split them into a couple of indexes. When you use the command line interface we’ll talk about in the next section, it will make the search process painless even with a large number of indexes.
After making the index:
Once the indexes are made, when you go to search them you’ll notice that the graphical interface feels super sluggish. For some reason, Filelocator pro starts doing searches while you’re typing in the search term (like google does). That might be fine for searches of small datasets but for big ones, it can bog you down. I started typing my terms into a text file and just loading that in to search.
What I finally decided to do was use the command line interface to run my searches. I wrote a very basic python wrapper to take all of the terms from a text file and search all of the indexes listed in another text file. All of the results are placed in an HTML file with a built in header graphic and all of the terms are bolded to make them easier to find in the results. I posted the code here: https://github.com/azmatt/DuckHunt
This makes it really easy to run big batches of searches unattended and quickly scan through the results. I have two different index lists. One that has all of the indexes in it, and one that only has indexes for datasets that have phone numbers. That way if I’m searching phone numbers, I don’t waste time searching data that won’t contain any.
While searching your indexes:
It’s always a good idea to search for the username of an email in addition to the full email. For instance if you’re searching for [email protected], you should also search example034 by itself to find results at Hotmail, etc. Unfortunately, a Filelocator pro idiosyncrasy forces that issue. When doing searches for emails, the results were taking FAR too long to come back, if they came back at all.
I reached out to their tech support and they told me that it was splitting the email at the ‘@’ into two different searches. Because of this you never want to search for full email addresses, just the unique username (or a unique domain) part. Fortunately this isn’t usually a deal breaker since searching for the username is a best practice anyway.
Whenever you’re searching data, you should always know what your data looks like and do some test searches. One area where that is particularly true is phone numbers. Most datasets don’t have any special characters between the numbers but some have country codes and some don’t.
Another special idiosyncrasy is that Filelocator pro finds partial hits on words, but not on numbers. Searching for “matt0177” will also hit on “matt01773.” This is fine. What sucks is searching for 9155551212 will hit on that but WILL NOT hit on 19155551212. Because of that I recommend searching for phone numbers both with and without the country code.
This whole process feels a little rough around the edges at times but it’s a pretty low effort way to make huge datasets searchable and useable for OSINT efforts.