The Malware Lake Project

Finding a golden nugget in a lake full of trash


For a while I was wondering, where am I going to find interesting malware? I have these huge sources of unorganized data: Malshare, VirusTotal, VirusShare, Malware Bazaar and AnyRun (and so much more!) but they hold so much data that unless you are looking for something very specific it’s highly doubtful that you would find something interesting out of the bat. It felt that all the big boy companies have access to so many resources while I, a single analyst have access to so many databases but no way to organize all the IOCs, Domains, Samples just a huge mess of data. It’s all about perspective and the way I saw the world, looked like this:

I wanted that golden nugget! The problem wasn’t the lack of data but the mess of the data. I needed a way to gain access to an endless amount of data but also a way to navigate through it and normalize it. I started looking at tools that might help me solve this problem without setting up 10 Virtual Machines with honeypots and whole instance of some threat intelligence platform. And then I was introduced to Splunk. Splunk allowed me to process huge data sets and create beautiful dashboards so I could manage my data sets and search for very specific things.

Required Knowledge:

  • Microsoft Office Excel
  • Basic knowledge of the Splunk Processing Language
  • Python IDE with requests and pandas libraries installed
  • Some knowledge in Python scripting

I’ve created a special Python script that merges Malware Bazaar and Malshare to create a full database, the script is highly flexible and has special objects that would allow anyone to create any database and merge it with other databases to create huge collections of malware. In the context of this post I would be showing you how to use this script. Any developers that wish to improve my script or use it please don’t forget to credit me :3. in addition I’ve tried to make this script documented as possible so anyone could edit it. You can get a copy of my script here:

I’ve tried to make this script generic and flexible as I possibly could, the database structure difference between Malshare and Malware Bazaar did not make this easy.

**Important details -**

This code snippet is the most important part of the code the Database constructor function contains the following parameters:

  • The name of the database, this name is very important as it defines the settings for the API queries and database creation. Why are the settings different for each database? Because Malshare and Malware Bazaar don’t share the same database structure, it must be normalized.
  • The API key for the database \

  • The location of the database URL, in the context of Malware Bazaar the recentdatabase contains only recent additions to the database that arrived to the bazaar database. To get the full database please use this URL:

  • The list of all the header fields the programmer wants to extract from the database. Why does this matter? Because the Malware Bazaar database contains a lot of unnecessary header fields and Malshare doesn’t even have header field values. This parameter aids to either extract specific header fields from the database or create a default one, which is always sha256_hash.

The function createDatabase queries the database URLS, it accepts one parameter:

  • The number of bytes to remove from the start of the returned database. Why is this needed? Because the Malware Bazaar returns in its query response a lot of database file information which is not needed in the context of the database creation. It is exactly 467 bytes long and I’m simply filtering it out.

How does it work?

First the function checks if a path to the database exists, if it doesn’t it creates one. If no database file exists, the function queries the database from the URL and writes it to current working directory. It filters out double quotes and space characters from the database.

Finally the function generateFullDB accepts two Database objects and merges them together.

First the function creates a full data frame, what this does is sends out specific API queries to each database provider to extract extra data for each hash. This creates a normalized template for the main database. The functions who perform this action can be viewed in the the data frames were created, the function checks if a main database exists. If it doesn’t it creates a new one by concatenating both database objects. If a main database does exist, the function concatenates the database objects and appends them to the database ignoring the header names of the concatenated database objects.

The result of this should give you a CSV file containing the following template:

It’s worth noting the Malware Bazaar database is very rich and detailed while Malshare kind of lacks some values I wish it had. This database can be uploaded to any SIEM platform in my case I chose Splunk(you can learn about Splunk for free here!) Which resulted in the creation of the following:

I can use this dashboard to search for samples by Time, by their File Type, by their Signature, by their Tagsand by their SHA256 hash. Thus, allowing to access a wide variety of malware samples for example, here are all the malware samples that arrived as document files disguised as COVID-19 Information:

That’s about it guys, have fun!

Article Link: