I’ve been tweaking some of my tools to help me analyze large password dumps, like exploit.in. And I also have done such analyses with build-in Unix tools (I refer to Unix tools because I started to use Unix in the eighties, before Windows and Linux existed), but I also must be able to do this on Windows machines, where I don’t always have the option to install “Unix-for-Windows” tools like cygwin.
When I started to process the exploit.in files with my CSV tools, I ran into some problems. The data is not very clean, for example, there are lines in the dump that are so long that Python’s csv module will error on it. Normally the format of a line is “email-address:password”, where a colon (:) is a separator between the email address and the password. But sometimes there is no separator in the line, and sometimes there is more than 1 separator. This happens when a password contains a colon (:), but the problem is that the colon (:) is not properly escaped for a CSV parser.
That’s why I made some updates to my python-per-line.py tool.
With python-per-line’s SBC function (Separator Based Cut), I can extract passwords even if the line is too long for other parsers, if there is no separator (:) or more than one separator. This is the expression I use:
SBC(line,’:’,2,1,[])
line is a Python variable, ‘:’ is the separator, 2 is the number of fields, 1 is the field that needs to be selected (index starts from 0, so 1 is the second field, i.e. the password), and [] is the value to return if there is no field with index 1. [] makes that python-per-line will not output a line (e.g. no empty line). SBC will split the line per the : separator, without taking any possible escape characters into account. It will also separate the line into maximum 2 fields, even if there is more than one : character. This is done from left to right, remaining : characters are part of the second field.
The other problem I encountered on Windows is that when I piped the output of python-per-line into count (to count passwords), the process would stop before all files were processed. It turns out that some passwords contain the CTRL-Z character (0x1A), which is the end-of-file marker, so that’s why processing stopped. I solved this problem by escaping the CTRL-Z character with a function I added to python-per-line: RIN (Repr If Needed). This is the expression I use:
RIN(SBC(line,’:’,2,1,[]),’\x1a’)
In this case, RIN will escaped its input (the first argument) with Python’s repr function if the input contains character CTRL-Z (\x1A).
python-per-line can also handle gzip compressed text files, so I was able to free up a couple of gigabytes by compressing the exploit.in text files. My count program version 0.1.0 was able to count the passwords, but it required Python 64 bit and took a long time. That’s why I added sqlite3 support to count.py as a counting method.
Here is the command I used to count the passwords and create a database:
Option -c exploit-in-passwords.db instructs count.py to use a sqlite3 database on disk with name exploit-in-passwords.db as a counting method in stead of a Python dictionary (the default counting method).
Option –ranktop 100 makes count.py output the top 100 most frequent passwords, along with their frequency. -H prints out a header, and -t prints totals.
Option -o passwords-top-100.csv makes count.py write its output to file passwords-top-100.csv, and finally, option -b makes that his output also goes to stdout.
Afterwards, I can use the database to print out other lists, like a top 20:
Option -z makes that count.py does not requires input files, it will just print out data from the database. Option -d sorts the output in descending order (sorted by default per count in ascending order).
From this output, I can see that 123456 is the password with the highest frequency (a bit more than 5 million times), that there are almost 800 million passwords in total and a bit more than 200 million unique passwords.


Article Link: https://blog.didierstevens.com/2017/07/28/analyzing-password-dumps-with-my-tools-part-1/