Discovering File Types Using Content Histograms
I was reviewing a hard drive the other day when I located several deleted Windows Event Log files. A quick review showed that I could recover both files but I could only read one of them with the Event Viewer. The second file produced an error and would not open. A quick review using a Notepad++ displayed a long line of seeming random characters. A quick review with the file command returned no results. I began to wonder why this file had been named like an Event Log, why it had been deleted, and why did it contain random characters. Dreams of finding the smoking gun started dancing through my head, but how could I know what this file contained if there was no decent file signature information.
Well, a quick review of my analysis tool answered all of my questions. When I initially copied the file out I had noticed that it had been deleted, but I failed to notice that it had been overwritten. Mystery solved, sort of. I still didn’t know what was in that file, or to be more precise, what information was currently stored at the file’s old location. As I stated, the data appeared to be random characters which initially lead me to believe that the new file was encrypted. But how could I tell? I began to wonder about character frequency and how I would be able to determine whether or not a file was encrypted and contained truly random character representation and how those files would compare to other files.
I decided to try and write a program that would show a plot of the number of times a character occurred in a file. Truly random character representation, such as you would find in an encrypted file, should show an even distribution of characters. Whereas a text file would, generally, show a greater concentration of characters with a byte representation less than 127 (printable characters) and those characters would vary according to the normal character representation of the English language (my language of choice).
Enter matplotlib, numpy, and Python(x,y). Or, more precisely, Python(x,y) since I was doing this on a Microsoft XP system. Python(x,y) provided a quick and easy Python installation which includes many powerful extras. As this installation was designed for scientists it includes the Python math modules matplotlib and numpy, among others. Matplotlib and numpy were what I required to show the frequency of the characters within a file by printing out a histogram. After some trial and error I was able run the script and get a nice pop-up window with the histogram plotted with enough detail to determine differences between different file types.
The following examples show the graphed output of the script. The first image is a histogram representation of a Microsoft Word document that I converted to text. I would have used the MS Word document itself but the amount of NULL characters affects the frequency range represented on the Y-axis and therefore requires some adjustments. The second image is a histogram representation of a file that has been encrypted with Truecrypt. A big difference.
Now, I’m not going to string out this post by including a long list of histogram images of different file types. Rather, I’m going to leave this post here but point you to the File Content Histograms page where I will be posting images of different file types. You will also be able to find the script that I used to gather this information. There is a Linux/Windows python script that will require Python and MatPlotLib and a Windows executable that should not have any requirements to function.
I would also like to say that this is wildly successful at identifying files but I am not going to do that. You will see when you review the other images that certain files to have a distinguishable pattern, but some of them are so similar that it cannot be used to help with definitive answers. It can help, however, to point the investigator in the right direction. For instance, archived files have a similar pattern to executable files and even some image files. All of this is understandable because these file types use a lot of similar techniques to compress the data. Despite these similarities, histograms of media files are definitely different then the archived files. I did find it interesting the the Truecrypt graph was different that the graph of the file encrypted with GnuPG.
Go forth and do good things,
Don C. Weber
Help support my training and travel to security conferences. Get your SANS Training and GIAC Certifications through the Security Ripcord.
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.











Leave a Reply