The Role of Fuzzy Hashes in Security Operations

Enhancing Malware Analysis with Similarity-Based Hashing Techniques

Dec 19, 2024

Bonus Content! Hands-on Lab at the bottom :)

Security Operations daily routines can feel like a game of whack-a-mole—because often teams are using IOCs (indicators of compromise) that have a very short shelf life. For instance, an IT person might detect suspicious activity to/from a certain IP address, and decide to simply block that IP address and move on. The problem with that approach is that IP addresses are disposable and easy to change. The same is true with malware hashes.

A traditional approach to identifying malware is to catalog its static hash (such as SHA1, SHA256, MD5) into a list of known malicious files. This is how traditional antivirus software operated—it compared files on your system against this list of known-bad hashes. However, this method has a significant weakness: static hashes are extremely fragile. Even the tiniest modification to a file completely changes its hash, allowing a previously identified "bad" file to appear "good" again.

Here are some common techniques malware authors use to evade static hash detection:

Byte manipulation: Adding, removing, or modifying single bytes in the malware code
Payload repackaging: Compressing or encrypting the malicious code differently each time
Polymorphic code: Automatically modifying the malware's code structure while maintaining functionality
Padding insertion: Adding random data between code sections to change the file's hash
Code reordering: Changing the sequence of code blocks without affecting the program's behavior

These simple modifications completely change the malware's static hash while preserving its malicious functionality, making traditional hash-based detection ineffective.

That’s where fuzzy hashes come in. Unlike traditional hashes, which are great for spotting exact matches, fuzzy hashes focus on finding “close enough” similarities. This makes them incredibly useful for detecting malware that’s been slightly modified or identifying patterns in large sets of data. In this post, we’ll break down the different types of hashes, how fuzzy hashes work, and why they’re a key part of malware identification.

Static Hashes

Traditional static hashing algorithms like MD5, SHA1, and SHA256 work by processing a file's data in a deterministic way to produce a fixed-length output (the hash). Here's how they generally work:

The file is read as a stream of bytes
The data is processed in fixed-size blocks through a mathematical function
Each block's result is combined with the previous results in a way that any change, no matter how small, cascades through the entire calculation
The final output is a fixed-length string of characters that uniquely represents the input data

This is why changing even a single bit in a file results in a completely different hash value — the cascading nature of the algorithm ensures that the entire hash changes. While this property makes static hashes excellent for verifying file integrity and detecting exact matches, it makes them ineffective for detecting similar or slightly modified files.

Fuzzy Hashes

Fuzzy hashes work differently from static hashes by breaking down files into smaller chunks and creating signatures that can survive minor modifications. Here's how they typically work:

Files are divided into variable-sized blocks based on content patterns rather than fixed sizes
Each block is hashed separately, creating a sequence of smaller hashes rather than one large hash
The sequence of block hashes is combined to create a "fuzzy hash signature" that represents the file's structure
When comparing files, their fuzzy hash signatures are analyzed for similarity, producing a match percentage rather than a binary match/no-match result

This approach offers several key advantages for security analysts:

Can identify variants of known malware even after minor code modifications
Maintains effectiveness against common malware obfuscation techniques
Provides similarity scores that help analysts correlate separate investigations
Works well for identifying malware families or code reuse

Below, we’ll break down a few of the most common fuzzy hashing techniques.

SSDEEP

SSDEEP is one of the most widely used fuzzy hashing tools in cybersecurity. It implements context triggered piecewise hashing (CTPH), which was originally developed for spam detection but found great utility in malware analysis.

Here's how SSDEEP specifically works:

It divides the input file into blocks based on content patterns rather than fixed sizes
Uses a rolling hash function to identify trigger points in the data that determine block boundaries
Generates a hash for each block using a traditional hashing function
Combines these block hashes into a single signature that represents the file's structure

Key advantages of SSDEEP include:

Ability to detect files that are similar but not identical
Resistance to simple obfuscation techniques commonly used by malware authors
Generation of compact signatures that are easy to store and compare
Fast comparison operations making it practical for large-scale analysis

SSDEEP outputs similarity scores from 0 to 100, where 100 indicates the highest similarity between two files. Security analysts typically consider scores above 50 as significant enough to warrant further investigation.

IMPHASH (Import Hash)

IMPHASH, originally developed by Mandiant in 2014, is another fuzzy hashing technique specifically designed for Windows Portable Executable (PE) files. Unlike ssdeep which analyzes the entire file content, IMPHASH focuses on the Import Address Table (IAT) of executable files.

Here's how IMPHASH works:

It extracts the Import Address Table from a PE file
Combines the DLL names and their imported function names in a specific order
Creates a hash of this combined string using MD5

IMPHASH is particularly useful because:

Malware variants often maintain similar import patterns even when the rest of the code changes
It can identify malware families that use the same codebase or development patterns
It's effective at detecting packed or obfuscated malware that share similar unpacking routines
The calculation is relatively fast compared to other fuzzy hashing methods
It’s natively supported in popular monitoring tools like Sysmon

However, IMPHASH does have limitations:

Only works with Windows PE files
Can be defeated if malware authors deliberately modify their import patterns
May produce false positives with legitimate applications that share common import patterns (happens more often than you’d think)

TLSH (Trend Micro Locality Sensitive Hash)

TLSH is another fuzzy hashing algorithm that was developed by Trend Micro and released as open source. It's particularly effective at identifying similarities between files, even when they've undergone significant modifications.

Here's how TLSH works:

The file is split into sliding windows
These windows are used to populate a counting bloom filter
The bloom filter data is processed to generate a digest
Final hash includes file metadata like length and quartile points

Key advantages of TLSH include:

More robust against certain types of file modifications compared to SSDEEP
Better performance when comparing large sets of files
Provides distance scores that are more consistent across different file sizes
Works well with both small and large files (recommended minimum 50 bytes)

TLSH has some unique characteristics that make it particularly valuable for malware analysis:

The distance scoring is symmetrical, meaning comparing A to B gives the same result as B to A
It's less sensitive to file size differences than other fuzzy hashing algorithms
The algorithm is designed to be robust against adversarial modifications
Can effectively cluster similar files in large datasets

Like other fuzzy hashing methods, TLSH should be used as part of a comprehensive malware analysis strategy, often in combination with SSDEEP and IMPHASH for the most effective results.

Want to try it yourself?

I've created a hands-on lab so you can get practical experience with this concept.

The lab consists of a single PowerShell script which does the following;

Downloads and unzips SSDEEP to the working directory.
Copies itself 9 times, adding a slight modification (random GUID) to each copy.
Captures static and fuzzy hashes of the original and each copy.
Interactively compares the hash techniques.

Lab Instructions

While you can fire up a VM for this one, it’s not necessary. Simply delete the folder when you’re done.

Create a folder anywhere on your system, like your Desktop or a temporary folder.
Download this script to a file called HashMorpher.ps1 in the newly created folder.
Inspect the script! It’s very straight-forward and well commented.
1. If you’re feeling paranoid, feed it to ChatGPT!
Run the script in PowerShell.
.\HashMorpher.ps1
Follow the prompts, learn cool stuff.
Repeat as many times as you wish.
Delete the folder when you’re done.

Thanks for reading Eric’s Substack! This post is public so feel free to share it.

Eric’s Substack