What’s So Fuzzy About It?
Sometimes you know what you’re looking for in life, and other times you’re not so sure. The same concept applies to searching data when there’s so much of it. Whether you’re a cybersecurity analyst or a data scientist, you often only have a starting thread on what you need to find. But how do you search for data using ambiguous terms and qualifiers? One way is by using fuzzy logic.
But when would you actually use it? There are several use cases where fuzzy logic can be incredibly helpful and may offer the only solution:
- Spelling mistakes or typos
- Spoofing (customer names, domains)
- Abbreviations & synonyms
- Added/missing data
- Email addresses
- Customer names & addresses
- Product names
It can also be helpful for producing features for machine learning models, which can go on to produce classification, clustering, and field value predictions.
To apply fuzzy logic is to evaluate something based on “degrees of truth” rather than the “true or false” Boolean logic that machines typically render. It’s closer to the way the human brain works, with few things being 100% absolute in nature. For example, how suspicious is someone you randomly meet in public? You wouldn’t say they’re 0% or 100% suspicious, but generally somewhere in between “I’d let them borrow my tools” and “I’ll wait for the next train instead.” Fuzzy logic operates in a similar way, and there are multiple approaches to apply this type of logic to your data. It can be used for inferencing, applying deductive reasoning, or looking for similarities in datasets.
Fuzzy Logic in Splunk
One simple way to use fuzzy logic in Splunk is by using string similarity algorithms. The Python jellyfish library provides several of these, each with their own benefits and drawbacks. There’s also a Splunk app for it, called Jellyfisher, which will compare two strings within a given event. The algorithms provided by jellyfish are grouped into string comparison, phonetic encoding, and stemming. For our purposes, we’re going to focus on string comparison algorithms, which include:
- Levenshtein – The number required of insertions, deletions, and substitutions between two strings.
- Damerau-Levenshtein – Like Levenshtein, but counts transpositions as a single change.
- Hamming – The number of characters that are different between two strings of equal length.
- Jaro & Jaro-Winkler – Similarity score between 0 and 1 (1=identical)
Of the string comparison algorithms provided, the Levenshtein algorithm is arguably the most popular. The URL Toolbox app uses it, which is well documented in this Splunk Blog article. There’s also the Fuzzy Search app, which will compare strings in a search field against strings you specify in your search command. But what if you have a large or dynamic comparison dataset (for example, a blacklist or whitelist)? What if you need to enrich your search results based on the most similar string? Enter Fuzzylookup.
Inspired by customer use cases, we built a search command that uses lookups to drive fuzzy logic searches. Similar to Fuzzy Search, Fuzzylookup uses the Levenshtein algorithm to determine string similarity between event fields and lookup fields and computes a score, while also enriching the event with the other values from the corresponding lookup row. We added some special sauce to make sure we grab the best entry from the lookup if there are more than one with the same “distance” metric. Since we’re comparing each search result event against each lookup row, it can get very CPU-intensive. For example, if you have 10,000 search results and a 1,000-row lookup, you’re talking about 10 million comparisons. Don’t worry though – we’ve given you the tools to make it more efficient (see the docs for text masking and deletion).
Before we get carried away, though, let’s take a look at a few simple examples of Levenshtein in action. We calculate the distance metric below, subtract the score from the number of letters in the longest word, then divide by the same number to get a similarity score.
|boom||boon||1||(4-1)/4 = 75%|
|hose||house||1 ||(5-1)/5 = 80%|
|book||back||2||(4-2)/4 = 50% |
|midnight||daylight||4||(8-4)/8 = 50%|
|mainstream||reinstate||6||(10-6)/10 = 40%|
Fuzzylookup runs a computation like this to calculate a similarity score, but it also takes the character overlap (CO) into account. The Levenshtein distance is weighted at 75% and the CO at 25%. Once we have the lookup entries with the best score, we keep the ones with the longest sequence length (LCS). Since we have the potential to match several strings at once in a lookup with the same Levenshtein distance, we need to make sure the most accurate match is made. Once we have a similarity score, we can look for near-matches to give us additional context around an event by pulling in the lookup fields.
Let’s take a look at a fictional use case. An electrical company called Turtle Power has been seeing a lot of phishing attacks from external email addresses that appear to be spoofing internal employee account names. They’ve stood up defenses to block external senders that match internal accounts, but the adversaries have bypassed the controls using similar names, such as adding dots and numbers. Turtle Power took the following steps to combat this issue:
- Implemented Fuzzylookup within their Splunk environment.
- Created an identities lookup with internal email addresses.
- Created searches for their incoming email logs to look for external senders, then piped that into Fuzzylookup to identify near-matches to internal account names.
- Setup an automated alert action when the similarity score is above a specific threshold that will quarantine the incoming emails.
The alert search they used would look similar to the following:
index=mail src_user!="*@turtlepower.nyc" recipient="*@turtlepower.nyc" link_count>0
| fuzzylookup add_metrics=true email AS src_user OUTPUT email AS spoofedacct_email first AS spoofedacct_firstname last AS spoofedacct_lastname
| search fuzzy_score<6 OR fuzzy_similarity>85
| table _time src_user recipient spoofedacct_email spoofedacct_firstname spoofedacct_lastname
Turtle Power estimates that they’ll stop about 75% of these new phishing attacks, which will significantly lower their risk of compromise from this adversary and the potential data loss that would have occurred from it.
This search would look for incoming emails from external senders that have links in them, and identify which ones have senders that mimic internal email accounts.
Where to Find It
We’re excited to publish this capability on Splunkbase for free to all Splunk users, and look forward to the feedback you have. The project is open source and can be found on Github, where the usage is well-documented and we welcome your pull requests and bug reports. Happy Splunking!