This is part two of a three-part blog series on building a DGA classifier and it is split into the three phases of building a classifier: 1) Data preperation 2) Feature engineering and 3) Model selection. Back in part 1, we prepared the data and we are starting with a nice clean list of domains labeled as either legitamate (“legit”) or generated by an algorithm (“dga”). library(dga) data(sampledga) In any machine learning approach, you will want to construct a set of “features” that help describe each class or outcome you are attempting to predict. Now the challenge with selecting features is that different models have different assumptions and restrictions on the type of data fed into them. For example, a linear regression model is very picky about correlated features while a random forest model will handle those without much of a hiccup. But that’s something we’ll have to face in when we are selecting a model. For now, we will want to gather up all the features we can think of (and have time for) and then we can sort them out in the final model. In our case, all we have to go off of is a domain name: a string of letters, numbers and maybe a dash. We have to think of what makes the domains generated by an algorithm that much different from a normal domain. For example, in the Click Security example they calculate the following features for each domain: Length in characters Entropy (range of characters) n-grams (3,4,5) and the “distance” from the n-grams of known legit domains n-grams (3,4,5) and the “distance” from the n-grams of dictionary words difference between the two distance calculations There is an almost endless list of other features you could come up with beyond those: ratio of numbers to (length/vowels|consonants/non-numbers) ratio of vowels to (length/numbers/etc) proportion matching dictionary words largest dictionary word match all the combinations of n-grams (mentioned above) Markov chain of probable combinations of characters Simplicity is the name of the game I know it may seem a bit counter-intuitive, but simplicity is the name of the game when doing feature selection. At first thought, you may think you should try every feature (and combinations of features) so you can build the very best model you can, but there are many reasons to not do that. First, no model or algorithm is going to be perfect and the more robust solutions will employ a variety of solutions (not just a single algorithm). So striving for perfection has diminishing returns. Second, adding too many features may cause you to overfit to your training data. That means you culd build a model that appears to be very accurate in your tests, but stinks with any new data (or new domain generating algorithms in our case). Finally, every feature will take some level of time and effort to generate and process, and these add up quickly. The end result is that you should have just enough features to be helpful, and no more than that. The Click Security example, in my opinion, does an excellent job at this balance with just a handful of features. Also, there isn’t any exact science to selecting features, so get that notion that science is structured, clean and orderly right out of your head. Feature seelction will, at least at this state, rely heavily on domain expertise. As we get to the model selection, we will be weeding out variables that don’t help, are too slow or contradict the model we are testing. For now, think of what makes a domain name usable. For example, when people create a domain name the focus on readability so they may include one set of digits together “host55” and rarely would they do “h5os5t”, so perhaps looking at number usage could be good. Or you could assume that randomly selecting from 26 charcters and 10 numbers will create some very strange combinations of characters not typically found in a language. Therefor, in legitimate domains, you expect to see more combinations like “est” and less “0zq”. The task when doing feature selection is to find attributes that indicate the difference. Just as in real life, if you want to classify a car from a motorcycle a good feature may be number of tires on the road, you want to find attributes to measure that seperate legitimate domains from those generated by an algorithm. N-Grams I hinted at n-grams in the previous paragraph and they may be a little difficult to grasp if you’ve never thought about it. But they are built on the premise that there are frequent character patterns in natural language. Anyone who’s watched “Wheel of Fortune” knows that consonants like r, s, t, n and l appear a lot more often than m, w, and q and nobody guesses “q” as their first choice letter. One of the features you could include is a simple count of the characters like that (could be called a “1-gram”, “n-gram of 1”, “unigram” or simply “charater frequency” since it’s single charaters). Randomly generated domains would have a much different distribution of characters than those generated based on natural language. That difference should help an algorithm correctly classify between the two. But you can get fancier than that and look at the frequency of the combination of characters, the “n” in “n-grams” represents a variable length. You could look for the combination of 3-characters, so let’s take a look at how that looks with the stringdist package and the qgrams() function. library(stringdist) qgrams("facebook", q=3) ## fac ook ace ceb ebo boo ## V1 1 1 1 1 1 1 qgrams("sandbandcandy", q=3) ## san and ndb ndc ndy dba dca ban can ## V1 1 3 1 1 1 1 1 1 1 qgrams("kykwdvibps", q=3) ## kyk ykw wdv vib kwd dvi ibp bps ## V1 1 1 1 1 1 1 1 1 See how the function pulls out groups of 3 characters that appear contiguously? Also, look at the difference in the collection of trigrams from the first two, they don’t look too weird, but the output from kykwdvibps probably doesn’t match your expectation of character combinations you are used to in the english langauge. That is what we want to capitalize on. All we have to do is teach the algorithm everything about the english language, easy right? Actually, we just have to teach it what should be “expected” as far as character combinations, and we can do that by figuring out what n-grams appear in legitimate domains and then calculate the difference. # pull domains where class is "legit" legitgram3