{"id":6560,"date":"2016-11-18T04:45:12","date_gmt":"2016-11-18T09:45:12","guid":{"rendered":"https:\/\/www.kaspersky.co.in\/blog\/?p=6560"},"modified":"2017-05-19T00:31:30","modified_gmt":"2017-05-19T04:31:30","slug":"how-machine-learning-works-simplified","status":"publish","type":"post","link":"https:\/\/www.kaspersky.co.in\/blog\/how-machine-learning-works-simplified\/6560\/","title":{"rendered":"How machine learning works, simplified"},"content":{"rendered":"<p>Lately, tech companies have gone absolutely crazy for machine learning. They say it solves the problems only people could crack before. Some even go as far as calling it \u201cartificial intelligence.\u201d Machine learning is of special interest in IT security, where the threat landscape is rapidly shifting and we need to come up with adequate solutions.<br>\nSome go as far as calling machine learning \u2018artificial intelligence\u2019 just for the sake of it.<\/p>\n<p><a href=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2016\/11\/05085920\/machine-learning-featured-1-1024x672.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-13489\" src=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2016\/11\/05085920\/machine-learning-featured-1-1024x672.jpg\" alt=\"How machine learning works, simplified\" width=\"1280\" height=\"840\"><\/a><\/p>\n<p>Technology comes down to speed and consistency, not tricks. And machine learning is based on technology, making it easy to explain in human terms. So, let\u2019s get down to it: We will be solving a real problem by means of a working algorithm \u2014 a machine-learning-based algorithm. The concept is quite simple, and it delivers real, valuable insights.<\/p>\n<h3>Problem: Distinguish meaningful text from gibberish<\/h3>\n<p>Human writing (in this case, Terry Pratchett\u2019s writing), might look like this:<\/p>\n<p><code>Give a man a fire and he's warm for the day. But set fire to him and he's warm for the rest of his life<\/code><br>\n<code>It is well known that a vital ingredient of success is not knowing that what you're attempting can't be done<\/code><br>\n<code>The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it<\/code><\/p>\n<p>Gibberish looks more like this:<\/p>\n<p><code>DFgdgfkljhdfnmn vdfkjdfk kdfjkswjhwiuerwp2ijnsd,mfns sdlfkls wkjgwl<\/code><br>\n<code>reoigh dfjdkjfhgdjbgk nretSRGsgkjdxfhgkdjfg gkfdgkoi<\/code><br>\n<code>dfgldfkjgreiut rtyuiokjhg cvbnrtyu<\/code><\/p>\n<p><b>Our task is to develop a machine-learning algorithm that can tell those apart.<\/b> Though trivial for a human, the task is a real challenge. It takes a lot to formalize the difference. We use machine learning here: We feed some examples to the algorithm and let it \u201clearn\u201d how to reliably answer the question, \u201cIs it human or gibberish?\u201d Every time a real-world antivirus program analyzes a file, that\u2019s essentially what it\u2019s doing.<\/p>\n<p>Because we are covering the subject within the context of IT security, and the main aim of antivirus software is to find malicious code in a huge amount of clean data, we\u2019ll refer to meaningful text as \u201cclean\u201d and gibberish as \u201cmalicious.\u201d<\/p>\n<p>It seems a trivial task for a human: they can see immediately which one is \u2018clean\u2019 and which one is \u2018malicious\u2019. But it\u2019s a real challenge to formalize the difference, or more, to explain this to a computer. We use machine learning here: we \u2018feed\u2019 some examples to the algorithm and let it \u2018learn\u2019 from them, so it is able to provide the correct answer to the question.<\/p>\n<h3>Solution: Use an algorithm<\/h3>\n<p>Our algorithm will calculate the frequency of one particular letter being followed by another particular letter, thus analyzing all possible letter pairs. For example, for our first phrase, \u201cGive a man a fire and he\u2019s warm for the day. But set fire to him and he\u2019s warm for the rest of his life,\u201d which we know to be clean, the frequency of particular letter pairs looks like this:<\/p>\n<p>Bu \u2014 1<br>\nGi \u2014 1<br>\nan \u2014 3<br>\nar \u2014 2<br>\nay \u2014 1<br>\nda \u2014 1<br>\nes \u2014 1<br>\net \u2014 1<br>\nfe \u2014 1<br>\nfi \u2014 2<br>\nfo \u2014 2<br>\nhe \u2014 4<br>\nhi \u2014 2<br>\nif \u2014 1<br>\nim \u2014 1<\/p>\n<p>To keep it simple, we ignore punctuation marks and spaces. So, in that phrase, <em>a<\/em> is followed by <em>n<\/em>three times, <em>f<\/em> is followed by <em>i<\/em> two times, and <em>a<\/em> is followed by<b>y<\/b> one time.<\/p>\n<p>At this stage, we understand one phrase is not enough to make our model learn: We need to analyze a bigger string of text. So let\u2019s count the letter pairs in <em>Gone with the Wind<\/em>, by Margaret Mitchell \u2014 or, to be precise, in the first 20% of the book. Here are a few of them:<\/p>\n<p>he \u2014 11460<br>\nth \u2014 9260<br>\ner \u2014 7089<br>\nin \u2014 6515<br>\nan \u2014 6214<br>\nnd \u2014 4746<br>\nre \u2014 4203<br>\nou \u2014 4176<br>\nwa \u2014 2166<br>\nsh \u2014 2161<br>\nea \u2014 2146<br>\nnt \u2014 2144<br>\nwc \u2014 1<\/p>\n<p>As you can see, the probability of encountering the <em>he<\/em> combination is twice as high as that of seeing <em>an<\/em>. And <em>wc<\/em> appears just once ( is only one in <em>newcomer<\/em>).<\/p>\n<p>So, now we have a model for clean text, but how do we use it? First, to define the probability of a line being clean or malicious, we\u2019ll define its <em>authenticity<\/em>. We will define the frequency of each pair of letters with the help of a model (by evaluating how realistic a combination of letters is) and then multiply those numbers:<\/p>\n<p><code>F(Gi) * F(iv) * F(ve) * F(e ) * F( a) * F(a ) * F( m) * F(ma) * F(an) * F(n ) * \u2026<\/code><br>\n<code>6 * 364 * 2339 * 13606 * 8751 * 1947 * 2665 * 1149 * 6214 * 5043 * \u2026<\/code><\/p>\n<p>In determining the final value of authenticity, we also consider the number of symbols in the line: The longer the line, the more numbers we multiplied. So, to make this value equally suitable to short and long lines we do some math magic (we extract the root of the degree \u201clength of line in question minus one\u201d from the result).<\/p>\n<h3>Using the model<\/h3>\n<p>Now we can draw some conclusions: The higher the calculated number, the better the line in question fits into our model \u2014 and consequently, the greater the likelihood of it having been written by a human. If the text yields a high number, we can call it <em>clean<\/em>.<\/p>\n<p>If the line in question contains a suspiciously large number of rare combinations (like <em>wx<\/em>, <em>zg<\/em>, <em>yq<\/em>, etc), it\u2019s more likely malicious.<\/p>\n<p>For the line we chose for analysis, we measure the likelihood (\u201cauthenticity\u201d) in points, as follows:<\/p>\n<p><code>Give a man a fire and he's warm for the day. But set fire to him and he's warm for the rest of his life \u2014 1984 points<\/code><br>\n<code>It is well known that a vital ingredient of success is not knowing that what you're attempting can't be done \u2014 1601 points<\/code><br>\n<code>The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it \u2014 2460 points<\/code><br>\n<code>DFgdgfkljhdfnmn vdfkjdfk kdfjkswjhwiuerwp2ijnsd,mfns sdlfkls wkjgwl \u2014 16 points<\/code><br>\n<code>reoigh dfjdkjfhgdjbgk nretSRGsgkjdxfhgkdjfg gkfdgkoi \u2014 9 points<\/code><br>\n<code>dfgldfkjgreiut rtyuiokjhg cvbnrtyu \u2014 43 points<\/code><\/p>\n<p>As you see, <em>clean<\/em> lines score well over 1,000 points and <em>malicious<\/em> ones couldn\u2019t scratch even 100 points. It seems our algorithm works as expected.<\/p>\n<p>As for putting high and low scores in context, the best way is to delegate this work to the machine as well, and let it learn. To do this, we\u2019ll submit a number of real, clean lines and calculate their authenticity, and then submit some malicious lines and repeat. Then we\u2019ll calculate the baseline for evaluation. In our case, it is about 500 points.<\/p>\n<h3>In real life<\/h3>\n<p>Let\u2019s go over what we\u2019ve just done.<\/p>\n<p><b>1. We defined the features of clean lines (i.e., pairs of characters).<\/b><\/p>\n<p>In real life, when developing a working antivirus, analysts also define features of files and other objects. By the way, their contributions are vital: It\u2019s still a human task to define what features to evaluate in the analysis, and the researchers\u2019 level of expertise and experience directly influences the quality of the features. For example, who said one needs to analyze characters in pairs and not in threes? Such hypothetical assumptions are also evaluated in antivirus labs. I should note here that we at Kaspersky Lab use machine learning to select the best and complementary features.<\/p>\n<p><b>2. We used the defined indicators to build a mathematical model, which we made learn based on a set of examples.<\/b><\/p>\n<p>Of course, in real life the models are a tad more complex. Now, the best results come from a decision tree ensemble built by the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Gradient_boosting\" target=\"_blank\" rel=\"noopener nofollow\">Gradient boosting<\/a> technique, but as we continue to strive for perfection, we cannot sit idle and simply accept today\u2019s best.<\/p>\n<p><b>3. We used a simple mathematical model to calculate the authenticity rating.<\/b><\/p>\n<p>To be honest, in real life, we do quite the opposite: We calculate the \u201cmalice\u201d rating. That may not seem very different, but imagine how inauthentic a line in another language or alphabet would seem in our model. But it is unacceptable for an antivirus to provide false responses when checking a whole new class of files just because it does not know them yet.<\/p>\n<h3>An alternative to machine learning?<\/h3>\n<p>Some 20 years ago, when malware was less abundant, \u201cgibberish\u201d could be easily detected by signatures (distinctive fragments). In the examples above, the signatures might look like this:<\/p>\n<p><code>DFgdgfkljhdfnmn vdfkjdfk kdfjkswjhwiu<b>erwp2ij<\/b>nsd,mfns sdlfkls wkjgwl<\/code><br>\n<code>reoigh dfjdkjfhgdjbgk nretSRGs<b>gkjdxfhg<\/b>kdjfg gkfdgkoi<\/code><\/p>\n<p>An antivirus program scanning the file and finding <b>erwp2ij<\/b> would reckon: \u201cAha, this is gibberish #17.\u201d On finding <b>gkjdxfhg<\/b>,\u201d it would recognize gibberish #139.<\/p>\n<p>Then, some 15 years ago, when the population of malware samples has grown significantly, \u201cgeneric\u201d detecting took center stage. A virus analyst defined the rules, which, when applied to meaningful text, looked something like this:<\/p>\n<p>1. The length of a word should be 1 to 20 characters.<\/p>\n<p>2. Capital letters and numbers are rarely placed in the middle of a word.<\/p>\n<p>3. Vowels are relatively evenly mixed with consonants.<\/p>\n<p>And so on. If a line does not comply with a number of these rules, it is detected as malicious.<\/p>\n<p>In essence, the principle worked just the same, but in this case a set of rules, which analysts had to write manually, substituted for a mathematical model.<\/p>\n<p>And then, some 10 years ago, when the number of malware samples grew to surpass any previously imagined levels, machine-learning algorithms started slowly to find their way into antivirus programs. At first, in terms of complexity they did not stretch too far beyond the primitive algorithm we described earlier as an example. But by then we were actively recruiting specialists and expanding our expertise. As a result, we have the <a href=\"https:\/\/www.kaspersky.com\/top3\" target=\"_blank\" rel=\"noopener nofollow\">highest level<\/a> of detection among antiviruses.<\/p>\n<p>Today, no antivirus would work without machine learning. Comparing detection methods, machine learning would tie with some advanced techniques such as behavioral analysis. However, behavioral analysis does use machine learning! All in all, machine learning is essential for efficient protection. Period.<\/p>\n<h3>Drawbacks<\/h3>\n<p>Machine learning has so many advantages \u2014 is it a cure-all? Well, not really. This method works efficiently if the aforementioned algorithm functions in the cloud or some kind of infrastructure that learns from analyzing a huge number of both <em>clean<\/em> and <em>malicious<\/em> objects.<\/p>\n<p>Also, it helps to have a team of experts to supervise this learning process and intervene every time their experience would make a difference.<\/p>\n<p>In this case, drawbacks are minimized \u2014 down to, essentially, one drawback: the need for an expensive infrastructure solution and a highly paid team of experts.<\/p>\n<p>But if someone wants to severely cut costs and use only the mathematical model, and only on the product-side, things may go wrong.<\/p>\n<p><b>1. False positives.<\/b><\/p>\n<p>Machine-learning-based detection is always about finding a sweet spot between the level of detected objects and the level of false positives. Should we want to enable more detection, there would eventually be more false positives. With machine learning, they might emerge somewhere you never imagined or predicted. For example, the clean line \u201cVisit Reykjavik\u201d would be detected as malicious, getting only 101 points in our rating of authenticity. That\u2019s why it\u2019s essential for an antivirus lab to keep records of clean files to enable the model\u2019s learning and testing.<\/p>\n<p><b>2. Model bypass.<\/b><\/p>\n<p>A malefactor might take such a product apart and see how it works. Criminals are a human, making them more creative (if not smarter) than a machine, and they would adapt. For example, the following line is considered clean, even though its first part is clearly (to human eyes) malicious: \u201cdgfkljhdfnmnvdfkHere\u2019s a whole bunch of good text thrown in to mislead the machine.\u201d However smart the algorithm, a smart human can always find a way to bypass it. That\u2019s why an antivirus lab needs a highly responsive infrastructure to react instantly to new threats.<\/p>\n<div id=\"attachment_13488\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2016\/11\/05085919\/gibberish-EN.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-13488\" src=\"https:\/\/media.kasperskydaily.com\/wp-content\/uploads\/sites\/36\/2016\/11\/05085919\/gibberish-EN.gif\" alt=\"How machine learning works, simplified\" width=\"640\" height=\"204\"><\/a>\n<p class=\"wp-caption-text\">Here\u2019s an example of how the aforementioned mathematical model can be fooled: The words look authentic, but in fact it\u2019s gibberish. <a href=\"https:\/\/writingisfun-damental.com\/tag\/gibberish-ryan-leslie\/\" target=\"_blank\" rel=\"noopener nofollow\">Source<\/a><\/p>\n<\/div>\n<p><b>3. Model update.<\/b><\/p>\n<p>Describing the aforementioned algorithm, we mentioned that a model that learned from English texts won\u2019t work for texts in other languages. From this perspective, malicious files (provided they are created by humans, who can think outside the box) are like a steadily evolving alphabet. The threat landscape is very volatile. Through long years of research, Kaspersky Lab has developed a balanced approach: We update our models step-by-step directly in our antivirus databases. This enables us to provide extra learning or even a complete change of the learning angle for a model, without interrupting its usual operations.<\/p>\n<h3>Conclusion<\/h3>\n<p>With considerable respect for machine learning and its huge importance in the cybersecurity world, we at Kaspersky Lab think that <a href=\"https:\/\/www.kaspersky.com\/top3\" target=\"_blank\" rel=\"noopener nofollow\">the most efficient cybersecurity approach<\/a> is based on a multilevel paradigm.<\/p>\n<p>Antivirus should be all-around perfect, with its behavioral analysis, machine learning, and many other things. But we\u2019ll speak about those \u201cmany other things\u201d next time.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Lately, tech companies have gone absolutely crazy for machine learning. They say it solves the problems only people could crack before. Some even go as far as calling it \u201cartificial<\/p>\n","protected":false},"author":669,"featured_media":6561,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[7,1342],"tags":[1220,1922,1923,1924,1925,321],"class_list":{"0":"post-6560","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-products","8":"category-technology","9":"tag-antivirus","10":"tag-explainer","11":"tag-in-simple-words","12":"tag-machine-learning","13":"tag-mathematical-model","14":"tag-technology"},"hreflang":[{"hreflang":"en-in","url":"https:\/\/www.kaspersky.co.in\/blog\/how-machine-learning-works-simplified\/6560\/"},{"hreflang":"zh","url":"https:\/\/www.kaspersky.com.cn\/blog\/how-machine-learning-works-simplified\/5009\/"}],"acf":[],"banners":"","maintag":{"url":"https:\/\/www.kaspersky.co.in\/blog\/tag\/antivirus\/","name":"Antivirus"},"_links":{"self":[{"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/posts\/6560","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/users\/669"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/comments?post=6560"}],"version-history":[{"count":1,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/posts\/6560\/revisions"}],"predecessor-version":[{"id":7636,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/posts\/6560\/revisions\/7636"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/media\/6561"}],"wp:attachment":[{"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/media?parent=6560"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/categories?post=6560"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kaspersky.co.in\/blog\/wp-json\/wp\/v2\/tags?post=6560"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}