Machine Learning is to Big Data as human learning is to life experience. It is the science of getting our computers to act instead of being just programmed. We interpolate and extrapolate from past experiences to deal with unfamiliar situations. We use data to make decisions that leads to taking actions. Our decisions are descriptive, diagnostic or predictive depending on the prior knowledge of “what happened”, “why did it happen” or “what will happen”. Machine learning (ML) with data analysis, models this behavior at massive scales, and it has significant applicability in the field of forensics, cybersecurity and information assurance.
Human learning is what we understand best and it continues to be the best form of learning. One way to look at machine learning is human-level artificial intelligence that can be applied broadly.
Machine learning, like human learning, is a 5-step process.
- Define the problem – eg, determine a domain to be safe versus malicious
- Harvest the data set – eg, train the data set to build better models
- Create a capability – eg, build new capability to detect novelty data patterns
- Validate the model – eg, use multiple capabilities created to form a model
- Operationalize – eg, use this model to make decisions; through continuous use this model gets trained (and becomes smarter to make predictions over time).
Machine learning, together with data science and big data has emerged as a mainstream technology in cybersecurity after proving its success in recommendation systems (like Amazon and Netflix) to voice recognition systems (like Apple Siri and Microsoft Cortana) to many other applications.
In cyber security, data models need to have predictive power to implicitly distinguish between normal benign network traffic and abnormal, potentially malicious traffic that can be an indicator of a cyber-attack vector. This is where machine learning is used to build classifiers, as the goal of the models is to provide a binary response (e.g., good or bad) to the network traffic being analyzed.
4 Phases of Learning
Machine learning has gone through four phases of evolution: Collect / Analyze / Predict / Prescribe. These steps were initially in silos in cybersecurity because these ecosystems were built from the bottom up — experimenting with data, tools and choices — and building a set of practices and competencies around these disciplines.
The question is often asked how much data is enough. To give an idea of how much data needs to be processed, a medium–size network with 20,000 devices (servers, laptops, phones) transmit more than 50 TB of data in a 24–hour period. That means that over 5 GB of it must be analyzed every second to detect cyber-attacks, targeted threats and malware attributed to malicious users. While dealing with such volumes of data in real time poses difficult challenges, so one has to models that can detect cyber-attacks while both minimizing false positives (false alarms) and false negatives (failing to detect real threats).
3 V’s of Cybersecurity
Big Data is being created at the rate of 2.5 quintillion bytes per day. So, it is hard enough to find the haystack, let alone the needle in the haystack. Describing big data in a cybersecurity context consists of the following ten common sensor sources:
- network flow,
- threat feeds,
- DNS captures,
- web page text,
- social activity,
- audit trails.
Finding the patterns to describe big data analytics in a cybersecurity context has to mention the three V's: Volume, Variety and Velocity.
Large quantities of data are necessary to build and test the models. The question is when is "large" large enough? Sample sizes are never large. If N (the sample size) is too small to get a sufficiently precise estimate, you need to get more data (or make more assumptions). But once N is “large enough,” you can start subdividing the data to learn more. N is never enough because if it were “enough” you’d already be on to the next problem for which you need more data.
In applications of big data there are two types of data available: structured data versus unstructured data. For cybersecurity-specific data science models, Variability refers to the range of values that a given feature could take in a data set. The importance of having data with enough variability in building cyber security models is often underestimated. Network deployments in organizations – businesses, government agencies and private institutions – vary greatly. Commercial network applications are used differently across organizations and custom applications are developed for specific purposes. If the data sample on which a given model is tested lacks variability, the risk of an incorrect assessment of the model’s performance is high. If a given machine learning model has been built properly (e.g., without "overtraining", which happens when the model picks up very specific properties of the data on which it has been trained), it should be able to generalize to "unseen" data.
If one has to analyze hundreds of millions of records and every single query to the data set requires hours, building and testing models would be a cumbersome and tedious process. Being able to quickly iterate through the data, modify some parameters in a particular model and quickly assess its performance are all crucial aspects of the successful application of data science techniques to cyber security.
Thus, Volume, Variability and Velocity are essential characteristics of big data that have high relevance for applying data science to cyber security. Together these characteristics increase the "Value" of data in data science for cyber security.
2 Types of Cyber Battles
Threats evolve every single day. As attack surfaces increase in business infrastructures, so does the diversity of cyber-attacks. The two broadest types of threats are
- outside-in attacks, and
- inside-out attacks.
In both types of threats, a combination of machine-based and human-based inputs are required before making a decision and taking an action. This is why the bad guys tend to win while the good guys (defenders) are analyzing the threat vectors.
Analytical tools, in widespread use today, are categorized into three groups based on its sophistication and ability to emulate the human brain of a trained infosec analyst.
Basic-level descriptive analytics, i.e., “what happened” – 25% ML-based finding, 75% reliance on human analyst.
Intermediate-level diagnostic analytics, i.e., providing context to “why did it happen” – 50% ML-based finding; 50% reliance on human analyst.
Advanced predictive analytics, i.e., “what is likely to happen” – 75% ML-based finding; 25% reliance on human analyst.
1 Way to Secure Your Assets
The network defense of the future will consist of analytics-enhanced human operators interacting with the network. However, until then, one has to rely on ML plus humans to combat the threats.
ML is rapidly training computers (like we train humans) to create batter mouse traps for advanced threat vectors. As attack surfaces increase, so will the diversity of cyber-attacks. This post discussed the cybersecurity-specific basics of machine learning (in 5 steps) to categorize the threats (in 4 ways) by understanding the 3 V’s of threat analytics to safeguard the business against 2 types of common threats. At the end of the day it is about big algorithms (less about big data) working in concert with the right machine learning models to train the system to identify and remediate the threat risks.