Skip navigation
All Places > CA Security > Blog > Authors chadi14

CA Security

2 Posts authored by: chadi14 Employee

Machine Learning is to Big Data as human learning is to life experience. It is the science of getting our computers to act instead of being just programmed. We interpolate and extrapolate from past experiences to deal with unfamiliar situations. We use data to make decisions that leads to taking actions. Our decisions are descriptive, diagnostic or predictive depending on the prior knowledge of “what happened”, “why did it happen” or “what will happen”. Machine learning (ML) with data analysis, models this behavior at massive scales, and it has significant applicability in the field of forensics, cybersecurity and information assurance.


Human learning is what we understand best and it continues to be the best form of learning. One way to look at machine learning is human-level artificial intelligence that can be applied broadly.


machine learning in cybersecurity.png

5-Step Learning


Machine learning, like human learning, is a 5-step process.

  1. Define the problem – eg, determine a domain to be safe versus malicious
  2. Harvest the data set – eg, train the data set to build better models
  3. Create a capability – eg, build new capability to detect novelty data patterns
  4. Validate the model – eg, use multiple capabilities created to form a model
  5. Operationalize – eg, use this model to make decisions; through continuous use this model gets trained (and becomes smarter to make predictions over time).


Machine learning, together with data science and big data has emerged as a mainstream technology in cybersecurity after proving its success in recommendation systems (like Amazon and Netflix) to voice recognition systems (like Apple Siri and Microsoft Cortana) to many other applications.


In cyber security, data models need to have predictive power to implicitly distinguish between normal benign network traffic and abnormal, potentially malicious traffic that can be an indicator of a cyber-attack vector. This is where machine learning is used to build classifiers, as the goal of the models is to provide a binary response (e.g., good or bad) to the network traffic being analyzed.


machine learning 2.png

4 Phases of Learning


Machine learning has gone through four phases of evolution: Collect / Analyze / Predict / Prescribe. These steps were initially in silos in cybersecurity because these ecosystems were built from the bottom up — experimenting with data, tools and choices — and building a set of practices and competencies around these disciplines.


The question is often asked how much data is enough. To give an idea of how much data needs to be processed, a medium–size network with 20,000 devices (servers, laptops, phones) transmit more than 50 TB of data in a 24–hour period. That means that over 5 GB of it must be analyzed every second to detect cyber-attacks, targeted threats and malware attributed to malicious users. While dealing with such volumes of data in real time poses difficult challenges, so one has to models that can detect cyber-attacks while both minimizing false positives (false alarms) and false negatives (failing to detect real threats).


3 V’s of Cybersecurity


Big Data is being created at the rate of 2.5 quintillion bytes per day. So, it is hard enough to find the haystack, let alone the needle in the haystack. Describing big data in a cybersecurity context consists of the following ten common sensor sources:


  1. alerts,
  2. events,
  3. logs,
  4. pcaps,
  5. network flow,
  6. threat feeds,
  7. DNS captures,
  8. web page text,
  9. social activity,
  10. audit trails.


Finding the patterns to describe big data analytics in a cybersecurity context has to mention the three V's: Volume, Variety and Velocity.



Large quantities of data are necessary to build and test the models. The question is when is "large" large enough? Sample sizes are never large. If N (the sample size) is too small to get a sufficiently precise estimate, you need to get more data (or make more assumptions). But once N is “large enough,” you can start subdividing the data to learn more. N is never enough because if it were “enough” you’d already be on to the next problem for which you need more data.



In applications of big data there are two types of data available: structured data versus unstructured data. For cybersecurity-specific data science models, Variability refers to the range of values that a given feature could take in a data set. The importance of having data with enough variability in building cyber security models is often underestimated. Network deployments in organizations – businesses, government agencies and private institutions – vary greatly. Commercial network applications are used differently across organizations and custom applications are developed for specific purposes. If the data sample on which a given model is tested lacks variability, the risk of an incorrect assessment of the model’s performance is high. If a given machine learning model has been built properly (e.g., without "overtraining", which happens when the model picks up very specific properties of the data on which it has been trained), it should be able to generalize to "unseen" data.



If one has to analyze hundreds of millions of records and every single query to the data set requires hours, building and testing models would be a cumbersome and tedious process. Being able to quickly iterate through the data, modify some parameters in a particular model and quickly assess its performance are all crucial aspects of the successful application of data science techniques to cyber security.


Thus, Volume, Variability and Velocity are essential characteristics of big data that have high relevance for applying data science to cyber security. Together these characteristics increase the "Value" of data in data science for cyber security.


2 Types of Cyber Battles


Threats evolve every single day. As attack surfaces increase in business infrastructures, so does the diversity of cyber-attacks. The two broadest types of threats are

  1. outside-in attacks, and
  2. inside-out attacks.


In both types of threats, a combination of machine-based and human-based inputs are required before making a decision and taking an action. This is why the bad guys tend to win while the good guys (defenders) are analyzing the threat vectors.


Analytical tools, in widespread use today, are categorized into three groups based on its sophistication and ability to emulate the human brain of a trained infosec analyst.

Basic-level descriptive analytics, i.e., “what happened” – 25% ML-based finding, 75% reliance on human analyst.

Intermediate-level diagnostic analytics, i.e., providing context to “why did it happen” – 50% ML-based finding; 50% reliance on human analyst.

Advanced predictive analytics, i.e., “what is likely to happen” – 75% ML-based finding; 25% reliance on human analyst.


1 Way to Secure Your Assets


The network defense of the future will consist of analytics-enhanced human operators interacting with the network. However, until then, one has to rely on ML plus humans to combat the threats.


ML is rapidly training computers (like we train humans) to create batter mouse traps for advanced threat vectors. As attack surfaces increase, so will the diversity of cyber-attacks. This post discussed the cybersecurity-specific basics of machine learning (in 5 steps) to categorize the threats (in 4 ways) by understanding the 3 V’s of threat analytics to safeguard the business against 2 types of common threats. At the end of the day it is about big algorithms (less about big data) working in concert with the right machine learning models to train the system to identify and remediate the threat risks.  

Big Data is almost a household term with not a single company out there not delving in it this year. One of the main enablers of Big Data has been the rise of Cybersecurity, and the "rise of the machines" with machine to machine interactions. M2M has caused Big Data to make the headlines with Cybersecurity. These seven myths and facts were compiled from a set of innovation forums where we reflected on the implications of Big Data on Cybersecurity.





1. You are Big Data. Much of the world’s Big Data is created as metadata from users’ smartphones and GPS traffic.

Every day you create metadata with smartphone that enable GPS location services. Every picture you take, every Web site you visit, every route you map creates metadata, which is stored and available for analysis. With more than 5 billion mobile phones in use, including more than 1 billion smartphones in 2015, according to research firms, it’s no wonder that many enterprises and government organizations are interested in gleaning valuable content from the information.


2. Big Data tends to be mined poorly in cyber security to build ineffective threat analysis algorithms.

With all the metadata that exists, we are only now figuring out how to make sense of it and how to cultivate beneficial data from it. For one, enterprises traditionally haven’t had the resources in place to analyze metadata. As those investments increase, the mining for trends and useful analysis will increase as well.


3. Big Data in cyber security is automating tasks that used to involve tedious manual labor.

Software companies are developing tools that can not only analyze metadata, but also automate tasks to more quickly make use that data to their advantage. This allows companies to both be more flexible, but also make the analysis of Big Data much less costly than in the past.


4. Big Data is used in cybersecurity to categorize and classify cyber threats the same way Google ranks pages.

As more information is gleaned, algorithms for categorizing and classifying malware are being developed to help security providers. Most software companies use Big Data in four ways: first, to discuss CART (Classification and Regression Trees) for predictive classification of event modifiers; second, to make use of Shewhart Control Charts for outlier threat detection; third, use Splines for non-linear exploratory modeling; lastly, apply Goodness of Fit principle to check for stability of historical threat data and constructing a parsimonious model.


5. Big Data theory is moving faster than the reality of what an enterprise is capable of from both a technology and manpower standpoint.

Since much of Big Data is derived from user-centric behavior and usage, it moves lot faster than what an enterprise typically generates from its application systems. The 70% of the digital universe has been created by individuals not corporations. Even though the IT department of the enterprise store, protect and manage 70% of the digital data, the real power play is in the users’ hand. The user is in charge (not the IT department) and the epicenter for producing majority of the world’s digital data is in the hands of the users.  Big Data tsunami has caused technologies to be modernized to solve security challenges. What used to be stored in conventional RDBMS and later in NoSQL databases are insufficient and cannot be accessed by direct record access methods. The current technology of choice is not conventional RDBMS but a map-reduced database like Hadoop that operates off distributed hardware substrate.


6. Big Data is creating major shift in visualization of breaches and cyber-attacks.

Visualization of objects in excess of a few billion requires thinking differently. For instance, imagine the complexity of modeling huge data sets that grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies, software logs, cameras, microphones, radio-frequency identification readers, wireless sensor networks. Right now, the largest memory requirements for visualizing Big Data working sets can’t be addressed by conventional computing models. That’s why the science of visualization has to be re-imagined and re-visited to visualize the looms in the data patterns in the case of events like privileged access violations, breaches and frauds.


7. Yesterday's endpoints have shifted to the users, with the proliferation of BYOD user devices are the today's endpoints.

With the advent of BYOD as the norm in the corporate environments, the real vulnerable endpoint of enterprises has turned out to be handhelds and smartphones. As more smartphones connect to corporate networks and data, it increases the vulnerabilities organizations face trying to secure all those additional points of entry in terms of cyber security.








1. Cyber security companies are equipped to handle the volume and velocity of Big Data.

Like every business, security companies are also learning to wrap their hands around Big Data, eliminating potential vulnerabilities to ensure that the data is cleansed and cleaned for analysis. As the concept of Big Data grows and evolves, security companies also must perpetually grow and evolve too.


2. Security developers are easily extracting value from collected data.

There’s a saying “You don’t know what you don’t know” that applies to intelligence and cybersecurity analysts. Without proper analysis tools in place, one isn’t able to extract valuable content from the collected data. Only with those analysis tools, algorithms and applications can developers truly garner valuable insight from collected data.


3. Analytics is ready-made for security.

From the phrase “finding a needle in the haystack,” analytics is useless in “haystacks” of data where there are no “needles” to begin with. The hype has caused us to create massive data stacks with poor references (or indices) around those stacks. Any data analyst will attest to the fact that a better index of smaller datasets yield better analytics than a larger dataset with lame indices.


4. Leveraging Big Data in a cybersecurity context is as simple as using it for any generalized purpose.

Leveraging Big Data must first address the point in Fiction No. 3, that analytics is ready-made for security. Second, establishing a security “context” is the next problem. Security context can be established connecting the relationships (after map reducing the data itself) between data sets to reveal valuable insights in the patterns that were previously not correlated or compared. Mining for trends requires data to be managed coherently at first. Similarly mining for relationship requires trends be understood. Only after you have the data map reduced, and the trends in it understood, you can then mine for relationship among the trends of the map reduced data farms. Only after all of these prerequisites are achievable, you can establish the big security context of Big Data. Think of cybersecurity context as the metadata fabric of relationships, which is lot more powerful and useful for visualizing risks, threats and predictive analytics.


5. Big Data will cause major change in the cyber security industry within the next year.

No, the major change in the security industry will be in identifying anomalies that can be identified as advanced security attack vectors. Big Data and cyber security algorithms will join together and work in concert to realize value for businesses.


6. There is a belief that Big Data sets offer a higher form of intelligence that can generate insights that were previously impossible.

That’s not true by itself. We need to develop more algorithms that can offer more intelligence, not bigger data sets. The two kinds of algorithms are: Bayesian algorithms, which deal with prior occurrences, and predictive analytics, which is forward facing. Looking at the future, Big Context in security is going to be more innovative than Big Data in security.


7. Big Data searched with naive algorithms fails to yield what little data can yield using smarter algorithms.

It should be about the algorithms and not about the data. Better precision and better searching techniques will trap the breaches. Better algorithms and lesser data stacks will provide more value than lesser algorithms and Big Data stacks. The better net will catch better stuff.