Scalable Machine Learning for Packet Capture Data with Kubeflow

9 min readJul 24, 2020

The lovely rainbows of PCAP on Wireshark…

Framing the Use Case

The data science and DevOps team at Anno.Ai was recently posed with the challenge of creating a scalable machine learning (ML) pipeline for processing packet capture (PCAP) data for a cyber analytics use case. The team decided to create this pipeline riding on top of Kubeflow, an open sourced scalable ML infrastructure, to host our pipeline and processing jobs.

The analytical challenge posed to us was to use PCAP data to understand a network’s baseline activity and, from that analysis, visually represent its components, track traffic flows, and identify anomalies. The goal was to use these ML enabled techniques against the PCAP data to identify anomalous network activity, flagging it to operators for further investigation. These capabilities would help cyber operators answer some of their most challenging questions:

What countries and locations have been hitting my network over the past 24 hours?
Where is my network?
What does a baseline of traffic look like day to day on my network?

Our prototype machine learning (ML) output identifies a threshold of root mean squared error (RMSE) derived from autoencoder reconstruction to identify abnormal traffic for the chosen feature set.

So what is Kubeflow?

A more detailed post here outlines what Kubeflow is and how to get started. As an overview, Kubeflow is an open-source toolkit designed to automate and scale ML development operations, that was originally used internally at Google as a simplified way to run TensorFlow jobs on Kubernetes. Kubeflow accomplishes this through harnessing the benefits of container orchestration frameworks, often Kubernetes. At its core, Kubeflow offers end-to-end ML stack orchestration to efficiently deploy, scale, and manage complex machine learning workflows. Kubeflow allows for containerized deployment of models across an enterprise, agnostic to the production platform (GCP, AWS; OpenShift, Rancher, on-prem infrastructure; edge devices, etc.). Kubeflow utilizes a microservices approach when building enterprise scale machine learning workflows. This allows data scientists the ability to effectively collaborate and build reusable machine learning solutions thereby eliminating duplicate efforts.

Overview of our Kubeflow Implementation

Our DevOps team developed an automated installation tool to install Kubeflow on Amazon Web Services (AWS), AWS GovCloud, and Google Cloud Platform. The installer was also integrated with GitLab CI / CD so DevSecOps teams could systematically deploy Kubeflow from a CI / CD environment. The Kubeflow deployment was pre-packaged with a Terraform installer that established networking between clusters and other necessary foundational details to deploy our Kubeflow clusters. We built our Kubeflow installation based on the following assumptions and prerequisites:

Assumptions

● MacOs / Linux Instance with internet access

● Git Client

● Active AWS Account with Admin level service credentials

Instance Prerequisites (The following components were used to support the deployment and configuration of Kubeflow on AWS).

● AWS CLI version 1

● Terraform >= 0.12

● KOPS >= 1.16

● Helm >= 3.0

● kfctl 0.7–1.0-rc

Overview of our Machine Learning Approach to PCAP Baselining and Anomaly Detection

To tackle the scale and feature density of PCAP data, we selected a machine learning approach, based on the Kitsune autoencoder framework with the following traits that proved attractive to our use case:

Efficiently processes full PCAP data to extract features for an overall performant machine learning model;
Utilized an unsupervised machine learning approach that is scalable and does not require large amounts of subject matter expert (SME) attention to label and maintain training datasets; and,
Deploys a lightweight training and inference machine learning architecture that can operate on edge devices with limited compute and storage.

We developed an Artificial Neural Network (ANN)-based Network Intrusion Detection System (NIDS) toolkit that is online, unsupervised, and computationally efficient. The ML approach consists of a feature extraction process, a feature mapping process, and an anomaly detection process. The core machine learning algorithm deployed uses an ensemble of neural networks called autoencoders to differentiate between normal and abnormal traffic patterns. The features extracted from the network traffic are mapped to the visible neurons of the ensemble, then each autoencoder attempts to reconstruct an instance’s features. A root mean squared (RMSE) metric identifies when this reconstruction process yields an error . The RMSE values from the ensemble feed into a final output autoencoder that combines these to determine if the input traffic is near the network’s baseline or if it’s anomalous.

During training, autoencoder computes the difference (RMSE) between the original data and it’s learned representation. Source.

This approach allows for efficient, lightweight, scalable, and unique methods to network baselining and anomaly detection that help overcome the volume and features of PCAP data. No more than one instance of PCAP data is stored in memory at a time, lightening the load on RAM required to run the training process and reporting a faster packet processing rate by up to a factor of five. Traditional NIDS are usually deployed at single points within the network (e.g. a gateway device). Our layered and distributed autoencoder architecture allows us to process, analyze and infer with as many layers as required to extract the proper depth of features from the network.

During prediction, trained autoencoder computes the difference (RMSE) between test data & it’s reconstruction and compares it to the RMSE values obtained during training to classify test data as normal or abnormal.

The unsupervised approach used in the autoencoder model can either be deployed as an already trained model to a customer environment (starting with potentially higher accuracy, but based on training from previous datasets) or deployed from scratch. Deploying the autoencoder architecture from scratch would allow for all unique behaviors of the network to be learned with no a-priori knowledge, thus ensuring the model matches exactly the network patterns in its deployed environment. The autoencoder can also be deployed as a pre-trained model if the Anno.Ai data science team deems the target environment’s PCAP data and patterns are closely aligned with the training data used for the pre-trained model.

Why Kubeflow works for this Use Case

In the case of cyber analytics and cyber data, implementing any scalable ML infrastructure pipeline and approach is crucial to success. This is for many reasons, but the main ones being the sheer velocity and volume of data one will deal with in cyber scenarios.

The following graphic outlines the general steps in which Kubeflow was used, along with an explanation of how features within Kubeflow were integrated into the end to end pipeline for the PCAP use case:

Data Ingestion & Parsing:

We utilized Kubeflow to scale out our PCAP parser, written in GO language. This proved invaluable as we needed speed as much as anything else to quickly extract the many protocol fields we were targeting. We used persistent blob storage to store raw PCAP files, then used persistent volumes to fulfill our persistent data storage needs as each job within the pipeline was completed.

The team also found Kubeflow’s organic data engineering tool, Argo, to be extremely useful. Argo harnesses the container-native ETL too in a way that minimized operational overhead. Scalable execution, parallelization, versioned workflows and consistent deployment across development and production environments all made the data engineering process efficient.

Data Engineering:

We used Kubeflow for our data cleansing effort to ensure all of our PCAP dimensions were extracted and normalized (eg. converting time-stamps to standard UNIX time). We enriched our PCAP data with geolocational information using the MaxMind database.

Kubeflow was especially useful and important for scaling out resources required for feature engineering. For this step in particular, we required GPUs to run our Jupyter Notebook that contained our feature engineering steps. This asked for provisioning different types of machines from what we had previously used (all previous and following jobs only required GPUs) and we also desperately needed to leverage autoscaling capabilities for what ended up being a seventeen-ish hour process to extract our features. The flexibility of autoscaling and being able to provision different types of machines with GPUs saved us time & money from the end to end pipeline perspective. This handoff, once engineered and set up, was also seamless from the cluster that performed data cleansing, enrichment, validation etc to the cluster that ran the feature engineering process.

*as an aside data science note, we used our geo-enrichment performed in the enrichment step to one-hot encode one of our features we used for analysis. The team mapped out internal vs. external IP addresses, awarding a “1” to every externally geolocated IP address and a “0” for those that returned no geolocations as an approximate way to estimate and predict internal vs. external IP addresses.

Data versioning capabilities for Kubeflow 0.7 could be better, but we are hopeful the 1.0 release will start providing those features for us.

Model Building & Training

Once we had our first pipeline code set up in Kubeflow, running experiments were as simple as re-running this code in a Jupyter notebook or from the Kubeflow UI. Being able to run a variety of experiments and runs (with metadata stored and saved) allowed our data science team to run several experiments to build the model itself, test the model and tune hyperparameters, and finally select the model version that performed the best.

Another benefit to Kubeflow’s ability to track and run experiments was our team’s flexibility to test a variety of infrastructure that worked best with the model we had chosen to test and deploy. Working with Kubeflow in a cloud environment improved this even further, where a variety of compute/processing power/RAM ratios could be explored to maximize price and time. Being able to perform training in a distributed, parallelized and scalable environment also sped up these two phases considerably.

In Kubeflow 0.7 many visualization opportunities are missed (such as visualizing experiments, model training performance, tracking artifacts and metadata etc from within Kubeflow itself) however the Kubeflow framework integrates pretty well with Tensorboard as a visualization tool. We are hoping the visualization capabilities organic to Kubeflow that enable data scientists to easily and automatically visualize model training iterations and processes are improved in the 1.0 release. There are also some missed opportunities (that we *think* might be resolved in the 1.0 release) for visualizing and tracking data and model versioning.

Model Operations

Kubeflow was found to greatly reduce infrastructure provisioning by automatically generating end to end workflows with minimal specifications. Kubeflow leverages a loosely coupled design of reusable containers to craft a complete, end to end pipeline. The elastic infrastructure provides the ability to handle multiple environments with specialized hardware components specific to model training. Sharing reusable containerized components dramatically increases the total number of pipeline runs.

NOT THAT RUNNING KUBEFLOW WAS ALL KITTENS AND RAINBOWS.

The team found that once initial pipelines are crafted and formed, and common workflows within an enterprise are identified, solidifying and re-using these components and the time it takes to instantiate them gradually becomes shorter and easier over time, requiring less and less hooting and hollering between the data science and DevSecOps teams. However, in the meantime, creating the environment remained a challenge. Significant effort was spent setting up unique machine images, container images, and required dependencies needed by ML code and matching those with the correct packages (eg. GPU vs CPU vs TPU). Scaling out nodes for scalable processes across different cloud environments was also found to be a challenge.