The yellow brick road to machine learning with honeypot data: Our lessons learned

Recently the Rapid7 Logentries team attended a hackathon over at one of our Boston offices. This was a great way for us to integrate with the other Rapid7 teams within the company and to have fun messing around with things we don’t usually have time for in a working day.

The project that my team worked on involved machine learning with the dataset collected by some of the various Heisenberg honeypots that Rapid7 has deployed. More information about these honeypots is available here. The goal was to predict whether a machine contacting us was an attacker or not based on incoming packet data.

The initial data that we had was the raw pcap (packet capture) data produced by the honeypots. This can be acquired using a tool like tshark or tcpdump. With this data we needed to find a property that would reveal some information about what kind of packet it was, i.e. malicious or non-malicious. The goal of our project was to produce a dataset that we could apply some machine learning algorithms to and thus predict whether a packet was malicious or not.

In this post I’ll be outlining the techniques that were used to extract features from the dataset, normalizing that dataset and finally applying some machine learning to it.


Suricata is an open source IDS, IPS and Network Security Monitoring engine. Using it is as simple as running

suricata -r my_data.pcap

on the command line.

Note: The -r flag specifies that we should read from a pcap file, instead of listening in on a network interface.

Another decision made by our team was to configure suricata to only look for ‘CVE’s.

CVE stands for Common Vulnerabilities and Exposures. So the presence or lack of a CVE tag seemed like the perfect goal for our prediction model.

Suricata has a group of rules files in the /etc/suricata/rules directory. We can filter out all CVE rules by running a simple command like:

cat /etc/suricata/rules/*.rules | grep “cve” > /etc/suricata/rules/custom.rules

This creates a new file containing only cve rules.

The rules used by Suricata can easily be enabled/disabled by modifying the /etc/suricata/suricata.yaml config file. The setting we need to change is “rule-files”. We can erase all .rules files referenced by this setting and set it to “custom.rules” to use that rules file that we created earlier.

Now when we run suricata we get some output in the /var/log/suricata directory. We will use the data collected in the /var/log/suricata/eve.json file.

This file contains a list of each packet that was detected as a CVE. Each item in the list has a packet number that we can use to uniquely identify each packet.


The next step is to extract relevant data from the pcap file we have. This is made super simple with Wireshark because you can export pcap data as a csv file.

Wireshark displays a table of the properties of each packet. The properties that are shown can be customised, which is useful for us, because we can easily choose what features we want to use for our prediction model.

To choose the features we want to extract we simply right click on one of the columns and click “Column Preferences…”. Here we can select which features to extract and we can even set our own names for them. For our example we’ll take the following features:

The packet number isn’t particularly interesting but if we want to combine our suricata data with this data then we need to output it here as well.

Now that we’ve decided what we want in our output data, it’s time to export it. Go to file > Export Packet Dissections > as “CSV”. Choose a name for the file such as pcap.csv

Hacky Python Script

Now that we have our two datasets, we just need to combine the two. What is a hackathon project without hacky scripts. I chose python for this part because it’s quick and easy to write prototypes without having to compile them. Perfect for writing code that manipulates data structures.

Here is the python code that combines these two datasets:

import json

marked_packets = {}

# First load in the event type + pcap number data
with open("/var/log/suricata/eve.json", "r") as f:
    for line in f:
   	 event = json.loads(line)
   	 pcap_cnt = event["pcap_cnt"]
   	 event_type = event["event_type"]
   	 marked_packets[pcap_cnt] = event_type

# Output is a list of strings which we join together at the end
output = []

# Load the wireshark output and combine with the suricata data
with open("/home/eoinf/Hackathon/pcap_data.csv", "r") as f:
    first_line = True
    for full_line in f:
   	 line = full_line.strip()
   	 # If header line
   	 if first_line:
   		 # Add the event type to the existing header
   		 output.append(line + ',”Event”')
   		 first_line = False
   		 # All other lines
   		 columns = line.split(',')

   		 # Assume the first column is the packet number (and remove the quotes)
   		 pcap_cnt = int(columns[0].replace('"', ''))

   		 # Check if there is a known event type for this packet and if it’s an alert
   		 if pcap_cnt in marked_packets and marked_packets[pcap_cnt] == “alert”:
   			 event_type = marked_packets[pcap_cnt]
   			 event_type = "None"

   		 output.append(‘%s,”%s”’ % (line, event_type))

with open("output.csv", "w") as f:


We used Weka because it’s simple to use and provides out of the box machine learning algorithms which we can apply to almost any dataset.

In Weka explorer, open up the csv file we generated and remove the “No.” field since this couldn’t possibly be useful for predicting malicious packets.

Now we can finally start training a model with our data. Go to the “classify” tab and “Choose” a classifier. A simple one to follow is J48 as this builds a decision tree which is human readable. Make sure “Event” is selected in the dropdown list above the “Start” button and then click start. Now we can see the results in the output on the right. Sadly they aren’t too interesting (for us anyway) since we get a tree like this:

: None (140974.0/496.0)

This means that we always predict that it’s not malicious, regardless of the properties of the packet. And we’re usually right (99.5% of the time in fact) because our dataset consists of 140,974 rows but only 496 of them are alerts.

To solve this problem we need an even distribution of alerts and non-alerts. If we go back to our output.csv file on the command line we can do this with built-in unix command line tools. First we want to grab the header row of the csv with:

head -n 1 output.csv > even.csv

take all the alerts and put them into a new file with the following command:

grep “alert” output.csv >> even.csv

Then we can use the shuffle command and take a random sample of non-alert type packets:

grep -v “None” output.csv | shuf -n 496

Now if we feed this into Weka again we get a slightly better decision tree:

Protocol = LOOP: alert (141.0/63.0)

Protocol = STP: alert (693.0/310.0)

Protocol = CDP: alert (25.0/12.0)

Protocol = LLDP: alert (37.0/15.0)

Protocol = DNS: None (55.0)

Protocol = NTP: None (5.0)

Protocol = TCP: None (34.0)

Protocol = ICMP: None (2.0)


The decision tree is deciding whether a packet is malicious based on protocol alone, which isn’t useful, but it’s starting to take shape. We can try to add more features or perhaps try some different techniques to evenly distribute our dataset, but that would be beyond the scope of this article.

The general thing to take away from this article is that there’s usually a lot of work in moulding the dataset into what you want before you can even get to applying machine learning techniques. It all depends on the initial data you have though.

Create a free Logentries account in less than a minute at

Tagged with: , , , , , , , , , , , , , ,
Posted in How To, Python

Leave a Reply