Network Data Source | pcaps |
Network Data Labeled | Yes |
Host Data Source | - |
Host Data Labeled | - |
Overall Setting | Enterprise IT |
OS Types | Undisclosed |
Number of Machines | 2000+ |
Total Runtime | n/a |
Year of Collection | 2004-2010 |
Attack Categories | Botnets (Storm, Waledac) |
Benign Activity | Real users |
Packed Size | 3 GB |
Unpacked Size | 10,6 GB |
Download Link | goto |
Overview
The ISOT Botnet dataset is an amalgamation of several individual datasets, two containing malicious botnet traffic, and five datasets consisting of benign traffic. Malicious data was taken from the “French Chapter” of the Honeynet project, while (anonymized) benign traces come from the LBNL Enterprise Trace Repository. The combination of these traces, after some preprocessing to make them appear as if they would stem from the same network, are then used to test several botnet detection methods leveraging network behavior analysis and machine learning. However, we were unable to find any information regarding the source of malicious traces, as linked pages no longer exist and further search remained fruitless.
Environment
The merged dataset contains traces from 23 individual subnets, 22 with only benign traffic (stemming from the LBNL traces) and one with both malicious and benign traffic (merged traffic from both sources). The IPs of the latter subnet can be obtained from Table 2 of the linked documentation. Information regarding services, operating systems and so on are not available.
Activity
Details regarding activity are not available; there might be some additional information hidden in LBNL publications, but we consider this to be out of scope.
Contained Data
As a first step to merge benign and malicious traces, the IP addresses of infected machines were mapped to two of the machines providing benign background traffic.
Then, the authors used to the TcpReplay
tool to replay all traces on the same network interface in order to homogenize the network behavior shown by individual datasets.
These traces are simply available in the form of a single large pcap file with 1,675,424 unique flows, of which 3.33% are malicious.
Labels are available via malicious traffic having a specific MAC, as per Table 2 of the linked documentation.
It should be noted that the application of methods based on machine learning on merged datasets bears some additional risks; researchers must ensure that results are not a byproducts of anomalies that remained after the merging process, which might not actually be caused by the malicious behavior, but rather the simple fact that these traces stem from separate environments.