ISCX Botnet Dataset 2014

Network Data Source pcaps
Network Data Labeled Yes
Host Data Source -
Host Data Labeled -
   
Overall Setting Enterprise IT
OS Types Undisclosed
Number of Machines 2000+
Total Runtime n/a
Year of Collection 2004-2014
Attack Categories Botnets
User Emulation Real Users
   
Packed Size -
Unpacked Size 13,8 GB
Download Link goto

Overview

The ISCX Botnet Dataset originated from the need for a modern botnet dataset suitable for evaluation of botnet detection methods; specifically, how much individual NetFlow features contribute towards detection. To this end, three overarching shortcomings affecting (at the time) current datasets are laid out:

  • Generality (few botnets included)
  • Realism (environments are not realistic / long-lasting enough for a botnet to perform all malicious functionality)
  • Representativeness (benign behavior does not resemble that of actual networks) In order to alleviate these flaws, the authors combine traffic from three sources - ISOT Botnet, ISCX IDS 2012, and the Malware Capture Facility Project - with the goal of creating one novel dataset.

Environment

For details regarding the respective underlying datasets, refer to the related entries and links below.

Activity

The final dataset, divided into a training and test set, contains a total of 7 and 16 types of botnets, respectively (the latter containing more to to evaluate novelty detection). For the training dataset, these are (total of 43.92% malicious): | Botnet Name | Type | Total Flows | | —————– | —- | ————- | | Neris | IRC | 21159 (12%) | | Rbot | IRC | 39316 (22%) | | Virut | HTTP | 1638 (0.94%) | | NSIS | P2P | 4336 (2.48%) | | SMTP Spam | P2P | 11296 (6.48%) | | Zeus | P2P | 31 (0.01%) | | Zeus Control (C2) | P2P | 20 (0.01%) |

For the test dataset, these are (total of 44.97% malicious): | Botnet Name | Type | Total Flows | | —————– | —- | ————- | | Neris | IRC | 25967 (5.67%) | | Rbot | IRC | 83 (0.018%) | | Menti | IRC | 2878 (0.62%) | | Sogou | HTTP | 89 (0.019%) | | Murlo | IRC | 4881 (1.06%) | | Virut | HTTP | 58576 (12.8%) | | NSIS | P2P | 757 (0.165%) | | Zeus | P2P | 502 (0.109%) | | SMTP Spam | P2P | 21633 (4.72%) | | UDP Storm | P2P | 44062 (9.63%) | | Tbot | IRC | 1296 (0.283) | | Weasel | P2P | 42313 (9.25%) | | Zero Access | P2P | 1011 (0.221%) | | Smoke Bot | P2P | 78 (0.17%) | | Zeus Control (C2) | P2P | 31 (0.006%) | | ISCX IRC Bot | P2P | 1816 (0.387%) |

The remaining flows are all benign, though their exact nature is not detailed.

Contained Data

The three sources have been combined using the overlay methodology [2], whereby malicious traffic was assigned to existing IPs outside the main network using the Bit-Twist packet generator. This combination of malicious and benign traffic was then replayed via TCPReplay and captured by TCPdump, resulting in a single dataset which was then divided into a training and test set with the properties outlined above. Labels are available in the form of IPs associated with each botnet and can be found in the homepage linked below.

Papers