Network Data Source | pcaps, NetFlows |
Network Data Labeled | Yes, NetFlows are labeled |
Host Data Source | - |
Host Data Labeled | - |
Overall Setting | Enterprise IT |
OS Types | Windows XP SP2 |
Number of Machines | n/a |
Total Runtime | 1-67 hrs |
Year of Collection | 2011 |
Attack Categories | Various Botnet activity (Neris, Rbot, Virut, Menti, Sogou, Murlo, NSIS.ay) |
Benign Activity | Real background traffic |
Packed Size | n/a |
Unpacked Size | 697 GB (sum of all 13 scenarios) |
Download Link | goto |
Overview
The Czech Technical University (CTU) dataset originated from the desire to compare different botnet detection methods. For this purpose, the authors deemed it necessary to use a dataset which is publicly available (i.e., without sensitive contents) and fulfils a number of requirements, such as containing real background traffic or representing different botnets. Thirteen different scenarios were executed, with each one representing a certain botnet behavior, resulting in thirteen individual datasets.
Note that these thirteen datasets are a subset of the Malware Capture Facility Project, though other datasets within this collection did not undergo any further analysis or study.
Environment
The infected network consists of an unspecified number of virtualized machines running Windows XP SP2, with each machine being bridged into the network of the university.
Activity
For each of the thirteen scenarios, some botnet malware (Neris, Rbot, Virut, Menti, Sogou, Murlo, or NSIS.ay) was deployed within the target network, configured to match certain desired behavior. More details regarding each scenario can be found in section 6.1 of the linked paper.
Contained Data
Traffic was captures on both infected hosts and university routers using tcpdump, the result of which was then converted into the NetFlow format composed of the following fields: Start Time, End Time, Duration, Source IP address, Source Port, Direction, Destination IP address, Destination Port, State, SToS, Total Packets, and Total Bytes.
For the labeling process, first each flow is labeled as background
.
Next, flows originating from known and controlled machines within the university network were assigned the normal
label, while the botnet
label was assigned to any flows coming to or from any of the known infected machines.
For each scenario, the following data is available:
- Full pcaps from infected machines
- Truncated pcaps from the complete capture (to preserve privacy)
- Unidirectional NetFlows (labeled)
- Bidirectional NetFlows (labeled)
- Bro (now Zeek) output files - presumably only those originating from infected machines?
- The original malware
- Config information (e.g., for converting pcaps to NetFlows)
The authors suggest that only the bidirectional NetFlows should be used for research efforts, stating that they outperformed the unidirectional ones. Note that NetFlows are only available in binary format and should be opened with Argus, though other tools might work too.