KDD Cup 1999

Network Data Source Connection records
Network Data Labeled Yes
Host Data Source -
Host Data Labeled -
   
Overall Setting Military IT
OS Types Linux 2.0.27
SunOS 4.1.4
Sun Solaris 2.5.1
Windows NT
Number of Machines 1000’s
Total Runtime Nine weeks
Year of Collection 1998
Attack Categories DoS
Remote to Local
User to Root
Surveillance/Probing
Benign Activity Scripts for synthetic traffic generation, real humans for performing complex tasks
   
Packed Size 18 MB
Unpacked Size 743 MB
Download Link goto

Overview

This dataset was used in the fifth International Conference on Knowledge Discovery and Data Mining (KDD’99). It is itself based on data captured in DARPAs ‘98 IDS evaluation program, which contains data collected from a total of nine weeks of synthetic network activity, including a small variety of attacks. The task for participants was to develop a “network intrusion detector”, a model capable of distinguishing between bad and good/normal connections. Like the dataset it is based on, due to its age and a number of flaws, it should be used with reservations, if at all.

Environment

Refer to the underlying DARPA’98 Intrusion Detection Program.

Activity

Refer to the underlying DARPA’98 Intrusion Detection Program.

Contained Data

The raw DARPA data, which comes in the form of binary TCP dumps, is divided and processed into seven weeks (~five million connection records) of training data, and two weeks (~two million connection records) of test data. A connection record is defined as “a sequence of TCP packets starting and ending at some well-defined times, between which data flows to and from a source IP address to a target IP address under some well-defined protocol”. Each of these connection records contains 41 features (description linked below), with a 42nd label indicating whether this event is normal or malicious, which in the latter case also references the specific attack that event belongs to.

The KDD’99 dataset fixes some issues present in its DARPA foundation, which was severely affected by simulation artifacts (e.g., attack traffic could often easily be distinguished by looking only at the TTL). However, it introduces some problems of its own, such as the distribution of data being noticeably uneven (hindering cross-validation), the large number of redundant/duplicate records, or the fact that DoS attacks constitute the vast majority (71%) of the entire test data set.

Papers

Data Examples

Labeled features taken from kddcup.data.

[...]
1,tcp,smtp,SF,2633,331,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,3,0.00,0.00,0.00,0.00,1.00,0.00,0.67,222,168,0.76,0.02,0.00,0.00,0.00,0.00,0.00,0.00,normal.
2,tcp,smtp,SF,974,327,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,223,169,0.76,0.02,0.00,0.00,0.00,0.00,0.00,0.00,normal.
12359,tcp,telnet,SF,3676,23454,0,0,0,2,0,1,2,0,0,4,20,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,1,3,1.00,0.00,1.00,0.67,0.00,0.00,0.00,0.00,normal.
12345,tcp,telnet,SF,8070,14511,0,0,0,0,0,1,1,0,0,0,35,0,5,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,2,4,1.00,0.00,0.50,0.50,0.00,0.00,0.00,0.00,normal.
0,tcp,http,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1.00,1.00,0.00,0.00,1.00,0.00,0.00,3,1,0.33,0.67,0.33,0.00,0.33,1.00,0.00,0.00,neptune.
0,tcp,http,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,1.00,1.00,0.00,0.00,1.00,0.00,0.00,4,2,0.50,0.50,0.25,0.00,0.50,1.00,0.00,0.00,neptune.
0,tcp,http,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,1.00,1.00,0.00,0.00,1.00,0.00,0.00,5,3,0.60,0.40,0.20,0.00,0.60,1.00,0.00,0.00,neptune.
0,tcp,http,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,4,1.00,1.00,0.00,0.00,1.00,0.00,0.00,6,4,0.67,0.33,0.17,0.00,0.67,1.00,0.00,0.00,neptune.
[...]

List of attack types taken from training_attack_types.

back dos
buffer_overflow u2r
ftp_write r2l
guess_passwd r2l
imap r2l
ipsweep probe
land dos
loadmodule u2r
multihop r2l
neptune dos
nmap probe
perl u2r
phf r2l
pod dos
portsweep probe
rootkit u2r
satan probe
smurf dos
spy r2l
teardrop dos
warezclient r2l
warezmaster r2l