Skip to content

The final project for my graduate level Data Mining course

Notifications You must be signed in to change notification settings

rose-tmp/AWID-Intrusion-Detection

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CIS 660 - Final Project

AWID Anomaly Detection

./images/AWID_box.png

Team

  • Paul Webster
  • Gabriel Madison
  • Brandon Marlowe
  • Mirza Baig

AWID (Aegean Wi-fi Intrusion) Dataset

Dataset produced from real Wireless Network logging

Full dataset

./images/full_set.png

  • ~ 160,000,000 Rows x 155 Columns

Reduced Training Set Used

./images/reduced_set.png

  • ~ 1.8 Million Rows x 155 Columns
  • Produced from 1 hour of logging

Majority of data is of “normal” class in either dataset

Project Goal

Build a classifier capable of properly classifying tuples with four specific attack types:

Amok
Deauthentication
Authentication Request
ARP

3 Major Tasks

  • Preprocessing/Cleaning
  • Feature Selection
  • Classification

About the Attacks

Deauthentication

A Denial of Service attack that uses unprotected deauthentication packets to spoof an entity. The attacker monitors traffic on a network to discover MAC addresses associated with specific clients. A deauthentication message is then sent to the access point on behalf of a particular MAC address, which forces that client off the network. The attacker then connects to the access point as the client that was previously disconnected.

Authentication Request

A type of Flooding Attack -> “In this case the aggressor attempts to exhaust the AP’s resources by causing overflow to its client association table. It is based on the fact that the maximum number of clients which can be maintained in the client AP’s association table is limited and depends either on a hard-coded value on the AP or on its physical memory constraints. An entry on the AP’s client association table is inserted upon the receipt of an Authentication Request message even if the client does not complete its authentication (i.e., is still in the unauthenticated/unassociated state).” - Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and a Public Dataset

Amok

Another flooding attack, similar to Authentication Request

ARP (Address Resolution Protocol)

“In computer networking, ARP spoofing, ARP cache poisoning, or ARP poison routing, is a technique by which an attacker sends (spoofed) Address Resolution Protocol (ARP) messages onto a local area network. Generally, the aim is to associate the attacker’s MAC address with the IP address of another host, such as the default gateway, causing any traffic meant for that IP address to be sent to the attacker instead.” - Wikipedia

Preprocessing/Cleaning

Starting Point

./images/original_dataset.png

Wireshark Column Names

./images/wiresharkawid_attributes.png

Adding Wireshark Column Names

[FILE: col_names.txt]

frame.interface_id
frame.dlt
frame.offset_shift

...

wlan.qos.buf_state_indicated
data.len
class
with open(Path(resource_dir, 'col_names.txt')) as cols_fp:
    for line_num, name in enumerate(cols_fp):
        col_names.append(name.rstrip())

data.columns = col_names

After Appending Column Names

./images/cols_appended.png

Dropped columns not listed on course webpage

Replaced ‘?’ with NaN values, then dropped columns with over 60% NaN values

  • removed 7 columns
...

data = data.replace('?', np.nan)

...

# If over 60% of the values in a column is null, remove it
prev_num_cols = len(data.columns)
data.dropna(axis='columns', thresh=len(data.index) * 0.40, inplace=True)
print("Removed " + str(prev_num_cols - len(data.columns)) +
      " columns with all NaN values.")

Drop the columns that have over 50% of its values as constant

for col in data:
    if data[col].nunique() >= (len(data.index) * 0.50):
        cols_to_drop.append(col)

data.drop(columns=cols_to_drop, inplace=True)

Drop the rows with at least one NaN value in it

  • ~ 2000 rows
data.dropna(inplace=True)

Output the relatively clean data to a new file

# Output the minimized and preprocessed dataset to a ZIP file
# (with no index column added)
data.to_csv(
    Path(resource_dir, 'preproc_dataset.zip'),
    sep=',',
    index=False,
    compression='zip')

Perform min-max normalization on attributes used for classification (range 0-1)

normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x)))  }

...

wifiLog2$wlan.fc.type=normalize(as.numeric(wifiLog2$wlan.fc.type))

wifiLog2$frame.time_delta_displayed=normalize(as.numeric(
    wifiLog2$frame.time_delta_displayed
))

wifiLog2$wlan.duration=normalize(as.numeric(wifiLog2$wlan.duration))
View(wifiLog2)

Normalization Output

./images/view_wifilog2.png

Feature Selection

We attempted PCA, but ran out of memory

…even on CSU’s Big Data Servers

We examined distinct values in remaining columns, and chose those with more distinct values for the normal class value than the attack class values

Using a little SQL magic…

select count(DISTINCT(wlan_fc_moredata))
  from AWID_REMOVED_NULL where class='normal'

select count(DISTINCT(wlan_fc_moredata))
  from AWID_REMOVED_NULL where class='arp'

select count(DISTINCT(wlan_fc_moredata))
  from AWID_REMOVED_NULL where class='amok'

select count(DISTINCT(wlan_fc_moredata))
  from AWID_REMOVED_NULL where class='authentication_request'

select count(DISTINCT(wlan_fc_moredata))
  from AWID_REMOVED_NULL where class='deauthentication'

select wlan_fc_moredata
  from AWID_REMOVED_NULL where class='normal'

Then, chose the following 3 columns for our analysis:

wlan.fc.type
frame.time_delta_displayed
wlan.duration

Classification

Isolated the attack types

 ...

 ATTACKTYPE<-"amok"

 # Keep only the target class and the normal packets
 wifiLog2<-wifiLog2[wifiLog2$class=="normal" | wifiLog2$class==ATTACKTYPE, ]

 wifiLog2$class<-as.character(wifiLog2$class)
 wifiLog2$class[wifiLog2$class=="normal"]<-as.character("0")
 wifiLog2$class[wifiLog2$class==ATTACKTYPE]<-as.character("1")
 wifiLog2$class<-as.factor(wifiLog2$class)

...

Separate files to handle each attack type

./images/KNN_file_names.png

Partitioned dataset into 66.6% training data and 33.3% test data

smp_size <- floor(0.66 * nrow(wifiLog2))

Performed SMOTE on training data

  • To create synthetic tuples of attack types
f<-formula("class~wlan.fc.type+frame.time_delta_displayed+wlan.duration")
train_smote<-SMOTE(f,train,perc.over=150,perc.under=90,k=3)
View(train_smote)

K-Nearest Neighbor classifier to train model for each specific attack type

m<-kNN(f,train_smote,test_oversamp,norm=FALSE,k=5)

Made predictions using the model on the test dataset

Parameter Selection/Interpretation

Recall - “completeness – what % of positive tuples did the classifier label as positive?”

./images/recall_eq.PNG

Precision - “exactness – what % of tuples that the classifier labeled as positive are actually positive”

./images/precision_eq.PNG

Recall and precision are inversely related measures, meaning as precision increases, recall decreases.

Accuracy and recall are inversely related in our case (for a majority of our data)

Results

  • Performed multiple tests for each attack

ARP (Address Resolution Protocol) (Test 1)

ARP (Test 1) KNN Parameters

  • Smote.k = 3
  • knn.k = 5
  • smote.perc.over = 150
  • smote.perc.under = 90

ARP (Test 1) - Confusion Matrix

  • N = 576,582
Predicted: NOPredicted: YESTotal
Actual: NO552,9581,731554,689
Actual: YES421,88921,893
Total552,96223,620

ARP (Test 1) - Anomaly Detection Metrics

False Positives1,731
True Positives21,889
True Negatives552,958
False Negatives4

ARP (Test 1) - Anomaly Detection Metrics (Contd.)

Accuracy99.6990%
Error Rate0.3009%
Sensitivity92.6714%
Specificity99.9992%
Precision92.6714%
Recall99.9817%

Only one set of results with ARP

  • Too many errors using other settings
  • Difficult to improve on already extremely good results

Amok (Test 1)

Amok (Test 1) KNN Parameters

  • Smote.k = 3
  • knn.k = 5
  • smote.perc.over = 150
  • smote.perc.under = 90

Amok (Test 1) - Confusion Matrix

  • N = 565,216
Predicted: NOPredicted: YESTotal
Actual: NO511,45142,928554,379
Actual: YES56210,27510,837
Total512,01353,203

Amok (Test 1) - Anomaly Detection Metrics

False Positives42,928
True Positives10,275
True Negatives511,451
False Negatives562

Amok (Test 1) - Anomaly Detection Metrics (Contd.)

Accuracy92.3056%
Error Rate7.6944%
Sensitivity19.3128%
Specificity99.8902%
Precision19.3128%
Recall94.8140%

Amok (Test 2)

Amok (Test 2) KNN Parameters

  • smote.k = 1
  • knn.k = 1
  • smote.perc.over = 120
  • smote.perc.under = 200

Amok (Test 2) - Confusion Matrix

  • N = 565,216
Predicted: NOPredicted: YESTotal
Actual: NO529,90624,473554,379
Actual: YES10999,73810,837
Total531,00534,211

Amok (Test 2) - Anomaly Detection Metrics

False Positives24,473
True Positives9,738
True Negatives529,906
False Negatives1099

Amok (Test 2) - Anomaly Detection Metrics (Contd.)

Accuracy95.4757%
Error Rate4.5242%
Sensitivity2.8464%
Specificity99.7930%
Precision28.4645%
Recall89.8588%

Deauthentication (Test 1)

Deauthentication (Test 1) KNN Parameters

  • Smote.k = 3
  • knn.k = 5
  • smote.perc.over = 150
  • smote.perc.under = 90

Deauthentication (Test 1) - Confusion Matrix

  • N = 558,167
Predicted: NOPredicted: YESTotal
Actual: NO512,54242,022554,564
Actual: YES953,5083,603
Total512,63745,530

Deauthentication (Test 1) - Anomaly Detection Metrics

False Positives42,022
True Positives3,508
True Negatives512,542
False Negatives95

Deauthentication (Test 1) - Anomaly Detection Metrics (Contd.)

Accuracy92.4544%
Error Rate7.5455%
Sensitivity7.7048%
Specificity99.9814%
Precision7.7048%
Recall97.3633%

Deauthentication (Test 2)

Deauthentication (Test 2) KNN Parameters

  • smote.k = 1
  • knn.k = 1
  • smote.perc.over = 90
  • smote.perc.under = 400

Deauthentication (Test 2) - Confusion Matrix

  • N = 558,167
Predicted: NOPredicted: YESTotal
Actual: NO527,78026,784554,564
Actual: YES3793,2243,603
Total528,15930,008

Deauthentication (Test 2) - Anomaly Detection Metrics

False Positives26,784
True Positives3,224
True Negatives527,780
False Negatives379

Deauthentication (Test 2) - Anomaly Detection Metrics (Contd.)

Accuracy95.1335%
Error Rate4.8664%
Sensitivity10.7438%
Specificity99.9282%
Precision10.7438%
Recall89.4809%

Authentication Request (Test 1)

Authentication Request (Test 1) KNN Parameters

  • Smote.k = 3
  • knn.k = 5
  • smote.perc.over = 150
  • smote.perc.under = 90

Authentication Request (Test 1) - Anomaly Detection Metrics

  • N = 555,805
Predicted: NOPredicted: YESTotal
Actual: NO513,66840,945554,613
Actual: YES311,1611,192
Total513,69942,106

Authentication Request (Test 1) - Anomaly Detection Metrics

False Positives40,945
True Positives1,161
True Negatives513,668
False Negatives31

Authentication Request (Test 1) - Anomaly Detection Metrics (Contd.)

Accuracy92.6276%
Error Rate7.3723%
Sensitivity2.7573%
Specificity99.9939%
Precision2.7573%
Recall97.3993%

Authentication Request (Test 2)

Authentication Request (Test 2) KNN Parameters

  • Smote.k = 1
  • knn.k = 1
  • smote.perc.over = 100
  • smote.perc.under = 300

Authentication Request (Test 2) - Anomaly Detection Metrics

  • N = 555,805
Predicted: NOPredicted: YESTotal
Actual: NO540,84013,773554,613
Actual: YES1521,0401,192
Total540,99214,813

Authentication Request (Test 2) - Anomaly Detection Metrics

False Positives13,773
True Positives1,040
True Negatives540,840
False Negatives152

Authentication Request (Test 2) - Anomaly Detection Metrics (Contd.)

Accuracy97.4946%
Error Rate2.5053%
Sensitivity7.0208%
Specificity99.9719%
Precision7.0208%
Recall87.2483%

Sources

{{{font(4em, Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and a Public Dataset)}}}
{{{font(4em, https://en.wikipedia.org/wiki/Address_Resolution_Protocol)}}}

Thank You

About

The final project for my graduate level Data Mining course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 73.8%
  • R 19.2%
  • Python 4.7%
  • Java 1.9%
  • Shell 0.4%