Detection of phishing websites is a really important safety measure for most of the online platforms. So, as to save a platform with malicious requests from such websites, it is important to have a robust phishing detection system in place.
Thanks to people like, Rami M. Mohammad, Fadi Thabtah, and Lee McCluskey who have worked intensively in this area. In this post, we are going to use Phishing Websites Datafrom UCI Machine Learning Datasets. This dataset was donated by Rami Mustafa A Mohammad for further analysis. Rami M. Mohammad, Fadi Thabtah, and Lee McCluskey have even used neural nets and various other models to create a really robust phishing detection system. I really encourage you to have a look at the original papers here andhere.
For this very basic analysis, we are going to use multiple models, and see which one fits the best with our dataset. And finally, the most important part - a breakdown of most important features to detect a phishing website using a randomForest
Fit.
We’ll start with loading the csv file, in our R Script and setting the new column names.
library(caret)
library(doMC)
# Register 4 cores for parallel computing
registerDoMC(4)
# Read the data from csv file
data <- read.csv('Datasets/phising.csv',
header = F,
colClasses = "factor")
# Names list for the features
names <- c("has_ip", "long_url", "short_service",
"has_at","double_slash_redirect", "pref_suf",
"has_sub_domain", "ssl_state",
"long_domain", "favicon", "port",
"https_token", "req_url", "url_of_anchor",
"tag_links",
"SFH", "submit_to_email", "abnormal_url",
"redirect",
"mouseover", "right_click", "popup",
"iframe",
"domain_Age", "dns_record", "traffic",
"page_rank",
"google_index", "links_to_page",
"stats_report", "target")
# Add column names
names(data) <- names
Here we are importing caret and doMC libraries and then registering 4 cores for parallel processing. You can set the number of cores according to your machine.
All of the features in this dataset are factors, that’s the reason I have used colClasses = "factor"
in read.csv
method. You can have a look at the README.md
file in thisGithub Repo, to get an overview of the possible values of each feature.
Now, first thing first, let’s have a look at the data
,
1str(data)
'data.frame': 2456 obs. of 31 variables:
$ has_ip : Factor w/ 2 levels "0","1": 2 1 1 ...
$ long_url : Factor w/ 3 levels "0","1","-1": 2 2 1 ...
$ short_service : Factor w/ 2 levels "0","1": 1 1 1 ...
$ has_at : Factor w/ 2 levels "0","1": 1 1 1 ...
$ double_slash_redirect: Factor w/ 2 levels "0","1": 2 1 1 ...
$ pref_suf : Factor w/ 3 levels "0","1","-1": 3 3 3 ...
$ has_sub_domain : Factor w/ 3 levels "0","1","-1": 3 1 3 ...
$ ssl_state : Factor w/ 3 levels "0","1","-1": 3 2 3 ...
$ long_domain : Factor w/ 3 levels "0","1","-1": 1 1 1 ...
$ favicon : Factor w/ 2 levels "0","1": 1 1 1 ...
$ port : Factor w/ 2 levels "0","1": 1 1 1 ...
$ https_token : Factor w/ 2 levels "0","1": 2 2 2 ...
$ req_url : Factor w/ 2 levels "1","-1": 1 1 1 ...
$ url_of_anchor : Factor w/ 3 levels "0","1","-1": 3 1 1 ...
$ tag_links : Factor w/ 3 levels "0","1","-1": 2 3 3 ...
$ SFH : Factor w/ 2 levels "1","-1": 2 2 2 ...
$ submit_to_email : Factor w/ 2 levels "0","1": 2 1 2 ...
$ abnormal_url : Factor w/ 2 levels "0","1": 2 1 2 ...
$ redirect : Factor w/ 2 levels "0","1": 1 1 1 ...
$ mouseover : Factor w/ 2 levels "0","1": 1 1 1 ...
$ right_click : Factor w/ 2 levels "0","1": 1 1 1 ...
$ popup : Factor w/ 2 levels "0","1": 1 1 1 ...
$ iframe : Factor w/ 2 levels "0","1": 1 1 1 ...
$ domain_Age : Factor w/ 3 levels "0","1","-1": 3 3 1 ...
$ dns_record : Factor w/ 2 levels "0","1": 2 2 2 ...
$ traffic : Factor w/ 3 levels "0","1","-1": 3 1 2 ...
$ page_rank : Factor w/ 3 levels "0","1","-1": 3 3 3 ...
$ google_index : Factor w/ 2 levels "0","1": 1 1 1 ...
$ links_to_page : Factor w/ 3 levels "0","1","-1": 2 2 1 ...
$ stats_report : Factor w/ 2 levels "0","1": 2 1 2 ...
$ target : Factor w/ 2 levels "1","-1": 1 1 1 ...
So, we have some 30 features and a target
variable with two levels(1, -1), i.e. whether a website is a phishing website or not.
We’ll now create a training and test set using caret’s createDataPartition
method. We’ll use test set to validate the accuracy of our detection system.
# Set a random seed so we can reproduce the results
set.seed(1234)
# Create training and testing partitions
train_in <- createDataPartition
(y = data$target,
p = 0.75, list = FALSE)
training <- data[train_in,]
testing <- data[-train_in,]
Now, we are ready to try a few models on the dataset. Starting with a Boosted logistic Regression
model. Let’s see how that perform on our quest for the nearly perfect phishing detection system ;).
################
Boosted Logistic Regression
################# trainControl for Boosted Logisitic Regression
fitControl <- trainControl(
method = 'repeatedcv',
repeats = 5,number = 5, verboseIter = T)
# Run a Boosted logisitic regression over the training set
log.fit <- train(target ~ ., data = training,
method = "LogitBoost",
trControl = fitControl,tuneLength = 5)
# Predict the testing target
log.predict <- predict(log.fit, testing[,-31])
confusionMatrix(log.predict, testing$target)
We are using caret’s trainControl
method to find out the best performing parameters using repeated cross-validation. After creating a confusion Matrix of the predicted values and the real target values, I could get a prediction accuracy of 0.9357, which is actually pretty good for a Boosted Logistic Regression model.
But of course we have better choices for models, right? And there is no reason, for not using our one of the most favourite SVM with an RBF Kernel
.
################## SVM - RBF Kernel
##################### trainControl for Radial SVM
fitControl = trainControl(
method = "repeatedcv", repeats = 5,
number = 5, verboseIter = T)
# Run a RBF - SVM over the training set
rbfsvm.fit <- train(target ~ .,
data = training,method = "svmRadial",
trControl = fitControl,
tuneLength = 5)
# Predict the testing target
rbfsvm.predict <- predict
(rbfsvm.fit, testing[,-31])
confusionMatrix(rbfsvm.predict,
testing$target)
Woah! I am getting a 0.9706 accuracy with a SVM and RBF Kernel. Looks like there is almost no escape for phishing websites now :D.
But, since one of the most important reason I picked up this analysis was to find out the most important predictors, that can identify a phishing website, we’ll have to move to Tree-based models to get the variable importance.
So, let’s fit a Tree bagging model on our dataset.
################## TreeBag
#################### trainControl for Treebag
fitControl = trainControl(method = "repeatedcv",
repeats = 5,
number = 5, verboseIter = T)
# Run a Treebag classification over the training set
treebag.fit <- train(target ~ .,
data = training,method = "treebag",
importance = T)
# Predict the testing target
treebag.predict <- predict(treebag.fit, testing[,-31])
confusionMatrix(treebag.predict, testing$target)
Now, this is something, an accuracy of 0.9739 and we also get our variable importances :). But I am not going to show that, without fitting another tree model, the almighty(throw-anything-at-me) Random Forests
.
####################### Random Forest
######################### trainControl for Random Forest
fitControl = trainControl(method = "repeatedcv",
repeats = 5,number = 5, verboseIter = T)
# Run a Random Forest classification over the training set
rf.fit <- train(target ~ .,
data = training, method = "rf",importance = T, trControl = fitControl,tuneLength = 5)
# Predict the testing target
rf.predict <- predict(rf.fit, testing[,-31])
confusionMatrix(rf.predict, testing$target)
That’s some coincidence(or-not), with mtry = 21, we are still getting an accuracy of)0.9739 with our Random Forest
model, which is actually pretty good, even for practical purposes. so, finally let’s have a look at the variable importances of different features,
1plot(varImp(rf.fit))
According to our Random Forest model, 10 of the most imporant features are:
* pref_suf-1 100.00
* url_of_anchor-1 85.89
* ssl_state1 84.59
* has_sub_domain-1 69.18
* traffic1 64.39
* req_url-1 43.23
* url_of_anchor1 37.58
* long_domain-1 36.00
* domain_Age-1 34.68
* domain_Age1 29.54
Numerical values suffixing the features name are just the level of the factor of that particular feature. As apparent from this variable importance plot and from our own intuition, features listed here are indeed some of the most important attributes to find out whether a given sample is a phishing website or not.
Like, if there is prefixes or suffixes being used in the url then there are very high chances that it’s a phishing website. Or a suspicious SSL state, having a sub domain in url, having a long domain url, etc. are actually really important features that can clearly identify a phishing website.
One can create a phishing detection system pretty easily if he/she can get the information about these predictors. Rami M. Mohammad, Fadi Thabtah, and Lee McCluskey have also mentioned in their original paper, how they did it.
I am sure that neural nets can further increase the accuracy of phishing detection system, but I tried to do a very basic analysis and it worked out pretty good. But of course getting and filtering out the data, creating factors out of different attributes is probably the most challanging task in phishing website detection.
You can further look at the Github repo with the above code at: rishy/phishing-websites. Your feedbacks and comments are always welcomed.
Share this post!
Related Posts
Dropout with Theano
Implementing a Dropout Layer with Numpy and Theano along with all the caveats and tweaks.
Electricity Demand Analysis and Appliance Detection
Analysis of Electricity demand from a house on a time-series dataset. An appliance detection systems is also created using K-Means Clustering based on the electricity demand.
L1 vs. L2 Loss function
Comparison of performances of L1 and L2 loss functions with and without outliers in a dataset.
Normal/Gaussian Distributions
This is first blog post of the series "Statistical Distributions". We are starting with the most commonly used Normal Distributions.
Google Summer of Code 2014
Got Selected for Google Summer of Code 2014, with Mifos as my mentoring organization.