1 Introduction

In this project I use a support vector machine (SVM) to classify a preprocessed dataset of emails as spam or not with a 98.9% test set accuracy. A support vector machine is a supervised machine learning model used for classification.

You can download my code at github to follow along. I begin this guide with Python and the scikit-learn package and then perform the same analysis in R with the e1071 package in R.

2 Python SVM Walkthrough

2.1 Setup

2.1.1 Load Packages

import scipy.io as sio
from sklearn import svm
import numpy as np
import pandas as pd
import h5py

2.1.2 Load Email Data

traindata = sio.loadmat('spamTrain.mat')
X = traindata['X']
y = traindata['y']
##Validation dataset is saved in a different format
##So the process to load it is slightly different
f = h5py.File('spamTest2.mat','r') 
Xtest = np.array(f.get('Xtest'))
Xtest = Xtest.T
ytest = np.array(f.get('ytest'))
ytest = ytest.T

The dataset I am using is a preprocessed set of emails from Stanford’s Machine Learning course on Coursera. The 1899 X values represent words that appeared at least 100 times in the entire email dataset. For a given email, X will be 1 if that word is present and 0 otherwise. The value Y represents whether the email is spam (Y=1) or not (Y=0).

2.2 Train Support Vector Machine

##Train SVM with linear kernel
##Use for loop to determine best value for regularization parameter C
C = [0.01,0.03,0.1,0.3,1,3,10]
trainingAccuracy = np.empty([len(C)])
testingAccuracy = np.empty([len(C)])
f1 = np.empty([len(C)])
for i in range(0,len(C)):
    clf = svm.SVC(C=C[i],kernel='linear')
    clf.fit(X, y.ravel())
    trainPredict = clf.predict(X)
    trainingAccuracy[i] = np.mean(trainPredict==y.ravel())*100
    
    testPredict = clf.predict(Xtest)
    testingAccuracy[i] = np.mean(testPredict==ytest.ravel())*100
    
    #f1 score
    tp = (np.logical_and(testPredict[:] == 1,ytest.ravel()[:] == 1))
    p = ytest.ravel() == 1
    recall = sum(tp)/sum(p)
    fp = (np.logical_and(testPredict[:] == 1,ytest.ravel()[:] == 0))
    precision = sum(tp)/(sum(tp)+sum(fp))
    f1[i] = (2*(precision*recall))/(precision+recall)
    

2.3 Results

print(testingAccuracy)
## [98.  99.  98.9 98.3 97.8 97.3 97.5]
print(trainingAccuracy)
## [ 98.325  99.425  99.825  99.975  99.975  99.975 100.   ]
print(f1)
## [0.96710526 0.98381877 0.98228663 0.9726248  0.96451613 0.95638126
##  0.95974235]

Since the test set accuracy is highest for C=.03, I use C=.03 as the regularization parameter. The support vector machine was able to classify the test email set with an accuracy of 99.0% and an F1 score of 98.38%.

3 R Code

3.1 Setup

3.1.1 Load Packages

library(e1071)
library(R.matlab)
library(rhdf5)

3.1.2 Load Data

traindata <- readMat('spamTrain.mat')
X <- traindata$X
Y <- traindata$y
Xtest <- h5read('spamTest2.mat','Xtest')
ytest <- h5read('spamTest2.mat','ytest')

3.2 Train Model

#Using cost determined in Python
clf <- svm(cost = .03, x=X,y=Y,kernel="linear",scale=FALSE,type="C-classification")
trainPredict <- fitted(clf)
trainingAccuracy <- mean(trainPredict == Y)
pred <- predict(clf, Xtest)
testingAccuracy <- mean(pred == ytest)

#f1 score
tp = pred== 1 & ytest == 1
p = ytest == 1
recall = sum(tp)/sum(p)
fp = pred == 1 & ytest == 0
precision = sum(tp)/(sum(tp)+sum(fp))
f1 = (2*(precision*recall))/(precision+recall)

3.3 Results

print(testingAccuracy)
## [1] 0.99
print(f1)
## [1] 0.9838188