In this project I use a support vector machine (SVM) to classify a preprocessed dataset of emails as spam or not with a 98.9% test set accuracy. A support vector machine is a supervised machine learning model used for classification.
You can download my code at github to follow along. I begin this guide with Python and the scikit-learn package and then perform the same analysis in R with the e1071 package in R.
import scipy.io as sio
from sklearn import svm
import numpy as np
import pandas as pd
import h5py
traindata = sio.loadmat('spamTrain.mat')
X = traindata['X']
y = traindata['y']
##Validation dataset is saved in a different format
##So the process to load it is slightly different
f = h5py.File('spamTest2.mat','r')
Xtest = np.array(f.get('Xtest'))
Xtest = Xtest.T
ytest = np.array(f.get('ytest'))
ytest = ytest.T
The dataset I am using is a preprocessed set of emails from Stanford’s Machine Learning course on Coursera. The 1899 X values represent words that appeared at least 100 times in the entire email dataset. For a given email, X will be 1 if that word is present and 0 otherwise. The value Y represents whether the email is spam (Y=1) or not (Y=0).
##Train SVM with linear kernel
##Use for loop to determine best value for regularization parameter C
C = [0.01,0.03,0.1,0.3,1,3,10]
trainingAccuracy = np.empty([len(C)])
testingAccuracy = np.empty([len(C)])
f1 = np.empty([len(C)])
for i in range(0,len(C)):
clf = svm.SVC(C=C[i],kernel='linear')
clf.fit(X, y.ravel())
trainPredict = clf.predict(X)
trainingAccuracy[i] = np.mean(trainPredict==y.ravel())*100
testPredict = clf.predict(Xtest)
testingAccuracy[i] = np.mean(testPredict==ytest.ravel())*100
#f1 score
tp = (np.logical_and(testPredict[:] == 1,ytest.ravel()[:] == 1))
p = ytest.ravel() == 1
recall = sum(tp)/sum(p)
fp = (np.logical_and(testPredict[:] == 1,ytest.ravel()[:] == 0))
precision = sum(tp)/(sum(tp)+sum(fp))
f1[i] = (2*(precision*recall))/(precision+recall)
print(testingAccuracy)
## [98. 99. 98.9 98.3 97.8 97.3 97.5]
print(trainingAccuracy)
## [ 98.325 99.425 99.825 99.975 99.975 99.975 100. ]
print(f1)
## [0.96710526 0.98381877 0.98228663 0.9726248 0.96451613 0.95638126
## 0.95974235]
Since the test set accuracy is highest for C=.03, I use C=.03 as the regularization parameter. The support vector machine was able to classify the test email set with an accuracy of 99.0% and an F1 score of 98.38%.
library(e1071)
library(R.matlab)
library(rhdf5)
traindata <- readMat('spamTrain.mat')
X <- traindata$X
Y <- traindata$y
Xtest <- h5read('spamTest2.mat','Xtest')
ytest <- h5read('spamTest2.mat','ytest')
#Using cost determined in Python
clf <- svm(cost = .03, x=X,y=Y,kernel="linear",scale=FALSE,type="C-classification")
trainPredict <- fitted(clf)
trainingAccuracy <- mean(trainPredict == Y)
pred <- predict(clf, Xtest)
testingAccuracy <- mean(pred == ytest)
#f1 score
tp = pred== 1 & ytest == 1
p = ytest == 1
recall = sum(tp)/sum(p)
fp = pred == 1 & ytest == 0
precision = sum(tp)/(sum(tp)+sum(fp))
f1 = (2*(precision*recall))/(precision+recall)
print(testingAccuracy)
## [1] 0.99
print(f1)
## [1] 0.9838188