Intrusion detection
Summary
The task of developed intrusion detection system is to analyse the authentication records and separate genuine and fraudulent authentication attempts for each user account in the system. To achieve this we combine two machine learning algorithms – Expectation Maximisation (EM) algorithm followed by Bayesian Classification. The Expectation Maximisation algorithm uses a maximum likelihood type parameter estimation approach and is ideally suited for cases in which the available data set is incomplete or "hidden". It is an iterative procedure which under certain conditions converges to values at a local or global maximum of the likelihood function. EM algorithm has been first proposed more than 30 years ago and since then it has been used in fields as varied as statistics, computer vision, signal processing, machine learning, pattern recognition and many others. The problem we are trying to address using EM algorithm can be simplified and described as a problem of "learning a mixture of two Gaussians" described with Figure 1. Data presented to algorithm is generated by picking one of two Gaussians at random and then sampling from the selected distribution. If each Gaussian describes on of two users – genuine and fraudulent, trying to authenticate, knowing from which Gaussian each sample of our data originated would completely solve our ID problem. Gaussian type distributions are assumed here for both genuine and fraudulent user, although it is possible to modify the equations implementing the EM in order to tackle any other density function, as long as it can be parameterised. So what are the hidden variables in this problem? Well, if we knew which sample in our set is generated by which distribution we could easily solve the problem. It would then be easy to calculate sample mean and variance for each distribution. All that would be left in this situation would be to somehow classify the new samples (i.e. new authentication attempts) as members of one or the other Gaussian.

Figure 1. Recognising original pdfs for intrusion detection
Since this is not the case we need to think of each data sample as a triple of the form {xi, zi1, zi2}, where zij is 1 if xi is generated by jth Gaussian and 0 otherwise. zij variables are not known to us, i.e. those variables are hidden. The EM algorithm solves this problem by starting from an arbitrary hypothesis h={µ1, µ2} and working iteratively by estimating the expected values of hidden variables based on the current hypothesis for {µ1, µ2} and then adjusting this hypothesis. It can be proved that each iteration of the algorithm increases the likelihood of the data given and that EM algorithm therefore converges to one of local maxima depending on the initial hypothesis, i.e. its starting point. This approach can also be generalised to an arbitrary number of Gaussians but it has been found that the algorithm sensitivity to the choice of starting point increases in such situations. Various approaches to algorithm initialisation have been suggested in those situations. More generally, the setting of the EM algorithm involves a set of m independently drawn samples and two sets of variables X={x1, x2, …, xm} and Z={z1, z2, …, zm} that correspond to the observed data and hidden variables, respectively. If the set of all hypothesis can be parametrised by θ, the goal of the EM algorithm is to find θ that maximises E[ln P[Y|h]|X] where h denotes the current hypothesis. The EM algorithm consists of two previously mentioned steps. Speaking in mathematical terms:
- Expectation step: Calculate Q(h'|h) = E[ln P[Y|h']|h,X] (i.e. obtain score for each possible hypothesis h' based on h,X)
- Maximisation step: Set h' = argmax Q(h'|h)
For the case of two components equations for E and M steps of the algorithm can be derived from the general expressions given above. Iterative EM procedure and corresponding update equations for the case of two-Gaussian mixture are shown below.

Figure 2. Expectation Maximisation algorithm for two-Gaussian mixture
EM procedure performs well for well separated Gaussians but struggles to provide good results when distributions are close to each other. This problem is area of intensive research in data mining and pattern recognition fields. EM algorithm can also be extended to consider multidimensional data, i.e. each sample from our set can be described with more than one attribute. One example of such data is shown at the end of this section. Following the successful estimation of parameter set θ, we apply Bayes classification rule to assign the particular sample x to a class ωj with the highest posterior probability – probability that data sample x belongs to a particular class ωj where j = 1, 2, … J for J possible classes.

Figure 3. Bayes classification rule
The Bayes approach to pattern classification is a fundamental technique and is used in many pattern recognition applications. Since Bayes classifier poses the classification problem in terms of probabilities, all of the probabilities must be known or estimated from the data. In our system this has been done using the previously discussed EM algorithm. Figure 4 demonstrates the performance of EM algorithm for 3D (3 attribute) authentication data. More general results can be seen in the
results section.

Figure 4. EM and Bayes algorithms applied to authentication data
Matlab source code
% Expectation Maximisation algorithm for 3 attributes
% This program estimates the parameters
% of a Gaussian mixture of two normal pdfs.
clc;
clear all;
close all;
run generatedata
% generate simulation data
% load data % load simulation data
D =
3;
% number of attributes to collect (D<4)
Ntest =
2000;
% size of test data set
A = A
(:,
1:D
);
muA = muA
(1:D,:
);
covmA = covmA
(1:D,
1:D,:
);
P =
double(zeros(N,k
));
Ptot =
double(zeros(N,
1));
% Parameter estimation
% initialisation step
LL
(1) =
-999999999;
% initial loglikelihood values
LL
(2) =
-99999999;
tol =
10^
(-5);
% criterion for stopping EM
max_it =
50;
% maximum number of iterations
% initial values for mu and sigma (will be re-valued in the M-step)
% set them manually
mu
(1,:
) =
[1 10];
% means of two terms (G&F) in attribute 1
mu
(2,:
) =
[10 18];
% means of two terms (G&F) in attribute 2
mu
(3,:
) =
[10 4];
% means of two terms (G&F) in attribute 3
covm
(:,:,
1) =
eye(D
);
covm
(:,:,
2) =
eye(D
);
% use kmeans for initialisation
% [mu,covm] = kmeansinitialisation(A,mask);
mu = mu
(1:D,
1:k
);
mixcoef =
[0.5 0.5];
% mixing coeffients: sum(mixcoef)=1
covm_new =
zeros(size(covm
));
mu_new =
zeros(size(mu
));
i =
1;
% number of iterations
LLch =
abs((LL
(2)-LL
(1))/LL
(1));
% loglikelihood change
LLt = LLch;
% ********************************************************************
while (LLch>=tol &&
i <= max_it
)
% calculate the posterior probabilities
Ptot =
zeros(N,
1);
for j=
1:k,
P
(:,
j) = mixcoef
(j)*multivarpdf
(A,mu
(:,
j),covm
(:,:,
j));
Ptot = Ptot+P
(:,
j);
end
P = P./
repmat(Ptot,
1,k
);
% update the mixing coefficients
mixcoef_new =
sum(P
)/N;
% update the means and the variances
mu_new =
(A'*P
)./
(repmat(mixcoef,D,
1)*N
);
for j=
1:k,
Ac = A-
repmat(mu
(:,
j),
1,N
)';
covm_new
(:,:,
j) =
(Ac'*
diag(P
(:,
j))*Ac
)./
(mixcoef_new
(j)*N
);
end
% reset parameters
mu = mu_new;
covm = covm_new;
mixcoef = mixcoef_new;
% loglikelihood function
LL
(i
+2) =
sum(log(P
(:
)));
LLch =
abs((LL
(i
+2)-LL
(i
+1))/LL
(i
+1));
LLt =
[LLt LLch
];
w =
round(P
(:,
1));
if sum(w
)>N/
2, w=~w;
end
% plot EM algorithm convergence for 1D, 2D or 3D case
h =
figure(4);
set(h,'
Name',
['
EM Convergence ... iteration no. '
num2str(i)],'
NumberTitle','
off'
)
plotEMsearch
(D,muA,covmA,mu_new,covm_new,
i,max_it,LLch
);
i = i
+1;
end % while loop
fprintf('\n\nConvergence achieved in
%0.0f iterations!\n',i-1);
fprintf('
\n\nEstimated mean:\n'
);
mu
w =
round(P
(:,
1));
mask = w;
% ********************************************************************
% Test procedure
answer =
inputdlg('
Enter the number of sessions that you would like to use for test:',
...
'
ALGORITHM TEST',
1,
{num2str(Ntest
)});
if ~isempty
(answer
),
Ntest =
str2num(answer
{1});
else
return;
end
% increase size of arrays to hold results obtained for testing set
X =
zeros(Ntest,D
);
P =
zeros(Ntest,k
);
Ptot =
zeros(Ntest,
1);
% ********************************************************************
% generate new genuine and fraudulent sessions
mask = binornd
(10,
max(mixcoef
),Ntest,
1)<
5;
% set their attribute values
for q=
1:D,
a
(q
).
session = normrnd
(a
(q
).
mu(1+mask
),a
(q
).
sigma(1+mask
));
X
(:,q
) = a
(q
).
session;
end
% calculate posterior and prior probabilities
for j=
1:k,
P
(:,
j) = mixcoef
(j)*multivarpdf
(X,mu
(:,
j),covm
(:,:,
j));
Ptot = Ptot+P
(:,
j);
end
P = P./
repmat(Ptot,
1,k
);
% Bayes decision rule
% calculate all TP, TN, FP and FN
TP
(:,
1) =
(~mask.
*(P
(:,
1)>P
(:,
2)));
FP
(:,
1) =
(~mask.
*(P
(:,
1)<=P
(:,
2)));
TN
(:,
1) =
(mask.
*(P
(:,
1)<=P
(:,
2)));
FN
(:,
1) =
(mask.
*(P
(:,
1)>P
(:,
2)));
% calculate total number of genuine and fraudulent sessions in testing set
nF =
sum(mask
);
nG = Ntest-nF;
% calculate probability of TP, TN, FP and FN
errTP =
sum(TP
)/nG;
errFP =
sum(FP
)/nG;
errTN =
sum(TN
)/nF;
errFN =
sum(FN
)/nF;
% ********************************************************************
figure(5)
subplot(2,
2,
1)
set(gca,'
FontSize',
11);
bar(TP.
*P(:,
1),'
g'
)
title(['
True Positives (TP), p_T_P=',
num2str(errTP
)]);
xlabel('
session'
);
ylabel('
P(G|A(\mu_G,\sigma_G))'
);
axis([0 Ntest
0 1.2])
subplot(2,
2,
2)
set(gca,'
FontSize',
11);
bar(TN.
*P(:,
2),'
r'
)
% bar(1-mask(:),'r-')
title(['
True Negatives (TN), p_T_N=',
num2str(errTN
)]);
xlabel('
session'
);
ylabel('
P(F|A(\mu_F,\sigma_F))'
);
axis([0 Ntest
0 1.2])
% ********************************************************************
subplot(2,
2,
3)
set(gca,'
FontSize',
11);
bar(FP.
*P(:,
2),'
r'
)
title(['
False Positives (FP), p_F_P=',
num2str(errFP
)]);
xlabel('
session'
);
ylabel('
P(F|A(\mu_G,\sigma_G))'
);
axis([0 Ntest
0 1.2])
subplot(2,
2,
4)
set(gca,'
FontSize',
11);
bar(FN.
*P(:,
1),'
g'
)
title(['
False Negatives (FN), p_F_N=',
num2str(errFN
)]);
xlabel('
session'
);
ylabel('
P(G|A(\mu_F,\sigma_F))'
);
axis([0 Ntest
0 1.2])
% ********************************************************************
% attribute values taken mistakenly
fprintf('
\n\n*** Attribute values in GENUINE session identified as FRAUDULENT: ***'
);
for i=
1:D,
a
(i).
session(find((FP.
*P(:,
2))>
0))
end
fprintf('
\n\n*** Attribute values in FRAUDULENT session identified as GENUINE: ***'
);
for i=
1:D,
a
(i).
session(find((FN.
*P(:,
1))>
0))
end
figure(6)
stem(1.2*mask,'
r:','
fill'
);
hold on;
stem(mask.
*(P
(:,
1)<=P
(:,
2)),'
k:','
fill'
);
axis([0 Ntest
0 2])
title('
Real and Estimated Fraudulent Attempts'
);
xlabel('
session'
);
legend('
Real fraudulent attempts','
Estimated fraudulent attempts','
Location','
NE'
);
Convergence achieved in 14 iterations!
Estimated mean:
mu =
1.8412 6.9039
12.1577 16.9794
10.0447 4.2297
* Attribute values in GENUINE session identified as FRAUDULENT: *
ans =
4.6681
4.7567
4.5620
5.2124
ans =
11.7963
19.7565
12.6032
8.8465
ans =
5.8196
6.9216
6.3438
5.7980
* Attribute values in FRAUDULENT session identified as GENUINE: *
ans =
Empty matrix: 0-by-1
ans =
Empty matrix: 0-by-1
ans =
Empty matrix: 0-by-1
generatedata.m
% Generate synthetic training data from normal distribution
% ********************************************************************
% a(1)
% Number of incorrectly answered questions during the authentication process
% (e.g. for three allowed authentication attempts during the certain time interval
% this number can be between 0 - no incorrectly answered questions and
% 9 - all three questions incorrectly answered during all three attempts)
%
% a(2)
% Time of the day during which the authentication session
% has taken place (i.e. 0-24h)
%
% a(3)
% Length of the time user has spent on the system after successfully
% passing particular authentication session (i.e. 0-24h or whatever
% the maximum time that can be spent on the system is set to be).
% ********************************************************************
clc,
clear all,
close all
% ********************************************************************
% set total number of login sessions and probability of fraud events
N =
200;
% total number of login sessions
pF =
0.20;
% probability of fraudulent logins
k =
2;
% number of classes to be estimated
% ********************************************************************
nF = binornd
(N,pF
);
% number of fraudulent logins
while nF==
0,
nF = binornd
(N,pF
);
end
% mark fraudulent sessions with 1 and genuine with 0
mask =
[ones(nF,
1);
zeros(N-nF,
1)];
mask
(randperm(N
)) = mask;
% randomise sessions
% ********************************************************************
% set metadata for each attribute
a
(1).
name = '
Number of incorrectly answered questions';
a
(1).
range =
0:
1:
9;
a
(1).
mu =
[ 2 7]';
a
(1).
sigma =
[ 1 1]';
a
(2).
name = '
Time of the day when user log in';
a
(2).
range =
0:
1:
23;
a
(2).
mu =
[12 17]';
a
(2).
sigma =
[ 3 2]';
a
(3).
name = '
Time duration of the login session';
a
(3).
range =
0:
1:
23;
a
(3).
mu =
[10 4]';
a
(3).
sigma =
[ 2 2]';
% ********************************************************************
% generate attribute values using known pdfs for genuine and fraudulent sessions
for i=
1:
3,
a
(i).
session = normrnd
(a
(i).
mu(1+mask
),a
(i).
sigma(1+mask
),N,
1);
end
A =
[a
(1).
session a
(2).
session a
(3).
session];
muA =
[a
(1).
mu';a
(2).
mu';a
(3).
mu'
];
sigmaA =
[a
(1).
sigma';a
(2).
sigma';a
(3).
sigma'
];
covmA =
zeros(3,
3,
2);
covmA
(:,:,
1) =
diag(sigmaA
(1:
3,
1));
covmA
(:,:,
2) =
diag(sigmaA
(1:
3,
2));
save data a mask N nF pF A muA covmA k
run plotdata
multivarpdf.m
function y = multivarpdf
(x,mu,covm
);
% Multivariate normal probability density function
%
% y = multivarpdf(x,mu,covm)
% Returns the value of the multivariate PDF
% at the locations given in X.
%
% INPUTS: x - n x d matrix of domain locations
% mu - 1 x d mean vector
% covm - d x d covariance matrix
[n,d
] =
size(x
);
y =
zeros(n,
1);
for i =
1:n,
xc = x
(i,:
)-mu';
y
(i) =
exp((-.
5)*
(xc*
inv(covm
)*xc'
));
end
y =
((2*
pi)^
(d/
2)*
sqrt(det(covm
)))^
(-1)*y;
plotdata.m
% Plot synthetic training data
% ********************************************************************
% clc; clear all; close all;
load data
% ********************************************************************
% plot normal pdfs of genuine and fraudulent sessions for each attribute
figure('
Name','
Predefined pdfs','
NumberTitle','
off'
)
set(0,'
DefaultAxesColorOrder',
[0 1 0;
1 0 0])
for i=
1:
3,
subplot(3,
1,
i)
set(gca,'
FontSize',
12);
xx =
0:
0.2:
max(a
(i).
range);
a
(i).
pdf(:,
1) = normpdf
(xx,a
(i).
mu(1),a
(i).
sigma(1));
% pdf of genuine sessions
a
(i).
pdf(:,
2) = normpdf
(xx,a
(i).
mu(2),a
(i).
sigma(2));
% pdf of fraudulent sessions
plot(xx,a
(i).
pdf,'
LineWidth',
1.5)
legend('
genuine pdf','
fraudulent pdf','
Location','
NE'
)
legend('
boxoff'
)
axis([0 max(a
(i).
range) 0 1])
ylabel(['
Attribute ',
int2str(i)])
end
xlabel('
attribute value'
);
pause(2)
% plot random sessions
% plot synthetised attribute values for each attribute
figure('
Name','
Synthetised attribute values','
NumberTitle','
off'
)
for i=
1:
3,
subplot(3,
1,
i)
set(gca,'
FontSize',
12);
bar(~mask.
*a(i).
session,
0.7,'
g'
);
hold on;
bar(mask.
*a(i).
session,
0.7,'
r'
);
axis([0 N
0 max(a
(i).
range)+1])
if i==
3,
xlabel('
session no.'
);
end
ylabel(['
Attribute ',
int2str(i)])
end
pause(2)
figure('
Name','
Synthetised attribute values - fraudulent sessions hidden','
NumberTitle','
off'
)
for i=
1:
3,
subplot(3,
1,
i)
set(gca,'
FontSize',
12);
bar(a
(i).
session,
0.7,'
g'
);
hold on;
axis([0 N
0 max(a
(i).
range)+1])
if i==
3,
xlabel('
session no.'
);
end
ylabel(['
Attribute ',
int2str(i)])
end
pause(2)
% ********************************************************************
plotEMsearch.m
function plotEMsearch
(D,AM,AV,M,V,iter,max_iter,Lch
)
% Used to animate EM search in 1D, 2D or 3D casses
mug = AM
(:,
1);
muf = AM
(:,
2);
sigmag = AV
(:,:,
1);
sigmaf = AV
(:,:,
2);
mu1 =
real(M
(:,
1)'
);
mu2 =
real(M
(:,
2)'
);
sigma1 =
real(V
(:,:,
1));
sigma2 =
real(V
(:,:,
2));
subplot('
Position',
[0.1 0.5 0.8 0.4])
cla;
hold on;
switch D,
case 1,
plotGauss1D
(mug, sigmag, '
g'
);
plotGauss1D
(muf, sigmaf, '
r'
);
plotGauss1D
(mu1, sigma1, '
b'
);
plotGauss1D
(mu2, sigma2, '
b'
);
xlabel('
ATTRIBUTE 1 (0-9)'
);
legend('
Theoretical pdf1','
Theoretical pdf2',
...
'
Estimated pdf1','
Estimated pdf2','
Location','
NE'
);
axis([-5 15 0 0.8]);
case 2,
contourGauss2D
(mug, sigmag, '
g'
);
contourGauss2D
(muf, sigmaf, '
r'
);
contourGauss2D
(mu1, sigma1, '
b'
);
contourGauss2D
(mu2, sigma2, '
b'
);
xlabel('
ATTRIBUTE 1 (0-9)'
);
ylabel('
ATTRIBUTE 2 (0-24)'
);
axis([0 9 0 24]);
case 3,
view(-45,
15)
elipsoidGauss3D
(mug,sigmag,'
g'
);
elipsoidGauss3D
(muf,sigmaf,'
r'
);
elipsoidGauss3D
(mu1,sigma1,'
b'
);
elipsoidGauss3D
(mu2,sigma2,'
b'
);
grid on
xlabel('
ATTRIBUTE 1 (0-9)'
);
ylabel('
ATTRIBUTE 2 (0-24)'
);
zlabel('
ATTRIBUTE 3 (0-15)'
);
axis([0 9 0 24 0 15]);
end
title('
EM Search'
);
subplot('
Position',
[0.1 0.1 0.8 0.2])
stem(iter,Lch,'
Color',
[0.7 0.7 0.7],
...
'
MarkerFaceColor','
k',
...
'
MarkerEdgeColor','
k'
),
hold on
axis([0 max_iter
0 1])
xlabel('
iterations no.'
);
title('
Loglikelihood Change'
);
pause(0.4);
plotGauss1D.m
function plotGauss1D
(mu,sigma2,col
)
% Plot 1D Gaussian with mean mu and variance sigma2
sd =
sqrt(sigma2
);
% std deviation
x = mu
-5*sd:
0.02:mu
+5*sd;
% location of points at which x is calculated
g =
1/
(sqrt(2*
pi)*sd
)*
exp(-0.5*
(x-mu
).^
2/sigma2
);
plot(x,g,'
Color',col,'
LineWidth',
3);
contourGauss2D.m
function contourGauss2D
(mu, covar, col
)
% Make a contour plot and a surface plot of a 2D Gaussian
%
% INPUTS:
% mu - 2D mean
% covar - 2D covariance
% col - linespec
A =
inv(covar
);
% the inverse covariance matrix
maxsd =
sqrt(max(covar
(1,
1), covar
(2,
2)));
x = mu
(1)-3*maxsd:
0.1:mu
(1)+3*maxsd;
% location of points at which x is calculated
y = mu
(2)-3*maxsd:
0.1:mu
(2)+3*maxsd;
% location of points at which y is calculated
[X, Y
] =
meshgrid(x,y
);
% matrices used for plotting
% Compute value of Gaussian pdf at each point in the grid
z =
1/
(2*
pi*
sqrt(det(covar
)))*
exp(-0.5*
(A
(1,
1)*
(X-mu
(1)).^
2 ...
+
2*A
(1,
2)*
(X-mu
(1)).
*(Y-mu
(2)) + A
(2,
2)*
(Y-mu
(2)).^
2));
contour(x,y,z,col
);
elipsoidGauss3D.m
function [xout,yout,zout
]=elipsoidGauss3D
(mu,
cov,col
)
% This function return values to plot an ellipsoid surface in 3D
% the inputs are the location of the center (3 elements)
% and a 3x3 positive definite matrix denoting the covariance structure
% This was adapted from sphere.m provided with matlab.
%
% function [xout,yout,zout]=elipsoidGauss3D(mu,cov,col)
n =
16;
% spacing
theta =
(-n:
2:n
)/n*
pi;
phi =
(-n:
2:n
)'/n*
pi/
2;
cosphi =
cos(phi
);
cosphi
(1) =
0; cosphi
(n
+1) =
0;
sintheta =
sin(theta
);
sintheta
(1) =
0; sintheta
(n
+1) =
0;
% get the axes of the ellipsoid
inv_cov =
inv(cov);
[v,d
] =
eig(inv_cov
);
eigval =
diag(d
);
xout = cosphi*
cos(theta
)/
sqrt(eigval
(1));
yout = cosphi*sintheta/
sqrt(eigval
(2));
zout =
sin(phi
)*
ones(1,n
+1)/
sqrt(eigval
(3));
lo =
length(xout
);
data =
zeros(lo*lo,
3);
data
(:,
1) =
reshape(xout,lo*lo,
1);
data
(:,
2) =
reshape(yout,lo*lo,
1);
data
(:,
3) =
reshape(zout,lo*lo,
1);
datarot = data*v';
xouttemp = datarot
(:,
1)+mu
(1);
youttemp = datarot
(:,
2)+mu
(2);
zouttemp = datarot
(:,
3)+mu
(3);
xout =
reshape(xouttemp,lo,lo
);
yout =
reshape(youttemp,lo,lo
);
zout =
reshape(zouttemp,lo,lo
);
h =
surface(xout,yout,zout,'
FaceColor','
w','
FaceAlpha',
0.5,'
EdgeColor',col
);
kmeansinitialisation.m
% K-means algorithm applied on 3 attributes
% ********************************************************************
function [mu,covm
] = kmeansinitialisation
(X,mask
)
% positions of fraudulent attempts in the above sets
SW = mask;
% form a matrix of mixes for simultaneous unmixing
[Xrow,Xcol
] =
size(X
);
% ********************************************************************
% do the clustering
[idx,ctrs
] = kmeans
(X,
2,
...
'
Distance','
city',
...
'
Replicates',
10);
% ********************************************************************
% find the smaller cluster and considered it a cluster of fraudulent attempts
SW_est =
zeros(1,Xrow
);
I =
2-
(length(find(idx==
1))<length
(find(idx==
2)));
SW_est
(find(idx==
I)) =
1;
% group clusters and calculate their means
XG = X
(find(SW_est==
0),:
);
XF = X
(find(SW_est==
1),:
);
muG =
mean(XG
);
muF =
mean(XF
);
sigmaG =
std(XG
);
sigmaF =
std(XF
);
mu =
[muG' muF'
];
covm
(:,:,
1) =
diag(sigmaG
);
covm
(:,:,
2) =
diag(sigmaF
);
Back to
top of page.
Results
Collection of real data for ID algorithm testing and development is a difficult task, taking into consideration both technical and ethical aspects of this process. During the course of the work on this project we have developed a new idea for the collection of real data in "simulated" ID environment. This technique relies on concepts developed at Carnegie Mellon University. The researchers at Carnegie Mellon used the Internet to engage the public in playing casual games (2-3 minutes in duration) which indirectly contributed towards automated identification of images. We have implemented a variation which uses the Internet to engage the public in playing a casual game (trivia about animals) which indirectly generates usage data with similar statistical characteristics as user authentication activities. More details about this should soon be available (link). We hope to be able to collect data using this approach and test our ID algorithm on this data in the near future.
Meanwhile, in the first phase of work on ID algorithm testing and development we have used a data simulation approach where parameters of two uni- or multi-variate Gaussian distributions have initially been selected and data randomly drawn from those two distributions - one representing genuine, and another fraudulent user attributes. Number of samples drawn from one or the other distribution could also be selected.
We show here some results obtained using two sets of simulated data drawn from 1 and 3 dimensional Gaussians. The attributes used to describe 3-D data samples are: (i) number of incorrectly answered ques¬tions during the authentication session, (ii) time duration of the authentication session and (iii) time spent on the system following the successful authentication. A single attribute – duration of the authentication session was used to describe 1-D data samples. Part of data set presented to our ID algorithm for 3-D Gaussian case is shown on Figure 1. Obviously, to investigate performance of algorithm on 1-D type data, only one attribute (in this case attribute 2) from the Figure 5 was used.

Figure 5. Simulated authentication data (a) ID data presented to EM algorithm, (b) "Labelled" ID data.
Table 1. Parameters for each attribute used in the example discussed further in this section.
To test the sensitivity of the algorithm to the size of training data set, algorithm was initially trained using data sets of size ntraining=50, 100, 150, 200 and 400 and then tested on the set of another ntesting=500 data samples. Total percentage of misclassified samples, i.e. false positives (FP) and false negatives (FN), is calculated for both training and testing phases and results averaged over 100 independent runs of the algorithm. Tests have been performed for 1-D and 3-D Gaussian mixtures. Percentages of FPs and FNs for both cases (1-D and 3-D) are shown below.
a)
b)

Figure 6. Results obtained for different sizes of training data set (a) 1-D Gaussian mixture, (b) 3-D Gaussian mixture.
The importance of data dimensionality - number of attributes used to describe each authentication attempt can be clearly seen from this set of results. Percentage of misclassified authentication attempts is significantly reduced when three attributes (3-D) are used compared to single attribute case (1-D). Another important factor is the size of training data set. Larger number of samples used in training data set results in reduced misclassification error, although increasing the size of data set above 100 does not improve the performance of the algorithm significantly, especially for 3-D Gaussian mixtures.
Figure 7 shows the performance of EM based IDS for different ratios of genuine and fraudulent data sessions in the training and test sets. This parameter does not have a significant influence on the system performance, although some improvement is evident when the percentage of fraudulent authentication attempts in mixtures is reduced.
a)
b)

Figure 7. Results obtained for different genuine-fraudulent attempts ratios in both sets (a) 1-D Gaussian mixture, (b) 3-D Gaussian mixture.
Presented results suggest that there is a potential value in applying the proposed ID algorithm in knowledge based type of authentication systems. It should also be kept in mind that, so far we have tested our algorithm using data drawn from two Gaussian data distributions. Next parameter to be investigated in this phase of work is the performance of the algorithm with respect to a degree of overlap between those Gaussian distributions. With more than one fraudulent user trying to gain the access to the system we might also have to deal with the unknown number of Gaussians generating our authentication data. A modification of the EM algorithm able to automatically detect this number should therefore be considered.
Final challenge of this project, as we have already mentioned at the beginning of this section, is to prove the Gaussian nature of ID data by collecting real data in an unbiased manner. Alternatively, our algorithm might have to be modified further to account for differently shaped user distributions.
PowerPoint presentation format
This information is also available to view as a
PowerPoint presentation.
Back to
top of page.