Attribute selection



Setup

The process of authentication question selection utilises the statistics of the available data set in order to choose the most appropriate set of user attributes to be used during the authentication attempt. To be able to do this most effectively we use Metropolis algorithm. This algorithm is one of the methods used to perform importance sampling on a given data set. It performs a random walk through the data space of interest in a way that the frequency of points visited during that walk corresponds to some required probability distribution. In the context of our problem, the Metropolis algorithm is used to generate a sequence of authentication challenge questions that approximately follows the inverse of a probability mass function (pmf) defined by the values of attributes in the database of the authentication system. Let's say we have a database with N users, U(1), U(2), … U(N), which can request an authentication from our system. This database is symbolically shown in Table 1. Each user in this database is described with a set of M+1 attributes, A0, A1, … AM. A value of the m-th attribute for the n-th user in the database is denoted as am(n). The attribute A0 is the only attribute not used in the question generation process. This is a unique identification key for each user and is used as a username during the authentication procedure.

Table 1. Initial table of user attributes
Initial table of user attributes

The authentication system first generates a metadata table depending on the number of attributes/questions to be asked during the authentication. An example of the generated one-attribute metadata table is shown below.

Table 2. One-attribute metadata table
One-attribute metadata table

Here, the value cm(n) describes the total number of occurrences of a particular attribute value am(n) for the m-th attribute of the n-th user in the original database. The value cm(n) can therefore range from 1 to N where value of 1 would denotes a unique value of the m-th attribute for the n-th user, while in the other extreme, the value of N for cm(n) would mean that all users from the original database have the same value for the attribute Am. The aim of the question selection part of the system is to select the attributes from the database in inverse proportion to the corresponding cm(n) values. Lower cm(n) value for a particular attribute would imply high frequency of questions based on that attribute and vice versa, higher cm(n) value would yield low frequency of questions based on the related attribute. This process also needs to be performed in some random fashion, so the the Metropolis algorithm is well suited to accomplish this task.

Application of the Metropolis algorithm

Each row from the generated metadata table in Table 2 is normalised (i.e. divided by N) to represent a pmf for a particular user across the whole group of users from the dataset. This pmf is then “inverted” and the Metropolis algorithm is used to sample inverted pmf in order to select the most appropriate attribute for the next question in a sequence for a particular user.

Using Metropolis algorithm to sample users metadata
Figure 1. Using Metropolis algorithm to sample users metadata

A sample of inverted pmf for one particular row of the metadata table is shown below in Figure 2. The number of attributes, M, used in this example is 40.

Inverted pmf
Figure 2. Inverted pmf corresponding to an n-th row from the metadata table

The Metropolis algorithm starts by selecting a random attribute for the first authentication session of the particular user from the database. At the next authentication session of the same user a random trial move from the current attribute to some other attribute is performed. This trial move is then either accepted or rejected according to a simple probabilistic rule. If the move is accepted then the new attribute is used to generate an authentication question but if the move is rejected a new trial move needs to be randomly generated and the process of evaluation of this move repeated again. In this way, it should be possible to explore the whole attribute space in the database. The Metropolis algorithm therefore provides a prescription for choosing which move in attribute space to accept or reject.

Flow chart of Metropolis based algorithm
Figure 3. Flow chart of Metropolis based algorithm used to choose the set of user attributes S

The record of large number of steps of this random walk should follow the specified inverted pmf and in that case the walk is said to have reached equilibrium. Figure 4 shows the record of two random walks according to the inverted pmf given in Figure 2 for different number of steps, i.e. the authentication attempts, ns, through the attribute space. First figure is generated after 100 steps while the second one is the record of Metropolis walk through the attribute space after 1000 steps. While the first record still only wagely resembles the shape of pmf from Figure 2, the second plot achieves much better accuracy. After even more authentication sessions statistics of asked questions, the shape is going to be very close to the original pmf for that particular user.

Record of Metropolis algorithm after ns=100 steps
Record of Metropolis algorithm after ns=1000 steps
Figure 4. Record of Metropolis algorithm on the pmf given in Figure 2 after (a) ns=100 steps and (b) ns=1000 steps

Program to implement modified Metropolis algorithm


clc, clear all, close all
% first generate some distribution p for Ns question sets
Ns = 40;                    % total number of sets
triples = Ns*rand(Ns,1);    % number of occurences for each set
po = triples/Ns;            % original pmf
pi = (Ns-triples)/Ns;       % "inverted" pmf

% plot original distribution
figure();
subplot(3,1,1);
bar(1:Ns,po,1,'r');
axis([0 Ns+1 0 1]);
title('Original PMF');
subplot(3,1,2);
bar(1:Ns,pi,1,'g');
axis([0 Ns+1 0 1]);
title('Inverted PMF');

% The Metropolis algorithm
% use described method to generate Nconv sets of questions
% that will fit into the given distribution
Nconv = 5000;
set = [];
% generate number of the first set randomly
set0 = ceil(Ns*rand)        
% need to have an integer in range [1,Ns]

% or try with highest number in set distribution
% set0 = ceil(Ns*max(p));   % generate first set number

accepted = 0;
while (accepted < Nconv)
    % choose next set
    setN = ceil(Ns*rand);
    % check is it going to be accepted
    ratio = (pi(setN))/(pi(set0));
    if (ratio >= 1),
        set = [set setN];
        set0 = setN;
        accepted = accepted+1;
    else
        x = rand;
        if (ratio > x),
            set = [set setN];
            set0 = setN;
            accepted = accepted+1;
        end
    end
end

% Plot final distribution of asked question sets
% this histogram should look like given distribution pi
subplot(3,1,3); hist(double(set),Ns);
axis([0 Ns+1 0 max(hist(double(set),Ns))]);
title('Estimated Inverted PMF')
xlabel('set no')               


Final distribution of asked question sets
Figure 5. Estimated Inverted PMF

PowerPoint presentation format

This information is also available to view as a PowerPoint presentation.


Back to top of page.