Performance Evaluation of Exponential Discriminant Analysis with Feature Selection for Steganalysis

The performance of supervised learning-based seganalysis depends on the choice of both classifier and features which represent the image. Features extracted from images may contain irrelevant and redundant features which makes them inefficient for machine learning. Relevant features not only decrease the processing time to train a classifier but also provide better generalisation. Linear discriminant classifier which is commonly used for classification may not be able to classify in better way non-linearly separable data. Recently, exponential discriminant analysis, a variant of linear discriminant analysis (LDA), is proposed which transforms the scatter matrices to a new space by distance diffusion mapping. This provides exponential discriminant analysis (EDA) much more discriminant power to classify non-linearly separable data and helps in improving classification accuracy in comparison to LDA. In this paper, the performance of EDA in conjunction with feature selection methods has been investigated. For feature selection, Kullback divergence, Chernoff distance measures and linear regression measures are used to determine relevant features from higher-order statistics of images. The performance is evaluated in terms classification error and computation time. Experimental results show that exponential discriminate analysis in conjunction with linear regression significantly performs better in terms of both classification error and compilation time of training classifier.

Steganography is the science of embedding hidden message in one of the cover multimedia, i.e. texts, images, audio, video files. The goal of steganalysis is to detect hidden information from observed data with little or no information about the steganography algorithm. Steganalysis has drawn attention of research community in last few years since detection of hidden messages can lead to the prevention of devastating security incidents. Steganalysis1-6 is a very challenging field because of the scarcity of information about the specific characteristics of the cover multimedia that can be exploited to hide information and detect the same. Detection of hidden message becomes more difficult as the approaches adopted for steganalysis also sometimes depend on the underlying steganography algorithm(s) used.

Several steganalysis approaches6-9 have been proposed which can broadly be classified into four categories: Supervised learning-based steganalysis10,11, blind identification-based steganalysis7, parametric statistical steganalysis9,12,13 and hybrid techniques7. Supervised learning-based steganalysis techniques involve two phases: (a) training phase and (b) testing phase. In the training phase, examples of both stego and non-stego are provided to a statistical classifier. The learning classifier determines the best classification rule using these examples. In the testing phase, unknown images are given as input to the trained classifier to decide whether image contains a secret message or not. Blind identification methods7 pose the steganalysis problem as a system identification problem. Some statistical properties such as independence of host and secret message etc. are exploited. The embedding algorithm is represented as a channel and the goal is to invert this channel to identify the hidden message. Parametric statistical steganalysis9,12,13 approaches tend to assume a certain parametric statistical model for the cover image, stego image and the hidden message. Steganalysis is formulated as a hypothesis testing problem, namely, H0: no message present (null hypothesis) and H1: message present (alternate hypothesis). A statistical detection algorithm is then designed to test between the two hypotheses. Hybrid techniques7 involve more than one of the above approaches.

Fridrich14, et al. propose a technique for estimating the unaltered histogram to find the number of changes and length of the secret message. The process involves cropping the JPEG image by four columns and then applying a quantization table to re-compress the image. The resulting discrete cosine transform (DCT) coefficient histogram will be a close estimate of the original. Avcibas10, et al. use a set of image quality metrics (IQMs) to build up a discriminator algorithm that differentiates cover images from stego images. IQMs are ranked based on their F-scores using analysis of variance (ANOVA) statistical test to identify the embedded message. The success of the approach lies in the recognition of IQMs that are susceptible to steganography and whose changes as a result of message embedding can be measured in better way. To increase the probability of a successful detection of hidden message, several IQMs are normally employed to measure the distortions at different levels of sensitivity. The mean square values for the human visual system (HVS)-weighted errors demonstrate more sensitivity to pure blur; while the gradient measure responds to changes in the texture and the image periphery10. The message embedding steganography algorithms differ in the changes brought to the different IQMs. Farid1,5, et al. proposed the use of higher order statistics in the generic steganalysis techniques vis-à-vis the first-order statistics (such as the histogram DCT coefficients) employed by the specific steganalysis techniques. Steganalysis techniques fail that employs the changes in the first-order statistics for detecting the presence of hidden messages if a steganography algorithm keeps the first-order statistics intact. Farid, et al. propose the use of quadratic mirror filters (QMF) to decompose an image into sub-bands and then evaluate higher-order statistics metrics such as the mean, variance, kurtosis and skewness to each of the sub-bands obtained. However, the resultant features may contain noisy, irrelevant or redundant features which make them inefficient for machine learning. In fact, the presence of irrelevant and redundant features may deteriorate the performance of the classifier and requires high computation time and other resources for training and testing the data. Another important component of supervised based Steganalysis is the choice of classifier which distinguishes two different types of images. Many classification techniques have been proposed by machine learning, data mining and pattern recognition community. Each one of them is associated with some advantages and disadvantages. Linear discriminant analysis (LDA) is one of the commonly used classifier which is simple to implement, understand and computationally less expensive. However, it is not able to classify well non-linearly separable data. Recently, exponential discriminant analysis (EDA)15 is proposed which is a variant of LDA. Exponential discriminant analysis is based on matrix approach that maps the data in such a way that the margin between different classes is much more than the margin between different classes in the original space. This provides EDA much more discriminant power to classify non-linearly separable data and helps in improving classification accuracy in comparison to LDA.

Performance of EDA in conjunction with feature selection methods were investigated in this paper. For feature selection, Kullback divergence, Chernoff distance measures, and linear regression measures are used to determine relevant features from higher-order statistics of images. The performance is evaluated in terms of classification accuracy and computation time.

Although the human eye is most often not able to detect the presence of embedded message, but it may nevertheless changes the statistics of an image. The resultant distortions cause due to embedding in the cover image can be analysed by comparing the statistical properties of both cover and stegoimages8,13. Several techniques are available to detect such changes based on first order statistical distributions of intensity or transform coefficients13,16.

However, the shortcoming of this investigation is that simple counter-measures that match first order statistics are likely to spoil detection of embedding message in cover image. The research works1,5 has pointed out that steganalysis-based on higher-order statistical models may distinguish stegoimages and cover images. It has been observed across a large number of natural images that there exist strong higher-order statistical regularities within a wavelet-like decomposition. The embedding of a message may significantly change the statistics of image and thus becomes measures to detect stegoimages.

The decomposition of image is possible by using separable QMFs. Thus, the frequency space is divided into multiple scales and orientations. This can be accomplished by applying separable lowpass and highpass filters along the image axes generating a vertical, horizontal, diagonal, and lowpass subband. The diagonal, horizontal, and vertical subband at scale i = 1, 2, ..., n are represented as and ${\mathit{V}}_{\mathit{i}}\left(\mathit{x}\mathrm{,}\mathrm{}\mathit{y}\right)$ respectively. Subsequent scales are obtained by recursively filtering the lowpass subband. The research works1,5 pointed out that using above decomposition the statistical model containing the mean, variance, skewness, and kurtosis of the subband coefficients for each orientation and scales can be obtained for i = 1 to n. This characterises the basic coefficient distributions statistically. The second set of statistics is based on the errors in an optimal linear predictor of coefficient magnitude. It is pointed out1,5 that the subband coefficients of the image are correlated to their spatial, orientation and scale neighbours. Taking this into account, ${\mathit{V}}_{\mathit{i}}\left(\mathit{x}\mathrm{,}\mathrm{}\mathit{y}\right)\mathrm{,}$ a vertical band at scale i , can be represented in terms of neighbouring pixels in spatial domain as:

${\mathit{V}}_{\mathit{i}}\left(\mathit{x}\mathrm{,}\mathrm{}\mathit{y}\right)\mathrm{}={\mathit{w}}_{\mathrm{1}}{\mathit{V}}_{\mathit{i}}\left(\mathit{x}\mathrm{}-\mathrm{1}\mathrm{,}\mathrm{}\mathit{y}\right)\mathrm{}+{\mathit{w}}_{\mathrm{2}}{\mathit{V}}_{\mathit{i}}\left(\mathit{x}+\mathrm{}\mathrm{1}\mathrm{,}\mathrm{}\mathit{y}\right)\mathrm{}+{\mathit{w}}_{\mathrm{3}}{\mathit{V}}_{\mathit{i}}\left(\mathit{x}\mathrm{,}\mathrm{}\mathit{y}\mathrm{}-\mathrm{1}\right)\mathrm{}+{\mathit{w}}_{\mathrm{4}}{\mathit{V}}_{\mathit{i}}\left(\mathit{x}\mathrm{,}\mathrm{}\mathit{y}+\mathrm{}\mathrm{1}\right)$ $+{\mathit{w}}_{\mathrm{5}}\mathit{V}_{\mathit{i}}{}_{\mathrm{+}\mathrm{1}}\left(\mathit{x}/\mathrm{2}\mathrm{,}\mathrm{}\mathit{y}/\mathrm{2}\right)\mathrm{}+{\mathit{w}}_{\mathrm{6}}{\mathit{D}}_{\mathit{i}}\left(\mathit{x}\mathrm{,}\mathrm{}\mathit{y}\right)\mathrm{}+{\mathit{w}}_{\mathrm{7}}\mathit{D}_{\mathit{i}}{}_{\mathrm{+}\mathrm{1}}\left(\mathit{x}/\mathrm{2}\mathrm{,}\mathrm{}\mathit{y}/\mathrm{2}\right)$      (1)

where wk denotes scalar weighting values. In more compact form, it can be expressed as:

V = Qw      (2)

where $\mathit{w}=\mathrm{}{\left({\mathit{w}}_{\mathrm{1}}\mathrm{,}\mathrm{}{\mathit{w}}_{\mathrm{2}}\mathrm{,}\mathrm{}\mathrm{.........,}{\mathit{w}}_{\mathrm{7}}\right)}^{\mathit{t}}$ the vector V contains the coefficient magnitudes of Vi(x, y) strung into a column vector and the columns of the matrix Q contain the neighbouring coefficient magnitudes as specified in (1) also strung out in column vectors.

To determine coefficients magnitudes, quadratic error function is defined1,5 as

(3)

This error function E(w) can be minimised by differentiating Eqn (3) and substituting it equal to zero yields:

$\mathit{w}=\mathrm{}\left({\mathit{Q}}^{\mathit{T}}\mathit{Q}\right)^{\mathrm{-}}{}^{\mathrm{1}}{\mathit{Q}}^{\mathit{T}}\mathit{V}$      (4)

From the above, order statistics such as the mean, variance, skewness, and kurtosis can be evaluated. Similarly, the above procedure can be repeated to get the subbands Hi(x, y), Di(x, y). Since, there are four statistics and linear predictor for three subband are computed for (n −1) levels, we have total 12(n−1) error statistics and 12(n−1) coefficient statistics. Thus, these 24(n−1) statistics altogether will form a feature vector of the image.

Feature selection is used to remove noisy, irrelevant, and redundant features. There are two major approaches to feature selection: filter and wrapper approach11,17,18. Most filter methods employ statistical characteristics of data for feature selection which requires less computation time. It independently measures the importance of features without involving any classifier. Since, the filter approach does not take into account the learning bias introduced by the final learning algorithm, it may not be able to select the most relevant set of features for the learning algorithm. On the other hand, wrapper methods tend to find features better suited to the predetermined learning algorithm resulting in better performance. But, it tends to be computationally more expensive since the classifier must be trained for each candidate subset.

Feature ranking approaches have been widely and commonly investigated for feature selection19-21. However, disadvantage associated with feature ranking methods is that they ignore the correlation present among the features because of their univariate approach. Hence the selected features subset may possess low discriminatory capacity and increased redundancy. A forward/backward feature selection method or its combinations are used to remove redundancy and select relevant and non redundant features with a suitable measure. Among the most widely used filter methods4 for feature selection, there are techniques based on statistical separability measures which allow one to select a suitable subset of features by assigning the degree of interclass separability associated with each subset considered. In particular, Kullback divergence, Chernoff distance measures, and linear regression are commonly employed by research community20,22,23. To obtain a quantitative measure of how separable are two classes, a distance measure can be easily extracted from some parameters of the data. A very important aspect of probabilistic distance measures is that a number of these criteria can be analytically simplified in the case when the class conditional p.d.f.s p(Xi | Ci) follows multivariate normal distribution. For multivariate normal distribution for two classes, KD and CD measures are given as follows20,23:

$\begin{array}{c}{\mathit{J}}_{\mathit{d}}=\frac{\mathrm{1}}{\mathrm{2}}\mathrm{\left(}{\mathrm{\mu }}_{\mathrm{2}}-{\mathrm{\mu }}_{\mathrm{1}}{\mathrm{\right)}}^{\mathit{t}}\left({\left({\Sigma }_{\mathrm{1}}\right)}^{\mathrm{-}\mathrm{1}}+{\left({\Sigma }_{\mathrm{2}}\right)}^{\mathrm{-}\mathrm{1}}\right)\phantom{\rule{0.33em}{0ex}}\mathrm{\left(}{\mathrm{\mu }}_{\mathrm{2}}-{\mathrm{\mu }}_{\mathrm{1}}\mathrm{\right)}\hfill \\ +\frac{\mathrm{1}}{\mathrm{2}}\mathit{tr}\left({\left({\Sigma }_{\mathrm{1}}\right)}^{\mathrm{-}\mathrm{1}}{\Sigma }_{\mathrm{2}}+{\left({\Sigma }_{\mathrm{2}}\right)}^{\mathrm{-}\mathrm{1}}{\Sigma }_{\mathrm{1}}\right)\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\hfill \end{array}$      (5)

$\begin{array}{c}{\mathit{J}}_{\mathit{c}}=\frac{\mathrm{1}}{\mathrm{2}}\mathrm{\beta }\mathrm{\left(}\mathrm{1}-\mathrm{\beta }\mathrm{\right)\left(}{\mathrm{\mu }}_{\mathrm{2}}-{\mathrm{\mu }}_{\mathrm{1}}{\mathrm{\right)}}^{\mathit{t}}{\left[\mathrm{\left(}\mathrm{1}-\mathrm{\beta }\mathrm{\right)}{\Sigma }_{\mathrm{2}}+\mathrm{\beta }\phantom{\rule{0.33em}{0ex}}{\Sigma }_{\mathrm{1}}\right]}^{\mathrm{-}\mathrm{1}}\hfill \\ \mathrm{\left(}{\mathrm{\mu }}_{\mathrm{2}}-{\mathrm{\mu }}_{\mathrm{1}}\mathrm{\right)}+\frac{\mathrm{1}}{\mathrm{2}}\mathrm{log}\frac{\mathrm{|\left(}\mathrm{1}-\mathrm{\beta }\mathrm{\right)}{\Sigma }_{\mathrm{1}}+\mathrm{\beta }{\Sigma }_{\mathrm{2}}\mathrm{|}}{\mathrm{|}{\Sigma }_{\mathrm{1}}{\mathrm{|}}^{\mathrm{1}\mathrm{-}\mathrm{\beta }}\mathrm{|}{\Sigma }_{\mathrm{2}}{\mathrm{|}}^{\mathrm{\beta }}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\phantom{\rule{0.33em}{0ex}}\hfill \end{array}$     (6)

where ${\mathrm{\mu }}_{\mathit{k}}^{\mathit{i}}$ is a mean vector and ${\sum }_{\mathit{k}}^{\mathit{i}}\phantom{\rule{0ex}{0ex}}$ is a covariance matrix of k-dimensional data for class Ci, i = 1, 2 .

${\mathit{C}}_{\mathit{i}}\mathrm{,}\mathit{i}=\mathrm{}\mathrm{1}\mathrm{,}\mathrm{2}\mathrm{}\mathrm{.}$

The regression analysis considers the relations between the selected features which minimise redundancy. While using regression analysis for data a multiple regression model is considered because there can be many features which could affect the presence or absence of stegoimage. A multiple regression model with a target variable y and multiple variables X is given by22:

$\mathit{y}={\mathrm{\beta }}_{\mathrm{0}}+{\mathrm{\beta }}_{\mathrm{1}}{\mathit{X}}_{\mathrm{1}\mathit{i}}+{\mathrm{\beta }}_{\mathrm{2}}{\mathit{X}}_{\mathrm{2}\mathit{i}}+\mathrm{·}\mathrm{· ·}+{\mathrm{\beta }}_{\mathit{n}}{\mathit{X}}_{\mathit{ni}}+{\mathrm{\xi }}_{\mathit{i}}\mathrm{,}$

$\mathit{i}=\mathrm{}\mathrm{1}\mathrm{,}\mathrm{2}\mathrm{,}\mathrm{}\mathrm{...,}\mathrm{}\mathit{n}$      (7)

where ${\mathrm{\beta }}_{\mathrm{0}}\mathrm{,}\mathrm{}{\mathrm{\beta }}_{\mathrm{1}}\mathrm{,}\mathrm{}\mathrm{....,}\mathrm{}{\mathrm{\beta }}_{\mathit{n}}$ are constants estimated by observed values of X and class label y and is estimated by normal distribution having mean zero and a variance σ2.

The sum of squares error (SSE) which is sum of the squared residuals is given by

SSE=

$\underset{\mathit{i}=\mathrm{0}}{\overset{\mathit{n}}{\sum }}\mathrm{\left(}{\mathit{y}}_{\mathit{i}}-{\mathit{y}}_{\mathit{i}}^{\mathit{p}}\mathrm{\right)}$     (8)

where y and yp are observed and predicted values, respectively.

A large value of SSE signifies that the regression is predicted poorly. The total sum of squares is given by

SSTO= $\underset{\mathit{i}\mathrm{=}\mathrm{0}}{\overset{\mathit{n}}{\sum }}\mathrm{\left(}{\mathit{y}}_{\mathit{i}}-\stackrel{\phantom{\rule{0ex}{0ex}}}{\stackrel{¯}{\mathit{y}}}\mathrm{\right)}$     (9)

where $\overline{y}$ is the average of yi. In a regression model, the choice of features which best explains the class label depends on the value of R2 and is given by

R2 = 1 $\frac{\mathit{SSE}}{\mathit{SSTO}}$     (10)

Linear discriminant analysis based on statistical method is most commonly and widely used for feature extraction and pattern classification. The objective of LDA is to determine an optimal projection matrix W by maximising the ratio of between-class scatter Sb to within-class scatter Sw as follows:

$\mathrm{arg}\begin{array}{c}\mathrm{max}\\ \mathit{W}\end{array}\left|\frac{{\mathit{W}}^{\mathit{T}}{\mathit{S}}_{\mathit{b}}\mathit{W}}{{\mathit{W}}^{\mathit{T}}{\mathit{S}}_{\mathit{W}}\mathit{W}}\right|$       (11)

where ${\mathit{S}}_{\mathit{b}}=\left({\mathrm{\mu }}_{\mathit{k}}^{\mathrm{2}}-{\mathrm{\mu }}_{\mathit{k}}^{\mathrm{1}}\right){\left({\mathrm{\mu }}_{\mathit{k}}^{\mathrm{2}}-{\mathrm{\mu }}_{\mathit{k}}^{\mathrm{1}}\right)}^{\mathit{k}}$

${\mathit{S}}_{\mathit{w}}=\underset{\mathit{i}\mathrm{=}\mathrm{1}}{\overset{\mathit{c}}{\sum }}{\sum }_{\mathit{k}}^{\mathit{i}}\phantom{\rule{0ex}{0ex}}$

It is shown that a vector W that maximises J(.) must satisfy

${\mathit{S}}_{\mathit{b}}\mathit{W}=\mathrm{\lambda }{\mathit{S}}_{\mathit{w}}\mathit{W}$       (12)

for some constant λ, which is a generalised eigenvalue problem. If Sw is non singular, we can obtain a conventional eigenvalue problem by rewriting Eqn. (12) as:

${\mathit{S}}_{\mathit{w}}^{\mathrm{-}\mathrm{1}}{\mathit{S}}_{\mathit{b}}\mathit{W}=\mathrm{\lambda }\mathit{W}$       (13)

The classical LDA criterion can be rewritten as:

$\mathit{J}\mathrm{\left(}\mathit{W}\mathrm{\right)}=\mathrm{arg}\begin{array}{c}\mathrm{max}\\ \mathit{W}\end{array}\left|\frac{{\mathit{W}}^{\mathit{T}}\mathrm{\left(}{\Phi }_{\mathit{b}}^{\mathit{T}}{\Lambda }_{\mathit{b}}\mathrm{\right)}\mathit{W}}{{\mathit{W}}^{\mathit{T}}\mathrm{\left(}{\Phi }_{\mathit{W}}^{\mathit{T}}{\Lambda }_{\mathit{W}}\mathrm{\right)}\mathit{W}}\right|$       (14)

where ${\Phi }_{\mathit{b}}=\left({\mathrm{\phi }}_{\mathit{b}\mathrm{1}}\mathrm{,}\mathrm{}{\mathrm{\phi }}_{\mathit{b}\mathrm{2}}\mathrm{,}\mathrm{}\mathrm{,}\mathrm{}{\mathrm{\phi }}_{\mathit{bn}}\right)$ and are the eigenvector matrix of between class scatter S b and with-in class scatter S w that corresponds to eigenvalues and ${\Lambda }_{\mathit{w}}=\mathit{diag}\left({\mathrm{\lambda }}_{\mathit{w}\mathrm{1}}\mathrm{,}\mathrm{}{\mathrm{\lambda }}_{\mathit{w}\mathrm{2}}\mathrm{,}\mathrm{}\mathrm{,}\mathrm{}{\mathrm{\lambda }}_{\mathit{wn}}\right)$ respectively.

Generally, the matrix S w is not a full rank matrix in the under-sampled case. However, the discriminant information corresponding to the zero eigenvalues of Sw have the most discriminative power. But, LDA can extracts discriminant information only from the principal subspace of Sw. Zhang15, et al. have proposed another approach based on the matrix exponential and is known as exponential discriminant analysis (EDA)15. To extract this kind of discriminative information, they had replaced the eigenvalues of ${\mathit{S}}_{\mathit{w}}\mathit{i}\mathrm{.}{\mathit{el}}_{\mathit{wi}}\mathrm{,}\mathrm{}\mathit{by}\mathrm{}\mathit{exp}\mathrm{\left(}{\mathit{l}}_{\mathit{wi}}\mathrm{\right)}$ and λbi i.e, the eigenvalue of Sb, by exp(λbi) and represent

Then the LDA criterion is modified as below:

$\mathit{J}\mathrm{\left(}\mathit{W}\mathrm{\right)}=\mathrm{arg}\begin{array}{c}\mathrm{max}\\ \mathit{W}\end{array}\left|\frac{{\mathit{W}}^{\mathit{T}}\mathrm{\left(}{\Phi }_{\mathit{b}}^{\phantom{\rule{0ex}{0ex}}}\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{\left(}{\Lambda }_{\mathit{b}}\mathrm{\right)}\Phi \mathrm{\right)}\mathit{W}}{{\mathit{W}}^{\mathit{T}}\mathrm{\left(}{\Phi }_{\mathit{w}}^{\phantom{\rule{0ex}{0ex}}}\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{\left(}{\Lambda }_{\mathit{w}}\mathrm{\right)}\Phi \mathrm{\right)}\mathit{W}}\right|$

$=\mathrm{arg}\begin{array}{c}\mathrm{max}\\ \mathit{W}\end{array}\left|\frac{{\mathit{W}}^{\mathit{T}}\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{\left(}{\mathit{S}}_{\mathit{b}}\mathrm{\right)}\mathit{W}}{{\mathit{W}}^{\mathit{T}}\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{\left(}{\mathit{S}}_{\mathit{w}}\mathrm{\right)}\mathit{W}}\right|$      (16)

The matrix exp (Sw) is always a full-rank matrix; and hence the discriminative information which was contained in the null space of Sw can be extracted using Eqn (16), even though the small sample-size problem is involved.

As it is known that orthogonality is of utmost importance to discriminant analysis, because redundant features can be combined back to the same number of variables through orthogonal transformation of the measurement space. The benefit of employing orthogonal transformation is that the correlations among candidate features are decomposed so that the significance of individual features can independently be evaluated. Hence they have defined the EDA criterion by enforcing the projection matrix W in Eqn (16) to be orthogonal as follows:

In EDA, similar to the kernel method, there exists a non-linear mapping function Φ such that the scatter matrices are mapped in to a new space, i.e.,

$\mathit{J}\mathrm{\left(}\mathit{W}\mathrm{\right)}=\begin{array}{c}\mathrm{arg}\\ {\mathit{W}}^{\mathit{T}}\end{array}\begin{array}{c}\mathrm{max}\\ \mathit{W}=\mathrm{1}\end{array}\left|\frac{{\mathit{W}}^{\mathit{T}}\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{\left(}{\mathit{S}}_{\mathit{b}}\mathrm{\right)}\mathit{W}}{{\mathit{W}}^{\mathit{T}}\mathrm{e}\mathrm{x}\mathrm{p}\mathrm{\left(}{\mathit{S}}_{\mathit{w}}\mathrm{\right)}\mathit{W}}\right|$      (17)

$\Phi \mathrm{}\mathrm{:}{\mathit{R}}^{\mathit{n}}\to \mathit{F}$      (18)

Such that

${\mathit{S}}_{\mathit{b}}\to \mathrm{}\Phi \left({\mathit{S}}_{\mathit{b}}\right)=\mathit{exp}\left({\mathit{S}}_{\mathit{b}}\right)$

${\mathit{S}}_{\mathit{w}}\to \mathrm{}\Phi \left({\mathit{S}}_{\mathit{w}}\right)=\mathit{exp}\left({\mathit{S}}_{\mathit{w}}\right)$

It is shown15 that with the help of the mapping Φ, there is a difference in diffusion scale between the within- and between-class distances, and the diffusion scale to the between–class distance is larger than the within-class distance. Hence, the margin between different classes under such mapping is much more than the margin between different classes in the original space. This provides EDA more discriminant power which helps in improving classification accuracy in comparison to LDA.

To investigate the efficacy of the EDA for classification of stegoimages and non-stego images, we prepared a database of 1500 natural images taken from different sources i.e., www.1000pictures.com, www.1000wallpapers.co. All the images were originally in JPEG format. The original image resolutions were ranging from 800 × 600 to 1600 × 1200. We first resized each one of these images to 640 × 480 pixels images and embedded message images of six different resolutions 256 × 256, 128 × 128, 64 × 64, 32 × 32, 16 × 16, and 8 × 8 into cover image using OUTGUESS13.

1000 non-stego images and 1000 stego images were created. Features are extracted from each one of the grey images using Haar wavelet. Each image is represented in terms of 72 statistics from four levels wavelet decomposition. To remove redundancy from the selected pool of features, three feature selection measures are investigated23: Kullback divergence measure, Chernoff distance measure and linear regression. For Chernoff distance measure, features are selected using 3 different values ranging from 0.1 to 0.9 with an increment of 0.4. We have used naive bayes classifier (Naivebc), Fisher linear classifier (Fisherc), and exponential discriminant classifier (EDA to evaluate the performance of the feature selection methods. The average classification error is computed using ten cross-validations. All the simulations are done using matlab. Tables 1, 2 and 3 show the minimum classification error achieved with different classifiers along with the number of features for different measures. For chernoff distance measure, the minimum classification error achieved for optimal value of β is also shown in Tables 1,2 and 3. The best results in each category are indicated in bold.

Table 1.Comparative results of classification error and minimum number of features for 8 × 8, 16 × 16

Table 2.Comparative results of classification error and minimum number of features for sizes 32 × 32 and 64 × 64

Table 3.Comparative results of classification error and minimum number of features for size 128 × 128 and 256 × 256

We observe the following from Tables 1, 2 and 3:

(a) The performance of exponential discriminant analysis is significantly better in comparison to naive bayesian and fisher discriminant analysis in terms of classification error for all sizes of embedding used in experiments.

(b) The minimum classification error is achieved with linear regression for all classifiers and for different size of embedding.

(c) The number of features required to obtain minimum classification error is significantly smaller using linear regression in comparison to baseline, kullback divergence measure and chernoff distance measure using all classifiers and different size of embedding.

(d) The number of features required to obtain minimum classification error is significantly smaller using linear regression in comparison to other feature selection methods. Hence, the computation time by all the learning methods will be significantly less with features selected using linear regression method.

(e) For each feature selection method, the number of features which achieves minimum classification error with EDA is less than Fisher discriminant analysis and more than naive Bayesian classifier. However, the classification error is significantly less with EDA in comparison to both Fisher discriminant analysis and naive Bayesian classifier.

Figure 1 shows the variation in computation time of different classifiers using different feature measures to achieve minimum classification accuracy for embedding size 8 x 8. The computation time required by each classifier for different measures without feature selection (baseline) is also shown in Fig 1. It can be observed from Fig 1 that the computation

time required will be significantly less with linear regression in comparison to baseline, and less in comparison to kullback divergence measure and chernoff distance measure for a given classifier. Similar observations are also made for other embedding size.

The performance of supervised learning-based steganalysis depends on the choice of features to represent the image and classification method used to distinguish stego from non stego image. Features based on higher order statistics are extracted from stego and non-stego images using wavelet decompositions. Feature extracted from image may contain irrelevant and redundant features which makes them inefficient for machine learning. LDA is commonly and widely used for classification. However, LDA is not able to classify well non-linearly seaparable data. Recently, Exponential discriminant analysis is proposed which maps the data in such a way that the margin between different classes is much more than the margin between different classes in the original space. This provides EDA much more discriminant power to classify non-linearly separable data and helps in improving classification accuracy in comparison to classical LDA.

Performance of exponential discriminant analysis in conjunction with different feature selection methods were investigated in this paper. For feature selection, Kullback Divergence, Chernoff distance measures and linear regression measures are used to determine relevant features from higher-order statistics of images. The performance of steganalysis is compared and evaluated in terms of classification error and computation time of training classifier. Experimental results show that the performance of exponential discriminant analysis is significantly better in comparison to naive bayesian and fisher discriminant analysis in terms of classification error for all sizes of embedding used in experiments. The minimum classification error is achieved with linear regression for all classifiers and for different size of embedding. Also, the number of features required to obtain minimum classification error is significantly smaller using linear regression in comparison to baseline, kullback divergence measure and chernoff distance measure using all classifiers and different size of embedding. Hence, exponential discriminant analysis in conjunction with linear regression outperforms other combination in terms of both classification error and computation time.

1. Farid, H. Detecting hidden messages using higher-order statistical models. In Proceedings of the IEEE International Conference on Image Processing, New York 2002, 2, pp. 905-08.

2. Johanson, N. & Jajodia, S. Exploring steganography: Seeing the unseen. IEEE Computer, 1998, 26-34.

3. Kahn, D. The history of steganography. In Anderson, R. (ed.) IH 1996. LNCS, 1174. Springer, Heidelberg, 1996.

4. Lie, W.N. & Lin, G.S. A Feature-based classification technique for blind image steganalysis. Proc. IEEE Trans. Multimedia, 2005, 7(6).

5. Lyu, S. & Farid, H. Steganalysis using higher-order image statistics. IEEE Trans. Inf. Forensics Security, 2006, 1(1), 111-19.

6. Mehrabi, M.A.; Faez, K. & Bayesteh, A.R. Image steganalysis on statistical moments of wavelet subband histograms in different frequencies and support vector machine. In Proceedings of the 3rd International Conference on Natural Computation, August 2007. pp. 24-27.

7. Chandramouli, R. A mathematical framework for active steganalysis. ACM Multimedia Systems, 2003, 9(3), 303-11.

8. Pevn´y, T. & Fridrich, J. Determining stego algorithm for JPEG images. IEEE Proc. Iinf. Security, 2006, 153(3).

9. Westfeld, A. & Pfitzmann, A. Attacks on steganographic systems. In: Pfitzmann, A. (ed.) IH 1999. LNCS, 1768, Springer, Heidelberg, 2000. pp. 61–75.

10. Avcibas, I.; Memon, N. & Sankur, B. Steganalysis using image quality metrics. IEEE Trans. Image Proc., 2003, 12(2), 221-29.

11. Guyon, I. & Elisseeff, A. An Introduction to Variable and feature Selection. J. Machine Learning Res., 2003, 3, 1157-182.

12. Harmsen, J. & Pearlman, W. Steganalysis of additive noise modelable information hiding. In Proceedings of SPIE Electronic Imaging, 2003.

13. Provos, N. & Honeyman, P. Detecting steganographic content on the internet. ISOC NDSS’02, San Diego, CA, February, 2002. August 2001, CITI Techreport.

14. Fridrich, J.; Du, R. & Long, M. Steganalysis of lsb encoding in color images, IEEE ICME, 2000.

15. Zhang, T; Fang, B.; Tang, Y.; Shang, Z. & Xu, B. Generalised discriminant analysis: A matrix exponential approach. IEEE Trans. Systems, Man, Cybernetics- Pt B, 2010, 40(1), 186-97.

16. Wang, Y. Optimised Feature Extraction for Learning-Based Image Steganalysis. IEEE Trans. Infor. Forensics Security, 2007, 1.

17. Kohavi, R. & John, G. Wrapper for feature subset selection. Artificial Intelligence, 1997, 1-2, 273-24.

18. Ruiz, R.; Riquelmea, José C. & Aguilar-Ruizb, Jesús S. Incremental wrapper based gene selection from microarray data for cancer classification. Pattern Recognition, 2006, 39(12), 2383-392.

19. Guyon, I.; Weston, Jason; Barnhill, Stephen & Vapnik, Vladimir. Gene selection for cancer classification using support vector machine. Machine Learning, 2003, 46, 263-68.

20. Pierre, A.D. & Kittler, J. Pattern recognition: A statistical approach. PHI, 1982.

21. Tibsrani, R.; Hastie, Trevor; Narasimhan, B. & Chu, Gilbert. Diagnosis of multiple cancer types by shrunken centriods of gene expression. Proc. Natl. Acad. Sci USA., 2002, 99, 6567-572.

22. Han-Saem, P.; Yoo, Si-Ho & Cho, Sung-Bae Forward selection Method with regression analysis for optimal gene selection in cancer classification. Int. J. Comp. Math., 2007, 84(5), 653-68.

23. Rajput, G.K. & Agrawal R. K. Evaluation of feature selection measures for steganalysis. PReMI 2009, LNCS 5909 Springer, 2009. pp. 432-439.

 Mr Gaurav Rajput received MSc (Math) from Aligarh Muslim University and MTech from Jawaharlal Nehru University, New Delhi, Currently he is pursuing PhD from School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi. Her current area of research are Steganography and Steganalysis. Dr R.K. Agrawal obtained MTech (Computer Application) from Indian Institute of Technology Delhi, New Delhi and PhD (Computational Physics) from University of Delhi, Delhi. Presently, he is working as an Associate Professor at the School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi. His current research areas are: Classification, feature extraction and selection for pattern recognition problems in domains of image processing, security, and bioinformatics. Ms Namita Aggarwal received BSc (Math) from University of Delhi, Delhi and MCA from Delhi University, Delhi. Currently she is pursuing PhD from School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi. Her current area of research is pattern recognition.