| Work & Study | Publica- tions | Family & Home | Travel | Running & Sports | Other |
STEPWISE LINEAR DISCRIMINANT ANALYSIS:
A STATISTICAL TECHNIQUE WAITING
FOR AN MIS APPLICATION ?
Essay on Research Techniques Presented To The Department of Accounting, University of Cape Town
in partial fulfilment of the requirements of the Bachelor of Commerce Honours Degree in Information Systems
by Jean-Paul Van Belle, 22 April 1991.
1. INTRODUCTION
1.1 Definition
Discriminant analysis can best be defined as a technique which allows the classification of an individual into one of two or more distinctive populations, on the basis of a set of measurements [Afifi, 1984, p.246]. Although this technique can be used solely for descriptive purposes, it is more usefully employed as a predictive measure.
Stepwise discriminant analysis is concerned with selecting the most important variables whilst retaining the highest discrimination power possible.
The discrimination analysis will compute coefficients, the discriminant (function) coefficients, for the discrimination or discriminant function which usually takes the form of a linear (or quadratic) combination of the original measurements. Calculating the value of this function for a given member or instance of the population will yield a particular value of the discriminant variable, which can be compared to a single (or more if there are more than 2 populations) cut-off value(s).
1.2 Historical Development of Discriminant Analysis
Fisher initiated the intuitive development of the initial theory of linear discriminant functions. This development was largely based on the methodology employed to derive multiple linear regression, i.e. simple matrix algebra to solve the simultaneous linear equations. This approach is still intuitively the easiest to understand and features prominently in the introductory textbooks, especially those written for non-statisticians.
A second stage reflected the swing to the probabilistic approach in statistics in general. This is evidenced in the literature by the generous use of probability functions, advanced matrix algebra and theoretical calculus. This generally confirmed the formulae used to calculate the discrimination function but differed substantially in the interpretation of statistical reliability and error estimates. This approach is adopted in Kendall (1983). Notation is very synoptical and important formulae can be derived quickly, given a sufficiently advanced mathematical background.
Thirdly, discriminant analysis can be developed in a mathematical decision theory framework. This non-trivial approach is illustrated in Hand (1981, chapter 4) and employs the concepts of linear mathematical programming.
Finally (or simultaneously), additional theoretical research resulted in the development of a number of non-parametric methods, of which the Kernel method is perhaps the most widely used.
It must be noted that most popular statistical computer packages do not necessarily conform to the results of the latest advances made in these theories.
1.3 Comparison With Other Multivariate Statistical Techniques.
The classic linear discrimination analysis, as developed by Fisher, adopts an approach similar to multiple linear regression and a number of computational algorithms are borrowed from that field of statistics.
Three distinctive multivariate techniques exist that aid in classification of individuals on the basic of multivariate measurements:
"(a) Discrimination. We are given the existence of two or more populations and a sample of individuals from each. The problem is to set up a rule, based on measurements from these individuals, which will enable us to allot some new individual to the correct population when we do not know from which it emanates.
(b) Clustering. We are given a sample of individuals [...] and the problem is to classify them into groups which shall be as distinct as possible. In the discrimination the existence of the groups is given, in clustering it is a matter to be determined.
(c) Dissection. We are given a sample [...] and wish to divide it into groups, whether the border-lines of subdivision are natural or not."
[Kendall, 1982, p.370]
The statistical use of dissection in the MIS field is not intuitively apparent. Cluster analysis, more specifically factor analysis, is more widely used by MIS researcher (refer to the essays by previous students and by Ms L. Cotty). Its popularity stems from its particular applicability when using questionnaires, especially where generally accepted theoretical models are absent. Its main scientific shortcoming lies in the interpretation by the researcher of the clusters that are revealed by the analysis.
Discrimination techniques have been employed very successfully in the natural and economic sciences. However, an admittedly superficial literature analysis could reveal no studies that employed discrimination analysis to any significant extent. This may be due a number of reasons:
1) Discriminant analysis appears to be a lesser known technique than many of the other variate techniques. Research tends to mimic previously "established" methodologies and hence adopts the more standard analysis techniques such as ANOVA or cluster analysis.
2) There is a definitive lack of a variety of measurable continuous variables for relatively large samples. While a number of variables are certainly ratio or interval variables, many tend to be ordinal (e.g. point-scales derived from questionnaires) or nominal (non-rankable categories).
3) Even where continuous data is available, there is no prior theory to suggest that assumptions of reasonable normality and constant variance hold. However, this does not seem to deter researchers from applying other multivariate techniques which make the same, or even more stringent assumptions, such as multiple linear regression; factor analysis.
4) A number of discriminant analysis techniques have been developed to address some of the above-mentioned methodological problems. Validation of the results for smaller samples can be done through the "jack-knife" technique. A number of non-parametric techniques have also been developed which do not rely on assumptions about data distribution and in some cases can work with ordinal data: Kernel estimators, k-Nearest-neighbour methods and series expansion. However, each of these alternative methods has its own methodological disadvantages and difficulties. They are require great care in application, usually require the assistance of a skilled statistician, and are not free of methodological controversy. In addition, non-parametric discrimination analysis techniques are not generally available in the more popular statistical computer packages (SPSS, SAS or BMDP) which reduces their appeal to researcher even further.
5) Finally, even applying the "standard" linear discriminant analysis requires a reasonable statistical background. This is not only required for the final interpretation of the results, particularly the statistical reliability of the results. Even the "computer analysis" requires an educated selection of parameters to govern the stepwise selection process and to establish the classification errors when selecting the Z cut-off value(s).
Then what should motivate the MIS researcher to consider discriminant analysis, especially in the light of the other, currently more popular, multivariate methods?
1.4 Advantages of Discriminant Analysis.
The main advantage of discriminant analysis could be considered its strong intuitive appeal to managers. A linear discriminant function is easy to calculate and all the manager has to do is to measure a small number of variables, multiply them with the appropriate discrimination coefficients, add them together and compare the to the critical Z-score. This attractiveness has already led to its acceptance in the banking environment (for credit evaluation) and in the field of financial management (for corporate failure prediction). Its development in the banking field was fuelled by the increasing demand for credit and the lack of trained personnel; precisely the conditions within the commercial MIS environment.
In addition, unlike cluster or multiple regression analysis, discriminant analysis actually yields the input needed for an immediate decision: it predicts either "success" or "failure", "accept" or "reject", "yes" or "no". A related attraction is its capability to take the relative costs of misclassification into consideration. Projects with higher potential returns could employ a more optimistic Z-value, whilst projects with average return would use a more conservative cut-off value. In all cases a clear measure of risk can be quantified.
From a theoretico-methodological point of view, there are a substantial number of options to deal with the lack of well-behaved (i.e. normally distributed continuous) data in the MIS field. Whilst linear discriminant analysis is relatively robust (tolerant towards minor violations of its assumptions), specific alternatives exist in the form of quadratic functions, non-parametric methods and specialist procedures (jack-knife and bootstrap).
2. HYPOTHETICAL EXAMPLE.
The following hypothetical example may serve to illustrate the basic theory which follows.
Assume that data can be collected on a set of characteristics for a relatively large number (40 or more) of expert system projects, within a large financial company. Characteristics could include the following variables: the number of decision rules and facts; the budget; the development time of the prototype; the number of (human) subject experts involved; year of implementation; etc. In order to measure relative size, logarithms should be taken of some of these variables. Other variables would be of a binary nature, e.g. whether a commercially available expert system shell was used or not; whether the system was implemented on mainframe or PC; whether consultants were used for the knowledge base engineering. Qualitative factors such as management commitment, experience of the development team and amount of end-user involvement in the system development could be quantified using 20-point checklists. In all, for each project a so-called vector (or set) of 22 measurements is available.
Each of these projects can be classified as either a successful implementation ("success") or an unsuccessful implementation ("failure"). Success might be defined as whether an expert system is used by at least one third of the anticipated (number of) users in the production environment one year after original implementation.
The discriminant function would be in the form of a linear combination of the variables. A typical result would be that four to six variables would be sufficient to obtain an acceptable determination coefficient. The resultant discrimination function (not normalized!) might well look as follows:
Z = 0,0210xNR + 5,32xSE + 3,20xLB - 4,20xCS + 5,2xMC
with: NR = Number of Rules.
SE = Number of Subject Experts involved.
LB = Log10 of the Budget.
CS = 1 if a Commercial ES Shell was used; otherwise 0.
MC = Score on the Management Commitment checklist.
The coefficients of this function would be determined in such a way that maximum discrimination between successful and unsuccessful projects is obtained. The validity of the function depends on a number of factors, which include the extent to which the statistical assumptions have been satisfied, the parameters specified in the stepwise selection process, the correctness of the classification procedure which classifies projects as successful/unsuccessful, etc.
A "cut-off" value (or dividing point) needs to be determined in order to separate the two populations. Assuming that the average Z-value for successful projects equals 140,30 and for the unsuccessful project is 30,20, the average value of 85,25 could be used.
If a new expert system is now being proposed of a similar nature as the ones for which the discriminant function was determined, and the values NR, SE, LB, CS, and MC are known, then a Z-value can easily and quickly be calculated. A Z-value larger than 85,25 would then suggest that a successful implementation can be expected, whereas values below 85,25 indicate likely failure.
A number of practical issues will be dealt with in more depth at a later stage but it should be mentioned that in this case the critical Z-value would probably be chosen at a higher level in order to reduce the probability of classifying a failure as a success, i.e. management may prefer to err on the conservative side (not go ahead with a project might actually be a success) rather than on the optimistic side (implement a system which will be a failure despite its high Z-value). Also, it is likely that several discrimination functions would be computed, since success might be measured one, two and three years into the future (instead of only one year ahead). Probably two sets of discrimination functions would be needed: one to decide whether to develop the prototype (based on estimated budgets etc.) and one based on actual historical project data which would aid with the decision at the actual implementation stage (predicting success when the system goes live in production). Each set of functions would then be estimated using the respective data sets. Probabilities of misclassification and confidence levels can also be calculated, but they require extreme care in interpretation.
3 ELEMENTARY THEORY OF THE LINEAR DISCRIMINANT ANALYSIS.
In what follows, discriminant analysis for two populations will be demonstrated. The extension to three or more populations is mathematically trivial but the practical interpretation of the results becomes increasingly difficult. Much of what follows has been condensed from Afifi (1984, chapter 11) and Flury (1988, chapter 7). A more rigorous but still readable treatment is found in Afifi (1979) which also includes a probabilistic approach, the extensions for classification into more that two populations and for binomially distributed populations.
Assume that a number of n individuals are drawn randomly from two multivariate normally distributed populations (i.e. n is the sample size). For each individual there are p observations X1, X2, ... , Xp with Xi (i=1 to p) a continuous variable.
The populations will be represented as PS and PF respectively (S=success; F=failure) and the number of individuals drawn from each population are nS and nF respectively (with nS + nF = n). It is important to note that it is known which population each of the sample individuals belongs to!
Further it is assumed that both populations have the same covariance matrix S (with a dimension of p x p) but different means vectors of m S and m F respectively (each with a dimension p). In practice, the population parameters S , m S and m F are unknown and will be estimated from the sample, i.e. S, x_Si and x_Fi.
Using an arbitrary linear combination:
Z = a1.X1 + a2.X2 + ... + ap.Xp
a new (dependent) variable Z is obtained with its own distribution characteristics. It can be shown that the Z-values for each of the populations will be normally distributed. The best estimators for each population's new distribution statistics (means and standard deviation) will be x_S; sS, and x_F; sF respectively. These Z-statistics (and their estimates) will be linear or quadratic combinations of the population (and sample) statistics, using the same coefficients a1 ... ap.
The overall Z distribution which includes individuals from both populations will have a more complex appearance, being the combination of two normals with different statistics. A graphical illustration of this can be found in Figure 1 where five sample linear combinations of two variables, "BOTTOM" and "TOP", and the resultant (individual) Z-distributions are shown [Flury, 1988, p.90].
Depending on the choice of the coefficients, the Z-distributions for (the sample of) each population will overlap to a larger or lesser extent. The degree of overlap will depend on the difference between the respective means x_S and x_F, as well as on the standard deviations sS and sF. Hence a standardized distance, usually known as the Mahalanobis distance, can be computed as follows:
with
The larger the Mahalanobis distance, the smaller the area which overlaps between the Z-distributions of the samples of both populations. The smaller the overlap, the better the discrimination between the populations. Referring to the example in Figure 1, it is clear that the linear combinations (4) and (1) offer a much better discrimination than (5) and (2).
As might be expected, the idea is to maximize the Mahalanobis distance. Since this standard distance is a function of the linear coefficients, it can be solved using the standard techniques for finding the maximum of a function i.e. through linear equations or linear programming. The linear combination(s) for which this maximum is achieved is called the discriminant function, or more specifically in this case, Fisher's linear discriminant function. The maximum value of the Mahalanobis distance, which is obtained, is known is the multivariate standard distance.
Since multiplying all the coefficients of the discriminant function with any (non-zero) constant will not alter the multivariate standard distance, the function will usually be normalized so that the sum of the squares of its coefficients will equal zero. These normalized coefficients are then known as the discriminant coefficients.
Evaluation of the discriminant function for any individual with given measurements X1, ... , Xp will yield the discriminant Z-score for that individual. The crux lies in choosing an appropriate "cut-off" Z-value of which gives the optimal discrimination between the two samples, e.g. C.
The Z-score for any individual can now be calculated and compared to C. Individuals with a Z-score less than C would be classified as belonging to the one population (e.g. PS), whilst individuals with a calculated Z-score larger than C would be classified as part of PF. Depending on the actual value of C, a number of individuals would be misclassified. This is graphically illustrated in Figure 2 which shows a proposed Z-value equal to C.
C will, under the assumptions, minimize the overall proportion of misclassifications and can be calculated as:
Note that there are actually two types of errors, as shown in the table below:
|
Population where individual is classified on basis of Z-score: |
Individual belongs to PS |
Individual belongs to PF |
|
Population PS |
Correct |
Type I error |
|
Population PF |
Type II error |
Correct |
The Type I error refers to the individuals which actually belong to population PF, but by virtue of their "very low" Z-score are classified incorrectly as belonging to PS. The Type II error refers to those individuals belonging to population PS but misclassified into PF due to their high Z-score. The above formula ensures equal probability of either type of error occurring. However, an adjusted formula must be employed if the (prior) probabilities of an individual belonging to one of either populations are not equal (evidenced by nS being significantly different from nF).
In most practical MIS situations, the cost of making a Type I error will not be equal to that of making a Type II error. If the relative costs of making either type of error can be quantified, the formula must be modified to compensate for this fact as well. However, all of these subsequent complications to the formula are still based on, and relatively sensitive to, the original assumptions of normality and equal variance. It makes practical sense therefore to base the critical cut-off value C on the actual, computed Z-frequency distribution. The C value can then be selected so that the observed occurrences of Type I and Type II errors match the risk profile desired by the decision maker.
It will be noted that this empirical method actually uses the sample which has been used to determine the discriminant function, to estimate the probability of classification errors. Hence this estimate is biased and, for the relatively small samples in the MIS field, may under-estimate the actual probability of misclassification substantially. Formulae exist which will provide unbiased estimates but they are again based on the initial distribution assumptions, which are unlikely to be fulfilled in MIS studies. Deviations from the assumptions will again result in an under-estimate of the real population error probability (Afifi, 1984, pp.259-267).
An alternative model validation method, generally called cross-validation, randomly splits the sample into two sub-samples and uses one sample to generate the discriminant function and the other to validate the function thus generated. (Note that cross-validation is not limited to discriminant analysis.) Again, an MIS researcher may be reluctant to do so since she may be working with a relatively small sample already. As a last resort, a technique called the jack-knife procedure can be employed. This procedure is a variant of the cross-validation technique. One observation from the first population is excluded and a discriminant function is estimated. The latter is then used to classify the omitted observation; this observation may be classified correctly or incorrectly. Subsequently another observation is omitted from the first population and replaced by the observation which was excluded originally. Again the discriminant function is estimated and used to classify the omitted observation. This process is repeated for the entire sample of the first population. The relative proportion of misclassifications will then approximate the unbiased estimator of the Type I error. The procedure can be repeated for the sample from the second population to estimate the Type II error.
So far, the discussion has concentrated on the classification problem from a descriptive point of view. The practical MIS application will most likely employ discriminant analysis for predictive purposes: an observation with known measurements but unknown membership can then be classified. If the discriminant function is to be used in a real-world environment, it would be advisable to re-estimate the discriminant function on a continuous basis in order to ensure that time-induced variations in the underlying populations are reflected in the model parameters.
4. STEPWISE VARIABLE SELECTION
Often, discrimination functions can correctly classify most of the observation even when using a relatively small subset of the available variables (measurements).
4.1 Exhaustive Search.
From a theoretical perspective, one could estimate and validate the discrimination function for all possible subsets of the variables. Because of the computation intensive nature of this technique, this is impractical when more than fifteen to twenty variables are available. Hence a number of heuristic methods have been developed.
4.2 Sequential Forward Selection Method.
Starting from one variable (the one with the highest correlation with the population membership), add a single variable at a time each time selecting that variable which increases the validity of the discrimination function the most (or reduces the misclassification error the most). Note that, contrary to multiple linear regression, these variables are not necessarily those with the highest correlation. Refer to the intuitive example: "Redundancy of variables in a discriminant function does not mean that these variables carry no information. What it means is that they furnish no additional information to the group separation than can be obtained using the sufficient (non-redundant) variables alone. It's like guessing the name of a president: knowing that he is a Republican is of some value, but once you know that he was a movie star before becoming a politician, the former information becomes redundant." (Flury, 1988, p.121).
This method ignores interrelationships between variables that have not yet been selected and cannot remove variables already added, which may become largely redundant through subsequent additions.
4.3 Sequential Backward Elimination Method.
This method is similar to the forward selection method but starts from the complete set of variables and deletes variables which contribute the least to the discriminatory power of the function. Similar disadvantages as for the above method, mutatis mutandis, are apparent.
4.4 Stepwise Procedure.
This method adopts a combination of both the forward selection and backward elimination techniques, generally described as a "plus l - take away r" algorithm. Theoretical and empirical research suggests that the algorithm works best if l is slightly larger than r. For computational reasons, both must be kept relatively small but for theoretical reasons they should be as large as possible. This compromise decision is one of the more difficult decisions facing the researcher and it may have a great influence on the reliability of the results.
Another decision is when to stop the process of iteration. One widely used statistic is the F-test which can be calculated as follows:
where n is the total sample size, D2 Mahalanobis distance, Dd represents D evaluated on d (existing) variables and De the new D on the e proposed variables. Although this statistic follows an F distribution with (d - e) and (n - d - 1) degrees of freedom when determined independently, subsequent use in the stepwise algorithm results in bias. Generally, computer programs would require the researcher to specify a (maximum) computed F-to-remove and a (minimum) computed F-to-add criterium. Obviously the latter must be larger than then the first, otherwise one would keep adding the variable which was rejected just before. 4.5 Additional Notes.
It must be noted that the above procedure does not guarantee optimal discrimination. Generally, a number of "runs" are required before the researcher develops a "feel" for the parameters of the data. Advances in computer processing power, however, tend to allow larger steps and thus ensure a better approximation of the optimal solution.
Also, the researcher may want to force certain variables into the discriminant function regardless of optimality for a given number of variables. This could be based on theoretical considerations (an a priori MIS model), to ensure compatibility with other research, or to include the knowledge of practical experience.
Other methods exist to reduce the number of variables, e.g. by variable transformation. These will probably not suit the MIS field since in essence new "artificial" variables are created by combined some of the originals variables.
5. PROBLEMS IN APPLYING DISCRIMINANT ANALYSIS TO MIS
5.1 Small Sample Size.
The minimum sample size required is most likely to be a major constraint for MIS. Some of the reasons for the relatively small sample size are:
a) The more variables are incorporated, the larger your sample size must be to avoid over-determination of the model and reduction of the number of degrees of freedom when estimating error probabilities.
b) Many variables may not be approximately normally distributed. To ensure that results are still reasonably valid, the sample size should be as large as possible, since the general robustness of multivariate discriminant analysis holds asymptotically (following the Central Limit Theorem). Fortunately, relatively little problems occur for binomial distributions, i.e. for binary variables.
c) If severe methodological problems are experienced with the data distribution, the researcher may have no alternative but to employ non-parametric discriminant analysis (e.g. Kernel analysis). Generally, non-parametric analysis requires significantly more data in order to reach a validity which would be on par with its parametric equivalent technique.
5.2 Continuous Variables.
Linear discriminant analysis assumes that the data is continuous (interval or ratio). Unfortunately, many variables in MIS are ordinal or nominal.
Ordinal MIS data typically originates from questionnaires using 5-, 7- or 9-point scales. Although researcher may be tempted to code the data from, say a 5 point scale, into a 0, 1, 2, 3 or 4; this raises methodological questions. Assume that the answer options were: Never, Once, Sometimes, Frequently and Always. Although "Never" and "Once" could arguably be coded as 0 and 1 respectively, should "Sometimes" then be coded as 2? As 5? As 10?
Even more problematic are the nominal data such as sex, (the dreaded) race, mainframe computer model, job title, ... Where binary variables are employed (Yes/No, Male/Female, True/False), current practice is to treat them as dummy variables taking only two discrete values (e.g. 0 and 1). As mentioned higher, the discriminant analysis theory can be extended to cater for binomial distributions.
"For binary variables the general conclusions seem to be that if the true decision surface is roughly linear, Fisher's [linear discriminant function] will perform satisfactorily. [... I]f a non-linear decision surface is discovered it might be possible to transform the data. [...] In spite of all this one should be aware that while Fisher's linear discriminant function may yield satisfactory results on categorical data, it certainly need not always do so." (Hand, 1981, pp. 113-114).
An overview of the discrete variable discriminant analysis methods that could be employed in these cases are given by the same author in Figure 3 (Hand, p. 116).
5.3 Population Considerations.
The predictive use of discriminant analysis assumes that new individuals are drawn from the same populations as the samples. In the fast-changing MIS environment, this might not hold. Continuous updating of the discriminant function as new data becomes available and the inclusion of a "date stamp" as a separate variable may go some way in alleviating these problems.
Another factor to be considered is whether both populations have similar covariance matrices. This can easily be tested through (M)ANOVA. Typically the measurements of the "fail" population have higher variances than the "successful" population. If so, non-parametric analysis might be considered.
Particularly problematic in the MIS field is the issue of multi-collinearity, whose treatment falls outside the scope of this essay.
Finally, care must be taken that both populations are homogeneous, particularly the "fail" population. Conceivably, in MIS applications the second population is often defined as "comprising those observations that do not belong to the (success) population". This negative definition may well conceal a number of different populations (compare: IBM-compatible versus non-compatible). Cluster analysis or visual inspection of the Z-function scatter plot might readily reveal this. Possible remedies include data transformation or classification into more than two populations; although the latter option requires again a large sample size.
5.4 Accumulation of Errors.
Researchers might be tempted to look only at the misclassification error measures. It must be realized that most or all of the following errors would have a cumulative effect on the validity of the discriminant analysis. (Note that this is the author's 'own', provisional taxonomy.)
a) Variable Identification Error: does the variable measure what the researcher intended. E.g. is a system that is not in actual use one year after implementation really unsuccessful?
b) Measurement Error: includes observation errors, data transcription errors and errors resulting from conversion from one data type (e.g. ordinal) to another (continuous).
c) Population Statistics Estimation Errors: estimation of the a priori probabilities, population means and variance on the basis of the sample statistics. Reliable statistical error estimators exist to quantify this error.
d) Variable Selection Error: by excluding the variables with the smallest information content, some discriminatory power is lost. In addition, the process of stepwise variable selection is heuristic and does not ensure the best possible subset of variables.
e) Discriminant Coefficient Estimation Errors. Even if the maximum discriminatory variables are selected, the discriminant coefficients are still only estimates of the true "population" discriminant function. Very biased estimates will result from multicollinearity or non-random samples.
f) Misclassification Error: this can be quantified using a variety of modified F-statistics (often biased), sample misclassification, sample splitting or jack-knife procedures.
g) Prediction Error: induced by changes over time in the underlying population statistics.
h) Violations of Assumptions: if parametric methods are used, the distribution assumptions (in this case normality and common covariance) will almost always be violated. At the very least, a "fudge" factor should be allowed for smaller deviations from the perfect distributions. Goodness-of-fit chi-square tests and multivariate analysis of variance can give an indication about the reasonableness of the assumptions.
6. CHECKLIST FOR DISCRIMINANT ANALYSIS IN MIS
Construct a theoretical model.
Identify your variables. Pay particular attention to ensure that the variables measure what you intend to measure. Ensure specifically that two discrete populations are involved, and that observations can readily be identified as belong to the one or the other.
Collect field data from a representative sample. The main problem areas will be the sample size, the randomness of the sample and the reliable measurement of the variables.
Test for normality of data with a Chi-square goodness-of-fit test and perform ANOVA to check the equality of the covariance matrices. Additional checks should be made for multi-collinearity and the dimensionality of the data should be assessed using canonical correlations analysis.
Decide whether to use linear discriminant analysis or another type of discriminant analysis (quadratic, non-parametric) on the basis of the type of variables and the results of the previous tests. Perform any data transformations that might be needed.
Consider the possibility of splitting the sample for cross-validation purposes.
Perform the (computer-aided) discriminant analysis. "Force" the inclusion of the theoretically and practically important variables in the initial runs. Try different critical values to govern the stepwise selection process. If the discriminant function is very unstable, your sample may be too small or the statistical assumptions may have been violated.
Compare your discriminant function with your theoretical model, other research and MIS practitioners in the field. This is an important part of the model validation.
Plot your Z-function and determine the critical cut-off value (or region) with particular attention to the a priori probabilities, validity of your function and (non-symmetrical) costs of misclassification.
7. REFERENCES
AFIFI A.A. & AZEN S.P. Statistical Analysis. A Computer Oriented Approach. London: Academic Press, 1979 (2nd Ed.).
AFIFI A.A. & CLARK V. Computer-aided Multivariate Analysis. London: Wasworth, 1984.
FLURY B. & RIEDWYL H. Multivariate Statistics. A practical approach. London: Chapman & Hall, 1988.
HAND D.J. Discrimination and Classification. New York: J.Wiley, 1981.
HAWKINS D.M. (Ed.) Topics in Applied Multivariate Analysis. Cambridge University Press, 1982.
KENDALL M., STUART A. & ORD J.K. The Advanced Theory of Statistics. Volume 3: Desing and Analysis, and Time Series. London: Griffin, 1983 (4th Ed.).
MUIRHEAD R.J. Aspects of Multivariate Statistical Theory. New York: J.Wiley, 1982.
SRIVASTAVA M.S. & CARTER E.M. An Introduction to Applied Multivariate Statistics. New York: North-Holland, 1983.
8. ACKNOWLEDGEMENTS.
With special thanks to Mr. John Fresen of the Statistics Department of the University of the Western Cape for his constructive comments.
© Jean-Paul Van Belle