INTRODUCTION

Scientific inquiry is an iterative learning process. Objectives pertaining to the explanation of a social or physical phenomenon must be specified and then tested by gathering and analyzing data. In turn, an analysis of the data gathered by experimentation or observation will usually suggest a modified explanation of the phenomenon. Throughout this iterative learning process, variables are often added or deleted from the study. Thus, the complexities of most phenomena require an investigator to collect observations on many different variables. This book is concerned with statistical methods designed to elicit information from these kinds of data sets. Because the data include simultaneous measurements on many variables, this body of methodology is called multivariate analysis.

The need to understand the relationships between many variables makes multivariate analysis an inherently difficult subject. Often, the human mind is overwhelmed by the sheer bulk of the data.

Additionally, more mathematics is required to derive multivariate statistical techniques for making inferences than in a univariate setting. We have chosen to provide explanations based upon algebraic concepts and to avoid the derivations of statistical results that require the calculus of many variables. Our objective is to introduce several useful multivariate techniques in a clear manner, making heavy use of illustrative examples and a minimum of mathematics.

Nonetheless, some mathematical sophistication and a desire to think quantitatively will be required.

Most of our emphasis will be on the analysis of measurements obtained without actively controlling or manipulating any of the variables on which the measurements are made. Only in Chapters 6 and 7 shall we treat a few experimental plans (designs) for generating data that prescribe the active manipulation of important variables.

Although the experimental design is ordinarily the most important part of a scientific investigation, it is frequently impossible to control the generation of appropriate data in certain disciplines. (This is true, for example, in business, economics, ecology, geology, and sociology.)

It will become increasingly clear that many multivariate methods are based upon an underlying probability model known as the multivariate normal distribu- tion. Other methods are ad hoc in nature and are justified by logical or commonsense arguments. Regardless of their origin, multivariate techniques must, invariably, be im- plemented on a computer. Recent advances in computer technology have been accompanied by the development of rather sophisticated statistical software packages, making the implementation step easier.

Multivariate analysis is a “mixed bag.” It is difficult to establish a classification scheme for multivariate techniques that both is widely accepted and indicates the appropriateness of the techniques. One classification distinguishes techniques de- signed to study interdependent relationships from those designed to study dependent relationships. Another classifies techniques according to the number of populations and the number of sets of variables being studied. Chapters in this text are divided into sections according to inference about treatment means, inference about covariance structure, and techniques for sorting or grouping. This should not, however, be considered an attempt to place each method into a slot. Rather, the choice of methods and the types of analyses employed are largely determined by the objectives of the investigation. In Section 1.2, we list a smaller number of practical problems designed to illustrate the connection between the choice of a statistical method and the objectives of the study.

These problems, plus the examples in the text, should provide you with an appreciation for the applicability of multivariate techniques across different fields.

The objectives of scientific investigations to which multivariate methods most naturally lend themselves include the following:

- Data reduction or structural simplification. The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier.
- Sorting and grouping. Groups of “similar” objects or variables are created, based upon measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required.
- Investigation of the dependence among variables. The nature of the relationships among variables is of interest. Are all the variables mutually indepen- dent or are one or more variables dependent on the others? If so, how?
- Prediction. Relationships between variables must be determined for the purpose of predicting the values of one or more variables on the basis of observations on the other variables.
- Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions.

We conclude this brief overview of multivariate analysis with a quotation from F. H. C. Marriott [19], page 89. The statement was made in a discussion of cluster analysis, but we feel it is appropriate for a broader range of methods. You should

keep it in mind whenever you attempt or read about a data analysis. It allows one to maintain a proper perspective and not be overwhelmed by the elegance of some of the theory:

If the results disagree with informed opinion, do not admit a simple logical interpretation, and do not show up clearly in a graphical presentation, they are probably wrong. There is no magic about numerical methods, and many ways in which they can break down. They are a valuable aid to the interpretation of data, not sausage machines au- tomatically transforming bodies of numbers into packets of scientific fact.

APPLICATIONS OF MULTIVARIATE TECHNIQUES

The published applications of multivariate methods have increased tremendously in recent years. It is now difficult to cover the variety of real-world applications of these methods with brief discussions, as we did in earlier editions of this book. However, in order to give some indication of the usefulness of multivariate techniques, we offer the following short descriptions of the results of studies from several disciplines. These descriptions are organized according to the categories of objectives given in the previous section. Of course, many of our examples are multifaceted and could be placed in more than one category.

Data reduction or simplification

⚫ Using data on several variables related to cancer patient responses to radio- therapy, a simple measure of patient response to radiotherapy was constructed.

- Track records from many nations were used to develop an index of performance for both male and female athletes.
- Multispectral image data collected by a high-altitude scanner were reduced to a form that could be viewed as images (pictures) of a shoreline in two dimen- sions.
- Data on several variables relating to yield and protein content were used to create an index to select parents of subsequent generations of improved bean plants.
- A matrix of tactic similarities was developed from aggregate data derived from professional mediators. From this matrix the number of dimensions by which professional mediators judge the tactics they use in resolving disputes was de- termined.

Sorting and grouping

- Data on several variables related to computer use were employed to create clusters of categories of computer jobs that allow a better determination of existing (or planned) computer utilization.
- Measurements of several physiological variables were used to develop a screening procedure that discriminates alcoholics from nonalcoholics.
- Data related to responses to visual stimuli were used to develop a rule for separating people suffering from a multiple-sclerosis-caused visual pathology from those not suffering from the disease.
- The U.S. Internal Revenue Service uses data collected from tax returns to sort taxpayers into two groups: those that will be audited and those that will not.

Investigation of the dependence among variables

⚫ Data on several variables were used to identify factors that were responsible for client success in hiring external consultants.

- Measurements of variables related to innovation, on the one hand, and variables related to the business environment and business organization, on the other hand, were used to discover why some firms are product innovators and some firms are not.

⚫ Data on variables representing the outcomes of the 10 decathlon events in the Olympics were used to determine the physical factors responsible for success in the decathlon. - The associations between measures of risk-taking propensity and measures of socioeconomic characteristics for top-level business executives were used to as- sess the relation between risk-taking behavior and performance.

Prediction

⚫ The associations between test scores and several high school performance vari- ables and several college performance variables were used to develop predic- tors of success in college.

Data on several variables related to the size distribution of sediments were used to develop rules for predicting different depositional environments.

- Measurements on several accounting and financial variables were used to develop a method for identifying potentially insolvent property-liability insurers.
- Data on several variables for chickweed plants were used to develop a method for predicting the species of a new plant.

Hypotheses testing

- Several pollution-related variables were measured to determine whether levels for a large metropolitan area were roughly constant throughout the week, or whether there was a noticeable difference between weekdays and weekends.
- Some rule of thumb guidelines for multivariate statistical analysis:
- Sample Size Rule of Thumb: As a general guideline, you should have a minimum of 5 observations per variable in your dataset to ensure reliable results. However, this is a rough estimate, and the specific requirements may vary depending on the analysis technique.
- Multicollinearity Rule of Thumb: Multicollinearity occurs when there is a high correlation between predictor variables. To avoid this issue, a commonly used rule of thumb is to ensure that the correlation coefficient between any pair of predictor variables is less than 0.7 (or even lower, such as 0.5).
- Dimensionality Rule of Thumb: When dealing with high-dimensional data, such as in principal component analysis (PCA) or factor analysis, a common rule of thumb is to retain components or factors that explain a minimum of 70-80% of the total variance in the data.
- Significance Level Rule of Thumb: In hypothesis testing, a commonly used significance level is 0.05 (5%). This means that if the p-value associated with a statistical test is less than 0.05, the result is considered statistically significant.
- Rule of Thumb for Outliers: A common rule of thumb for identifying outliers is to consider any data point that is more than 1.5 times the interquartile range (IQR) away from the upper or lower quartiles as a potential outlier.
- Rule of Thumb for Variable Selection: When selecting variables for regression models or other analyses, a common rule of thumb is to include variables that have a correlation coefficient of at least 0.3 with the outcome variable.
- Rule of Thumb for Model Complexity: In model building, it is generally recommended to avoid overfitting by limiting the number of predictor variables to a maximum of one variable per every 10 observations (or even more conservative ratios for smaller sample sizes).
- Rule of Thumb for Normality: In many statistical analyses, the assumption of normality is important. As a rough guideline, if the skewness and kurtosis of a variable are within the range of -2 to +2, the data can be considered approximately normally distributed.
- Rule of Thumb for Homoscedasticity: Homoscedasticity refers to the assumption of equal variance in regression models. A rule of thumb is to visually inspect the residuals plot and look for a consistent spread of residuals across the range of predicted values.
- Rule of Thumb for Interpreting Factor Loadings: In factor analysis, a rule of thumb is to consider factor loadings above 0.3 (in absolute value) as significant for interpretation purposes. However, this guideline can vary depending on the specific research field and context.
- Remember that these rules of thumb are general guidelines and may not be universally applicable in all situations but they will help you in taking a rough decision in need of time. It’s important to consider the specific context, data characteristics, and analysis goals when applying these guidelines in multivariate statistical analysis.