Construct Validity: Advances in Theory and Methodology

Measures of psychological constructs are validated by testing whether they relate to measures of other constructs as specified by theory. Each test of relations between measures reflects on the validity of both the measures and the theory driving the test. Construct validation concerns the simultaneous process of measure and theory validation. In this chapter, we review the recent history of validation efforts in clinical psychological science that has led to this perspective, and we review five recent advances in validation theory and methodology of importance for clinical researchers. These are: the emergence of nonjustificationist philosophy of science; an increasing appreciation for theory and the need for informative tests of construct validity; valid construct representation in experimental psychopathology; the need to avoid representing multidimensional constructs with a single score; and the emergence of effective new statistical tools for the evaluation of convergent and discriminant validity.

Keywords: Philosophy of Science, Construct Representation, Multitrait – Multimethod validation, Construct Homogeneity, Construct Validation Programs

In this chapter, we highlight the centrality of construct validation to theory testing in clinical psychology. In doing so, we first provide a brief history of modern validation efforts and describe the foundational role construct validity theory has for modern, scientific clinical psychology. We then highlight four recent developments in construct validity theory and advances in statistical methodology that, we believe, should play an important role in shaping construct and theory validation efforts. We begin with a brief history.

An Historical Overview of Validation Efforts in Clinical Psychology

At the modern beginning of scientific clinical psychology in the beginning of the 20 th century, researchers faced the challenge of developing valid measures without an existing knowledge base on which to rely. The absence of a foundation of knowledge was an enormous problem for test validation efforts. The goal of validating measures of psychological constructs necessarily requires criteria that are themselves valid. One cannot show that a predictor of some form of psychopathology is valid, unless one can show that the predictor relates to an indicator of that form of psychopathology that is, itself, valid. One cannot show that a certain deficit in cognitive processing characterizes individuals with a certain disorder unless one has defined and validly measured the disorder. Inevitably, to validate scores on measures one needs a structure of existing knowledge to which one can relate those scores. To go further, to validate one's claim that scores on a measure play a certain role in a network of psychological processes, one needs valid measures of the different components of the specified processes.

As researchers developed measures, and confirmed or disconfirmed early, relatively crude predictive hypotheses, a knowledge base began to develop. The development of a knowledge base made possible the specification of procedures for measure validation. The specification of such procedures, in turn, facilitated further knowledge acquisition. And as knowledge continued to develop, the need for more theoretically sophisticated means of measure and theory validation emerged. We believe the recent history of validation efforts reflects this kind of reciprocal influence between existing knowledge and validation standards. We next briefly describe this process in greater detail.

Early Measure Development and Validity

An often-discussed early measure in the history of validation efforts is the Woodworth Personal Data Sheet (WPDS), a measure developed in 1919 to help the U.S. Army screen out individuals who might be vulnerable to “war neurosis” or “shell shock.” It was subsequently described as measuring emotional stability (Garrett & Schneck 1928; Morey 2002). Both during construction and use of the test, researchers showed clear concern with its validity. Unfortunately, their efforts to both develop and validate the test reflected the weak knowledge structure of clinical research at the time.

Woodworth constructed the 116 item test by relying on existing clinical psychological knowledge and by using empirical methods. Specifically, he drew his item content from case histories of individuals identified as neurotic. He then administered the items to a normal test group and deleted items scored in the presumably dysfunctional direction by 50% or more of that group (Garrett & Schneck 1928). Clearly, he sought to construct a valid measure of dysfunction. And although not all researchers who used the WPDS concerned themselves with its validity, some did. Flemming & Flemming (1929) chided researchers for neglecting to validate the test, and then conducted their own empirical test of the measure.

Items on the WPDS are quite diverse. They include, “Have you ever lost your memory for a time?”, “Can you sit still without fidgeting?”, “Does it make you uneasy to have to cross a wide street or an open square?”, and “Does some particular useless thought keep coming into your mind to bother you?” From the standpoint of today's knowledge base in clinical psychology, each of these four sample items seems to refer to a different construct. It is thus not surprising that the measure did not perform well. It did not differentiate college students from “avowed psychoneurotics” (Garrett & Schneck 1928), nor did it correlate with teacher ratings of students’ emotional stability (Flemming & Flemming 1929).

One can see two core limitations underlying the effort to develop this test and validate it. First, in developing the WPDS item pool, Woodworth had to rely on a far too incomplete understanding of psychopathology and its contributors. Second, the validity of the criterion measures was not established independently and was based either on broad diagnostic classification or subjective teacher ratings; surely the validity of these criteria was limited.

Researchers at the time expressed concerns related to these limitations. For example, Garrett & Schneck (1928) noted the heterogeneous items and the mixture of complaints represented in the item pool and drew a conclusion that anticipated recent advances in validation theory to be discussed later in this chapter:

“It is this [heterogeneity], among other [considerations], which is causing the present-day trend away from the concept of mental disease as an entity. Instead of saying that a patient has this or that disease, the modern psychiatrist prefers to say that the patient exhibits such and such symptoms.” (p. 465).

Based on this thinking, Garrett & Schneck (1928) investigated relations among individual items and specific diagnoses (rather than membership in the general category of “mentally disturbed”). In doing so, they recognized the need to avoid combining items of different content as well as the need to avoid combining individuals with different symptom pictures. Their use of an empirical item – person classification produced very different results from prior rational classifications (Laird 1925), thus (a) implicating the importance of empirical validation and (b) anticipating criterion-keying methods of test construction. In addition, they anticipated the current appreciation for construct homogeneity, with its emphasis on unidimensional traits and unidimensional symptoms as the preferred objects of theoretical study and measure validation (Edwards 2001; McGrath 2005; Smith et al. 2003; Smith & Combs, 2008).

The Validation of Measures as their Ability to Predict Criteria

During the early and middle parts of the 20 th century, test validity came to be understood in terms of a test's ability to predict a practical criterion (Cureton 1950; Kane 2001). This focus on criterion prediction may have been a function of three forces: advances in substantive knowledge and precision of thought in the field, the obvious limitations in the tests constructed on purely rational grounds, and a philosophy-based suspicion of theories describing unobservable entities (Blumberg & Feigl 1931). Indeed, many validation theorists explicitly rejected the idea that scores on a test mean anything beyond their ability to predict an outcome. As Anastasi (1950) put it,

“It is only as a measure of a specifically defined criterion that a test can be objectively validated at all . . . . To claim that a test measures anything over and above its criterion is pure speculation.” (p. 67)

At the time, this approach to measure validation proved quite useful: it led to the generation of new methods of test construction as well as to important substantive advances in knowledge. Concerning test construction, it led to the criterion-keying approach, in which one selects items entirely on the basis of whether the items predict the criterion. This method represented an important advance: to some degree, validity as successful criterion prediction was built into the test. The method worked well. Two of the most prominent measures of personality and psychopathology, the MMPI (Butcher 1995) and the CPI (Megargee, 2008), were developed using criterion-keying. Each of those measures has generated a wealth of knowledge concerning personality, psychopathology, and adjustment: there are thousands of studies attesting to the measures’ clinical value. For example, the MMPI-2 distinguishes between psychiatric inpatients and outpatients and facilitates treatment planning (Butcher 1990; Greene 2006; Nichols & Crowhurst 2006; Perry et al. 2006). It has also been applied usefully to normal populations (such as in personnel assessment: Butcher 2002; Derksen et al. 2003), to head-injured populations (Gass 2002), and in correctional facilities (Megargee 2006). The CPI validly predicts a wide range of criteria as well (Gough 1996).

As Kane (2001) noted, the criterion-related validity perspective also led to more sophisticated treatments of the relationship between test scores and criteria, as well as to the development of utility-based decision rules (see Cronbach & Gleser 1965). Perhaps it is also true that the focus on prediction of criteria as the defining feature of validity contributed to the finding that statistical combinations of test data are superior to clinical combinations, and that this is true across domains of inquiry (Grove et al. 2000; Swets et al. 2000).

As prediction improved and knowledge advanced using this criterion validity perspective, the ultimate limitations of the method became clear. One core limitation reflects a difficulty in prediction that was present from the beginning: tests of criterion-related validity are only as good as the criteria used in the prediction task. As Bechtoldt (1951) put it, reliance on criterion-related validity “involves the acceptance of a set of operations as an adequate definition of whatever is to be measured [or predicted].” (p. 1245). Typically, the validity of the criterion was presumed, not evaluated independently. In hindsight, there was good reason to question the validity of many criteria: they were often based on some form of judgment (crude diagnostic classification, teacher rating), and those judgments had to be made with an insufficiently developed knowledge base. Limitations in the validity of criteria impose limitations in one's capacity to validate a measure.

The second limitation is one that led to the development of construct validity theory and that could only have become apparent once the core knowledge base in clinical psychology had developed sufficiently: the criterion-related validity approach does not facilitate the development of basic theory. When tests are developed for the specific intent of predicting a circumscribed criterion, as is the case with criterion-keying test construction, and when they are only validated with respect to that predictive task, as is the case with criterion-related validity, the validation process is likely to contribute little to theory development. As a result, criterion-related validity findings tend not to provide a strong foundation for deducing likely relationships among variables, and hence for the development of generative theory.

The Emergence of Construct Validity

In the early 1950's there was an emerging concern with theory development that led to Meehl and Challman's introduction of the concept of construct validity in the 1954 Technical Recommendations (American Psychological Association 1954). Their work was part of the work of the American Psychological Association's Committee on Psychological Tests. In our view, the developing focus on theory was made possible, in part, by the substantive advances in clinical knowledge facilitated by the criterion-related validity approach. Perhaps ironically, the success of the criterion-related validity method led to its ultimate replacement with construct validity theory. The criterion approach led to significant advances in knowledge, which helped facilitate the development of integrative theories concerning cognition, personality, behavior, and psychopathology. But such theories could not be validated using the criterion approach; there was thus a need for advances in validation theory to make possible the emerging theoretical advances. This need was addressed by several construct validity authors in the middle of the 20 th century (Campbell & Fiske 1959; Cronbach & Meehl 1955; Loevinger 1957).

Indeed, theoretical progress in clinical psychology has substantially depended on four seminal papers all published within a decade. The first (MacCorquodale & Meehl 1948) promoted the philosophical legitimacy of hypothetical constructs, concepts that have a “cognitive factual reference” (p. 107) that goes beyond the data used to support them. That is, hypothetical constructs are hypotheses about the existence of entities, processes, or events that are not directly observed. That seminal paper advanced the legitimacy of psychological theories that describe entities that underlie, but are not equivalent to, what is observed in the laboratory or other research setting.

The second (Cronbach & Meehl 1955) described the methods and rules of inference by which one develops evidence for the validity of measures of such hypothetical constructs. Construct validation tests are also tests of the validity of the theory that specifies a measure's presumed meaning. We use the word developed rather than established to emphasize that construct validation is an ongoing process, the process of theory testing. Central to Cronbach and Meehl's conceptualization of construct validity was the need to articulate specific theories describing relations among psychological processes, in order to then evaluate the performance of measures thought to represent one such process (see also Garner et al. 1956). Cronbach & Meehl (1955) emphasized deductive processes in construct validity. The third (Loevinger 1957) identified the construct validation process as the general framework for the development and testing of psychological theories and the measures used to represent theoretical constructs. In Loevinger's view, construct validity subsumed both content validity and predictive/concurrent, or empirical, validity. In short, construct validity is validity (see also, Landy 1986, Messick 1995).

The fourth paper (Campbell & Fiske 1959) considered issues in the validation of purported indicators of a construct. The title of their article, “Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix (MTMM),” refers to two of the three core ideas in their article that remain crucial in the process of validation of a measure as a construct indicator.

First, all measures are trait (construct)-method units. That is, variance in all psychological measures consists of the substantive construct variance, variance due to the method of measurement that is independent of the construct and, of course, errors of measurement. Second, two types of evidence are required to validate a test or other measurement in psychology. The first, convergent validity, is demonstrated by associations among ”independent measurement procedures” designed to reflect the same or similar constructs (Campbell & Fiske 1959, p. 81, emphasis added). The second aspect of measurement validity, discriminant validity iii requires that a new measure of a construct be substantially less correlated with measures of conceptually unrelated constructs than with other indicators of that construct. Discriminant validity requires the contrast of relationships of measures of constructs in the same conceptual domain, e.g. personality or symptom dimension constructs. Although Campbell & Fiske (1959) gave even weight to convergent and discriminant validity, in later work, the initial primacy of convergent validity is acknowledged (Cook & Campbell 1979; see Ozer 1989). Third, because of the ever-present, often substantial method variance in all psychological measures, validation studies require the simultaneous consideration of 2 or more traits measured by at least 2 different methods. Campbell & Fiske (1959) referred to this approach as multitrait - multimethod matrix methodology; we return to this specific methodology at the end of this article.

Although these papers are over 50 years old, each remains an invaluable place to begin one's mastery of the concept of construct validity. From the first three of these foundational papers, we understand that each study using a measure is simultaneously a test of the validity of a measure and a test of the theory defining the construct. Each new test provides additional information supporting or undermining one's theory or validation claims; with each new test, the validity evidence develops further. Thus, validation is a process not an outcome. Often, the construct validity of a measure is described as “demonstrated,” which is incorrect (Cronbach & Meehl 1955). Although the process is ongoing, it is not necessarily infinite. For example, if a well validated measure such as the Wechsler Adult Intelligence Scale -III (Wechsler 1997) or the Positive and Negative Affect Scale (Watson et al. 1988b) does not behave as “expected” in a study, the measure would not be abandoned. One would likely retain one's confidence in the measure and consider other possible explanations for the outcome, such as deficient research design.

Since the time of these articles, it has also become clear that researchers should concern themselves with construct validity from the beginning of the test construction process. To develop a measure that validly represents a psychological entity, researchers should carefully define the construct and select items representing the definition (Clark & Watson 1995). This reasoning extends to the selection of parameters for manipulation in experimental psychopathology (see Knight & Silverstein 2001). As Bryant (2000) effectively put it for the assessment of a trait,

Imagine, for example, that you created an instrument to measure the extent to which an individual is a “nerd.” To demonstrate construct validity, you would need a clear initial definition of what a nerd is to show that the instrument in fact measures “nerdiness.” Furthermore, without a precise definition of nerd, you would have no way of distinguishing your measure of the nerdiness construct from measures of shyness, introversion or nonconformity. (p. 112).

There have been four recent developments in perspectives on construct validity theory of importance for clinical psychological measurement. First, the philosophical understanding of scientific inquiry has evolved in ways that underscore both the complexity and the indeterminate nature of the validation process (Bartley 1987; Weimer 1979). Second, it has become apparent that the relative absence of strong, precise theories in clinical psychology sometimes leads to weak, non-informative validation tests (Cronbach 1988; Kane 2001). Appreciation of this has led theorists to re-emphasize the centrality of theory-testing in construct validation (Borsboom et al. 2004; Kane 2001). Third, researchers have accentuated the need to consider as an aspect of construct validity, evaluation of theories describing the psychological processes that lead to responses in psychological experiments such as are used in experimental psychopathology research. Tests of such theories are evaluations of construct representation (Whitely [now Embretson] 1983; Embretson 1998; see, Knight & Silverstein 2001). Fourth, researchers have stressed the importance of specifying and measuring homogeneous constructs, so the meaning of validation tests is unambiguous (Edwards 2001; Hough & Schneider 1995; Schneider et al. 1996; McGrath 2005; Smith et al. 2003; Smith & McCarthy 1995; G.T. Smith, D.M. McCarthy, T.B. Zapolski, submitted manuscript). We consider each of these in turn. But first, what is the current view of construct validity in assessment?

Current Views on Construct Validity in Psychological Measurement

Construct validity is now generally viewed as a unifying form of validity for psychological measurements, subsuming both content and criterion validity, which traditionally had been treated as distinct forms of validity (Landy 1986). Messick (1989, as discussed in Messick 1995) has argued that even this notion of validity is also too narrow. In his view “[v]alidity is an overall evaluative judgment of the degree to which [multiple forms of] evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions on the basis of test scores. (Messick 1995, p. 741),”

That is, construct validity is comprehensive, encompassing all sources of evidence supporting specific interpretations of a score from a measure as well as actions based on such interpretations. Messick, writing mainly with reference to educational assessment, identified six contributors to construct validity (Messick 1995, see Figure 1, p. 748): (1) content relevance and technical quality; (2) theoretical understanding of scores and associated empirical evidence, including process analyses; (3) structural data; (4) generalizability; (5) external correlates; and (6) consequences of score interpretation We focus here on aspects (2), (3) and (5) , considering points (1) and (4) to be relatively well-established and not controversial, and the practical consequence of test use (point 6) to be beyond the scope of this chapter (but see Youngstrom 2008)

Advances in Philosophy of Science

In the first half of the 20 th century, many philosophers of science held the view that theories could be fully justified or fully disproved based on empirical evidence. The classic idea of the critical experiment that could falsify a theory is part of this perspective, which has been called justificationism (Bartley 1962; Duhem 1914/1991; Lakatos 1968). Logical positivism (Blumberg & Feigl 1931), with its belief that theories are straightforward derivations from observed facts, is one example of justificationist philosophy of science. From this perspective, one could imagine the validity of a theory and its accompanying measures being fully and unequivocally established as a result of a series of critical experiments.

However, advances in the philosophy of science have led to a convergence on a different perspective, referred to as nonjustificationism (Bartley 1987; Campbell 1987, 1990; Feyerabend 1970; Kuhn 1970; Lakatos 1968; Weimer 1979). The nonjustificationist perspective is that no theory is ever fully proved or disproved. Instead, in the ongoing process of theory development and evaluation, at a given time certain theories are viewed as closer approximations to truth than are other theories. From this perspective (which dominates current philosophy of science, despite disagreement both within and outside this framework: Hacking 1999; Kusch 2002; Latour 1999), science is understood to be characterized by a lack of certainty.

The reason for the uncertainty is as follows. When one tests any theory, such as “individual differences in personality cause individuals to react differently to the same stimulus” (a theory of considerable importance for understanding the process of risk for psychopathology: Caspi 1993), one is presupposing the validity of multiple theories in order to conduct the test (Lakatos 1999; Meehl 1978, 1990). In this example, one must accept that (1) there are reliable individual differences in personality that are not fully a function of context; (2) one has measured the appropriate domains of individual differences in personality; (3) one's measure of personality is valid, in that variation on dimensions of personality underlie variation in responses to the measure; (4) one's measure of personality does not represent other, non-personality processes to any substantial degree; (5) one's measure of each specific dimension of personality is coherent and unidimensional, i.e., does not represent variation on multiple dimensions simultaneously; (6) one can validly expose different individuals to precisely the same stimulus; (7) one can validly measure reactions to that stimulus; and so on.

It is easy to see that a failed test of the initial, core hypothesis could actually be due not just to a failure of the theory, but instead to failures in any number of “auxiliary” theories invoked to test the hypothesis. Researchers typically consider a number of different possibilities when faced with a non-supportive finding. Often, when one faces a negative finding for a theory one believes has strong support otherwise, one questions any number of auxiliary issues: measurement, sample, context, etc. Doing so is quite appropriate (Cronbach & Meehl 1955).

Science is characterized by ongoing debates between proponents and opponents of a theoretical perspective. Through the ongoing process of theoretical criticism and new empirical findings, the debate comes to favor one side over the other. In considering this process, Weimer (1979) concluded that what characterizes science is “comprehensively critical rationalism” (p. 40), by which he meant that every aspect of the research enterprise must be open to criticism and potential revision. Researchers must make judgments as to whether one should question a core theory, an auxiliary theory, or both; they must then investigate the validity of those judgments empirically.

Thus, validation efforts can be understood as arguments concerning the overall evaluation of the claimed interpretation of test scores (Messick 1995), or of claims concerning the underlying theory (Kane 2001). The validation enterprise can thus be understood to include a coherent analysis of the evidence for and against theory claims. Researchers can design theory validation tests based on their analysis of the sum total of evidence relevant to the target theory.

Interestingly, this perspective, particularly as argued by psychological scientists, has begun to influence inquiry in historically non-empirical fields as well. For example, legal scholars, drawing on construct validation theory, have begun to argue that empirical investigation of legal arguments is a necessary part of the validation of those theories (Goldman 2007). Their contention is that sound arguments for the validity of legal theories require both theoretical coherence and supportive empirical evidence.

There is no obvious answer to the question of how one decides which theoretical arguments, embodied by programs of research, are convincing and which are not. Lakatos (1999) referred to progressing versus degenerating research programs. Progressing research programs predict facts that are subsequently confirmed by research; degenerating research programs may offer explanations for existing findings, but they do not make future predictions successfully, and they often require post hoc theoretical shifts to incorporate unanticipated findings (Lakatos 1999). Clearly, this perspective requires judgment on the part of researchers.

It is important to appreciate that the concept of the nonjustificationist nature of scientific inquiry did not spring from studies of psychology as a science. Most authors espousing these views have focused primarily on hard sciences, particularly physics. It is a reality of scientific inquiry that findings are always open to challenge and critical evaluation. Indeed, what separates science from other forms of inquiry is that it embraces critical evaluation, both by theory and by empirical investigation (Weimer 1979). A second point is equally important to appreciate: the reality that no theories are ever fully proved or disproved is no excuse to proceed without theory or without clearly articulated theoretical predictions.

Strong, Weak, and Informative Programs of Construct Validation

As discussed recently by Kane (2001), there have been drawbacks in the use of construct validity theory to organize measure and theory validation. The core idea that one can define constructs by their place in a lawful network of relationships (the network is deduced from the theory) assumes a theoretical precision that tends not to be present in the social sciences. Typically, clinical psychology researchers are faced with the task of validating their measures and theories despite the absence of a set of precisely definable, expected lawful relations among construct measures. Under this circumstance, the meaning of construct validity, and what counts as construct validation, is ambiguous.

Cronbach (1988) addressed this issue by contrasting strong and weak programs of construct validity. Strong programs depend on precise theory, and are perhaps accurately understood to represent an ideal. Weak programs, on the other hand, stem from weak, or less fully articulated, theories and construct definitions. With weak validation programs, there is less guidance as to what counts as validity evidence (Kane 2001). One result can be approaches in which almost any correlation can be described as validation evidence (Cronbach 1988). In the absence of a commitment to precise construct definitions and specific theories, validation research can have an ad hoc, opportunistic quality (Kane 2001), the results of which tend not to be very informative.

Informative, Rather than Strong or Weak, Theory Tests

In our view, clinical researchers are not wedged between a yet unattainable ideal of strong theory and ill-conceived, weak theory testing. Rather, there is an iterative process in which tests of partially developed theories provide information that leads to theory refinement and elaboration, which in turn provides a sounder basis for subsequent construct and theory validation research. Cronbach & Meehl (1955) referred to this bootstrapping process and to the inductive quality of construct definition and theory articulation; advances in testing partially formed theories lead to the development of more elaborate, complete theories. This process has proven effective; striking advances in clinical research have provided clear benefits to the consumers of clinical services.

One example of this process has been the development of an effective psychological treatment for many of the behaviors characteristic of a previously untreatable disorder: borderline personality disorder. Dialectical behavior therapy (DBT) provides improved treatment of parasuicidal behavior and excessive anger (Linehan 1993; Linehan et al. 1993). The emergence of this treatment depended on incremental advances in numerous domains of clinical inquiry. First, advances in temperament theory and personality theory led to awareness of the stability of human temperament and personality, even across decades (Caspi & Roberts 2001; Roberts & DelVecchio 2000). That finding carried the obvious implication that treatment aimed at altering personality may not prove effective. The second advance was the recognition of disorders of personality, i.e., chronic dysfunction in characteristic modes of thinking, perceiving, and behaving, as distinct from other sources of dysfunction (Millon et al. 1996). That recognition facilitated the emergence of treatments targeted toward one's ongoing, typical mode of reacting and behaving. The third advance was the finding that behavioral interventions were effective for disorders of mood: when depressed individuals began participating in numerous, previously rewarding activities, their mood altered (Dimidjian et al. 2006).

DBT can be understood to represent the fruitful integration of each of these three theoretical advances. DBT was designed to treat individuals with borderline personality disorder. One central aspect of DBT is that therapists do not attempt to change borderline clients’ characteristic, intense affective response style: attempts to do so are unlikely to be successful, given the stability of personality. Instead, therapists seek to provide behavioral skills for clients to employ to manage their intense affective reactivity. The therapeutic focus has become managing one's mood effectively, and it has proven effective (Linehan 1993).

To facilitate the process of theory development, researchers should consider whether their theoretical statements and tests are informative, given the current state of knowledge (Smith 2005b). Is a theory consistent with what else is known in the field (MacCorquodale & Meehl 1948)? Can it be integrated with existing knowledge? To what degree does a hypothesis test shed light on the likely validity of a theory, or the likely validity of a measure? Does a hypothesis involve a direct comparison between two, alternative theoretical explanations? Does a hypothesis involve a claim that, if supported, undermines criticism of a theory? Does a hypothesis involve direct criticism of a theory, or a direct challenge to the validity of a measure? Theory tests of this kind will tend to advance knowledge, because they facilitate the central component of the scientific process: critical evaluation and cumulative knowledge.

Recent Arguments for a Reconceptualization of the Role of Theory in Clinical Research

In recent years, validity theorists have argued for an increased emphasis on theory in several aspects of psychological inquiry (Barrett 2005; Borsboom 2006; Borsboom et al. 2003, 2004; Maraun & Peters 2005; Michell 2000, 2001). We next review three basic arguments offered in this recent writing; we believe two of these apply, straightforwardly, to clinical science and the third does not.

The first, which we consider both relevant to clinical research and uncontroversial, concerns latent variable theory. Latent variable theory reflects the idea that variation in responses to test items indicates variation in levels of an underlying trait. As Borsboom et al. (2003) most recently noted, latent variable theory involves a specific view of causality: variation in a construct causes variation in test scores. When clinical psychology researchers describe a scale as a valid measure of a construct, such as anxiety, they are saying that variation in anxiety among individuals causes variation in those individuals’ test responses. From this point of view, each item on a test is an indicator of the construct of interest. Borsboom et al. (2003) develop the implications of this theory for psychological assessment.

The second concerns the basic distinction between theory and empirical data: theories exist independently of data (Borsboom 2006). It is certainly appropriate for researchers to develop, adopt, and promote explicit theories of psychological processes. Of course, ideally, researchers avoid inferring that findings provide stronger support for theories than they do, but that appropriate caution should not dissuade researchers from taking clear theoretical stands. More explicit statements of theory would (a) clarify the degree to which a given empirical test truly pertains to the theory and (b) drive the development of more direct tests of theoretical mechanisms (Borsboom 2006; Borsboom et al. 2004).

The third recent argument is one that, we believe, does not accurately pertain to the development of clinical science. Several recent authors have emphasized the need for more explicit, well-developed theories in general (Barrett 2005; Borsboom 2006; Maraun & Peters 2005; Michell 2000, 2001). At least one of these writers (Borsboom 2006) emphasizes the need to begin with precise, fully developed theories; in their view, to do otherwise is to provide a disservice to the field. For example, although psychological theories often refer to causal processes, they are neither detailed nor mathematically formal. From the point of view of these authors, this is regrettable.

This point of view has not gone without criticism. Both Clark (2006) and Kane (2006) note that the incomplete knowledge base in psychology requires that any theory be an approximation, to be modified as new knowledge becomes available. Formal mathematical theories of psychological phenomena, especially in clinical psychology, are quite premature. And regardless of how detailed and precise the explication of a theory is, each component of it would necessarily undergo critical evaluation and revision as part of the normal progress of science (Weimer 1979). It seems to us that this process is inevitable and a normal part of scientific inquiry.

Construct Representation and Nomothetic Span

Construct Representation

Whitely (1983; Embretson 1998) introduced an important distinction in construct validity theory between nomothetic span and construct representation. Nomothetic span refers to the pattern of significant relations among measures of the same or different constructs (i.e., convergent and discriminant validity). Nomothetic span is in the domain of individual differences (correlation). It is particularly relevant to research concerning expected relationships among trait measures or measures of intellectual skills, neuropsychological variables, or measures of personality constructs. For example, IQ has excellent nomothetic span because individual differences in various measures of that construct all show similar meaningful patterns of relationship with other variables as expected (Whitely 1983). The confirmatory factor analysis of a matrix of correlations among measures for which there are specifications of what relationships should be present and which absent is a method for evaluating nomothetic span.

Construct representation (Whitely 1983; Embretson 1998), on the other hand, refers to the validation of the theory of the response processes that result in a score (such as accuracy or reaction time) in the performance of cognitive tasks. That is, construct representation refers to the psychological processes that lead to a given response on a trial or to the pattern of responses across conditions in an experiment. For many authors, and particularly for cognitive psychologists, construct representation indicates the validity of the dependent variable as an index of a construct (Borsboom et al. 2004; Embretson 1998). That is to say, the goal of construct representation is to test a theory of the cognitive processes giving rise to a response.

An example may make the notion of construct representation clearer. Carpenter et al. (1990) proposed a theory of matrix reasoning problem solving to account for performance on tests such as Ravens Progressive Matrix test, a widely used measure of intelligence. Their model posited that working memory was a critical determinant of performance and that working memory load is influenced by two parameters: (1) the number of relationships among elements in a matrix, and (2) the level of complexity of the relationships among the elements. Note that these are quantitative variables. So by developing matrix items that systematically varied on these two dimensions, these investigators were able to evaluate the extent to which each parametric variation, separately and conjointly determined performance.

The model, in other words, identified the underlying psychological processes that were validated, through accounting for performance on the task as the proposed processes were parametrically manipulated. The validity of the model provides evidence of the construct representation component of the test. The Ravens thus has both evidence of construct representation (model predictions are confirmed) and nomothetic span in that individual difference in performance on the standardized version of the test correlate meaningfully with other variables. Nomothetic span and construct representation aspects of construct validity can complement each other. As an example, the construct representation analysis of Carpenter et al. (1990) is supported by correlational analyses showing that working memory tests but not tests of short term memory are related to measures of fluid intelligence (Engle et al. 1999).

On the other hand, measures may have developed evidence of construct validity of one sort but not the other. Most IQ measures have excellent nomothetic span but limited construct representation: scores predict many things, but the specific psychological processes underlying responses (and those underlying processes common across measures), is generally unknown. The converse may also be true. As Whitely (1983) describes, Posner's (1978) verbal encoding task has excellent construct representation: the psychological mechanisms underlying performance are well established. However, the task has poor nomothetic span because individual differences on that task do not correlate well with scores on other measures of verbal encoding (Whitely 1983).

Construct representation research in clinical psychology

Construct representation has been understudied in clinical psychology research, particularly in clinical neuropsychology and experimental psychopathology. Theories of schizophrenia, depression, and other disorders emphasize disruptions in cognitive processes, and the nomothetic span of a number of tests within neuropsychology, cognitive psychology, and clinical cognitive neuroscience paradigms are well established. But the construct representation of such tests is often less well-developed: many are psychologically complex, many are adaptations of paradigms developed for studying normal cognition and, at least in the case of schizophrenia research, many are poorly understood in terms of the underlying processes accounting for task deficits (Strauss 2001; Strauss & Summerfelt 1994). How construct representation may be relevant to research with personality or symptom self reports or interviews is unclear and a topic for further conceptual analysis and research.

Although construct representation and nomothetic span are distinct, one can influence the other. Performance on cognitive and neuropsychological tasks involves the operation of multiple cognitive processes, each of which may be reliably measured by the task. However, some of the reliable variance may well be construct-irrelevant, (Messick 1995; Silverstein 2008). In such instances group differences on a task as well as associations between task performance and conceptually relevant other variables (i.e., apparent nomothetic span) may be due to such reliable but construct-irrelevant variance (Messick 1995; Silverstein 2008). Theoretical progress in clinical cognitive neuroscience and experimental psychopathology depend on the conjoint analysis of nomothetic span and construct representation in the evaluation of the construct validity of measures.

Conjoint analysis of nomothetic span and construct representation is also important for theory development in the study of personality traits and symptoms, especially as the field becomes more focused on neurobiological processes in personality and psychopathology. For example, there are at least 27 studies of the relation of impulsivity to the Iowa Gambling Task, a proposed measure of neurobiologically based deficits in decision making (PsychInfo Search, July 1, 2008 with terms “Iowa Gambling Task and Impulsivity”). However, none of these studies has evaluated the construct representation of the task, which is necessary to develop links between neurobiology, psychological processes and individual differences in impulsivity. An excellent example of the conjoint evaluation construct representation and nomothetic span is the work of Richard DePue, who has proposed a detailed theory of the biology of extraversion and its link with psychopathology (e.g., Depue & Collins 1999).

The incorporation of converging operations (Garner et al. 1956) into research designs can facilitate the analysis of construct representation and identify the extent to which correlations between performance and other variables reflect construct-relevant associations. For clinical research, the ability of different tasks or individual difference measures to differentially predict markers of observable, clinically important behaviors speaks to the presence of substantial construct-relevant variance (Hammond et al. 1986).

Establishing the construct representation of a measure requires an explicit theoretical analysis of test performance and empirical tests of the theory (Whitely 1983). An example of such a research program is the experimental analysis of the basis of schizophrenia patients’ error patterns on the A-X CPT, a form of vigilance task widely used in schizophrenia research (see Cornblatt & Keilp 1994; Nuechterlein 1991). In the A-X CPT, subjects must respond to the brief occurrence of an X in a rapidly changing sequence of letters, but only if the X is preceded by an A (Cohen et al. 1999). Experiments evaluating a theory of construct representation in this task suggested deficits in context representation as the most fruitful interpretation of task performance. A number of experiments using converging operations along with manipulations of theoretically proposed constituent processes have converged on this conclusion (see Barch 2005; Cohen et al. 1999). There is also substantial evidence of nomothetic span validity for the A-X CPT, including the specificity of the deficit to schizophrenia among psychotic disorders, as well as association with specific symptoms, intellectual function, and genetic liability to schizophrenia spectrum disorders (Barch et al. 2003; MacDonald et al. 2005). Other research programs suggest that this deficit may be an instance of a more general deficit in contextual coordination at both the behavioral and neural levels (Phillips & Silverstein 2003; Uhlhaas & Silverstein 2005).

Construct Homogeneity

Over the last 10 to 15 years, psychometric theory has evolved in a fundamental way that is crucial for psychopathology researchers to appreciate. In the past, psychometrics writers argued for the importance of including items on scales that tap a broad range of content. Researchers were taught to avoid including items that were highly redundant with each other, because then the breadth of the scale would be diminished and the resulting high reliability would be associated with an attenuation of validity (Loevinger 1954). To take the logic further, researchers were sometimes encouraged to choose items that were largely uncorrelated with each other, so that each new item could add the most possible incremental predictive validity over the other items (Meehl 1992).

In recent years, a number of psychometricians have identified a core difficulty with this approach. If items are only moderately inter-correlated, it is likely that they do not represent the same underlying construct. As a result, the meaning of a score on such a test is unclear. Edwards (2001) noted that researchers have long appreciated the need to avoid heterogeneous items: if such an item predicts a criterion, one will not know which aspect of the item accounts for the covariance. The same reasoning extends to tests: if one uses a single score from a test with multiple dimensions, one cannot know which dimensions account for the test's covariance with measures of other constructs.

There are two sources of uncertainty built into any validation test that uses a single score to reflect multiple dimensions. The first is that one cannot know the nature of the different dimensions’ contributions to that score, and hence to correlations of the measure with measures of other constructs. The second source of uncertainty is perhaps more severe than the first. The same composite score is likely to reflect different combinations of constructs for different members of the sample.

McGrath (2005) clarified this point by drawing a useful distinction between psychological constructs that represent variability on a single dimension, on the one hand, and concepts designed to refer to covariations among unidimensional constructs on the other hand. Consider the NEO-PI-R measure of the five factor model of personality (Costa & McCrae 1992). One of the five factors is neuroticism, which is understood to be composed of six, elemental constructs. Two of those are angry hostility and self-consciousness. Measures of those two traits covary reliably; they consistently fall on a neuroticism factor in exploratory factor analyses conducted in different samples and across cultures (McCrae et al. 1996). However, they are not the same construct. Their correlation was .37 in the standardization sample; they share only 14% of their variance. When concerned with theoretical issues it is appropriate to disattenuate correlations for unreliability. In this instance the common variance between angry hostility and self-consciousness, corrected for unreliability, is estimated to be 19%.

Clearly, one person could be high in angry hostility and low in self-consciousness, and another could be low in angry hostility and high in self-consciousness. Those two different patterns could produce exactly the same score on neuroticism as measured by the NEO-PI-R, even though the two traits may have importantly different correlates. For example, the consensus view of psychopathy, based on both expert ratings and measurement, involves being unusually high in angry hostility and unusually low in self-consciousness (Lynam & Widiger 2007). Thus, it makes sense to develop theories relating angry hostility, or self-consciousness, to other constructs, and tests of such theories would be coherent. However, a theory relating overall neuroticism to other constructs must be imprecise and unclear because of the relative independence of the facets of the construct. If neuroticism correlates with another measure, one does not know which traits account for the covariation, or even whether the same traits account for the covariation for each member of the sample.

The use of a neuroticism score, obtained as a summation of scores on several, separable traits, is problematic because it introduces theoretical imprecision. That observation is separate from the theoretical claim that there is a unidimensional construct, whether referred to as negative affectivity or emotional instability, which relates to variability on each lower level construct within the broad neuroticism domain. There is, of course, considerable empirical support for that claim (Costa & McCrae 1992; Watson et al. 1988a), as well as support for the view that each lower level construct shares variance with both general negative affectivity and also has variance specific to the lower level construct (Krueger et al. 2001). We are noting that since the specific variance for each lower level construct can be substantial, summing scores on the lower level constructs to obtain a single overall score introduces theoretical and empirical imprecision as we described above.

Hough & Schneider (1995), McGrath (2005), Paunonen & Ashton (2001), Schneider et al. (1996), and Smith et al. (2003), among others, have all noted that use of scores of broad measures often obscures predictive relationships. Paunonen (1998) and Paunonen & Ashton (2001) have shown that prediction of theoretically relevant criteria is improved when one uses facets of the big five personality scales, rather than the composite, big five dimensions themselves. Using the NEO-PI-R operationalization of the five factor model of personality, Costa & McCrae (1995) compared different facets of conscientiousness in their prediction of aspects of occupational performance. Dutifulness was related to service orientation (.35) and employee reliability (.38), but achievement striving was not (−.01 and .02, respectively). In contrast, achievement striving was related to sales potential (.22), but dutifulness was not (.06). By definition, correlations of broad conscientiousness (which on the NEO-PI-R sums these two facets with four other facets) will produce correlations in between the high and low values, because the sum effectively averages the different effects of the different facets. Use of the broad score would obscure the different roles of the different facets of conscientiousness. Should one wish to represent the full domain of a higher order dimension, such as conscientiousness or neuroticism, one can include each lower level facet as part of a multivariate analysis (such as multiple regression); doing so preserves the theoretical precision inherent in precise constructs while representing the full variance of the higher order domain (Goldberg 1993; Nunnally & Bernstein 1994).

Recently, this perspective has been extended to the study of disorders. For example, McGrath (2005), noting that individuals can obtain the same depression scores with very different symptom patterns, describes depression as a useful social construction but not a coherent psychological entity that can be used in validation studies. Indeed, using factor analysis, Jang et al. (2004) identified 14 subfactors in a set of depression measures. Examples included “feeling blue and lonely,” “insomnia,” “positive affect,” “loss of appetite,” and “psychomotor retardation.” They found that the inter-correlations among the factors ranged from .00 to .34; further, the factors were differentially heritable, with heritability coefficients ranging from .00 to .35. Evidence of multidimensionality is accruing for many disorders, including post-traumatic stress disorder (King et al. 1998; Simms et al. 2002), psychopathy (Brinkley et al. 2004), schizotypal personality disorder (Fossati et al. 2005), and many others (Smith & Combs, 2008).

For scientific clinical psychology to advance, researchers should study cohesive, unidimensional constructs. To use multi-faceted, complex constructs as predictors or criteria in validity or theory studies is difficult to defend. Researchers are encouraged to generate theories that identify putatively homogenous, coherent constructs. It may often be useful to compare the theory that a putative attribute is homogeneous to the theory that it is a combination of separate attributes. The success of such efforts in the recent past bodes well for continued progress in the field as researchers study unidimensional constructs with meaningful test scores (Jang et al. 2004; Smith et al. 2007; Whiteside & Lynam 2001).

This discussion of construct homogeneity raises two important issues. The first is, when is a construct measure elemental enough? There is a risk of continually parsing constructs until one is left with a content domain specific to a single item, thus losing full coverage of a target construct and attenuating predictive power. We believe the guiding consideration should be theoretical clarity. When there is good theoretical or empirical reason to believe that an item set actually consists of two, separately definable constructs with different psychological meaning, and when those two constructs could reasonably have meaningfully different external correlates, measuring the two separately is likely to improve both understanding and empirical prediction. When there is no sound theoretical basis to separate items into multiple constructs, one should perhaps avoid doing so.

The second issue is whether a focus on construct homogeneity leads to a clear and unacceptable loss of parsimony. This possibility merits careful consideration. With respect to etiological models, the use of several homogeneous constructs rather than their aggregate can complicate theory testing, but that difficulty must be weighed against the improved precision of theory tests.

It is at least possible that an emphasis on construct homogeneity often does not compromise parsimony. For example, it appears to be the case that four broad personality dimensions and their underlying facets effectively describe the many different forms of dysfunction currently represented by the full set of personality disorders (Widiger & Simonsen 2005; Widiger et al. 2005). Perhaps it is instead the case that parsimony has been compromised by the current DSM system that names multiple putative syndromes that often appear to reflect slightly different combinations of personality dimensions.

It may be that parsimony would be better served by describing personality dysfunction in terms of a set of core, homogeneous personality traits rather than in terms of combinations of disparate, moderately related symptoms (Widiger & Trull 2007). This logic has been extended beyond the personality disorders domain: Serretti & Olgiati (2004) described basic dimensions of psychosis that apply across current diagnostic distinctions, suggesting parsimony in the dimensional description of psychosis.

Empirical Evaluation of Construct Validity

Campbell's & Fiske's (1959) multitrait-multimethod matrix methodology presented a logic for evaluating construct validity through simultaneous evaluation of convergent and discriminant validity, and the contribution of method variance to observed relationships. iv Wothke (1995) nicely summarized the central idea of MTMM matrix methodology:

The crossed-measurement design in the MTMM matrix derives from a simple rationale: Traits are universal, manifest over a variety of situations, and detectable with a variety of methods. Most importantly, the magnitude of a trait should not change just because different assessment methods are used (p. 125)

Traits are latent variables, inferred constructs. The term trait, as used here, is not limited to enduring characteristics; it applies as well to more transitory phenomena such as moods, emotions, as well as to all other individual differences constructs, e.g., attitudes and psychophysical measurements. Methods for Campbell and Fiske are the procedures through which responses are obtained, the operationalization of the assessment procedures that produce the responses, the quantitative summary of which is the measure itself (Wothke 1995).

As Campbell & Fiske (1959) emphasized, measurement methods (method variance) are sources of irrelevant, though reliable, variance. When the same method is used across measures, the presence of reliable method variance can lead to an overestimation of the magnitude of relations among constructs. This can lead to overestimating convergent validity and underestimating discriminant validity. This is why multiple assessment methods are critical in the development of construct validity. Their distinction of validity (the correlation between dissimilar measures of a characteristic) from reliability (the correlation between similar measures of a chartacteristic) hinged on the differences between construct assessment methods.

Campbell & Fiske's (1959) observation remains important today: much clinical psychology research relies on the same method for both predictor and criterion measurement, typically self-report questionnaire or interview. Their call for attention to method variance is as relevant today as it was 50 years ago; examination of constructs with different methods is a crucial part of the construct validation process. Of course, the degree to which two methods are independent is not always clear. For example, how different are the methods of interview and questionnaire? Both rely on self-report, so are they independent sources of information? Perhaps not, but they do differ operationally. For example, questionnaire responses are often anonymous, whereas interview responses require disclosure to another. Questionnaire responses are based on the perceptions of the respondent, whereas interview ratings are based, in part, on the perceptions of the interviewer. A conceptually based definition of “method variance” has not been easy to achieve, as Sechrest et al.'s (2000) analysis of this issue demonstrates. Certainly, method differences lie on a continuum where for example, self-report and interview are closer to each other than self-report and informant report or behavioral observation.

The guidance provided for evaluating construct validity in 1959 was qualitative; it involved the rule-based examination of patterns of correlations against the expectations of convergent and discriminant validity (Campbell & Fiske 1959). Developments in psychometric theory, multivariate statistics and analysis of latent traits in the decades since the Campbell & Fiske (1959) paper have made available a number of quantitative methods for modeling convergent and discriminant validity across different assessment methods.

Bryant (2000) provides a particularly accessible description of using ANOVA (and a nonparametric variant) and confirmatory factor analysis (CFA) in the analysis of MTMM matrices. A major advantage of CFA in construct validity research is the possibility of directly comparing alternative models of relationships among constructs, a critical component of theory testing (see Whitely 1983). Covariance component analysis of the MTMM matrix has also been developed (Wothke 1995). Both covariance component analysis and CFA are variants of structural equation models (SEM). With these advances eye-ball examinations of MTMM matrices are no longer sufficient for the evaluation of the trait validity of a measure in modern assessment research.

Perhaps the first CFA approach was one that followed very straightforwardly from Campbell & Fiske (1959): it involved specifying a CFA model in which responses to any item can be understood as reflecting additive effects of trait variance, method variance, and measurement error (Marsh & Grayson 1995; Reichardt & Coleman 1995; Widaman 1985). So if traits A, B, and C are each measured with methods X, Y, and Z, there are six latent variables: three for the traits and three for the methods. Thus, if indicator i reflects method X for evaluating trait A, that part of the variance of i that is shared with other indicators of trait A is assigned to the trait A factor, that part of the variance of i that is shared with indicators of other constructs measured by method X is assigned to the method X factor, and the remainder is assigned to an error term (Eid et al. 2003; Kenny & Kashy 1992). The association of each type of factor with other measures can be examined, so, for example, one can test explicitly the role of a certain trait or a certain type of method variance on responses to a criterion measure. This approach can be expanded to include interactions between traits and methods (Campbell & O'Connell 1967, 1982), and therefore test multiplicative models (Browne 1984; Cudeck 1988).

Although the potential advantages of this approach are obvious, it has generally not proven feasible. As noted by Kenny & Kashy (1992), this approach often results in modeling more factors than there is information to identify them; the result, often, is a statistical failure to converge on a factor solution. That reality has led some researchers to turn away from multivariate statistical methods to evaluate MTMM results. In recent years, however, two alternative CFA modeling approaches have been developed that appear to work well.

The first is referred to as the “correlated uniquenesses” approach (Marsh & Grayson 1995). In this approach, one does not model method factors as in the approach previously described. Instead, one identifies the presence of method variance by allowing the residual variances of trait indicators that share the same method to correlate, after accounting for trait variation and covariation. To the degree there are substantial correlations between these residual terms, method variance is considered present and is accounted for statistically (although other forms of reliable specificity may be represented in those correlations as well). As a result, the latent variables reflecting trait variation do not include that method variance: one can test the relation between method-free trait scores and other variables of interest. And, since this approach models only trait factors, it avoids the over-factoring problem of the earlier approach. There is, however, an important limitation to the correlated uniquenesses approach. Without a representation of method variance as a factor, one cannot examine the association of method variance with other constructs, which may be important to do (Cronbach 1995).

The second alternative approach provides a way to model some method variance while avoiding the over-factoring problem (Eid et al. 2003). One constructs latent variables to represent all trait factors and all but one method factor. Since there are fewer factors than in the original approach, the resulting solution is mathematically identified: one has not over-factored. The idea is that one method is chosen as the baseline method and is not represented by a latent variable. One evaluates other methods for how they influence results compared to the baseline method. Suppose, for example, that one had interview and collateral report data for a series of traits. One might specify the interview method as the baseline method, so an interview method factor is not modeled as separate from trait variance, and trait scores are really trait-as-measured-by-interview scores. One then models a method factor for collateral report. If the collateral report method leads to higher estimates of trait presence than does the interview, one would find that the collateral report method factor correlated positively with the trait-as-measured-by-interview. That would imply that collaterals report higher levels of the trait than do individuals during interviews.

Interestingly, one can assess whether this process works differently for different traits. Perhaps collaterals perceive higher levels of some traits than are reported by interview (unflattering traits?) and lower levels of other traits as reported by interview (flattering traits?). This possibility can be examined empirically using this method. In this way, the Eid et al. (2003) approach makes it possible to identify the contribution of method to measure scores. The limitation of this method, of course, is that the choice of “baseline method” influences the results and may be arbitrary (Eid et al. 2003).

Most recently, Courvoisier et al. (2008) have combined this approach with latent state-trait analysis; the latter method allows one to estimate variance due to stable traits, occasion-specific states, and error (Steyer et al. 1999). The result is a single analytic method to estimate variance due to trait, method, state, and error. Among the possibilities offered by this approach is that one can investigate the degree to which method effects are stable or variable over time.

We wish to emphasize three points concerning these advances in methods for the empirical evaluation of construct validity. First, the concern that MTMM data could not successfully be analyzed using CFA/SEM approaches is no longer correct. There are now analytic tools that have proven successful (Eid et al. 2003). Second, statistical tools are available that enable one to quantitatively estimate multiple sources of variance that are important to the construct validation enterprise (Eid et al. 2003; Marsh & Grayson 1995). One need not guess at the degree to which method variance is present, or the degree to which it is common across traits, or the degree to which it is stable: one can investigate these sources of variance directly. Third, these analytic techniques are increasingly accessible to researchers (see Kline 2005, for a useful introduction to SEM). Clinical researchers have a validity concern beyond successful demonstration of convergent and discriminant validity. Success at the level of MTMM validity does not assure the measured traits have utility. Typically, one also needs to investigate whether the traits enhance prediction of some criterion of clinical importance.

To this end, clinical researchers can rely on a classic contribution by Hammond et al. (1986). They offered a creative, integrative analytic approach for combining the results of MTMM designs with the evaluation of differential prediction of external criteria. In the best tradition of applying basic science advances to practical prediction, their design integrated the convergent/discriminant validity perspective of Campbell & Fiske (1959) with Brunswik's (1952, 1956) emphasis on representative design in research, which in part concerned the need to conduct investigations that yield findings one can apply to practical problems. They presented the concept of a performance validity matrix, which adds criterion variables for each trait to the MTMM design. By adding clinical outcome variables to one's MTMM design, one can provide evidence of convergent validity, discriminant validity, and differential clinical prediction in a single study.

Such analyses are critical clinically, because this sophisticated treatment of validity is likely to improve the usefulness of measures for clinicians. For many measures, validation research that considers practical prediction improves measures’ “three Ps”: predicting important criteria; prescribing treatments, and understanding the processes underlying personality and psychopathology (Youngstrom 2008), thereby improving clinical assessment. Such practical efforts in assessment must rely on observed scores, confounded as they may be with method variance. Construct validity research provides the clinician with an appreciation of the many factors entering into an observed score and, thus, appreciation of the mix of construct-relevant, reliable construct-irrelevant variance and method variance in any score. (see Richters 1992).

Conclusion

The term “construct validation” refers to the process of simultaneously validating measures of psychological constructs and the theories of which the constructs are a part. The study of the construct validation process is ongoing. It rests on core principles identified 50 years ago (Campbell & Fiske 1959; Cronbach & Meehl 1955; Loevinger 1957; MacCorquodale & Meehl 1948); those principles remain central to theory testing today. It is also true that our understanding of construct validation has evolved over these 50 years.

In this chapter, we emphasized five ways in which this is true. First, advances in philosophy of science have helped clarify the ongoing, indeterminate nature of the construct validation process. This quality of theory testing represents a strength to the scientific method, because it reflects the continuing process of critical evaluation of all aspects of theory and measurement. Second, theoreticians now emphasize the pursuit of informative theory tests, in order to avoid weak, ad hoc theory tests in the absence of fully specified theories. Third, the need to validate clinical, laboratory tasks, by investigating the degree to which responses on a task do reflect the influence of the target construct of interest, is becoming increasingly appreciated. Fourth, the lack of clarity that follows the use of a single score to represent multidimensional constructs has been described; researchers are increasingly relying on unidimensional measures to improve the validity of their theory tests. And fifth, important advances in the means to evaluate validity evidence empirically have been described; researchers have important new statistical tools at their disposal.

In sum, there are exciting new developments in the study of how to validate theories and their accompanying measures. These advances promise important improvements in measure and theory validation. As researchers fully incorporate sound construct validation theory in their methods, the rate of progress in clinical psychology research will continue to increase.

Acknowledgments

We thank Lee Anna Clark and Eric Youngstrom for their most helpful comments and suggestions, and Jill White for her excellent work in preparing the ms. Portions of this work were supported by NIAAA grant 1 RO 1 AA 016166 to Gregory T. Smith.