Editing Psychometrics (section)

=== Key concepts ===
Key concepts in classical test theory are [[Reliability (psychometric)|reliability]] and [[Test validity|validity]]. A reliable measure is one that measures a construct consistently across time, individuals, and situations. A valid measure is one that measures what it is intended to measure. Reliability is necessary, but not sufficient, for validity.

Both reliability and validity can be assessed statistically. Consistency over repeated measures of the same test can be assessed with the Pearson correlation coefficient, and is often called ''test-retest reliability.''<ref name="gifted.uconn">{{cite web|url=http://www.gifted.uconn.edu/Siegle/research/Instrument+Reliability+and+Validity/Reliability.htm|title=Home – Educational Research Basics by Del Siegle|website=www.gifted.uconn.edu|date=17 February 2015}}</ref> Similarly, the equivalence of different versions of the same measure can be indexed by a [[Pearson product-moment correlation coefficient|Pearson correlation]], and is called ''equivalent forms reliability'' or a similar term.<ref name="gifted.uconn"/>

Internal consistency, which addresses the homogeneity of a single test form, may be assessed by correlating performance on two halves of a test, which is termed ''split-half reliability''; the value of this [[Pearson product-moment correlation coefficient]] for two half-tests is adjusted with the [[Spearman–Brown prediction formula]] to correspond to the correlation between two full-length tests.<ref name="gifted.uconn"/> Perhaps the most commonly used index of reliability is [[Cronbach's α]], which is equivalent to the [[mean]] of all possible split-half coefficients. Other approaches include the [[intra-class correlation]], which is the ratio of variance of measurements of a given target to the variance of all targets.

There are a number of different forms of validity. [[Criterion validity|Criterion-related validity]] refers to the extent to which a test or scale predicts a sample of behavior, i.e., the criterion, that is "external to the measuring instrument itself."<ref>Nunnally, J.C. (1978). ''Psychometric theory'' (2nd ed.). New York: McGraw-Hill.</ref> That external sample of behavior can be many things including another test; college grade point average as when the high school SAT is used to predict performance in college; and even behavior that occurred in the past, for example, when a test of current psychological symptoms is used to predict the occurrence of past victimization (which would accurately represent postdiction). When the criterion measure is collected at the same time as the measure being validated the goal is to establish ''[[concurrent validity]]''; when the criterion is collected later the goal is to establish ''[[predictive validity]]''. A measure has ''[[construct validity]]'' if it is related to measures of other constructs as required by theory. ''[[Content validity]]'' is a demonstration that the items of a test do an adequate job of covering the domain being measured. In a personnel selection example, test content is based on a defined statement or set of statements of knowledge, skill, ability, or other characteristics obtained from a ''[[job analysis]]''.

[[Item response theory]] models the relationship between [[latent trait]]s and responses to test items. Among other advantages, IRT provides a basis for obtaining an estimate of the location of a test-taker on a given latent trait as well as the standard error of measurement of that location. For example, a university student's knowledge of history can be deduced from his or her score on a university test and then be compared reliably with a high school student's knowledge deduced from a less difficult test. Scores derived by classical test theory do not have this characteristic, and assessment of actual ability (rather than ability relative to other test-takers) must be assessed by comparing scores to those of a "norm group" randomly selected from the population. In fact, all measures derived from classical test theory are dependent on the sample tested, while, in principle, those derived from item response theory are not.