
Experience Everything Compliance Network Has to Offer
Test reliability and validity are two technical properties of a test that indicate the quality and usefulness of the test. These are the two most important features of a test. You should examine these features when evaluating the suitability of the test for your use. This chapter provides a simplified explanation of these two complex ideas. These explanations will help you to understand reliability and validity information reported in test manuals and reviews and use that information to evaluate the suitability of a test for your use.
Use only reliable assessment instruments and procedures. Use only assessment procedures and instruments that have been demonstrated to be valid for the specific purpose for which they are being used. Use assessment tools that are appropriate for the target population. |
An employment test is considered "good" if the following can be said about it:
The degree to which a test has these qualities is indicated by two technical properties: reliability and validity.
Reliability refers to how dependably or consistently a test measures a characteristic. If a person takes the test again, will he or she get a similar test score, or a much different score? A test that yields similar scores for a person who repeats the test is said to measure a characteristic reliably.
How do we account for an individual who does not get exactly the same test score every time he or she takes the test? Some possible reasons are the following:
These factors are sources of chance or random measurement error in the assessment process. If there were no random errors of measurement, the individual would get the same test score, the individual's "true" score, each time. The degree to which test scores are unaffected by measurement errors is an indication of the reliability of the test.
Reliable assessment tools produce dependable, repeatable, and consistent information about people. In order to meaningfully interpret test scores and make useful employment or career-related decisions, you need reliable tools. This brings us to the next principle of assessment.
Use only reliable assessment instruments and procedures. In other words, use only assessment tools that provide dependable and consistent information. |
Test manuals and independent review of tests provide information on test reliability. The following discussion will help you interpret the reliability information about any test.
The reliability of a test is indicated by the reliability coefficient. It is denoted by the letter "r," and is expressed as a number ranging between 0 and 1.00, with r = 0 indicating no reliability, and r = 1.00 indicating perfect reliability. Do not expect to find a test with perfect reliability. Generally, you will see the reliability of a test as a decimal, for example, r = .80 or r = .93. The larger the reliability coefficient, the more repeatable or reliable the test scores. Table 1 serves as a general guideline for interpreting test reliability. However, do not select or reject a test solely based on the size ofits reliability coefficient. To evaluate a test'sreliability, you should consider the type of test, the type of reliability estimate reported, and the contextin which the test will be used.
Table 1. General Guidelines for Interpreting Reliability Coefficients | |
Reliability coefficient value | Interpretation |
.90 and up .80 - .89 .70 - .79 below .70 | excellent good adequate may have limited applicability |
There are several types of reliability estimates, each influenced by different sources of measurement error. Test developers have the responsibility of reporting the reliability estimates that are relevant for a particular test. Before deciding to use a test, read the test manual and any independent reviews to determine if its reliability is acceptable. The acceptable level of reliability will differ depending on the type of test and the reliability estimate used.
The discussion in Table 2 should help you develop some familiarity with the different kinds of reliability estimates reported in test manuals and reviews.
Table 2. Types of Reliability Estimates |
# Test-retest reliability indicates the repeatability of test scores with the passage of time. This estimate also reflects the stability of the characteristic or construct being measured by the test. Some constructs are more stable than others. For example, an individual's reading ability is more stable over a particular period of time than that individual's anxiety level. Therefore, you would expect a higher test-retest reliability coefficient on a reading test than you would on a test that measures anxiety. For constructs that are expected to vary over time, an acceptable test-retest reliability coefficient may be lower than is suggested in Table 1. |
# Alternate or parallel form reliability indicates how consistent test scores are likely to be if a person takes two or more forms of a test. A high parallel form reliability coefficient indicates that the different forms of the test are very similar which means that it makes virtually no difference which version of the test a person takes. On the other hand, a low parallel form reliability coefficient suggests that the different forms are probably not comparable; they may be measuring different things and therefore cannot be used interchangeably. |
# Inter-rater reliability indicates how consistent test scores are likely to be if the test is scored by two or more raters. On some tests, raters evaluate responses to questions and determine the score. Differences in judgments among raters are likely to produce variations in test scores. A high inter-rater reliability coefficient indicates that the judgment process is stable and the resulting scores are reliable. Inter-rater reliability coefficients are typically lower than other types of reliability estimates. However, it is possible to obtain higher levels of inter-rater reliabilities if raters are appropriately trained. |
# Internal consistency reliability indicates the extent to which items on a test measure the same thing. A high internal consistency reliability coefficient for a test indicates that the items on the test are very similar to each other in content (homogeneous). It is important to note that the length of a test can affect internal consistency reliability. For example, a very lengthy test can spuriously inflate the reliability coefficient. Tests that measure multiple characteristics are usually divided into distinct components. Manuals for such tests typically report a separate internal consistency reliability coefficient for each component in addition to one for the whole test. Test manuals and reviews report several kinds of internal consistency reliability estimates. Each type of estimate is appropriate under certain circumstances. The test manual should explain why a particular estimate is reported. |
Test manuals report a statistic called the standard error of measurement (SEM). It gives the margin of error that you should expect in an individual test score because of imperfect reliability of the test. The SEM represents the degree of confidence that a person's "true" score lies within a particular range of scores. For example, an SEM of "2" indicates that a test taker's "true" score probably lies within 2 points in either direction of the score he or she receives on the test. This means that if an individual receives a 91 on the test, there is a good chance that the person's "true" score lies somewhere between 89 and 93.
The SEM is a useful measure of the accuracy of individual test scores. The smaller the SEM, the more accurate the measurements.
When evaluating the reliability coefficients of a test, it is important to review the explanations provided in the manual for the following:
For more information on reliability, consult the APA Standards, the SIOP Principles, or any major textbook on psychometrics or employment testing. Appendix A lists some possible sources.
Validity is the most important issue in selecting a test. Validity refers to what characteristic the test measures and how well the test measures that characteristic.
It is important to understand the differences between reliability and validity. Validity will tell you how good a test is for a particular situation; reliability will tell you how trustworthy a score on that test will be. You cannot draw valid conclusions from a test score unless you are sure that the test is reliable. Even when a test is reliable, it may not be valid. You should be careful that any test you select is both reliable and valid for your situation.
A test's validity is established in reference to a specific purpose; the test may not be valid for different purposes. For example, the test you use to make valid predictions about someone's technical proficiency on the job may not be valid for predicting his or her leadership skills or absenteeism rate. This leads to the next principle of assessment.
Use only assessment procedures and instruments that have been demonstrated to be valid for the specific purpose for which they are being used. |
Similarly, a test's validity is established in reference to specific groups. These groups are called the reference groups. The test may not be valid for different groups. For example, a test designed to predict the performance of managers in situations requiring problem solving may not allow you to make valid or meaningful predictions about the performance of clerical employees. If, for example, the kind of problem-solving ability required for the two positions is different, or the reading level of the test is not suitable for clerical applicants, the test results may be valid for managers, but not for clerical employees.
Test developers have the responsibility of describing the reference groups used to develop the test. The manual should describe the groups for whom the test is valid, and the interpretation of scores for individuals belonging to each of these groups. You must determine if the test can be used appropriately with the particular type of people you want to test. This group of people is called your target population or target group.
Use assessment tools that are appropriate for the target population. |
Your target group and the reference group do not have to match on all factors; they must be sufficiently similar so that the test will yield meaningful scores for your group. For example, a writing ability test developed for use with college seniors may be appropriate for measuring the writing ability of white-collar professionals or managers, even though these groups do not have identical characteristics. In determining the appropriateness of a test for your target groups, consider factors such as occupation, reading level, cultural differences, and language barriers.
Recall that the Uniform Guidelines require assessment tools to have adequate supporting evidence for the conclusions you reach with them in the event adverse impact occurs. A valid personnel tool is one that measures an important characteristic of the job you are interested in. Use of valid tools will, on average, enable you to make better employment-related decisions. Both from business-efficiency and legal viewpoints, it is essential to only use tests that are valid for your intended use.
In order to be certain an employment test is useful and valid, evidence must be collected relating the test to a job. The process of establishing the job relatedness of a test is called validation.
The Uniform Guidelines discuss the following three methods of conducting validation studies. The Guidelines describe conditions under which each type of validation strategy is appropriate. They do not express a preference for any one strategy to demonstrate the job-relatedness of a test.
3Current thinking in psychology is that construct validity encompasses all other forms of validity; validation is the cumulative and on-going process of giving meaning to test scores.
The three methods of validity-criterion-related, content, and construct-should be used to provide validation support depending on the situation. These three general methods often overlap, and, depending on the situation, one or more may be appropriate. French (1990) offers situational examples of when each method of validity may be applied.
First, as an example of criterion-related validity, take the position of millwright. Employees' scores (predictors) on a test designed to measure mechanical skill could be correlated with their performance in servicing machines (criterion) in the mill. If the correlation is high, it can be said that the test has a high degree of validation support, and its use as a selection tool would be appropriate.
Second, the content validation method may be used when you want to determine if there is a relationship between behaviors measured by a test and behaviors involved in the job. For example, a typing test would be high validation support for a secretarial position, assuming much typing is required each day. If, however, the job required only minimal typing, then the same test would have little content validity. Content validity does not apply to tests measuring learning ability or general problem-solving skills (French, 1990).
Finally, the third method is construct validity. This method often pertains to tests that may measure abstract traits of an applicant. For example, construct validity may be used when a bank desires to test its applicants for "numerical aptitude." In this case, an aptitude is not an observable behavior, but a concept created to explain possible future behaviors. To demonstrate that the test possesses construct validation support, ". . . the bank would need to show (1) that the test did indeed measure the desired trait and (2) that this trait corresponded to success on the job" (French, 1990, p. 260).
Professionally developed tests should come with reports on validity evidence, including detailed explanations of how validation studies were conducted. If you develop your own tests or procedures, you will need to conduct your own validation studies. As the test user, you have the ultimate responsibility for making sure that validity evidence exists for the conclusions you reach using the tests. This applies to all tests and procedures you use, whether they have been bought off-the-shelf, developed externally, or developed in-house.
Validity evidence is especially critical for tests that have adverse impact. When a test has adverse impact, the Uniform Guidelines require that validity evidence for that specific employment decision be provided.
The particular job for which a test is selected should be very similar to the job for which the test was originally developed. Determining the degree of similarity will require a job analysis. Job analysis is a systematic process used to identify the tasks, duties, responsibilities and working conditions associated with a job and the knowledge, skills, abilities, and other characteristics required to perform that job.
Job analysis information may be gathered by direct observation of people currently in the job, interviews with experienced supervisors and job incumbents, questionnaires, personnel and equipment records, and work manuals. In order to meet the requirements of the Uniform Guidelines, it is advisable that the job analysis be conducted by a qualified professional, for example, an industrial and organizational psychologist or other professional well trained in job analysis techniques. Job analysis information is central in deciding what to test for and which tests to use.
Conducting your own validation study is expensive, and, in many cases, you may not have enough employees in a relevant job category to make it feasible to conduct a study. Therefore, you may find it advantageous to use professionally developed assessment tools and procedures for which documentation on validity already exists. However, care must be taken to make sure that validity evidence obtained for an "outside" test study can be suitably "transported" to your particular situation.
The Uniform Guidelines, the Standards, and the SIOP Principles state that evidence of transportability is required. Consider the following when using outside tests:
To ensure that the outside test you purchase or obtain meets professional and legal standards, you should consult with testing professionals. See Chapter 5 for information on locating consultants.
To determine if a particular test is valid for your intended use, consult the test manual and available independent reviews. (Chapter 5 offers sources for test reviews.) The information below can help you interpret the validity evidence reported in these publications.
Table 3. General Guidelines for Interpreting Validity Coefficients | |
Validity coefficient value | Interpretation |
above .35 .21 - .35 .11 - .20 below .11 | very beneficial likely to be useful depends on circumstances unlikely to be useful |
Here are three scenarios illustrating why you should consider these factors, individually and in combination with one another, when evaluating validity coefficients: