Do You Pass the Test?

It’s a funny thing, but everyone and their brother seems to think that developing a test is easy. Just put a few people in a room, brainstorm questions, do a quick review, and away you go! Wrong. If that’s the kind of test your organization uses, 1) pick it up using tongs, 2) walk to the nearest waste can, 3) drop it in, and 4) never look back. Sound a little drastic? Not if you want to be considered a “professional.” Guidelines for developing a good test are outlined in the Standards for Educational and Psychological Testing (www.apa.org/science/standards.html). They aren’t hard to follow, but that does not keep people from completely ignoring them. Here is a basic list of some important facts you need to know about testing:

Be sure the test is grounded in a “legitimate theory of job performance.” That means that high scores equal high performance, and low scores equal low performance. It is unacceptable to use a test just because you like it, or because it “described your personality.” Bogus! Hiring tests are supposed to predict job performance, not social style.

The world is filled with pseudo-psychologists who think they can use training room tests or mental illness tests to predict job performance. These folks suffer from a mental disease called “unconscious incompetence”; that is, they don’t know what they don’t know. It’s a very dangerous place to be. Attending a training class in DISC or MB may be fun, but it is much more important to know that these tests were designed not to predict future performance, but just to illustrate early personality theory. Unconsciously incompetent folks do an injustice to themselves, their employer, and applicants. Why? Because although DISC or MB test-content might seem intuitively attractive, you cannot use it to hire people until you know scores predict job performance. This takes a formal study ó not someone’s “gut” opinion.
Reputable test vendors always publish a technical manual that includes words like, “XYZ has been found to be strongly associated with job performance in TUV jobs.” If these words are missing ó or if the vendor suggests, “This test can be helpful in conducting an interview” ó run away! This vendor probably knows less than you do. Tests have to be tested and proven before they can be called hiring tools.
You say you’re not “really” using the test results? Sometimes I hear, “We don’t really use the test results; we just use the test to guide our hiring decisions.” Come on. Do you think anyone over five years old really believes this statement? If you don’t use the scores, then why use the test?
All tests must have integrity. A good test will go through several repetitions to ensure all the items and answers have equal difficulty. I once reviewed a test where some questions were answered correct 90% of the time and some were answered wrong 90% of the time. Were those good test items? Nope. How do you find this out? You test, examine, retest, and repeat until you end up with good items. The vast majority of technical tests I see on the Internet seem to skip this “little” step.
Integrity also means individual test items will have a relationship with overall test scores. This is tricky to explain, but I’ll use an example. Suppose you wanted to measure creativity. You initially believe creative-related items include making suggestions, crazy ideas, being dissatisfied with the status quo, demonstrating eccentricity, liking science fiction novels, and daydreaming. So far we have your opinion, but how do we know whether these items actually relate to creativity? It takes work. We have to ask hundreds of people to take our test and statistically analyze which items overlap. If scores on the first four items moved up and down together while scores on the last two items moved independently, we must assume the last two items had nothing to do with creativity and eliminate them from the scale. When the items are stable, we should also check scores against a trusted measure of creativity. What if you don’t do this? Be prepared to make big mistakes.
Tests that ask a few questions and then deliver pages and pages of “deep” analysis are frequently flawed. If you give this kind of test to a few dozen people, you will soon discover most of the comments are standard boilerplate. If human behavior could be described by just checking off a list of adjectives, psychology would have dried up years ago. It takes five to seven interrelated items to make one good test scale ó and that scale only describes a “self-reported tendency.” Twenty test items would only predict three or four factors. Only bad tests make broad assumptions using a little bit of data. Psychological tests are not like blood tests. Blood tests indicate chemical conditions directly linked to physical conditions. Psychological tests are trends and patterns that probably lead to behaviors.
Another measure of reliability is whether the same test yields about the same results with the same person over time. This is called test-retest reliability. You obtain test-retest reliability by giving the test to a group of people, waiting a year (hope they forget the test items), then giving it again to the same people and comparing the first-round scores to the second. This is the only way we know a test measures the same thing over time. Incidentally, one of the most popular “acronym tests” on the market has incredibly poor test-retest results. Is your “real” personality type the one you had on Monday, or the one you had on Wednesday?
Is it valid? Validity means test scores actually predict something useful. High test scores predict high performance and low scores predict low performance. Pretty basic? Only if you know how to recognize the “vendor shuffle.” Sometimes a vendor will divide a population into a high and low performance group. They will give everyone their test, average the scores of each group and proclaim, “Yea, verily, there is a difference! Our test withstandeth the ordeal!” Sorry, this is not validation. Done right, this is a statistical procedure for testing two groups for differences. The whole idea behind validation is to determine how test scores track individual performance along a continuum ó not match a group average. To use an example, divide a work group into an equal number of males and females. Now, measure their feet. On the average, the male group would have bigger feet. Therefore, if big feet is our “target,” any applicant with a big foot would belong to the male group. Likewise, any applicant with a small foot would belong to the female group. See the problem?
Don’t buy until you read. Reputable test-catalog publishers don’t let just anyone buy their tests. They require users to pre-qualify. “A” users are either people with graduate degrees in the field or certifications, or practicing psychologists. “B” users are HR people who have attended certification workshops. “C” users are people with undergraduate degrees in the field. Reputable publishers try to minimize test misuse by at least establishing minimum technical qualifications for buyers.
Tests for selection and tests for training should not be mixed. Because the applications are so different, I am not aware of any test that can be used for both hiring and training. Be extremely wary of any vendor who suggests his or her training test can also be used to help clarify hiring decisions.
“Our test is validated.” Oh yeah? Have you tested it with your jobs? Do you have proof that test scores predict performance? If the test was validated for another job, have you done a job analysis to ensure your job requires equivalent skills? If you cannot answer yes to one or all questions, you have been deceived. The test is not valid.
You need to know basic statistics. Validity and reliability are reported in weird terms like “alphas,” “correlations,” “discriminate,” “variability,” and the like. If you don’t know what these terms mean, there is no way to determine whether or not the test is any good.
Don’t forget that any method of separating “qualified” from “unqualified” people is a test. So, unless you plan to hire everyone who applies, that makes interviews, resumes, and applications a form of test. Even if you don’t think you use tests, if you make hiring decisions based on interviews or screen resumes and applications, you’re using tests. So why not do it right?
Demographic filters can be a problem. Does the government require you to hire unqualified people? How about unqualified minorities? Unqualified protected groups? No. So far, they don’t force anyone to hire an applicant who cannot do the job. The problem arises when you look at applicants through a demographic filter. That’s when you often see group-level differences. What to do? Well, you could change the job requirements. Or you could look for other methods that were equally effective but had less adverse impact. That’s what the government would like you to do. And that’s not so bad, is it?

And finally, remember one of the most dangerous myths of all: “The vendor is responsible for test use.” Ha! In your dreams. The test user is always on the hook, and ultimately will be held accountable if discrimination or other problems can be proven.