?I’m in the Mood for Validity? or “A Rose by Any Other Name”

If I have heard it once, I have heard it a thousand times, “My test is “validated.” “Oh, against what?” “You know, against performance.” “Show me your study.” “What study?” “The one that shows that test scores have a direct relationship with job performance. The one that shows that as scores increase, performance increases, and vice versa.” “Oh, that study…We don’t have one. We discussed the test with managers. They looked at the test and thought it was a good idea to use for selection. Besides the test vendor said it was valid.” “Against what? No, never mind. Let me see if I heard this right. You discussed the test with managers. They thought it was a good idea, but you never did a scientific study to determine if a given score actually predicted performance?” “That’s correct. You make it sound so negative!” Almost every day I read or hear someone talk about validation. Vendors say their test is valid, test givers talk about validation, and consultants say their tests are valid. This all sounds good on the surface, but just what is this “validation” thing anyway? How can you use it to your advantage as a recruiter? More importantly — how can you keep from being “snookered” by both candidate and test vendor alike? Validity simply means (and this is as simple as we get) that test scores have a legitimate relationship with some aspect of job performance. By “test” I mean any procedure that separates the “more” qualified from the “less” qualified. (By the way, I want to warn you that I’m about to compress a 16-week course in psychometrics to around 1000 words, so I’ll probably leave a “few” things out.) There are four general types of validity: face, content, construct, and criterion. Let’s review these types one at a time. We’ll begin with the easiest one. Face validity simply means that the test looks legitimate (i.e., my face is more valid than your face). That is, the questions or items on the test include words or items that a person would expect to encounter on the job. It does not ask invasive questions about your personal life nor does it probe deeply into how you feel about your mother. There is a lot of research to suggest that using a test without face validity is like leaning out your office window and waving a flag that says, “Please sue me, I’ve got money to spare!” Tests that are not face valid are easy to spot. They have a lot of items that prompt you to ask yourself, “What the $%# does that item have to do with working here?” The next type of validity is construct. This is one of three types that are cited by the EEOC in its Uniform Guidelines On Employee Selection Procedures. Constructs are deep-seated drivers like intelligence, dominance, extraversion, need for achievement or lust for chocolate. Most tests used for training are construct-based. Constructs are fun for Ph.D.s to discuss because they give us an opportunity to use big words and feel like deep thinkers, but constructs are very difficult to link with job requirements and job performance. In addition, although constructs like intelligence have a strong correlation with job performance, intelligence tests have serious adverse impact. Consider this. Research shows that highly effective people are usually very intelligent (unless Dad owns the business); however, not all people who are intelligent are highly effective. Clear? The EEOC does not like construct validity. The third type of validity is content. Content validity means the content of the test looks like the content of the job. A typing test is content-valid when the job involves typing letters. The EEOC is generally OK with content validity as long as the “content” would take some time to learn. A test that asks a prospective sales person how to sell might be content valid, as might a test that measures some kind of arcane IT programming language or a test for a med-tech that measures medical knowledge. The snag in content valid tests is you still are burdened with the nasty task of setting cut off scores. We’re pretty sure a person scoring 100 will be better that a person scoring 0, but what about the person who scores 80 compared to a person scoring 60. Will there really be a 20-point difference in performance? Is 60 good enough for the job anyway? Will it have adverse impact? Setting cut off points can be ugly! The last type of validity is criterion. A runner who clocks a 4-minute mile has demonstrated a performance criterion of 4 minutes. This is easy for people who have clearly measurable outputs such as sales, production, or distance runners, but not so clean for white collar or knowledge workers. How do we measure performance of these folks? Do we use multi-rater feedback (also known as “popularity polls”, “I’ll scratch your back if you scratch mine,” or “I’ll get you this time, you %^$#”). Maybe we should ask supervisors. Supervisors aren’t very biased, are they? Yeah, right! ‘Ever hear of the halo effect? Give me a break! Gathering accurate performance criteria is very challenging and often filled with plenty of error. So, the next time you want to test to see if a test is valid, ask:

Just what kind of validity are you speaking about?

Can you show me where scores on your test actually correlate with performance on jobs.
Are the jobs you validated almost identical to mine?
Do your scores fall along a pretty bell-shaped curve?
If you did measure performance, how did you gather the data and can I trust the accuracy?

As a final note, if you are a trainer or consultant, constructs are fun to discuss, but if you are a recruiter, you need to cut through the psychobabble nonsense and assure yourself that the score on any test you use has a legitimate and documented relationship with job performance.