Bad Tests and Fake Bird Seed

An old Gary Larsen cartoon once showed a kindly old lady hand-feeding birds in her back yard. Off to the side was a sack labeled with words that read something like: “Fake birdseed. Great fun! Birds just can’t figure it out!”

Fake bird seed represents many vendors’ test claims … and, what users don’t know about birdseed and test validity can cost them a fortune. Test validity does not mean people like the test; or, the test has zero adverse impact; or, the EEOC approves; or, the test looks sexy. Validity means test scores consistently predict some specific aspect of job performance. For example, if high scores predict more mistakes, then low scores should predict fewer. Validity predicts on-the-job performance … both ways.

Reputable test vendors (i.e., those who follow professional test development standards) eagerly show controlled studies of test results … and, welcome questions about them. Bird seed vendors enthusiastically produce client testimonials … andget defensive when questioned. How can testimonials be unacceptable? For the same reason you cannot trust political ads. They have an agenda and are seldom supported by facts. Here is an example using a sales job:

Sales Manager Anecdote: We used XYZ test and our sales productivity increased.

OK. What is your definition of productivity? What else was happening at the time that could have affected the numbers? Did you land a big customer? Did the economy improve? Did lower and higher scores predict lower and higher sales? Are you using group results or individual data? Sales dollars are only one part of the job. What about satisfaction, service, returns, cross-selling? You see, anecdotes are rhetorical. They might sound good, but seldom tell the whole story. Anecdotes and validity are not equal. Birdseed vendors, because they don’t follow professional test-validation processes, don’t know they don’t know this.

Define Performance … or Else!

Let’s continue with our sales example. Nothing is more important than a highly productive sales staff. But wait. What does that mean? Are we discussing acquiring new customers? Farming or hunting? Cross-selling? Delivering great customer service? Customer retention? Solving service problems? Favorite golf buddies? Job turnover? Learning new products?

Get the picture? I have learned over time, especially with call centers, that many performance areas even conflict with one another. Problem Solving Quality and Calls Completed are often negatively related (i.e., it generally takes more time to better resolve problems). It drives employees crazy when an organization sets mutually conflicting objectives. So which one should they test for?

Performance is a loosey-goosey catch-all term that could actually mean something entirely different to different people. In my experience, few sales managers and even fewer HR departments ever take the time to think this through. So, before you decide on a test vendor, carefully define what you want to measure. If you think “performance” is a singular thing, then you are in a heap of trouble. If someone does not know what he/she wants to control, then any solution will be like bed wetting … warm and comfy at night, but cold and miserable in the morning.

Truth or Dare!

My bathroom scale is heartless. It tells me when I am overeating. It also tells me when I am at my healthy weight. Your hiring test should do the same thing. Good scores should have the same strong causal relationship with high performance; and, bad scores should have the same strong causal relationship with low performance. This is really important. Vendors who do not follow professional test development standards don’t seem to really understand that validation is a two-edged sword. Let’s look deeper a very common, and very wrong-headed practice.

Vendor A separates people into a good group and a bad group. The good group takes the test and the vendor averages their numbers. From that day forward, every applicant is benchmarked against the good-group average. Sound’s good? Sorry. It’s a clear sign the vendor is selling fake birdseed.

Let’s start by asking how the people were group-classified. What constitutes performance? Are good schmoozers in the same group as slow learners? How about group size? Are there enough people in a group (i.e., it takes at least 15 to 25 people before you can draw a decent conclusion). Is the bad group the same size as the good group? (Groups should always be about equal-sized.) Are the differences between groups strong or subtle? (If everyone is at least good enough to stay employed, you will probably be able to see only strong differences.)

What about the test itself? Can the vendor show proof every item in the test directly affects group performance? How strong is it? Research shows that virtually all self-reported motivation, personality, and attitude test scores have weak relationships with “hard” job skills like learning ability, problem solving, and so forth. If the test factor doesn’t strongly predict job performance, the test won’t make any difference in hiring quality … it will just make your job more difficult.

One more comment about group scores. They tell you about groups — nothing about individuals. Consider the following: people in the Top Group have scores of 20, 30, 40, 50, 60, and 70 (average = 45). The Bottom Group scores are 10, 20, 30, 40, 50, and 60 (average = 35).

So the person doing this analysis figures that producers score an average of 45 — so let’s go test people and hire the ones who score 45 or more. Whoops! If we used top-group averages as our standard, we would eliminate three top producers and hire two bottom ones. Fake birdseed alert!

Separating Pros from Pretenders

Setting hiring-test standards is an all or nothing game. There are no shortcuts. In my personal experience, wrong-headed vendors are seldom intentionally deceitful. They enthusiastically believe in their fake birdseed; after all, people who make things with their own hands seldom welcome criticism. So, they rely on client anecdotes, claiming that is sufficient proof of validity. Some will even claim that the EEOC has validated their test. Sorry. This is completely wrong-headed and foolish thinking.

If they rely on vendor claims, users will never know how many good candidates they turn away, nor how many bad ones they will hire. They always pay the price for this mistake later. You see, legal challenges seldom happen in the hiring phase. They happen on the job. Challenges begin when incompetent employees challenge termination or being overlooked for promotion. Forget the short term and six-month guarantees. Bad hiring decisions start showing themselves about a year later.

So how do you identify a pretender? Anyone who is:

Producing client testimonials (not tightly controlled studies) claiming their test is valid;
Getting defensive when questioned;
Claiming their test doesn’t actually predict performance, but can be helpful;
Claiming the EEOC has approved their test;
Setting standards based on group or job averages;
Focused primarily on training, not professional test development;
Giving everyone a broad-based test (i.e., not based on performance requirements) and then measuring averages;
Giving everyone a broad-based test (i.e., not based on performance requirements) and then measuring differences;
Believing a self-descriptive test strongly and accurately predicts job skills;
Not able to produce a technical manual documenting what the test measures and why that factor leads to job performance;
Not clear on the definition of what the test actually predicts;

There are others, but this is a good start. Here is a quickie birdseed question users should ask every vendor: “Was your test specifically developed to predict job performance? If so, what part?” Any answer other than “Yes” means the test probably won’t work.

Birdseed or Not Birdseed!

As you might imagine, birdseed vendors complain the loudest. That’s really shameful. Validation principles are taught in major universities throughout the western world and religiously followed by every professional test development house. Just because a vendor does not know what they are is no excuse. It reminds me of Gary Larsen’s little fat boy trying to enter the School for the Gifted and Talented by pushing hard against a door that clearly say “pull.”

Here are some believe-it-or-not examples:

V1: Vendor (who sells a self-reported personality test) … All you care is about assessment. Don’t you care about performance?

A: Hello! Assessment is anything used to evaluate a candidate and predict performance. Besides, there is abundant literature showing self-reported tests are miserable predictors of skills like problem solving ability, planning, and teamwork. You want accuracy? Start selling tests that measure hard-to-fake applicant skills.

V2: Vendor (who sells a post-WWII NAZI atrocity test). Our test is validated. See our report. Wanna be a distributor?

A: No, thank you. I am not in the market for a concentration camp commandant. Besides, a technical report filled with anecdotes from unqualified people venturing their unsupported personal opinions about your test does not meet professional test standards.

V3: Vendor (who does group-level averaging). Group averaging is just another form of validation.

A: No. It’s not. Your test has no clear performance criteria; no proof a specific factor causes performance; group data is being used to make individual conclusions; and, your groups are so small, the numbers are either nonsense or chance.

U4: User … If I use a test, I’ll never place a candidate!

A: If there was ever a statement concerning the sad state of applicant screening, this was it!

U5: User … We like the DISC/MBTI/ACL/CPI/16PF/MMPI/Caliper test so much, we decided to use it for hiring.

A: That’s interesting. As far as I know, none of these publishers claim their test predicts job performance. Some even strongly recommend against it. Perhaps, you know something the publishers do not? Think about it. Just because a test measures a difference between people, does that mean it also predicts someone’s job performance?

U6: User … We use tests to match candidate personalities to managers.

A: That might be a good idea, unless company culture never changed; managers never changed jobs; people never changed departments; or, cloned personalities never lead to group-think.

U7: User … We interview. We don’t use tests.

A: If you ask questions and make hiring decisions based on applicant answers, how is that not a test?

V8: Vendor (after learning what it takes to meet professional test requirements) … I can’t do that!

A: That said it all.

V9: Vendor … We keep adjusting top group scores until we get the maximum individuals in the group to pass. The results become our hiring standard.

A: Fine-tuning junk science yields finely-tuned junk science.

V10: Vendor … We compare every applicant against a country-wide manager/salesperson/driver/XZY job norm.

A: So, you are assuming all jobs/companies/industries with the same title are alike; everyone in the group norm performs just like your people are expected to perform; every individual in the group norm matches the group average; jobholder answers are identical to applicant answers; applicants never try to make themselves look good on tests; and, every factor in the norm affects performance? Sure.