(This post was co-authored with Arnold Packer.)
Reliability and Validity are the Alpha and Omega of testing. A test that is reliable can be counted on each time it’s given, while a valid test measures what it is supposed to. Tests that meet these two criteria are the gold standard of assessment..
For example, making someone swim 100 yards to test whether or not he can swim would be a valid and reliable test. If you sink, you flunk, and that’s true each time the test is given and is independent of who is doing the testing.
However, when teachers are trying to assess ‘soft’ skills, the waters get murky. How can we measure the ability to work with others, process information from disparate sources, communicate persuasively, or work reliably?
Consider the concept of reliability. Is an employee who is late to work one day most weeks reliable? What grade would you assign her, on a scale of 1-5? Suppose you found her explanations plausible (child care problems, for example) and you cut her slack—and gave her a 5– because you’ve been there yourself? However, another employer might give that same employee a grade of 2 or 3, because, after all, late is late.
There’s no set scale for measuring ‘working with others,’ meaning that the rating may vary depending upon who’s doing the rating. And what to one teacher is ‘persuasive communication’ may fall flat with another. There’s just no easy way to measure those all-important ‘soft’ skills.
And they are important. Put yourself in the position of Human Resource officer, trying both to be fair and to have some confidence in an applicant’s likelihood of success on the job. You want to know as much as you can about a potential hire, but now all you have to go on are a resume, impressions from an interview, and maybe some recommendations.
Or consider higher education. College admission officers don’t want a freshman class made up entirely of students with perfect GPA’s. They know that students with ‘soft skills’ and academic proficiency contribute greatly to campus life.
And so they consult SAT (and maybe AP) scores, scores on high school exit exams, references, and high school GPAs, but how reliable and valid are these? High school grades and GPAs are clearly unreliable–as every student who has chosen courses and teachers to enhance their record knows. References are even less reliable, and the lack of predictive validity of SAT scores has led many selective colleges to abandon them.
Clearly, both employers and admissions officers could use more information—if only that information were reliable and valid.
While we recognize the complexity of measuring soft skills, we believe we are close to meeting this challenge. This summer, with help from a Kellogg Foundation grant, we are asking mentors at 28 community-based organizations to assess teenagers’ performance and provide them with a Verified Resume. We ask the mentors to assess high and middle school students on such traits as responsibility, work ethic, collaboration, communication, problem-solving, critical thinking, and creativity.
We ask the mentors to write a two -sentence description of the context in which each of the traits was demonstrated. Was the teenager responsible about picking up trash in the park or helping out on the surgical ward? Communicating to a friend about the homework assignment requires a different skill level than communicating about obesity to a large community audience. There is no reasonable rubric that will cover this amount of variation.
Finally, mentors also grade the students’ performance on a scale of one (“cannot do it”) to five (“does it well enough teach others”).
Having a Verified Resume of performance will potentially improve the package of information available to college admissions and HR staff, if those busy folks take the time to look at the VR in the few minutes that they devote to considering each applicant. The challenge is to convince them that the VR will improve their decisions.
Five communities are involved in the current Kellogg-financed project: Baltimore, Boston, Grand Rapids, Montana and Salt Lake. Baltimore provides a good illustration of the process. All of the eight participating not-for-profit organizations there are involved in youth development. Two have students creating videos; another sends students to City Hall; students in a third organization engage in debates; those in a fourth help younger students with algebra.
Here are two grades and comments pertaining to a student in a media project.
Skill Rating Observations
Responsibility 4.50 G. created and took on personal projects above and beyond the requirements of the programs in which he participated. He often came in early to ensure that these tasks were completed professionally and on time.
Team Player 3.50 G. has consistently led his team members to complete projects on time. In the Festival Committee, he helped to organize his peers to accomplish the production of promotional videos and prepare for public speaking events.
But the process doesn’t stop when the mentors give their grades. Instead, we will survey employers to see if they agree with, for example, that ‘5’ the mentor gave their new employee. If not, revisions are in order, and perhaps some retraining of the mentor. That is, we envision the VR as a living document, one that is always subject to on line revision.
We got into this because we believe performance traits like responsibility, tolerance for diversity, ability to communicate and work ethic matter.. Because they matter, we must also figure out how to measure them reliably.
It’s not going to be easy, but nothing of importance ever is.
Can the VR come close to the assessment community’s gold standard? The quest for the perfect is frequently the enemy of the good. John Maynard Keynes is credited with saying, “It is better to be roughly right than precisely wrong.” Teaching, measuring, and certifying soft performance skills are important, according to both employers and colleges. We cannot afford slavish adherence to shibboleths regarding statistical purity without asking if new measures can provide more real reliability and better predictive validity.