Thursday, May 21, 2009

Constructing Evaluation Instruments

The posting below gives some excellent advice on constructing evaluation instruments and their uses in testing and grading. The author is Stanford C. Ericksen.

Testing and Grading

Fair play is the first and final requirement in matters of testing and grading. Students will accept pressures for hard work but object strenuously and rightly so, to signs of unfairness in a teacher's assessment of their efforts. Being an expert in an area of subject matter and having the speaking skills required for teaching are quite different dimensions of professional competence than are the abilities to construct discriminating examinations and to assign valid grades. Improvement on the part of instructors in the areas of testing and grading is nearly always in order.

An important distinction must be made between evaluation and grading. Evaluation is information provided to the student about particular aspects of what was said or done during the effort to learn, to solve a problem, or to organize and integrate facts and concepts. As they move into unknown intellectual territory, students must have guideposts to confirm that they are moving in the right direction. The qualitative comments about particular aspects of a term paper are far more constructive aids for the specifics of learning and remembering than is the grade on the cover page. Evaluation, therefore, is indispensable to students for gaining understanding and to fix what is learned in memory. A grade, on the other hand, is a gross index which typically comes too late for the student to take corrective measures about the specifics of learning.

A few guidelines about testing and grading can help instructors to: (1) strengthen the process of instruction, (2) clarify the diagnostic value of testing, (3) make a fair assessment of what each student knows, and (4) report this achievement through grades.

Testing as a Tool for Instruction

Students tend to concentrate their study effort in preparation for an exam, and they structure this effort in anticipation of the nature of the questions they will be asked. If students anticipate the need to know unassimilated facts, they will concentrate on memorizing information; if they expect to be asked to integrate, extend, and evaluate information, they will try to prepare themselves along those lines. The management of testing is an opportunity for the instructor to underline the essential elements making up the course.

As a matter of fact, a program for the orientation and training of beginning college teachers could well be geared to the interdependence among: the objectives of a course, the sequence of topics (and their classroom presentation), and the manner in which this can be assessed by means of tests, papers, and special projects. I recall a science professor whose overriding goal was "to teach students to think like a scientist thinks" but whose tests were almost solely measures of how well students memorized. He changed his exams to emphasize integration of material, and everyone felt better about the course.

The Diagnostic Use of Tests

Placement testing is commonly used at the department and college level, but within our own courses we can also make effective use of similar testing for making a grade-free diagnostic appraisal of what information is already known by the students or is not known but should be. Diagnostic testing is an excellent instructional tool because when a student says, in effect, "I don't see why the question was scored that way," an inquiry is started toward unscrambling the false connections. In this close-up look, the teacher may note a pattern of mistakes showing a misunderstanding of a particular rule, procedure, or principle. It may also appear that a student has the right answer but for the wrong reasons.

A diagnostic test is a sort of intellectual X-ray showing the strengths and weaknesses in a student's inventory of information, understanding, and skill. The evaluative emphasis is on the responses to individual test items, on information prerequisite for understanding the larger concepts and procedures in this particular course of study.

When students realize the significance to themselves of grade-free probing, they are more likely to open up and reveal low points in their preparation profile, anxieties, misconceptions and deficiencies in knowing how to do certain tasks. A sprinkling of short, diagnostic quizzes early in the term suggests to students that the teacher cares about how they are doing and is taking corrective steps to help them along - an excellent climate for starting the semester.

Assessing Achievement

Although test scores in any setting are affected by students' aptitude, study skills, motivation, background preparation, and the influence of the teacher, our classroom examinations should be designed primarily to measure subject-matter achievement. To this end, the teacher and student seek the same wavelength within an assigned domain of knowledge. A frustrated student expressed a contrary state of affairs quite clearly, "I don't like to play the professor's game: I've got a secret, see if you can guess what it is."

Effective classroom instruction is central to student learning, but students are short-changed if the examinations are trivial, irrelevant, confusing and tangential to the substance of the course. College teaching is not complete without an accurate and fair assessment of students' achievement during the term and at its conclusion.

Objective Tests

Objective (machine-scorable) tests are almost mandatory in large classes, but constructing such instruments is a demanding task. Although it is tempting for teachers to make use of commercially available examinations, to pull old tests from the file, or to overuse test items taken from a teacher's manual, students are best served when their instructors develop exams tailored to their specific course and based on sound principles.

Two basic concepts need to guide the development of classroom examinations:

1. Validity refers to whether an instrument measures what it is supposed to measure. A valid test, therefore, samples what students should have learned from your course offering. It measures here and-now achievement rather than, for example, how well a student reads or how much information the student had gained outside the course. Test items about minutiae and footnote information are temptingly easy to put together but lack the validity of questions that elicit a student's understanding of key concepts, important factual data, and relevant procedures. A valid test is an unambiguous reflection of what is worth knowing and remembering.

2. Reliability refers to the consistency of an instrument's results. A good short quiz is better than a poorly constructed long test but, assuming equal quality of items, a 50-item test is more reliable (stable, consistent) than a 10-item quiz. The random errors due to ambiguous wording, idiosyncratic interpretations, distractions, and other flaws are more likely to cancel out in the longer test, resulting in a more dependable total score. Thus, the easiest way to reduce the unreliability in the measuring instrument is simply to increase the length of the test.

Objective tests come in many forms, but the multiple-choice format carries most of the burden. When carefully worded, multiple-choice items can probe a student's understanding of factual information, skills and procedures, concrete and abstract concepts, and the implications from different scales of values. (True-false items are altogether too constrained to be effective discriminators for most college courses.)

To strengthen the quality of the set of items used, a complete item analysis should be made of each new test. This test-of-the-test is mainly to determine and adjust the difficulty level of each item. It is normal to find that many of our carefully conceived questions turn out to be too easy or too difficult or just seem to ride along as excess baggage. Such items use valuable testing time but add little to the discriminating power of the test. They don't help to separate the top group of students from the bottom group of achievers.

Because ambiguity of meaning is a persistent problem, the wording of test items is critical. Careful editing of the draft exam includes close attention to such pitfalls as cluing the right answer, overlapping correct alternatives, or asking for a positive answer to a negative question. Good test items are parsimonious in meaning and simple in wording. It is surprising how quickly excess words can lead to double meaning or obscure the correct answer. It is appropriate, however, to expand the stem - the lead-in statement of the multiple-choice question - by using a relevant quotation or making reference to a particular body of factual data.

Score the test in a straightforward manner, e.g., in terms of the number of right answers. Trying to adjust (punish) for guessing may, in effect, simply open further sources of variability. Combining raw scores from different performance measures, i.e., tests, term papers, class participation, special projects, etc., can easily distort your original intention. The statistical solution is to convert the different measures to a common scale through the use of some type of standard-score scale.

Subjective Evaluation

The distinctive value of essay exams or term papers is the freedom they offer for students to probe and develop the personal meaning of ideas and to express these thoughts in their own words. To organize an integrated chain of thought, to elaborate on findings, and to communicate ideas to others are stronger tests of achievement than is the recognition or recall of isolated units of information.

1. Essay Exams. In an essay examination, the student is staring at a blank page and generating, from within, a complicated sequence of responses aimed at organizing a meaningful unit of knowledge. This ability to recall is a more demanding test of memory than simply to recognize something. As essay examination elicits the ability to retrieve information but with little help from presently given cues. The perceptive teacher (reader) can evaluate the strong and weak points in a written argument even when the student's perception of a question differs from the teacher's. Evaluative permissiveness can, of course, go only so far.

A steady and unwavering evaluative state of mind is difficult to sustain when reading page after page through a set of exams. Three procedural controls help to reduce the evaluating drift: (1) turn under the front (name) page to forestall confounding effects from those students we particularly like or dislike; (2) read one question at a time through the entire set of exam booklets; (3) shuffle the order of the booklets periodically to balance the inevitable effects of reader fatigue or an emerging tilt toward one pattern of answers.

2. Term Papers. In some respects, the term paper is the essence of what a student has gained from the course. It sets forth what the individual student has learned and how the student has pulled together all the information for comprehension and understanding. This, in turn, serves to keep the knowledge available in long-term memory.

A written handout is a useful guide regarding the due date, length, use of references, comments about style, and any other restrictions or suggestions about the assignment. It may, for example, be helpful to remind students about the difference between describing versus analyzing events and ideas. The heavy task of reading these papers is counterbalanced, somewhat, by the satisfaction of reading the better papers - some of which can be truly exciting.

Grading a stack of exams and papers is a time consuming and pressured task because, throughout, the matter of fair play is squarely on the back of the reader. By way of evaluation, the teacher should indicate in some detail the rationale for assigning the gross grade, making specific reference to identified parts of the exam or paper. The instructional value of essay exams and term papers is practically wiped out if the student receives nothing back other than the grade.

Grading

Faculty standards for A-grade performance define the meaning of excellence within the university. We must guard the criteria of achievement, since everyone pays the price of academic inflation when these standards are lowered. Students work hard for grades because "making the grade" is personally rewarding and is an important basis for special awards, admission to advanced training, and employment prospects ' With such payoff potential it is unfair for a teacher to be casual or careless in assigning this index of achievement. Judgments about professional competence must take into account the quality of a teacher's procedures for testing and grading.

There are two basic options available to instructors for grading student achievement:

1. Norm-referenced grading, more commonly referred to as grading-on-the-curve, sets the scale of achievement by the average level of class performance. Students basically compete against one another in this approach.

2. Criterion -referenced grading has the teacher measuring the students against some absolute standard with respect to what they are expected to learn. The competition here is between the student and mastery of a finite body of knowledge.

In practice, these two approaches overlap and merge since a teacher's judgment about levels of achievement is influenced by the levels of student performance with which one is accustomed at a given school. Also, the departmental culture enters into the picture, because a teacher's procedures and standards for testing and grading are expected to fall in line with the traditions or policies of the home department.

The danger in grading-on-the-curve is its diminishment of the teacher's responsibility for evaluating the students' level of understanding against his or her preset criteria of subject-matter achievement. The final examination, for example, is a revealing statement sampling the information and skills the teacher believes should be carried from the course.

With criterion-referenced grading, there is the danger that the instructor may set the expected level of achievement unrealistically high or low, with the result that students perceive the exam as inappropriate and unfair.

Grades serve the academic purpose of showing intellectual achievement in a limited domain defined by books, teachers, laboratories, and the like. They are not designed to predict success in the off-campus setting where special weight may be given to information, aptitudes, and personal characteristics extending beyond the boundaries of teachers and their courses. Only indirectly or on occasion, do grades reflect a student's tolerance for stress, independent decision-making, congeniality in human relations, ability to cope with unexpected problems, and the like. Teachers can best sustain the credibility of the grading system by making their assessments reflect as fairly as possible how well each student has achieved the stated objectives of the course.

No comments: