A Fractal Thinker Looks at Student Evaluations
© 2005 Edward B. Nuhfer - Center for Teaching and Learning, Idaho State University
"It is the true believer's ability to shut his eyes and stop his ears to facts which in his own mind deserve never to be seen nor heard, which is the source of his unequalled fortitude and consistency." Eric Hoffer, 1942, The True Believer
Abstract. This review paper began as a reference for Boot Camp for Profs® and receives periodic updates. It evolved greatly between 1994 and 2005. This version has two parts. Part I is a general description of student evaluations and the results of studies, as well as a presentation of arguments based on evidence. Part II addresses student ratings through a framework that relates to all of education: thinking in terms of fractals.
Fractal patterns occur in neural networks and synaptic connections. Many aspects of higher education: teaching, learning, thinking, and evaluation of all three involve fractal neural networks that manifest fractal traits. Neural networks developed during acquisition of competency in teaching are comparable to the neural networks developed during acquisition of personality types, learning styles, and multiple intelligences. A trait of all fractal forms is that their characterization requires multiple measures. Distinction of personality types, etc., by necessity, require multiple measures, and the tools used for diagnoses, without exception, require spectra of diagnostic responses, because types are impossible to diagnose by single response items. Likewise, summative student ratings alone cannot define "good teaching," and the employment of students' responses to single questions to characterize "good teachers" represents a single-measure attempt to characterize a fractal system.
To those
who understand fractals, it is obvious why such practices are flawed and doomed
to failure. However, less obvious is the fact that such evaluations are not
simply inept, but also destructive. All learning, including the learning of
teaching competency, involves both the cognitive and affective domains of
the brain. Meta-analyses have already defined self-esteem and enthusiasm as
essential affective attributes of successful teachers. When student satisfaction
ratings become tyrannical tools, the result is an attack on the positive affective
attributes of teachers' minds. The result is counterproductive to teaching,
learning, and even to the personal core of the individuals caught in this
hapless situation.
Summative
ratings result from a mix of cognitive and affective factors. Correlations
reported between ratings and affective first impressions (thin slices) are
even higher than those reported between ratings and learning performance.
Thin-slices research confirms powerful, affective influence on ratings. Evaluative
ratings are certainly honest expressions of student satisfaction and therefore
represent one form of valuable information--one measure that becomes useful
when informed by equally important multiple measures. However, students, unlike
"customers," have responsibilities that go beyond paying for a product,
so such satisfaction is not equivalent to "customer satisfaction."
Affective influence seems weaker on formative ratings, which reveal the pedagogical
practices present in a class and the degree to which each is present. There
is strong research evidence for benefits of particular instructional practices
on both student learning and student satisfaction. This provides reasonable
basis for including, as one essential multiple measure of a faculty member,
a formative profile of pedagogical practices.
Summative rating of faculty by college students is an evaluative challengeat the highest level of challenge in Bloom's 1956 taxonomy. The ability of students to do evaluative thinking rests upon their ability to use evidence and to meet a high Bloom-level challenge with a high-level thinking response on the Perry (1999) scale. Students ability to handle well the evaluative challenge in the special case of rating professors is no different from that to handle other evaluative challenges. Research confirms that this ability in undergraduates (King and Kitchener, 1994) is generally marginal.
Recent emphases on assessment of student learning outcomes underscores the fact that evaluating the faculty implies nothing about improvement of student learning. Knowledge surveys are a kind of student evaluation. They constitute a direct and detailed portrayal of content learning from the viewpoint of students. Data yielded from knowledge surveys proves to be reliable and useful as an assessment tool. Instead of reliance on summative satisfaction, I recommend that "student evaluations" of faculty proficiency should include summative ratings, formative profiles and knowledge surveys. Multiple measures of evaluative input from students should be superior to just the use of summative satisfaction alone, but even these together do not constitute a thorough review required for career decisions about faculty rewards and retention.
Preface
The
area of student evaluations differs from most research areas in that it is
perhaps the paragon of emotionally charged academic topics. Emotions result
from both personal experiences and conflicts of interest. As a result, the
literature of student evaluations contains more than the usual share of diatribes
and polemics, and some take on unusually vitriolic tones. On one side are
staunch champions of student evaluations, whose rhetoric seems like advocacy
for student evaluations being the strongest cure for any classroom ill that
affects higher education. On the other side are those who hold equally unshakable
belief that students have no business evaluating faculty, that student evaluations
are, at best, popularity votes based on attributes that have little to do
with educational value. Strident arguments from both camps remind one of Eric
Hoffer's description of "the true believer." In the middle are the
vast majority of academics (faculty and administrators) who have not read
much research on evaluations, and hold opinions based mainly upon personal
experiences.
Pseudo-evaluation
damages the credibility of legitimate evaluation and victimizes individuals
by irresponsibly publishing comments about them derived from anonymous sources.
This is voyeurism passed off as "evaluation" and examples lie at
http://www.pickaprof.com/ and http://ratemyprofessors.com/index.jsp.
Neither site provides evaluation of faculty through criteria that might be
valuable to a student seeking a professor who is conducive to their learning,
thinking or intellectual growth. Both sites are transparently obvious in their
advocacy that describes a "good teacher" as an easy grader. The
former site proudly displays the quote: "...the most vital academic tool[s]
to students seeking good grades." as a quotation from the Houston Chronicle.
Presenter Phil Abrami (see Theall and others, 2005), rated the latter as "The
worst evaluation I've seen" during a panel discussion on student evaluations
at the 2005 annual AERA meeting.
Lest
one think that only crackpots armed with web servers structure such abuses,
an example of equally irresponsible use of student evaluations lies at University
of Colorado's http://www.colorado.edu/pba/fcq/.
Any individual, not just prospective students, but faculty members' friends,
children, and associates inside or outside Colorado can snoop through
what would normally be confidential personnel data. The exercise of such an inexcusable disrespect for personal lives gets justified by the usual pompous bureacratic excuse: "Collection,
publication and use of student ratings is mandated by the Regents." This exercise hasn't
yet been accompanied by any "Rate your Administrator" or "Evaluate
your Regent" equivalents, which speaks mostly to the significant abuse of rather insignifigant power. Only petty arrogance produces an evaluation system to inflict onto others and their families that the system's designers will not apply to be applied to themselves. In practice, most institutions show better judgment
in management of their faculty and student evaluation data than does CU. Yet, the following
statement accurately captures the importance that administrators in general
ascribe to student evaluations.
"Student
ratings of teaching serve as an important component of many faculty evaluation
systems. Either by design or default, institutions often place great weight
on student rating data in making decisions that impact faculty rewards,
career progress and professional growth. It is critical that student
rating forms be designed and constructed in such a way as to provide valid
and reliable information for these purposes." (See http://www.cedanet.com/sr_description.htm
web site current as of July 15, 2005)
"Dr. X's performance in the area of teaching meets expectations. However, if overall student evaluations of Dr. X do not improve during 2005, teaching may not meet performance expectations."
This
quote illustrates the thinking that equates satisfactory teaching performance
as synonymous with high student ratings and often specifically to tabulated
agreement with a single item such as: "Overall this was an excellent
course." Such examples reveal a true paradox: academic cultures with
awareness of the necessity for multiple measures being unable to overcome
their own practices of making personnel decisions based upon single-measure
convenience. Units that collect data from multiple sources often subscribe
to the same mentality by failure to incorporate "other measures"
in meaningful ways. Thus, they too default to "evaluation" as a
tabulation of reactions to a single measure.
Part I: What's Known?
Seldin
(1993) notes that "...hundreds of studies have determined that student
ratings are generally both reliable (yielding similar results consistently)
and valid (measuring what the instrument is supposed to measure.)" The volume of literature written about student evaluations
is indeed immenselarger than any other single topic in higher education.
Cashin (1988) noted that over 1300 articles on the topic existed in 1988 and
the number today is more than double that figure. Workers who survey the growing
literature on the subject express favor for the usefulness of student evaluations
(Braskamp, Brandenberg, and Ory, 1984;
Cashin, 1988 and 1995; Cohen, 1981; d'Apollonia and Abrami, 1997; Dunkin and
Barnes, 1986; Greenwald, 1997; Theall, Abrami, and Mets, 2001). This is primarily
because there is a very general trend for highly rated teachers to be associated
with students who achieve well (McKeachie, 1994). However, end users of evaluations
often confuse "reliable" and "valid" with "highly predictive," "precise" and
even "accurate."
Student evaluations
come in two types
There are two very
different kinds of student evaluations. These are "formative"
(those that diagnose in ways that allow professors to improve their teaching)
and "summative" (those used to evaluate professors for rank,
salary and tenure purposes). Formative evaluations given during the ongoing
course, usually about mid-term, ask detailed questions that provide a profile
of pedagogy and instructional strategies being employed.
In contrast, summative
evaluations given at the end of a course are direct measures of student satisfaction.
"Satisfaction" is the sum of complex factors that include learning,
teaching traits, and affective personal reactions that are products of what
happens in a class and what an individual has brought with him or her to the
class in form of bias and motivation.
It is maddening
when writers of papers and books about "student evaluations" or
"student ratings" fail to specify whether they are talking about
summative or formative tools. The thorough compilation by Theall, Abrami,
and Mets (2001) is somewhat damaged by lack of such specificity, because when
one talks of the utility of evaluations to help to improve teaching, one cannot
be talking about summative evaluations.
Formative items
are specific rather than general and address practices of professors rather
than students' feelings about professors.
Typical formative items include the following.
"Discusses recent developments in the field";"Uses examples and illustrations";"Is well prepared";"States objectives of each class session";"Encourages class discussion/participation";"Gives personal help to students having difficulty in the course," and"Is enthusiastic."
Summative items
that describe general satisfaction receive the greatest (unfortunately, sometimes
the only) attention by evaluative supervisors. Summative items called "global"
solicit a general overview of satisfaction with the course experience. Typical
global summative items include:
"Overall, how do you rate this instructor's teaching ability compared to all other college instructors you have now and have had in the past?""Overall, how do you rate this course compared to all other college courses you have now and have had in the past?""How do you rate this course as a learning experience?"
Summative
responses conventionally
follow Likert scales that range from "poor" to "excellent"
or "strongly agree" to "strongly disagree."
Brief
history of formative and summative uses
The
following was provided via email by Dr. Michael Theall, (now at Youngstown State
University), who has written extensively on student evaluations (see Theall
& Franklin, 1990; Theall, Abrami, and Mets, 2001).
"The
earliest distinction between formative and summative uses was by Mike Scriven,
who coined the terms in his (1967) 'Methodology of evaluation' in Taylor,
Gagne, & Scriven's 'Perspectives of curriculum evaluation'. The earliest
studies were by H. Remmers (e.g., 'Experimental data on the Purdue Rating
Scale for Instructors in 1927) and they were concerned with exploring student
opinions as one way to find out more about teaching/learning for 'self-improvement
of instruction' and for psychometric reasons (i.e. to validate the scale).
In 1928, Remmers investigated 'student marks and student attitude toward instructors'.
This time, the psychometric properties of ratings were more the focus, (perhaps
due to increasing summative use and resulting validity questions?). By 1949,
Remmers was referring (in 'Are student ratings of their instructors related
to their grades') to students' opinions of the teacher as one of the 'Two
criteria by which teachers are often evaluated...' In the 1949 study, Remmers,
concluded that 'There is warrant for ascribing validity to student ratings
not merely as measures of student attitude toward instructors...but also as
measured by what students learn of the content of the course.'
The timing of the administration of the instrument isn't
mentioned in the 1927 study, for example, but in the 1949 study, the evaluations
were done at the close of the term. So it looks like: 1) the earliest intent
was formative; 2) summative uses developed fairly quickly; 3) psychometric
properties were first a measurement issue and then a matter of establishing
ratings validity due to summative use; and 4) specifics of the evaluation
process gradually evolved from end-of-term administration to other timing
and process changes."
Evidence for Value
Where
is support for the belief that student evaluations are meaningful reflections
of instructional competence? The link presumed between student ratings and
student learning, which Theall mentions began in 1949 with Remmers' study,
remained unproven for many years. Objective support for this belief now rests
upon good data and numerical analysis. The most common statistical tool used
is the calculated correlation coefficient (r) between student ratings and
other measures, in particular student ratings and measures of student achievement
as expressed by examination scores. Positive numerical coefficients can range
between r = 0 (no correlation) to r = 1 (perfect correlationÑsee Figure 9),
and negative correlations between r = 0 (no correlation) to r = -1 (perfect
inverse correlation). What constitutes a positive correlation "good enough,"
given such data, to warrant being called "supportive?" Cashin (1988)
provides some guidelines and suggests regarding student ratings validity:
"Correlations between 0.20 and 0.49 are practically useful" and
correlations above 0.50 "are very useful but they are rare...."
Researchers generally accept these guidelines.
Reviews
of the actual results from large numbers of teacher evaluations show that
global questions, in particular, correlate very highly with one another (Cashin,
1995). The correlations between global questions commonly reach higher than
r = 0.8 (see Figure 1). Global questions often carry a great deal of redundancy.
For example, a professor who is rated highly on one global question that has
to do with his/her overall rating as a good professor will likely also get
a high rating in an overall question about the quality of his/her course.
The positive affective feeling is heavily redundant for both items, and redundancy
clouds the actual merits of a high correlation coefficient or the high factor
loadings in a factor analysis (see for example Marsh, 1983).
Figure 1 displays scatter plots associated with correlation coefficients within the ranges of those represented between student ratings and other measures.
Figure 1. Scatter plots showing correlations typical of student ratings (global ratings of overall satisfaction) with other important parameters. The research that established the relationships shown in accompanying tables were done on much larger populations than shown by the points used to create these graphs. These scatter plots display "best-fit" lines to the data that show degree of prediction of Y from X which can be expected with varied correlations. A "significant correlation" does not mean "high degree of predictability." Perfect predictability (r = 1) would place all points on the best-fit line. From these graphs, it is obvious why a correlation established on a large population cannot be applied reliably to judge an individual. This is why multiple means of assessment are required; global student evaluations are never in themselves sufficient to judge individuals.
The strongest argument that student evaluations are related to student learning occurred about a quarter century ago. Two of the largest meta-analyses aimed to resolve the relationship (Cohen, 1981; Feldman, 1989) found consistent correlations of r = about 0.5 between student learning and student ratings. These provide the strongest basis to date that student evaluations reflect cognitive gains and that high ratings generally reflect better student learning. Cohen (1981) utilized students' scores on an external exam as a measure of student learning and compared them with ratings given by students on their evaluation questionnaires. The global self-evaluation of achievement ("rating of how much I learned" r = 0.47), "overall rating of course" (r = 0.47) and "overall rating of instructor effectiveness"(r = 0.44) all fall solidly within the upper-most regions of Cashin's values dubbed as "useful". Advocates for use of student evaluations frequently cite these studies as evidence of the value of student evaluations. Certainly, the studies demonstrate irrefutably that students generally know when they are learning and credit their teachers accordingly for providing learning opportunities. Based upon the size and care of these studies, it is unlikely that this particular relationship will change with further study. Since the relationship was established, no subsequent credible study has refuted these findings.
Cohen's (1981) landmark paper also indicated that particular instructional practices influence student ratings. He utilized students' scores on an external exam as a measure of student learning and compared them with ratings given by students on their evaluation questionnaire. He discovered that ratings correlated positively with particular teaching traits including: "explains clearly" (r = 0.50), and teacher behavior "uses class time well" (r = 0.47). These two lead us to look at whether particular practices, in addition to their documented relationship to satisfaction, are indeed related to increased learning. Thus, we now have a tie between formative evaluation data and summative evaluations. Subsequent research has teased apart student ratings as functions of particular practices. Such practices, when shown to be useful to promoting learning, provide another strong argument for the value of student evaluations. We have data to show that high global ratings in general are related to teachers' practices that indeed have a basis in research as being beneficial to learning.
Professor Kenneth Feldman has studied the relationships between student evaluations and practices, perhaps more than any individual, and does thorough meta-analytical studies. Feldman (1998 see Table 1) teased out the formative components that lead to summative ratings of satisfaction and components that lead to measurably increased learning. There is similarity in the ranking (Table 1), but there are important differences. Recent studies indicate that "the most effective" teaching practices seen by students vary across ethnic groups (Sanders and Wiseman, 1998). Further, many of the classic studies cited here drew their inferences primarily from classes dominated by lecture-discussion pedagogy. Comparable studies derived from classes that use alternative pedagogies are yet to be produced.
"Top 5"Instructional Dimensions Based on Different Indicators(Modified from Feldman, 1998)
|
Instructional Dimension |
% Variation Explained
|
Importance Shown by weighting (and
rank) with Student Achievement
|
Importance Shown by Rank with Overall
Evaluations
|
|
Teacher's preparation; organization of the course |
30 - 35% |
.57 (1) |
(6) |
|
Clarity and understandableness |
25 - 30% |
.56 (2) |
(2) |
|
Perceived outcome or impact of instruction |
15 - 20% |
.46 (3) |
(3) |
|
Teacher's stimulation of interest in the course and its subject matter |
10 - 15% |
.38 (4) |
(1) |
|
Teacher's encouragement of questions, discussion, and openness to opinions of others |
10 - 15% |
.36 (5) |
(11) |
|
Intellectual challenge and encouragement of independent thought (by teacher & course) |
5 - 10% |
.25 (13) |
(4) |
|
Teacher's sensitivity to, and concern with class level and progress |
5 - 10% |
.30 (10) |
(5) |
Table 1. Instructional dimensions compared with their ranks of importance in producing satisfaction and producing learning. These reveal similarity but not congruence. The trait most important to develop in order to produce highest levels of student learning is attention to course organization and preparation. This importance was also confirmed in the National Study for Student Learning (Pascarella, 2001). Yet, this is only the sixth most important practice for producing high ratings of satisfaction.
Erdle and Murray (1986) showed that certain behaviors in class affect student satisfaction, and further, that the relative importance of these behaviors varies between disciplines (Table 2). Erdle and Murray's work indicates that the nature of what we are trying to teach influences the traits we should probably seek to address to obtain improvement.
Correlations between Ratings of Overall Teaching Effectiveness and Teaching Behavior Factors(After Erdle and Murray, 1986)
|
Behavior
|
Perceived Importance to teaching by students of: |
||
|
Humanities |
Social Science |
Physical/life science |
|
Rapport |
0.43 |
0.70 |
0.59 |
|
Interest |
0.50 |
0.71 |
0.37 |
|
Disclosure |
0.30 |
0.65 |
0.25 |
|
Organization |
0.51 |
0.56 |
0.47 |
|
Interaction |
0.48 |
0.51 |
0.34 |
Course Pacing |
0.53 |
0.45 |
0.62 |
|
Speech Clarity |
0.53 |
0.45 |
0.62 |
|
Expressiveness |
0.58 |
0.59 |
0.51 |
|
Emphasis |
0.61 |
0.58 |
0.51 |
|
Mannerisms |
-0.53 |
-0.42 |
-0.28 |
|
Use of Graphic |
0.22 |
0.35 |
0.37 |
|
Vocabulary |
0.16 |
0.35 |
0.37 |
|
Presentation Rate |
0.23 |
0.14 |
0.31 |
|
Media Use |
0.30 |
0.23 |
0.11 |
Table 2. Relationship of teaching traits to summative global student ratings. The variations between how students value various traits depend upon the subject being taught. Affective factors such as rapport and expressiveness exert a powerful influence, but so do traits related to learning such as organization, clarity, and emphasis of important points.
Cashin (1988, 1995) helped promote awareness through concise summaries of the research on student evaluations, and he presented this data along with correlation coefficients (Table 3). His compilations show that professors who teach classes where students are motivated (such as classes taken by choice or in one's own major) have a distinct advantage in the likelihood that they will garner higher ratings than other teachers. Those who teach large classes have some slight disadvantage to their ratings (Smith and Glass, 1980) and, over large populations, ratings of students are generally consistent with those of alumni, colleagues and administrators. Those who are productive in research are generally rated more highly in classes they teach, but the relationship is so slight that research productivity is useless as a predictor of student satisfaction with teaching. The higher professorial ranks have slightly better student satisfaction, but the relationship is likewise so weak that rank cannot be used as any predictor. The relationship between grade expectations and student satisfaction is weak.
Relationships with Student Evaluations: Correlations between Various Influences and Ratings of Overall Teaching Effectiveness(NR = Not Related at any Significance)
|
FACTOR |
CORRELATION WITH GLOBAL RATING |
|
Sex of Instructor |
NR |
|
Sex of Student |
NR |
|
Level of Student |
NR |
|
Rank of Professor |
0.10 |
|
Research productivity |
0.12 |
|
Student's GPA |
NR |
|
Age of Student |
NR |
|
Age of Professor |
NR |
|
Time of day |
NR |
|
Class size |
-0.18 |
|
Student Motivation |
0.39 |
|
Expected Grades |
0.12 |
|
Course Level |
0.07 |
|
Colleagues' Ratings |
0.48 to 0.69 |
|
Administrators' Ratings |
0.47 to 0.62 |
|
Alumni Ratings |
0.40 to 0.75 |
Table 3. Relationships of various factors to student ratings, come from various studies cited in Cashin, 1988, with exception of relationship to sex of instructor, which comes from Basow and Silberg (1987), Marsh, (1984), Feldman (1992) and Centra and Gaubatz (1998). Grades have correlation to students' ratings but appear to be because students who learn more rate teachers/courses higher (Howard and Maxwell, 1982). The results are all outcomes based upon studies of large populations. Student motivation (willingness to participate actively in the learning process) has the greatest positive influence on student satisfaction of any instructional factor shown. Student ratings are also consistent with those of faculty colleagues and administrators, and the ratings remain consistent, as students become alumni. Of interest is the fact that, in practice, administrators and colleagues spend little to no time in classrooms from which these ratings are derived. Their only means of obtaining information are either hearsay from the students, or from seeing the results of the student evaluations. Thus, colleagues' and administrators' ratings are likely redundant with, rather than independently supportive expressions of, the student evaluation ratings.
We know that formative evaluations, properly used, are highly beneficial. Abrami, Leventhal, and Perry (1982), Dunkin and Barnes (1986), Murray (1985), Stevens and Aleamoni (1985), and Cashin (1988) all show that formative evaluation, particularly with follow-up consultation, leads to improvement. Based upon a synthesis of 22 studies, Cohen (1980) showed instructors with no student evaluations as rated in the 50th percentile at the end of a term; those who obtained student evaluation feedback were rated in the 58th percentile and those who received feedback with follow-up consultation were rated in the 74th percentile. Practitioners unite in their consensus that follow-up consultations for individuals should occur in a neutral and supportive environment.
Given all of this, is there any basis for arguments against the value of student evaluations?
Evidence for the Contrary
Objective support for contrary beliefs also rests upon context of use and, paradoxically, on some of the same numerical analyses that support value of student evaluations. The context is the fact that student evaluation data are routinely collected and used for the purpose of judging individuals. There is no credible evidence that such summative practices effectively improve either student learning or satisfaction. Critics note correctly that administrators fail to gather student ratings data under research conditions and don't use the data to address issues of educational research. Instead, the collectors of data use it to judge, reward, and punish individuals. A number of writers address weaknesses and misuses of student evaluations. Representative are Nerger and others, (1995), Williams and Ceci (1997), Trout (1997), and Wilson (1998). They express discontent about student evaluations--usually because student evaluations express satisfaction or dissatisfaction based upon affective attributes rather than on learning or cognitive growth.
Critics correctly point out that evaluation of an individual tests a different hypothesis than deducing a general trend across a populace. Deducing a trend by fitting lines to a cluster gleaned from many individuals' varied performances poses a problem that differs from evaluating a single individual's performance, which is one point in the cluster. Figure 1 depicted the difference clearly. A measure that yields a "significant" correlation coefficient between student learning and summative evaluation scores over a large population of faculty (Cohen, 1981; d'Apollonia, and Abrami, 1997) is not something that one can often reliably apply to an individual faculty member in a rank-salary-tenure decision. As Figure 1 shows, at correlation values of r = 0.5, almost none of the points actually fit on the line. This data confirms that efforts at predicting learning based on student ratings will more likely widely overestimate or underestimate the learning produced by an individual than pinpoint it accurately. The trends for the cluster are statistically reliable, but a trend used to evaluate an individual must not simply be valid and reliable; the trend also must reflect a high degree of predictability. Sufficient predictability to judge individuals responsibly simply isn't present at correlation values ranging at or below r = 0.5.
A reason that Cohen and Feldman resorted to meta-analysis arose from the fact that smaller individual studies were producing conflicting conclusions. Critics point out that a correlative association between student evaluations and learning strong enough to allow a user to accurately deduce students' learning from a faculty member's ratings, if it existed, would have been discovered without need to resort to meta-analyses. Strong associations with high predictability do not require meta-analyses to discern them. They have an inherently reasonable consistence whether one uses a large or small study group to generate a database. Such is not the case with student evaluations. Application of student evaluations to deduce the career success of any single professor is a different challenge from statistically describing paired relationships within large databases. The same research that supports the validity of student ratings illuminates the dangers of trying to apply results of good established associations, determined as valid on large populations, onto individuals. Therefore, Seldin's (1993) statement: "...hundreds of studies have determined that student ratings are generally both reliable (yielding similar results consistently) and valid (measuring what the instrument is supposed to measure.)" accurately describes results on populations, but it has no business being used to justify applying trends to individuals.
Sources of conflict between pro and con advocates regarding use of student evaluations are not based merely on relationships between learning and evaluations but between evaluations and other factors, some of which are affective, and not clearly related to the mission or goals of education. Cohen's (1981) landmark paper provided an example when it indicated that particular affective practices also contributed to higher ratings ("Teacher rapport" r = 0.31). Naturally, the question arose "Can a teacher promote high ratings through emphasizing affective traits without really producing the kinds of learning in accord with goals and mission of an institution?"
Feldman (1986) showed that professors' personalities affect students' ratings of overall teaching effectiveness (Table 3). One striking aspect of Table 3 is the demonstration of how teachers tend not to see themselves as others see them. Of the personality traits, the only two traits that peers, students, and teachers agree upon as being of significant importance are enthusiasm and self esteem, and students and peers give these much more importance than we tend to give them in ourselves.
Overall Teaching Effectiveness and Personal Attributes of Professors(After Feldman, 1986)
|
PERSONALITY TRAIT |
IMPORTANCE AS SEEN |
||
|
By Self |
By Students |
By Peers |
|
|
Self Esteem |
0.38 |
0.51 |
not rated |
|
Energy (enthusiasm) |
0.27 |
0.62 |
0.51 |
|
Warmth |
0.15 |
0.55 |
0.50 |
|
Cautiousness |
-0.09 |
-0.02 |
-0.26 |
|
Leadership |
0.07 |
0.56 |
0.48 |
|
Sensitivity |
0.07 |
0.53 |
0.47 |
|
Flexibility |
0.05 |
0.57 |
0.46 |
|
Emotional Stability |
-0.02 |
0.47 |
0.54 |
|
Friendliness |
0.04 |
0.42 |
0.49 |
|
Neuroticism |
-0.04 |
-0.49 |
-0.35 |
|
Responsible/orderly |
0.06 |
0.31 |
0.25 |
|
Brightness |
-0.05 |
0.36 |
0.22 |
|
Independence |
-0.12 |
0.01 |
0.08 |
|
Aggressiveness |
0.23 |
0.05 |
0.02 |
Any evaluation system that humiliates faculty rather than strengthens them will likely damage teaching through destroying self-esteem and enthusiasm on an institutional scale. The university leader who recognizes the significance of Feldman's research will do everything possible to nurture self-esteem and enthusiasm of faculty.
Perhaps the most heretical of all studies concerning affective influences on student evaluations was the famed "Dr. Fox experiment"(Naftulin, Ware and Donnelly, 1973) in which a hired actor posed as Doctor Fox and lectured to three groups of educators in a manner which was highly expressive but low in content. The groups consisting of professors, professionals and administrators gave satisfactory content marks to the actor, thus demonstrating a tremendously disturbing fact: even ratings by those who are above average in intelligence, trained in critical thinking and are well-educated will not always reveal whether a presenter delivered substantive educational value. Critics noted that if professionals could not make this distinction, then average undergraduates probably could not make it either. Could students know whether a professor was providing substantive content, if the content were current, or would they rate their professors more on expressiveness (or worse, entertainment value) rather than content value? One implication from the study was that student ratings were not valid criteria to evaluate actual teaching effectiveness by lecture. The implication was argued based on similar data both pro (Ware and Williams, 1975) and con (Marsh and Ware, 1982). Marsh and Ware (1982) used factor analysis to divide "evaluation" into several dimensions and showed that, of the two most important influences, expressiveness (number 1 in importance) registered primarily through the rating of "Instructor Enthusiasm," whereas content coverage (number 2) was expressed through "Instructor Knowledge." The more comprehensive of the later studies (Perry, Abrami and Leventhal, 1979; Abrami, Leventhal and Perry, 1982 - see Dunkin and Barnes, 1986) respectively replicated the Dr. Fox experiment and analyzed data from their own and from 11 other studies. They found that the effect of expressiveness alone on overall student ratings was "significant and reasonably large" whereas the effect of content alone was sadly "inconsistent and generally much smaller." However, on overall student achievement, content became significant and expressiveness became insignificant.
Critics of the Dr Fox study speculate that such a ruse could not remain successful over an extended time and that students would eventually discover the hoax. Contradicting this speculation is an actual case study described in the first half of Generation X Goes to College by Peter Sacks. This autobiography of a tenure track professor in an unnamed community college discloses the teacher initially finding himself in trouble with student evaluations. Sacks exploited affective factors to deliberately obtain higher evaluations, and described in detail how he did so in the chapter, "The Sandbox Experiment." Sacks obtained subsequent high evaluations through his efforts, but not through promotion of any learning outcomes. For years, he managed not only to deceive students, but also peers and administrators and eventually received tenure based on higher student evaluations. Sacks demonstrated that a professor can emphasize particular practices that will change student ratings but not necessarily produce the best learning outcomes. His book is a brutal disclosure about himself and his institution. The case shows clearly that (1) a teacher can manipulate satisfaction without attending to students' learning, and (2) that inept faculty peer reviewers and administrators promote the actions that Sacks chose by placing faculty careers in the hands of student raters.
The experience also made Sacks a powerful voice, not necessarily against the evaluations themselves, but rather in deprecation of the tyrannical ways in which administrators use these. Sacks describes his view in a sidebar in Trout (1997):
"Once employed as an innocuous tool for feedback about teaching, student surveys have evolved into surveillance and control devices for decisions about tenure and promotion. Add the consumeristic and entertainment values of the culture beyond academe and the result can be ugly: pandering teachers doing what's necessary to keep their student-consumers satisfied and their jobs secure."
Sacks' quotation captures a poignant source of conflict between those who champion student evaluations and those who deplore them: the vitriol in the conflict is not always about the research at all. Rather, the emotional polemics (typified by Fish, 2005) are about the use, and particularly the misuses of evaluations. (See also McKeachie and Kaplan, 1996). In the end, mindless misuse can lead to institution-wide erosion of self-esteem and enthusiasm, thereby destroying the very traits in instructors that Feldman's work (1986) shows are most important to their success. Clearly, the practices through which student evaluations are used perhaps, more often than not, are directly at odds with the research on student evaluations. It is little wonder that the messy state of things led to the AERA panel in 2005 titled "Valid faculty evaluation data: are there any?" (Theall and others, 2005)
Part II. A Fractal Thinker Looks at Student Evaluations
Nature is full of fractal forms: trees, clouds, blood vessels, and landforms, to name a few. A fractal form is complex, but although it has the illusion of being randomly irregular and seemingly impossible to quantify at first sight, this intimidating complexity has an order within that provides a means to understand the form in surprising ways. Recursive operations on a small unit called a generator (Figure 2) produce the order found in fractal forms. Fractal forms have the characteristics of similarity when viewed at different scales, and predictable growth of a dimension such as length in accord with decreasing length of measuring tool.
Figure 2. Concept of a fractal form, in this case a branching network built from recursive operations on a generator--each branch of the Y being replaced with subsequent Y-shapes. (From Nuhfer, 2003a).
Fractals provide important insights to understanding much about education, because learning occurs by increasing the strength and numbers of synaptic connections (Leamnson, 1999), usually through repeated use. Exceptions to repetition occur when learning is accompanied by strong emotional-affective influences that seem to establish strong permanent connections instantly. Growth of these connections, much like growth of a tree, produces immensely complex forms by recursive growth of a simple generator into branching patterns (Nuhfer, 2003a; 2003b). Education is replete with fractal characteristics in both space and time, probably because neural networks, like blood vessels, are fractal networks. In space, physical brain changes include growth of such networks in the process of becoming educated. Many natural temporal patterns in time are fractal, and learning, too, is a product of a series of events in time. Relationships between student ratings and other measures are myriad and complex. Student ratings, as well as all other educational endeavors, arise from the brain's branching neural networks. Thinking in terms of fractals proves useful to many educational endeavors (Figure 3. See also Nuhfer-2003a-c 2004a-c, 2005a-c; Nuhfer, and Adkison, 2003; Nuhfer, Krest and Handelsman, 2004; Nuhfer, Leonard, and Akersten, 2004.) This paper concerns the specific case of student evaluations.
Figure 3. Generator for a professional application in college teaching (after Nuhfer, 2003a) deduced as result of years of design for faculty development. Note that the base, self-introspection, addresses primarily affective attributes. This model recognizes that all cognitive choices in selection of content, pedagogy, levels of thinking, rubrics and exercises to produce student self-assessment are rooted in and connect with affective feelings. All course products likewise manifest such feelings. Nuhfer attaches extreme importance to the generator in recognizing how earliest experiences and practices shape course outcomes and products in major ways.
The nature of learning and teaching (which is learned) involve both cognitive and affective imprints in the brain that begin to develop together from at least as early as birth. The cognitive inexorably communicates, both verbally and non-verbally, with the affective regions of the brain (Nuhfer, 2005c). One of the major obstacles in resolving the student ratings controversies lies in inability to perceive the role and magnitude of the affective components. At times, those who deprecate the value of student evaluations and advocates for student evaluations seem unable to really hear one another. In itself, this situation is possibly a function of the affective domain, where emotional regions of the brain communicate with and influence the cognitive domain.
The missionary zeal inflicted on faculty by advocates for the use of student ratings has probably accounted for much faculty resistance. Sometimes individual faculty objections to student evaluations derive from actual personal experiences. The predictive character of correlations in the ranges of r = 0.5 and less are such that many individuals' situations won't "plot on the line." Despite this reality, advocates of student evaluations sometime act as if the dominant positive trend precludes any possibility of exceptions. They label faculty disclosures of true events as "anecdotal," "misbeliefs" or even "myths." Some researchers (e.g. Boice, 1990) suggest that "countering" constitutes the proper response to such disclosures. "Countering" might be well suited to arguing about trends in general populations, but "countering," in the context of faculty development is an exercise unbridled by emotional intelligence. It denies the reality of individuals, and will be perceived as humiliating by faculty seeking help. Heavy-handed "countering," produces the undesired effect of scorning individuals' actual hardships and experiences. The skills needed to be a great researcher in education, psychology, or student ratings are not the same as those required to help an individual become a stronger teacher. Other traits, many of them affective, are far more important in campus leaders than the ability to argue.
There are many reasons why experiences with student evaluations tie strongly to affective domains. From the standpoint of faculty, any member who has received abusive ratings, been humiliated by inept uses of these ratings, or both, will naturally react against arguments perceived as advocacy for putting her/him self into an endless cycle of abusive experiences. On the other hand, researchers of evaluations may pressure for faculty evaluation simply because of researchers' desires for their own research to be influential.
The ability to evaluate faculty provides a differential in which administrators have power over faculty. The tool that comes with more ease of acquisition than perhaps any other and that gives administrators more power than anything to affect faculty lives outside of the workplace are the evaluative ratings by students. Any intimation that administrators may be wielding such power badly or irresponsibly, or are not even doing real evaluation will trigger the affective domains of administrators. They may simply discount the evidence based on affective perceptions that it is threatening to both them and their positions. Experts who sell evaluation forms and workshops that promote student evaluation may have built the most impervious of all affective defenses against seeing student evaluations as just one of a number of possible multiple measures. A characteristic of a fractal thinker is perpetual awareness that the ratings debate is not about only an objective application of research findings.
Affective feelings largely control responses to rating items associated with the general experience with the class. The instructor is surely a major contributor to such feelings (Marsh, 1982), but not nearly all. Perception of a course experience depends heavily on what individual students bring to class pre-wired within their neural networks as expectations and levels of intellectual sophistication. Bimodal distributions on ratings showing a "love-hate" division within a single class produced by students who have undergone the same educational experience are common. These reveal that a rating provided by a student is as much about the student as it is the experience. A misapplication of student evaluations in using ratings alone to judge good teaching is usually based on the rash presumption that both teaching and teachers' ratings only involve the cognitive domain (see Nuhfer; 2004a, 2005c). The presumption implies that student ratings reflect the instructor's ability as a teacher to meet their assigned duties, the primary one of which is to produce beneficial cognitive growth in students. In contrast, a fractal thinker anticipates that a summative student evaluation will be a very honest expression of satisfaction, which is largely an affective trait that develops along with cognitive growth as a part of the course experience.
Fractals provide a particularly damning exposure of the practice of employment of single global questions as the overly dominating basis for an evaluation. Because teaching practice is learned behavior, an evaluation of teaching is an evaluation of the neural networks a faculty member has developed to conduct those practices. The neural network is fractal, so our problem of faculty evaluation can be framed as one of understanding a complex fractal form.
The fractal dimension is one of the essential manifestations required to describe a fractal form. Derivation of a fractal dimension requires multiple measures taken at different scales. A trait of a fractal form is that its dimensions increase in a surprisingly predictable way, depending upon the length of the measuring instrument one uses to measure the form. The fact that something would change in length depending upon the length of our measuring sticks defies our common sense, but that is exactly how fractals behave. For instance, take the profile of a coastline (Figure 4) on a large map. If one measures its length with a pair of dividers that are set initially at say, four inches, and then again, at two inches, one, inch, a half-inch and so forth, each measured length of the coastline will increase, as the dividers get smaller. To some, it may seem obvious that if one takes a very crooked line (like a coastline) and measures it first crudely and then more precisely, its length will increase. However, the length of a fractal form just doesn't increase, it increases with such regularity that one can predict accurately what length one will measure based upon any setting of the divider (Figure 5). This growth will be so regular that it will plot as a straight line with slope produced by log(L)/log(1/r) where L= length derived from the number of divider widths required to measure the length of the feature and r = the width setting of the divider. The slope of that line defined in a single number is an expression of the fractal dimension for that landform. This provides a concise description expressed in numbers to distinguish one kind of coastline from another.
Figure 4. Measuring a fractal coastline several times by choosing first narrow and then wide divider widths, walking the divider along the coast and summing the number of divider steps and multiplying by the width to get the length of the coastline. The length of the coastline increases as divider length decreases (see Figure 4).
Figure 5. Understanding the fractal dimension of a coastline requires multiple measures, not just one. Note how the order of a fractal form becomes apparent as the coastal length changes almost in a perfectly predictable manner as the divider lengths are changed. This depiction of a fractal form was inspired by Benoit Mandelbrot's article in Science, 1967, "How long is the Coast of Britain?"
To capture the insight of fractal dimension, there is no substitute for multiple measures; no single measure can capture the quality of a fractal form. Now, consider the practice of trying to measure teaching effectiveness with a single measure, or worse, by averages of responses to a single question/item. Chances of being able to capture teaching effectiveness through such an effort are much less than being able to capture the length of a coastline; the neural networks that govern teaching are much more complex than any line described by the interface of land and water.
All substantive classifications of patterns of thinking and acting have been demonstrated to require multiple measures. Consider the established patterns of learning styles, multiple intelligences and personality types. The way all originators who developed such tools were forced to classify individuals was through a battery of survey items--many measures. Not one of these researchersDavid Kolb, Howard Gardner, or Isabel Myers & Katharine Briggs was able to succeed in classifications based upon one question or one question for each of their types. To a fractal thinker, it is obvious why all such classifications required multiple measures. Without being overtly aware of the fact, all these researchers were trying to characterize fractal forms of particular neural networks. There was no way any of them could possibly succeed without multiple measures.
It should be equally obvious why attempts to evaluate teaching (more honestly in practice, teachers) based on student ratings alone, or worse, a single global item from a survey, is an approach to evaluation that is bankrupt from the outset. The nature of what we must evaluate simply cannot be captured with any single measure. Capturing successful teaching is likely an endeavor at understanding neural networks that are far more complex than those involved in personality types, learning styles or intelligence type. All three of these and more are almost certainly involved.
The arguments of relative merits of formative and summative evaluation are reminiscent of the affective "loyalties" associated within influential camps, wherein one might see one's classification of learning styles as better than another or multiple intelligences as more valuable than personality types. Fractal thinkers see these as merely different, multiple measures. Figure 6 is an example of a fractal thinker's perception of multiple intelligences and personality types.

Figure 6. A model in which multiple measures deduce "type" by sampling neural networks through surveys with batteries of different items. Although some claim their scheme is unique and different from others, the fractal thinker questions the purity of that claim. The "types" appear deduced by different samplings of the affective and cognitive neural networks that create choices of response. There is likely some overlap in the samplings. The idea of exclusivity in "type" appears absurd when looked at in this way. For example, all people with normally functioning brains likely have a measurable personality type and a measurable intelligence. These coexist easily.
A similar conceptualization can depict the summative versus formative character of student evaluations. Figure 7 displays a single global measure (summative) versus a diagnostic battery of multiple practices that formative tools try to capture. It depicts formative and summative data as different measures derived through a general assessment of students' satisfaction and a pedagogical profile of practices. Seen through the lens of reasoning through awareness of fractals, the two should provide separate but related useful information. This suspicion is substantiated by much of the research cited above.
Figure 7. Comparison of sampling of a summative item: "Overall this was an excellent course;" with a battery of formative measures such as "Uses examples and illustrations," and "States objectives of each class session." The global item is likely to trigger a response mainly from general feelings that arise in the limbic area along with some cognitive recollection. The items on the survey are more aligned to elicit a response that is more dominated by the cognitive, but nevertheless remains tied to affective feelings. Of importance here is that the global summative question taps information that the formative does not, and vice versa. Both are useful multiple measures, but neither are sufficient in themselves for making judgmental ratings of individual professors.
Formative efforts to define, disclose, and improve weaknesses along with consultations with a developer to gain improvements need to take place in confidence. The practices of formative evaluation have traditionally omitted formative questions from summative surveys. From a fractal thinker's viewpoint, this convention is unwise, because of the vast research that's known about the positive effects of a variety of practices on student learning. If students confirm that good practices are in fact at work in the classes, then this "pedagogical fingerprint" is probably even more valuable in deducing the quality of teaching than a satisfaction rating. A fractal perspective confirms a need for full employment of multiple measures and refutes the summative practices that bow to convenience of "evaluation" afforded by global summative questions alone.
The assessment movement appears to have incorporated fractal thinking much better than those too caught up in arguing merits of student evaluations. Assessment refuses to accept convenient single measures as adequate, and assessment is beginning to get educators out of the rut of thinking myopically at just the level of evaluating courses and individuals. Fractal thinking (Figure 3) recognizes that individuals are part of something larger and that education is more than courses. Such recognition constitutes awareness of assessment. College administrators who have a history of practicing trying to understand a complex entity such as "good teaching" through single measures have likely adopted a neural network whose conceptual processing likewise fails to appreciate the need to capture student learning through multiple measures. This may account for the fact that assessment persists as the most vulnerable area to criticisms from accreditation agencies, which hold the chief executive officers on campus responsible.
Student evaluations are an evaluative exercise.
The ability of students to evaluate faculty is rarely considered in terms of the research on student thinking. The research on levels of thinking is extensive and it is as solid as any on evaluative tools. The best-known model is that of Perry (1999). Perry deduced nine levels of thinking, six of which apply to undergraduates and have been repeatedly corroborated by others' work (see Nuhfer and Pavelich, 2001, 2002). The average high school graduate reasons at about a level of 3.7 on Perry's 9-point scale. The average college graduate reasons at about a level of 4. There is scarcely little gain between high school and college in ability reason at higher levels or to think reflectively. Reasons for this are outside this discussion. A characteristic of levels below 5 is that students may do evaluative thinking but do it poorly. When a student evaluates a professor or a course, this is an exercise in evaluative thinking. The degree to which individual students can do this well does indeed depend upon the level of thinking each has reached. Those who argue that their undergraduates can do faculty evaluation well may, depending upon their institution, be unwittingly arguing against a massive amount of research revealing knowledge about intellectual development.
The research on levels of thinking differs in an important way from the research on multiple intelligences, learning styles, etc., where individuals create useful classifications through different tools based upon different objectives and considerations. In contrast, researchers who study levels of thinking independently and with some very differently held values nevertheless produce a remarkable concurrence about discernible differences between specific thinking levels. Leamnson (1999) captured this distinction by referring to the different classifications produced by tools as "inventions," whereas the repeatable concurrence typified by disparate studies on levels of thinking would fall more into the realm of "discovery."
Thin slices and other affective manifestations
One of the most surprising findings about the power of the affective domain came from Ambady's and Rosenthal's (1993) "thin slice" studies. These researchers determined that students arrived at ratings for teachers after watching 30 seconds of silent content-free video that were highly consistent (r = 0.76) with end-of-semester ratings. Further, viewing of several 3-second video segments yielded only somewhat lower correlations (r= 0.68) both of which are higher than the established relationship (r = 0.5) between learning and ratings. Certainly, content-free video clips observed for a few seconds cannot confer learning, and these correlations are not reasonably explained as arising from cognitive growth. An explanation is that affective reactions form neural networks quickly, stabilize early and persist to the end of the course where they manifest as a rating of the professor.
This is a find truly appealing to a fractal thinker's fancy! The first class period establishes a generator in the mind of the student--likely even unconsciously, and the character of this generator should persist in subsequent growth that the neural network produces during the course. The thin-slices finding isn't so surprising to a fractal thinker, but the strength of the generator and the incredible speed with which it forms are astounding. It shows particularly why the first day of class should be carefully planned in accord with one's highest aspirations and teaching philosophy.
Recent work underscores other affective influences on student evaluations and observations that fit well with a fractal thinker's perspective. University of Texas economist Daniel Hamermesh (Hamermesh and Parker, 2004) recently verified an influential relationship behind student rating: beauty of the professor: "Instructors who are viewed as better looking receive higher instructional ratings, with the impact of a move from the 10th to the 90th percentile of beauty being substantial."
Communication specialists also confirm the influence of affective "nonverbal immediacy" on ratings (Smythe and Hess, 2005). Because affective reactions are intrinsic to the very discussion of the merits of student evaluations, researchers who report affective factors' strong influence on student ratings do so at some peril. "True believers" in student ratings invariably attack such studies as "biased," "erroneous" and/or "methodologically flawed." Responses to the work of Hamermesh and Parker posted on the POD Network reveal how affective hostile responses can extend past the work to the authors themselves:
"... deprecation of student eval's. *may* be the intent of the economist-authors. (Who knows, maybe the researchers themselves are not "beautiful," received low ratings, were denied raises, etc.; I empathize/sympathize)"
In the case cited above, Hamermesh, a well-respected teacher and researcher, had a long prior record of researching the influence of personal appearance on job success in many settings. The "deprecation" of student evaluations was obviously not on the personal agendas of these researchers. Their research, though unappreciated, reported real events. However, the comment in response shows the power of the affective over both reason and civility when this particular topic arises. The topic of student evaluations is accompanied with emotions that few, if any, other area of academic research can match.
To a fractal thinker, the discovery of powerful affective influences is neither disturbing, unanticipated, nor a detraction from sound discoveries linking ratings to cognitive factors. Because student evaluations tap both affective and cognitive neural components, it would be surprising if explored relationships between affective attributes and student ratings showed no strong relationships. Statistical practices such as factor analyses and regression that focused on cognitive attributes seem to have been interpreted by some as indicative that there can be no powerful influences outside the factors studied. The reason more affective influences were not found by early researchers, whom advocates regard as orthodox, is not that affective influences are not real. Rather, it is because early researchers didn't look for them or take them seriously. We should hardly expect all meaningful data on student evaluations to arise only from the cognitive domain or for student evaluations to be understood completely through pedagogical practices and exam scores. Such an expectation runs counter to everything known about how the brain learns and operates.
How Can We Use Evaluations More Effectively?
Use multiple measures. NEVER evaluate faculty based on summative ratings alone
By default, student evaluations often become the sole basis for career decisions. Although data provided by them is meaningful and convenient to use, evaluations based on such data alone are wholly inadequate. One reason that multiple measures are not used in practice is because even when researchers on student evaluations advise employment of multiple measures, no measure other than the summative student evaluation tool is specifically recommended. Here, I recommend that the student input consist of three specific measures (Figure 8): (1) summative ratings as a measure of student satisfaction, (2) formative survey information to provide a picture of pedagogical practices and (3) knowledge survey data to provide information on content, levels of challenge and a student-based report on their learning.
Knowledge surveys (Nuhfer, 1993; 1995; Wirth, Perkins and Nuhfer, 2005; Nuhfer and Knipp, 2003--See link in References Cited; Knowledge surveys are detailed there) provide an additional source of detailed information gathered from students on perceptions of their learning. Knowledge surveys are an assessment tool that gathers information bridging that yielded by tests and by student evaluations. Statistically, knowledge surveys prove to be highly reliable (Figure 9). They are ideal for supporting learning in any course, and the student responses yield a wealth of both formative and summative information.
Assessment of student learning, not faculty ratings, is the major outcome now demanded from evaluators and accreditors. Traditionally, school administrators have been generally more engaged in rating faculty than in improving student learning. Although some advocates of student evaluations argue that student learning should not be a component of faculty evaluation, student learning is the only outcome that, in itself, can justify the maintenance of educational institutions. Faculty are the primary group responsible for both the curriculum and for the maintenance of an environment conducive to learning. In evaluating faculty, programs and institutions, it is simply good management to monitor learning outcomes at all scales. As an outcome, student learning is more important than student satisfaction, and to act otherwise communicates that happy customers are more important than a learning community. Such an approach conveys that degrees and learning can be bought as a "customer" but perhaps need not be earned as a student. Treating students as "customers" obfuscates responsibilities and subverts the core goals and learning missions of colleges and universities.
Regarding the "student as customer model," Chemist Mike Chejlava at Lafayette's College notes such "...forces lead students to believe that they must get the RIGHT answer the FIRST time ...(and) any faculty member who gives work that they cannot master the first time is trying to keep them from their goals by setting standards too high." In her dissertation, "Bridging the Gap Between What is Praised and What is Practiced: Supporting the Work of Change as Anatomy & Physiology Instructors Introduce Instructional Strategies to Promote Student Active Learning in Undergraduate Classrooms," Thorn (2003 & personal communication) revealed that all instructors in her study received lower student evaluations while attempting to emphasize critical thinking. One, called to task by her dean, was ordered to stop that emphasis because of low satisfaction ratings. Weimer (2002) is refreshingly candid in revealing that effective learner-centered practices will not receive initial appreciation from students. She stops short of stating that this "resistance" may express itself through lowered global evaluations of the faculty member who introduces them. Peter Sacks entered the "Generation X" college with the original intent of educating students, but, like the examples above, learned that it was safer to please students than to educate them. Student evaluations were the lever that pressured Sacks to change his goals, and it is likely that faculty such as those in Thorn's (2003) study face similar pressure. Damage done to both education and to individuals through inept use of evaluation by forcing faculty to please "students as customers" is considerable. The assessment movement, particularly its emphasis on direct assessment of student learning, offers the most promising road out of misuse and abuse of student evaluations (see Huba and Freed, 2000). The greatest reason that professors should embrace assessment (looking at the work that is done and the student learning that results) is to extricate themselves from the morass of having their livelihoods depend upon ratings reflective of how others "feel" about them.