My thoughts on improving the quality and effectiveness of student evaluation of teaching (student feedback)

21 June 2020

Towards a more meaningful evaluation of University lecturers

THILO HAGEN

As most lecturers can attest, teaching feedback can be frustrating. For instance, over the years I have consistently improved my feedback to students. I now upload the answers and explanations to all my assessments, including the final examination, immediately after the test. I try my best to respond to all student emails in a timely manner and I even send personalized feedback emails to all students at the end of the semester. Yet, my scores to the student feedback question: “The teacher provided timely and useful feedback.” have shown little to no change over the years and in fact consistently trail with my other feedback scores.

The example highlights what has been confirmed by many studies, student evaluation of teaching (SET) is commonly not an accurate reflection of a lecturer’s teaching performance. This is problematic because Universities need to know how their lecturers perform in their teaching. They need to know what lecturers are doing well and where improvements are necessary. Having this information is essential to bring about change in student teaching, to address shortcomings and to incentivise lecturers to improve their teaching. In addition, the information derived from SET is used to appraise lecturers and to make important administrative decisions. As such, there is a great need for a high quality and accurate teacher evaluation.

In principle, there are a number of ways through which teaching performance can be measured, each one with their own shortcomings.

  1. Measuring the quantity (in terms of so-called teaching hours) of how much someone teaches:

This tells nothing about the quality of teaching. Teaching hours are also a very inaccurate measure of teaching effort as lecturers can spend vastly different amounts of time to prepare classes and follow up on taught content with assignments or other activities.

  1. Evaluation by peers (i.e. other lecturers):

This is a rather subjective evaluation that is dependent on individual lecturers’ experience, standards and their own views of what constitutes effective teaching. Indeed, it has been found that teachers commonly compare their peers’ teaching strategies to their own practice and judge positively only those practices which agree with their own preference (Quinlan 2002; Courneya et al. 2008).

  1. Survey based student evaluation of teaching (SET):

In this most common type of feedback, students usually award scores in response to specific questions and often also provide verbal comments. Numerous problems related to biasing and confounding factors in SET have been highlighted, as summarized by Spooren et al. (2013) in their SET meta-analysis. Based on my own experience, one main problem with SET is that many students award scores in a very narrow range. Thus, students tend to give a default score of around 80% (e.g. 4 out of 5) and only deviate from it in extreme cases, where a lecturer is very obviously above or below the average. As a result, teacher scores are often nearly indistinguishable and do not really reflect differences in teaching quality.

The reasons for the student scoring behavior could be several-fold, including difficulties in giving absolute scores (not knowing the benchmarks or expectations), scoring apathy because there are too many lecturers to evaluate and questions to answer, not wanting to ‘hurt’ the lecturer, or only doing the evaluation because it is linked to some reward or benefit for the students.

However, the main reason for the narrow range of SET scores is likely that students do not put much effort into the evaluation. Much research has been done to identify reasons for the lack of student motivation to provide meaningful student feedback. Dunegan and Hrivnak (2003) point out that there is a lack of incentives for students to invest significant effort into SET. Thus, students see no obvious advantage from completing the survey thoroughly. There are also no negative consequences if the students do not invest time and effort and only give “default” scores. In addition, students generally see no evidence that their feedback has any impact. They do not see any changes as a result of their feedback and they in fact do not even see the results of their feedback. As such, it comes as no surprise that students have been found to have little confidence that their evaluations are actually taken into account by teachers (Spencer & Schmelkin, 2002).

The lack of student motivation could result in students completing SET without much cognitive engagement. Indeed, Dunegan and Hrivnak (2003) found in their study of student feedback using traditional SET questionnaires that students commonly complete SET in a mindless manner, unless the performance of the teacher clearly deviates from the students’ expectations of a good teacher. In other words, if the performance of a lecturer is generally in line with their expectation of a good teacher, the students do not put much effort into the evaluation and likely award default marks. Such student scoring behavior is a problem because it makes it difficult to distinguish lecturers in the good to exceptional range, which normally comprises the majority of lecturers.

Another prevalent problem are the types of questions used to evaluate lecturers. Often these questions do not really address the pedagogical approaches used by the lecturer and the quality of the actual teaching, but try to measure teaching/learning outcomes. This is problematic because teaching/learning outcomes are actually very difficult for students to evaluate and score. For instance, asking if the lecturer has promoted critical thinking during classes (approach-oriented) is relatively easy for students to answer. In contrast, asking whether the lecturer has enhanced the student’s thinking ability (outcome-oriented) is very hard for a student to determine.

An excellent example for the difficulties that students have in judging learning outcomes is a recent study by Deslauriers et al. (2019) in students taking an introductory college physics course. The authors compared students’ perception of learning with their actual learning when exposed to two different pedagogical approaches, active versus passive learning. They demonstrated that the students perceived active learning to be inferior to a well-delivered passive lecture in terms of how much they have learned. The increased cognitive engagement with problems and intellectual challenges during the active learning process caused the students to feel that they learned less compared to experiencing readily accessible concepts taught using passive learning approaches. In contrast, objective evaluation of student learning showed the opposite to be true. Active was more, not less effective in promoting student learning, consistent with the overwhelming evidence in the literature. Hence, students appear to have a poor ability to judge learning outcomes.

Further adding to the problem of not getting accurate student feedback is the fact that sometimes survey questions are not very clear. For instance, asking if the teacher is effective is rather vague: effective in what? Furthermore, even asking if the teacher is effective in enhancing a certain quality, for instance in promoting analytical thinking, is again an outcome based question that may be difficult for students to evaluate objectively. In contrast, asking the students if the lecturer has introduced analytical thinking tasks into the classes, assignments and assessments is something that the students can evaluate.

Another common question is whether the lecturer has increased the students’ interest in the subject. There are several problems with this question. Based on my personal communications, raising students’ interest is not really a common desired learning outcome in most modules, and hence the question is not really aligned with what most modules are trying to achieve. Interest is also very subjective. Some students may already be very interested in the topic before the module. If a student did not become more interested in a subject, does that mean he or she did not learn much? Probably not necessarily. And finally, asking if the lecturer has increased the students’ interest in the subject does not reveal anything specific about what the teacher has done, as there are many ways to increase someone’s interest.

In summary, there appear to be two a number of problems in the commonly administered SET exercises. Firstly, student evaluations lack discriminating power. Secondly, students often lack motivation and incentives to carry out the SET in a meaningful manner. Finally, the type of questions in SET questionnaires are often not approach-oriented, but inquire about outcomes, which are difficult for students to evaluate.

How to overcome these problems?

Since our teaching targets students, I do believe that students are best placed to evaluate lecturers. If one were to view education as a product and draw comparisons to the industry, it is clear that companies evaluate their products mainly based on how many consumers bought it. Although it is helpful if a product receives great reviews by experts, this does not matter in the end if the consumers do not value and buy the product.

If we compare “consumer” feedback in the commercial sector and in the education sector, there is one major difference. In the commercial world, the consumer compares and chooses the best product to spend their money on. In contrast, when students evaluate lecturers they commonly give marks to each lecturer, and potentially could give the same marks to many lecturers (and often do). The approach to evaluate products in the commercial sector is known as ordinal (ranking of different items according to some criterion), whereas the method by which students evaluate lecturers is referred to as cardinal (meaning that each lecturer is evaluated independently).

When trying to buy a product, ordinal evaluation makes intuitive sense. If one wants to buy a product, it is obviously much easier to compare two products and pick the best one rather than evaluating two products separately and then deciding which has gotten the better evaluation. In fact, when trying to buy a product people often hold or put two things side by side to be able to compare the products directly.

There is also some scientific evidence in favor of ordinal grading by students in peer-grading in the education stetting. As discussed by McKinney and Niese (2017), when students grade their peers, ordinal grading usually outperforms cardinal grading in terms of accuracy. As pointed out by Billington (1997), while students have difficulties to provide quantitative scores and often overrate, they can rank well. One important advantage of ordinal grading is that it reduces the evaluation of the lecturers that students would have encountered during a semester to ranking of the best lecturers, as opposed to assigning scores to all lecturers. This would decrease the students’ workload. At the same time, ranking of lecturers according to specific criteria requires significantly more cognitive processing compared to awarding scores to all lecturers. Hence, ordinal feedback would be expected to lower student fatigue in evaluating lecturers and discourage mindless grading of lecturer performance. Hence, ordinal grading might be a way to address the normally large number of so-called “lazy graders” (Shah et al. 2013), which are a major reason for the difficulty to distinguish between teaching performance of lecturers. As discussed by Shah et al. (2013), when comparing performance (e.g. of lecturers), the student evaluators may also be able to offer more insightful comments on the positive and negative aspects of the teaching performance. Furthermore, students will preferentially remember the good lecturers and their teaching approaches. Hence, it would make sense to ask them about the information that they are most capable to provide.

How to increase the motivation of students to provide meaningful SET?

As discussed above, strategies to simplify the SET and improve students’ cognitive engagement would most likely be helpful. However, these strategies by themselves would not provide actual incentives. Without a doubt, the most important incentive for students would be to know and experience that their feedback has a real impact. Ideally, the students would see change in the teaching itself. However, SET is usually administered at the end of a course, and as such this is difficult to achieve. Another more realistic idea is that the results of the student feedback (in terms of the best performing lecturers) are publicized. Alternatively, Universities could establish a direct link between the students’ feedback votes and teaching awards given. It is important that this link is apparent to the students. Another strategy would be to encourage lecturers to discuss the previous semester course feedback at the beginning of a new course and emphasize which changes have been introduced as a result of previous student feedback. Through these suggestions, it might be possible to make the students realize that their opinions count, which is a prerequisite for meaningful SET.

One important problem is that many teacher feedback surveys favor lecturers who please students by making the module easy. On the other hand, feedback surveys could disfavor lecturers who put in extra effort to introduce new elements into their teaching, which may not be successful during the first round of implementation and require more optimization. Introducing new elements, e.g. active learning based components, may also result in lower feedback scores because the students may perceive these components as additional work.

To address these issues and to also obtain more meaningful feedback, it is very important to ask about the information that one wants to obtain, as discussed above. For instance, if one wants to know if the lecturers have used active learning methods, questions about whether the teacher was effective or has increased the interest in the topic are not really helpful, as these questions are outcome-based and not concrete. Some examples of more concrete, process-related feedback questions are given below:

Informative example questions:

– Based on the lectures that you attended during the past semester, which lecturers have helped you most effectively to gain new knowledge and useful skills? (As for all examples below, this question could be followed by the option to provide comments or examples.)

– Over the past semester, which lecturers have provided the best feedback to you (such as providing personal feedback or feedback via email, discussing student questions during the lectures or outside of class via email or other formats, providing model answers and explanations for assessments, promoting self-assessment)?

– Over the past semester, which lecturers have used the most innovative teaching methods?

– Over the past semester, which lecturers have used the most interesting and useful active learning methods? (Note that active learning may include for example problem-based lectures and assessments, independent learning assignments, use of class response systems or technology to make the learning more student-centric.)

– Bloom’s Taxonomy is a hierarchical ordering of cognitive skills, which comprises of 6 levels: Level 1 (most basic) – Remember; Level 2 – Understand; Level 3 – Apply; Level 4 – Analyze; Level 5 – Evaluate; Level 6 – Create. Over the past semester, which lecturers have moved in their teaching to the highest levels in this taxonomy?

How to practically implement these suggestions?

Changing the type of questions that students are being asked during the teacher feedback evaluation is relatively easy. But is it feasible to change the evaluation method from cardinal to ordinal? One may argue that ordinal teacher evaluation is difficult as students enroll in different modules and are hence taught by different combinations of lecturers. Also, ordinal feedback may be unfair as lecturers teach different levels and class sizes. For instance, lecturers teaching higher level modules generally receive higher feedback scores. On the other hand, teaching a larger number of students means that there is theoretically a higher chance to be a high scoring lecturer in student evaluations. But on closer scrutiny, these are not really major concerns.

Let us assume an example whereby students select their top 3 lecturers for each evaluation question. Although each student is taught by a different combination of lecturers, ultimately this would likely average out over large numbers of student feedback. It is true that lecturers teaching large size classes have theoretically a higher chance to be named one of the top 3 lecturers. However, it is important to note that introducing active learning components into large class modules is significantly more difficult and requires more effort and time compared to smaller classes (in terms of class preparations, assessment and student evaluation or marking). It is also well known that feedback scores for large class teaching are generally lower compared to smaller classes, partially offsetting the theoretical advantage of getting more student scores. Lastly, it is undeniable that teachers who teach large classes well ultimately have a greater impact than teachers who teach small classes well. This would probably partially justify a higher overall score of the large class lecturer.

For example, if we take two lecturers, of which one teaches a year 1 or 2 undergraduate module with 200 students and the other a postgraduate module with 20 students. If both teach equally well, the lecturer teaching the 200 student undergraduate module would theoretically be 10 times likely to get high student rankings. However, this is likely to be partially counterbalanced by the greater likelihood to obtain good teaching feedback in small class settings, as lecturers can dedicate more time to each student and build a more intimate relationship with the students. In small class settings it is also easier for lecturers to introduce interesting assessment modes and to provide meaningful feedback.

Taken these factors into consideration, as well as the fact that an equally effective lecturer would have a greater impact in a large class compared to a small class, different class sizes are unlikely to be a major concern for introducing ordinal ranking into lecturer evaluations. Nonetheless, it would be easy to analyze feedback for undergraduate and postgraduate teaching separately, or to even divide undergraduate teaching into different levels, in order to reduce effects of different class sizes.

It is important to note that the proposed approach would only yield information about whom the students consider to be the best lecturers and what are the reasons for it. This is likely to be satisfactory for the purpose of teacher evaluation and appraisal, but it leaves two important issues unaddressed: How would the University be aware of poorly performing lecturers? And even more importantly, how would lecturers obtain constructive feedback from students for the purpose of continuous improvement.

With regards to identifying lecturers with unsatisfactory teaching performance, it should be noted that in the commonly used SET questionnaires even poorly performing teachers often still have reasonably good teaching scores. Moreover, there could be many reasons for lower than average teaching scores, not necessarily related to poor teaching. For instance, the lecturer may have experimented with introducing new elements into the teaching that the students perceived as too challenging or too time consuming, or the newly introduced components still require further optimization. Hence, evaluation based on feedback scores may flag up lecturers as poorly performing when in fact they have been actively trying to improve their teaching and assessment methods and should be offered assistance to do so. Hence, the current system is not particularly good at identifying lecturers who would need to improve their teaching.

Being a good teacher is not primarily about being able to teach well, but more about aiming for continuous improvement. Good teachers analyse their own and their students’ performance (see below), and based on this, try to improve. Good teachers keep up to date with current pedagogical and education related developments and try to actively improve their courses. As such, it would be much more meaningful for lecturers to perform and perhaps submit a self-reflection, in which they highlight how they have tried to improve their teaching practice, what new elements they have introduced into their teaching and assessment, and what the outcome of the introduced changes was.

Nonetheless, most Universities prefer quantitative assessments for appraisals and tenure and promotion decisions. The most straightforward approach to address this would be to ask the students to rank all their lecturers. This, however, may not be a very practical and preferred approach. Alternatively, it would be feasible to ask students to identify lecturer(s) who have not performed to their expectations. For instance, when asking: “Over the past semester, which lecturers have used the most innovative teaching methods?”, one could add a follow-up question: “Which lecturer(s) need improvement?” This would be easy to implement and would likely only be used by students if they feel strongly dissatisfied with a lecturer’s performance.

In terms of obtaining formative feedback for the improvement of teaching, in the commonly used practice, students provide both scores and comments on teaching performance in the form of a standardised questionnaire. Using the same questionnaire for both lecturer evaluation and formative feedback is problematic (Spooren et al., 2013). The type of questions one would ask for the purpose of evaluation of teaching as opposed to improvement of teaching would be very different and well-meant formative feedback could be misinterpreted as poor teaching performance. There are also other reasons why standardised questionnaires have shortcomings. Firstly, (as mentioned above), students have little confidence that their evaluations are actually taken into account by teachers (Spencer and Schmelkin, 2002). This is expected to limit the effort and quality of comments when students provide feedback. Secondly, Spooren et al. (2013) concluded, based on their meta-analysis of student evaluation of teaching, that University teachers make little or no use of student feedback. This may be in part because lecturers feel that the student feedback is not relevant and not focused, but consists essentially of random comments, where each student highlights different issues. This makes it difficult to draw overall conclusions (Tiberius et al. 1987).

To address these shortcomings, it would likely be more useful if lecturers were to design and administer their own feedback surveys at the end of the semester. These surveys could be customised to contain the desired information specific to the teaching and assessment practices of the course. In fact, many teachers already conduct their own specific surveys, or they may like to, but refrain from doing so in order not to burden the students with an additional survey. The results of such customised surveys administered by individual lecturers would be highly useful for lecturers to reflect and improve their teaching and assessment approaches.

In summary, I propose a more goal-oriented SET approach, which takes into account the various purposes of teacher feedback. These include identifying the truly best educators, identifying lecturers who need to improve or may require assistance, and providing formative feedback for lecturers to improve their teaching and assessment.

To address these objectives, I firstly recommend to replace the current cardinal grading of lecturers with an ordinal system, in which student rank their best lecturers based on specific criteria. Importantly, these criteria should (i) be concrete, (ii) be aligned with the desired attributes of a good lecturer and (iii) be process-oriented rather than achievement-oriented (e.g. “The lecturer has incorporated critical thinking components into his lessons and assessments” rather than “The lecturer has improved my critical thinking ability”.) In addition, the students could be given the opportunity to highlight lecturers who have not performed to their expectations with regards to the specific criteria. To increase student motivation to provide accurate feedback, student feedback should be directly linked to teaching awards and publicised in a transparent manner.

Secondly, to obtain meaningful formative feedback, lecturers should administer their own feedback surveys, tailored to the specific pedagogical approaches and learning outcomes of their modules, rather than relying on centralized uniform formative feedback questionnaires for all lecturers. Teachers should in some way be encouraged to reflect on the obtained feedback and address shortcomings. To increase student motivation to provide meaningful feedback, lecturers should be encouraged to discuss previous student feedback and changes that have been introduced as a result at the beginning of a course.

With these measures, it is hoped that a more meaningful student evaluation of teaching can be achieved. Lecturer evaluations do matter, and as such we should carefully consider how we can get the most out of the feedback that the students provide.

REFERENCES

Billington HL. Poster presentations and peer assessment: novel forms of evaluation and assessment. (1997) The Journal of Biological Education 31:pp.218-220

Courneya CA, Pratt DD, Collins J. Through what perspective do we judge the teaching of peers? (2008) Teaching and Teacher Education 24:pp.69–79

Deslauriers L, McCarty LS, Miller K, Callaghan K, Kestin G. Measuring actual learning versus feeling of learning in response to being actively engaged in the classroom. (2019) Proceedings of the National Academy of Sciences USA 116:19251–19257

Dunegan KJ, Hrivnak MW. Characteristics of mindless teaching evaluations and the moderating effects of image compatibility. (2003) Journal of Management Education 27:pp.280-303

McKinney E, Niese B. Effective Student Crowdpolling: Key Decisions and a Learning Case. (2017) Proceedings of the 23rd Americas Conference on Information Systems, Boston, MA pp.1-7

Quinlan KM. Inside the peer review process: how academics review a colleague’s teaching portfolio. (2002) Teaching and Teacher Education 18:pp.1035–1049

Shah NB, Bradley JK, Parekh A, Wainwright M, Ramchandran K. A Case for Ordinal Peer-evaluation in MOOCs. (2013) NIPS Workshop on Data Driven Education pp.1-10

Spencer KJ, Schmelkin LP. Student perspectives on teaching and its evaluation. (2002) Assessment & Evaluation in Higher Education 27:pp.397–409

Spooren P, Brockx B, Mortelmans D. On the Validity of Student Evaluation of Teaching:

The State of the Art. (2013) Review of Educational Research 83:pp. 598–642

Tiberius RG, Sackin HD, Cappe L. A comparison of two methods for evaluating teaching. (1987) Studies in Higher Education 12:pp.287-297