Rater Severity/Leniency and Bias in EFL Students' Composition Using Many-Facet Rasch Measurement (MFRM)

Yenni Arif Rahman(1*), Fitri Apriyanti(2), Rahmi Aulia Nurdini(3)

(1) Universitas Bina Sarana Informatika
(2) Universitas Bina Sarana Informatika
(3) Universitas Bina Sarana Informatika
(*) Corresponding Author

Abstract


The study aims to investigate the extent to which raters exhibit tendencies towards being overly severe, lenient, or even bias when evaluating students' writing compositions in Indonesia. Data were collected from 15 student essays and four raters with master's degrees in English education. The Many-facet Rasch measurement (MFRM), automatized by Minifac software, a program created for the Many-facet Rasch measurement, was used for data analysis. This was done by meticulously dissecting the assessment process into its distinct components—raters, essay items, and the specific traits or criteria being evaluated in the writing rubric. Each rater's level of severity or leniency, essentially how strict or lenient they are in assigning scores, is scrutinized. Likewise, the potential biases that raters might introduce into the grading process are carefully examined. The findings revealed that, while the raters used the rubric consistently when scoring all test takers, they varied in how lenient or severe they were. Scores of 70 were given more frequently than the other score. Based on the findings, composition raters may differ in how they rate students which potentially leading to student dissatisfaction, particularly when raters adopt severe scoring. The bias in scoring has highlighted that certain raters consistently tend to inaccurately score items, deviating from the established criteria (traits). Furthermore, the study also found that having more than four items/criteria (content, diction, structure, and mechanic) is essential to achieve a more diverse distribution of item difficulty and effectively measure students' writing abilities. These results are valuable for writing departments to improve the oversight of inter-rater reliability and rating consistency. To address this issue, implementing rater training is suggested as the most feasible method to ensure more dependable and consistent evaluations.


Keywords


Rater Severity/Leniency; Rater Bias; EFL Composition; Many-Facet Rasch Measurement

Full Text:

PDF

References


Ahmadi Shirazi, M. (2019). For a greater good: Bias analysis in writing assessment. SAGE Open, 9(1). https://doi.org/10.1177/2158244018822377

Bond, T., & Fox, C. (2015). Applying the Rasch Model. Routledge. https://doi.org/https://doi.org/10.4324/9781315814698

Bond, T.G., & Fox, C. M. (2007). Applying the Rasch Model: Fundamental Measurement in the Human Sciences (2nd ed.). Awrence Erlbaum Associate.

Cahyono, B. Y. (2019). Teaching English as a foreign language in Indonesia: Past, present, and future directions. TEFLIN Journal, 30(2), 141-156.

Erguvan, I. D., & Aksu Dunya, B. (2020). Analyzing rater severity in a freshman composition course using many facet Rasch measurement. Language Testing in Asia, 10(1). https://doi.org/10.1186/s40468-020-0098-3

Fahim, M., & Bijani, H. (2011). The effects of rater training on raters’ severity and bias in second language writing assessment. Iranian Journal of Language Testing, 1(1), 1–16.

Fisher, W. (2007). Rating scale instrument quality criteria. Rasch Measurement Transactions, 1.

Huang, H. Y. (2023). Modeling rating order effects under item response theory models for rater-mediated assessments. Applied Psychological Measurement, 47(4), 312–327. https://doi.org/10.1177/01466216231174566

Jacobs., Holly. L., Stephen, A., Zingkgraf., Deanne. R., Wormuth, V., Faye, H., Jane, B., H. (1981). Testing ESL Composition: A Practical Approach. Newbury House Publishers, Inc.

Li, W. (2022). Scoring rubric reliability and internal validity in rater-mediated EFL writing assessment: Insights from many-facet Rasch measurement. Reading and Writing, 35(10), 2409–2431. https://doi.org/10.1007/s11145-022-10279-1

Lim, G. S. (2009). Prompt and Rater Effects in Second Language.

Linacre, J. M. (2002). KR-20/Cronbach alpha or Rasch person reliability: Which tells us the truth? Rasch Measurement Transactions, 11, 580–581.

Linacre, J. M. (2006). Many-facet Rasch measurement. In Facets Rasch Measurement.

Linacre, J. M. (2020). What do infit and outfit mean-square and standardized mean? Rasch Measurement Transaction, 16, 878.

Maier, K. S. (2001). A Rasch hierarchical measurement model. Journal of Educational and Behavioral Statistics, 26, 307–331. https://doi.org/https://doi.org/10.3102%2F10769986026003307

Misbach, I. H., & Sumintono, B. (2014). Pengembangan dan Validasi Instrumen “Persepsi Siswa Tehadap Karakter Moral Guru” di Indonesia dengan Model Rasch. PROCEEDING Seminar Nasional Psikometri, 148–162.

Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189–227.

Noor, N., Beram, S., Huat, F. K. C., Gengatharan, K., & Mohamad Rasidi, M. S. (2023). Bias, Halo effect and Horn effect: A systematic literature review. International Journal of Academic Research in Business and Social Sciences, 13(3). https://doi.org/10.6007/ijarbss/v13-i3/16733

Nur Azizah, Muchlas Suseno, B. H. (2023). Penilaian Menulis Menggunakan Many Fcets Rasch Measurement (MFRM) Pengaplikasian Software FACETS Dalam Validasi Instrumen Penilaian Menulis Serta Analisis Penilai dan Karya Tulis.pdf (p. 93). Program Pasca Sarjana Universitas Negeri Jakarta.

Rahayu, S., & Widiati, U. (2019). Challenges in writing academic texts in English: A case study in an Indonesian university. Journal of Applied Linguistics and Language Research, 6(4), 144-158.

Setiawan, H., & Hartoyo, N. S. (2020). The challenges of translating idiomatic expressions: A case study of Indonesian EFL students. 11(1), 153-166.

Sumintono, B. & Widhiarso, W. (2013). Aplikasi Model Rasch untuk Penelitian Ilmu-Ilmu Sosial. Trim Komunikata Publishing House.

Tan, S. (2013). V. of an A. R. S. for W. A. R. M. A. (2013). Validation of an Analytic Rating Scale for Writing: A Rasch Modeling Approach. Tabaran Institute of Higher Education. Iranian Journal of Language Testing, 3(1).

Tanaka, M. S. J. R. (2023). Impact of self-construal on rater severity in peer assessments of oral presentations. Assessment in Education: Principles, Policy & Practice, 30(2), 203–220. https://doi.org/doi.org/10.1080/0969594X.2023.2189564

Uto, M. (2022). A Bayesian many-facet Rasch model with Markov modeling for rater severity drift. Behavior Research Methods, October. https://doi.org/10.3758/s13428-022-01997-z

Wind, S. A. (2019). Examining the impacts of rater effects in performance assessments. Appl Psychol Meas, 43(2), 159–171. https://doi.org/10.1177/0146621618789391




DOI: http://dx.doi.org/10.30998/scope.v8i1.19432

Refbacks

  • There are currently no refbacks.


Copyright (c) 2023 Yenni Arif Rahman

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

scope isjd

Portal Garuda Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Web
Analytics
View My Stats

Flag Counter