Publication Date
| In 2026 | 0 |
| Since 2025 | 56 |
| Since 2022 (last 5 years) | 282 |
| Since 2017 (last 10 years) | 778 |
| Since 2007 (last 20 years) | 2040 |
Descriptor
| Interrater Reliability | 3122 |
| Foreign Countries | 654 |
| Test Reliability | 503 |
| Evaluation Methods | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 347 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 308 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 24 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
Bosch, Holger; Steinkamp, Fiona; Boller, Emil – Psychological Bulletin, 2006
H. Bosch, F. Steinkamp, and E. Boller's (see record 2006-08436-001) meta-analysis, which demonstrated (a) a small but highly significant overall effect, (b) a small-study effect, and (c) extreme heterogeneity, has provoked widely differing responses. After considering D. B. Wilson and W. R. Shadish's (see record 2006-08436-002) and D. Radin, R.…
Descriptors: Meta Analysis, Publications, Bias, Models
Powell, Thomas W. – Clinical Linguistics & Phonetics, 2006
The third edition of the "Boston Diagnostic Aphasia Examination" (Goodglass, Kaplan, and Barresi) introduced standardized procedures for coding discourse samples elicited using the well known Cookie Theft illustration. To evaluate the reliability of this discourse coding procedure, a transcribed sample was coded by 14 novice examiners…
Descriptors: Examiners, Interrater Reliability, Test Reliability, Aphasia
Johnson, Martin; Greatorex, Jackie – E-Learning, 2008
Technological innovation undoubtedly offers many potential benefits for education and the assessment of learning, which have been acknowledged elsewhere. One area that is relatively under-researched relates to the practice of how assessors interact with longer texts that are presented on screen. This is an important area of study because there…
Descriptors: Foreign Countries, Innovation, Technological Advancement, Technology Uses in Education
Vlach, Haley A.; Carver, Sharon M. – Early Childhood Research & Practice, 2008
Education programs have fostered advanced levels of graphic representation ability in young children but have not detailed the specific mechanisms responsible for the accelerated growth. Research suggests that between 6 and 8 years of age children begin to observe more carefully before drawing and that observation prompts aid children's…
Descriptors: Childrens Art, Observation, Scores, Early Childhood Education
Yiu, Edwin M.-L.; Chan, Karen M. K.; Mok, Rosa S.-M. – Clinical Linguistics & Phonetics, 2007
One of the ways to improve the reliability in perceptual voice quality rating is to provide listeners with external anchors. A paired comparison matching paradigm using synthesized Cantonese voice stimuli that covered a range of rough and breathy qualities were used to investigate the rating reliability. Twenty-five speech pathology students rated…
Descriptors: Data Analysis, Measures (Individuals), Stimuli, Models
Hodge, Samuel R.; Kozub, Francis M.; Robinson, Leah E.; Hersman, Bethany L. – Adapted Physical Activity Quarterly, 2007
The purpose of this study was to determine what trends exist in the identification and description of participants used in data-based studies published in "Adapted Physical Activity Quarterly" and the "Journal of Teaching in Physical Education". Data were analyzed using frequency counts for journals and time periods from the 1980s to 2005 with…
Descriptors: Physical Education, Ethnicity, Socioeconomic Status, Physical Activities
Barkaoui, Khaled – Assessing Writing, 2007
Educators often have to choose among different types of rating scales to assess second-language (L2) writing performance. There is little research, however, on how different rating scales affect rater performance. This study employed a mixed-method approach to investigate the effects of two different rating scales on EFL essay scores, rating…
Descriptors: Writing Evaluation, Writing Tests, Rating Scales, Essays
Arnold, Margery E. – 1996
It is incorrect to say "the test is reliable" because reliability is a function not only of the test itself, but of many factors. The present paper explains how different factors affect classical reliability estimates such as test-retest, interrater, internal consistency, and equivalent forms coefficients. Furthermore, the limits of classical test…
Descriptors: Estimation (Mathematics), Generalizability Theory, Heuristics, Interrater Reliability
Spolsky, Bernard – 1990
A discussion of the differences between the Test of English as a Foreign Language (TOEFL), an American test battery, and the Cambridge English Examinations (Cambridge), a British battery, focuses on the different approaches to language test development embodied in the tests as the source of difficulty in translating between them for individual…
Descriptors: Comparative Analysis, Cultural Differences, English (Second Language), Foreign Countries
Zwick, Rebecca – 1986
Most currently used measures of inter-rater agreement for the nominal case incorporate a correction for "chance agreement." The definition of chance agreement is not the same for all coefficients, however. Three chance-corrected coefficients are Cohen's Kappa; Scott's Pi; and the S index of Bennett, Goldstein, and Alpert, which has…
Descriptors: Error of Measurement, Interrater Reliability, Mathematical Models, Measurement Techniques
Kang, Namjun – 1987
If content analysis is to satisfy the requirement of objectivity, measures and procedures must be reliable. Reliability is usually measured by the proportion of agreement of all categories identically coded by different coders. For such data to be empirically meaningful, a high degree of inter-coder reliability must be demonstrated. Researchers in…
Descriptors: Content Analysis, Interrater Reliability, Measurement Techniques, Media Research
Stelmachers, Zigfrids T.; Sherman, Robert E. – 1988
The clinical usefulness of various empirically derived suicide potential rating scales has been questioned by several suicidologists. This study used actual case histories in an attempt to anchor suicide risk ratings. Thirty-three brief case histories of suicidal patients were given to 19 experienced crisis workers for seven-point ratings of…
Descriptors: Clinical Diagnosis, Evaluation Criteria, Evaluation Methods, High Risk Persons
Tsui, Anne S. – 1983
Quality of performance data yielded by subjective judgment is of major concern to researchers in performance appraisal. However, some confusion exists in the analysis of quality on ratings obtained from different rating scale formats and from different raters. To clarify this confusion, a study was conducted to assess the quality of judgmental…
Descriptors: Administrator Evaluation, Administrators, Error of Measurement, Evaluation Methods
Edinger, Jack D.; Vosk, Barbara N. – 1983
Of the many short forms of the Minnesota Multiphasic Personality Inventory (MMPI) that have been developed, the MMPI-168 is among the most promising. To determine whether clinical judgments based on the MMPI-168 are comparable to judgments based on the standard MMPI, 30 clinical psychologists participated in a randomized block, repeated treatment…
Descriptors: Comparative Testing, Diagnostic Tests, Interrater Reliability, Personality Measures
Thompson, Richard T.; Johnson, Dora E. – 1988
Efforts to expand the generic language proficiency guidelines of the American Council on the Teaching of Foreign Languages (ACTFL) to the less commonly taught languages (LCTLs) began when developers realized that the ACTFL guidelines were too Eurocentric; the guidelines included grammatical categories specific to Western European languages and…
Descriptors: Cultural Context, Interrater Reliability, Language Proficiency, Language Tests

Peer reviewed
Direct link
