• Không có kết quả nào được tìm thấy

Thư viện số Văn Lang: Advancing Human Assessment: The Methodological, Psychological and Policy Contributions of ETS

N/A
N/A
Nguyễn Gia Hào

Academic year: 2023

Chia sẻ "Thư viện số Văn Lang: Advancing Human Assessment: The Methodological, Psychological and Policy Contributions of ETS"

Copied!
198
0
0

Loading.... (view fulltext now)

Văn bản

Understanding a feature (even if it may be rudimentary) points to the kinds of tasks or stimuli that might provide information about it. Explanations of the properties of test scores date back to at least the late nineteenth century and thus predate the use of the term validity and the establishment of the ETS.

Latent Traits

Given specification of the network as a confirmatory factor model (and sufficient data), the hypotheses inherent in the network can be checked by evaluating the fit of the model to the data. If the model fits, the substantive assumptions (about relationships between the constructs) in the model and the validity of the proposed measures of the constructs are both supported.

Controlling Irrelevant Variance

In the context of a statistical model, a latent trait considers test performance, real and possible, in relation to item or task parameters. A latent property has a model-specific meaning and a model-specific use; it captures the enduring contribution of the subject's "ability" to the probability of success across repeated independent performances of a variety of tasks.

Validity of Score-Based Predictions

The fundamental role of criterion-related validity evidence in evaluating the accuracy of such predictions remains important to the validity of any interpretation or use that relies on predictions of future performance (Kane 2013a), but these paradigm cases of prediction now tend to be assessed in a wider theoretical context (Messick 1989) and from a wider set of perspectives (Dorans 2012; Holland 1994; Kane 2013b). In this broader context, the accuracy of predictions remains important, but concerns about fairness and utility receive more attention than they did before the 1970s.

Validity and Fairness

  • Fairness and Bias
  • Adverse Impact and Differential Prediction
  • Differential Item Functioning
  • Identifying and Addressing Specific Threats to Fairness/

First, a major impetus for the development of these models was the belief in the late 1960s that at least part of the explanation for the observed differences in test scores across groups was to be found in the properties of the test. ETS played a major role in introducing DIF methods as a way to promote fairness in testing programs (Dorans and Holland 1993; Holland and Thayer 1988).

Messick’s Unified Model of Construct Validity

  • Meaning and Values in Measurement
  • A Unified but Faceted Framework for Validity
  • The Evidential Basis of Test Score Interpretations
  • The Evidential Basis of Test Score Use
  • The Consequential Basis of Test Score Interpretation
  • The Consequential Basis of Test Score Use
  • Validity as a Matter of Consequences
  • The Central Messages

The article (Messick 1975) was the published version of his presidential address to Division 5 (Evaluation and Measurement) of the American Psychological Association. Third, Messick (1975) recognized the need to be precise about intended interpretations of test scores.

Fig. 16.1  Messick’s facets of validity. From Test Validity and the Ethics of Assessment (p. 30,  Research Report No
Fig. 16.1 Messick’s facets of validity. From Test Validity and the Ethics of Assessment (p. 30, Research Report No

Argument-Based Approaches to Validation

The IUA is intended to provide a reasonably detailed specification of the reasoning inherent in the proposed interpretation and uses of the test scores. In this way, an argument-based approach can provide necessary and sufficient conditions for validity in terms of the plausibility of the inferences and assumptions in the IUA.

Applied Validity Research at ETS

  • Predictive Validity
  • Beyond Correlations
  • Construct-Irrelevant Variance
    • Fatigue Effects
    • Time Limits
    • Guessing
    • Scoring Errors
  • Construct Underrepresentation

From the 1960s through the 1980s, ETS conducted a number of SAT validation studies that focused on routine predictions of freshman grade point average (FGPA) with data provided by colleges using the ETS/College Validation Study Service , as summarized by Ramist and Weiss (1990). A study of the impact of extending the amount of time allowed per item on the SAT concluded that there were some effects of the extended time (1.5 times the regular time); average gains for the verbal score were less than 10 points on the 200-800 scale and about 30 points for the math score (Bridgeman et al. 2004b).

Figure  16.3 indicates that, even within a UGPA quartile, GRE scores matter for  identifying highly successful students (i.e., the percentage achieving a 4.0 average).
Figure  16.3 indicates that, even within a UGPA quartile, GRE scores matter for identifying highly successful students (i.e., the percentage achieving a 4.0 average).

Fairness as a Core Concern in Validity

Frederiksen's group instituted performance tests that required students to service real weapons, and grades on end-of-course tests plummeted. Note that the utility of a test for selection is mostly assessed statistically in terms of the correlation coefficient between the test and the criterion.

Concluding Remarks

To the extent that testing programs play important roles in the public arena, their claims must be substantiated. In general, it is important to assess how well testing programs work in practice, in the contexts in which they operate (eg, as a basis for hiring decisions, in academic selection, in placement, in licensure, and certification).

The Study of the Reading Reliability of the College Board English Composition Tests of April and June 1947 (Research Bulletin No. RB-48-07). A theory of test scores and their relationship to the trait measured (Research Bulletin No. RB-51-13).

Understanding the Impact of Special Preparation for Admissions Tests

Definitions

  • Significance of Special Test Preparation
  • Interest in Special Test Preparation

For example, test familiarization is intended to ensure that prospective candidates are well informed about the general skills required to take a test and to help them become familiar with the procedures required to take a test. to take a certain test. Finally, along with the validity of the scores, equity is often an issue in special test preparation, as typically not all students have equal opportunities to benefit in the ways described above. For example, in the early 1980s, a previously offered section of the GRE® General Test (the measure of analytical ability) was radically modified based on the results of a GRE Board-sponsored test preparation study (Powers and Swinton 1984).

Studying the Effects of Special Test Preparation

  • The SAT
    • The College Board Position
    • Early Studies
    • Test Familiarization
    • Federal Interest
    • Extending Lessons Learned
    • Studying the 1994 Revision to the SAT
  • The GRE General Test
    • Effects on Relationships of Test Scores with Other Measures

Perhaps the single most significant factor in increasing interest in coaching and test preparation was the involvement of both the US ETS and some of the major commercial coaching companies that cooperated with the FTC's investigation. Data collected in studies of the GRE analytic measure were also used to evaluate the effectiveness of formal commercial training for the verbal and quantitative sections (Powers 1985a).

Summary

Federal Trade Commission Boston Regional Office Staff Note: The Effects of Coaching on Standardized Entrance Exams. Effects of coaching on standardized entrance exams: Revised statistical analyzes of data collected by the Boston Regional Office, Federal Trade Commission. An examination of the effects of special preparation on GRE analytical scores and item types.

A Historical Survey of Research Regarding Constructed-Response Formats

Isaac I. Bejar

Reliability

  • The Emergence of a Solution
  • Conclusion

A second concern with holistic scores is the nature of the inferences that can be drawn from the scores. As Edgeworth (1890) recognized, readers may differ in the severity of the scores they assign, and such disagreements contribute to measurement error. After Brigham's death, there seemed to be no strong proponents of the format, at least not within the College Board, not even in the early years of ETS.

Validity

  • Validity Theory at ETS
  • Conclusion

However, as we will see in the following sections, much more was needed for constructed-response formats to become viable. The relevance of the unit view for educational testing first had to be established. The deployment of highly complex forms of assessment in the early 1990s was intended to maximize the positive educational effects of constructed-response formats and avoid the negative effects of the multiple-choice format, such as teaching the narrow segment of the curriculum that ' will represent a multiple choice test.

The Interplay of Constructs and Technology

  • Computer-Mediated Scoring
  • Automated Scoring
  • Construct Theory and Task Design
    • Writing
    • Speaking
    • Mathematics
    • Interpersonal Competence
    • Professional Assessments
    • Advances in Assessment Design Theory
  • Conclusion

In the case of the GMAT and GRE,38 a design consisting of two prompts, generating and evaluating arguments, emerged after several rounds of research (Powers et al. 1999a). Writing was partially incorporated into the TOEFL during the 1980s in the form of TWE. Predicting the TOEFL from this perspective was a possible outcome of the conference (Duran et al. 1987).

School-Based Testing

  • Advanced Placement
  • Educational Surveys 47
  • Accountability Testing
  • Conclusion

The written English test portion of the paper-based TOEFL test was introduced in 1986 to select TOEFL administrations. In an informal history of AP readings 1956–76 (Advanced College Board Placement Program 1980), it was noted that. This meant that the new assessment had to be an integral part of the educational process.

Table 18.1  Writing assessment milestones for GMAT, GRE and TOEFL tests
Table 18.1 Writing assessment milestones for GMAT, GRE and TOEFL tests

Validity and Psychometric Research Related to Constructed-Response Formats

  • Construct Equivalence
  • Predictive Validity of Human and Computer Scoring
  • Equivalence Across Populations and Differential Item Functioning
  • Equating and Comparability
  • Medium Effects
  • Choice
  • Difficulty Modeling
  • Diagnostic and Formative Assessment

The main approach to ensure comparability of results is through equating (Dorans et al. 2007), a methodology that was developed for multiple-choice tests. The difficulty of constructed-response items and the basis for and control of variability in difficulty have been studied in numerous fields, including mathematics (Katz et al. 2000), architecture (Bejar 2002), and writing (Bridgeman et al. One approach is based on Bayesian networks (Almond et al. 2007), while the second approach follows a latent variable tradition (von Davier 2013).

Summary and Reflections

  • What Is Next?

Gender and ethnic group differences on the GMAT Analytic Writing Assessment (Research Report No. RR-96-02). Toward testing communicative competence: Proceedings of the Second TOEFL Invitational Conference (TOEFL Research Report No. 21). Application of the Online Assessment Network (OSN) to Advanced Placement Program (AP) tests (Research Report No. RR-03-12).

Advancing Human Assessment: A Synthesis Over Seven Decades

The Years 1948–1959

  • Psychometric and Statistical Methodology
  • Validity and Validation
  • Constructed-Response Formats and Performance Assessment
  • Personal Qualities

1 In most cases, the citations included as examples of a workflow were selected based on their discussion in one of the book's chapters. Work on constructed response formats and performance assessment was undertaken (Ryans and Frederiksen 1951), including the development of the basket test (Fredericksen et al. 1957), then used worldwide for job selection, and a measure of the ability to formulate hypotheses as an indicator of scientific thinking (Frederiksen 1959). Cognition, broadly defined, was a key interest, as evidenced by the publication of the Selected Test Kit for Reference Ability and Factors of Achievement (French 1954).

The Years 1960–1969

  • Psychometric and Statistical Methodology
  • Large-Scale Survey Assessments of Student and Adult Populations
  • Validity and Validation
  • Constructed-Response Formats and Performance Assessment
  • Personal Qualities
  • Teacher and Teaching Quality

In the 1960s, interest in forecasting studies continued (Schrader and Pitcher, 1964), although noticeably less than in the previous period. 18, this volume), writing assessment deserves special mention for the seminal study by Diederich et al. 1961) documents that raters brought 'mindsets' to the evaluation of essays, sparking interest in the study of rater cognition, or the mental processes underlying essay grading. A second milestone was the study by Godshalk et al. 1966) that resulted in the invention of the holistic score.

The Years 1970–1979

  • Psychometric and Statistical Methodology
  • Large-Scale Survey Assessments of Student and Adult Populations
  • Validity and Validation
  • Constructed-Response Formats and Performance Assessment
  • Personal Qualities
  • Human Development
  • Educational Evaluation and Policy Analysis
  • Teacher and Teaching Quality

It is also worth noting that this period saw the beginning of ETS work on cognitive styles (Gardner et al. Some of the accumulated wisdom gained during this period was synthesized in two books, the Encyclopedia of Educational Evaluation (Anderson et al. 1975 ) and The Encyclopedia of Educational Evaluation (Anderson et al. 1975) The Profession and Practice of Program Evaluation (Anderson and Ball 1978) In addition to intensive evaluation activity was the beginning of a policy analysis work stream (see Coley et al., Chapter 12, this volume).

The Years 1980–1989

  • Psychometric and Statistical Methodology
  • Large-Scale Survey Assessments of Student and Adult Populations
  • Validity and Validation
  • Constructed-Response Formats and Performance Assessment
  • Personal Qualities
  • Human Development
  • Educational Evaluation and Policy Analysis
  • Teacher and Teaching Quality

Finally, while research continued on the topic type of hypothesis formulation (Ward et al. 1980), the study of portfolios also emerged (Camp 1985). This focus remained largely centered on traditional academic ability, although limited research continued on creativity (Baird and Knapp 1981; Ward et al. 1980). As with program evaluation, the departure of key personnel during this period resulted in reduced activity, with only limited attention to the three dominant lines of research of the previous decade: the functioning of NTE (Rosner and Howey 1982), classroom observation (Medley), and Coker 1987; Medley et al. 1981) and university teaching (Centra 1983).

The Years 1990–1999

  • Psychometric and Statistical Methodology
  • Large-Scale Survey Assessments of Student and Adult Populations
  • Validity and Validation
  • Constructed-Response Formats and Performance Assessment
  • Personal Qualities
  • Human Development
  • Education Policy Analysis
  • Teacher and Teaching Quality

Articles describing these methodological innovations were published in a special issue of the Journal of Educational Statistics (Mislevy et al. 1992b; Yamamoto and Mazzeo 1992). During this period, many aspects of the function of constructed response formats were investigated, including construct equivalence (Bennett et al. 1991; Bridgeman 1992), population invariance (Breland et al. Also introduced at the end of the decade in the Graduate Management Admission Test was the e- rater® automated scoring engine, an approach to automated essay scoring (Burstein et al. 1998).

The Years 2000–2009

  • Psychometric and Statistical Methodology
  • Large-Scale Survey Assessments of Student and Adult Populations
  • Validity and Validation
  • Constructed-Response Formats and Performance Assessment
  • Personal Qualities
  • Human Development
  • Education Policy Analysis
  • Teacher and Teaching Quality

Also, recent developments in the statistical procedures used in NAEP were summarized and future directions described (M. von Davier et al. 2006). In K-12 education, the achievement gap (Barton 2003), gender equity (Coley 2001), the role of the family (Barton and Coley 2007), and access to advanced coursework in high school (Handwerk et al. 2008) were each examined. In teacher policy and practice, staff examined approaches to teacher preparation (Wang et al. 2003) and the quality of the teaching force (Gitomer 2007b).

The Years 2010–2016

  • Psychometric and Statistical Methodology
  • Large-Scale Survey Assessments of Student and Adult Populations
  • Validity and Validation
  • Constructed-Response Formats and Performance Assessment
  • Personal Qualities
  • Education Policy Analysis
  • Teacher and Teaching Quality

The first direction was through the further development of ED, especially its application in educational games (Mislevy et al. 2014). For English learners, themes covering accessibility (Guzman-Orth et al. 2016; Young et al. 2014), accommodations (Wolf et al. Roberts et al. 2010), and stereotype threat (Stricker and Rock 2015) persisted. period saw a considerable expansion in a variety of noncognitive constructs and their.

Discussion

Evaluation of a program for training dentists in the care of disabled patients (Research Report No. RR-82-52). Computer-adapted testing for students with disabilities: A review of the literature (Research Report No. RR-11-32). A job analysis of the knowledge important to newly licensed (certified) general science teachers (Research Report No. RR-92-77).

Hình ảnh

Fig. 16.1  Messick’s facets of validity. From Test Validity and the Ethics of Assessment (p. 30,  Research Report No
Figure  16.3 indicates that, even within a UGPA quartile, GRE scores matter for  identifying highly successful students (i.e., the percentage achieving a 4.0 average).
Table 18.1  Writing assessment milestones for GMAT, GRE and TOEFL tests
Table 18.1 (continued)

Tài liệu tham khảo

Tài liệu liên quan

Ngoài các lý do thiếu thông tin nêu trên, khả năng ứng dụng cơ chế cập nhật không đồng bộ còn dựa trên cơ sở trên thực tế có các bài toán của hệ