LANGUAGE ASSESSMENT ELT Teacher Training Tark NCE CHAPTER 1 TESTING ASSESSING
AND TEACHING In an era of communicative language teaching: Tests should measure up to standards of authenticity and meaningfulness. Ts should design tests that serve as motivating learning experiences
rather than anxiety-provoking threats. Tests; should be positive experiences should build a persons confidence and become learning experiences should bring out the best in students shouldnt be degrading shouldnt be artificial
shouldnt be anxiety-provoking Language Assessment aims; to create more authentic, intrinsically motivating assessment procedures that are appropriate for their context & designed offer constructive feedback to sts What is a test?
A test is measuring a persons ability, knowledge or performance in a given domain. 1. Method A set of techniques, procedures or items. To qualify as a test, the method must be explicit and structured. Like; Multiple-choice questions with prescribed correct
answers A writing prompt with a scoring rubric An oral interview based on a question script and a checklist of expected responses to be filled by the administrator 2 Measure A means for offering the test-taker some kind of result.
If an instrument does not specify a form of reporting measurement, then that technique cannot be defined as a test. Scoring may be like the followings Classroom-based short answer essay test may earn the test-taker a letter grade accompanied by the instructors marginal comments. Large-scale standardized tests provide a total numerical score, a percentile rank, and perhaps some sub-scores.
3. The test-taker(the individual) = The person who takes the test. Testers need to understand; who the test-takers are? what is their previous experience and background? whether the test is appropriately matched to their abilities? how should test-takers interpret their scores?
4. Performance Test measures performance, but results imply test-taker ability or competence. Some language tests measure ones ability to perform language: To speak, write, read or listen to a subset of language Some others measure a test-takers knowledge about language: Defining a vocabulary item, reciting a grammatical rule or identifying
a rhetorical feature in written discourse. 5. Measuring a given domain It means measuring the desired criterion and not including other factors. Proficiency tests: Even though the actual performance on the test involves only a
sampling of skills, that domain is overall proficiency in a language general competence in all skills of a language. Classroom-based performance tests: These have more specific criteria. For example: A test of pronunciation might well be a test of only a limited set of phonemic minimal pairs. A vocabulary test may focus on only the set of words covered in
a particular lesson. A well-constructed test is an instrument that provides an accurate measure of the test takers ability within a particular domain. TESTING, ASSESSMENT & TEACHING TESTING are prepared administrative
procedures that occur at identifiable times in a curriculum. When tested, learners know that their performance is being measured and evaluated. When tested, learners muster all their faculties to offer peak
performance. Tests are a subset of assessment. They are only one among many procedures and tasks that teachers can ultimately use to assess students. Tests are usually time-constrained
(usually spanning a class period or at most several hours) and draw on a limited sample of behaviour. ASSESSMENT Assessment is an ongoing process that encompasses a
much wider domain. A good teacher never ceases to assess students, whether those assessments are incidental or intended. Whenever a student responds to a question, offers a comment,
or tries out a new word or structure, the teacher subconsciously makes an assessment of the students performance. Assessment includes testing. Assessment is more extended
and it includes a lot more components. What about TEACHING? For optimal learning to take place, learners must have opportunities to play with language without being formally graded.
Teaching sets up the practice games of language learning: the opportunities for learners to listen, think, take risks, set goals, and process feedback from the teacher (coach) and then recycle through the skills that they are trying to master. During these practice activities, teachers are indeed observing students performance and making various evaluations of each learner.
Then, it can be said that testing and assessment are subsets of teaching. ASSESSMENT Informal Assessment They are incidental, unplanned comments and responses.
Examples include: Nice job! Well done! Good work! Did you say can or cant? Broke or break!, or putting a on some homework. Classroom tasks are designed to elicit performance without recording results and making fixed
judgements about a students competence. Examples of unrecorded assessment: marginal comments on papers, responding to a draft of an essay, advice about how to better pronounce a word, a suggestion for
a strategy for compensating for a reading difficulty, and showing how to modify a students note-taking to better remember the content of a lecture. Formal Assessment
They are exercises or procedures specifically designed to tap into a storehouse of skills and knowledge. They are systematic, planned sampling techniques constructed to give Ts and sts an appraisal of student achievement.
They are tournament games that occur periodically in the course of teaching. It can be said that all tests are formal assessments, but not all formal assessment is testing. Example 1: A students journal or
portfolio of materials can be used as a formal assessment of attainment of the certain course objectives, but it is problematic to call those two procedures test. Example 2: A systematic set of
THE FUNCTION OF AN ASSESSMENT Formative Assessment Evaluating students in the process of forming their competencies and skills with the goal of helping them to continue
that growth process. It provides the ongoing development of learners lang Example: When you give sts a comment or a suggestion, or call
attention to an error, that feedback is offered to improve learners Summative Assessment It aims to measure, or summarize, what a
student has grasped, and typically occurs at the end of a course. It does not necessarily point the way to future progress. Example: Final exams in a
course and general proficiency exams. All tests/formal assessment (quizzes, periodic review tests, IMPORTANT:
As far as summative assessment is considered, in the aftermath of any test, students tend to think that Whew! Im glad thats over. Now I dont have to remember that stuff anymore! An ideal teacher should try to change this attitude among students. A teacher should:
instill a more formative quality to his lessons offer students an opportunity to convert tests into learning experiences. TESTS Criterion-Referenced Norm-Referenced Tests
Tests Each test-takers score is interpreted in relation to a mean (average score), median (middle score), standard deviation (extend of
variance in scores), and/or percentile rank. The purpose is to place test-takers along a mathematical continuum in rank order. Scores are usually reported back to the test-taker in the form of a
numerical score. (230 out of 300, 84%, etc.) Typical of these tests are standardized tests like SAT. TOEFL, DS, KPDS, DS, etc. These tests are intended to be administered to large audiences,
with results efficiently disseminated to test takers. They are designed to give testtakers feedback, usually in the form of grades, on specific course or lesson objectives.
Tests that involve the sts in only one class, and are connected to a curriculum, are CriterionReferenced Tests. Much time and effort on the part of the teacher are required to deliver useful, appropriate feedback to students.
The distribution of students scores across a continuum may be of little concern as long as the instrument assesses appropriate objectives. As opposed to standardized, large scale testing with its emphasis on
classroom-based testing, Approaches to Language Testing: A Brief History Historically, language-testing trends have followed the trends of teaching methods. During 1950s: An era of behaviourism and special attention to contrastive analysis.
Testing focused on specific lang elements such as phonological, grammatical, and lexical contrasts between two languages. During 1970s and 80s: Communicative Theories were widely accepted. A more integrative view of testing. Today: Test designers are trying to form authentic, valid
instruments that simulate real world interaction. APPROACHES TO LANGUAGE TESTING A) Discrete-Point Testing Language can be broken down into its component parts and those parts can be tested
successfully. Component parts; listening, speaking, reading and writing. Units of language (discrete points); phonology, graphology, morphology, lexicon, syntax and discourse.
An language proficiency test should sample all 4 skills and as many linguistic discrete points as possible In the face of evidence that in a study each student scored differently in various skills
depending on his background, country and major field, Oller B) Integrative Testing Language competence is a unified set of interacting abilities that cannot be tested
separately. Communicative competence is global and requires such integration that it cannot be captured in additive tests of grammar, reading, vocab, and other discrete points of lang.
Two types of tests examples of integrative tests: *cloze test and **dictation. Unitary trait hypothesis: It suggests an indivisible view of language proficiency; that vocabulary, grammar,
phonology, 4 skills, and other discrete points of lang Cloze Test: Cloze Test results are good measures of overall proficiency. The ability to supply appropriate words in blanks requires a number of abilities that lie at the heart of competence in a language:
knowledge of vocabulary, grammatical structure, discourse structure, reading skills and strategies. It was argued that successful completion of cloze items taps into all of those abilities, which were said to be the essence of global language proficiency. Dictation Essentially, learners listen to a passage of 100 to 150 words read
aloud by an administrator (or audiotape) and write what they hear, using correct spelling. Supporters argue that dictation is an integrative test because success on a dictation requires careful listening, reproduction in writing of what is heard, efficient short-term memory, to an extent, some expectancy rules to aid the short-term memory.
c) Communicative Language Testing ( recent approach after mid 1980s) What does it criticise? In order for a particular langtest to be useful for its intended purposes, test performance must correspond in demonstrable ways to language use in nontest situations. Integrative tests such as cloze only tell us about a candidates linguistic
competence. They do not tell us anything directly about a students performance ability. (Knowledge about a language, not the use of language) Any suggestion? A quest for authenticity, as test designers centered on communicative performance. The supporters emphasized the importance of strategic competence (the ability to employ communicative strategies to compensate for breakdowns
as well as to enhance the rhetorical effect of utterances) in the process of communication. Any problem in using this approach? Yes, communicative testing presented challenges to test designers, because they began to identify the real-world tasks that language learners were called upon to perform. But, it was clear that the contexts for those tasks were extraordinarily widely
varied and that the sampling of tasks for any one assessment procedure needed to be validated by what language users actually do with language. As a result: The assessment field became more and more concerned with the authenticity of tasks and the genuineness of texts. d) Performance-Based Assessment
performance-based assessment of language typically involves oral production, written production, open-ended responses, integrated performance (across skill areas), group performance, and other interactive tasks. Any problems? It is time-consuming and expensive, but those extra efforts are paying off in more direct testing because sts are assessed as they perform actual or
simulated real-world tasks. The advantage of this approach? Higher content validity is achieved because learners are measured in the process of performing the targeted linguistic acts. Important performance-based assessment means that Ts should rely a little less on formally structured tests and a little more on evaluation while sts are performing various tasks.
In performance-based assessment: Interactive Tests (speaking, requesting, responding, etc.) IN Paper-andpencil OUT Result: in this test tasks can approach the authenticity of real life language use. CURRENT ISSUES IN CLASSROOM TESTING The design of communicative, performance-based assessment
continues to challenge both assessment experts and classroom teachers. Therere three issues which are helping to shape our current understanding of effective assessment. These are: The effect of new theories of intelligence on the testing industry The advent of what has come to be called alternative assessment
The increasing popularity of computer-based testing New Views on Intelligence In the past: Intelligence was once viewed strictly as the ability to perform linguistic and logical-mathematical problem solving. For many years, weve lived in a word of standardized, normreferenced tests that are timed in a multiple-choice format consisting of a multiplicity of logic constrained items, many of which are
inauthentic. We were relying on timed, discrete-point, analytical tests in measuring lang.to be in the limits of objectivity and give impersonal We were forced
Recently: Spatial intelligence musical intelligence bodily-kinesthetic intelligence interpersonal intelligence intrapersonal intelligence
EQ (Emotional Quotient) underscore emotions in our cognitive processing. Those who manage their emotions tend to be more capable of fully intelligent processing, because anger, grief, resentment, other feelings can easily impair peak performance in everyday tasks as well as higher-order problem solving. These conceptualizations of intelligence intuitive appeal infused the
1990s with a sense of both freedom and responsibility in our testing agenda. In past, our challenge was to test interpersonal, creative, communicative, interactive skills, doing so to place some trust in our subjectivity and intuition. Traditional and Alternative Assessment
Traditional Assessment -One-shot, standardized exams -Timed, multiple-choice format -Decontextualized test items
-Scores suffice for feedback -Norm-referenced scores -Focus on the right answer -Summative -Oriented to product -Non-interactive process
Alternative Assessment Continuous longterm assessment Untimed, free-response format Contextualized communicative
tests Individualized feedback and washback Criterion-referenced scores Open-ended, creative answers
Formative Oriented to process Interactive process IMPORTANT It is difficult to draw a clear line of distinction between traditional and alternative assessment.
Many forms of assessment fall in between the two, and some combine the best of both. More time and higher institutional budgets are required to administer and score assessments that presuppose more subjective evaluation, more individualization, and more interaction in the process of offering feedback. But the payoff of the Alternative Assessment comes
with more useful feedback to students, the potential for intrinsic motivation, and ultimately a more complete description of a students ability. Computer-Based Testing Some computer-based tests are small-scale. Others are standardized, large scale tests (e.g. TOEFL) in which thousands of test-takers are involved.
A type of computer-based test (Computer-Adaptive Test / CAT) is available In CAT, the test-taker sees only one question at a time, and the computer scores each question before selecting the next one. Test-takers cannot skip questions, and, once they have entered and confirmed their answers, they cannot return to questions. Advantages of Computer-Based Testing: o Classroom-based testing
o Self-directed testing on various aspects of a lang (vocabulary, grammar, discourse, etc) o Practice for upcoming high-stakes standardized tests o Some individualization, in the case of CATs. o Scored electronically for rapid reporting of results. Disadvantages of Computer-Based Testing: Lack of security and the possibility of cheating in unsupervised computerized
tests. Home-grown quizzes may be mistaken for validates assessments. Open-ended responses are less likely to appear because of need for human scorers. The human interactive element is absent. An Overall summary
Tests Assessment is an integral part of the teaching-learning cycle. In an interactive, communicative curriculum, assessment is almost constant. Tests can provide authenticity, motivation, and feedback to the learner. Tests are essential components of a successful curriculum and
learning process. Assessments Periodic assessments can increase motivation as milestones of student progress. Appropriate assessments aid in the reinforcement and retention of information. Assessments can confirm strength and pinpoint areas needing
further work. Assessments provide sense of periodic closure to modules within a curriculum. Assessments promote sts autonomy by encouraging self-evaluation progress. Assessments can spur learners to set goals for themselves. Assessments can aid in evaluating teaching effectiveness.
Decide whether the following statements are TRUE or FALSE. 1. Its possible to create authentic and motivating assessment to offer constructive feedback to the sts. ----------2. All tests should offer the test takers some kind of measurement or result. ----3. Performance based tests measure test takers knowledge about language. ----4. Tests are the best tools to assess students. ----------5. Assessment and testing are synonymous terms. ----------6. Ts incidental and unplanned comments and responses to sts is an example of formal assessment. ------7. Most of our classroom assessment is summative assessment.
----------8. Formative assessment always points toward future formation of learning. ---9. The distribution sts scores across a continuum is a concern in norm referenced test. ----------10. C riterion referenced testing has more instructional value than norm-referenced testing for classroom teachers. ----------1. TRUE 2. TRUE 3. FALSE They are designed to test actual use of lang not knowledge about lang
4. FALSE (We cannot say they are best, but one of useful devices to assess sts.) 5. FALSE
(They are not.) 6. FALSE (They are informal assessment) 7. FALSE (formative assessment) 8. TRUE 9. TRUE 10. TRUE CHAPTER 2 PRINCIPLES OF LANGUAGE
ASSESSMENT Therere five testing criteria for testing a test: 1. Practicality 2. Reliability 3. Validity 4. Authenticity 5. Washback 1. PRACTICALITY A practical test
is not excessively expensive, stays within appropriate time constraints, is relatively easy to administer, and has a scoring/evaluation procedure that is specific and timeefficient. For a test to be practical administrative details should clearly be established before the test, sts should be able to complete the test reasonably within the set
time frame, the test should be able to be administered smoothly (prosedrle bomamal), all materials and equipment should be ready, the cost of the test should be within budgeted limits, the scoring/evaluation system should be feasible in the teachers time frame.
methods for reporting results should be determined in advance. 2. RELIABILITY A reliable test is consistent and dependable. The issue of reliability of a test may best be addressed by considering a number of factors that may contribute to the unreliability of a test.
Consider following possibilities: in the fluctuations student (Student-Related Reliability),
in scoring (Rater Reliability), in test administration (Test Administration Reliability), and
in the test (Test Reliability) itself. Student-Related Reliability: Temporary illness, fatigue, a bad day, anxiety, other physical or
psychological factors may make an observed score deviate from ones true score. Also a test-takers test-wiseness or strategies for efficient test taking can also be included in this category. Rater Reliability: Human error, subjectivity, lack of attention to scoring criteria, inexperience,
inattention, or even preconceived (pein hkml) biases may enter into scoring process. Inter-rater unreliability occurs when 2 or more scorers yield inconsistent scores of the same test. Intra-rater unreliability is because of unclear scoring criteria, fatigue, bias toward particular good and bad students, or simple carelessness. One solution to such intra-rater unreliability is to read through about half of
the tests before rendering any final scores or grades, then to recycle back through the whole set of tests to ensure an even-handed judgment. The careful specification of an analytical scoring instrument can increase raterreliability. Test Administration Reliability: Unreliability may also result from the conditions in which the test is administered.
Street noise, photocopying variations, poor light, temperature, desks and chairs. Test Reliability: Sometimes the nature of the test itself can cause measurement errors. Timed tests may discriminate against sts who do not perform well with a time limit. Poorly written test items may be a further source of test unreliability.
3. VALIDITY The extent to which the assessment requires students to perform tasks that were included in the previous classroom lessons. How is the validity of a test established?
There is no final, absolute measure of validity, but several different kinds of evidence may be invoked in support. it may be appropriate to examine the extent to which a test calls for performance that matches that of the course or unit of study being tested. In other cases we may be concerned with how well a test determines
whether or not students have reached an established set of goals or level of competence. it could be appropriate to study statistical correlation with other related but independent measures. Other concerns about a tests validity may focus on the consequences beyond measuring the criteria themselves - of a test, or even on the test-takers perception of
validity. We will look at these five types of evidence below. Content Validity: If a test requires the test-taker to perform the behaviour that is being measured, content-related evidence of validity, often popularly referred to as content
validity. If you assess a persons ability to speak TL, asking sts answer paper-andpencil multiple choice questions requiring grammatical judgements does not achieve content validity. for content validity to be achieved, one should be able to elicit the following conditions: Classroom objectives should be identified and appropriately framed. The first measure of an effective classroom test is the identification of objectives.
Lesson objectives should be represented in the form of test specifications. A test should have a structure that follows logically from lesson or unit you are testing. If you clearly perceive the performance of test-takers as reflective of the classroom objectives, then you can argue this, content validity has probably been achieved. To understand content validity consider difference between direct and
indirect testing. Direct testing involves the test-taker in actually performing the target task. Indirect testing involves performing not target task itself, but that related in some way. Direct testing is most feasible (uygun) way to achieve content validity in assessment.
Criterion-related Validity: It examines the extent to which the criterion of test has actually been achieved. For example, a classroom test designed to assess a point of grammar in communicative use will have criterion validity if test scores are corroborated either by observed subsequent behavior or by other communicative measures of the grammar point in question.
Criterion-related evidence usually falls into one of two categories: Concurrent (uygun, ayn zamanda olan) validity: A test has concurrent validity if its results are supported by other concurrent performance beyond the assessment itself. For example, the validity of a high score on the final exam of a foreign language course will be substantiated by actual proficiency in the language.
Predictive (ngrsel, tahmini) validity: The assessment criterion in such cases is not to measure concurrent ability but to assess (and predict) a test-takers likelihood of future success. For example, the predictive validity of an assessment becomes important in the case of placement tests, language aptitude tests, and the like.
Construct Validity: Every issue in language learning and teaching involves theoretical constructs. In the field of assessment, construct validity asks, Does this test actually tap into the theoretical construct as it has been identified? (test gerekten de test etmek istediim konu ya da beceriyi test
etmede gerekli olan yapsal zellikleri tayor mu?) Imagine that you have been given a procedure for conducting an oral interview. The scoring analysis for the interview includes several factors in the final score: pronunciation, fluency, grammatical accuracy, vocabulary use, and sociolinguistic appropriateness. The justification for these five factors lies in a theoretical construct that claims those factors to be major
components of oral proficiency. So if you were asked to conduct on oral proficiency interview that evaluated only pronunciation and grammar, you could be justifiably suspicious about the construct validity of that test. Large-scale standardized
tests olarak nitelediimiz snavlar construct validity asndan pek de uygun deildir. nk pratik olmas asndan (yani hem zaman hem de ekonomik nedenlerden) bu testlerde llmesi gereken btn dil becerileri llememektedir. rnein TOEFL da oral production blmnn olmamas construct validity asndan byk bir engel olarak karmza kmaktadr.
Consequential Validity: Consequential validity encompasses all the consequences of a test, including such considerations as its accuracy in measuring intended criteria, its impact on the preparation of test-takers, its effect on the learner, and the (intended and unintended) social consequences of a tests interpretation and use.
McNamara (2000, p. 54) cautions against test results that may reflect socioeconomic conditions such as opportunities for coaching (zel ders, zel ilgi). For example, only some families can afford coaching, or because children with more highly educated parents get help from their parents. Teachers should consider the effect of assessments on students
motivation, subsequent performance in a course, independent learning, study habits, and attitude toward school work. Face Validity: the degree to which a test looks right, and appears to measure the knowledge or abilities it claims to measure, based on the subjective judgment of test-takers
Face validity means that the students perceive the test to be valid. Face validity asks the question Does the test, on the face of it, appear from the learners perspective to test what it is designed to test? Face validity is not something that can be empirically tested by a teacher or even by a testing expert. It depends on subjective evaluation of the test-taker.
A classroom test is not the time to introduce new tasks. If a test samples the actual content of what the learner has achieved or expects to achieve, face validity will be more likely to be perceived. Content validity is a very important ingredient in achieving face validity. Students will generally judge a test to be face valid if directions are
clear, the structure of the test is organized logically, its difficulty level is appropriately pitched, the test has no surprises, and timing is appropriate. To give an assessment procedure that is biased for best a teacher offers students appropriate review and preparation for the test, suggests strategies that will be beneficial, and structures the test so that the best students will be modestly challenged and the
weaker students will not be overwhelmed. 4. AUTHENTICITY In an authentic test the language is as natural as possible, items are as contextualized as possible, topics and situations are interesting, enjoyable and/or humorous,
some thematic (konuyla ilgili) organization, such as through a story line or episode is provided, tasks represent real-world tasks. Reading passages are selected from real-world sources that testtakers are likely to have encountered or will encounter. Listening comprehension sections feature natural language with hesitations, white noise, and interruptions. More and more tests offer items that are episodic in that they are
sequenced to form meaningful units, paragraphs, or stories. 5. WASHBACK Washback includes the effects of an assessment on teaching and learning prior to the assessment itself, that is, on preparation for the assessment. Informal performance assessment is by nature more likely to have
built-in washback effects because the teacher is usually providing interactive Formal testsfeedback. can also have positive washback, but they provide no washback if the students receive a simple letter grade or a single overall numerical score. Tests should serve as learning devices through which washback is
achieved. Sts incorrect responses can become windows of insight into further work. Their correct responses need to be praised, especially when they represent accomplishments in a students inter-language. Washback enhances a number of basic principles of language acquisition: intrinsic motivation, autonomy, self-confidence,
language ego, interlanguage, and strategic investment, among others. To enhance washback comment generously & specifically on test performance. Washback implies that students have ready access to the teacher to discuss the feedback and evaluation he has given. Teachers can raise the washback potential by asking students to use
test results as a guide to setting goals for their future effort. What is washback? In general terms: The effect of testing on teaching and learning In large-scale assessment: Refers to the effects that the
tests have on instruction in terms of how students prepare for the test In classroom assessment: The information that washes back to students in the form of useful diagnoses of strengths and weaknesses What does washback enhance?
Intrinsic motivation Autonomy Language ego Inter-language Self-confidence
Strategic investment What should teachers do to enhance washback? Comment generously and specifically on test performance Respond to as many details as possible Praise strengths Criticize weaknesses constructively
Give strategic hints to improve performance Decide whether the following statements are TRUE or FALSE. 1. An expensive test is not practical. 2. One of the sources of unreliability of a test is the school. 3. Sts, raters, test, and administration of it may affect the tests reliability.
4. In indirect tests, students do not actually perform the task. 5. If students are aware of what is being tested when they take a test, and think that the questions are appropriate, the test has face validity. 6. Face validity can be tested empirically. 7. Diagnosing strengths and weaknesses of students in language learning is a facet of washback.
8. One way of achieving authenticity in testing is to use simplified language. 1. TRUE 2. FALSE 3. TRUE 4. TRUE 5. TRUE 6. FALSE
7. TRUE 8. FALSE Decide which type of validity does each sentence belong to? 1. It is based on subjective judgment. ---------------------2. It questions the accuracy of measuring the intended criteria. ---------------------3. It appears to measure the knowledge and abilities it claims to measure.
------------4. It measures whether the test meets the objectives of classroom objectives. -------5. It requires the test to be based on a theoretical background. ---------------------6. Washback is part of it. ---------------------7. It requires the test-taker to perform the behavior being measured. -----------------8. The students (test-takers) think they are given enough time to do the test. ----------9. It assesses a test-taker's likelihood of future success. (e.g. placement tests). --------10. The students' psychological mood may affect it negatively or positively. -------------11. It includes the consideration of the test's effect on the learner.
---------------------12. Items of the test do not seem to be complicated. ---------------------13. The test covers the objectives of the course. ---------------------14. The test has clear directions. ---------------------1. Face 2. Consequential 3. Face 4. Content 5. Construct 6. Content
7. Criterion related 8. Face 9. Criterion related 10. Consequential 13. Content validity 14.
11. Consequential 12. Face validity Decide with which type of reliability could each sentence be related? 1. There are ambiguous items. 2. The student is anxious. 3. The tape is of bad quality.
4. The teacher is tired but continues scoring. 5. The test is too long. 6. The room is dark. 7. The student has had an argument with the teacher. 8. The scorers interpret the criteria differently. 9. There Is a lot of noise outside the building. 1. Test reliability
2. Student-related reliability 3. Test administration reliability 4. Rater reliability 5. Test reliability 6. Test administration reliability 7. Student-related reliability
9. Test administration reliability 8. Rater reliability CHAPTER 3 DESIGNING CLASSROOM LANGUAGE TESTS
we examine test types, and learn how to design tests and revise existing ones. To start the process of designing tests, we will ask some critical questions. 5 questions should form basis of your approach to designing tests for class.
Question 1: What is the purpose of the test? Why am I creating this test? For an evaluation of overall proficiency? (Proficiency Test) To place students into a course? (Placement Test) To measure achievement within a course? (Achievement Test) Once you established major purpose of a test, you can determine its objectives.
Question 2: What are the objectives of the test? What specifically am I trying to find out? What language abilities are to be assessed? Question 3: How will test specifications reflect both purpose and objectives? When a test is designed, the objectives should be incorporated into a structure that appropriately weights the various competencies
being assessed. Question 4: How will test tasks be selected and the separate items arranged? The tasks need to be practical. They should also achieve content validity by presenting tasks that mirror those of the course being assessed.
They should be evaluated reliably by the teacher or scorer. The tasks themselves should strive for authenticity, and the progression of tasks ought to be biased for best performance. Question 5: What kind of scoring, grading, and/or feedback is expected? Tests vary in the form and function of feedback, depending on their purpose.
For every test, the way results are reported is an important consideration. Under some circumstances a letter grade or a holistic score may appropriate; other circumstances may require that a teacher offer substantive washback to the learner.
Chapter 4 STANDARDIZED TESTING: characteristics of a standardized test CHAPTER 5 STANDARDIZED-BASED ASSESSMENT:
6 ASSESSING LISTENING Designing Assessment Tasks Intensive Listening Recognizing Phonological and Morphological Elements Phonemic pair, consonants
Test-takers hear : Hes from California Test-takers read
: A. Hes from California B. Shes from California Phonemic pair, vowels Test-takers hear
: is he living? Test-takers read : A. is he leaving?
B. is he living? Morphological pair, -ed ending Test-takers hear :
I missed you very much. Test-takers read : A. I missed you very much B. I miss you very much
Stress pattern in cant Test-takers hear : My girlfriend cant go to the party
Test-takers read : A. My girlfriend can go to the party B. My girlfriend cant go to the party One word stimulus Test-takers hear
: vine Test-takers read :
A. Vine B. Wine Paraphrase Recognition Sentence Paraphrase Test-takers hear
: Hellow, my name is Keiko. I come from Japan Test-takers read : A. Keiko is comfortable in japan B. Keiko wants to come to Japan C. Keiko is Japanese D. Keiko likes Japan
Dialogue paraphrase Test-takers hear Test-takers read :
man : Hi, Maria, my name is George. woman : Nice to meet you, George. Are you American? man : no, Im Canadian
: A. George lives in United States B. George is American C. George comes from Canada D. Maria is Canadian Responsive listening
Appropriate response to a question Designing Assessment Tasks Test-takers hear : how much time did you take to do your homework? Test-takers read : A. in about an hour
B. about an hour C. about $10 D. yes, I did. Open-ended response to a question Test-takers hear : how much time did you take to do your
homework? Test-takers write or speak : __________________________________ Chapter-7 Assessing Speaking 8 ASSESSING READING
UNIT 9: ASSESSING WRITING 10 BEYOND TESTS: ALTERNATIVES IN ASSESSMENT DILEMMA OF MAXIMIZING BOTH PRACTICALITY AND WASHBACK
LARGE SCALE STANDARDIZED TESTS
one-shot performances timed multiple-choice decontextualized norm-referenced foster extrinsic motivation highly practical, reliable
instruments minimize time and money much practicality or reliability cannot offer much washback or authenticity ALTERNATIVE ASSESSMENT
open-ended in their time orientation and format, contextualized to a curriculum, referenced to the criteria (objectives) of that curriculum likely to build intrinsic
motivation considerable time and effort offer much authenticity and washback CHAPTER 11: GRADING AND STUDENT EVALUATION