TOEFL RESEARCH

Inside the TOEFL iBT Updates: Validity by Design

April 23, 2026

For more than six decades, TOEFL has operated as a major assessment of academic English language proficiency, serving as an important resource for decision-making in university admissions, as well as other higher education and professional contexts.

Since the exam’s inception in 1964, ETS has revised TOEFL on multiple occasions to reflect updated thinking in language teaching and assessment, advances in measurement science, and evolving societal needs. The current version of the test, TOEFL iBT, was created in 2005. This January, ETS launched an updated version of the exam.

This update maintains the same core purpose and builds on the long history of the TOEFL test as a valid and trustworthy assessment based on decades of measurement research at ETS. We’re excited to share more about the design philosophy behind these enhancements.

Building an English Exam That Yields Meaningful Results

A critical requirement for all tests is that they are valid for the claims and ultimate uses of their results. In other words: The results must be meaningful. The evidence for these valid claims and uses should also be varied and sufficient. The more opportunities a student has to demonstrate what they can do on a variety of tasks (e.g. more items of different types), the greater the trust in the results (validity).

In any validity argument, evidence is required. This evidence refers to the information we collect about what a person can do – that is, the test tasks and the scores awarded for performance on these tasks.

An English proficiency test for admissions purposes must include tasks that: (1) cover all four language skills (reading, writing, listening, and speaking); (2) reflect integrated use of these skills that is typical of university study (e.g., reading then writing); and (3) include features of real-life language use.

The scores produced by the test must also be a reliable estimate of overall language ability – with an appropriate level of precision – and be consistently accurate and precise across the required range of language proficiency levels. For tests of language proficiency, test results must also accurately reflect the ability to use language to succeed in diverse academic environments.

Over the last 20 years, modern academic environments have evolved to emphasize new ways of communicating, facilitated by new technologies and pedagogical models. Today’s students, for example, must be able to communicate with peers from around the world in group learning settings, not just passively absorb lectures. They must also be able to interpret a broader range of English texts.

To measure the English skills required to thrive in modern academic environments and capture meaningful evidence for valid results, the updated TOEFL iBT incorporates a variety of diverse tasks that expand our collection of meaningful evidence of language ability.

Increasing the Diversity and Volume of Task Types

The updated TOEFL iBT has added more test tasks of greater variety, building on the solid foundation of the original format. A test taker’s performance is meaningful if it aligns with an ability to communicate in an academic environment, which includes not just listening to lectures or reading textbooks, but also engaging in other university contexts that contribute to academic success.

One common challenge for test developers, however, is that test tasks which closely mimic real-world activities can be unfeasibly time-consuming to administer, while providing relatively little measurement information and related evidence.

For example, you can imagine a writing test that consists entirely of a single hour-long written essay that is scored on a 1-5 scale. Such a test might be seen as “authentic,” but it provides a narrow view of a student’s language ability and limits the opportunity for the test to gauge the full spectrum of a student’s skills beyond the single item.

Qualitatively, this hypothetical test provides information about the ability to accomplish only one type of writing. Quantitatively, it produces only five score points, which limits its ability to reliably discern different levels of performance. This approach is also vulnerable to random circumstances; for example, if an otherwise capable writer struggles with the essay topic, the consequences can be severe.

An alternative approach is to use not only more tasks but also a diversity of tasks, providing a broader view of ability and greater reliability in measurement. In pursuit of this goal, the updated TOEFL iBT includes tasks that measure foundational language skills, as well as modernized academic tasks that enable deeper insights into communicative ability.

How TOEFL iBT Modernized Its Speaking Section

TOEFL iBT’s Speaking section shows this design philosophy in action. To start, a well-researched speaking task, Listen and Repeat, evaluates the ability to comprehend a spoken sentence and reproduce it accurately. The student must rapidly decode the language input, then accurately regenerate the language to produce a response, reflecting the development of their underlying language abilities.

This task incorporates foundational skills necessary for oral communication (Levelt, 1989). Additionally, individuals with a highly developed internal language system can more efficiently and accurately reproduce longer sentences, so by varying sentence length it is possible to efficiently measure general language ability across a broad range of language proficiency (Davis & Norris, 2021).

Listen and Repeat is used in combination with a communicative speaking task, Take an Interview, where students participate in a simulated conversation with a pre-recorded interviewer. The interview takes place within a variety of academic situations, such as participating in a research study, and students are scored on a total of four questions related to the interview context. Initial questions focus on factual information and personal experience, while later questions ask students to express and support opinions regarding broader issues.

This task measures the student’s ability to speak on a range of topics, producing a clear and coherent response with appropriate support and elaboration. The task also measures the ability to produce speech that is intelligible, fluent, and makes effective use of a range of vocabulary and grammatical structures.

This combination of tasks that target foundational (Listen and Repeat) and communicative (Take an Interview) speaking abilities provides diversity in construct representation and related evidence about students’ oral language ability, while maintaining the meaningfulness of scores for making decisions in academic contexts.

Linking TOEFL Results to Real-World Academic Performance

Regardless of whether a task focuses on foundational or communicative skills, meaningfulness requires that task performance should predict real-world language performance suitable for academic success. Otherwise, assigning a test score would be an exercise in futility.

For the Listen and Repeat and Virtual Interview tasks, recent research at the University of Hawai’i at Manoa found that scores on these tasks correlated highly with performance on other types of communicative language tasks assigned in a classroom setting.

These researchers found correlations of 0.84 between scores on the Listen and Repeat task and each of the two classroom communicative tasks, and 0.83-0.85 for the Virtual Interview task. These results suggest that both tasks from the updated TOEFL iBT are very good predictors of performance in typical types of academic speaking.

Modernizing the TOEFL iBT Reading & Writing Sections

ETS implemented a similar strategy – increasing the diversity of task types and varying the opportunities to evaluate student performance – in the Reading and Writing sections, as well.

In the Reading section’s newly added task, Complete the Words, the second half of every second word within a reading passage is deleted. Students are required to fill in the missing letters to recreate the original words and create a coherent text.

This task – commonly known as the C-test – efficiently provides information about the ability to process and understand text – as well as knowledge of vocabulary, syntax, and spelling. To supplement this task, more traditional reading comprehension tasks, like Read an Academic Passage, provide insight into the ability to obtain information and understand meanings, as typical in academic study.

In the Writing section, the Write for an Academic Discussion task assesses communicative aspects of literacy skills. This task takes place in the context of an ongoing class discussion of a question posed by the course instructor. The student adds their own views, supported with relevant reasoning, knowledge, or experience. They may also respond to peers’ contributions.

Additionally, Write for an Academic Discussion simulates a type of writing that has become increasingly common in academic contexts. It also provides a context for writing, which helps to clarify whether the writer can write appropriately for a given audience and situation. This stands in contrast to traditional writing tests that use a “bare” topic, with no description of the audience or circumstances.

Beyond these innovative features, the Write for an Academic Discussion task also measures other aspects of successful written communication, including coherence and clarity, quality of elaboration, and range and precision of language.

In sum: Developing the updated TOEFL iBT test represented an intriguing design challenge that required building upon strong validity evidence with greater variety and additional tasks that reflect the rigorous expectations and diverse academic environments of today’s higher education institutions.

In addition to the content and construct validity discussed above, TOEFL iBT test is also benefiting from newly implemented adaptive test design, innovations in measurement science, improvements to test security, and more. Stay tuned to this channel to learn more!

References

Davis, L., & Norris, J. (2021). Developing an innovative elicited imitation task for efficient English proficiency assessment (TOEFL Research Report No. 96). ETS. https://doi.org/10.1002/ets2.12338

Isbell, D. R., & Crowther, D. (in press). Investigating the real-world relevance of an academic English speaking test: Extrapolating subjective evaluations and linguistic performance characteristics. Language Testing.

Levelt, W. J. M. (1989). Speaking: From intention to articulation. MIT Press.

Pearlman, M. (2008). Finalizing the test blueprint. In C. A. Chapelle, M. K. Enright, & J. M. Jamieson (Eds.), Building a validity argument for the Test of English as a Foreign Language (pp. 227-258). Routledge.

TOEFL Research

Inside the TOEFL iBT Updates: Validity by Design

A behind-the-scenes look at how the updated TOEFL iBT was designed to ensure valid, meaningful results for modern academic environments.

April 23, 2026

Learn More

TOEFL Research

Connecting TOEFL Speaking to Speaking at University

Learn how the TOEFL iBT® Speaking tasks, Listen & Repeat and Take an Interview, serve as strong indicators of how well students perform on actual academic speaking tasks.

February 24, 2026

Learn More

Inside the TOEFL iBT Updates: Validity by Design

Building an English Exam That Yields Meaningful Results

Increasing the Diversity and Volume of Task Types

How TOEFL iBT Modernized Its Speaking Section

Linking TOEFL Results to Real-World Academic Performance

Modernizing the TOEFL iBT Reading & Writing Sections

References

Related