Background statement
Large-scale databases in education have been increasingly drawing attentions from researchers and policymakers, primarily because of their analytic power to produce in-depth studies and cost-saving benefits in the long run. To be more specific, a well-designed large-scale database will not only be able to effectively address the originally intended study issues but also could become an important resource for numerous secondary analyses because of its elaborate sample design and comprehensive data contents. Such capability and productivity consequently will lower the unit cost for each study. Take High School and Beyond, a longitudinal database of the 1980 high school sophomore and senior students sponsored by the National Center for Education Statistics (NCES), U.S. Department of Education, for example. It has been widely used by researchers and policymakers, resulting in a large number of publications, including journal articles, research reports, policy analyses, dissertation/theses of graduate students, and technical reports (http://www.nces.edu.gov). Furthermore, because of its longitudinal nature, an individual's prior background, attributes and performance can be used as controlled variables in estimating the impact of educational programs or practices. Such analyses are the so-called controlled studies in a natural setting. They are greatly valued in educational research.
Although not all large-scale databases are longitudinal in nature, their values for comprehensive and efficient evaluation and research studies, nevertheless, are widely recognized. Several well-known studies such as IEA sponsored Trends in International Mathematics and Science Study (TIMSS) and Progress in International Reading Literacy Study (PIRLS) as well as NCES administered National Assessment of Educational Progress (NAEP) and National Teacher and Principal Survey (NTPS) are good examples of this nature. Those studies, when repeated every few years, also enable trend analyses to further examine and monitor any changes over time, providing valuable information for program or policy decision-making.
However, the development of a large-scale database is not a simple task. It takes a lot of efforts and it is not cheap. Thus, any large-scale data collection must be carefully planned, designed and implemented to assure that its values can be maximized. To this end, experience and lessons from prior studies would be extremely helpful for all future studies.
Objectives of this special issue
This special issue of Contemporary Educational Research Quarterly (CERQ) is, therefore, intended to compile collective wisdom about the design, implementation and applications of large-scale databases in education. Specifically, we would like to know what steps and factors should be considered in designing a study sample and data collection instruments. We would also like to know what statistical analysis techniques would be required in mining the database, and how the database should be constructed and disseminated to facilitate its use. We hope that the information assembled in this special issue would further inspire and encourage the development and use of large-scale databases in education.
Overview of the articles
We have accepted five articles for this special issue.
The first article, Establishing a Comprehensive Large-Scale Data Infrastructure for Educational Research: The Example of the German National Educational Panel Study (NEPA), written by Jutta von Maurice, Daniel Fuß and Hans-Günther Roßbach from Leibniz Institute for Educational Trajectories (LIfBi), presents some key processes in preparing and disseminating a rich empirical database to researchers from different disciplines. It starts with an overview of the NEPA's design of sample and selection of research topics on competence development and educational processes, both taking into account the relevant learning environments as well as issues of social inequality, the special situation of migrants and the various returns to education. The article then describes the processes of creating the database, including data cleaning and editing, coding and variable generation, documentation and metadata management, as well as data enrichment. The data protection and dissemination strategies are also explained. Furthermore, the article also gives some basic information about data usage as well as an outlook on future developments within the NEPS. This article clearly illustrates the steps and considerations required for planning, designing, implementing the development of a resourceful longitudinal database and its final applications for research and evaluation in education.
The second article, Data Analyses with IEA’s TIMSS and PIRLS International Databases, by Pierre Foy and Liqun Yin from TIMSS and PIRLS International Study Center, states that large-scale assessments in education generally rely on sophisticated assessment instruments, elaborate sample designs, and leading-edge item response theory to meet their analytical objectives. To analyze such databases, proper analytic techniques and processes would be required. The authors used TIMSS and PIRLS data to illustrate the required analysis procedures, including the use of sampling weights to produce accurate and reliable results, the application of the Jackknife Repeated Replication technique to derive proper estimates of sampling variance, and the correct handling of student achievement reported as sets of five plausible values to estimate the students' performance. Following those procedures, researchers and users of the TIMSS and PIRLS databases should feel confident in the results of their analyses.
In the third article, U.S. National Teacher and Principal Survey (NTPS) as a Valued Resource for Teacher and Principal Studies, by Jiangang Xia from University of Nebraska-Lincoln and Xingyuan Gao and Jianping Shen from Western Michigan University, these authors first give a brief review of NTPS which will replace the traditional Schools and Staffing Survey (SASS), both administered by the National Center for Education Statistics, U.S. Department of Education. Since the data collected by NTPS will be similar to those collected by SASS, the authors outline the potential studies using this database by reviewing and discussing how the SASS data have been utilized for educational research and policy analyses over the years. It is concluded that the NTPS database will be very unique and resourceful for studies on school teachers and principals, an area that has often been neglected in many countries. The authors strongly recommend that such a database be developed and more studies about teachers and principals be conducted.
The fourth article, Challenges and Opportunities for Estimating Effects with Large-Scale Education Data Sets, by Guan Kung Saw and Barbara Schneider from Michigan State University, describes the capability of large-scale databases to simulate near-experimental conditions without employing traditional methods that require randomization of units (e.g., students, schools, districts) to treatment and control situations. Such a capability for making robust inferences regarding the effect of educational programs and/or practices is very important and helpful for decision-making in education. This article examines the opportunities and potential statistical problems with estimating effects with large-scale databases. The information presented in this article should be valuable and helpful to researchers in using large-scale database analyses to make inferences of the effect of educational programs or practices.
Finally, the fifth article, Using R to Analyze International Large-Scale Educational Assessment Data, by Fu-An Chi and Ching-Fan Sheu from National Cheng Kung University, offers a solution to researchers who do not have access to the procedures of analysis by commercial software to replicate results others have reported in publications. The authors demonstrate how to use the free, open-source R computing environment to manage and analyze international large-scale educational assessment data. The example is comprised of mathematics achievement and covariates for 15-year-old students from 15 countries, including Taiwan. Also included in the demonstration is the use of the 'intsvy' package developed by Caro and Biecek to manage the data and the maptools package to link numerical summaries to geographical boundaries of countries examined in the illustration. R codes to perform analysis of data from PISA detailed in the paper is provided to facilitate reproducibility.
Concluding remarks
In summary, as directly or indirectly shown in those articles, large-scale databases offer great opportunities to researchers and policy-makers. However, most of such databases require an elaborate sample design and comprehensive data collection methods in order to ensure all important subgroups are covered and all relevant issues can be sufficiently addressed. Also required is special statistical techniques to properly analyze the data to obtain unbiased results. Such requirements often pose challenges to data developers and users. Fortunately, over the years many large-scale databases have been developed. Lessons and experiences gained from those studies and new technological skills and tools developed in recent years have helped the advancement of large-scale data for educational evaluation and research studies. As a well-designed database can facilitate many high-quality research and policy studies, it is hoped that such large comprehensive databases, instead of scattered small and localized databases, will become a common practice in the near future to advance educational evaluation and research studies.