ERIC Number: ED658543
Record Type: Non-Journal
Publication Date: 2022-Sep-23
Pages: N/A
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Available Date: N/A
Custom Software to Efficiently Scale API-Based Data Collection for Education Research and Policy
Brandon Sepulvado; Jennifer Hamilton
Society for Research on Educational Effectiveness
Background/Context: Application Programming Interfaces (APIs) are becoming a core means to collect and disseminate data for education research and policy. Surveys can be costly, slow, and burden respondents, and manually downloading data files is slow and cumbersome when many files are needed. APIs provide a programmatic way to collect data or obtain estimates, and many commonly used statistical programming languages, such as R and Python, have packages to access these APIs. This dramatically reduces the time and cost involved in collecting education data. Given these benefits, it should come as no surprise that many data providers, including federal agencies, are increasingly making available APIs to access their data. Despite the promise of APIs, challenges remain. Terms of Service often regulate the amount of data that may be downloaded at any given time and the frequency at which data may be downloaded. Research teams must also have requisite knowledge of how APIs and web requests work as well as knowledge of a common programming language. Computing environments in which code to use APIs is executed must be standardized. Further, as research team size increases, coordinating data collection across APIs becomes a considerable challenge. Purpose: To address these challenges and to fully harness the advantages that API-based data collection affords, NORC is developing custom software to enable fast-paced and cost-efficient research on educational effectiveness. This data collection program is based upon the following principles: 1. Respect API Terms of Service, so as not to overburden data provider resources; 2. Minimize necessary technical expertise, to encourage inclusive research teams composed of subject matter experts; 3. Maximize quality of data collected by standardizing computing environment and centralizing the coordination of data collection efforts among research team members; and 4. Encode flexibility into the software so that it can be used with many APIs. Tool: Although developed in Python, NORC's program is interactive and does not require Python knowledge for anyone other than core developers. Users simply double-click the program to run it. If it is the first time that the program is run on a computer, it automatically downloads the necessary Python packages and standardizes the environment across the research team. Once installed, users are asked plain-language questions, such as "what is your name," "what are your API credentials," and "what piece of information or type of data are you trying to collect?" The program then uses this information to coordinate data collection. The program uses the team member's name and type of information desired to identify the sample members on whom data should be collected and the API that provides access to such information. Next, given the appropriate API, the program selects the appropriate custom Python module written by our team's core developers, populates the requisite API call fields, such as data sets, variable names, geographies, etc., and constructs the API requests. Finally, the program executes the API calls, checks for transmission errors, and stores the information when it is retrieved successfully. Because the custom modules are written with Terms of Service in mind, the rules placed on data collection by each source are respected. Logs--in addition to the data--are stored in a centralized location, so that developers can troubleshoot any problems that might arise. Results: This program for API-based data collection dramatically reduces both cost and time involved in education research and policy projects. Most team members no longer must worry about identifying the specific records they collect, reviewing the legal terms of API use, and possibly having to write code in a programming language with which they are not familiar; this translates to reduced coordination burden regardless of data collection team size. Recent efforts at NORC that utilized our software have seen the data collection time reduced by up to 93%.
Descriptors: Data Collection, Computer Software, Educational Research, Access to Information, Barriers, Coding, Educational Policy
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; e-mail: contact@sree.org; Web site: https://www.sree.org/
Publication Type: Reports - Research
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)
Grant or Contract Numbers: N/A
Author Affiliations: N/A