Reimagining Automated Large-Scale Data Collection.

Brandon Sepulvado; Jennifer Hamilton

Background: Traditional survey efforts to gather outcome data at scale have significant limitations, including cost, time, and respondent burden. This pilot study explored new and innovative large-scale methods of collecting and validating data from publicly available sources. Taking advantage of emerging data science techniques, we leverage extant data to inform programmatic decision-making that counteracts inequality and better supports all students. Purpose: The National Science Foundation has been awarding merit-based support for graduate study in STEM elds since 1952. Understanding the importance of supporting traditionally underrepresented students, NSF needed to be able to quickly obtain and analyze data on a large scale. This pilot was designed as a test case -- to develop and assess new approaches to collecting the career outcomes of Graduate Research Fellowships Program (GRFP) awardees. A key part of this pilot was investigating how these processes could be automated while simultaneously ensuring data quality. This project therefore established a data collection pipeline that enabled the large-scale ingestion of records, and verification protocols to ensure that the collected records were that of Fellowship recipients. Our presentation will discuss in detail how we addressed these two challenges. Challenge 1: Scaling API Data Collection: Collecting data to evaluate small STEM funding programs can be done manually, which ensures high data quality and typically is a straightforward process (e.g., searching within internet portals/webpages). However, when the sample size includes tens of thousands of individuals, manual data collection becomes infeasible. To collect information on scholarly impact, output, and collaboration as well as patenting activities of 14,000 GRFP Fellows, we turned to Scopus and USPTO PatentsView. Although USPTO PatentsView allows users to download for free the entire database or only the desired tables from it, the only automated access that Scopus offers is through Application Programming Interfaces (APIs). Rather than manually searching every name on a webpage, APIs allow researchers and evaluators to obtain data by using a programming language, such as python in our case, to "automatically" search a given query. APIs however often have "rate limits" that govern how many requests for data may be sent per second and quotas on the total number of records that may be collected within a given period of time (e.g., per month or week). We used multiple APIs from the Scopus publications database, but the APIs had weekly quotas that were too low for our project's data science team to complete data collection within the evaluation period. We recruited a larger API data collection team but ran into two key problems: 1. Not all team members had knowledge of APIs; and 2. Not all team members knew python. To resolve these issues, we developed software that required no programming language knowledge, developed queries that were sent automatically to the APIs, standardized its python environment to ensure that all users had the same versions of the required packages, and asked users plain-language questions (e.g., which API would you like to call?) to obtain the necessary input. This software enabled us to reduce data collection time from months to weeks. Challenge 2: Validating Records from API: After the logistics of coordinating large-scale automated data collection, constructing the database/API query is a fundamental step to ensuring data quality. A key tension exists when deciding what pieces of information to include in queries. On the one hand, including as many pieces of information about an individual as possible should return only the most relevant records from the database being accessed by the API; on the other hand, broader queries with fewer pieces of information should ensure that the results include as possible relevant records. We attempted both strategies and manually reviewed the results from each. Our analyses indicated that the former approach excluded many records that should have been attributed to the Fellows but that searching for Scopus records using only the GRFP Fellow names unsurprisingly returned far too many irrelevant results. Addressing the second problem entails identifying only those records that correspond to GRFP Fellows in our sample--a process we call validation. In order to tackle the validation problem, we turned to machine learning and natural language processing (NLP). We rst started with a sample of 230 individuals and then searched for information both manually and via automated means. After this initial stage, we using machine learning (e.g., LASSO) to predict which records from automated collection matched those found through manual searching. We developed modeling approaches that compared application and birth years to year of first publication for Scopus and of first patent for PatentsView and that included further data on eld of study as well as the total number of potential matches. We also used NLP (i.e., short text topic models) to learn features from abstract text and interacted these features with the GRFP eld of study, in order to learn the relationships between scholarship and the elds in which Fellows were active. Early results from these models show high classification accuracy levels, with precision and recall for Scopus equal to 91.5% and 93.3% and for USPTO 88.9% and 76.2%. Further exploration of advanced NLP techniques will only improve these validation models.