NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
ERIC Number: ED575302
Record Type: Non-Journal
Publication Date: 2016
Pages: 102
Abstractor: As Provided
ISBN: 978-1-3696-6297-9
ISSN: N/A
EISSN: N/A
Available Date: N/A
Blocking Strategies for Performing Entity Resolution in a Distributed Computing Environment
Wang, Pei
ProQuest LLC, Ph.D. Dissertation, University of Arkansas at Little Rock
Entity resolution (ER) is an O(n[superscript 2]) problem where n is the number of records to be processed. The pair-wise nature of ER makes it impractical to perform on large datasets without the use of a technique called blocking. In blocking the records are separated into groups (called blocks) in such a way the records most likely to match are within the same block. The ER system only compares pairs of records within the same block, thus reducing the total number of pairs to match. Traditionally, blocking algorithms build inverted indices in memory to quickly locate potential matches. With the advent of Big Data, processing has moved to a distributed environment of multiple processors to exploit the power of parallel processing. However, by design, distributed processing environments do not have a single, shared memory space. The design science research in this dissertation describes the design, verification, and validation of three new blocking strategies to support ER processes running in the Hadoop distributed processing environment. The three blocking strategies I designed and validated are 1. Pre-Resolution Transitive Closure of Match Keys, 2. Post-Resolution Transitive Closure of Cluster Identifiers, and 3. Incremental Transitive Closure of Cluster Identifier-Match Key Pairs. The research also describes the relative efficiency of the three approaches, and identifies the strengths and weaknesses of each approach with respect to different characteristics of the input data. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com.bibliotheek.ehb.be/en-US/products/dissertations/individuals.shtml.]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com.bibliotheek.ehb.be/en-US/products/dissertations/individuals.shtml
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Grant or Contract Numbers: N/A
Author Affiliations: N/A