Hybrid Matching and Risk Assessment of the Misspelled Names [HMRA].

Varol, Cihan

Notes FAQ Contact Us

Back to results

Direct link

ERIC Number: ED528627

Record Type: Non-Journal

Publication Date: 2009

Pages: 102

Abstractor: As Provided

ISBN: ISBN-978-1-1093-9715-4

ISSN: N/A

EISSN: N/A

Available Date: N/A

Hybrid Matching and Risk Assessment of the Misspelled Names [HMRA]

Varol, Cihan

ProQuest LLC, Ph.D. Dissertation, University of Arkansas at Little Rock

Companies acquire personal information from phone, World Wide Web, or email in order to sell or send an advertisement about their product. However, when this information is acquired, moved, copied or edited, the data may loose its quality. Often, the use of data administrators or a tool that has limited capabilities to correct the mistyped information can cause many problems. Moreover, most of the correction techniques are particularly implemented for the words used in daily conversations. Since personal names have different characteristics compared to general text, firstly, we proposed a hybrid matching algorithm (PNRS) which employs phonetic encoding, string matching and statistical facts to provide a possible candidate for misspelled names. "SoundD Phonetic Strategy" is created to provide name suggestions based on the phonetic structure of the misspelled name, "Restricted Near Miss" Strategy is build to produce name suggestions based on the pattern of the ill-defined data, and "Weighted Census Score" is used to produce the final suggestion based on the frequency of usage of the candidate names to overcome the problem. The PNRS system makes it possible to suggest the closest match for the ill defined data compared to the other algorithms that are available in the literature. Secondly, in order to justify the effectiveness of PNRS, we attempted to check the correctness without looking at the reference table. Therefore, a decision support system is embedded to the PNRS structure. This support system contains a similarity based name cluster which is created by using "k-medoid's" method. At the end, PNRS Distance Metric (PNRSDM) is mathematically modeled in order to provide a confidence level for the results achieved by PNRS. Thirdly, in order to identify the impact on customer satisfaction caused by ill-defined/dirty data (misspelled or mistyped data) we define a mathematical model for error propagation in Information Quality Products and created NXN matrix. A framework and business case is created based on Talend Open Studio 3.0. Unified Modeling Language (UML) Activity Diagrams are used to model the estimation of error propagation where the messages and associated attributes yield to calculate the single and multi step error propagation in the workflow. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com.bibliotheek.ehb.be/en-US/products/dissertations/individuals.shtml.]

Descriptors: Business, Data Collection, Confidentiality, Error Patterns, Spelling, Mathematical Models, Item Sampling, Phonetics, Information Science, Computer Science

ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com.bibliotheek.ehb.be/en-US/products/dissertations/individuals.shtml

Publication Type: Dissertations/Theses - Doctoral Dissertations

Education Level: N/A

Audience: N/A

Language: English

Sponsor: N/A

Authoring Institution: N/A

Grant or Contract Numbers: N/A

Author Affiliations: N/A