Thesis and Dissertation Workshop (WTDBD)
MapReduce, Entity Matching, Adaptive Window, Sorted Neighborhood Method
Cloud computing has proven to be a powerful ally to efficient parallel execution of data-intensive tasks such as Entity Matching (EM) in the era of Big Data. For this reason, studies about challenges and possible solutions of how EM can benefit from the cloud computing paradigm have become an important demand nowadays. In this context, we investigate how the MapReduce programming model can be used to perform efficient parallel EM using a variation of the Sorted Neighborhood Method (SNM) that uses a varying size window. We propose Distributed Duplicate Count Strategy (DDCS), an efficient MapReduce-based approach for this adaptive SNM, aiming to decrease even more the execution time of SNM.
Demetrio Gomes Mestre, Carlos Eduardo Pires