Overview
[1] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51.
[2] Mauricio A. Hernández and Salvatore J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery 2, 1998.
[3] Stratosphere Project Homepage http://www.stratosphere.eu/
Papers for Presentation
[4] Rares Vernica, Michael J. Carey, and Chen Li. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the International Conference on Management of Data (SIGMOD), 2010.
[5] Ahmed Metwally and Christos Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. Proceedings of the Very Large Database Endowment (PVLDB) 5, 2012.
[6] Lars Kolb, Andreas Thor, and Erhard Rahm. Multi-pass Sorted Neighborhood Blocking with MapReduce. Computer Science - Research and Development 27, 2012.
[7] Foto N. Afrati, Anish Das Sarma, David Menestrina, Aditya G. Parameswaran, Jeffrey D. Ullman. Fuzzy Joins Using MapReduce. In Proceedings of the International Conference on Data Engineering, 2012.
[8] Dal Bianco, Guilherme, Renata Galante, and Carlos A. Heuser. A fast approach for parallel deduplication on multicore processors. In Proceedings of the ACM Symposium on Applied Computing, 2011.
[9] Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, Philip Bohannon. CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks. In Proceedings of the International Conference on Information and Knowledge Management, 2012.