Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51.
 Mauricio A. Hernández and Salvatore J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery 2, 1998.
 Stratosphere Project Homepage http://www.stratosphere.eu/
Papers for Presentation
 Rares Vernica, Michael J. Carey, and Chen Li. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the International Conference on Management of Data (SIGMOD), 2010.
 Ahmed Metwally and Christos Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. Proceedings of the Very Large Database Endowment (PVLDB) 5, 2012.
 Lars Kolb, Andreas Thor, and Erhard Rahm. Multi-pass Sorted Neighborhood Blocking with MapReduce. Computer Science - Research and Development 27, 2012.
 Foto N. Afrati, Anish Das Sarma, David Menestrina, Aditya G. Parameswaran, Jeffrey D. Ullman. Fuzzy Joins Using MapReduce. In Proceedings of the International Conference on Data Engineering, 2012.
 Dal Bianco, Guilherme, Renata Galante, and Carlos A. Heuser. A fast approach for parallel deduplication on multicore processors. In Proceedings of the ACM Symposium on Applied Computing, 2011.
 Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, Philip Bohannon. CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks. In Proceedings of the International Conference on Information and Knowledge Management, 2012.