The Big Data Plagiarism Pipeline

How we were able to cross-check millions of papers for plagiarism

Hi There! We are Lars, Max and William, three Dutch Computer Science Master students at the VU in Amsterdam. Through the past month, we have been working on this project, where the goal was to crosscheck a big data set of scientific papers in the search for possible plagiarism. Sounds easy enough, right? However, when we were presented with the 4.2 Terabytes worth of nearly four million scientific papers, it was difficult to think of where to start.

This website is dedicated to interactively teaching you (yes, you!) the steps we took to be able to process such a huge data set in the search for Plagiarism. Click on any of the boxes below this text to see what that step did with the data set. If you're interested, you can read the full paper we wrote here.

Clean data

LDA Clustering

Smith Waterman

Results

Disclaimer: this website and the described pipeline were designed as a graded project for the course "Large Scale Data Engineering" within the faculty of Computer Science at the VU Amsterdam in October of 2020