Results
Did we actually find any plagiarism?
The good news is: our pipeline was very effective in finding overlapping pieces of text between two papers. The combination of our text cleaning sequence, LDA and the Smith-Waterman algorithm produced results that were definitely, and sometimes eerily, overlapping between two completely different papers. The bad news is: overlapping texts doesn't necessarily imply plagiarism. With all of the 35930 overlapping pieces of text we found, we were able to divide them into 6 categories. This page presents those categories individually with some real examples from the dataset.
In the tables below, papers are referenced by their Digital Object Identifier (DOI), which is a unique identifier system used for scientific papers and books. Using this number, anyone can find the paper anywhere. Clicking on a row shows the overlap our algorithm found between the two papers. Clicking on a DOI-text in the row opens a new tab where you can view the full paper to see the overlap in more context.
Institution Names
Department and institution names might be quite long, especially if the address is attached to it. If used in two text sections separately, it might look like plagiarism, when it most definitely is not.
Copyright and Download Notices
Everyone seems to copy everyone else's copyright statements, which is pretty ironic. This is still not considered plagiarism though.
Front Page Templates
When a paper is downloaded from the web, the download supplier will sometimes add a front page template to the paper. This template contains information on when the paper was downloaded, by whom it was downloaded, what the content of the paper is and some legal text. This obviously does not count as plagiarism either.
Overlapping References
Due to the standardized nature of the references in scientific papers, when two papers use the same source, there will be overlap in their references section. Sometimes this is a really long reference with many authors or a really long title ("10.1016/j.cancergencyto.2008.05.014" and "10.1007/s00335-011-9346-2"), sometimes two papers have a lot of overlapping references that are also sorted in the same way ("10.1002/cav.1596" and "10.1016/j.cag.2014.09.013"). Slowly but surely we're starting to see more dubious cases: is it a coincidence that these two references sections are so similar?
Standardized Enumerations and Definitions
If a concept, definition or enumeration (list) is somehow standardized, it will show up in the overlapping text analysis. The example "10.1017/s0033291703001168" and "10.1016/j.brat.2006.12.002" is one of a standardized enumeration. Even though the two papers have different authors and core messages, these similarities stick out like two sore thumbs to our pipeline. Is this plagiarism? Probably not, because most papers showing standardized enumerations or definitions directly cite the source directly before or after using said definition or enumeration.
Plagiarism?
This is where we start treading dangerous waters. As a disclaimer: we are not directly accusing any authors of the following articles of any case-closed plagiarism. Plagiarism is taken very seriously by the scientific community and the last thing we want to do is falsely accuse people of possibly accidental similarities. That being said, you can judge the examples for yourself.