The Smith-Waterman Algorithm
Finding the actual plagiarism
All of the steps until now have been leading up to this one: text matching using the Smith-Waterman algorithm. This algorithm was first designed for use in the field of molecular biology to find similar sub-sequences between two DNA, RNA or protein sequences. It can also be put to use in a plagiarism checker, finding a common sub-sequence between two lists of words or characters.
The Sequence Alignment Matrix
To do this sequence alignment, the algorithm uses a so-called "alignment matrix". It works as follows: Say we want to compare the two words: "waterman" and "somewhat". We first create a table with the letters of the first word on the horizontal axis and the letters of the second word on the vertical axis. Then we set the first column and row of the table to 0. We now have a table (matrix) that looks like this:
Now the fun can start. We first set a "match score" and a "gap penalty". These are the points and penalties we will award to the algorithm if it finds a match or doesn't. It's common practice to use a match score of 2 and a gap penalty of -1 so we will use that too. Let's first look at the empty cell in the top left corner, comparing "w" with "s". What value should it get? The Smith-Waterman Algorithm defines that you take the highest value out of 4 possible values:
- The value to the top left (diagonally) of the current cell, adding the match score (2) if the two items match or subtracting the match score (-2) if they don't
- The value to the left of the current cell, subtracting the gap penalty (-1)
- The value to the top of the current cell, subtracting the gap penalty (-1)
- Zero
But when we reach the "m" row, we find our first match! For this cell, we can pick option 1: 0 + 2 = 2, which is higher than 0! We add a 2 to that cell and also remember that we came from the cell to the top left, which will come in handy later. The cell next to that compares "a" with "m", which is not a match, but not 0 points either. For this cell we pick option 2, because 2 - 1 = 1 and that is still higher than 0. We write down a 1 and remember that we came from the left cell for later. Our matrix now looks like this:
Following this logic, we can fill in every value in the matrix. Doing that gives us the following end result:
Finding the biggest overlap
Now that we have calculated the alignment matrix, finding the overlap is very easy. Simply start at the highest value in the matrix, in our case, this is the 5 at the bottom, and follow the arrows back until you see a 0. So we go from the "t" match, to the "a" match, to the "w" - "h" gap, to the "w" match to the zero in the leftmost column. In the example matrix below, the backtracking path is made up of the blue arrows.
The Smith-Waterman algorithm has now determined that the common sub-sequence between the words "waterman" and "somewhat" is "w-at", with a similarity score of 5 (the highest value in the matrix).
Searching for plagiarism
Okay cool! we can now execute a sequence alignment of two words and calculate a score for how similar the two words are. How will this be used to find plagiarism though? As you might recall from the cleaning phase, our data set is not a list of characters, it is now a list of words. This same algorithm can be used to find a common sub-sequence between two lists of words.
The biggest issue here is that with very large papers with many many words, the alignment matrix gets so big that it sometimes might take a few minutes to calculate a single similarity score. This is one of the major drawbacks of using this algorithm and we sadly did not find a way around it, other than literally running it for days on end in the search for plagiarism. We did find some very interesting results, which you can find on the results page.