Latent Dirichlet Allocation Topic Modeling
Reducing the amount of comparisons drastically
After cleaning the data we are left with a little less than 2 million papers. If we were to start searching for plagiarism right now, we would need to check every paper in the data set with every other paper in the data set. This means that we would have to make 2 million x 2 million comparisons, which is over 4 trillion. The goal of this phase is to get that number of comparisons down using a technique known as unsupervised clustering.
Why will unsupervised clustering work?
First of all, what is unsupervised clustering? Clustering is the act of grouping things together based on similarity. If we divide the 2 million papers up into two groups (clusters) of 1 million, we now only have to do 1 million documents x 1 million documents x 2 clusters worth of comparisons, which is "only" 2 trillion. This effect goes on as you divide the documents over even more clusters.
The unsupervised part means that we, the people, don't really know what the algorithm is going to be grouping the documents on. How does it decide that two documents are similar? We let the machine decide that for us, based on some analysis before the actual grouping happens.
Looking at the definition alone, this technique will be a perfect way to get the number of comparisons as low as possible. We don't know how to decide that two documents are similar enough to be put in a group together, we can let the machine do that for us! By dividing the documents over a whole bunch of clusters, we speed up the comparison process, while not sacrificing too much accuracy for the actual plagiarism detection. For example, it is highly unlikely that papers on medical research will have plagiarized from papers in the computer science field. Because of this fundamental dissimilarity, these types of papers will end up in different clusters, because it would have been useless to compare them anyway.
What does Latent Dirichlet Allocation do?
Latent Dirichlet Allocation (or LDA for short) creates a topic model for a given data set of text documents. This topic model consists of a list of "topics". Each of these topics is defined by a list of terms that appear frequently within the topic. For each of these terms, the probability that a document is in that topic if the document contains that term is also saved. An example of two topics within a topic model might look like this: it’s safe to say a document containing the words 'cell', 'protein', 'gene', and 'human' has a high probability of being a molecular biology-related paper, while a document containing the words 'voltage', 'circuit', 'power', and 'device' has a high probability of being an electrical engineering-related paper.
All of this allows us to add a column to the data set, containing the topics a document belongs to. Now we only have to compare two documents if they have at least one matching topic. This technique got the total amount of comparisons down from 4 trillion to just under two million, a HUGE speed optimization.
Pictured below is a bubble chart of some of the topics LDA generated on our data set. Each topic has its own color and the bigger the circle is, the higher the probability is that a document belongs in that topic if it contains that word. This topic model was generated on the cleaned, stemmed version of the data set, as discussed in the cleaning phase, which is why the words look like they have been cut off ("solut" instead of "solution", "turbul" instead of "turbulence", etc.).