The Data Cleaning Process

Preparing the data for analysis

After extracting the raw text from the nearly 4 million PDF files, discarding the papers that couldn't be opened or we couldn't find more information about, we are left with a data set with 32GB of 2.636.933 papers. Even though it's already been reduced, this is still a LOT of data. Because of this, we implemented a data cleaning sequence that in the end was able to half the data set in size, to about 16GB. Not only is the data now smaller, but it is also more fit to be analysed for reasons that will be discussed later on.

Global Cleaning

The cleaning happens in two phases, the first of which is the "global cleaning" phase. This phase focuses on removing entire documents from the data set that have no business being there for whatever reason. The steps it takes are as follows:

First it removes all papers from the data set that have more than 100 pages, because this is way too much data to analyse anyway.
Some papers in the data set were just the same paper but with different file names. We removed any record from the dataset that had an exact duplicate text as another.
Finally, we used some external code to analyse the languages of the remaining papers. It turns out that 85% of them were English, 10% were German, only 1% was French and the remaining 4% were spread over about 50 other languages, including "could not detect language". After much deliberation, we decided to remove all papers that we detected as "not English".

After this global cleaning phase, the dataset has already been reduced to just less than 2 million papers, which is an improvement but nowhere near the desired improvement of halving the data set. The second, "local cleaning" phase will take care of cleaning the actual text inside of the documents.

Local Cleaning

Strip whitespace and lowercase

As a running example for this chapter, we will step-by-step see how a lyric from the Rick Astley classic "Never Gonna Give You Up" will be cleaned using our local cleaning sequence. First of all, when a document enters the sequence, all leading and trailing whitespace is removed from it. After that, the entire text is converted to lowercase. Both of these transformations will help with the steps following this one.

Tokenization

Now that the text is stripped and lowercased, there is no more need to check the text on a per-character basis. From now on, we only need to analyse the words in the text. This is why the next step is "tokenization", which converts a text (essentially, a list of characters) into a list of words. This step automatically removes all punctuation and spaces from the text as well.

Filter Forbidden Words

For the next step, the text is filtered on "forbidden words". These forbidden words are words that either:

Contain numbers or any other character that is not considered a letter (examples: Fig3, 1999, niño)
Are less than 4 characters long
Are stop words, which are words that appear frequently in a text but rarely add any value (examples: I, me, we, our, am, could, most, other, very, so)

Word stemming

After that, the words in the text are stemmed. This means that a word is reduced to it's absolute stem: plural nouns become singular, conjugated verbs become unconjugated, etc. We do this, so that the text-matching algorithm (discussed on the Smith-Waterman page) can focus more easily on the meaning and function of the word in a sentence, instead of looking at how it is spelled exactly. After all, with plagiarism checking, the words "colors", "coloring", "colored" and "color" should not be differentiated between, because they represent the same idea, even though they are written differently.

Document Frequency Filtering

As a final step, we remove all words from the text that are marked as "used very frequently by documents". A metric that is commonly used for this is the "document frequency" of a word: the number of documents that contain that certain word. The words with the highest document frequency are filtered out. Examples of words with a very high document frequency in this data set are: "however", "result", "study", "research", "thus", "also", "condition". This list of words is treated as a more research-paper-specific list of stop words. These words are used in so many papers that it would be a waste of time to check them for plagiarism.

Result

In just a few steps, we went from a simple lyric of a song we all know and love, to a list of words that contain the essence of that lyric. As much data as possible has been removed, without losing the core and essence of the text, which was exactly the goal of the data cleaning phase. Next up, is the clustering phase, which takes a huge workload off of the text matching phase in a completely different way.