Text Mining with R: cleaning and preparing data
This is the first post of a series intented to present an overview of the steps involved in undertaking text and data mining, from the data preparation to the sentiment analysis. This is an excerpt of the online seminar that was held as part of the Summer School for the Plan for Science Degrees (Piano di Lauree Scientifiche, PLS) promoted by the University of Naples Federico II. Participants consisted of secondary school students and early university students. R software was used for the statistical analyses. In this article I’ll introduce the first steps to take to preprocess textual data.
Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically.
Text mining is the automated process of selecting and analysing texttual data for the purpose to find patterns, extract information and perform semantic analysis. Some of the common text mining tasks are text classification, text clustering, creation of granular taxonomies, document summarisation, entity extraction, and sentiment analysis.
The main advantage of text mining is that text can be found (nearly) everywhere, some examples are:
- Medical records
- Product reviews
- Social posts (Facebook, Twitter, etc.)
- Book recommendations
- Legislation, court decisions
- Emails
- Websites
However, text data is ‘dirty’ and unstructured, meaning that there is no feature vector representation and we have to take into account for:
- Linguistic structure
- Language
- Relationships between words
- Importance of words
- Negations, etc.
- Grammatical, spelling, abbreviations, synonyms, homographs
More importantly, we have to consider that text is intended for communication between people: context and syntax matter! See for instance Van Hee, Lefever, and Hoste (2016). For this reason, there is no ‘standard’ method; each document requires a dedicated approach.
Example. There is a huge difference between the sentences ‘Even the sun sets in paradise’ and ‘The sun sets even in paradise’ or, again, between ‘she only told me she loved him’ and ‘she told me she loved only him,’ but for a computer those pairs of sentences are almost identical.
In general, text is stored in an unstructured way. Pre-processing the raw data to transform unstructured data into structured data is the first task to overcome, making it possible to analyse vast collections of text documents.
Structured data | Unstructured data |
---|---|
Data is organised in a defined format, allowing it to be easily parsed and manipulated by a computer. | Data has irregularities and ambiguities that make it difficult to process it using traditional softwares. |
Thus, the first step involves pre-processing the raw data to transform unstructured data into structured data, turning a collection of documents into a feature-vector representation. In general, a collection of documents is called corpus where each document is composed of individual tokens or terms (i.e. words).
The structure of each document can be various. Each document can be a full book (e.g. LOTR), a chapter, some pages, few sentences (e.g. Twitter posts) or even a single sentence. The process involving the conversion of this kind of unstructred data to a structured feature vector is called featurization.
In our case, each document is represented by a short answer; we asked participants of the seminar to answer (using a Google Form) to the question ‘what would you like to do after graduation?’ (‘Cosa vorresti fare dopo il diploma?’), setting an open answer with fixed limit of 200 letters.
During the PLS seminar we gathered a total of 147 answers, the original dataset has been translated in english and it is available here (here you can find the original version).
Text Mining with R
The tm
package is required for the text mining functions, while some stemming procedures need the use of SnowballC
package.
rawdata <- read.csv("pls2021_eng.csv",sep=";")
head(rawdata)
ID | answer |
---|---|
1 | University of Veterinary Medicine |
2 | University |
3 | University - Informatics |
4 | University |
5 | I would like to enroll in a physics or aerospace engineering degree program |
The function SimpleCorpus
transforms raw texts, initially stored as a vector, into a corpus.
corpus <- SimpleCorpus(VectorSource(rawdata[,2]),control = list(language="en"))
When data is large and more structured, other functions are suggested (i.e. VCorpus
, PCorpus
) to boost performance and minimize memory pressure.
With str(corpus)
we can inspect the newly defined data structure.
str(corpus)
## Classes 'SimpleCorpus', 'Corpus' hidden list of 3
## $ content: chr [1:147] "University of Veterinary Medicine" "University " "University - Informatics" "University " ...
## $ meta :List of 1
## ..$ language: chr "en"
## ..- attr(*, "class")= chr "CorpusMeta"
## $ dmeta :'data.frame': 147 obs. of 0 variables
Pre-processing data
Depending on the task, there may be several methods that we can take to standardise the text. Some pre-processing techniques are depicted as follows.
Normalisation: It is the first attempt to reduce the number of unique tokens present in the text, removing the variations in a text and also cleaning for the redundant information. Among the most common approaches it is worth to consider the reduction of every characters to lower case, misspelling conversion and special characters removal.
Usually, a typical normalization involves the lowercase and deletion of stop words.
corpus_cl <- tm_map(corpus, tolower)
Additionally, it would be necessary to convert special symbols (i.e. emoticons) and accented characters, that are common in some languages (i.e. in Italian), to their plain version. In this case, there are several techniques that can do the trick, one of the most common is given by changing the original encoding to ASCII with transliteration option (ASCII//TRANSLIT
) by using the iconv()
function.
corpus_cl <- tm_map(corpus_cl,iconv,from="UTF-8",to="ASCII//TRANSLIT")
Sometimes the conversion in ASCII returns the question mark “?” as result, meaning that the algorithm was not able to map the character from the initial encoding to ASCII.
corpus[116]$content
## [1] "the millionaire 😂"
corpus_cl[116]$content
## [1] "the millionaire ?"
This kind of error will be corrected in next step, when removing punctuation symbols.
Stopwords: Text and document classification includes many words which do not contain important significance to be used in classification algorithms (e.g. ‘and,’ ‘about,’ ‘however,’ ‘afterwards,’ ‘again,’ etc.). The most common technique to deal with these words is to remove them from the documents.
corpus_cl <- tm_map(corpus_cl, removeWords, c(stopwords('en')))
corpus_cl[1:5]$content
## [1] "university veterinary medicine"
## [2] "university "
## [3] "university - informatics"
## [4] "university "
## [5] " like enroll physics aerospace engineering degree program "
The corpus can be additionally filtered by using the following functions.
# remove numbers
corpus_cl <- tm_map(corpus_cl, removeNumbers)
# remove punctuation
corpus_cl <- tm_map(corpus_cl, removePunctuation)
# set the dictionary and remove additional words/acronyms
drop <- c("cuz")
corpus_cl <- tm_map(corpus_cl,removeWords,drop)
# remove extra white spaces
corpus_cl <- tm_map(corpus_cl, stripWhitespace)
corpus[117]$content
## [1] "My biggest dream is to be a forensic anthropologist, but I am very afraid of not being able to do it and of making the wrong choice, cuz I would also like to do criminology or be a detective."
corpus_cl[117]$content
## [1] " biggest dream forensic anthropologist afraid able making wrong choice also like criminology detective"
Stopwords removal, as well as other procedures (see stemming below), is clearly language-dependent: each language has a specific list of symbols and special characters.
Lemmatization/Stemming: replace the suffix of a word with a different one or removes the suffix of a word completely to get the basic word form (lemma).
corpus_cl <- tm_map(corpus_cl, stemDocument,language = "it")
In some cases, to avoid ambiguization, the replacement is slightly different from the original word. The final result is different from the initial corpus.
corpus[1:5]$content
## [1] "University of Veterinary Medicine"
## [2] "University "
## [3] "University - Informatics"
## [4] "University "
## [5] "I would like to enroll in a physics or aerospace engineering degree program "
corpus_cl[1:5]$content
## [1] "university veterinary medicin"
## [2] "university"
## [3] "university informatics"
## [4] "university"
## [5] "lik enroll physics aerospac engineering degre program"
Document-Term Matrix
After pre-processing, the data have the form of a ‘clean’ corpus, consisting of a collection of n = 147 vectors, each of them containing a collection of
Term 1 | Term 2 | Term |
||
---|---|---|---|---|
Doc 1 | ||||
Doc 2 | ||||
Doc |
In our case, the DTM is obtained through the use of the DocumentTermMatrix
function.
DTM <- DocumentTermMatrix(corpus_cl)
The resulting DTM has dimension 147 x 298 and the values of the cells correspond to the raw counts of a given term. Row marginals represent the number of terms in each document while column marginals count how many times each unique term appears in the corpus.
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 8/17
## Sparsity : 68%
## Maximal term length: 11
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs aerospac informatics medicin university veterinary
## 1 0 0 1 1 1
## 2 0 0 0 1 0
## 3 0 1 0 1 0
## 4 0 0 0 1 0
## 5 1 0 0 0 0
Different schemes for weighting raw counts can be applied, depending on the type of statistical measures to be derived.
Next time, we will start from the DTM to provide some descriptive statistics and generate the word cloud.