How to Organize A Text Analytics Project?

Using Natural Language tools to uncover conversational data.

Text analytics or text mining is the analysis of “unstructured” data contained in natural language text using various methods, tools and techniques.

The popularity of text mining today is driven by statistics and the availability of unstructured data. With the growing popularity of social media and with the internet as a central location for all sorts of important conversations, text mining offers a low-cost method to gauge public opinion.

This was my inspiration to learn about text analytics and write this blog and share my learnings with my fellow data scientists!

My key reference for this blog is DataCamp’s beautifully designed course Text Mining — Bag of Words.

Below are the six main steps for a text mining project. In this blog, I will focus on Steps 3, 4, 5 and 6 and discuss the key packages and functions in R which can be used for these steps.

1. Problem Definition

Identifying the specific goals or objectives for any project is key to its success. One needs to have domain understanding to define the problem statement appropriately.

For this article, I will be asking whether Amazon or Google has a better pay perception according to online reviews, and which has a better work-life balance according to current employee reviews.

2. Identifying the Text Sources

There can be multiple ways to collect employees reviews, from websites like Glassdoor and Indeed to articles published with workplace reviews, or even through focus group interviews of employees.

3. Text Organization

This involves the multiple steps for cleaning and pre-processing your text. There are two main packages in R which can be used to perform this: qdapand tm.

Points to Remember:

the tm package works on the text corpus object the qdap package is applied directly to the text vector

x -> vector with positive reviews for Amazon


# qdap cleaning function
> qdap_clean <- function(x) {
x <- replace_abbreviations(x)
x <- replace_contractions(x)
x <- replace_number(x)
x <- replace_ordinal(x)
x <- replace_symbol(x)
x <- tolower(x)
return(x)
}

**You can also add more cleaning functions to the above, based on specific requirements.

corpus -> VCorpus(VectorSource(x))

Then use the tm_map() function — provided by the tm package — to apply cleaning functions to a corpus. Mapping these functions to an entire corpus makes scaling of the cleaning steps very easy.


# tm cleaning function
> clean_corpus <- function(corpus){
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "Google", "Amazon", "company))
return(corpus)
}

Word stemming and stem completion on a sentence using tm package

The tm package provides the stemDocument() function to get to a word’s root. This function either takes in a character vector and returns a character vector, or takes in a PlainTextDocument and returns a PlainTextDocument.


# Remove punctuation
> rm_punc <- removePunctuation(text_data)

# Create character vector
> n_char_vec <- unlist(strsplit(rm_punc, split = ' '))

# Perform word stemming: stem_doc
> stem_doc <- stemDocument(n_char_vec)

# Re-complete stemmed document: complete_doc
> complete_doc <- stemCompletion(stem_doc, comp_dict)

Point to remember:

Define your own comp_dict which is a custom dictionary containing words you want to use to re-complete the stemmed words.

4. Feature Extraction

After completing the basic cleaning and pre-processing of text, the next step is to extract the key features which can be done in the form of sentiment scoring or extracting n-grams and plotting them. For this purpose, the TermDocumentMatrix (TDM) or DocumentTerm Matrix (DTM) functions come in very handy.


# Generate TDM
> coffee_tdm <- TermDocumentMatrix(clean_corp)

# Generate DTM
> coffee_dtm <- DocumentTermMatrix(clean_corp)

Points to remember:

You can use TDM when you have more words than documents to be reviewed, as it is easier to read a large number of rows than columns.

You can then convert the results to matrices using the as.matrix() function, and then slice and dice and review parts of these matrices.

Let’s see a simple example of creating a TDM for bigrams:

To create a bigrams TDM, we use TermDocumentMatrix() along with a control argument which receives a list of control functions (please refer to TermDocumentMatrix for more details). Here, a built-in function called tokenizer is used, which helps in tokenizing words as bigrams.


# Create bigram TDM
> amzn_p_tdm <- TermDocumentMatrix(
amzn_pros_corp,
control = list(tokenize = tokenizer))
# Create amzn_p_tdm_m
> amzn_p_tdm_m <- as.matrix(amzn_p_tdm)

# Create amzn_p_freq
> amzn_p_freq <- rowSums(amzn_p_tdm_m)

5. Feature Analysis

There are multiple ways to analyze the text features. A few of them are discussed below.

a. Barplot


# Sort term_frequency in descending order
> amzn_p_freq <- sort(amzn_p_freq, decreasing = TRUE) >

# Plot a barchart of the 10 most common words
> barplot(amzn_p_freq[1:10], col = "tan", las = 2)

b. WordCloud


# Plot a wordcloud using amzn_p_freq values
> wordcloud(names(amzn_p_freq), amzn_p_freq, max.words = 25, color = "red")

c. Cluster Dendograms

This is a simple clustering technique to perform a hierarchical cluster and create a dendrogram to see how connected different phrases are.


# Create amzn_p_tdm2 by removing sparse terms
> amzn_p_tdm2 <- removeSparseTerms(amzn_p_tdm, sparse = .993) >

# Create hc as a cluster of distance values
> hc <- hclust(dist(amzn_p_tdm2, method = "euclidean"), method = "complete") >

# Produce a plot of hc
> plot(hc)

You can see similar topics throughout the dendrogram like “great benefits,” “good pay,” “smart people,” etc.

d. Word Association

This is used to examine top phrases that appear in the word clouds and find associated terms using the findAssocs() function from the tm package.

The code below is used to find the most associated words with the most frequent terms in the positive reviews for Amazon.


# Find associations with Top 2 most frequent words
> findAssocs(amzn_p_tdm, "great benefits", 0.2)
$`great benefits`
stock options options four four hundred vacation time
0.35 0.28 0.27 0.26
benefits stock competitive pay great management time vacation
0.22 0.22 0.22 0.22
> findAssocs(amzn_p_tdm, "good pay", 0.2)
$`good pay`
pay benefits pay good good people work nice
0.31 0.23 0.22 0.22

e. Comparison Clouds

This is used when you wish to examine two different corpuses of words in one go, rather then analyzing them separately (which can be more time consuming).

The code below compares the positive and negative reviews for Google.


# Create all_goog_corp
> all_goog_corp <- tm_clean(all_goog_corpus) > # Create all_tdm
> all_tdm <- TermDocumentMatrix(all_goog_corp)

<>
Non-/sparse entries: 2845/1713
Sparsity : 38%
Maximal term length: 27
Weighting : term frequency (tf)

> # Name the columns of all_tdm
> colnames(all_tdm) <- c("Goog_Pros", "Goog_Cons") > # Create all_m
> all_m <- as.matrix(all_tdm) > # Build a comparison cloud
> comparison.cloud(all_m, colors = c("#F44336", "#2196f3"), max.words = 100)

f. Pyramid Plots

Pyramid plots are used to display a pyramid (as opposed to a horizontal bar) plot and help in easy comparison based on similar phrases.

The code below compares the frequency of positive phrases for Amazon vs Google.


# Create common_words
> common_words <- subset(all_tdm_m, all_tdm_m[,1] > 0 & all_tdm_m[,2] > 0)
> str(common_words)
num [1:269, 1:2] 1 1 1 1 1 3 2 2 1 1 ...
- attr(*, "dimnames")=List of 2
..$ Terms: chr [1:269] "able work" "actual work" "area traffic" "atmosphere little" ...
..$ Docs : chr [1:2] "Amazon Pro" "Google Pro"

# Create difference
> difference <- abs(common_words[,1]- common_words[,2]) >

# Add difference to common_words
> common_words <- cbind(common_words, difference) > head(common_words)
Amazon Pro Google Pro difference
able work 1 1 0
actual work 1 1 0
area traffic 1 1 0
atmosphere little 1 1 0
back forth 1 1 0
bad work 3 1 2

# Order the data frame from most differences to least
> common_words <- common_words[order(common_words[,"difference"],decreasing = TRUE),]

# Create top15_df
> top15_df <- data.frame(x = common_words[1:15,1], y = common_words[1:15,2], labels = rownames(common_words[1:15,]))

# Create the pyramid plot
> pyramid.plot(top15_df$x, top15_df$y,
labels = top15_df$labels, gap = 12,
top.labels = c("Amzn", "Pro Words", "Google"),
main = "Words in Common", unit = NULL)
[1] 5.1 4.1 4.1 2.1

6. Drawing Conclusions

Based on the above visual (“Words in Common” pyramid plot), overall Amazon looks to have a better work environment and work-life balance than Google. Working hours seem to be higher at Amazon, but perhaps they provide other benefits to restore the work-life balance. We would need to collect more reviews to make a better conclusion.

So, finally we come to the end of this blog. We learned how to organize our text analytics project, the different steps involved in cleaning and pre-processing and finally how to visualize the features and draw conclusions. I am on my way to completing my text analytics project based on this blog and learnings from DataCamp. I will soon post my GitHub repository for the project to help you further. Our next goal should be to perform sentiment analysis. Till then keep CODING!!

Hope you liked this blog. Do share your comments on what you liked and what you would like me to improve in my next blog.

Keep watching this space for more. Cheers!

(First published @www.datacritics.com)