Tip:
Highlight text to annotate it
X
So much valuable information is out there in the form of social media talk.
In this case study we will explore how STATISTICA Text Miner
was used to access tweets from Twitter
related to specific topics,
store and track trends,
find interesting relationships, and gain insight.
My name is Jennifer Thompson and I'm a statistician at StatSoft.
Let's explore ways to analyze social media content.
I want to learn more about discussions involving keywords like StatSoft,
STATISTICA,
data mining,
text mining, predictive analytics, and the list goes on.
What are people talking about when they mention these keywords?
When tracked over time, do we see spikes or lulls in the number of mentions?
What other words occur frequently
in these discussions,
and what's the overall sentiment?
What other interesting things can we learn from what people are saying online?
Currently, a few tools are available for looking at Twitter posts for things like
sentiment and the number of mentions for a keyword.
So, why did I use STATISTICA Text Miner?
The benefits are in the automation of reports and alerting,
as well as greater abilities to ascertain the content of conversations
automatically.
Using STATISTICA's Monitoring and Alerting Server,
I have long term and continuous tracking of conversations,
and the historical data are stored
so I can access them again later.
This allows me to compare conversations this year to last.
Tweets are searchable from Twitter for just over a week only, so storing the information
becomes necessary.
Also, with automation, the analysis is run and only when interesting trends
are found am I alerted--
for instance, if a new word or phrase is occurring in the conversation.
This emailed report shows the new trends and I can react accordingly.
I can look at the reports at my leisure
but I get a text message any time sentiment shifts towards the negative.
This shift would be a big deal,
so I need to be alerted as soon as possible.
I used a STATISTICA Visual Basic macro
to get the tweets and supporting information into a spreadsheet.
I extracted information like the timestamp,
user ID,
and the text of the post.
Then I can begin the analysis and perform text mining on the tweets.
Basic descriptive statistics on the Twitter data can be informative.
How many times
was the keyword mentioned in tweets in a given day?
This plot shows daily mentions
for the STATISTICA brand as well as two major competitors.
When spikes in the number of mentions occur,
it's very interesting to see what people are saying. Does it coincide with a press
release, marketing campaign, or some event such as a conference or the release of
a new version of the software?
Finding these spikes
is the first step to understanding their cause
and how we can
instigate more positive conversations about the brand.
How much overlap occurs for users?
When looking at trends, are a handful of tweeters
creating the majority of the buzz,
or are the mentions coming from unique users?
Here we see that 71% of the tweets
are from unique users. That tells me the reach is likely further than if most
tweets were coming from just one person.
Here we're looking at the daily mentions
of various keywords:
data mining, analysis, and statistics.
A change in trend could indicate an interesting place to drill down into the
content.
Retweets can be interesting as well.
For the keyword 'data mining,' 14% of posts were retweets.
Text mining on social media information has some special considerations,
such as the informality of the language used.
Acronyms in chat lingo are often used on Twitter.
To [account for] this, I changed the filters for what constitutes a word.
Also, I changed the characters
that are allowed to form a word.
To detect emoticons or smiley faces,
colons and parentheses should be added.
I'm looking for specific phrases like 'data mining,' 'text mining,'
'predictive analytics.'
These phrases can be detected, as well.
Using a synonyms list, I can combine words with the same meaning.
For example, 'stat' is an abbreviation for 'statistic.'
They're the same word and are recognized as such using a synonyms list.
In the results, interesting trends can become apparent in a scatter plot of
singular value decomposition components.
This cluster of words--
fire, far, drink, and hose--
indicates that several tweets use this set of words together
in posts that also contain the word 'data mining.'
With further exploration, I found that several people had tweeted or retweeted
about a blog post with 'drinking from the fire hose' in the title.
This interesting trend was found through text mining of the data.
In tracking sentiment of the STATISTICA brand and two of its
competitors,
I made a pie chart to show the relative sentiment breakdown for each.
Sentiment, in this case, was measured in the use of emoticons,
and we see the smile, frown, and tongue-sticking-out emoticons.
Tweeters mentioning the second competitor
used a lot of tongue-sticking-out emoticons, which can be a playful, silly
expression.
Additionally, sentiment analysis can be performed by comparing the positive and
negative words found in the post,
giving an overall tone for the post.
This quality control chart is tracking mentions over time
for keywords in posts to detect pattern shifts.
This type of information could be used to determine the effectiveness of a
marketing campaign.
Breaking the analysis down by geographical location can give insight
on regional trends.
Let's review more of the results in STATISTICA.
After a basic analysis with plotting and some descriptive statistics tables,
I started the text mining analysis for posts mentioning the phrase,
'data mining.'
Here we're looking at the
'Words summary' output.
These will be the most commonly mentioned words and phrases in tweets
about data mining.
Some of them are expected--
things like knowledge, download, computer,
web, business, application, machine, intelligence, and so on.
Then we see the word 'Shakespeare.'
This isn't a term I expect to see in a discussion about data mining,
and this is a pretty interesting trend where 'Shakespeare' is showing up
in the top twenty terms
mentioned in data mining posts.
Further exploration will tell us more.
After monitoring the frequencies of terms showing up in tweets about data
mining,
I combined this frequency-across-time information into one plot.
This plot actually shows several interesting trends and spikes
in the number of mentions for several
terms, including that 'Shakespeare' term.
Many of these key terms are somewhat unexpected.
The first spike in the number of mentions of a word
is the term 'expert.'
A week or so later a cluster of words spike in the number of mentions:
drink, fire, hose, and far.
This spike I tracked back to several tweets and retweets about blog posts with
'drinking from the fire hose' in the title.
Another cluster of terms to spike is
'European,' 'Facebook,' and 'crack,'
which is attributed to an article circulating about a European crackdown on
Facebook privacy.
Later, there's a small peak talking about books on the topic of machine
learning and artificial intelligence.
Then we come to the Shakespeare spike.
This peak comes from buzz around a presentation on data mining
of Shakespeare's classic works
taking center stage in the digital age.
This plot is particularly interesting as it shows trends across time
and we can see, at a glance, what people are buzzing about.
Here's a look at the reports generated from the STATISTICA Monitoring and
Alerting Server.
These reports are generated automatically on a schedule.
The reports are sent to the appropriate personnel for review.
Also, the retrieved tweets are stored for future use.
This last page of the report shows information about the Twitter accounts
involved in the data mining posts.
The graph gives an overview of the number of followers a Twitter user has,
as well as the average number of people
that user follows.
This gives a general idea as to the reach these messages have.
For Twitter accounts tweeting about data mining topics,
the average number of followers
is 2,451.
The histogram shows a skewed distribution.
The majority of the posters have between 65 and 725 followers.
This is the 25th and 75th percentile.
To continue watching the series you can sign up at statsoft.com/tmsubscribe.
And if you'd like more information about StatSoft and its products including
STATISTICA Text Miner, you can visit StatSoft.com or call 918-749-1119
And outside of the United States, Canada, and Mexico,
the 'Contact Us' page of StatSoft.com has links to our international offices
that can also be of assistance.