Text Mining Twitter, a STATISTICA Case Study

So much valuable information is out there in the form of social media talk. In this case study we will explore how STATISTICA Text Miner was used to access tweets from Twitter related to specific topics, store and track trends, find interesting relationships, and gain insight. My name is Jennifer Thompson and I'm a statistician at StatSoft. Let's explore ways to analyze social media content. I want to learn more about discussions involving keywords like StatSoft, STATISTICA, data mining, text mining, predictive analytics, and the list goes on. What are people talking about when they mention these keywords? When tracked over time, do we see spikes or lulls in the number of mentions? What other words occur frequently in these discussions, and what's the overall sentiment? What other interesting things can we learn from what people are saying online? Currently, a few tools are available for looking at Twitter posts for things like sentiment and the number of mentions for a keyword. So, why did I use STATISTICA Text Miner? The benefits are in the automation of reports and alerting, as well as greater abilities to ascertain the content of conversations automatically. Using STATISTICA's Monitoring and Alerting Server, I have long term and continuous tracking of conversations, and the historical data are stored so I can access them again later. This allows me to compare conversations this year to last. Tweets are searchable from Twitter for just over a week only, so storing the information becomes necessary. Also, with automation, the analysis is run and only when interesting trends are found am I alerted-- for instance, if a new word or phrase is occurring in the conversation. This emailed report shows the new trends and I can react accordingly. I can look at the reports at my leisure but I get a text message any time sentiment shifts towards the negative. This shift would be a big deal, so I need to be alerted as soon as possible. I used a STATISTICA Visual Basic macro to get the tweets and supporting information into a spreadsheet. I extracted information like the timestamp, user ID, and the text of the post. Then I can begin the analysis and perform text mining on the tweets. Basic descriptive statistics on the Twitter data can be informative. How many times was the keyword mentioned in tweets in a given day? This plot shows daily mentions for the STATISTICA brand as well as two major competitors. When spikes in the number of mentions occur, it's very interesting to see what people are saying. Does it coincide with a press release, marketing campaign, or some event such as a conference or the release of a new version of the software? Finding these spikes is the first step to understanding their cause and how we can instigate more positive conversations about the brand. How much overlap occurs for users? When looking at trends, are a handful of tweeters creating the majority of the buzz, or are the mentions coming from unique users? Here we see that 71% of the tweets are from unique users. That tells me the reach is likely further than if most tweets were coming from just one person. Here we're looking at the daily mentions of various keywords: data mining, analysis, and statistics. A change in trend could indicate an interesting place to drill down into the content. Retweets can be interesting as well. For the keyword 'data mining,' 14% of posts were retweets. Text mining on social media information has some special considerations, such as the informality of the language used. Acronyms in chat lingo are often used on Twitter. To [account for] this, I changed the filters for what constitutes a word. Also, I changed the characters that are allowed to form a word. To detect emoticons or smiley faces, colons and parentheses should be added. I'm looking for specific phrases like 'data mining,' 'text mining,' 'predictive analytics.' These phrases can be detected, as well. Using a synonyms list, I can combine words with the same meaning. For example, 'stat' is an abbreviation for 'statistic.' They're the same word and are recognized as such using a synonyms list. In the results, interesting trends can become apparent in a scatter plot of singular value decomposition components. This cluster of words-- fire, far, drink, and hose-- indicates that several tweets use this set of words together in posts that also contain the word 'data mining.' With further exploration, I found that several people had tweeted or retweeted about a blog post with 'drinking from the fire hose' in the title. This interesting trend was found through text mining of the data. In tracking sentiment of the STATISTICA brand and two of its competitors, I made a pie chart to show the relative sentiment breakdown for each. Sentiment, in this case, was measured in the use of emoticons, and we see the smile, frown, and tongue-sticking-out emoticons. Tweeters mentioning the second competitor used a lot of tongue-sticking-out emoticons, which can be a playful, silly expression. Additionally, sentiment analysis can be performed by comparing the positive and negative words found in the post, giving an overall tone for the post. This quality control chart is tracking mentions over time for keywords in posts to detect pattern shifts. This type of information could be used to determine the effectiveness of a marketing campaign. Breaking the analysis down by geographical location can give insight on regional trends. Let's review more of the results in STATISTICA. After a basic analysis with plotting and some descriptive statistics tables, I started the text mining analysis for posts mentioning the phrase, 'data mining.' Here we're looking at the 'Words summary' output. These will be the most commonly mentioned words and phrases in tweets about data mining. Some of them are expected-- things like knowledge, download, computer, web, business, application, machine, intelligence, and so on. Then we see the word 'Shakespeare.' This isn't a term I expect to see in a discussion about data mining, and this is a pretty interesting trend where 'Shakespeare' is showing up in the top twenty terms mentioned in data mining posts. Further exploration will tell us more. After monitoring the frequencies of terms showing up in tweets about data mining, I combined this frequency-across-time information into one plot. This plot actually shows several interesting trends and spikes in the number of mentions for several terms, including that 'Shakespeare' term. Many of these key terms are somewhat unexpected. The first spike in the number of mentions of a word is the term 'expert.' A week or so later a cluster of words spike in the number of mentions: drink, fire, hose, and far. This spike I tracked back to several tweets and retweets about blog posts with 'drinking from the fire hose' in the title. Another cluster of terms to spike is 'European,' 'Facebook,' and 'crack,' which is attributed to an article circulating about a European crackdown on Facebook privacy. Later, there's a small peak talking about books on the topic of machine learning and artificial intelligence. Then we come to the Shakespeare spike. This peak comes from buzz around a presentation on data mining of Shakespeare's classic works taking center stage in the digital age. This plot is particularly interesting as it shows trends across time and we can see, at a glance, what people are buzzing about. Here's a look at the reports generated from the STATISTICA Monitoring and Alerting Server. These reports are generated automatically on a schedule. The reports are sent to the appropriate personnel for review. Also, the retrieved tweets are stored for future use. This last page of the report shows information about the Twitter accounts involved in the data mining posts. The graph gives an overview of the number of followers a Twitter user has, as well as the average number of people that user follows. This gives a general idea as to the reach these messages have. For Twitter accounts tweeting about data mining topics, the average number of followers is 2,451. The histogram shows a skewed distribution. The majority of the posters have between 65 and 725 followers. This is the 25th and 75th percentile. To continue watching the series you can sign up at statsoft.com/tmsubscribe. And if you'd like more information about StatSoft and its products including STATISTICA Text Miner, you can visit StatSoft.com or call 918-749-1119 And outside of the United States, Canada, and Mexico, the 'Contact Us' page of StatSoft.com has links to our international offices that can also be of assistance.