On Cyber Monday Eve, Jeff Bezos revealed that Amazon may have intentions to one day deliver many of its goods by unmanned aerial vehicles through a service called Amazon Prime Air as part of an segment for the television show 60 Minutes. This notebook explores ~125k tweets from Twitter's firehose that were captured shortly after the announcement and teaches you how you can be equipped to capture interesting data within moments of announcements for your own analysis.
Let's seek to better understand the "Twitter reaction" to Amazon's announcement that drones may one day be delivering packages right to our doorsteps.
Twitter is an ideal source of data that can help you to understand the reaction to newsworthy events, because it has more than 200M active monthly users who tend to use it to frequently share short informal thoughts about anything and everything. Although Twitter offers a Search API that can be used to query for "historical data", tapping into the firehose with the Streaming API is a preferred option because it provides you the ability to acquire much larger volumes of data with keyword filters in real-time.
There are numerous options for storing the data that you acquire from the firehose. A document-oriented database such as MongoDB makes a fine choice and can provide useful APIs for filtering and analysis. However, we'll opt to simply store the tweets that we fetch from the firehose in a newline-delimited text file, because we'll use the pandas library to analyze it as opposed to relying on MongoDB or a comparable option.
Note: Should you have preferred to instead sink the data to MongoDB, the mongoexport commandline tool could have exported it to a newline delimited format that is exactly the same as what we will be writing to a file. Either way, you're covered.
There are only a few third-party packages that are required to use the code in this notebook:
You can easily install these packages in a terminal with pip install twitter pandas nltk, or you can install them from within IPython Notebook by using "Bash magic". Bash magic is just a way of running Bash commands from within a notebook as shown below where the first line of a cell prefixed with %%bash.
%%bash
pip install twitter pandas nltk
It's a lot easier to tap into Twitter's firehose than you might imagine if you're using the right library. The code below show you how to create a connection to Twitter's Streaming API and filter the firehose for tweets containing keywords. For simplicity, each tweet is saved in a newline-delimited file as a JSON document.
import io
import json
import twitter
# XXX: Go to http://twitter.com/apps/new to create an app and get values
# for these credentials that you'll need to provide in place of these
# empty string values that are defined as placeholders.
#
# See https://vimeo.com/79220146 for a short video that steps you
# through this process
#
# See https://dev.twitter.com/docs/auth/oauth for more information
# on Twitter's OAuth implementation.
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
OAUTH_TOKEN = ''
OAUTH_TOKEN_SECRET = ''
# The keyword query
QUERY = 'Amazon'
# The file to write output as newline-delimited JSON documents
OUT_FILE = QUERY + ".json"
# Authenticate to Twitter with OAuth
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
CONSUMER_KEY, CONSUMER_SECRET)
# Create a connection to the Streaming API
twitter_stream = twitter.TwitterStream(auth=auth)
print 'Filtering the public timeline for "{0}"'.format(QUERY)
# See https://dev.twitter.com/docs/streaming-apis on keyword parameters
stream = twitter_stream.statuses.filter(track=QUERY)
# Write one tweet per line as a JSON document.
with io.open(OUT_FILE, 'w', encoding='utf-8', buffering=1) as f:
for tweet in stream:
f.write(unicode(u'{0}\n'.format(json.dumps(tweet, ensure_ascii=False))))
print tweet['text']
Assuming that you've amassed a collection of tweets from the firehose in a line-delimited format, one of the easiest ways to load the data into pandas for analysis is to build a valid JSON array of the tweets.
Note: With pandas, you will need to have an amount of working memory proportional to the amount of data that you're analyzing. For reference, it takes on the order of ~8GB of memory to analyze ~125k tweets as shown in this notebook. (Bear in mind that each tweet is roughly 5KB of text when serialized out to a file.)
import pandas as pd
# A text file with one tweet per line
DATA_FILE = "tmp/Amazon.json"
# Build a JSON array
data = "[{0}]".format(",".join([l for l in open(DATA_FILE).readlines()]))
# Create a pandas DataFrame (think: 2-dimensional table) to get a
# spreadsheet-like interface into the data
df = pd.read_json(data, orient='records')
print "Successfully imported", len(df), "tweets"
Successfully imported 125697 tweets
Whereas you may be used to thinking of data such as a list of dictionaries in a rows-oriented paradigm, pandas DataFrame exposes a convenient columnar view of the data that makes it easy to slice and dice by particular fields in each record. You can print the data frame to display the columnar structure and some stats about each column.
# Printing a DataFrame shows how pandas exposes a columnar view of the data
print df
<class 'pandas.core.frame.DataFrame'> Int64Index: 125697 entries, 0 to 125696 Data columns (total 27 columns): _id 125697 non-null values contributors 0 non-null values coordinates 1102 non-null values created_at 125681 non-null values entities 125681 non-null values favorite_count 125681 non-null values favorited 125681 non-null values filter_level 125681 non-null values geo 1102 non-null values id 125681 non-null values id_str 125681 non-null values in_reply_to_screen_name 10001 non-null values in_reply_to_status_id 5927 non-null values in_reply_to_status_id_str 5927 non-null values in_reply_to_user_id 10001 non-null values in_reply_to_user_id_str 10001 non-null values lang 125681 non-null values limit 16 non-null values place 1442 non-null values possibly_sensitive 90143 non-null values retweet_count 125681 non-null values retweeted 125681 non-null values retweeted_status 40297 non-null values source 125681 non-null values text 125681 non-null values truncated 125681 non-null values user 125681 non-null values dtypes: datetime64[ns](1), float64(13), object(13)
Some of the items in a data frame may be null values, and these null values can wreak all kinds of havoc during analysis. Once you understand why they exist, it's wise to filter them out if possible. The null values in this collection of tweets are caused by "limit notices", which Twitter sends to tell you that you're being rate-limited. Notice in the columnar output above that the "limit" field (which is not typically part of a tweet) appears 16 times. This indicates that we received 16 limit notices and means that there are effectively 16 "rows" in our data frame that has null values for all of the fields we'd have expected to see.
Per the Streaming API guidelines, Twitter will only provide up to 1% of the total volume of the firehose, and anything beyond that is filtered out with each "limit notice" telling you how many tweets were filtered out. This means that tweets containing "Amazon" accounted for at least 1% of the total tweet volume at the time this data was being collected. The next cell shows how to "pop" off the column containing the sixteen limit notices and sum up the totals across these limit notices so that we can learn exactly how many tweets were filtered out across the aggregate.
# Observe the "limit" field that reflects "limit notices" where the streaming API
# couldn't return more than 1% of the firehose.
# See https://dev.twitter.com/docs/streaming-apis/messages#Limit_notices_limit
# Capture the limit notices by indexing into the data frame for non-null field
# containing "limit"
limit_notices = df[pd.notnull(df.limit)]
# Remove the limit notice column from the DataFrame entirely
df = df[pd.notnull(df['id'])]
print "Number of total tweets that were rate-limited", sum([ln['track'] for ln in limit_notices.limit])
print "Total number of limit notices", len(limit_notices)
Number of total tweets that were rate-limited 1062 Total number of limit notices 16
From this output, we can observe that ~1k tweets were not provided out of ~125k, more than 99% of the tweets about "Amazon" were received for the time period that they were being captured. In order to learn more about the bounds of that time period, let's create a time-based index on the created_at field of each tweet so that we can perform a time-based analysis.
# Create a time-based index on the tweets for time series analysis
# on the created_at field of the existing DataFrame.
df.set_index('created_at', drop=False, inplace=True)
print "Created date/time index on tweets"
Created date/time index on tweets
With a time-based index now in place, we can trivially do some useful things like calculate the boundaries, compute histograms, etc. Since tweets through to our filter in roughly the order in which they are created, no additional sorting should be necessary in order to compute the timeframe for this dataset; we can just slice the DataFrame like a list.
# Get a sense of the time range for the data
print "First tweet timestamp (UTC)", df['created_at'][0]
print "Last tweet timestamp (UTC) ", df['created_at'][-1]
First tweet timestamp (UTC) 2013-12-02 01:41:45 Last tweet timestamp (UTC) 2013-12-02 05:01:18
Operations such as grouping by a time unit are also easy to accomplish and seem a logical next step. The following cell illustrates how to group by the "hour" of our data frame, which is exposed as a datetime.datetime timestamp since we now have a time-based index in place.
# Let's group the tweets by hour and look at the overall volumes with a simple
# text-based histogram
# First group by the hour
grouped = df.groupby(lambda x: x.hour)
print "Number of relevant tweets by the hour (UTC)"
print
# You can iterate over the groups and print
# out the volume of tweets for each hour
# along with a simple text-based histogram
for hour, group in grouped:
print hour, len(group), '*'*(len(group) / 1000)
Number of relevant tweets by the hour (UTC) 1 14788 ************** 2 43286 ******************************************* 3 36582 ************************************ 4 30008 ****************************** 5 1017 *
Bearing in mind that we just previously learned that tweet acquisition began at 1:41 UTC and ended at 5:01 UTC, it could be helpful to further subdivide the time ranges into smaller intervals so as to increase the resolution of the extremes. Therefore, let's group into a custom interval by dividing the hour into 15-minute segments. The code is pretty much the same as before except that you call a custom function to perform the grouping; pandas takes care of the rest.
# Let's group the tweets by (hour, minute) and look at the overall volumes with a simple
# text-based histogram
def group_by_15_min_intervals(x):
if 0 <= x.minute <= 15: return (x.hour, "0-15")
elif 15 < x.minute <= 30: return (x.hour, "16-30")
elif 30 < x.minute <= 45: return (x.hour, "31-45")
else: return (x.hour, "46-00")
grouped = df.groupby(lambda x: group_by_15_min_intervals(x))
print "Number of relevant tweets by intervals (UTC)"
print
for interval, group in grouped:
print interval, len(group), "\t", '*'*(len(group) / 200)
# Since we didn't start or end precisely on an interval, let's
# slice off the extremes. This has the added benefit of also
# improving the resolution of the plot that shows the trend
plt.plot([len(group) for hour, group in grouped][1:-1])
plt.ylabel("Tweet Volume")
plt.xlabel("Time")
Number of relevant tweets by intervals (UTC) (1, '31-45') 2875 ************** (1, '46-00') 11913 *********************************************************** (2, '0-15') 13611 ******************************************************************** (2, '16-30') 11265 ******************************************************** (2, '31-45') 10452 **************************************************** (2, '46-00') 7958 *************************************** (3, '0-15') 10386 *************************************************** (3, '16-30') 9542 *********************************************** (3, '31-45') 8727 ******************************************* (3, '46-00') 7927 *************************************** (4, '0-15') 9042 ********************************************* (4, '16-30') 7543 ************************************* (4, '31-45') 7074 *********************************** (4, '46-00') 6349 ******************************* (5, '0-15') 1017 *****
<matplotlib.text.Text at 0x1e9a9d50>
In addition to time-based analysis, we can do other types of analysis as well. Generally speaking, one of the first things you'll want to do when exploring new data is count things, so let's compute the Twitter accounts that authored the most tweets and compare it to the total number of unique accounts that appeared.
from collections import Counter
# The "user" field is a record (dictionary), and we can pop it off
# and then use the Series constructor to make it easy to use with pandas.
user_col = df.pop('user').apply(pd.Series)
# Get the screen name column
authors = user_col.screen_name
# And count things
authors_counter = Counter(authors.values)
# And tally the totals
print
print "Most frequent (top 25) authors of tweets"
print '\n'.join(["{0}\t{1}".format(a, f) for a, f in authors_counter.most_common(25)])
print
# Get only the unique authors
num_unique_authors = len(set(authors.values))
print "There are {0} unique authors out of {1} tweets".format(num_unique_authors, len(df))
Most frequent (top 25) authors of tweets _net_shop_ 165 PC_shop_japan 161 work_black 160 house_book_jp 160 bousui_jp 147 Popular_goods 147 pachisuro_777 147 sweets_shop 146 bestshop_goods 146 __electronics__ 142 realtime_trend 141 gardening_jp 140 shikaku_book 139 supplement_ 139 __travel__ 138 disc_jockey_jp 138 water_summer_go 138 Jungle_jp 137 necessaries_jp 137 marry_for_love 137 trend_realtime 136 sparkler_jp 136 PandoraQQ 133 flypfox 133 Promo_Culturel 132 There are 71794 unique authors out of 125681 tweets
At first glance, it would appear that there are quite a few bots accounting for a non-trivial portion of the tweet volume, and many of them appear to be Japanese! As usual, we can plot these values to get better intution about the underlying distrubution, so let's take a quick look at a frequency plot and histogram. We'll use logarithmic adjustments in both cases, so pay close attention to axis values.
# Plot by rank (sorted value) to gain intution about the shape of the distrubtion
author_freqs = sorted(authors_counter.values())
plt.loglog(author_freqs)
plt.ylabel("Num Tweets by Author")
plt.xlabel("Author Rank")
# Start a new figure
plt.figure()
# Plot a histogram to "zoom in" and increase resolution.
plt.hist(author_freqs, log=True)
plt.ylabel("Num Authors")
plt.xlabel("Num Tweets")
<matplotlib.text.Text at 0x21c29fd0>
Although we could filter the DataFrame for coordinates (or locations in user profiles), an even simpler starting point to gain rudimentary insight about where users might be located is to inspect the language field of the tweets and compute the tallies for each language. With pandas, it's just a quick one-liner.
# What languages do authors of tweets speak? This might be a useful clue
# as to who is tweeting. (Also bear in mind the general timeframe for the
# data when interpreting these results.)
df.lang.value_counts()
en 79151 ja 35940 und 3197 es 2713 de 1142 fr 717 id 442 pt 434 ko 283 vi 248 nl 212 th 209 zh 135 sk 114 ru 84 da 73 it 65 sl 65 pl 64 ht 63 et 56 tr 53 tl 43 ar 38 lt 30 no 17 lv 16 fi 15 hu 13 sv 12 bg 8 ne 7 el 5 he 5 fa 4 uk 3 my 2 is 2 ta 1 dtype: int64
A staggering number of Japanese speakers were talking about "Amazon" at the time the data was collected. Bearing in mind that it was already mid-day on Monday in Japan when it the news of the Amazon drones started to surface in the United States on Sunday evening, is this really all that surprising given Twitter's popularity in Japan?
Filtering on language also affords us to remove some noise from analysis since we can filter out only tweets in a specific language for inspection, which will be handy for some analysis on the content of the tweets themselves. Let's filter out only the 140 characters of text from tweets where the author speaks English and use some natural language processing techniques to learn more about the reaction.
# Let's just look at the content of the English tweets by extracting it
# out as a list of text
en_text = df[df['lang'] == 'en'].pop('text')
Although NLTK provides some advanced tokenization functions, let's just split the English text on white space, normalize it to lowercase, and remove some common trailing punctuation and count things to get an initial glance in to what's being talked about.
from collections import Counter
tokens = []
for txt in en_text.values:
tokens.extend([t.lower().strip(":,.") for t in txt.split()])
# Use a Counter to construct frequency tuples
tokens_counter = Counter(tokens)
# Display some of the most commonly occurring tokens
tokens_counter.most_common(50)
[(u'amazon', 54778), (u'rt', 36409), (u'the', 25749), (u'drones', 24474), (u'to', 24040), (u'a', 21341), (u'delivery', 18557), (u'in', 17531), (u'of', 15741), (u'on', 14095), (u'drone', 13800), (u'by', 13422), (u'is', 12034), (u'for', 10988), (u'@amazon', 9318), (u'i', 9263), (u'and', 8793), (u'prime', 8783), (u'30', 8319), (u'air', 8026), (u'with', 7956), (u'future', 7911), (u'deliver', 7890), (u'get', 6790), (u'you', 6573), (u'your', 6543), (u'via', 6444), (u'deliveries', 6432), (u'this', 5899), (u'bezos', 5738), (u'will', 5703), (u'#primeair', 5680), (u'unmanned', 5442), (u'aerial', 5313), (u'under', 5308), (u'-', 5257), (u'mins', 5199), (u'that', 4890), (u'vehicles', 4835), (u'my', 4728), (u'from', 4720), (u'peek', 4699), (u'sneak', 4684), (u'unveils', 4555), (u'it', 4473), (u'minutes', 4459), (u'just', 4396), (u'at', 4394), (u'http://t.c\u2026', 4391), (u'packages', 4302)]
Not surprisingly, "amazon" is the most frequently occurring token, there are lots of retweets (actually, "quoted retweets") as evidenced by "rt", and lots of stopwords (commonly occurring words like "the", "and", etc.) at the top of the list. Let's further remove some of the noise by removing stopwords.
import nltk
# Download the stopwords list into NLTK
nltk.download('stopwords')
# Remove stopwords to decrease noise
for t in nltk.corpus.stopwords.words('english'):
tokens_counter.pop(t)
# Redisplay the data (and then some)
tokens_counter.most_common(200)
[nltk_data] Downloading package 'stopwords' to /usr/share/nltk_data... [nltk_data] Package stopwords is already up-to-date!
[(u'amazon', 54778), (u'rt', 36409), (u'drones', 24474), (u'delivery', 18557), (u'drone', 13800), (u'@amazon', 9318), (u'prime', 8783), (u'30', 8319), (u'air', 8026), (u'future', 7911), (u'deliver', 7890), (u'get', 6790), (u'via', 6444), (u'deliveries', 6432), (u'bezos', 5738), (u'#primeair', 5680), (u'unmanned', 5442), (u'aerial', 5313), (u'-', 5257), (u'mins', 5199), (u'vehicles', 4835), (u'peek', 4699), (u'sneak', 4684), (u'unveils', 4555), (u'minutes', 4459), (u'http://t.c\u2026', 4391), (u'packages', 4302), (u'jeff', 4040), (u'http://t.co/w6kugw4egt', 3922), (u"amazon's", 3669), (u'flying', 3599), (u'ceo', 3205), (u'#amazon', 3074), (u'new', 2870), (u'free', 2797), (u'testing', 2585), (u'could', 2568), (u'shipping', 2541), (u'', 2422), (u'says', 2343), (u"'60", 2324), (u'like', 2300), (u'stuff', 2263), (u'years', 2194), (u'60', 2157), (u'use', 2134), (u'using', 1939), (u'&', 1901), (u"minutes'", 1868), (u'kindle', 1735), (u"it's", 1657), (u'plans', 1655), (u'cyber', 1622), (u'one', 1617), (u'gift', 1614), (u"i'm", 1604), (u'monday', 1568), (u'wants', 1538), (u'first', 1522), (u'order', 1519), (u'good', 1479), (u'going', 1459), (u'package', 1446), (u'fire', 1400), (u'look', 1386), (u'plan', 1378), (u'4', 1377), (u'delivering', 1376), (u'@60minutes', 1371), (u'make', 1369), (u'experimenting', 1341), (u'30-minute', 1336), (u'book', 1330), (u'primeair', 1310), (u'real', 1285), (u'online', 1274), (u'coming', 1261), (u'think', 1195), (u'see', 1152), (u'video', 1149), (u'next', 1149), (u'would', 1135), (u'system', 1131), (u'service', 1115), (u'thing', 1099), (u'something', 1069), (u'hour', 1052), (u'black', 1043), (u'card', 1040), (u'half', 1033), (u'want', 1018), (u'half-hour', 1016), (u'futuristic', 1016), (u"you're", 998), (u'know', 987), (u'love', 985), (u'people', 964), (u'aims', 964), (u'(video)', 958), (u'day', 954), (u'shot', 936), (u'deploy', 921), (u'delivered', 919), (u'amazon\u2019s', 906), (u'basically', 902), (u'within', 888), (u'shop', 886), (u'really', 882), (u'buy', 876), (u'check', 859), (u'\u2026', 855), (u'us', 844), (u'time', 829), (u'autonomous', 817), (u'wait', 815), (u'right', 801), (u'@mashable', 787), (u'finding', 786), (u'go', 780), (u'2015', 779), (u"can't", 774), (u'@buzzfeed', 774), (u'top', 774), (u'cool', 770), (u'rebel', 767), (u'@amazondrone', 766), (u'(also', 762), (u'helpful', 762), (u'#drones', 761), (u'rifle', 759), (u'reveals', 759), (u'door', 755), (u'bases', 752), (u'store', 751), (u'hoth.)', 748), (u'shit', 748), (u'@bradplumer', 743), (u'waiting', 743), (u'looks', 735), (u'@deathstarpr', 735), (u"don't", 732), (u'5', 731), (u'win', 731), (u'floats', 717), (u'friday', 717), (u'way', 717), (u'great', 713), (u'http://t.co/jlfdnihzks', 711), (u'company', 710), (u'need', 709), (u'read', 704), (u'home', 704), (u'watch', 697), (u'moment', 691), (u'http://t.co/bxsavvzxzf', 690), (u'best', 685), (u'notion', 680), (u'news', 669), (u'blog', 669), (u'announces', 667), (u'got', 658), (u'$25', 654), (u'products', 646), (u'big', 645), (u'still', 642), (u'2', 642), (u'gonna', 642), (u'tip', 636), (u'sales', 623), (u'awkward', 618), (u'"amazon', 617), (u'idea', 609), (u'take', 604), (u'working', 600), (u'books', 597), (u"won't", 597), (u'hovers', 593), (u'wow', 589), (u'live', 587), (u'promises', 579), (u'back', 576), (u'package-delivery', 573), (u'@badbanana', 570), (u'soon', 563), (u'deals', 560), (u'+', 558), (u'work', 555), (u'ever', 552), (u"'octocopter'", 551), (u'$50', 549), (u'hit', 549), (u'holy', 546), (u'night', 537), (u'hdx', 535), (u'today', 526), (u'bits', 521), (u'many', 520), (u'awesome', 519), (u'amazing', 508), (u'window', 506)]
What a difference removing a little bit of noise can make! We now see much more meaningful data appear at the top of the list: drones, signs that a phrase "30 mins" (which turned out to be a possible timeframe for a Prime Air delivery by a drone according to Bezos) might appear based the appearance of "30" and "mins"/"minutes" near the top of the list), signs of another phrase "prime air" (as evidenced by "prime", "air" and the hashtag "#primeair"), references to Jeff Bezos, URLs to investigate and more!
Even though we've already learned a lot, one of the challenges with only employing crude tokenization techniques is that you aren't left with any phrases. One of the simplest ways of disocvering meaningful phrases in text is to treat the problem as one of discovering statistical collocations. NLTK provides some routines to find collocations and includes a "demo" function that's a quick one-liner.
nltk_text = nltk.Text(tokens)
nltk_text.collocations()
Building collocations list prime air; sneak peek; unmanned aerial; aerial vehicles; http://t.co/w6kugw4egt http://t.c…; vehicles http://t.co/w6kugw4egt; #primeair future; future deliveries; delivery drones; jeff bezos; @amazon get; amazon prime; '60 minutes'; amazon unveils; cyber monday; deliver packages; flying delivery; unveils flying; kindle fire; (also helpful
Even without any prior analysis on tokenization, it's pretty clear what the topis is about as evidenced by this list of collocations. But what about the context in which these phrases appear? As it turns out, NLTK supplies another handy data structure that provides some insight as to how words appear in context called a concordance. Trying out the "demo functionality" for the concordance is as simple as just calling it as shown below.
Toward the bottom of the list of commonly occurring tokens, the words "amazing" and "holy" appear. The word "amazing" is interesting, because it is usually the basis of an emotional reaction, and we're interested in examining the reaction. What about word "holy"? What might it mean? The concordance will help us to find out...
nltk_text.concordance("amazing")
print
nltk_text.concordance("holy")
Building index... Displaying 25 of 508 matches: s - @variety http://t.c… this looks amazing how will it impact drone traffic? - it? amazon prime air delivery looks amazing http://t.co/icizw5vfrw rt @jeffreyg gift card? @budgetearth & other amazing bloggers are giving one away! ends k? damn that amazon prime air looks amazing im sure it would freak out some peo egt http://t.c… @munilass amazon is amazing for what i use it for i'm okay with wwglyqrq just in bonnie sold again! amazing book http://t.co/jce902iros #best-s ase of 1000) http://t.co/d6l8p3jgbz amazing prospects! “@brianstelter on heels riety http://t.c… rt @dangillmor by amazing coincidence amazon had a youtube dr rd_ferguson amazon prime air sounds amazing *hot* kindle fire hdx wi-fi 7' tabl t.co/hrgxtrlumx this is going to be amazing if the faa allows it welcome to the lying grill #primeair is absolutely amazing i'm excited to see the progress/dev .co/w6kugw4egt http://t.c… the most amazing thing to me about amazon - when bez //t.co/cad9zload3 rt @dangillmor by amazing coincidence amazon had a youtube dr that 60 minutes piece on amazon was amazing what an incredible company and deli jesus christ this is real and it’s amazing erohmygerd http://t.co/m4salqm0lo r /t.co/0trwr9qsoc rt @semil the most amazing thing to me about amazon - when bez yeah no this @amazon drone stuff is amazing me but i have the same questions as hqfg… 30 minutes delivery by amazon amazing http://t.co/ofu39suiov i really don eat show at #60minutes amazon is an amazing company! rt @zachpagano next year d on's future drone shipping tech was amazing to see amazon unveils futuristic pl the first review on this product is amazing http://t.co/20yn3jguxu rt @amazon g ttp://t.co/s2shyp48go this would be amazing jeff bezos promises half-hour ship wugrxq2oju have you guys seen these amazing steals on amazon?? wow!! some of my ttp://t.co/mhqfg… rt @dangillmor by amazing coincidence amazon had a youtube dr bezo http://t.co/2jt6pgn8an this is amazing rt @rdizzle7 rt @marquisgod dog rt Displaying 25 of 546 matches: @brocanadian http://t.co/zxyct2renf holy shit rt @amazon get a sneak peek of our shipping with amazon prime air - holy cow this is the future - http://t.co eo) http://t.co/hi3gviwto7 #technews holy shit wtf http://t.co/p3h2wn5pba awes es'! (other 1 suggested was usa 1!!) holy shit jeff bezos promises half-hour s es http://t.co/k… rt @joshuatopolsky holy shit jeff bezos promises half-hour s //t.co/tjdtdpkaaf rt @joshuatopolsky holy shit jeff bezos promises half-hour s //t.co/0gpvgsyatm rt @joshuatopolsky holy shit jeff bezos promises half-hour s when amazon prime air is available? holy shit very funny! @amazon rt tim sied w4egt http://t.c… rt @joshuatopolsky holy shit jeff bezos promises half-hour s drones rt @alexpenn amazon prime air holy shit http://t.co/g2b7dumgbl amazon i ijk0 via @oliryan rt @joshuatopolsky holy shit jeff bezos promises half-hour s s http://t.co/w6kugw4egt http://t.c… holy shit amazon what? https://t.co/qrhkx w4egt http://t.c… rt @joshuatopolsky holy shit jeff bezos promises half-hour s //t.co/zggekdoyhv rt @joshuatopolsky holy shit jeff bezos promises half-hour s me for free? http://t.co/euutqyuoox holy crap @60minutes by using drones amaz //t.c… rt @alexpenn amazon prime air holy shit http://t.co/g2b7dumgbl amazon i one #primeair http://t.co/jgflwmcth5 holy fucking shit this is badass watch th d in a back yard? rt @joshuatopolsky holy shit jeff bezos promises half-hour s @brocanadian http://t.co/zxyct2renf holy shit of course! “@verge delivery dro how many lawyers… rt @joshuatopolsky holy shit jeff bezos promises half-hour s #business #market rt @joshuatopolsky holy shit jeff bezos promises half-hour s ke from amazon ;d rt @joshuatopolsky holy shit jeff bezos promises half-hour s fan of commercials rt @maryforbes14 holy crap @60minutes by using drones amaz //t.co/lefbeec5ht rt @joshuatopolsky holy shit jeff bezos promises half-hour s g each other down rt @joshuatopolsky holy shit jeff bezos promises half-hour s
It would appear that there is indeed a common thread of amazement in the data, and it's evident that @joshuatopolsky (who turns out to be Editor-in-chief of The Verge) is a commonly occurring tweet entity that warrants further investigation. Speaking of tweet entities, let's take an initial look at usernames, hashtags, and URLs by employing a simple heuristic to look for words prefixed with @, RT, #, and http to see what some of the most commonly occurring tweet entiteis are in the data.
# An crude look at tweet entities
entities = []
for txt in en_text.values:
for t in txt.split():
if t.startswith("http") or t.startswith("@") or t.startswith("#") or t.startswith("RT @"):
if not t.startswith("http"):
t = t.lower()
entities.append(t.strip(" :,"))
entities_counter = Counter(entities)
for entity, freq in entities_counter.most_common()[:100]:
print entity, freq
@amazon 8994 #primeair. 4767 http://t.c… 4391 http://t.co/w6kugw4EGt 3922 #amazon 3032 @60minutes 1325 #primeair 911 @mashable 787 @buzzfeed 774 @amazondrone 743 @bradplumer 743 @deathstarpr 735 #drones 729 http://t.co/JlFdNiHzks 711 http://t.co/BxSAVVzXZf 690 @badbanana 570 #kindle 467 @thenextweb 458 #amexamazon 441 http://t.co/MHqFG… 434 #giveaway 421 http:/… 417 #win 409 http:… 406 @techcrunch 391 #drone 383 #60minutes 380 http://t… 357 #tech 342 @levie 340 @variety 337 @breakingnews 331 @youtube 326 #cybermonday 325 @huffposttech 322 http://… 320 @jonrussell 304 @realjohngreen 300 #news 298 http://t.co/FNndPuMouA 294 @washingtonpost 284 @kotaku 283 @usatoday 283 http://t.… 280 #amazondrones 278 @nycjim 277 http://t.co/NG8… 270 http://t.co/rUu14XpvGo 270 @brianstelter 268 @majornelson 260 @benbadler 258 http://t.co/M7kqd26jVR 255 http… 254 @businessinsider 249 @huffingtonpost 245 http://t.co/DOEjXCC1vL 241 @sai 241 http://t.co/… 240 @verge 237 http://t.co/tAVunIXxo3 230 http://t.co/OqAZXaZdh9 223 http://t.co/sMBvNrasSN 221 #amazonprimeair 221 @buzzfeednews 214 @mattsinger 211 @ericstangel 211 #1 205 @byjasonng 200 #free 198 http://t.co/GQCvN0xZ7n 193 @americanexpress 190 #csrracing 183 @nickkristof 178 @orgasmicgomez 170 http://t.co/REJl0wLImt 168 http://t.co/zUWaeRjFC8 167 #ebook 165 http://t.co/4zk01TUugX 165 @joshuatopolsky 161 @percival 161 @lanceulanoff 160 @time 158 http://t.co/xQCjG6VQyb 157 #romance 154 #technology 154 #rt 148 @engadget 145 @arstechnica 142 @sapnam 142 http://t.co/JxBSx8rLBZ 141 http://t.co/IyV1qBhtJg 141 @youranonnews 139 @gizmodo 138 @abc 135 @mckaycoppins 133 http://t.co/zGgEkdOyhv 133 http://t.co/9ZBYtKpHce 132 @newsbreaker 132 http://t.co/z5JQkD4svO 132 @ 131
As you can see, there are lots of intersting tweet entities that give you helpful context for the announcement. One particularly notable observation is the appearance of "comedic accounts" such as @deathstarpr and @amazondrone near the top of the list, relaying a certain amount of humor. The tweet embedded below that references Star Wars was eventually retweeted over 1k times in response to the announcement! It wouldn't be difficult to determine how many retweets occurred just within the ~3 hour timeframe corresponding to the dataset we're using here.
First look at Amazon's new delivery drone. (Also helpful for finding Rebel bases on Hoth.) pic.twitter.com/JlFdNiHzks
— Death Star PR (@DeathStarPR) December 2, 2013
When you take a closer look at some of the developed news stories, you also see sarcasm, unbelief, and even a bit of frustration about this being a "publicity stunt" for Cyber Monday.
Note: There proper way of parsing out tweet entities from the entities field that you can see in the DataFrame. It's marginally more work but has the primary advantage that you can see the "expanded URL" which provides better insight into the nature of the URL since you'll know its domain name. See Example 9-10, Extracting Tweet Entities from Mining the Social Web for more on how to do that.
We aspired to learn more about the general reaction to Amazon's announcement about Prime Air by taking an initial look at the data from Amazon's firehose, and it's fair to say that we learned a few things about the data without too much effort. Lots more could be discovered, but a few of the themes that we were able to glean included:
Although these reactions aren't particularly surprising for such an outrageous announcement, you've hopefully learned enough that you could tap into Twitter's firehose to capture and analyze data that's of interest to you. There is no shortage of fun to be had, and as you've learned, it's easier than it might first appear.
Enjoy!
If you enjoy analyzing data from social websites like Twitter, then you might enjoy the book Mining the Social Web, 2nd Edition (O'Reilly). You can learn more about it at MiningTheSocialWeb.com. All source code is available in IPython Notebook format at GitHub and can be previewed in the IPython Notebook Viewer.
The book itself is a form of "premium support" for the source code and is available for purchase from Amazon or O'Reilly Media.