I consume online media through different apps on different devices (plain browsers, Feedly, Twitter, etc.) and to archive interesting reads for future reference I developed a habit of e-mailing links to myself.
Gmail addresses are great for this purpose since an e-mail sent to firstname.lastname+Links@gmail.com still ends up in the inbox of firstname.lastname@gmail.com. Combining this with Gmail filters allows you to set up a rule so that all e-mails received on your-gmail-address+Links@gmail.com are archived automatically and are assigned a label that you can use to collect all blog posts you want to archive (I use the label Links).
I already collected a couple of hundred links to blog articles, webpages, pdfs, etc. that I feel compelled to do something interesting with (such as topic modelling and automatic summarization).
In this blog article I describe a pipeline that automatically accesses my Gmail account and extracts the readable text from all links I have sent myself over time. Of course I am opening a can of worms here since these e-mails come in lots of different formats and point to different materials:
Here I will use a rule-based approach to distinguish between these different formats.
These are the Python imports I will need throughout the pipeline:
import httplib2
from apiclient.discovery import build
from oauth2client.client import flow_from_clientsecrets
from oauth2client.file import Storage
from oauth2client.tools import run
import base64
import re
import requests
from lxml import etree
from StringIO import StringIO
import itertools as it
import urllib2
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
from collections import defaultdict
from lxml.etree import fromstring
import sqlite3
from datetime import datetime
import zlib
Google offer an API to many of their services including Gmail. The free of charge quota of the Gmail API is very generous and you will essentially not need to worry about going over the free allowance.
To access Gmail through Python you need to followe the steps outlined here here (all steps assume that you are logged into your Gmail account in the browser):
Get the Google API Python package
pip install google-api-python-client
and the following code (source) will
connect to the Gmail API and expose an API resource object pointed to by gmail_service
.
# Path to the client_secret.json file downloaded from the Developer Console
CLIENT_SECRET_FILE = 'client_secret.json'
# Check https://developers.google.com/gmail/api/auth/scopes for all available scopes
OAUTH_SCOPE = 'https://www.googleapis.com/auth/gmail.readonly'
# Location of the credentials storage file
STORAGE = Storage('gmail.storage')
# Start the OAuth flow to retrieve credentials
flow = flow_from_clientsecrets(CLIENT_SECRET_FILE, scope=OAUTH_SCOPE)
http = httplib2.Http()
# Try to retrieve credentials from storage or run the flow to generate them
credentials = STORAGE.get()
if credentials is None or credentials.invalid:
credentials = run(flow, STORAGE, http=http)
# Authorize the httplib2.Http object with our credentials
http = credentials.authorize(http)
# Build the Gmail service from discovery
gmail_service = build('gmail', 'v1', http=http)
The Gmail API resource gmail_service
is the only object we need to query to eventually download
all e-mail bodies of interest.
The Gmail API reference explains how to use the different
resource types (labels, message lists, messages) that we need to work with.
The Gmail label of interest is known to me as Links but internally Gmail identifies all labels by a unique ID. The following code identifies the label ID that corresponds to the label name Links:
labels = gmail_service.users().labels().list(userId='me').execute()['labels'] # 'me' is the currently logged-in user
label_id = filter(lambda x: x['name'] == 'Links', labels)[0]['id']
We can now query gmail_serice
for all e-mail messages with the label Links (as identified by label_id
).
The API returns messages in pages of at most 100 messages so that we will need to track a pointer to the
next page (nextPageToken
) until all pages are consumed.
def get_message_ids():
""" Page through all messages in `label_id` """
next_page = None
while True:
if next_page is not None:
response = gmail_service.users().messages().list(userId='me', labelIds=[label_id], pageToken=next_page).execute()
else:
response = gmail_service.users().messages().list(userId='me', labelIds=[label_id]).execute()
messages = response.get('messages')
next_page = response.get('nextPageToken')
for el in messages:
yield el['id']
if next_page is None:
break
To extract message bodies we need to distinguish between text/plain
and MIME
e-mails (to my understanding the latter allows
embedded images and such).
When an e-mail is just plain text then the body is found directly in the payload
, however when the e-mail is MIME I will
extract the body as the first of potentially many parts
.
See the message API reference for more detail.
def message_bodies():
for ctr, message_id in enumerate(get_message_ids()):
message = gmail_service.users().messages().get(userId='me', id=message_id, format='full').execute()
try:
body = message['payload']['parts'][0]['body']['data'] # MIME
except KeyError:
body = message['payload']['body']['data'] # text/plain
body = base64.b64decode(str(body), '-_')
yield body
I think of every e-mail with the Links label as a pointer to a resource on the Internet (be it a link to a webpage, PDF, etc.). Generally to extract URLs from plain text I will make use of the following regular expression formulated by John Gruber:
pattern = (r'(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|'
r'www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)'
r'(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))'
r'+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))* \)|'
r'[^\s`!()\[\]{};:\'\".,<>?\«\»\“\”\‘\’]))')
When I sent myself a link through the Feedly iPad app, Feedly sometimes sends the entire article in the e-mail body but it does not do so consistently as far as I can judge.
Some Feedly-originating e-mail bodies will contain multiple URLs (e.g. links as part of a blog article) but they mostly provide a link to the original story as the first URL in the e-mail.
Other times I may type an e-mail myself in which I copy and paste multiple links to interesting resources - in which case I want to extract all URLs from the body and not just the first one.
def is_feedly(body):
return 'feedly.com' in body
def urls():
for body in message_bodies():
matches = re.findall(pattern, body)
if is_feedly(body):
match = matches[0]
yield match[0] # Feedly e-mail: first URL is link to original story
else:
for match in matches:
yield match[0]
I noticed a number of URLs in urls()
that did not match my expectation - such as links to disqus comments and an online retailer.
This is where a rule-based filter comes in:
exclude = ['packtpub.com', 'disqus', '@', 'list-manage', 'utm_', 'ref=', 'campaign-archive']
def urls_filtered():
for url in urls():
if not any([pattern in url.lower() for pattern in exclude]):
yield url
I quite often send myself a link to the Hacker News message thread on an article of interest.
To extract the URL to the actual article we need to parse the HTML of the Hacker News discussion
and locate the relevant URL in a td
element with class="title"
- a rule that works for these particular Hacker News pages.
def is_hn(url):
return 'news.ycombinator.com' in url
parser = etree.HTMLParser()
def urls_hn_filtered():
for url in urls_filtered():
if is_hn(url) and (re.search(r'item\?id=', url) is None):
continue # do not keep HN links that do not point to an article
elif is_hn(url):
r = requests.get(url)
if r.status_code != 200:
continue # download of HN html failed, skip
root = etree.parse(StringIO(r.text), parser).getroot()
title = root.find(".//td[@class='title']")
try:
a = [child for child in title.getchildren() if child.tag == 'a'][0]
except AttributeError:
continue # title is None
story_url = a.get('href')
yield story_url
else:
yield url
Poor memory gets the better of me quite often and I end up sending myself the same link multiple times. To avoid downloading the same resource twice and skewing follow-on work with duplicates I will track seen URLs with the following code:
def unique_urls():
seen = defaultdict(bool)
for url in urls_hn_filtered():
key = hash(url)
if seen[key]:
continue
else:
seen[key] = True
yield url
Now that we have a stream of unique URLs of interesting articles provided by unique_urls
we can
download each of these resources and extract their readable text.
To extract readable text I will here distinguish between HTML and PDF.
PDF text extraction is handled with pdfminer
and I am making use of
this code shared in a StackOverflow answer.
HTML text extraction is handled with lxml
and this code snippet.
def pdf_from_url_to_txt(url):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Open the url provided as an argument to the function and read the content
f = urllib2.urlopen(urllib2.Request(url)).read()
# Cast to StringIO object
fp = StringIO(f)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
string = retstr.getvalue()
retstr.close()
return string
def resource_text():
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0'}
html_parser = etree.HTMLParser(recover=True, encoding='utf-8')
for url in unique_urls():
if url.endswith('.pdf'):
try:
text = pdf_from_url_to_txt(url)
except:
continue # something went wrong, just skip ahead
yield url, text
else:
try:
r = requests.get(url, headers=headers)
except:
continue # something went wrong with HTTP GET, just skip ahead
if r.status_code != 200:
continue
if not 'text/html' in r.headers.get('content-type', ''):
continue
# from: http://stackoverflow.com/a/23929292 and http://stackoverflow.com/a/15830619
try:
document = fromstring(r.text.encode('utf-8'), html_parser)
except:
continue # error parsing document, just skip ahead
yield url, '\n'.join(etree.XPath('//text()')(document))
This concludes our pipeline as the generator exposed by readable_text
returns our current best representation of
useful text for each resource.
All that remains now is to consume this generator and store the extracted text together with the source URL for future reference.
Since this pipeline consists of a number of failure-prone steps (communicating with APIs, extracting text, etc.) I will also
want to make certain to catch errors and deal with them appropriately:
In the generator resource_text
I already included a number of places where we just skip ahead to the next resource in case of failure - however failure may still happen and it will be best to catch those errors right here where we consume the pipeline.
I will be lazy here and just skip all resources that generate an error during text extraction.
Hence I will consume the generator readable_text
until it throws the StopIteration
exception and still skip all other exceptions.
Using the with
statement here makes certain that we do not need to worry about closing the file that we write to once StopIteration
is thrown.
text_generator = resource_text()
try:
db = sqlite3.connect('gmail_extracted_text.db')
db.execute('CREATE TABLE gmail (date text, url text, compression text, extracted blob)')
db.commit()
db.close()
except sqlite3.OperationalError:
pass # table gmail already exists
while True:
try:
db = sqlite3.connect('gmail_extracted_text.db')
url, text = text_generator.next()
now = datetime.now().__str__()
if isinstance(text, unicode):
text = zlib.compress(text.encode('utf-8')) # to decompress: zlib.decompress(text).decode('utf-8')
else:
text = zlib.compress(text)
db.execute('INSERT INTO gmail VALUES (?, ?, ?, ?)', (unicode(now), unicode(url), u'zlib', sqlite3.Binary(text)))
db.commit()
except StopIteration:
break # generator consumed, stop calling .next() on it
except Exception, e:
print e
continue # some other exception was thrown by the pipeline, just skip ahead
finally:
db.close() # tidy up
In my case the above pipeline extracted text from 571 resources and the resultant database is 7.9 MB in size.