In [1]:

%matplotlib inline
%autosave 10

Autosaving every 10 seconds

In [2]:

import gensim
import cPickle as pickle
from sklearn import *
import numpy
from matplotlib import pyplot

/usr/lib64/python2.7/site-packages/sklearn/pls.py:7: DeprecationWarning: This module has been moved to cross_decomposition and will be removed in 0.16
  "removed in 0.16", DeprecationWarning)

In [3]:

articles = pickle.load(open('data/plos_biology_articles_unfurled.list','r'))

In [4]:

dois = pickle.load(open('data/plos_biology_dois.list','r'))

In [5]:

articles[0][:10]

Out[5]:

['introduction',
 'during',
 '1980s',
 '1990s',
 'methods',
 'molecular',
 'genetics',
 'used',
 'determine',
 'contributions']

In [6]:

dois[0]

Out[6]:

'10.1371/journal.pbio.1000584'

Checking the main text of the above DOI we make certain that the article stored in articles[0] corresponds to the DOI stored in dois[0].

Let us now load the same corpus as in articles but already formatted as a numerical matrix that represents each article (row of the matrix) as a bag of words. We generated this corpus and the corresponding dictionary earlier.

In [7]:

corpus = gensim.corpora.MmCorpus('data/plos_biology_corpus.mm')
dictionary = dictionary = gensim.corpora.dictionary.Dictionary().load('data/plos_biology.dict')

In [8]:

corpus_mat = gensim.matutils.corpus2csc(corpus)
corpus_mat = corpus_mat.T
print corpus_mat.shape

(1754, 27210)

In [9]:

svd = decomposition.TruncatedSVD(n_components=2)

In [10]:

corpus_mat_transform = svd.fit_transform(corpus_mat)

In [11]:

pyplot.scatter(corpus_mat_transform[:,0], corpus_mat_transform[:,1])
pyplot.scatter(numpy.median(corpus_mat_transform, axis=0)[0], numpy.median(corpus_mat_transform, axis=1)[1], color='red')

Out[11]:

<matplotlib.collections.PathCollection at 0x3691fb10>

As we can see there are a few articles that lie relatively far away from the bulk of the corpus. Let's focus on some of these:

In [12]:

corpus_mat_transform[corpus_mat_transform[:,0]>150]

Out[12]:

array([[ 188.00372631,  206.98116152],
       [ 173.92168158,  149.52950725],
       [ 153.28884507,  215.76087149],
       [ 162.25142408,  102.5657386 ],
       [ 150.9967843 ,  145.03623178]])

In [13]:

numpy.where(corpus_mat_transform[:,0]>150)

Out[13]:

(array([  35, 1074, 1109, 1371, 1544]),)

In [14]:

for index in numpy.where(corpus_mat_transform[:,0]>150)[0]:
    print 'http://www.plosbiology.org/article/info:doi/%s' % dois[index]

http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1001060
http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1001657
http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1001283
http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000135
http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000373

The first of these "outliers" in the above reduced space is a Synopsis articles so it may be understandable why that one sticks out. However, the remaining articles are research articles that all deal with neurobiological topics - so off-hand it is not obvious to me why these would lie a bit further away from the bulk of the articles.