%matplotlib inline
%autosave 10
Autosaving every 10 seconds
import gensim
import cPickle as pickle
from sklearn import *
import numpy
from matplotlib import pyplot
/usr/lib64/python2.7/site-packages/sklearn/pls.py:7: DeprecationWarning: This module has been moved to cross_decomposition and will be removed in 0.16 "removed in 0.16", DeprecationWarning)
articles = pickle.load(open('data/plos_biology_articles_unfurled.list','r'))
dois = pickle.load(open('data/plos_biology_dois.list','r'))
articles[0][:10]
['introduction', 'during', '1980s', '1990s', 'methods', 'molecular', 'genetics', 'used', 'determine', 'contributions']
dois[0]
'10.1371/journal.pbio.1000584'
Checking the main text of the above DOI we make certain that the article stored in articles[0]
corresponds to the DOI stored in dois[0]
.
Let us now load the same corpus as in articles
but already formatted as a numerical matrix that represents each article (row of the matrix) as a bag of words.
We generated this corpus and the corresponding dictionary earlier.
corpus = gensim.corpora.MmCorpus('data/plos_biology_corpus.mm')
dictionary = dictionary = gensim.corpora.dictionary.Dictionary().load('data/plos_biology.dict')
corpus_mat = gensim.matutils.corpus2csc(corpus)
corpus_mat = corpus_mat.T
print corpus_mat.shape
(1754, 27210)
svd = decomposition.TruncatedSVD(n_components=2)
corpus_mat_transform = svd.fit_transform(corpus_mat)
pyplot.scatter(corpus_mat_transform[:,0], corpus_mat_transform[:,1])
pyplot.scatter(numpy.median(corpus_mat_transform, axis=0)[0], numpy.median(corpus_mat_transform, axis=1)[1], color='red')
<matplotlib.collections.PathCollection at 0x3691fb10>
As we can see there are a few articles that lie relatively far away from the bulk of the corpus. Let's focus on some of these:
corpus_mat_transform[corpus_mat_transform[:,0]>150]
array([[ 188.00372631, 206.98116152], [ 173.92168158, 149.52950725], [ 153.28884507, 215.76087149], [ 162.25142408, 102.5657386 ], [ 150.9967843 , 145.03623178]])
numpy.where(corpus_mat_transform[:,0]>150)
(array([ 35, 1074, 1109, 1371, 1544]),)
for index in numpy.where(corpus_mat_transform[:,0]>150)[0]:
print 'http://www.plosbiology.org/article/info:doi/%s' % dois[index]
http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1001060 http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1001657 http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1001283 http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000135 http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000373
The first of these "outliers" in the above reduced space is a Synopsis articles so it may be understandable why that one sticks out. However, the remaining articles are research articles that all deal with neurobiological topics - so off-hand it is not obvious to me why these would lie a bit further away from the bulk of the articles.