"In this article we look at document clustering implemented in Python with scikit-learn."
... at least by some measure.
I recently downloaded 1754 PLoS Biology articles as XML files through the PLoS API and have looked at the distribution of the time to publication of PLoS Biology and other PLoS journals.
Here I will play a little with scikit-learn to see if I can discover those PLoS Biology articles (in my data set) that are most similar to one another.
I started writing a Python package (PLoSPy) for more convient parsing of the XML files I have download from PLoS.
import plospy
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import itertools
all_names = [name for name in os.listdir('../plos/plos_biology/plos_biology_data') if '.dat' in name]
all_names[0:10]
print len(all_names)
To reduce memory use, I wrote the following method that returns an iterator over all article bodies. In passing this iterator to the vectorizer, we avoid loading all articles into memory at once - despite the use of an iterator here, I have not been able to repeat this experiment with all 65,000-odd PLoS ONE articles without running out of memory.
ids = []
titles = []
def get_corpus(all_names):
for name_i, name in enumerate(all_names):
docs = plospy.PlosXml('../plos/plos_biology/plos_biology_data/'+name)
for article in docs.docs:
ids.append(article['id'])
titles.append(article['title'])
yield article['body']
corpus = get_corpus(all_names)
tfidf = TfidfVectorizer().fit_transform(corpus)
Just as a sanity check, the number of DOIs in our data set should now equal 1754 as this is the number of articles I downloaded in the first place.
len(ids)
The vectorizer generated a matrix with 139,748 columns (these are the tokens, i.e. probably unique words used in all 1754 PLoS Biology articles) and 1754 rows (corresponding to individual articles).
tfidf.shape
Let us now compute all pairwise cosine distances betweeen all 1754 vectors (articles) in matrix tfidf
.
I copied and pasted most of this from a StackOverflow answer that I cannot find now - I will
add a link to the answer when I come across it again.
To get the ten most similar articles, we track the top five pairwise matches.
top_five = [[-1,-1,-1] for i in range(5)]
threshold = -1.
for index in range(len(ids)):
cosine_similarities = linear_kernel(tfidf[index:index+1], tfidf).flatten()
related_docs_indices = cosine_similarities.argsort()[:-5:-1]
first = related_docs_indices[0]
second = related_docs_indices[1]
if first != index:
print 'Error'
break
if cosine_similarities[second] > threshold:
if first not in [top[0] for top in top_five] and first not in [top[1] for top in top_five]:
scores = [top[2] for top in top_five]
replace = scores.index(min(scores))
# print 'replace',replace
top_five[replace] = [first, second, cosine_similarities[second]]
# print 'old threshold',threshold
threshold = min(scores)
# print 'new threshold',threshold
Let us now take a look at the results!
for tf in top_five:
print ''
print('Cosine Similarity: %.2f' % tf[2])
print('Title 1: %s' %titles[tf[0]])
print('http://www.plosbiology.org/article/info%3Adoi%2F'+str(ids[tf[0]]))
print ''
print('Title 2: %s' %titles[tf[1]])
print('http://www.plosbiology.org/article/info%3Adoi%2F'+str(ids[tf[1]]))
print ''