Similar Items portlet for Plone

by Matt Hamilton on Mar 19, 2010
Filed Under:

We are in the process of building a new website for Netsight, and one of the items on the wish-list was a 'related items' portlet for the blog. With Plone you can manually related items, but there isn't really a way to display related items automatically. Many years ago we wrote something that did this on a client's site in Plone 1, but after a while the client asked it to be removed as it was bringing up some slightly embarrassing and controversial supposed related items.

I now wanted to resurrect the idea and create a Plone 3 portlet that did something along those lines. I'd seen topia.termextract a while back and was thinking of a fun use to put it to, and also saw collective.classification which sounded like it did something similar, but needs various external dependancies to be build (incl. C libraries I think, and download a large training corpus for it).

My final year thesis at university in 2000 was a full text indexer written in C, and so I had a fairly good grasp of relevance ranking algorithms and the likes. Indeed, back in 2002 at the first non-US Zope 3 sprint, I introduced Jim Fulton to 'Managing Gigabytes' a seminal book on indexing and information retrieval... this then lead to the creation of ZCTextIndex, the text index used in Zope today.

So I knew from the ZCTextIndex code that we should already have much of the information needed for determining similar content already calculated in the text index data structures -- which means it should be pretty fast.

The basic idea is this:

  1. Find the most 'important words' in the document you are looking at
  2. Search for other 'similar' documents based on those words


So how do we find the most 'important' words? Well in text indexing there is a common metric called TF*IDF. This is the Term Frequency by Inverse Document Frequency. Basically, an important word is a word that appears in this document at a higher frequency compared to other documents.

As an example the word 'the' appears in a specific document roughly the same number of times as all documents in our entire site. It also appears in virtually every document. So it is not that special. Whereas, the word 'conference' might appear in our specific document a number of times, yet doesn't appear that often in general, and doesn't appear in that many documents overall -- hence it is pretty 'important'.

The SearchableText ZCTextIndex already has a way to find all words in a document efficiently (ie. without having to parse the document again) as this information is stored in the indexes. It also stores various frequency and weight information -- ie everything we need for our calculation.

This means we can iterate over every word in our document, score each one as to how 'important' it is, and then return the top 20 words.

Once we have those words, we are onto the second part of the process and we search the catalog for all documents that match these terms. To do this efficiently, again we have to delve into the internals of ZCTextIndex and call some private methods. I know this is bad form, but is really needed for efficiency. If we used the public API to do the search then the catalog would treat the 20 terms in our query with an implicit 'and' which means that if any one of the terms doesn't appear in a candidate document then it will be excluded. This is not what we want, we want that document to be included, but if a specific term doesn't appear then just don't rank it as high.

The end result is we get a list of documents related to the one we are looking at:

Similar Items portlet

I later did a bit of an experiment to see how topia.termextract would fare in comparison to my TF*IDF approach to determining which words at the most important in a document. The results were actually surprisingly close:

topia.termextract:
['2010', 'venue', 'people', 'idea', 'year', '/', 
'university', 'uk', 'bristol', 'number', 
'community', 'conference', 'city', 'lot', 'room', 
'work', 'conference', 'talk', 'plone']

tf.idf:
['bristol', 'conference', 'venue', 'ploneconf2010', 
'suits', 'rooms', 'vote', 'silicon', 'city', 
'media', 'talks', 'university', 'bid', 'delegates', 
'lots', 'bbc', 'west']

So topia.termextract is returning a list of all nouns it knows of in the document, and tf.idf is returning all words in this document that occur more frequently than average. You can probably guess the content of the document they were both looking at ;)

So for now, I'll keep with my tf.idf approach, but maybe in the future it might be interesting to see how topia.termextact can be integrated, especially if you include the phrases that topia.termextract can find:

['plone projects', 'south west', '600,000 people', 
'plone conference', 'plastercine characters', 
'computer science', 'advocacy work', 
'silicon valley', 'case studies', 
'plone conferences', 'plone community', 
'cycle city', 'industry analysts', 
'technology evaluators', ...]

The product is available to download from plone.org or via buildout and PyPi if you want to install it, just add collective.portlet.similarcontent to your eggs and zcml lines in your buildout config file.

Filed under: , , ,
Yiorgis Gozadinos
Yiorgis Gozadinos says:
Mar 22, 2010 12:31 PM

Hey! It looks good and it's fast! Well done. A few words on topia.termextract and collective.classification... In order for both of them to do anything useful they need to have a way to assign for each word its POS (part-of-speech) tag. topia.termextract is using an external product (not open source) to create a dictionary. The program is not part of the package, hence it will only work for the document parsed in the tests. If it happens that some other document contains the same words it will give good enough results... In contrast collective.classification uses NLTK to perform POS tagging, which indeed is a big dependency. The added value is that in principle you can do it for non-english languages as well as train it with specialized corpora (say for instance medical or law text). The dependencies (and size) of collective.classification will be reduced in the near future (whenever I get to work on it a bit that is;))

Commenting has now closed on this post.

Follow us

— via Twitter

Is proudly sponsoring #BlueLightCamp today. If you want to come talk Open Source content management @HammerToe is there #blcamp
last month