Unicode stuff and lxml

by Matt Hamilton on Feb 25, 2011
Filed Under:

This is just a little tip for others, who if like myself, often battle with Unicode issues in python and need a little help.

Firstly, go read Dan Fairs' great blog post on Unicode, it is a great primer regardless of what programming language you use.

In this instance, I was having problems with this very blog you are reading. On the front page of our site, we grab the latest blog post and automatically extract the first non-empty paragraph to use as a description. To do this I use lxml and find all paragraphs in the document. All was going well until recently when our resident 'break anything better than anyone else' person, Astra, managed to break our blog with a posting on our support of the the Childline 1600 club.

The issue turned out to be the pound sign in the first paragraph, that was then causing issues when trying to be used on the front page of the site. It took me quite a bit of head scratching and experimenting to get it right. In the end I worked out the key was I needed to make sure the text I was passing to lxml was already cast to unicode with the correct codec. If you don't do this, then lxml will convert it itself and in this case get the encoding wrong.

So the code is:

def Description(self):
	""" When calling Description instead return the first paragraph in the blog post """
        text = self.getText()
        if not text:
            return ''
        text = text.decode('utf-8')
        h = html.fromstring(text)

        # find first non-empty paragraph and return the text of it                                                  
        for para in h.iterfind('.//p'):
            text = para.text_content()
            if text:
		return text.encode('utf-8')
        return ''

The key here is to make sure you 'decode' the text using the utf-8 codec before you pass it to lxml (html.fromstring) and then 'encode' it back to utf-8 afterwards. Without the decode step at the beginning lxml was being fed the string '\xc2\xa3200' which represents a pound sign and then the integer 200, and assuming it was already unicode, resulting in the output of £200 instead of £200.

Filed under: , , ,

Commenting has now closed on this post.

Follow us

— via Twitter

Tweet could not be retrieved