pyquery: a jquery-like library for python

pyquery is a fantastic little library for dealing with XML and HTML documents. It brings the power and ease of jQuery into Python, letting you deal with CSS selectors and functions instead of a clunky DOM. I try to avoid dealing with XML as much as possible, but slinging around pyquery almost makes XML fun.

Building lxml

The hardest part of working with pyquery is getting it installed. pyquery gets all of its XML power from lxml, which has a reputation for being difficult. Ian Bicking mentioned that lxml2.2 has become much easier to install by providing an option to compile the troublesome C libs as static libraries, which has avoided any problems for me. All you need to do is define STATIC_DEPS=true in the build environment:

STATIC_DEPS=true pip install pyquery

This has worked for me on OS X with pip, easy_install, buildout, and probably anything else based on distutils.

Web Scraping

Web scraping is ridiculously easy with pyquery. Grabbing a Shakespearean insult from the web is as simple as

import pyquery

p = pyquery.PyQuery('http://www.pangloss.com/seidel/Shaker/')
insult = p('font').text()

Finding the insult on that page is aided by the author’s semantic font tag.

Testing

I like to make sure that my views are working correctly, another task in which I’m finding pyquery indispensable. I’ve seen regexen used for the same task, but examining a real DOM is much more resilient than trying to pick out pieces by matching strings. Testing views is especially useful when dealing with template systems like Django’s and Jinja’s which silently hide errors instead of raising exceptions.

assert d('#stats').text() == '5 tests: +2 -3'

I’ve noticed that testing the HTML in this manner has improved my semantic markup. Pulling out and testing pieces of the page forces me to add meaningful ids and classes to the elements.

Bonus

For extra HTML goodness, the tests submit response pages to the w3c Validator using this multipart form encoder. Then, of course, I use pyquery to make sure all is well.

validator = post_multipart('validator.w3.org', '/check',
                           {'fragment': response.data})
assert pyquery.PyQuery(validator)('#congrats')