Parsing an XML document with a default namespace in Scrapy

While writing a new spider for Feeds I stumbled upon the following problem. I wanted to parse an XML feed with a default namespace but couldn't get it to work.

XML namespaces

First of all, what are XML namespaces anyway? They are used to avoid element name conflicts and are usually along with a prefix.

<h:table xmlns:h="http://www.w3.org/TR/html4/">
  <h:tr>
    <h:td>Apples</h:td>
    <h:td>Bananas</h:td>
  </h:tr>
</h:table>

(All examples are taken from XML Namespaces by w3schools.com.)

As you can see, all tags are prefixed with h: and the root of the document has a special xmlns attribute.

Iterating over nodes with a namespace

Parsing that example using an XMLFeedSpider from Scrapy is easy. We can derive from that class and just register the h namespace. Now we can iterate over the h:td tag.

Note that without the namespaces attribute in the class it's not possible to select or extract data from non-standard namespaces.

from scrapy.spiders import XMLFeedSpider

class ExampleSpider(XMLFeedSpider):
    name = 'example'
    namespaces = [
        ('h', 'http://www.w3.org/TR/html4/'),
    ]
    itertag = 'h:td'
    iterator = 'xml'

    def parse_node(self, response, node):
        self.logger.debug(node.xpath('text()').extract_first())

This results in the following output:

2016-11-05 15:39:44 [example] DEBUG: Apples
2016-11-05 15:39:44 [example] DEBUG: Bananas

The URI is the actual interesting part in the namespace definition. The prefix which we use to register the URI doesn't need to be the same as the one used in the XML document. I.e. this would lead to the same output:

namespaces = [
    ('x', 'http://www.w3.org/TR/html4/'),
]
itertag = 'x:td'
iterator = 'xml'

Don't forget to set the iterator to xml when changing namespace prefixes. Otherwise an iterator based on regexes is used which doesn't work for XML documents like these.

XML default namespaces

To not repeat the prefix over and over again we can use a default namespace which is then implicitly applied to all child elements:

<table xmlns="http://www.w3.org/TR/html4/">
  <tr>
    <td>Apples</td>
    <td>Bananas</td>
  </tr>
</table>

This looks like an innocent namespace-less document but in fact it isn't. The xmlns attribute is still there and since no prefix is given it acts as a default namespace. The document is essentially equivalent to the first one, except that no explicit namespace prefix is given. This means, however, that XPath queries like //td don't work since they are not bound to a namespace.

Iterating over nodes with a default namespace

How do we iterate over such documents now? The answer is actually quite simple. We have to register a namespace again but come up with our own prefix. It's just important that we use the same URI as specified in the XML document. We also have to set itertag to x:td (x being the prefix we registered for the URI) and use the prefix in all XPath queries. That's all.

from scrapy.spiders import XMLFeedSpider

class ExampleSpider(XMLFeedSpider):
    name = 'example'
    namespaces = [
        ('x', 'http://www.w3.org/TR/html4/'),
    ]
    itertag = 'x:td'
    iterator = 'xml'

    def parse_node(self, response, node):
        self.logger.debug(node.xpath('text()').extract_first())

The output is now the same as before:

2016-11-05 16:15:09 [example] DEBUG: Apples
2016-11-05 16:15:09 [example] DEBUG: Bananas

XML namespaces

Iterating over nodes with a namespace

XML default namespaces

Iterating over nodes with a default namespace

Related Posts: