Rules from web-poet

scrapy-poet fully supports the functionalities of web_poet.ApplyRule. It uses the registry from web_poet called web_poet.RulesRegistry which provides functionalties for:

  • Returning the page object override if it exists for a given URL.

  • Returning the page object capable of producing an item for a given URL.

A list of web_poet.ApplyRule can be configured by passing it to the SCRAPY_POET_RULES setting.

In this section, we go over its instead_of parameter for overrides and to_return for item returns. However, please make sure you also read web-poet’s Rules documentation to see all of the expected behaviors of the rules.

Overrides

This functionality opens the door to configure specific Page Objects depending on the request URL domain. Please have a look to Scrapy Tutorial to learn the basics about overrides before digging deeper in the content of this page.

Tip

Some real-world examples on this topic can be found in:

Page Objects refinement

Any web_poet.pages.Injectable or page input can be overridden. But the overriding mechanism stops for the children of any already overridden type. This opens the door to refining existing Page Objects without getting trapped in a cyclic dependency. For example, you might have an existing Page Object for book extraction:

class BookPage(ItemPage):
    def to_item(self):
        ...

Imagine this Page Object obtains its data from an external API. Therefore, it is not holding the page HTML code. But you want to extract an additional attribute (e.g. ISBN) that was not extracted by the original Page Object. Using inheritance is not enough in this case, though. No problem, you can just override it using the following Page Object:

class ISBNBookPage(WebPage):

    def __init__(self, response: HttpResponse, book_page: BookPage):
        super().__init__(response)
        self.book_page = book_page

    def to_item(self):
        item = self.book_page.to_item()
        item['isbn'] = self.css(".isbn-class::text").get()
        return item

And then override it for a particular domain using settings.py:

SCRAPY_POET_RULES = [
    ApplyRule("example.com", use=ISBNBookPage, instead_of=BookPage)
]

This new Page Object gets the original BookPage as dependency and enrich the obtained item with the ISBN from the page HTML.

Note

By design overrides rules are not applied to ISBNBookPage dependencies as it is an overridden type. If they were, it would end up in a cyclic dependency error because ISBNBookPage would depend on itself!

Note

This is an alternative more compact way of writing the above Page Object using attr.define:

@attr.define
class ISBNBookPage(WebPage):
    book_page: BookPage

    def to_item(self):
        item = self.book_page.to_item()
        item['isbn'] = self.css(".isbn-class::text").get()
        return item

Overrides rules

The following example configures an override that is only applied for book pages from books.toscrape.com:

from web_poet import ApplyRule


SCRAPY_POET_RULES = [
    ApplyRule(
        for_patterns=Patterns(
            include=["books.toscrape.com/cataloge/*index.html|"],
            exclude=["/catalogue/category/"]),
        use=MyBookPage,
        instead_of=BookPage
    )
]

Note how category pages are excluded by using a exclude pattern. You can find more information about the patterns syntax in the url-matcher documentation.

Decorate Page Objects with the rules

Having the rules along with the Page Objects is a good idea, as you can identify with a single sight what the Page Object is doing along with where it is applied. This can be done by decorating the Page Objects with web_poet.handle_urls() provided by web-poet.

Tip

Make sure to read the Rules documentation of web-poet to learn all of its other functionalities that is not covered in this section.

Let’s see an example:

from web_poet import handle_urls


@handle_urls("toscrape.com", instead_of=BookPage)
class BTSBookPage(BookPage):

    def to_item(self):
        return {
            'url': self.url,
            'name': self.css("title::text").get(),
        }

The web_poet.handle_urls() decorator in this case is indicating that the class BSTBookPage should be used instead of BookPage for the domain toscrape.com.

Using the rules in scrapy-poet

scrapy-poet automatically uses the rules defined by page objects annotated with the web_poet.handle_urls() decorator by having the default value of the SCRAPY_POET_RULES setting set to web_poet.default_registry.get_rules(), which returns a List[ApplyRule]. Moreover, you also need to set the SCRAPY_POET_DISCOVER setting so that these rules could be properly imported.

Note

For more info and advanced features of web-poet’s web_poet.handle_urls() and its registry, kindly read the web-poet documentation, specifically its Rules documentation.

Item Returns

scrapy-poet also supports a convenient way of asking for items directly. This is made possible by the to_return parameter of web_poet.ApplyRule. The to_return parameter specifies which item a page object is capable of returning for a given URL.

Let’s check out an example:

import attrs
import scrapy
from web_poet import WebPage, handle_urls, field
from scrapy_poet import DummyResponse


@attrs.define
class Product:
    name: str


@handle_urls("example.com")
@attrs.define
class ProductPage(WebPage[Product]):

    @field
    def name(self) -> str:
        return self.css("h1.name ::text").get("")


class MySpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/products/some-product", self.parse
        )

    # We can directly ask for the item here instead of the page object.
    def parse(self, response: DummyResponse, item: Product):
        return item

From this example, we can see that:

  • Spider callbacks can directly ask for items as dependencies.

  • The Product item instance directly comes from ProductPage.

  • This is made possible by the ApplyRule("example.com", use=ProductPage, to_return=Product) instance created from the @handle_urls decorator on ProductPage.

Note

The slightly longer alternative way to do this is by declaring the page object itself as the dependency and then calling its .to_item() method. For example:

@handle_urls("example.com")
@attrs.define
class ProductPage(WebPage[Product]):
    product_image_page: ProductImagePage

    @field
    def name(self) -> str:
        return self.css("h1.name ::text").get("")

    @field
    async def image(self) -> Image:
        return await self.product_image_page.to_item()


class MySpider(scrapy.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/products/some-product", self.parse
        )

    async def parse(self, response: DummyResponse, product_page: ProductPage):
        return await product_page.to_item()

For more information about all the expected behavior for the to_return parameter in web_poet.ApplyRule, check out web-poet Rules documentation.