.. _rules-from-web-poet: =================== Rules from web-poet =================== scrapy-poet fully supports the functionalities of :class:`web_poet.ApplyRule `. It uses the registry from web_poet called :class:`web_poet.RulesRegistry ` which provides functionalties for: * Returning the page object override if it exists for a given URL. * Returning the page object capable of producing an item for a given URL. A list of :class:`web_poet.ApplyRule ` can be configured by passing it to the ``SCRAPY_POET_RULES`` setting. In this section, we go over its ``instead_of`` parameter for overrides and ``to_return`` for item returns. However, please make sure you also read web-poet's :ref:`rules` documentation to see all of the expected behaviors of the rules. Overrides ========= This functionality opens the door to configure specific Page Objects depending on the request URL domain. Please have a look to :ref:`intro-tutorial` to learn the basics about overrides before digging deeper in the content of this page. .. tip:: Some real-world examples on this topic can be found in: - `Example 1 `_: shorter example - `Example 2 `_: longer example - `Example 3 `_: rules using :py:func:`web_poet.handle_urls` decorator and retrieving them via :py:meth:`web_poet.RulesRegistry.get_rules ` Page Objects refinement ----------------------- Any :class:`web_poet.pages.Injectable` or page input can be overridden. But the overriding mechanism stops for the children of any already overridden type. This opens the door to refining existing Page Objects without getting trapped in a cyclic dependency. For example, you might have an existing Page Object for book extraction: .. code-block:: python class BookPage(ItemPage): def to_item(self): ... Imagine this Page Object obtains its data from an external API. Therefore, it is not holding the page HTML code. But you want to extract an additional attribute (e.g. ``ISBN``) that was not extracted by the original Page Object. Using inheritance is not enough in this case, though. No problem, you can just override it using the following Page Object: .. code-block:: python class ISBNBookPage(WebPage): def __init__(self, response: HttpResponse, book_page: BookPage): super().__init__(response) self.book_page = book_page def to_item(self): item = self.book_page.to_item() item["isbn"] = self.css(".isbn-class::text").get() return item And then override it for a particular domain using ``settings.py``: .. code-block:: python SCRAPY_POET_RULES = [ ApplyRule("toscrape.com", use=ISBNBookPage, instead_of=BookPage), ] This new Page Object gets the original ``BookPage`` as dependency and enrich the obtained item with the ISBN from the page HTML. .. note:: By design overrides rules are not applied to ``ISBNBookPage`` dependencies as it is an overridden type. If they were, it would end up in a cyclic dependency error because ``ISBNBookPage`` would depend on itself! .. note:: This is an alternative more compact way of writing the above Page Object using ``attr.define``: .. code-block:: python @attr.define class ISBNBookPage(WebPage): book_page: BookPage def to_item(self): item = self.book_page.to_item() item["isbn"] = self.css(".isbn-class::text").get() return item Overrides rules --------------- The following example configures an override that is only applied for book pages from ``books.toscrape.com``: .. code-block:: python from web_poet import ApplyRule SCRAPY_POET_RULES = [ ApplyRule( for_patterns=Patterns( include=["books.toscrape.com/cataloge/*index.html|"], exclude=["/catalogue/category/"], ), use=MyBookPage, instead_of=BookPage, ) ] Note how category pages are excluded by using a ``exclude`` pattern. You can find more information about the patterns syntax in the `url-matcher `_ documentation. Decorate Page Objects with the rules ------------------------------------ Having the rules along with the Page Objects is a good idea, as you can identify with a single sight what the Page Object is doing along with where it is applied. This can be done by decorating the Page Objects with :py:func:`web_poet.handle_urls` provided by `web-poet`_. .. tip:: Make sure to read the :external:ref:`rules` documentation of `web-poet`_ to learn all of its other functionalities that is not covered in this section. Let's see an example: .. code-block:: python from web_poet import handle_urls @handle_urls("toscrape.com", instead_of=BookPage) class BTSBookPage(BookPage): def to_item(self): return { "url": self.url, "name": self.css("title::text").get(), } The :py:func:`web_poet.handle_urls` decorator in this case is indicating that the class ``BSTBookPage`` should be used instead of ``BookPage`` for the domain ``toscrape.com``. Using the rules in scrapy-poet ------------------------------ scrapy-poet automatically uses the rules defined by page objects annotated with the :func:`web_poet.handle_urls` decorator by having the default value of the ``SCRAPY_POET_RULES`` setting set to :meth:`web_poet.default_registry.get_rules() `, which returns a ``List[ApplyRule]``. Moreover, you also need to set the ``SCRAPY_POET_DISCOVER`` setting so that these rules could be properly imported. .. note:: For more info and advanced features of `web-poet`_'s :py:func:`web_poet.handle_urls` and its registry, kindly read the `web-poet `_ documentation, specifically its :external:ref:`rules` documentation. Item Returns ============ scrapy-poet also supports a convenient way of asking for items directly. This is made possible by the ``to_return`` parameter of :class:`web_poet.ApplyRule `. The ``to_return`` parameter specifies which item a page object is capable of returning for a given URL. Let's check out an example: .. code-block:: python import attrs import scrapy from web_poet import WebPage, handle_urls, field from scrapy_poet import DummyResponse @attrs.define class Product: name: str @handle_urls("toscrape.com") @attrs.define class ProductPage(WebPage[Product]): @field def name(self) -> str: return self.css("h1.name ::text").get("") class MySpider(scrapy.Spider): name = "myspider" async def start(self): yield scrapy.Request("https://toscrape.com/products/some-product", self.parse) # We can directly ask for the item here instead of the page object. def parse(self, response: DummyResponse, item: Product): return item From this example, we can see that: * Spider callbacks can directly ask for items as dependencies. * The ``Product`` item instance directly comes from ``ProductPage``. * This is made possible by the ``ApplyRule("toscrape.com", use=ProductPage, to_return=Product)`` instance created from the ``@handle_urls`` decorator on ``ProductPage``. .. note:: The slightly longer alternative way to do this is by declaring the page object itself as the dependency and then calling its ``.to_item()`` method. For example: .. code-block:: python @handle_urls("toscrape.com") @attrs.define class ProductPage(WebPage[Product]): product_image_page: ProductImagePage @field def name(self) -> str: return self.css("h1.name ::text").get("") @field async def image(self) -> Image: return await self.product_image_page.to_item() class MySpider(scrapy.Spider): name = "myspider" async def start(self): yield scrapy.Request("https://toscrape.com/products/some-product", self.parse) async def parse(self, response: DummyResponse, product_page: ProductPage): return await product_page.to_item() For more information about all the expected behavior for the ``to_return`` parameter in :class:`web_poet.ApplyRule `, check out web-poet :ref:`rules` documentation.