.. _pitfalls: ======== Pitfalls ======== ``scrapy.Request`` without callback =================================== .. tip:: Note that the pitfalls discussed in this section aren't applicable to Scrapy >= 2.8 for most cases. However, if you have code somewhere which directly adds :class:`scrapy.Request ` instances to the downloader, you need to ensure that they don't use ``None`` as the callback value. Instead, you can use the new :func:`scrapy.http.request.NO_CALLBACK` value introduced in Scrapy 2.8. .. note:: This section *only applies* to specific cases where spiders define a ``parse()`` method. The TLDR; recommendation is to simply avoid defining a ``parse()`` method and instead choose another name. Scrapy supports declaring :class:`scrapy.Request ` instances without setting any callbacks (i.e. ``None``). For these instances, Scrapy uses the ``parse()`` method as its callback. Let's take a look at the following code: .. code-block:: python import scrapy class MySpider(scrapy.Spider): name = "my_spider" start_urls = ["https://books.toscrape.com"] def parse(self, response): ... Under the hood, the inherited :meth:`~scrapy.Spider.start` method from :class:`scrapy.Spider` doesn't declare any callback value to :class:`scrapy.Request `: .. code-block:: python for url in self.start_urls: yield Request(url, dont_filter=True) Apart from this, there are also some built-in Scrapy < 2.8 features which omit the :class:`scrapy.Request ` callback value: * :class:`scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware` * :class:`scrapy.pipelines.images.ImagesPipeline` * :class:`scrapy.pipelines.files.FilesPipeline` However, omitting the :class:`scrapy.Request ` callback value presents *some problems* for **scrapy-poet**. Skipped Downloads ----------------- .. note:: This subsection is specific to cases wherein a :class:`~.DummyResponse` annotates the response in a ``parse()`` method. Let's take a look at an example: .. code-block:: python import scrapy from scrapy_poet import DummyResponse class MySpider(scrapy.Spider): name = "my_spider" start_urls = ["https://books.toscrape.com"] def parse(self, response: DummyResponse): ... In order for the built-in Scrapy < 2.8 features listed above to work properly, **scrapy-poet** chooses to ignore the :class:`~.DummyResponse` annotation completely. This means that the response is downloaded instead of being skipped. Otherwise, :class:`scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware` might not work properly and would **not** visit the ``robots.txt`` file from the website. Moreover, this **scrapy-poet** behavior avoids the problem of the images or files being missing when the following pipelines are used: * :class:`scrapy.pipelines.images.ImagesPipeline` * :class:`scrapy.pipelines.files.FilesPipeline` Note that the following :class:`UserWarning` is emitted when encountering such scenario: A request has been encountered with callback=None which defaults to the parse() method. If the parse() method is annotated with scrapy_poet.DummyResponse (or its subclasses), we're assuming this isn't intended and would simply ignore this annotation. To avoid the said warning and this **scrapy-poet** behavior from occurring, it'd be best to avoid defining a ``parse()`` method and instead choose any other name. Dependency Building ------------------- .. note:: This subsection is specific to cases wherein dependencies are provided by **scrapy-poet** in the ``parse()`` method. Let's take a look at the following code: .. code-block:: python import attrs import scrapy from myproject.page_objects import MyPage class MySpider(scrapy.Spider): name = "my_spider" start_urls = ["https://books.toscrape.com"] def parse(self, response: scrapy.http.Response, page: MyPage): ... In the above example, this error would be raised: ``TypeError: parse() missing 1 required positional argument: 'page'``. The reason for this **scrapy-poet** behavior is to prevent the wasted dependency building *(which could be expensive in some cases)* when the ``parse()`` method is unintentionally used. For example, if a spider is using the :class:`scrapy.pipelines.images.ImagesPipeline`, **scrapy-poet**'s :class:`scrapy_poet.downloadermiddlewares.InjectionMiddleware` could be wasting precious compute resources to fulfill one or more dependencies that won't be used at all. Specifically, the ``page`` argument to the ``parse()`` method is not utilized. If there are a million of images to be downloaded, then the ``page`` instance is created a million times as well. The following :class:`UserWarning` is emitted on such scenario: A request has been encountered with callback=None which defaults to the parse() method. On such cases, annotated dependencies in the parse() method won't be built by scrapy-poet. However, if the request has callback=parse, the annotated dependencies will be built. As the warning message suggests, this could be fixed by ensuring that the callback is **not** ``None``: .. code-block:: python class MySpider(scrapy.Spider): name = "my_spider" async def start(self): yield scrapy.Request("https://books.toscrape.com", callback=self.parse) def parse(self, response: scrapy.http.Response, page: MyPage): ... The :class:`UserWarning` is only shown when the ``parse()`` method declares any dependency that is fullfilled by any provider declared in ``SCRAPY_POET_PROVIDERS``. This means that the following code doesn't produce the warning nor attempts to skip any dependency from being built because there is none: .. code-block:: python class MySpider(scrapy.Spider): name = "my_spider" start_urls = ["https://books.toscrape.com"] def parse(self, response: scrapy.http.Response): ... Similarly, the best way to completely avoid the said warning and this **scrapy-poet** behavior is to avoid defining a ``parse()`` method and instead choose any other name. Opening a response in a web browser =================================== When using scrapy-poet, the ``open_in_browser`` function from Scrapy may raise the following exception:: TypeError: Unsupported response type: HttpResponse To avoid that, use the ``open_in_browser`` function from ``scrapy_poet.utils`` instead: .. code-block:: python from scrapy_poet.utils import open_in_browser