Pitfalls
scrapy.Request
without callback
Tip
Note that the pitfalls discussed in this section aren’t applicable to Scrapy >= 2.8 for most cases.
However, if you have code somewhere which directly adds
scrapy.Request
instances to the downloader,
you need to ensure that they don’t use None
as the callback value.
Instead, you can use the new scrapy.http.request.NO_CALLBACK()
value introduced in Scrapy 2.8.
Note
This section only applies to specific cases where spiders define a
parse()
method.
The TLDR; recommendation is to simply avoid defining a parse()
method
and instead choose another name.
Scrapy supports declaring scrapy.Request
instances
without setting any callbacks (i.e. None
). For these instances, Scrapy uses
the parse()
method as its callback.
Let’s take a look at the following code:
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = ["https://books.toscrape.com"]
def parse(self, response):
...
Under the hood, the inherited start_requests()
method from
scrapy.Spider
doesn’t declare any callback
value to scrapy.Request
:
for url in self.start_urls:
yield Request(url, dont_filter=True)
Apart from this, there are also some built-in Scrapy < 2.8 features which omit
the scrapy.Request
callback value:
However, omitting the scrapy.Request
callback
value presents some problems for scrapy-poet.
Skipped Downloads
Note
This subsection is specific to cases wherein a
DummyResponse
annotates the response in a parse()
method.
Let’s take a look at an example:
import scrapy
from scrapy_poet import DummyResponse
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = ["https://books.toscrape.com"]
def parse(self, response: DummyResponse):
...
In order for the built-in Scrapy < 2.8 features listed above to work properly,
scrapy-poet chooses to ignore the DummyResponse
annotation completely. This means that the response is downloaded instead of
being skipped.
Otherwise, scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware
might not work properly and would not visit the robots.txt
file from the
website.
Moreover, this scrapy-poet behavior avoids the problem of the images or files being missing when the following pipelines are used:
Note that the following UserWarning
is emitted when encountering such
scenario:
A request has been encountered with callback=None which defaults to the parse() method. If the parse() method is annotated with scrapy_poet.DummyResponse (or its subclasses), we’re assuming this isn’t intended and would simply ignore this annotation.
To avoid the said warning and this scrapy-poet behavior from occurring, it’d
be best to avoid defining a parse()
method and instead choose any other name.
Dependency Building
Note
This subsection is specific to cases wherein dependencies are provided by
scrapy-poet in the parse()
method.
Let’s take a look at the following code:
import attrs
import scrapy
from myproject.page_objects import MyPage
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = ["https://books.toscrape.com"]
def parse(self, response: scrapy.http.Response, page: MyPage):
...
In the above example, this error would be raised: TypeError: parse() missing 1
required positional argument: 'page'
.
The reason for this scrapy-poet behavior is to prevent the wasted dependency
building (which could be expensive in some cases) when the parse()
method
is unintentionally used.
For example, if a spider is using the scrapy.pipelines.images.ImagesPipeline
,
scrapy-poet’s scrapy_poet.downloadermiddlewares.InjectionMiddleware
could be wasting precious compute resources to fulfill one or more dependencies
that won’t be used at all. Specifically, the page
argument to the parse()
method is not utilized. If there are a million of images to be downloaded, then
the page
instance is created a million times as well.
The following UserWarning
is emitted on such scenario:
A request has been encountered with callback=None which defaults to the parse() method. On such cases, annotated dependencies in the parse() method won’t be built by scrapy-poet. However, if the request has callback=parse, the annotated dependencies will be built.
As the warning message suggests, this could be fixed by ensuring that the callback
is not None
:
class MySpider(scrapy.Spider):
name = "my_spider"
def start_requests(self):
yield scrapy.Request("https://books.toscrape.com", callback=self.parse)
def parse(self, response: scrapy.http.Response, page: MyPage):
...
The UserWarning
is only shown when the parse()
method declares any
dependency that is fullfilled by any provider declared in SCRAPY_POET_PROVIDERS
.
This means that the following code doesn’t produce the warning nor attempts to
skip any dependency from being built because there is none:
class MySpider(scrapy.Spider): name = "my_spider" start_urls = ["https://books.toscrape.com"] def parse(self, response: scrapy.http.Response): ...
Similarly, the best way to completely avoid the said warning and this scrapy-poet
behavior is to avoid defining a parse()
method and instead choose any other name.
Opening a response in a web browser
When using scrapy-poet, the open_in_browser
function from Scrapy may raise
the following exception:
TypeError: Unsupported response type: HttpResponse
To avoid that, use the open_in_browser
function from scrapy_poet.utils
instead:
from scrapy_poet.utils import open_in_browser