callback_for(page_cls: Type[web_poet.pages.ItemPage]) → Callable¶
This function is a helper for creating callbacks for
ItemPagesub-classes. The generated callback should return the result of the call to the
The generated callback could be used as a spider instance method or passed as an inline/anonymous argument. Make sure to define it as a spider argument if you’re planning to use disk queues because in this case, Scrapy should be able to serialize your request object.
class BooksSpider(scrapy.Spider): name = 'books' start_urls = ['http://books.toscrape.com/'] parse_book = callback_for(BookPage) def parse(self, response): links = response.css('.image_container a') yield from response.follow_all(links, self.parse_book)
DummyResponse(url: str, request=typing.Union[scrapy.http.request.Request, NoneType])¶
This class is returned by the
InjectionMiddlewarewhen it detects that the download could be skipped. It inherits from Scrapy
Responseand signals and stores the URL and references the original
If you want to skip downloads, you can type annotate your parse method with this class.
def parse(self, response: DummyResponse): pass
If there’s no Page Input that depends on a Scrapy
InjectionMiddlewareis going to skip download and provide a
DummyResponseto your parser instead.
PageObjectInputProviderdoesn’t need a request, you simply don’t need to list it as a dependency. But if you need, for example, the original request’s URL, you can use
@provides(ResponseData) class ResponseDataProvider(PageObjectInputProvider): def __init__(self, response: DummyResponse): self.response = response async def __call__(self): data = await self.get_data() return ResponseData( url=self.response.url, html=data ) async def get_data(self): # make an api call # make a database query # read from disk # ... pass
An important part of scrapy-poet is the Injection Middleware. It’s responsible for injecting Page Input dependencies before the request callbacks are executed.
This is a Downloader Middleware that’s supposed to:
check if request downloads could be skipped
inject dependencies before request callbacks are executed
process_request(request: scrapy.http.request.Request, spider: scrapy.spiders.Spider)¶
This method checks if the request is really needed and if its download could be skipped by trying to infer if a
Responseis going to be used by the callback or a Page Input.
Responsecan be ignored, a
utils.DummyResponseobject is returned on its place. This
DummyResponseis linked to the original
With this behavior, we’re able to optimize spider executions avoiding unnecessary downloads. That could be the case when the callback is actually using another source like external APIs such as Scrapinghub’s Auto Extract.
process_response(request: scrapy.http.request.Request, response: scrapy.http.response.Response, spider: scrapy.spiders.Spider)¶
This method instantiates all
Injectablesub-classes declared as request callback arguments and any other parameter with a provider for its type. Otherwise, this middleware doesn’t populate
request.cb_kwargsfor this argument.
We should be able to inject any type into classes that inherit from
web_poet.pages.Injectable, but currently, we’re only able to build and inject
Page Input Providers¶
The Injection Middleware needs a standard way to build dependencies for
the Page Inputs used by the request callbacks. That’s why we have created a
You could implement different providers in order to acquire data from multiple external sources, for example, Splash or Auto Extract API.
This is an abstract class for describing Page Object Input Providers.
This method is responsible for building Page Input dependencies.
You can override this method to receive external dependencies.
This class provides
This method builds a
ResponseDatainstance using a Scrapy
This class receives a Scrapy
Responseas a dependency.
This decorator should be used with classes that inherits from
PageObjectInputProviderin order to automatically register them as providers.
ResponseDataProvider’s implementation for an example.
register(provider_class: Type[scrapy_poet.page_input_providers.PageObjectInputProvider], provided_class: Type)¶
This method registers a Page Object Input Provider in the providers registry. It could be replaced by the use of the