API Reference

Utils

scrapy_poet.callback_for(page_cls: Type[web_poet.pages.ItemPage]) → Callable[source]

This function is a helper for creating callbacks for ItemPage sub-classes. The generated callback should return the result of the call to the ItemPage.to_item method.

The generated callback could be used as a spider instance method or passed as an inline/anonymous argument. Make sure to define it as a spider argument if you’re planning to use disk queues because in this case, Scrapy should be able to serialize your request object.

Example:

class BooksSpider(scrapy.Spider):

    name = 'books'
    start_urls = ['http://books.toscrape.com/']
    parse_book = callback_for(BookPage)

    def parse(self, response):
        links = response.css('.image_container a')
        yield from response.follow_all(links, self.parse_book)
class scrapy_poet.utils.DummyResponse(url: str, request=typing.Union[scrapy.http.request.Request, NoneType])[source]

This class is returned by the InjectionMiddleware when it detects that the download could be skipped. It inherits from Scrapy Response and signals and stores the URL and references the original Request.

If you want to skip downloads, you can type annotate your parse method with this class.

def parse(self, response: DummyResponse):
    pass

If there’s no Page Input that depends on a Scrapy Response, the InjectionMiddleware is going to skip download and provide a DummyResponse to your parser instead.

If your PageObjectInputProvider doesn’t need a request, you simply don’t need to list it as a dependency. But if you need, for example, the original request’s URL, you can use DummyResponse instead of Response:

@provides(ResponseData)
class ResponseDataProvider(PageObjectInputProvider):

    def __init__(self, response: DummyResponse):
        self.response = response

    async def __call__(self):
        data = await self.get_data()
        return ResponseData(
            url=self.response.url,
            html=data
        )

    async def get_data(self):
        # make an api call
        # make a database query
        # read from disk
        # ...
        pass

Injection Middleware

An important part of scrapy-poet is the Injection Middleware. It’s responsible for injecting Page Input dependencies before the request callbacks are executed.

class scrapy_poet.middleware.InjectionMiddleware[source]

This is a Downloader Middleware that’s supposed to:

  • check if request downloads could be skipped

  • inject dependencies before request callbacks are executed

process_request(request: scrapy.http.request.Request, spider: scrapy.spiders.Spider)[source]

This method checks if the request is really needed and if its download could be skipped by trying to infer if a Response is going to be used by the callback or a Page Input.

If the Response can be ignored, a utils.DummyResponse object is returned on its place. This DummyResponse is linked to the original Request instance.

With this behavior, we’re able to optimize spider executions avoiding unnecessary downloads. That could be the case when the callback is actually using another source like external APIs such as Scrapinghub’s Auto Extract.

process_response(request: scrapy.http.request.Request, response: scrapy.http.response.Response, spider: scrapy.spiders.Spider)[source]

This method instantiates all Injectable sub-classes declared as request callback arguments and any other parameter with a provider for its type. Otherwise, this middleware doesn’t populate request.cb_kwargs for this argument.

Warning

We should be able to inject any type into classes that inherit from web_poet.pages.Injectable, but currently, we’re only able to build and inject scrapy.Response instances.

Page Input Providers

The Injection Middleware needs a standard way to build dependencies for the Page Inputs used by the request callbacks. That’s why we have created a repository of PageObjectInputProviders.

You could implement different providers in order to acquire data from multiple external sources, for example, Splash or Auto Extract API.

class scrapy_poet.page_input_providers.PageObjectInputProvider[source]

This is an abstract class for describing Page Object Input Providers.

abstract __call__()[source]

This method is responsible for building Page Input dependencies.

__init__()[source]

You can override this method to receive external dependencies.

class scrapy_poet.page_input_providers.ResponseDataProvider(response: scrapy.http.response.Response)[source]

This class provides web_poet.page_inputs.ResponseData instances.

__call__()[source]

This method builds a ResponseData instance using a Scrapy Response.

__init__(response: scrapy.http.response.Response)[source]

This class receives a Scrapy Response as a dependency.

scrapy_poet.page_input_providers.provides(provided_class: Type)[source]

This decorator should be used with classes that inherits from PageObjectInputProvider in order to automatically register them as providers.

See ResponseDataProvider’s implementation for an example.

scrapy_poet.page_input_providers.register(provider_class: Type[scrapy_poet.page_input_providers.PageObjectInputProvider], provided_class: Type)[source]

This method registers a Page Object Input Provider in the providers registry. It could be replaced by the use of the provides decorator.

Example:

register(ResponseDataProvider, ResponseData)