Providers

Note

This document assumes a good familiarity with web-poet concepts; make sure you’ve read web-poet docs.

This page is mostly aimed at developers who want to extend scrapy-poet, not to developers who are writing extraction and crawling code using scrapy-poet.

Creating providers

Providers are responsible for building dependencies needed by Injectable objects. A good example would be the scrapy_poet.HttpResponseProvider, which builds and provides a web_poet.HttpResponse instance for Injectables that need it, like the web_poet.WebPage.

import attr
from typing import Set, Callable

import web_poet
from scrapy_poet.page_input_providers import PageObjectInputProvider
from scrapy import Response


class HttpResponseProvider(PageObjectInputProvider):
    """This class provides ``web_poet.HttpResponse`` instances."""
    provided_classes = {web_poet.HttpResponse}

    def __call__(self, to_provide: Set[Callable], response: Response):
        """Build a ``web_poet.HttpResponse`` instance using a Scrapy ``Response``"""
        return [
            web_poet.HttpResponse(
                url=response.url,
                body=response.body,
                status=response.status,
                headers=web_poet.HttpResponseHeaders.from_bytes_dict(response.headers),
            )
        ]

You can implement your own providers in order to extend or override current scrapy-poet behavior. All providers should inherit from this base class: PageObjectInputProvider.

Please, check the docs provided in the following API reference for more details: PageObjectInputProvider.

Cache Support in Providers

scrapy-poet also supports caching of the provided dependencies from the providers. For example, HttpResponseProvider supports this right off the bat. It’s able to do this by inheriting the CacheDataProviderMixin and implementing all of its abstractmethods.

So, extending from the previous example we’ve tackled above to support cache would lead to the following code:

import web_poet
from scrapy_poet.page_input_providers import (
    CacheDataProviderMixin,
    PageObjectInputProvider,
)

class HttpResponseProvider(PageObjectInputProvider, CacheDataProviderMixin):
    """This class provides ``web_poet.HttpResponse`` instances."""
    provided_classes = {web_poet.HttpResponse}

    def __call__(self, to_provide: Set[Callable], response: Response):
        """Build a ``web_poet.HttpResponse`` instance using a Scrapy ``Response``"""
        return [
            web_poet.HttpResponse(
                url=response.url,
                body=response.body,
                status=response.status,
                headers=web_poet.HttpResponseHeaders.from_bytes_dict(response.headers),
            )
        ]

    def fingerprint(self, to_provide: Set[Callable], request: Request) -> str:
        """Returns a fingerprint to identify the specific request."""
        # Implementation here

    def serialize(self, result: Sequence[Any]) -> Any:
        """Serializes the results of this provider. The data returned will
        be pickled.
        """
        # Implementation here

    def deserialize(self, data: Any) -> Sequence[Any]:
        """Deserialize some results of the provider that were previously
        serialized using the serialize() method.
        """
        # Implementation here

Take note that even if you’re using providers that supports the Caching interface, it’s only going to be used if the SCRAPY_POET_CACHE has been enabled in the settings.

The caching of provided dependencies is very useful for local development of Page Objects, as it lowers down the waiting time for your Responses (or any type of external dependency for that manner) by caching them up locally.

Currently, the data is cached using a sqlite database in your local directory. This is implemented using SqlitedictCache.

The cache mechanism that scrapy-poet currently offers is quite different from the HttpCacheMiddleware which Scrapy has. Although they are quite similar in its intended purpose, scrapy-poet’s cached data is directly tied to its appropriate provider. This could be anything that could stretch beyond Scrapy’s Responses (e.g. Network Database queries, API Calls, AWS S3 files, etc).

Note

The scrapy_poet.injection.Injector maintains a .weak_cache which stores the instances created by the providers as long as the corresponding scrapy.Request instance exists. This means that the instances created by earlier providers can be accessed and reused by latter providers. This is turned on by default and the instances are stored in memory.

Configuring providers

The list of available providers should be configured in the spider settings. For example, the following configuration should be included in the settings to enable a new provider MyProvider:

"SCRAPY_POET_PROVIDERS": {MyProvider: 500}

The number used as value (500) defines the provider priority. See Scrapy Middlewares configuration dictionaries for more information.

Note

The providers in scrapy_poet.DEFAULT_PROVIDERS, which includes a provider for web_poet.HttpResponse, are always included by default. You can disable any of them by listing it in the configuration with the priority None.

Ignoring requests

Sometimes requests could be skipped, for example, when you’re fetching data using a third-party API such as Auto Extract or querying a database.

In cases like that, it makes no sense to send the request to Scrapy’s downloader as it will only waste network resources. But there’s an alternative to avoid making such requests, you could use DummyResponse type to annotate your response arguments.

That could be done in the spider’s parser method:

def parser(self, response: DummyResponse, page: MyPageObject):
    pass

Spider method that has its first argument annotated as DummyResponse is signaling that it is not going to use the response, so it should be safe to not download scrapy Response as usual.

This type annotation is already applied when you use the callback_for() helper: the callback which is created by callback_for doesn’t use Response, it just calls page object’s to_item method.

If neither spider callback nor any of the input providers are using Response, InjectionMiddleware skips the download, returning a DummyResponse instead. For example:

def get_cached_content(key: str):
    # get cached html response from db or other source
    pass


@attr.define
class CachedData:
    key: str
    value: str


class CachedDataProvider(PageObjectInputProvider):
    provided_classes = {CachedData}

    def __call__(self, to_provide: List[Callable], request: scrapy.Request):
        return [
            CachedData(
                key=request.url,
                value=get_cached_content(request.url)
            )
        ]


@attr.define
class MyPageObject(ItemPage):
    content: CachedData

    def to_item(self):
        return {
            "url": self.content.key,
            "content": self.content.value,
        }


class MySpider(scrapy.Spider):
    name = "my_spider"

    def start_requests(self):
        yield scrapy.Request("http://books.toscrape.com/", self.parse_page)

    def parse_page(self, response: DummyResponse, page: MyPageObject):
        # request will be IGNORED because neither spider callback
        # not MyPageObject seem like to be making use of its response
        yield page.to_item()

Although, if the spider callback is not using Response, but the Page Object uses it, the request is not ignored, for example:

def parse_content(html: str):
    # parse content from html
    pass


@attr.define
class MyResponseData:
    url: str
    html: str


class MyResponseDataProvider(PageObjectInputProvider):
    provided_classes = {MyResponseData}

    def __call__(self, to_provide: Set[Callable], response: Response):
        return [
            MyResponseData(
                url=response.url,
                html=response.content,
            )
        ]


class MyPageObject(ItemPage):
    response: MyResponseData

    def to_item(self):
        return {
            "url": self.response.url,
            "content": parse_content(self.response.html),
        }


class MySpider(scrapy.Spider):
    name = "my_spider"

    def start_requests(self):
        yield scrapy.Request("http://books.toscrape.com/", self.parse_page)

    def parse_page(self, response: DummyResponse, page: MyPageObject):
        # request will be PROCESSED because spider callback is not
        # making use of its response, but MyPageObject seems like to be
        yield page.to_item()

Note

The code above is just for example purposes. If you need to use scrapy.http.Response instances in your Page Objects, use built-in web_poet.WebPage — it has response attribute with web_poet.HttpResponse; no additional configuration is needed, as there is HttpResponseProvider enabled in scrapy-poet by default.

Requests concurrency

DummyRequests are meant to skip downloads, so it makes sense not checking for concurrent requests, delays, or auto throttle settings since we won’t be making any download at all.

By default, if your parser or its page inputs need a regular Request, this request is downloaded through Scrapy, and all the settings and limits are respected, for example:

  • CONCURRENT_REQUESTS

  • CONCURRENT_REQUESTS_PER_DOMAIN

  • CONCURRENT_REQUESTS_PER_IP

  • RANDOMIZE_DOWNLOAD_DELAY

  • all AutoThrottle settings

  • DownloaderAwarePriorityQueue logic

But be aware when using third-party libraries to acquire content for a page object. If you make an HTTP request in a provider using some third-party async library (aiohttp, treq, etc.), CONCURRENT_REQUESTS option will be respected, but not the others.

To have other settings respected, in addition to CONCURRENT_REQUESTS, you’d need to use crawler.engine.download or something like that. Alternatively, you could implement those limits in the library itself.

Attaching metadata to dependencies

Providers can support dependencies with arbitrary metadata attached and use that metadata when creating them. Attaching the metadata is done by wrapping the dependency class in typing.Annotated:

@attr.define
class MyPageObject(ItemPage):
    response: Annotated[HtmlResponse, "foo", "bar"]

To handle this you need the following changes in your provider:

from andi.typeutils import strip_annotated
from scrapy_poet import PageObjectInputProvider
from web_poet.annotated import AnnotatedInstance


class Provider(PageObjectInputProvider):
    ...

    def is_provided(self, type_: Callable) -> bool:
        # needed so that you can list just the base type in provided_classes
        return super().is_provided(strip_annotated(type_))

    def __call__(self, to_provide):
        result = []
        for cls in to_provide:
            metadata = getattr(cls, "__metadata__", None)
            obj = ...  # create the instance using cls and metadata
            if metadata:
                # wrap the instance into a web_poet.annotated.AnnotatedInstance object
                obj = AnnotatedInstance(obj, metadata)
            result.append(obj)
        return result