API Reference¶

API¶

scrapy_poet.callback_for(page_or_item_cls: Type) → Callable[source]¶

Create a callback for an web_poet.ItemPage subclass or an item class.

The generated callback returns the output of the to_item method, i.e. extracts a single item from a web page, using a Page Object.

This helper allows to reduce the boilerplate when working with Page Objects. For example, instead of this:

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["http://books.toscrape.com/"]

    def parse(self, response):
        links = response.css(".image_container a")
        yield from response.follow_all(links, self.parse_book)

    def parse_book(self, response: DummyResponse, page: BookPage):
        return page.to_item()

It allows to write this:

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["http://books.toscrape.com/"]

    def parse(self, response):
        links = response.css(".image_container a")
        yield from response.follow_all(links, self.parse_book)

    parse_book = callback_for(BookPage)

It also supports producing an async generator callable if the Page Objects’s to_item method is a coroutine which uses the async/await syntax.

So if we have the following:

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["http://books.toscrape.com/"]

    def parse(self, response):
        links = response.css(".image_container a")
        yield from response.follow_all(links, self.parse_book)

    async def parse_book(self, response: DummyResponse, page: BookPage):
        yield await page.to_item()

It could be turned into:

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["http://books.toscrape.com/"]

    def parse(self, response):
        links = response.css(".image_container a")
        yield from response.follow_all(links, self.parse_book)

    parse_book = callback_for(BookPage)

The generated callback could be used as a spider instance method or passed as an inline/anonymous argument. Make sure to define it as a spider attribute (as shown in the example above) if you’re planning to use disk queues, because in this case Scrapy is able to serialize your request object.

class scrapy_poet.DummyResponse(*args: Any, **kwargs: Any)[source]¶

This class is returned by the InjectionMiddleware when it detects that the download could be skipped. It inherits from scrapy.http.Response and signals and stores the URL and references the original scrapy.Request.

If you want to skip downloads, you can type annotate your parse method with this class.

def parse(self, response: DummyResponse):
    pass

If there’s no Page Input that depends on a scrapy.http.Response, the InjectionMiddleware is going to skip download and provide a DummyResponse to your parser instead.

Injection Middleware¶

An important part of scrapy-poet is the Injection Middleware. It’s responsible for injecting Page Input dependencies before the request callbacks are executed.

class scrapy_poet.downloadermiddlewares.InjectionMiddleware(crawler: Crawler)[source]¶

This is a Downloader Middleware that’s supposed to:

check if request downloads could be skipped
inject dependencies before request callbacks are executed

__init__(crawler: Crawler) → None[source]¶: Initialize the middleware

process_request(request: Request, spider: Spider) → DummyResponse | None[source]¶

This method checks if the request is really needed and if its download could be skipped by trying to infer if a scrapy.http.Response is going to be used by the callback or a Page Input.

If the scrapy.http.Response can be ignored, a DummyResponse instance is returned on its place. This DummyResponse is linked to the original scrapy.Request instance.

With this behavior, we’re able to optimize spider executions avoiding unnecessary downloads. That could be the case when the callback is actually using another source like external APIs such as Zyte’s AutoExtract.

process_response(request: Request, response: Response, spider: Spider) → Generator[Deferred, object, Response][source]¶

This method fills scrapy.Request.cb_kwargs with instances for the required Page Objects found in the callback signature.

In other words, this method instantiates all web_poet.Injectable subclasses declared as request callback arguments and any other parameter with a PageObjectInputProvider configured for its type.

Page Input Providers¶

The Injection Middleware needs a standard way to build the Page Inputs dependencies that the Page Objects uses to get external data (e.g. the HTML). That’s why we have created a colletion of Page Object Input Providers.

The current module implements a Page Input Provider for web_poet.HttpResponse, which is in charge of providing the response HTML from Scrapy. You could also implement different providers in order to acquire data from multiple external sources, for example, from scrapy-playwright or from an API for automatic extraction.

class scrapy_poet.page_input_providers.HttpClientProvider(injector)[source]¶

This class provides web_poet.HttpClient instances.

__call__(to_provide: Set[Callable], crawler: Crawler)[source]¶: Creates an web_poet.HttpClient instance using Scrapy’s downloader.

class scrapy_poet.page_input_providers.HttpRequestProvider(injector)[source]¶

This class provides web_poet.HttpRequest instances.

__call__(to_provide: Set[Callable], request: Request)[source]¶: Builds a web_poet.HttpRequest instance using a scrapy.http.Request instance.

class scrapy_poet.page_input_providers.HttpResponseProvider(injector)[source]¶

This class provides web_poet.HttpResponse instances.

__call__(to_provide: Set[Callable], response: Response)[source]¶: Builds a web_poet.HttpResponse instance using a scrapy.http.Response instance.

class scrapy_poet.page_input_providers.ItemProvider(injector)[source]¶

async __call__(to_provide: Set[Callable], request: Request, response: Response) → List[Any][source]¶: Call self as a function.

__init__(injector)[source]¶: Initializes the provider. Invoked only at spider start up.

class scrapy_poet.page_input_providers.PageObjectInputProvider(injector)[source]¶

This is the base class for creating Page Object Input Providers.

A Page Object Input Provider (POIP) takes responsibility for providing instances of some types to Scrapy callbacks. The types a POIP provides must be declared in the class attribute provided_classes.

POIPs are initialized when the spider starts by invoking the __init__ method, which receives the scrapy_poet.injection.Injector instance as argument.

The __call__ method must be overridden, and it is inside this method where the actual instances must be build. The default __call__ signature is as follows:

def __call__(self, to_provide: Set[Callable]) -> Sequence[Any]:

Therefore, it receives a list of types to be provided and return a list with the instances created (don’t get confused by the Callable annotation. Think on it as a synonym of Type).

Additional dependencies can be declared in the __call__ signature that will be automatically injected. Currently, scrapy-poet is able to inject instances of the following classes:

Finally, __call__ function can execute asynchronous code. Just either prepend the declaration with async to use futures or annotate it with @inlineCallbacks for deferred execution. Additionally, you might want to configure Scrapy TWISTED_REACTOR to support asyncio libraries.

The available POIPs should be declared in the spider setting using the key SCRAPY_POET_PROVIDERS. It must be a dictionary that follows same structure than the Scrapy Middlewares configuration dictionaries.

A simple example of a provider:

class BodyHtml(str): pass

class BodyHtmlProvider(PageObjectInputProvider):
    provided_classes = {BodyHtml}

    def __call__(self, to_provide, response: Response):
        return [BodyHtml(response.css("html body").get())]

The provided_classes class attribute is the set of classes that this provider provides. Alternatively, it can be a function with type Callable[[Callable], bool] that returns True if and only if the given type, which must be callable, is provided by this provider.

__init__(injector)[source]¶: Initializes the provider. Invoked only at spider start up.

is_provided(type_: Callable) → bool[source]¶: Return True if the given type is provided by this provider based on the value of the attribute provided_classes

class scrapy_poet.page_input_providers.PageParamsProvider(injector)[source]¶

This class provides web_poet.PageParams instances.

__call__(to_provide: Set[Callable], request: Request)[source]¶: Creates a web_poet.PageParams instance based on the data found from the meta["page_params"] field of a scrapy.http.Response instance.

class scrapy_poet.page_input_providers.RequestUrlProvider(injector)[source]¶

This class provides web_poet.RequestUrl instances.

__call__(to_provide: Set[Callable], request: Request)[source]¶: Builds a web_poet.RequestUrl instance using scrapy.Request instance.

class scrapy_poet.page_input_providers.ResponseUrlProvider(injector)[source]¶

__call__(to_provide: Set[Callable], response: Response)[source]¶: Builds a web_poet.RequestUrl instance using a scrapy.http.Response instance.

class scrapy_poet.page_input_providers.ScrapyPoetStatCollector(stats)[source]¶

__init__(stats)[source]¶

inc(key: str, value: int | float = 1) → None[source]¶: Increment the value of stat key by value, or set it to value if key has no value.

set(key: str, value: Any) → None[source]¶: Set the value of stat key to value.

class scrapy_poet.page_input_providers.StatsProvider(injector)[source]¶

This class provides web_poet.Stats instances.

__call__(to_provide: Set[Callable], crawler: Crawler)[source]¶: Creates an web_poet.Stats instance using Scrapy’s stat collector.

Cache¶

class scrapy_poet.cache.SerializedDataCache(directory: str | PathLike)[source]¶

Stores dependencies from Providers in a persistent local storage using web_poet.serialization.SerializedDataFileStorage

__init__(directory: str | PathLike) → None[source]¶

Injection¶

class scrapy_poet.injection.Injector(crawler: Crawler, *, default_providers: Mapping | None = None, registry: RulesRegistry | None = None)[source]¶

Keep all the logic required to do dependency injection in Scrapy callbacks. Initializes the providers from the spider settings at initialization.

__init__(crawler: Crawler, *, default_providers: Mapping | None = None, registry: RulesRegistry | None = None)[source]¶

build_callback_dependencies(request: Request, response: Response)[source]¶: Scan the configured callback for this request looking for the dependencies and build the corresponding instances. Return a kwargs dictionary with the built instances.

build_instances(request: Request, response: Response, plan: Plan)[source]¶: Build the instances dict from a plan including external dependencies.

build_instances_from_providers(request: Request, response: Response, plan: Plan)[source]¶: Build dependencies handled by registered providers

build_plan(request: Request) → Plan[source]¶: Create a plan for building the dependencies required by the callback

discover_callback_providers(request: Request) → Set[PageObjectInputProvider][source]¶: Discover the providers that are required to fulfil the callback dependencies

is_scrapy_response_required(request: Request)[source]¶: Check whether Scrapy’s Request’s Response is going to be used.

scrapy_poet.injection.get_callback(request, spider)[source]¶: Get the scrapy.Request.callback of a scrapy.Request.

scrapy_poet.injection.get_injector_for_testing(providers: Mapping, additional_settings: Dict | None = None, registry: RulesRegistry | None = None) → Injector[source]¶: Return an Injector using a fake crawler. Useful for testing providers

scrapy_poet.injection.get_response_for_testing(callback: Callable) → Response[source]¶: Return a scrapy.http.Response with fake content with the configured callback. It is useful for testing providers.

scrapy_poet.injection.is_callback_requiring_scrapy_response(callback: ~typing.Callable, raw_callback: ~typing.Any = <object object>) → bool[source]¶: Check whether the request’s callback method requires the response. Basically, it won’t be required if the response argument in the callback is annotated with DummyResponse.

scrapy_poet.injection.is_class_provided_by_any_provider_fn(providers: List[PageObjectInputProvider]) → Callable[[Callable], bool][source]¶

Return a function of type Callable[[Type], bool] that return True if the given type is provided by any of the registered providers.

The is_provided method from each provider is used.

scrapy_poet.injection.is_provider_requiring_scrapy_response(provider)[source]¶: Check whether injectable provider makes use of a valid scrapy.http.Response.

Injection errors¶

exception scrapy_poet.injection_errors.InjectionError[source]¶

exception scrapy_poet.injection_errors.MalformedProvidedClassesError[source]¶

exception scrapy_poet.injection_errors.NonCallableProviderError[source]¶

exception scrapy_poet.injection_errors.ProviderDependencyDeadlockError[source]¶

This is raised when it’s not possible to create the dependencies due to deadlock.

For example:

Page object named “ChickenPage” require “EggPage” as a dependency.
Page object named “EggPage” require “ChickenPage” as a dependency.

exception scrapy_poet.injection_errors.UndeclaredProvidedTypeError[source]¶