API Reference
API
- scrapy_poet.callback_for(page_or_item_cls: type) Callable[source]
Create a callback for an
web_poet.ItemPagesubclass or an item class.The generated callback returns the output of the
to_itemmethod, i.e. extracts a single item from a web page, using a Page Object.This helper allows to reduce the boilerplate when working with Page Objects. For example, instead of this:
class BooksSpider(scrapy.Spider): name = "books" start_urls = ["http://books.toscrape.com/"] def parse(self, response): links = response.css(".image_container a") yield from response.follow_all(links, self.parse_book) def parse_book(self, response: DummyResponse, page: BookPage): return page.to_item()
It allows to write this:
class BooksSpider(scrapy.Spider): name = "books" start_urls = ["http://books.toscrape.com/"] def parse(self, response): links = response.css(".image_container a") yield from response.follow_all(links, self.parse_book) parse_book = callback_for(BookPage)
It also supports producing an async generator callable if the Page Objects’s
to_itemmethod is a coroutine which uses theasync/awaitsyntax.So if we have the following:
class BooksSpider(scrapy.Spider): name = "books" start_urls = ["http://books.toscrape.com/"] def parse(self, response): links = response.css(".image_container a") yield from response.follow_all(links, self.parse_book) async def parse_book(self, response: DummyResponse, page: BookPage): yield await page.to_item()
It could be turned into:
class BooksSpider(scrapy.Spider): name = "books" start_urls = ["http://books.toscrape.com/"] def parse(self, response): links = response.css(".image_container a") yield from response.follow_all(links, self.parse_book) parse_book = callback_for(BookPage)
The generated callback could be used as a spider instance method or passed as an inline/anonymous argument. Make sure to define it as a spider attribute (as shown in the example above) if you’re planning to use disk queues, because in this case Scrapy is able to serialize your request object.
- class scrapy_poet.DummyResponse(*args: Any, **kwargs: Any)[source]
This class is returned by the
InjectionMiddlewarewhen it detects that the download could be skipped. It inherits fromscrapy.http.Responseand signals and stores the URL and references the originalscrapy.Request.If you want to skip downloads, you can type annotate your parse method with this class.
def parse(self, response: DummyResponse): pass
If there’s no Page Input that depends on a
scrapy.http.Response, theInjectionMiddlewareis going to skip download and provide aDummyResponseto your parser instead.
Scrapy components
- class scrapy_poet.InjectionMiddleware(crawler: Crawler)[source]
This is a Downloader Middleware that’s supposed to:
check if request downloads could be skipped
inject dependencies before request callbacks are executed
- process_request(request: Request, spider: Spider | None = None) DummyResponse | None[source]
This method checks if the request is really needed and if its download could be skipped by trying to infer if a
scrapy.http.Responseis going to be used by the callback or a Page Input.If the
scrapy.http.Responsecan be ignored, aDummyResponseinstance is returned on its place. ThisDummyResponseis linked to the originalscrapy.Requestinstance.With this behavior, we’re able to optimize spider executions avoiding unnecessary downloads. That could be the case when the callback is actually using another source like external APIs such as Zyte API.
- async process_response(request: Request, response: Response, spider: Spider | None = None) Response | Request[source]
This method fills
scrapy.Request.cb_kwargswith instances for the required Page Objects found in the callback signature.In other words, this method instantiates all
web_poet.Injectablesubclasses declared as request callback arguments and any other parameter with aPageObjectInputProviderconfigured for its type.
- class scrapy_poet.RetryMiddleware[source]
Captures
web_poet.exceptions.Retryexceptions from spider callbacks, and retries the source request.
- class scrapy_poet.ScrapyPoetRequestFingerprinter(crawler: Crawler)[source]
Page input providers
The Injection Middleware needs a standard way to build the Page Inputs dependencies that the Page Objects uses to get external data (e.g. the HTML). That’s why we have created a colletion of Page Object Input Providers.
The current module implements a Page Input Provider for
web_poet.HttpResponse, which
is in charge of providing the response HTML from Scrapy. You could also implement
different providers in order to acquire data from multiple external sources,
for example, from scrapy-playwright or from scrapy-zyte-api.
- class scrapy_poet.page_input_providers.HttpClientProvider(injector)[source]
This class provides
web_poet.HttpClientinstances.
- class scrapy_poet.page_input_providers.HttpRequestProvider(injector)[source]
This class provides
web_poet.HttpRequestinstances.
- class scrapy_poet.page_input_providers.HttpResponseProvider(injector)[source]
This class provides
web_poet.HttpResponseinstances.- __call__(to_provide: Set[Callable], response: Response)[source]
Builds a
web_poet.HttpResponseinstance using ascrapy.http.Responseinstance.
- class scrapy_poet.page_input_providers.PageObjectInputProvider(injector)[source]
This is the base class for creating Page Object Input Providers.
A Page Object Input Provider (POIP) takes responsibility for providing instances of some types to Scrapy callbacks. The types a POIP provides must be declared in the class attribute
provided_classes.POIPs are initialized when the spider starts by invoking the
__init__method, which receives thescrapy_poet.injection.Injectorinstance as argument.The
__call__method must be overridden, and it is inside this method where the actual instances must be build. The default__call__signature is as follows:def __call__(self, to_provide: Set[Callable]) -> Sequence[Any]: ...
Therefore, it receives a list of types to be provided and return a list with the instances created (don’t get confused by the
Callableannotation. Think on it as a synonym ofType).Additional dependencies can be declared in the
__call__signature that will be automatically injected. Currently, scrapy-poet is able to inject instances of the following classes:Request
Finally,
__call__function can execute asynchronous code. Just prepend the declaration withasync.The available POIPs should be declared in the spider setting using the key
SCRAPY_POET_PROVIDERS. It must be a dictionary that follows same structure than the Scrapy Middlewares configuration dictionaries.A simple example of a provider:
class BodyHtml(str): pass class BodyHtmlProvider(PageObjectInputProvider): provided_classes = {BodyHtml} def __call__(self, to_provide, response: Response): return [BodyHtml(response.css("html body").get())]
The provided_classes class attribute is the
setof classes that this provider provides. Alternatively, it can be a function with typeCallable[[Callable], bool]that returnsTrueif and only if the given type, which must be callable, is provided by this provider.
- class scrapy_poet.page_input_providers.PageParamsProvider(injector)[source]
This class provides
web_poet.PageParamsinstances.- __call__(to_provide: Set[Callable], request: Request)[source]
Creates a
web_poet.PageParamsinstance based on the data found from themeta["page_params"]field of ascrapy.http.Responseinstance.
- class scrapy_poet.page_input_providers.RequestUrlProvider(injector)[source]
This class provides
web_poet.RequestUrlinstances.
- class scrapy_poet.page_input_providers.ScrapyPoetStatCollector(stats)[source]
- class scrapy_poet.page_input_providers.StatsProvider(injector)[source]
This class provides
web_poet.Statsinstances.
Cache
- class scrapy_poet.cache.SerializedDataCache(directory: str | os.PathLike)[source]
Stores dependencies from Providers in a persistent local storage using web_poet.serialization.SerializedDataFileStorage
- __init__(directory: str | os.PathLike) None[source]
Injection
- class scrapy_poet.injection.DynamicDeps[source]
A container for dynamic dependencies provided via the
"inject"request meta key.The dynamic dependency instances are available at the run time as dict values with keys being dependency types.
- class scrapy_poet.injection.Injector(crawler: Crawler, *, default_providers: Mapping | None = None, registry: RulesRegistry | None = None)[source]
Keep all the logic required to do dependency injection in Scrapy callbacks. Initializes the providers from the spider settings at initialization.
- __init__(crawler: Crawler, *, default_providers: Mapping | None = None, registry: RulesRegistry | None = None)[source]
- async build_callback_dependencies(request: Request, response: Response)[source]
Scan the configured callback for this request looking for the dependencies and build the corresponding instances. Return a kwargs dictionary with the built instances.
- async build_instances(request: Request, response: Response, plan: Plan)[source]
Build the instances dict from a plan including external dependencies.
- async build_instances_from_providers(request: Request, response: Response, plan: Plan)[source]
Build dependencies handled by registered providers
- build_plan(request: Request) Plan[source]
Create a plan for building the dependencies required by the callback
- discover_callback_providers(request: Request) set[PageObjectInputProvider][source]
Discover the providers that are required to fulfil the callback dependencies
- scrapy_poet.injection.get_callback(request, spider)[source]
Get the
scrapy.Request.callbackof ascrapy.Request.
- scrapy_poet.injection.get_injector_for_testing(providers: Mapping, additional_settings: dict | None = None, registry: RulesRegistry | None = None) Injector[source]
Return an
Injectorusing a fake crawler. Useful for testing providers
- scrapy_poet.injection.get_response_for_testing(callback: Callable, meta: dict[str, Any] | None = None) Response[source]
Return a
scrapy.http.Responsewith fake content with the configured callback. It is useful for testing providers.
- scrapy_poet.injection.is_callback_requiring_scrapy_response(callback: Callable, raw_callback: Any = <object object>) bool[source]
Check whether the request’s callback method requires the response. Basically, it won’t be required if the response argument in the callback is annotated with
DummyResponse.
- scrapy_poet.injection.is_class_provided_by_any_provider_fn(providers: list[PageObjectInputProvider]) Callable[[Callable], bool][source]
Return a function of type
Callable[[Type], bool]that return True if the given type is provided by any of the registered providers.The
is_providedmethod from each provider is used.
- scrapy_poet.injection.is_provider_requiring_scrapy_response(provider)[source]
Check whether injectable provider makes use of a valid
scrapy.http.Response.
Injection errors
- exception scrapy_poet.injection_errors.ProviderDependencyDeadlockError[source]
This is raised when it’s not possible to create the dependencies due to deadlock.
- For example:
Page object named “ChickenPage” require “EggPage” as a dependency.
Page object named “EggPage” require “ChickenPage” as a dependency.