API Reference
API
- scrapy_poet.callback_for(page_or_item_cls: Type) Callable [source]
Create a callback for an
web_poet.ItemPage
subclass or an item class.The generated callback returns the output of the
to_item
method, i.e. extracts a single item from a web page, using a Page Object.This helper allows to reduce the boilerplate when working with Page Objects. For example, instead of this:
class BooksSpider(scrapy.Spider): name = "books" start_urls = ["http://books.toscrape.com/"] def parse(self, response): links = response.css(".image_container a") yield from response.follow_all(links, self.parse_book) def parse_book(self, response: DummyResponse, page: BookPage): return page.to_item()
It allows to write this:
class BooksSpider(scrapy.Spider): name = "books" start_urls = ["http://books.toscrape.com/"] def parse(self, response): links = response.css(".image_container a") yield from response.follow_all(links, self.parse_book) parse_book = callback_for(BookPage)
It also supports producing an async generator callable if the Page Objects’s
to_item
method is a coroutine which uses theasync/await
syntax.So if we have the following:
class BooksSpider(scrapy.Spider): name = "books" start_urls = ["http://books.toscrape.com/"] def parse(self, response): links = response.css(".image_container a") yield from response.follow_all(links, self.parse_book) async def parse_book(self, response: DummyResponse, page: BookPage): yield await page.to_item()
It could be turned into:
class BooksSpider(scrapy.Spider): name = "books" start_urls = ["http://books.toscrape.com/"] def parse(self, response): links = response.css(".image_container a") yield from response.follow_all(links, self.parse_book) parse_book = callback_for(BookPage)
The generated callback could be used as a spider instance method or passed as an inline/anonymous argument. Make sure to define it as a spider attribute (as shown in the example above) if you’re planning to use disk queues, because in this case Scrapy is able to serialize your request object.
- class scrapy_poet.DummyResponse(*args: Any, **kwargs: Any)[source]
This class is returned by the
InjectionMiddleware
when it detects that the download could be skipped. It inherits fromscrapy.http.Response
and signals and stores the URL and references the originalscrapy.Request
.If you want to skip downloads, you can type annotate your parse method with this class.
def parse(self, response: DummyResponse): pass
If there’s no Page Input that depends on a
scrapy.http.Response
, theInjectionMiddleware
is going to skip download and provide aDummyResponse
to your parser instead.
Injection Middleware
An important part of scrapy-poet is the Injection Middleware. It’s responsible for injecting Page Input dependencies before the request callbacks are executed.
- class scrapy_poet.downloadermiddlewares.InjectionMiddleware(crawler: Crawler)[source]
This is a Downloader Middleware that’s supposed to:
check if request downloads could be skipped
inject dependencies before request callbacks are executed
- process_request(request: Request, spider: Spider) DummyResponse | None [source]
This method checks if the request is really needed and if its download could be skipped by trying to infer if a
scrapy.http.Response
is going to be used by the callback or a Page Input.If the
scrapy.http.Response
can be ignored, aDummyResponse
instance is returned on its place. ThisDummyResponse
is linked to the originalscrapy.Request
instance.With this behavior, we’re able to optimize spider executions avoiding unnecessary downloads. That could be the case when the callback is actually using another source like external APIs such as Zyte’s AutoExtract.
- process_response(request: Request, response: Response, spider: Spider) Generator[Deferred, object, Response | Request] [source]
This method fills
scrapy.Request.cb_kwargs
with instances for the required Page Objects found in the callback signature.In other words, this method instantiates all
web_poet.Injectable
subclasses declared as request callback arguments and any other parameter with aPageObjectInputProvider
configured for its type.
Page Input Providers
The Injection Middleware needs a standard way to build the Page Inputs dependencies that the Page Objects uses to get external data (e.g. the HTML). That’s why we have created a colletion of Page Object Input Providers.
The current module implements a Page Input Provider for
web_poet.HttpResponse
, which
is in charge of providing the response HTML from Scrapy. You could also implement
different providers in order to acquire data from multiple external sources,
for example, from scrapy-playwright or from an API for automatic extraction.
- class scrapy_poet.page_input_providers.HttpClientProvider(injector)[source]
This class provides
web_poet.HttpClient
instances.
- class scrapy_poet.page_input_providers.HttpRequestProvider(injector)[source]
This class provides
web_poet.HttpRequest
instances.- __call__(to_provide: Set[Callable], request: Request)[source]
Builds a
web_poet.HttpRequest
instance using ascrapy.http.Request
instance.
- class scrapy_poet.page_input_providers.HttpResponseProvider(injector)[source]
This class provides
web_poet.HttpResponse
instances.- __call__(to_provide: Set[Callable], response: Response)[source]
Builds a
web_poet.HttpResponse
instance using ascrapy.http.Response
instance.
- class scrapy_poet.page_input_providers.ItemProvider(injector)[source]
- class scrapy_poet.page_input_providers.PageObjectInputProvider(injector)[source]
This is the base class for creating Page Object Input Providers.
A Page Object Input Provider (POIP) takes responsibility for providing instances of some types to Scrapy callbacks. The types a POIP provides must be declared in the class attribute
provided_classes
.POIPs are initialized when the spider starts by invoking the
__init__
method, which receives thescrapy_poet.injection.Injector
instance as argument.The
__call__
method must be overridden, and it is inside this method where the actual instances must be build. The default__call__
signature is as follows:def __call__(self, to_provide: Set[Callable]) -> Sequence[Any]:
Therefore, it receives a list of types to be provided and return a list with the instances created (don’t get confused by the
Callable
annotation. Think on it as a synonym ofType
).Additional dependencies can be declared in the
__call__
signature that will be automatically injected. Currently, scrapy-poet is able to inject instances of the following classes:Finally,
__call__
function can execute asynchronous code. Just either prepend the declaration withasync
to use futures or annotate it with@inlineCallbacks
for deferred execution. Additionally, you might want to configure ScrapyTWISTED_REACTOR
to supportasyncio
libraries.The available POIPs should be declared in the spider setting using the key
SCRAPY_POET_PROVIDERS
. It must be a dictionary that follows same structure than the Scrapy Middlewares configuration dictionaries.A simple example of a provider:
class BodyHtml(str): pass class BodyHtmlProvider(PageObjectInputProvider): provided_classes = {BodyHtml} def __call__(self, to_provide, response: Response): return [BodyHtml(response.css("html body").get())]
The provided_classes class attribute is the
set
of classes that this provider provides. Alternatively, it can be a function with typeCallable[[Callable], bool]
that returnsTrue
if and only if the given type, which must be callable, is provided by this provider.
- class scrapy_poet.page_input_providers.PageParamsProvider(injector)[source]
This class provides
web_poet.PageParams
instances.- __call__(to_provide: Set[Callable], request: Request)[source]
Creates a
web_poet.PageParams
instance based on the data found from themeta["page_params"]
field of ascrapy.http.Response
instance.
- class scrapy_poet.page_input_providers.RequestUrlProvider(injector)[source]
This class provides
web_poet.RequestUrl
instances.- __call__(to_provide: Set[Callable], request: Request)[source]
Builds a
web_poet.RequestUrl
instance usingscrapy.Request
instance.
- class scrapy_poet.page_input_providers.ResponseUrlProvider(injector)[source]
- __call__(to_provide: Set[Callable], response: Response)[source]
Builds a
web_poet.RequestUrl
instance using ascrapy.http.Response
instance.
- class scrapy_poet.page_input_providers.ScrapyPoetStatCollector(stats)[source]
- class scrapy_poet.page_input_providers.StatsProvider(injector)[source]
This class provides
web_poet.Stats
instances.
Cache
Injection
- class scrapy_poet.injection.DynamicDeps[source]
A container for dynamic dependencies provided via the
"inject"
request meta key.The dynamic dependency instances are available at the run time as dict values with keys being dependency types.
- class scrapy_poet.injection.Injector(crawler: Crawler, *, default_providers: Mapping | None = None, registry: RulesRegistry | None = None)[source]
Keep all the logic required to do dependency injection in Scrapy callbacks. Initializes the providers from the spider settings at initialization.
- __init__(crawler: Crawler, *, default_providers: Mapping | None = None, registry: RulesRegistry | None = None)[source]
- build_callback_dependencies(request: Request, response: Response)[source]
Scan the configured callback for this request looking for the dependencies and build the corresponding instances. Return a kwargs dictionary with the built instances.
- build_instances(request: Request, response: Response, plan: Plan)[source]
Build the instances dict from a plan including external dependencies.
- build_instances_from_providers(request: Request, response: Response, plan: Plan)[source]
Build dependencies handled by registered providers
- build_plan(request: Request) Plan [source]
Create a plan for building the dependencies required by the callback
- discover_callback_providers(request: Request) Set[PageObjectInputProvider] [source]
Discover the providers that are required to fulfil the callback dependencies
- scrapy_poet.injection.get_callback(request, spider)[source]
Get the
scrapy.Request.callback
of ascrapy.Request
.
- scrapy_poet.injection.get_injector_for_testing(providers: Mapping, additional_settings: Dict | None = None, registry: RulesRegistry | None = None) Injector [source]
Return an
Injector
using a fake crawler. Useful for testing providers
- scrapy_poet.injection.get_response_for_testing(callback: Callable, meta: Dict[str, Any] | None = None) Response [source]
Return a
scrapy.http.Response
with fake content with the configured callback. It is useful for testing providers.
- scrapy_poet.injection.is_callback_requiring_scrapy_response(callback: ~typing.Callable, raw_callback: ~typing.Any = <object object>) bool [source]
Check whether the request’s callback method requires the response. Basically, it won’t be required if the response argument in the callback is annotated with
DummyResponse
.
- scrapy_poet.injection.is_class_provided_by_any_provider_fn(providers: List[PageObjectInputProvider]) Callable[[Callable], bool] [source]
Return a function of type
Callable[[Type], bool]
that return True if the given type is provided by any of the registered providers.The
is_provided
method from each provider is used.
- scrapy_poet.injection.is_provider_requiring_scrapy_response(provider)[source]
Check whether injectable provider makes use of a valid
scrapy.http.Response
.
Injection errors
- exception scrapy_poet.injection_errors.ProviderDependencyDeadlockError[source]
This is raised when it’s not possible to create the dependencies due to deadlock.
- For example:
Page object named “ChickenPage” require “EggPage” as a dependency.
Page object named “EggPage” require “ChickenPage” as a dependency.