Providers
Note
This document assumes a good familiarity with web-poet
concepts;
make sure you’ve read web-poet
docs.
This page is mostly aimed at developers who want to extend scrapy-poet
,
not to developers who are writing extraction and crawling code using
scrapy-poet
.
Creating providers
Providers are responsible for building dependencies needed by Injectable
objects. A good example would be the scrapy_poet.HttpResponseProvider
, which builds and
provides a web_poet.HttpResponse
instance for Injectables that need it, like the web_poet.WebPage
.
import attr
from typing import Set, Callable
import web_poet
from scrapy_poet.page_input_providers import PageObjectInputProvider
from scrapy import Response
class HttpResponseProvider(PageObjectInputProvider):
"""This class provides ``web_poet.HttpResponse`` instances."""
provided_classes = {web_poet.HttpResponse}
def __call__(self, to_provide: Set[Callable], response: Response):
"""Build a ``web_poet.HttpResponse`` instance using a Scrapy ``Response``"""
return [
web_poet.HttpResponse(
url=response.url,
body=response.body,
status=response.status,
headers=web_poet.HttpResponseHeaders.from_bytes_dict(response.headers),
)
]
You can implement your own providers in order to extend or override current
scrapy-poet
behavior. All providers should inherit from this base class:
PageObjectInputProvider
.
Please, check the docs provided in the following API reference for more details:
PageObjectInputProvider
.
Cache Support in Providers
scrapy-poet
also supports caching of the provided dependencies from the
providers. For example, HttpResponseProvider
supports this right off
the bat. It’s able to do this by inheriting the CacheDataProviderMixin
and implementing all of its abstractmethods
.
So, extending from the previous example we’ve tackled above to support cache would lead to the following code:
import web_poet
from scrapy_poet.page_input_providers import (
CacheDataProviderMixin,
PageObjectInputProvider,
)
class HttpResponseProvider(PageObjectInputProvider, CacheDataProviderMixin):
"""This class provides ``web_poet.HttpResponse`` instances."""
provided_classes = {web_poet.HttpResponse}
def __call__(self, to_provide: Set[Callable], response: Response):
"""Build a ``web_poet.HttpResponse`` instance using a Scrapy ``Response``"""
return [
web_poet.HttpResponse(
url=response.url,
body=response.body,
status=response.status,
headers=web_poet.HttpResponseHeaders.from_bytes_dict(response.headers),
)
]
def fingerprint(self, to_provide: Set[Callable], request: Request) -> str:
"""Returns a fingerprint to identify the specific request."""
# Implementation here
def serialize(self, result: Sequence[Any]) -> Any:
"""Serializes the results of this provider. The data returned will
be pickled.
"""
# Implementation here
def deserialize(self, data: Any) -> Sequence[Any]:
"""Deserialize some results of the provider that were previously
serialized using the serialize() method.
"""
# Implementation here
Take note that even if you’re using providers that supports the Caching interface,
it’s only going to be used if the SCRAPY_POET_CACHE
has been enabled in the
settings.
The caching of provided dependencies is very useful for local development of Page Objects, as it lowers down the waiting time for your Responses (or any type of external dependency for that manner) by caching them up locally.
Currently, the data is cached using a sqlite database in your local directory.
This is implemented using SqlitedictCache
.
The cache mechanism that scrapy-poet
currently offers is quite different
from the HttpCacheMiddleware
which Scrapy has. Although they are quite similar in its intended purpose,
scrapy-poet
’s cached data is directly tied to its appropriate provider. This
could be anything that could stretch beyond Scrapy’s Responses
(e.g. Network
Database queries, API Calls, AWS S3 files, etc).
Note
The scrapy_poet.injection.Injector
maintains a .weak_cache
which
stores the instances created by the providers as long as the corresponding
scrapy.Request
instance exists. This means that
the instances created by earlier providers can be accessed and reused by latter
providers. This is turned on by default and the instances are stored in memory.
Configuring providers
The list of available providers should be configured in the spider settings. For example,
the following configuration should be included in the settings to enable a new provider
MyProvider
:
"SCRAPY_POET_PROVIDERS": {MyProvider: 500}
The number used as value (500) defines the provider priority. See Scrapy Middlewares configuration dictionaries for more information.
Note
The providers in scrapy_poet.DEFAULT_PROVIDERS
,
which includes a provider for web_poet.HttpResponse
, are always included by default.
You can disable any of them by listing it in the configuration with the
priority None.
Ignoring requests
Sometimes requests could be skipped, for example, when you’re fetching data using a third-party API such as Auto Extract or querying a database.
In cases like that, it makes no sense to send the request to Scrapy’s downloader
as it will only waste network resources. But there’s an alternative to avoid
making such requests, you could use DummyResponse
type to annotate
your response arguments.
That could be done in the spider’s parser method:
def parser(self, response: DummyResponse, page: MyPageObject):
pass
Spider method that has its first argument annotated as DummyResponse
is signaling that it is not going to use the response, so it should be safe
to not download scrapy Response as usual.
This type annotation is already applied when you use the callback_for()
helper: the callback which is created by callback_for
doesn’t use Response,
it just calls page object’s to_item
method.
If neither spider callback nor any of the input providers are using
Response
, InjectionMiddleware
skips the download, returning a
DummyResponse
instead. For example:
def get_cached_content(key: str):
# get cached html response from db or other source
pass
@attr.define
class CachedData:
key: str
value: str
class CachedDataProvider(PageObjectInputProvider):
provided_classes = {CachedData}
def __call__(self, to_provide: List[Callable], request: scrapy.Request):
return [
CachedData(
key=request.url,
value=get_cached_content(request.url)
)
]
@attr.define
class MyPageObject(ItemPage):
content: CachedData
def to_item(self):
return {
"url": self.content.key,
"content": self.content.value,
}
class MySpider(scrapy.Spider):
name = "my_spider"
def start_requests(self):
yield scrapy.Request("http://books.toscrape.com/", self.parse_page)
def parse_page(self, response: DummyResponse, page: MyPageObject):
# request will be IGNORED because neither spider callback
# not MyPageObject seem like to be making use of its response
yield page.to_item()
Although, if the spider callback is not using Response
, but the
Page Object uses it, the request is not ignored, for example:
def parse_content(html: str):
# parse content from html
pass
@attr.define
class MyResponseData:
url: str
html: str
class MyResponseDataProvider(PageObjectInputProvider):
provided_classes = {MyResponseData}
def __call__(self, to_provide: Set[Callable], response: Response):
return [
MyResponseData(
url=response.url,
html=response.content,
)
]
class MyPageObject(ItemPage):
response: MyResponseData
def to_item(self):
return {
"url": self.response.url,
"content": parse_content(self.response.html),
}
class MySpider(scrapy.Spider):
name = "my_spider"
def start_requests(self):
yield scrapy.Request("http://books.toscrape.com/", self.parse_page)
def parse_page(self, response: DummyResponse, page: MyPageObject):
# request will be PROCESSED because spider callback is not
# making use of its response, but MyPageObject seems like to be
yield page.to_item()
Note
The code above is just for example purposes. If you need to use
scrapy.http.Response
instances in your Page Objects, use built-in
web_poet.WebPage
— it has response
attribute with web_poet.HttpResponse
;
no additional configuration is needed, as there is HttpResponseProvider
enabled in scrapy-poet
by default.
Requests concurrency
DummyRequests are meant to skip downloads, so it makes sense not checking for concurrent requests, delays, or auto throttle settings since we won’t be making any download at all.
By default, if your parser or its page inputs need a regular Request, this request is downloaded through Scrapy, and all the settings and limits are respected, for example:
CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
RANDOMIZE_DOWNLOAD_DELAY
all AutoThrottle settings
DownloaderAwarePriorityQueue
logic
But be aware when using third-party libraries to acquire content for a page
object. If you make an HTTP request in a provider using some third-party async
library (aiohttp, treq, etc.), CONCURRENT_REQUESTS
option will be respected,
but not the others.
To have other settings respected, in addition to CONCURRENT_REQUESTS
, you’d
need to use crawler.engine.download
or something like that. Alternatively,
you could implement those limits in the library itself.
Attaching metadata to dependencies
Note
This feature requires Python 3.9+.
Providers can support dependencies with arbitrary metadata attached and use
that metadata when creating them. Attaching the metadata is done by wrapping
the dependency class in typing.Annotated
:
@attr.define
class MyPageObject(ItemPage):
response: Annotated[HtmlResponse, "foo", "bar"]
To handle this you need the following changes in your provider:
from andi.typeutils import strip_annotated
from scrapy_poet import PageObjectInputProvider
from web_poet.annotated import AnnotatedInstance
class Provider(PageObjectInputProvider):
...
def is_provided(self, type_: Callable) -> bool:
# needed so that you can list just the base type in provided_classes
return super().is_provided(strip_annotated(type_))
def __call__(self, to_provide):
result = []
for cls in to_provide:
metadata = getattr(cls, "__metadata__", None)
obj = ... # create the instance using cls and metadata
if metadata:
# wrap the instance into a web_poet.annotated.AnnotatedInstance object
obj = AnnotatedInstance(obj, metadata)
result.append(obj)
return result