Changelog
0.24.0 (2024-10-10)
When the dynamic dependencies are annotated with
typing.Annotated
, the keys in the resultingDynamicDeps
instance are now not annotated.Improved the error message when passing incorrect values in the
"inject"
meta key.Fixed documentation builds with
sphinx-rtd-theme
3.0.0+.
0.23.0 (2024-07-18)
Added support for specifying callback dependencies dynamically.
0.22.6 (2024-07-03)
Raising
Retry
now also works as expected when callbacks ask for an item instead of asking for its page object.
0.22.5 (2024-06-27)
When
Retry
is raised with a message, that message now becomes the retry reason, replacing the default (page_object_retry
).
0.22.4 (2024-06-10)
Additional requests, when mapped to
scrapy.Request
objects, now get theirdont_filter
parameter set toTrue
, to ask downloader middlewares likeOffsiteMiddleware
not to drop those requests.
0.22.3 (2024-04-25)
0.22.2 (2024-04-24)
HttpRequestProvider
, added in 0.17.0, is now actually enabled by default.
0.22.1 (2024-03-07)
Fixed
scrapy savefixture
not finding page object modules when used outside a Scrapy project.
0.22.0 (2024-03-04)
Now requires
web-poet >= 0.17.0
andtime_machine >= 2.7.1
.Removed
scrapy_poet.AnnotatedResult
, useweb_poet.annotated.AnnotatedInstance
instead.Added support for annotated dependencies to the
scrapy savefixture
command.Test improvements.
0.21.0 (2024-02-08)
Added a
.weak_cache
toscrapy_poet.injection.Injector
which stores instances created by providers as long as thescrapy.Request
exists.Fixed the incorrect value of
downloader/response_count
in the stats due to additional counting ofscrapy_poet.api.DummyResponse
.Fixed the detection of
scrapy_poet.api.DummyResponse
when some type hints are annotated using strings.
0.20.1 (2024-01-24)
ScrapyPoetRequestFingerprinter
now supports item dependencies.
0.20.0 (2024-01-15)
Add
ScrapyPoetRequestFingerprinter
, a request fingerprinter that uses request dependencies in the fingerprint generation.
0.19.0 (2023-12-26)
Now requires
andi >= 0.6.0
.Changed the implementation of resolving and building item dependencies from page objects. Now
andi
custom builders are used to create a single plan that includes building page objects and items. This fixes problems such as providers being called multiple times.ItemProvider
is now no-op. It’s no longer enabled by default and users should also stop enabling it.PageObjectInputProvider.allow_prev_instances
and code related to it were removed so custom providers may need updating.
Fixed some tests.
0.18.0 (2023-12-12)
Now requires
andi >= 0.5.0
.Add support for dependency metadata via
typing.Annotated
(requires Python 3.9+).
0.17.0 (2023-12-11)
Now requires
web-poet >= 0.15.1
.HttpRequest
dependencies are now supported, viaHttpRequestProvider
(enabled by default).Enable
StatsProvider
, which providesStats
dependencies, by default.More robust disabling of
InjectionMiddleware
in thescrapy savefixture
command.Official support for Python 3.12.
0.16.1 (2023-11-02)
Fix the bug that caused requests produced by
HttpClientProvider
to be treated as if they need arguments of theparse
callback as dependencies, which could cause returning an empty response and/or making extra provider calls.
0.16.0 (2023-09-26)
Now requires
time_machine >= 2.2.0
.ItemProvider
now supports page objects that declare a dependency on the same type of item that they return, as long as there is an earlier page object input provider that can provide such dependency.Fix running tests with Scrapy 2.11.
0.15.1 (2023-09-15)
scrapy-poet stats now also include counters for injected dependencies (
poet/injector/<dependency import path>
).All scrapy-poet stats that used to be prefixed with
scrapy-poet/
are now prefixed withpoet/
instead.
0.15.0 (2023-09-12)
Now requires
web-poet >= 0.15.0
.Web-poet stats are now supported.
0.14.0 (2023-09-08)
Python 3.7 support has been dropped.
Caching is now built on top of web-poet serialization, extending caching support to additional inputs, while making our code simpler, more reliable, and more future-proof.
This has resulted in a few backward-incompatible changes:
The
scrapy_poet.page_input_providers.CacheDataProviderMixin
mixin class has been removed. Providers no longer need to use it or reimplement its methods.The
SCRAPY_POET_CACHE_GZIP
setting has been removed.
Added
scrapy_poet.utils.open_in_browser
, an alternative toscrapy.utils.response.open_in_browser
that supports scrapy-poet.Fixed some documentation links.
0.13.0 (2023-05-08)
Now requires
web-poet >= 0.12.0
.The
scrapy savefixture
command now uses the adapter from theSCRAPY_POET_TESTS_ADAPTER
setting to save the fixture.Fix a typo in the docs.
0.12.0 (2023-04-26)
Now requires
web-poet >= 0.11.0
.The
scrapy savefixture
command can now generate tests that expect thatto_item()
raises a specific exception (onlyweb_poet.exceptions.PageObjectAction
and its descendants are expected).Fixed an error when using
scrapy shell
withscrapy_poet.InjectionMiddleware
enabled.Add a
twine check
CI check.
0.11.0 (2023-03-17)
The
scrapy savefixture
command can now generate a fixture using an existing spider.
0.10.1 (2023-03-03)
More robust time freezing in
scrapy savefixture
command.
0.10.0 (2023-02-24)
Now requires
web-poet >= 0.8.0
.The
savefixture
command now also saves requests made via theweb_poet.page_inputs.client.HttpClient
dependency and their responses.
0.9.0 (2023-02-17)
Added support for item classes which are used as dependencies in page objects and spider callbacks. The following is now possible:
import attrs import scrapy from web_poet import WebPage, handle_urls, field from scrapy_poet import DummyResponse @attrs.define class Image: url: str @handle_urls("example.com") class ProductImagePage(WebPage[Image]): @field def url(self) -> str: return self.css("#product img ::attr(href)").get("") @attrs.define class Product: name: str image: Image @handle_urls("example.com") @attrs.define class ProductPage(WebPage[Product]): # ✨ NEW: The page object can ask for items as dependencies. An instance # of ``Image`` is injected behind the scenes by calling the ``.to_item()`` # method of ``ProductImagePage``. image_item: Image @field def name(self) -> str: return self.css("h1.name ::text").get("") @field def image(self) -> Image: return self.image_item class MySpider(scrapy.Spider): name = "myspider" def start_requests(self): yield scrapy.Request( "https://example.com/products/some-product", self.parse_product ) # ✨ NEW: We can directly use the item here instead of the page object. def parse_product(self, response: DummyResponse, item: Product) -> Product: return item
In line with this, the following new features were made:
New
scrapy_poet.page_input_providers.ItemProvider
which makes the usage above possible.An item class is now supported by
scrapy_poet.callback_for()
alongside the usual page objects. This means that it won’t raise aTypeError
anymore when not passing a subclass ofweb_poet.pages.ItemPage
.New exception:
scrapy_poet.injection_errors.ProviderDependencyDeadlockError
. This is raised when it’s not possible to create the dependencies due to a deadlock in their sub-dependencies, e.g. due to a circular dependency between page objects.
New setting named
SCRAPY_POET_RULES
having a default value ofweb_poet.default_registry.get_rules
. This deprecatesSCRAPY_POET_OVERRIDES
.New setting named
SCRAPY_POET_DISCOVER
to ensure thatSCRAPY_POET_RULES
have properly loaded all intended rules annotated with the@handle_urls
decorator.New utility functions in
scrapy_poet.utils.testing
.The
frozen_time
value inside the test fixtures won’t contain microseconds anymore.Supports the new
scrapy.http.request.NO_CALLBACK()
introduced in Scrapy 2.8. This means that the Pitfalls (introduced inscrapy-poet==0.7.0
) doesn’t apply when you’re using Scrapy >= 2.8, unless you’re using third-party middlewares which directly uses the downloader to addscrapy.Request
instances with callback set toNone
. Otherwise, you need to set the callback value toscrapy.http.request.NO_CALLBACK()
.Fix the
TypeError
that’s raised when using Twisted <= 21.7.0 since scrapy-poet was usingtwisted.internet.defer.Deferred[object]
type annotation before which was not subscriptable in the early Twisted versions.Fix the
twisted.internet.error.ReactorAlreadyInstalledError
error raised when using thescrapy savefixture
command and Twisted < 21.2.0 is installed.Fix test configuration that doesn’t follow the intended commands and dependencies in these tox environments:
min
,asyncio-min
, andasyncio
. This ensures that page objects usingasyncio
should work properly, alongside the minimum specified Twisted version.Various improvements to tests and documentation.
Backward incompatible changes:
For the
scrapy_poet.page_input_providers.PageObjectInputProvider
base class:It now accepts an instance of
scrapy_poet.injection.Injector
in its constructor instead ofscrapy.crawler.Crawler
. Although you can still access thescrapy.crawler.Crawler
via theInjector.crawler
attribute.scrapy_poet.page_input_providers.PageObjectInputProvider.is_provided()
is now an instance method instead of a class method.
The
scrapy_poet.injection.Injector
’s attribute and constructor parameter calledoverrides_registry
is now simply calledregistry
.Removed the
SCRAPY_POET_OVERRIDES_REGISTRY
setting which overrides the default registry.The
scrapy_poet.overrides
module which containedOverridesRegistryBase
andOverridesRegistry
has now been removed. Instead, scrapy-poet directly usesweb_poet.rules.RulesRegistry
.Everything should pretty much the same except for
web_poet.rules.RulesRegistry.overrides_for()
now acceptsstr
,web_poet.page_inputs.http.RequestUrl
, orweb_poet.page_inputs.http.ResponseUrl
instead ofscrapy.http.Request
.This also means that the registry doesn’t accept tuples as rules anymore. Only
web_poet.rules.ApplyRule
instances are allowed. The same goes forSCRAPY_POET_RULES
(and the deprecatedSCRAPY_POET_OVERRIDES
).The following type aliases have been removed:
scrapy_poet.overrides.RuleAsTuple
scrapy_poet.overrides.RuleFromUser
0.8.0 (2023-01-24)
Now requires
web-poet >= 0.7.0
andtime_machine
.Added a
savefixture
command that creates a test for a page object. See Tests for Page Objects for more information.
0.7.0 (2023-01-17)
Fixed the issue where a new page object containing a new response data is not properly created when
web_poet.exceptions.core.Retry
is raised.In order for the above fix to be possible, overriding the callback dependencies created by scrapy-poet via
scrapy.http.Request.cb_kwargs
is now unsupported. This is a backward incompatible change.Fixed the broken
scrapy_poet.page_input_providers.HttpResponseProvider.fingerprint()
which errors out when running a Scrapy job using theSCRAPY_POET_CACHE
enabled.Improved behavior when
spider.parse()
method arguments are supposed to be provided by scrapy-poet. Previously, it was causing unnecessary work in unexpected places likescrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware
,scrapy.pipelines.images.ImagesPipeline
orscrapy.pipelines.files.FilesPipeline
. It is also a reasonweb_poet.page_inputs.client.HttpClient
might not be working in page objects. Now these cases are detected, and a warning is issued.As of Scrapy 2.7, it is not possible to fix the issue completely in scrapy-poet. Fixing it would require Scrapy changes; some 3rd party libraries may also need to be updated.
Note
The root of the issue is that when request.callback is
None
,parse()
callback is assumed normally. But sometimes callback=None is used whenscrapy.http.Request
is added to the Scrapy’s downloader directly, in which case no callback is used. Middlewares, including scrapy-poet’s, can’t distinguish between these two cases, which causes all kinds of issues.We recommend all scrapy-poet users to modify their code to avoid the issue. Please don’t define
parse()
method with arguments which are supposed to be filled by scrapy-poet, and rename the existingparse()
methods if they have such arguments. Any other name is fine. It avoids all possible issues, including incompatibility with 3rd party middlewares or pipelines.See the new Pitfalls documentation for more information.
There are backwards-incompatible changes related to this issue. They only affect you if you don’t follow the advice of not using
parse()
method with scrapy-poet.When the
parse()
method has its response argument annotated withscrapy_poet.api.DummyResponse
, for instance:def parse(self, response: DummyResponse)
, the response is downloaded instead of being skipped.When the
parse()
method has dependencies that are provided by scrapy-poet, thescrapy_poet.downloadermiddlewares.InjectionMiddleware
won’t attempt to build any dependencies anymore.This causes the following code to have this error
TypeError: parse() missing 1 required positional argument: 'page'.
:class MySpider(scrapy.Spider): name = "my_spider" start_urls = ["https://books.toscrape.com"] def parse(self, response: scrapy.http.Response, page: MyPage): ...
scrapy_poet.injection.is_callback_requiring_scrapy_response()
now accepts an optionalraw_callback
parameter meant to represent the actual callback attribute value ofscrapy.http.Request
since the originalcallback
parameter could be normalized to the spider’sparse()
method when thescrapy.http.Request
hascallback
set toNone
.Official support for Python 3.11
Various updates and improvements on docs and examples.
0.6.0 (2022-11-24)
Now requires
web-poet >= 0.6.0
.All examples in the docs and tests now use
web_poet.WebPage
instead ofweb_poet.ItemWebPage
.The new
instead_of
parameter of the@handle_urls
decorator is now preferred instead of the deprecatedoverrides
parameter.scrapy_poet.callback_for
doesn’t require an implementedto_item
method anymore.The new
web_poet.rules.RulesRegistry
is used instead of the oldweb_poet.overrides.PageObjectRegistry
.The Registry now uses
web_poet.ApplyRule
instead ofweb_poet.OverrideRule
.
Provider for
web_poet.ResponseUrl
is added, which allows to access the response URL in the page object. This triggers a download unlike the provider forweb_poet.RequestUrl
.Fixes the error when using
scrapy shell
while thescrapy_poet.InjectionMiddleware
is enabled.Fixes and improvements on code and docs.
0.5.1 (2022-07-28)
Fixes the minimum web-poet version being 0.5.0 instead of 0.4.0.
0.5.0 (2022-07-28)
This release implements support for page object retries, introduced in web-poet 0.4.0.
To enable retry support, you need to configure a new spider middleware in your Scrapy settings:
SPIDER_MIDDLEWARES = {
"scrapy_poet.RetryMiddleware": 275,
}
web-poet 0.4.0 is now the minimum required version of web-poet.
0.4.0 (2022-06-20)
This release is backwards incompatible, following backwards-incompatible changes in web-poet 0.2.0.
The main new feature is support for web-poet >= 0.2.0
, including
support for async def to_item
methods, making additional requests
in the to_item
method, new Page Object dependencies, and the new way
to configure overrides.
Changes in line with web-poet >= 0.2.0
:
web_poet.HttpResponse
replacesweb_poet.ResponseData
as a dependency to use.Additional requests inside Page Objects: a provider for
web_poet.HttpClient
, as well asweb_poet.HttpClient
backend implementation, which uses Scrapy downloader.callback_for
now supports Page Objects which defineasync def to_item
method.Provider for
web_poet.PageParams
is added, which usesrequest.meta["page_params"]
value.Provider for
web_poet.RequestUrl
is added, which allows to access the request URL in the page object without triggering the download.We have these backward incompatible changes since the
web_poet.OverrideRule
follow a different structure:Deprecated
PerDomainOverridesRegistry
in lieu of the newerOverridesRegistry
which provides a wide variety of features for better URL matching.This resuls in a newer format in the
SCRAPY_POET_OVERRIDES
setting.
Other changes:
New
scrapy_poet/dummy_response_count
value appears in Scrapy stats; it is the number of timesDummyResponse
is used instead of downloading the response as usual.scrapy.utils.reqser
deprecated module is no longer used by scrapy-poet.
Dependency updates:
The minimum supported Scrapy version is now
2.6.0
.The minimum supported web-poet version is now
0.2.0
.
0.3.0 (2022-01-28)
Cache mechanism using
SCRAPY_POET_CACHE
Fixed and improved docs
removed support for Python 3.6
added support for Python 3.10
0.2.1 (2021-06-11)
Improved logging message for DummyResponse
various internal cleanups
0.2.0 (2021-01-22)
Overrides support
0.1.0 (2020-12-29)
New providers interface
One provider can provide many types at once
Single instance during the whole spider lifespan
Registration is now explicit and done in the spider settings
CI is migrated from Travis to Github Actions
Python 3.9 support
0.0.3 (2020-07-19)
Documentation improvements
providers can now access various Scrapy objects: Crawler, Settings, Spider, Request, Response, StatsCollector
0.0.2 (2020-04-28)
The repository is renamed to scrapy-poet
, and split into two:
web-poet
(https://github.com/scrapinghub/web-poet) contains definitions and code useful for writing Page Objects for web data extraction - it is not tied to Scrapy;scrapy-poet
(this package) provides Scrapy integration for such Page Objects.
API of the library changed in a backwards incompatible way; see README and examples.
New features:
DummyResponse
annotation allows to skip downloading of scrapy Response.callback_for
works for Scrapy disk queues if it is used to create a spider method (but not in its inline form)Page objects may require page objects as dependencies; dependencies are resolved recursively and built as needed.
InjectionMiddleware supports
async def
and asyncio providers.
0.0.1 (2019-08-28)
Initial release.