Changelog

0.22.1 (2024-03-07)

  • Fixed scrapy savefixture not finding page object modules when used outside a Scrapy project.

0.22.0 (2024-03-04)

  • Now requires web-poet >= 0.17.0 and time_machine >= 2.7.1.

  • Removed scrapy_poet.AnnotatedResult, use web_poet.annotated.AnnotatedInstance instead.

  • Added support for annotated dependencies to the scrapy savefixture command.

  • Test improvements.

0.21.0 (2024-02-08)

0.20.1 (2024-01-24)

  • ScrapyPoetRequestFingerprinter now supports item dependencies.

0.20.0 (2024-01-15)

  • Add ScrapyPoetRequestFingerprinter, a request fingerprinter that uses request dependencies in the fingerprint generation.

0.19.0 (2023-12-26)

  • Now requires andi >= 0.6.0.

  • Changed the implementation of resolving and building item dependencies from page objects. Now andi custom builders are used to create a single plan that includes building page objects and items. This fixes problems such as providers being called multiple times.

    • ItemProvider is now no-op. It’s no longer enabled by default and users should also stop enabling it.

    • PageObjectInputProvider.allow_prev_instances and code related to it were removed so custom providers may need updating.

  • Fixed some tests.

0.18.0 (2023-12-12)

  • Now requires andi >= 0.5.0.

  • Add support for dependency metadata via typing.Annotated (requires Python 3.9+).

0.17.0 (2023-12-11)

0.16.1 (2023-11-02)

  • Fix the bug that caused requests produced by HttpClientProvider to be treated as if they need arguments of the parse callback as dependencies, which could cause returning an empty response and/or making extra provider calls.

0.16.0 (2023-09-26)

  • Now requires time_machine >= 2.2.0.

  • ItemProvider now supports page objects that declare a dependency on the same type of item that they return, as long as there is an earlier page object input provider that can provide such dependency.

  • Fix running tests with Scrapy 2.11.

0.15.1 (2023-09-15)

  • scrapy-poet stats now also include counters for injected dependencies (poet/injector/<dependency import path>).

  • All scrapy-poet stats that used to be prefixed with scrapy-poet/ are now prefixed with poet/ instead.

0.15.0 (2023-09-12)

0.14.0 (2023-09-08)

  • Python 3.7 support has been dropped.

  • Caching is now built on top of web-poet serialization, extending caching support to additional inputs, while making our code simpler, more reliable, and more future-proof.

    This has resulted in a few backward-incompatible changes:

    • The scrapy_poet.page_input_providers.CacheDataProviderMixin mixin class has been removed. Providers no longer need to use it or reimplement its methods.

    • The SCRAPY_POET_CACHE_GZIP setting has been removed.

  • Added scrapy_poet.utils.open_in_browser, an alternative to scrapy.utils.response.open_in_browser that supports scrapy-poet.

  • Fixed some documentation links.

0.13.0 (2023-05-08)

  • Now requires web-poet >= 0.12.0.

  • The scrapy savefixture command now uses the adapter from the SCRAPY_POET_TESTS_ADAPTER setting to save the fixture.

  • Fix a typo in the docs.

0.12.0 (2023-04-26)

  • Now requires web-poet >= 0.11.0.

  • The scrapy savefixture command can now generate tests that expect that to_item() raises a specific exception (only web_poet.exceptions.PageObjectAction and its descendants are expected).

  • Fixed an error when using scrapy shell with scrapy_poet.InjectionMiddleware enabled.

  • Add a twine check CI check.

0.11.0 (2023-03-17)

0.10.1 (2023-03-03)

  • More robust time freezing in scrapy savefixture command.

0.10.0 (2023-02-24)

0.9.0 (2023-02-17)

  • Added support for item classes which are used as dependencies in page objects and spider callbacks. The following is now possible:

    import attrs
    import scrapy
    from web_poet import WebPage, handle_urls, field
    from scrapy_poet import DummyResponse
    
    @attrs.define
    class Image:
        url: str
    
    @handle_urls("example.com")
    class ProductImagePage(WebPage[Image]):
        @field
        def url(self) -> str:
            return self.css("#product img ::attr(href)").get("")
    
    @attrs.define
    class Product:
        name: str
        image: Image
    
    @handle_urls("example.com")
    @attrs.define
    class ProductPage(WebPage[Product]):
        # ✨ NEW: The page object can ask for items as dependencies. An instance
        # of ``Image`` is injected behind the scenes by calling the ``.to_item()``
        # method of ``ProductImagePage``.
        image_item: Image
    
        @field
        def name(self) -> str:
            return self.css("h1.name ::text").get("")
    
        @field
        def image(self) -> Image:
            return self.image_item
    
    class MySpider(scrapy.Spider):
        name = "myspider"
    
        def start_requests(self):
            yield scrapy.Request(
                "https://example.com/products/some-product", self.parse_product
            )
    
        # ✨ NEW: We can directly use the item here instead of the page object.
        def parse_product(self, response: DummyResponse, item: Product) -> Product:
            return item
    

    In line with this, the following new features were made:

  • New setting named SCRAPY_POET_RULES having a default value of web_poet.default_registry.get_rules. This deprecates SCRAPY_POET_OVERRIDES.

  • New setting named SCRAPY_POET_DISCOVER to ensure that SCRAPY_POET_RULES have properly loaded all intended rules annotated with the @handle_urls decorator.

  • New utility functions in scrapy_poet.utils.testing.

  • The frozen_time value inside the test fixtures won’t contain microseconds anymore.

  • Supports the new scrapy.http.request.NO_CALLBACK() introduced in Scrapy 2.8. This means that the Pitfalls (introduced in scrapy-poet==0.7.0) doesn’t apply when you’re using Scrapy >= 2.8, unless you’re using third-party middlewares which directly uses the downloader to add scrapy.Request instances with callback set to None. Otherwise, you need to set the callback value to scrapy.http.request.NO_CALLBACK().

  • Fix the TypeError that’s raised when using Twisted <= 21.7.0 since scrapy-poet was using twisted.internet.defer.Deferred[object] type annotation before which was not subscriptable in the early Twisted versions.

  • Fix the twisted.internet.error.ReactorAlreadyInstalledError error raised when using the scrapy savefixture command and Twisted < 21.2.0 is installed.

  • Fix test configuration that doesn’t follow the intended commands and dependencies in these tox environments: min, asyncio-min, and asyncio. This ensures that page objects using asyncio should work properly, alongside the minimum specified Twisted version.

  • Various improvements to tests and documentation.

  • Backward incompatible changes:

0.8.0 (2023-01-24)

  • Now requires web-poet >= 0.7.0 and time_machine.

  • Added a savefixture command that creates a test for a page object. See Tests for Page Objects for more information.

0.7.0 (2023-01-17)

  • Fixed the issue where a new page object containing a new response data is not properly created when web_poet.exceptions.core.Retry is raised.

  • In order for the above fix to be possible, overriding the callback dependencies created by scrapy-poet via scrapy.http.Request.cb_kwargs is now unsupported. This is a backward incompatible change.

  • Fixed the broken scrapy_poet.page_input_providers.HttpResponseProvider.fingerprint() which errors out when running a Scrapy job using the SCRAPY_POET_CACHE enabled.

  • Improved behavior when spider.parse() method arguments are supposed to be provided by scrapy-poet. Previously, it was causing unnecessary work in unexpected places like scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware, scrapy.pipelines.images.ImagesPipeline or scrapy.pipelines.files.FilesPipeline. It is also a reason web_poet.page_inputs.client.HttpClient might not be working in page objects. Now these cases are detected, and a warning is issued.

    As of Scrapy 2.7, it is not possible to fix the issue completely in scrapy-poet. Fixing it would require Scrapy changes; some 3rd party libraries may also need to be updated.

    Note

    The root of the issue is that when request.callback is None, parse() callback is assumed normally. But sometimes callback=None is used when scrapy.http.Request is added to the Scrapy’s downloader directly, in which case no callback is used. Middlewares, including scrapy-poet’s, can’t distinguish between these two cases, which causes all kinds of issues.

    We recommend all scrapy-poet users to modify their code to avoid the issue. Please don’t define parse() method with arguments which are supposed to be filled by scrapy-poet, and rename the existing parse() methods if they have such arguments. Any other name is fine. It avoids all possible issues, including incompatibility with 3rd party middlewares or pipelines.

    See the new Pitfalls documentation for more information.

    There are backwards-incompatible changes related to this issue. They only affect you if you don’t follow the advice of not using parse() method with scrapy-poet.

    • When the parse() method has its response argument annotated with scrapy_poet.api.DummyResponse, for instance: def parse(self, response: DummyResponse), the response is downloaded instead of being skipped.

    • When the parse() method has dependencies that are provided by scrapy-poet, the scrapy_poet.downloadermiddlewares.InjectionMiddleware won’t attempt to build any dependencies anymore.

      This causes the following code to have this error TypeError: parse() missing 1 required positional argument: 'page'.:

      class MySpider(scrapy.Spider):
          name = "my_spider"
          start_urls = ["https://books.toscrape.com"]
      
          def parse(self, response: scrapy.http.Response, page: MyPage):
              ...
      
  • scrapy_poet.injection.is_callback_requiring_scrapy_response() now accepts an optional raw_callback parameter meant to represent the actual callback attribute value of scrapy.http.Request since the original callback parameter could be normalized to the spider’s parse() method when the scrapy.http.Request has callback set to None.

  • Official support for Python 3.11

  • Various updates and improvements on docs and examples.

0.6.0 (2022-11-24)

  • Now requires web-poet >= 0.6.0.

    • All examples in the docs and tests now use web_poet.WebPage instead of web_poet.ItemWebPage.

    • The new instead_of parameter of the @handle_urls decorator is now preferred instead of the deprecated overrides parameter.

    • scrapy_poet.callback_for doesn’t require an implemented to_item method anymore.

    • The new web_poet.rules.RulesRegistry is used instead of the old web_poet.overrides.PageObjectRegistry.

    • The Registry now uses web_poet.ApplyRule instead of web_poet.OverrideRule.

  • Provider for web_poet.ResponseUrl is added, which allows to access the response URL in the page object. This triggers a download unlike the provider for web_poet.RequestUrl.

  • Fixes the error when using scrapy shell while the scrapy_poet.InjectionMiddleware is enabled.

  • Fixes and improvements on code and docs.

0.5.1 (2022-07-28)

Fixes the minimum web-poet version being 0.5.0 instead of 0.4.0.

0.5.0 (2022-07-28)

This release implements support for page object retries, introduced in web-poet 0.4.0.

To enable retry support, you need to configure a new spider middleware in your Scrapy settings:

SPIDER_MIDDLEWARES = {
    "scrapy_poet.RetryMiddleware": 275,
}

web-poet 0.4.0 is now the minimum required version of web-poet.

0.4.0 (2022-06-20)

This release is backwards incompatible, following backwards-incompatible changes in web-poet 0.2.0.

The main new feature is support for web-poet >= 0.2.0, including support for async def to_item methods, making additional requests in the to_item method, new Page Object dependencies, and the new way to configure overrides.

Changes in line with web-poet >= 0.2.0:

  • web_poet.HttpResponse replaces web_poet.ResponseData as a dependency to use.

  • Additional requests inside Page Objects: a provider for web_poet.HttpClient, as well as web_poet.HttpClient backend implementation, which uses Scrapy downloader.

  • callback_for now supports Page Objects which define async def to_item method.

  • Provider for web_poet.PageParams is added, which uses request.meta["page_params"] value.

  • Provider for web_poet.RequestUrl is added, which allows to access the request URL in the page object without triggering the download.

  • We have these backward incompatible changes since the web_poet.OverrideRule follow a different structure:

    • Deprecated PerDomainOverridesRegistry in lieu of the newer OverridesRegistry which provides a wide variety of features for better URL matching.

    • This resuls in a newer format in the SCRAPY_POET_OVERRIDES setting.

Other changes:

  • New scrapy_poet/dummy_response_count value appears in Scrapy stats; it is the number of times DummyResponse is used instead of downloading the response as usual.

  • scrapy.utils.reqser deprecated module is no longer used by scrapy-poet.

Dependency updates:

  • The minimum supported Scrapy version is now 2.6.0.

  • The minimum supported web-poet version is now 0.2.0.

0.3.0 (2022-01-28)

  • Cache mechanism using SCRAPY_POET_CACHE

  • Fixed and improved docs

  • removed support for Python 3.6

  • added support for Python 3.10

0.2.1 (2021-06-11)

  • Improved logging message for DummyResponse

  • various internal cleanups

0.2.0 (2021-01-22)

  • Overrides support

0.1.0 (2020-12-29)

  • New providers interface

    • One provider can provide many types at once

    • Single instance during the whole spider lifespan

    • Registration is now explicit and done in the spider settings

  • CI is migrated from Travis to Github Actions

  • Python 3.9 support

0.0.3 (2020-07-19)

  • Documentation improvements

  • providers can now access various Scrapy objects: Crawler, Settings, Spider, Request, Response, StatsCollector

0.0.2 (2020-04-28)

The repository is renamed to scrapy-poet, and split into two:

  • web-poet (https://github.com/scrapinghub/web-poet) contains definitions and code useful for writing Page Objects for web data extraction - it is not tied to Scrapy;

  • scrapy-poet (this package) provides Scrapy integration for such Page Objects.

API of the library changed in a backwards incompatible way; see README and examples.

New features:

  • DummyResponse annotation allows to skip downloading of scrapy Response.

  • callback_for works for Scrapy disk queues if it is used to create a spider method (but not in its inline form)

  • Page objects may require page objects as dependencies; dependencies are resolved recursively and built as needed.

  • InjectionMiddleware supports async def and asyncio providers.

0.0.1 (2019-08-28)

Initial release.