@@ -11,14 +11,14 @@ async def main() -> None:
		awaitdataset.export_to(
		content_type='csv',
		key='data.csv',
		to_key_value_store_name='my-cool-key-value-store',
		to_kvs_name='my-cool-key-value-store',

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is this BC break worth it?

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

let's evaluate all the potential BCs at the end

Copy link

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sure. I thought we are nearing that now 😁

Copy link

ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since we're just re-exporting thestorages fromcrawlee here, there will be many more cases than this one. I'm not saying we have to rename this particular argument (and I will undo it if you insist—just I don't like those long identifiers, especially when we can use the KVS abbreviation).

src/apify/storage_clients/_apify/_dataset_client.py OutdatedShow resolvedHide resolved

src/apify/_actor.py OutdatedShow resolvedHide resolved

tests/integration/conftest.py OutdatedShow resolvedHide resolved

vdusek added2 commits

June 27, 2025 08:43

Move default storage IDs to Configuration

82e65fc

opening storages get default id from config

8de950b

vdusek added4 commits

June 27, 2025 10:50

Addressing more feedback

98b76c5

Fixing integration test test_push_large_data_chunks_over_9mb

7b5ee07

Abstract open method is removed from storage clients

afcb8c7

fixing generate public url for KVS records

3bacab7

vdusek force-pushed thenew-apify-storage-clients branch from0766ecd to3bacab7Compare

June 30, 2025 13:04

vdusek added3 commits

July 1, 2025 15:02

add async metadata getters

287a119

Merge branch 'master' into new-apify-storage-clients

e45d65b

better usage of apify config

51178ca

vdusek requested review fromjanbuchar andPijukatel

July 1, 2025 13:44

vdusek added a commit to apify/crawlee-python that referenced this pull request

Jul 1, 2025

refactor!: Introduce new storage client system (#1194)

de1c03f

## Description- I consolidated all commits from#1107 into this new PR.- The previous storage-clients implementation was completely replacedwith a redesigned clients, including:    - new storage-client interface,    - in-memory storage client,    - file-system storage client,- Apify storage client (implemented in the SDK; seeapify/apify-sdk-python#470),    - and various adjustments elsewhere in the codebase.- The old "memory plus persist" client has been split into separatememory and file-system implementations.- The `Configuration.persist_storage` and`Configuration.persist_metadata` options were removed.- All old collection clients have been removed, they're no longerneeded.- Each storage client now prints warnings if you pass method argumentsit does not support.- The creation management modules in the storage clients and storageswere removed.- Storage client parameters (e.g. `purge_on_start`, or `token` and`base_api_url` for the Apify client) are configured via the`Configuration`.- Every storage, and its corresponding client, now provides both a`purge` method (which clears all items but preserves the storage andmetadata) and a `drop` method (which removes the entire storage,metadata included).- All unused types, models, and helper utilities have been removed.- The detailed, per-storage/client changes are listed below...### Dataset- Properties:    - `id`    - `name`    - `metadata`- Methods:    - `open`    - `purge` (new method)    - `drop`    - `push_data`    - `get_data`    - `iterate_items`    - `list_items` (new method)    - `export_to`- Breaking changes:- `from_storage_object` method has been removed - Use the `open` methodwith `name` or `id` instead.    - `get_info` -> `metadata` property    - `storage_object` -> `metadata` property- `set_metadata` method has been removed (it wasn't propage to clients)        - Do we want to support it (e.g. for renaming)?- `write_to_json` -> method has been removed, use `export_to` instead    - `write_to_csv` -> method has been removed, use `export_to` instead```pythonimport asynciofrom crawlee.storage_clients import FileSystemStorageClientfrom crawlee.storages import Datasetasync def main() -> None:    dataset = await Dataset.open(storage_client=FileSystemStorageClient())    print(f'default dataset - ID: {dataset.id}, name: {dataset.name}')    await dataset.push_data({'name': 'John'})    await dataset.push_data({'name': 'John', 'age': 20})    await dataset.push_data({})    dataset_with_name = await Dataset.open(        name='my_dataset',        storage_client=FileSystemStorageClient(),    )    print(f'named dataset - ID: {dataset_with_name.id}, name: {dataset_with_name.name}')    await dataset_with_name.push_data([{'age': 30}, {'age': 25}])    print('Default dataset items:')    async for item in dataset.iterate_items(skip_empty=True):        print(item)    print('Named dataset items:')    async for item in dataset_with_name.iterate_items():        print(item)    items = await dataset.get_data()    print(items)    dataset_by_id = await Dataset.open(id=dataset_with_name.id)    print(f'dataset by ID - ID: {dataset_by_id.id}, name: {dataset_by_id.name}')if __name__ == '__main__':    asyncio.run(main())```### Key-value store- Properties:    - `id`    - `name`    - `metadata`- Methods:    - `open`    - `purge` (new method)    - `drop`    - `get_value`    - `set_value`- `delete_value` (new method, Apify platform's set_value support settingan empty value to a key, so having a separate method for deleting isuseful)    - `iterate_keys`    - `list_keys` (new method)    - `get_public_url`    - `get_auto_saved_value`    - `persist_autosaved_values`- Breaking changes:- `from_storage_object` method has been removed - Use the `open` methodwith `name` or `id` instead.    - `get_info` -> `metadata` property    - `storage_object` -> `metadata` property- `set_metadata` method has been removed (it wasn't propage to clients)        - Do we want to support it (e.g. for renaming)?```pythonimport asyncioimport requestsfrom crawlee.storage_clients import FileSystemStorageClientfrom crawlee.storages import KeyValueStoreasync def main() -> None:    print('Opening key-value store "my_kvs"...')    storage_client = FileSystemStorageClient()    kvs = await KeyValueStore.open(name='my_kvs', storage_client=storage_client)    print('Setting value to "file.json"...')    await kvs.set_value('file.json', {'key': 'value'})    print('Setting value to "file.jpg"...')    response = requests.get('https://avatars.githubusercontent.com/u/25082181')    await kvs.set_value('file.jpg', response.content)    print('Iterating over keys:')    async for key in kvs.iterate_keys():        print(f'Key: {key}')    print('Listing keys:')    keys = [key.key for key in await kvs.list_keys()]    print(f'Keys: {keys}')    for key in keys:        print(f'Getting value of {key}...')        value = await kvs.get_value(key)        print(f'Value: {str(value)[:100]}')    print('Deleting value of "file.json"...')    await kvs.delete_value('file.json')    kvs_default = await KeyValueStore.open(storage_client=storage_client)    special_key = 'key with spaces/and/slashes!@#$%^&*()'    test_value = 'Special key value'    await kvs_default.set_value(key=special_key, value=test_value)    record = await kvs_default.get_value(key=special_key)    assert record is not None    assert record == test_value    result = await kvs_default.list_keys()    print(f'kvs_default list keys = {result}')    kvs_2 = await KeyValueStore.open()    result = await kvs_2.list_keys()    print(f'kvs_2 list keys = {result}')if __name__ == '__main__':    asyncio.run(main())```### Request queue- Properties:    - `id`    - `name`    - `metadata`- Methods:    - `open`    - `purge` (new method)    - `drop`    - `add_request`    - `add_requests_batched` -> `add_requests`    - `fetch_next_request`    - `get_request`    - `mark_request_as_handled`    - `reclaim_request`    - `is_empty`    - `is_finished`- Breaking changes:- `from_storage_object` method has been removed - Use the `open` methodwith `name` or `id` instead.    - `get_info` -> `metadata` property    - `storage_object` -> `metadata` property- `set_metadata` method has been removed (it wasn't propage to clients)        - Do we want to support it (e.g. for renaming)?- `get_handled_count` method had been removed - Use`metadata.handled_request_count` instead.- `get_total_count` method has been removed - Use`metadata.total_request_count` instead.- `resource_directory` from the `RequestQueueMetadata` was removed, use`path_to...` property instead.- `RequestQueueHead` model has been removed - Use`RequestQueueHeadWithLocks` instead.- Notes:- New RQ `add_requests` contain `forefront` arg (Apify API supports it)```pythonimport asynciofrom crawlee import Requestfrom crawlee.configuration import Configurationfrom crawlee.storage_clients import FileSystemStorageClientfrom crawlee.storages import RequestQueueasync def main() -> None:    rq = await RequestQueue.open(        name='my-queue',        storage_client=FileSystemStorageClient(),        configuration=Configuration(purge_on_start=True),    )    print(f'RequestQueue: {rq}')    print(f'RequestQueue client: {rq._client}')    await rq.add_requests(        requests=[            Request.from_url('https://example.com', use_extended_unique_key=True),            Request.from_url('https://crawlee.dev', use_extended_unique_key=True),            Request.from_url('https://apify.com', use_extended_unique_key=True),        ],    )    print('Requests were added to the queue')    is_empty = await rq.is_empty()    is_finished = await rq.is_finished()    print(f'Is empty: {is_empty}')    print(f'Is finished: {is_finished}')    request = await rq.fetch_next_request()    print(f'Fetched request: {request}')    await rq.add_request('https://facebook.com', forefront=True)    request = await rq.fetch_next_request()    print(f'Fetched request: {request}')    rq_default = await RequestQueue.open(        storage_client=FileSystemStorageClient(),        configuration=Configuration(purge_on_start=True),    )    await rq_default.add_request('https://example.com/1')    await rq_default.add_requests(        [            'https://example.com/priority-1',            'https://example.com/priority-2',            'https://example.com/priority-3',        ]    )    await rq_default.add_request('https://example.com/2')if __name__ == '__main__':    asyncio.run(main())```### BaseDatasetClient- Properties:    - `metadata`- Methods:    - `open`    - `purge`    - `drop`    - `push_data`    - `get_data`    - `iterate_items`### BaseKeyValueStoreClient- Properties:    - `metadata`- Methods:    - `open`    - `purge`    - `drop`    - `get_value`    - `set_value`    - `delete_value`    - `iterate_keys`    - `get_public_url`### BaseRequestQueueClient- Properties:    - `metadata`- Methods:    - `open`    - `purge`    - `drop`- `add_requests_batch` -> `add_batch_of_requests` (one backend methodfor 2 frontend methods)    - `get_request`    - `fetch_next_request`    - `mark_request_as_handled`    - `reclaim_request`    - `is_empty`- Models    - `RequestQueueHeadWithLocks` -> `RequestQueueHead`    - `BatchRequestsOperationResponse` -> `AddRequestsResponse`- Notes:- Old file system (memory) version didn't persist the in-progressrequests- Old file system (memory) version didn't persist the forefront values(now there is a FS-specific `_sequence` field in the FS Request)- The methods manipulating locks and listing heads are now only internalmethods of Apify RQ client.## Issues-Closes:#92-Closes:#147-Closes:#783-Closes:#1247- Relates:#1175- Relates:#1191## Testing- The original tests were mostly removed and replaced with a new ones.- Each storage-client implementation now has its own dedicated tests atthe client level (more targeted/edge-case coverage).- On top of that, there are storage-level tests that use a parametrizedfixture for each storage client (`file-system` and `memory`), ensuringevery storage test runs against every client implementation.## Checklist- [x] CI passed

Pijukatel reviewed

Jul 2, 2025

View reviewed changes

src/apify/storage_clients/_apify/_dataset_client.py OutdatedShow resolvedHide resolved

renaming

3cd7dfe

vdusek requested a review fromPijukatel

July 2, 2025 10:54

Pijukatel approved these changes

Jul 2, 2025

View reviewed changes

Copy link

ContributorAuthor

vdusek commentedJul 2, 2025

Probably let's wait for@janbuchar, and also I'll add an upgrading guide for v2.

vdusekand others added7 commits

July 3, 2025 16:08

Merge branch 'master' into new-apify-storage-clients

6fe9eb3

fixes after merge commit

1547cbd

Merge branch 'master' into new-apify-storage-clients

bb47efc

Change from orphan commit to master in crawlee version

4e4fa93

Merge branch 'master' into new-apify-storage-clients

683cb31

fix encrypted secrets test

e5b2bc4

Add Apify's version of FS client that keeps the INPUT json

638756f

janbuchar reviewed

Jul 15, 2025

View reviewed changes

Copy link

Contributor

janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I haven't made it through the request queue client, but here's a batch of comments.

src/apify/storage_clients/_file_system/_key_value_store_client.py

		"""

		@override
		async def purge(self) -> None:

Copy link

Contributor

janbucharJul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Does this override mean that Crawlee does not treatINPUT.json differently from other files anymore and this logic is only added in the SDK? I'm a fan, but I wonder if the rest of the team is on board with this.

Copy link

ContributorAuthor

vdusekJul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Exactly - because Crawlee does not useINPUT.json for anything. We are in the same boat with@Pijukatel here.

@B4nan, would you be OK with this change? TLDR; so you don't have to study the code: As Honza wrote, Crawlee now treatsINPUT.json just like any other file. Here in SDK, I've added a minimal override ofFileSystemStorageClient that only customizes thepurge method, preventing it from deletingINPUT.json. That's it.

Copy link

Member

B4nanJul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

are we talking about the input.json in the project root? since that is implemented on crawlee level (in JS) and i think it should stay this way. i am not talking about the purging logic, in JS, if there is an INPUT.json file in the root,KVS.getInput() will prefer it over anything in the storage folder (the local/memory implementation, with apify client it is ignored completely). not sure if we even do this in the python version (and I am open to discuss if we want to remove this from v4, but it feels like a good thing to me)

if this is just about purging, i am not entirely sure i understand what it means, we dont want to autopurge inputs on crawlee level, crawlee shouldnt behave differently because you init the SDK (not locally).

Copy link

Member

B4nanJul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this is the part where we handle input file in the project root in JS

https://github.com/apify/crawlee/blob/5c4726df96e358a9bbf44a0cd2760e4e269f0fae/packages/core/src/storages/key_value_store.ts#L697

Copy link

Contributor

janbucharJul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

From the "Crawlee is a web scraping framework" point of view, handling input feels a bit redundant here - if you intend to only run the crawler outside of Apify, you'll probably want to accept input some other way - CLI arguments, environment variables, some domain-speciffic config file... Providing a helper for loading a JSON file from two possible locations doesn't add much value IMO.

Asking Apifiers about this doesn't make much sense in this situation, because basically everything we write is intended to run on Apify 🙂

src/apify/storage_clients/_apify/_key_value_store_client.py OutdatedShow resolvedHide resolved

vdusek added4 commits

July 16, 2025 09:31

update metadata fixes

931b0ca

Merge branch 'master' into new-apify-storage-clients

ad7c0d8

KVS metadata extended model

1f3c481

fix url signing secret key

44d8e09

Pijukatel self-requested a review

July 17, 2025 13:23

Pijukatel requested changes

Jul 17, 2025

View reviewed changes

Copy link

Contributor

Pijukatel left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

While testing the change on platform I used modified benchmark tests. They are specific, because they scrape locally hosted site and so the speed of scraping is unusually fast - no network related delays. This exposes RequestQueue race condition that can happen when checking if the RQ is empty or not.
When running the test using this branch, I saw that the crawler premature finished even though RQ was still quite full. The hypothesis about premature exit was confirmed when running same test on branch modified with extra log and sleep.

Please see following tiny commit based on this branch and attached log:
5b25728