Description
We have a setup where we have multiple Document
s, that are chunked into Chunk
s. For some of these documents, we have an automated service that updates the Document daily. To correctly update the documents we:
- Get all UUIDs of
Chunk
s belonging to that specificDocument
- Use generate a deterministic uuid5 to calculate the
uuid
s for all new chunks - Figure out which chunks to delete and which chunks to add
- Add only the new chunks
- Delete the chunks that are no longer relevant
This allows us to:
- have a fallback if any of the steps fail
- not reupload unnecessary
Chunk
s - save some cost & bandwidth
However, step 1 is giving us some challenges, as to achieve that, we need to query all existing chunks. The ‘normal’ Get
with offset doesn’t work above QUERY_MAXIMUM_RESULTS
so the only other option we’ve seen so far has been to use the Cursor API, which requires us to dump our entire Weaviate instance, which can’t be the suggested way to achieve this.
So, I’m wondering how we’re supposed to solve this problem, we can’t find anything in the documentation so far, and we’re slightly scared of the implications of increasing the QUERY_MAXIMUM_RESULTS
.
Server Setup Information
- Weaviate Server Version: 1.24.6
- Deployment Method: Docker
- Multi Node? Number of Running Nodes: 1
- Client Language and Version: Python v3
- Multitenancy?: Nope
Any additional Information
Not really