Quantcast
Channel: Weaviate Community Forum - Latest posts
Viewing all articles
Browse latest Browse all 3605

Issue During Batch Insert

$
0
0

Description

I am using Weaviate locally with a Docker container and the Weaviate Python client. I encounter a “Deadline Exceeded” error when trying to insert a large batch of data.

Code:

import weaviate
import os

client = weaviate.Client(
    url="http://localhost:8080",
    additional_headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_API_KEY"]  # Replace with your inference API key
    },
)

client.schema.create_class({
    "class": "work_steps",
    "vectorizer": "text2vec-openai",
    "module_config": {
        "generative-openai": {}
    }
})

work_steps_data = [
    {"wtd_text": d["wtd_text"], "wta_text": d["wta_text"]}
    for d in data_json
]

# len(work_steps_data)  # 106954

try:
    client.batch.create_objects(work_steps_data)
except weaviate.exceptions.WeaviateBatchError as e:
    print(f"Error: {e}")

Error:

{
    "name": "WeaviateBatchError",
    "message": "Query call with protocol GRPC batch failed with message <AioRpcError of RPC that terminated with:
    status = StatusCode.DEADLINE_EXCEEDED
    details = \"Deadline Exceeded\"
    debug_error_string = \"UNKNOWN:Error received from peer  {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2024-08-01T18:14:40.555441469+04:00\"}\"
>.",
    ...
}

docker-compose.yaml

version: '3.4'
services:
  weaviate:
    image: semitechnologies/weaviate:1.25.6
    ports:
      - 8080:8080
      - 50051:50051
    volumes:
      - weaviate_data:/var/lib/weaviate
    environment:
      CLIP_INFERENCE_API: 'http://multi2vec-clip:8080'
      OPENAI_APIKEY: $OPENAI_APIKEY
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      DEFAULT_VECTORIZER_MODULE: 'multi2vec-clip'
      ENABLE_MODULES: 'multi2vec-clip,generative-openai,generative-cohere,text2vec-openai,text2vec-huggingface,text2vec-cohere,reranker-cohere'
      CLUSTER_HOSTNAME: 'node1'
    restart: on-failure:0
  multi2vec-clip:
    image: semitechnologies/multi2vec-clip:sentence-transformers-clip-ViT-B-32-multilingual-v1
    environment:
      ENABLE_CUDA: '0'
volumes:
  weaviate_data:

Additional Information

Docker Logs:

weaviate-1        | {"action":"startup","default_vectorizer_module":"multi2vec-clip","level":"info","msg":"the default vectorizer modules is set to \"multi2vec-clip\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-08-01T07:04:33Z"}
...
weaviate-1        | {"level":"warning","msg":"prop len tracker file /var/lib/weaviate/work_steps/iPkMMMILWoTR/proplengths does not exist, creating new tracker","time":"2024-08-01T08:54:43Z"}
...
multi2vec-clip-1  | INFO:     Model initialization complete
...
weaviate-1        | {"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2024-08-01T07:04:38Z"}

Problem

I am trying to insert a large dataset (about 106,954 records) into Weaviate, but I keep encountering a “Deadline Exceeded” error when using the batch insert functionality.

Questions

  1. How can I avoid the “Deadline Exceeded” error during batch insertion?
  2. Are there any recommended configurations or settings for handling large batch inserts?
  3. Is there a way to increase the timeout settings for GRPC batch operations in Weaviate?

Any assistance or recommendations would be greatly appreciated. Thank you!

P.S:

I am used below workaround for the batch upsert to avoid any errors:

work_step_col = client.collections.get("work_steps")
# work_step_col.data.insert_many(work_steps_data)

import time

batch_size = 1000  # Adjust the batch size as needed
for i in range(0, len(work_steps_data), batch_size):
    batch = work_steps_data[i:i + batch_size]
    work_step_col.data.insert_many(batch)
    time.sleep(2)

I has been 15 minutes and counting, so I am posting this anyway


Viewing all articles
Browse latest Browse all 3605

Trending Articles