Retrieve Corpus

Overview

Fetch detailed metadata for a single corpus by its unique identifier. This endpoint returns the same comprehensive information as the list endpoint but scoped to one record. Only the corpus owner can retrieve this data.

Use this for: Checking indexing status after creation, refreshing UI state after updates, validating corpus existence before operations.

Authentication

Requires valid JWT token or session authentication. You must own the target corpus.

Path Parameters

UUID

required

The corpus identifier returned during creation or from the list endpoint.Example: 8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd

Example request

curl -X GET https://{your-host}/api/corpora/8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd/ \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

Example response

{
  "id": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd",
  "created_at": "2024-09-01T10:05:03.291Z",
  "updated_at": "2024-09-01T10:08:11.522Z",
  "corpora_name": "support_playbooks",
  "description": "Runbooks feeding LlamaIndex",
  "size_on_disk": 4194304.0,
  "index_location": "qdrant_free_collection",
  "is_published": false,
  "index_type": "VSI",
  "indexing_status": "IND",
  "creator": "eb81c1d5-78fe-4f35-b58e-0ff6a3ad5d12"
}

Response Structure

UUID

Unique corpus identifier.

created_at

timestamp

ISO 8601 timestamp when the corpus was created.

updated_at

timestamp

Last modification timestamp. Updates when metadata changes or resources are added.

corpora_name

string

Normalized corpus name (lowercase with underscores).

description

string

User-provided description of the corpus purpose.

size_on_disk

float

Storage size in bytes consumed by this corpus and its resources.

index_location

string

Vector database collection identifier where embeddings are stored.

is_published

boolean

Public visibility flag. true makes the corpus discoverable by other users.

index_type

string

Indexing strategy: VSI (Vector Store Index), SMI (Summary Index), or DSI (Document Summary Index).

indexing_status

string

Current processing state:

PND - Pending (waiting for resources)
PRS - Processing (indexing in progress)
IND - Indexed (ready for queries)
ERR - Error (indexing failed)

creator

UUID

User ID of the corpus owner.

Common Use Cases

Poll for Indexing Completion

After creating a corpus and uploading resources, poll this endpoint to check when indexing completes:

import time
import requests

def wait_for_indexing(base_url, token, corpus_id, timeout=300):
    start = time.time()
    while time.time() - start < timeout:
        response = requests.get(
            f"{base_url}/api/corpora/{corpus_id}/",
            headers={"Authorization": f"Bearer {token}"}
        )
        corpus = response.json()

        if corpus["indexing_status"] == "IND":
            print("Corpus ready for queries!")
            return corpus
        elif corpus["indexing_status"] == "ERR":
            raise Exception("Indexing failed")

        time.sleep(5)  # Poll every 5 seconds

    raise TimeoutError("Indexing did not complete in time")

Verify Update Success

After modifying corpus settings, fetch the latest state to confirm changes:

# Update the corpus
requests.patch(
    f"{base_url}/api/corpora/{corpus_id}/",
    headers=headers,
    json={"is_published": True}
)

# Verify the change
corpus = requests.get(
    f"{base_url}/api/corpora/{corpus_id}/",
    headers=headers
).json()

assert corpus["is_published"] == True

Check Storage Usage

Monitor storage consumption before adding more resources:

corpus = requests.get(
    f"{base_url}/api/corpora/{corpus_id}/",
    headers=headers
).json()

size_mb = corpus["size_on_disk"] / (1024 * 1024)
print(f"Current storage: {size_mb:.2f} MB")

# Warn if approaching limits
if size_mb > 500:
    print("Warning: Large corpus size may affect query performance")

Validate Before Operations

Ensure corpus exists and is ready before performing operations:

try:
    corpus = requests.get(
        f"{base_url}/api/corpora/{corpus_id}/",
        headers=headers
    ).json()

    # Check if ready for queries
    if corpus["indexing_status"] != "IND":
        raise ValueError("Corpus not fully indexed yet")

    # Proceed with operation
    query_corpus(corpus_id)
except requests.HTTPError as e:
    if e.response.status_code == 404:
        print("Corpus not found or access denied")

Security Note: Requests return 404 Not Found if the corpus belongs to another user. This prevents leaking corpus existence to unauthorized users.

Client examples

Python
TypeScript / JavaScript
Java

import os
import requests

BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ["SOAR_LABS_TOKEN"]
CORPUS_ID = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"

response = requests.get(
    f"{BASE_URL}/api/corpora/{CORPUS_ID}/",
    headers={"Authorization": f"Bearer {TOKEN}"},
    timeout=30,
)
response.raise_for_status()
corpus = response.json()

const BASE_URL = "https://your-soar-instance.com";
const token = process.env.SOAR_LABS_TOKEN!;

async function getCorpus(corpusId: string) {
  const response = await fetch(`${BASE_URL}/api/corpora/${corpusId}/`, {
    headers: {
      Authorization: `Bearer ${token}`,
    },
  });

  if (!response.ok) {
    throw new Error(`Retrieve corpus failed: ${response.status}`);
  }

  return response.json();
}

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

var BASE_URL = "https://your-soar-instance.com";
var token = System.getenv("SOAR_LABS_TOKEN");
var corpusId = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd";

var request = HttpRequest.newBuilder(URI.create(BASE_URL + "/api/corpora/" + corpusId + "/"))
    .header("Authorization", "Bearer " + token)
    .GET()
    .build();

var response = HttpClient.newHttpClient().send(request, HttpResponse.BodyHandlers.ofString());

if (response.statusCode() >= 400) {
    throw new RuntimeException("Retrieve corpus failed: " + response.statusCode());
}

var body = response.body();

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Path Parameters

string<uuid>

required

A UUID string identifying this Corpora.

Response

200 - application/json

string<uuid>

required

created_at

string<date-time>

required

The date and time the organization was created

updated_at

string<date-time>

required

Last updated time

corpora_name

string

required

Name of the corpora

Maximum string length: 100

size_on_disk

number<double>

required

Size of the corpora on disk (in bytes)

index_location

string | null

required

Location of the index on Remote Storage

creator

string<uuid>

required

description

string | null

Description of the corpora

is_published

boolean

Is the corpora Visible to all users?

index_type

enum<string>

Type of index to be used for the corpora

VSI - VectorStoreIndex
SMI - SummaryIndex
DSI - DocumentSummaryIndex

Available options:

VSI,

SMI,

DSI

indexing_status

enum<string>

Status of the corpora processing

PND - Pending
IQE - In Queue
PRS - Processing
DEX - Data Extracted Successfully
DER - Data Extraction Error
IND - Indexed
CMP - Completed
ERR - Error

Available options:

PND,

IQE,

PRS,

DEX,

DER,

IND,

CMP,

ERR

Getting Started

Corpus Management

Query and Retrieve

Resources

Overview

Authentication

Path Parameters

Example request

Example response

Response Structure

Common Use Cases

Client examples

Authorizations

Path Parameters

Response

Getting Started

Corpus Management

Query and Retrieve

Resources

​Overview

​Authentication

​Path Parameters

​Example request

​Example response

​Response Structure

​Common Use Cases

​Client examples

Authorizations

Path Parameters

Response

Overview

Authentication

Path Parameters

Example request

Example response

Response Structure

Common Use Cases

Client examples