Create Corpus

Overview

Create a corpus before uploading any resources. Every corpus belongs to the authenticated user and encapsulates indexing configuration (vector index type, publication flag, etc.). Back-end logic normalizes the name into lowercase snake_case and enforces uniqueness per user.

Prerequisite: You must be authenticated with a valid JWT token or session cookie.

Request Body

corpora_name

string

required

Human-friendly name for your corpus. The system automatically converts it to lowercase with underscores (e.g., “Support Playbooks” becomes “support_playbooks”). Must be unique for your user account.

description

string

Optional context about what content lives in this corpus. Helps you and your team understand the corpus purpose.

is_published

boolean

default:"false"

Controls whether the corpus is discoverable via public listings. Set to true for shared knowledge bases.

index_type

string

default:"VSI"

Indexing strategy for the corpus. Available options:

VSI - Vector Store Index (recommended for semantic search)
SMI - Summary Index
DSI - Document Summary Index

indexing_status

string

System-managed field that tracks indexing progress. Leave empty - Soar Labs automatically updates this as ingestion jobs complete.Status values:

PND - Pending (newly created)
PRS - Processing (ingestion in progress)
IND - Indexed (ready for queries)
ERR - Error (ingestion failed)

Example request

curl -X POST https://{your-host}/api/corpora/ \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "corpora_name": "Support Playbooks",
    "description": "Runbooks feeding LlamaIndex",
    "is_published": false,
    "index_type": "VSI"
  }'

Response

UUID

Unique identifier for the corpus. Use this ID in all subsequent operations (uploading resources, querying, etc.).

created_at

timestamp

ISO 8601 timestamp when the corpus was created.

updated_at

timestamp

ISO 8601 timestamp of the last update to corpus metadata.

corpora_name

string

Normalized corpus name in lowercase snake_case format.

description

string

User-provided description of the corpus content and purpose.

size_on_disk

float

Total storage size in bytes. Initially 0.0 for new corpora, updates as resources are ingested.

index_location

string | null

Storage location of the vector index (e.g., "qdrant_free_collection"). null until first resource is indexed.

is_published

boolean

Whether the corpus is publicly discoverable.

index_type

string

The indexing strategy: VSI (Vector Store), SMI (Summary), or DSI (Document Summary).

indexing_status

string

Current indexing status: PND (Pending), PRS (Processing), IND (Indexed), or ERR (Error).

creator

UUID

User ID of the corpus creator. Read-only field for ownership tracking.

Example Response

{
  "id": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd",
  "created_at": "2024-09-01T10:05:03.291Z",
  "updated_at": "2024-09-01T10:05:03.291Z",
  "corpora_name": "support_playbooks",
  "description": "Runbooks feeding LlamaIndex",
  "size_on_disk": 0.0,
  "index_location": null,
  "is_published": false,
  "index_type": "VSI",
  "indexing_status": "PND",
  "creator": "eb81c1d5-78fe-4f35-b58e-0ff6a3ad5d12"
}

Best Practices

Check Name Availability

Use GET /api/check_corpora_name/?corpora_name=your_name to validate name availability before creating a corpus. This prevents 400 errors from duplicate names.

curl -X GET "https://{your-host}/api/check_corpora_name/?corpora_name=support_playbooks" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

Store the Corpus ID

Save the returned id field immediately - you’ll need it for:

Uploading resources (POST /api/data/files/, /urls/, /strings/)
Executing queries (POST /api/query/)
Retrieving corpus details (GET /api/corpora/{id}/)

Monitor Indexing Status

Track the indexing_status field as resources are added:

PND → PRS → IND: Normal progression
ERR: Check resource ingestion logs for failures

Poll GET /api/corpora/{id}/ to monitor status changes.

Understanding Read-Only Fields

The following fields are managed by SOAR and cannot be set directly:

size_on_disk - Updated as resources are indexed
index_location - Assigned when first resource is processed
creator - Automatically set to your user ID
id, created_at, updated_at - System-generated metadata

Choose VSI (Vector Store Index) for most use cases. It provides the best semantic search capabilities and works well with the advanced RAG retrieval pipeline.

Client examples

Python
TypeScript / JavaScript
Java

import os
import requests

BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ["SOAR_LABS_TOKEN"]

payload = {
    "corpora_name": "support_playbooks",
    "description": "Runbooks feeding LlamaIndex",
}

response = requests.post(
    f"{BASE_URL}/api/corpora/",
    headers={
        "Authorization": f"Bearer {TOKEN}",
        "Content-Type": "application/json",
    },
    json=payload,
    timeout=30,
)
response.raise_for_status()
corpus = response.json()

const BASE_URL = "https://your-soar-instance.com";
const token = process.env.SOAR_LABS_TOKEN!;

async function createCorpus() {
  const response = await fetch(`${BASE_URL}/api/corpora/`, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${token}`,
    },
    body: JSON.stringify({
      corpora_name: "support_playbooks",
      description: "Runbooks feeding LlamaIndex",
    }),
  });

  if (!response.ok) {
    throw new Error(`Create corpus failed: ${response.status}`);
  }

  return response.json();
}

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

var BASE_URL = "https://your-soar-instance.com";
var token = System.getenv("SOAR_LABS_TOKEN");

var json = "{" +
    "\"corpora_name\":\"support_playbooks\"," +
    "\"description\":\"Runbooks feeding LlamaIndex\"" +
"}";

var request = HttpRequest.newBuilder(URI.create(BASE_URL + "/api/corpora/"))
    .header("Authorization", "Bearer " + token)
    .header("Content-Type", "application/json")
    .POST(HttpRequest.BodyPublishers.ofString(json))
    .build();

var response = HttpClient.newHttpClient().send(request, HttpResponse.BodyHandlers.ofString());

if (response.statusCode() >= 400) {
    throw new RuntimeException("Create corpus failed: " + response.statusCode());
}

var body = response.body();

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

corpora_name

string

required

Name of the corpora

Maximum string length: 100

description

string | null

Description of the corpora

is_published

boolean

Is the corpora Visible to all users?

index_type

enum<string>

Type of index to be used for the corpora

VSI - VectorStoreIndex
SMI - SummaryIndex
DSI - DocumentSummaryIndex

Available options:

VSI,

SMI,

DSI

indexing_status

enum<string>

Status of the corpora processing

PND - Pending
IQE - In Queue
PRS - Processing
DEX - Data Extracted Successfully
DER - Data Extraction Error
IND - Indexed
CMP - Completed
ERR - Error

Available options:

PND,

IQE,

PRS,

DEX,

DER,

IND,

CMP,

ERR

Response

201 - application/json

string<uuid>

required

created_at

string<date-time>

required

The date and time the organization was created

updated_at

string<date-time>

required

Last updated time

corpora_name

string

required

Name of the corpora

Maximum string length: 100

size_on_disk

number<double>

required

Size of the corpora on disk (in bytes)

index_location

string | null

required

Location of the index on Remote Storage

creator

string<uuid>

required

description

string | null

Description of the corpora

is_published

boolean

Is the corpora Visible to all users?

index_type

enum<string>

Type of index to be used for the corpora

VSI - VectorStoreIndex
SMI - SummaryIndex
DSI - DocumentSummaryIndex

Available options:

VSI,

SMI,

DSI

indexing_status

enum<string>

Status of the corpora processing

PND - Pending
IQE - In Queue
PRS - Processing
DEX - Data Extracted Successfully
DER - Data Extraction Error
IND - Indexed
CMP - Completed
ERR - Error

Available options:

PND,

IQE,

PRS,

DEX,

DER,

IND,

CMP,

ERR

Getting Started

Corpus Management

Query and Retrieve

Resources

Overview

Request Body

Example request

Response

Example Response

Best Practices

Client examples

Authorizations

Body

Response

Getting Started

Corpus Management

Query and Retrieve

Resources

​Overview

​Request Body

​Example request

​Response

​Example Response

​Best Practices

​Client examples

Authorizations

Body

Response

Overview

Request Body

Example request

Response

Example Response

Best Practices

Client examples