Skip to main content
POST
/
api
/
corpora
{
  "id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "created_at": "2023-11-07T05:31:56Z",
  "updated_at": "2023-11-07T05:31:56Z",
  "corpora_name": "<string>",
  "size_on_disk": 123,
  "index_location": "<string>",
  "creator": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "description": "<string>",
  "is_published": true,
  "index_type": "VSI",
  "indexing_status": "PND"
}

Overview

Create a corpus before uploading any resources. Every corpus belongs to the authenticated user and encapsulates indexing configuration (vector index type, publication flag, etc.). Back-end logic normalizes the name into lowercase snake_case and enforces uniqueness per user.
Prerequisite: You must be authenticated with a valid JWT token or session cookie.

Request Body

corpora_name
string
required
Human-friendly name for your corpus. The system automatically converts it to lowercase with underscores (e.g., “Support Playbooks” becomes “support_playbooks”). Must be unique for your user account.
description
string
Optional context about what content lives in this corpus. Helps you and your team understand the corpus purpose.
is_published
boolean
default:"false"
Controls whether the corpus is discoverable via public listings. Set to true for shared knowledge bases.
index_type
string
default:"VSI"
Indexing strategy for the corpus. Available options:
  • VSI - Vector Store Index (recommended for semantic search)
  • SMI - Summary Index
  • DSI - Document Summary Index
indexing_status
string
System-managed field that tracks indexing progress. Leave empty - Soar Labs automatically updates this as ingestion jobs complete.Status values:
  • PND - Pending (newly created)
  • PRS - Processing (ingestion in progress)
  • IND - Indexed (ready for queries)
  • ERR - Error (ingestion failed)

Example request

curl -X POST https://{your-host}/api/corpora/ \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "corpora_name": "Support Playbooks",
    "description": "Runbooks feeding LlamaIndex",
    "is_published": false,
    "index_type": "VSI"
  }'

Response

id
UUID
Unique identifier for the corpus. Use this ID in all subsequent operations (uploading resources, querying, etc.).
created_at
timestamp
ISO 8601 timestamp when the corpus was created.
updated_at
timestamp
ISO 8601 timestamp of the last update to corpus metadata.
corpora_name
string
Normalized corpus name in lowercase snake_case format.
description
string
User-provided description of the corpus content and purpose.
size_on_disk
float
Total storage size in bytes. Initially 0.0 for new corpora, updates as resources are ingested.
index_location
string | null
Storage location of the vector index (e.g., "qdrant_free_collection"). null until first resource is indexed.
is_published
boolean
Whether the corpus is publicly discoverable.
index_type
string
The indexing strategy: VSI (Vector Store), SMI (Summary), or DSI (Document Summary).
indexing_status
string
Current indexing status: PND (Pending), PRS (Processing), IND (Indexed), or ERR (Error).
creator
UUID
User ID of the corpus creator. Read-only field for ownership tracking.

Example Response

{
  "id": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd",
  "created_at": "2024-09-01T10:05:03.291Z",
  "updated_at": "2024-09-01T10:05:03.291Z",
  "corpora_name": "support_playbooks",
  "description": "Runbooks feeding LlamaIndex",
  "size_on_disk": 0.0,
  "index_location": null,
  "is_published": false,
  "index_type": "VSI",
  "indexing_status": "PND",
  "creator": "eb81c1d5-78fe-4f35-b58e-0ff6a3ad5d12"
}

Best Practices

Use GET /api/check_corpora_name/?corpora_name=your_name to validate name availability before creating a corpus. This prevents 400 errors from duplicate names.
curl -X GET "https://{your-host}/api/check_corpora_name/?corpora_name=support_playbooks" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"
Save the returned id field immediately - you’ll need it for:
  • Uploading resources (POST /api/data/files/, /urls/, /strings/)
  • Executing queries (POST /api/query/)
  • Retrieving corpus details (GET /api/corpora/{id}/)
Track the indexing_status field as resources are added:
  • PNDPRSIND: Normal progression
  • ERR: Check resource ingestion logs for failures
Poll GET /api/corpora/{id}/ to monitor status changes.
The following fields are managed by SOAR and cannot be set directly:
  • size_on_disk - Updated as resources are indexed
  • index_location - Assigned when first resource is processed
  • creator - Automatically set to your user ID
  • id, created_at, updated_at - System-generated metadata
Choose VSI (Vector Store Index) for most use cases. It provides the best semantic search capabilities and works well with the advanced RAG retrieval pipeline.

Client examples

import os
import requests

BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ["SOAR_LABS_TOKEN"]

payload = {
    "corpora_name": "support_playbooks",
    "description": "Runbooks feeding LlamaIndex",
}

response = requests.post(
    f"{BASE_URL}/api/corpora/",
    headers={
        "Authorization": f"Bearer {TOKEN}",
        "Content-Type": "application/json",
    },
    json=payload,
    timeout=30,
)
response.raise_for_status()
corpus = response.json()

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

corpora_name
string
required

Name of the corpora

Maximum string length: 100
description
string | null

Description of the corpora

is_published
boolean

Is the corpora Visible to all users?

index_type
enum<string>

Type of index to be used for the corpora

  • VSI - VectorStoreIndex
  • SMI - SummaryIndex
  • DSI - DocumentSummaryIndex
Available options:
VSI,
SMI,
DSI
indexing_status
enum<string>

Status of the corpora processing

  • PND - Pending
  • IQE - In Queue
  • PRS - Processing
  • DEX - Data Extracted Successfully
  • DER - Data Extraction Error
  • IND - Indexed
  • CMP - Completed
  • ERR - Error
Available options:
PND,
IQE,
PRS,
DEX,
DER,
IND,
CMP,
ERR

Response

201 - application/json
id
string<uuid>
required
created_at
string<date-time>
required

The date and time the organization was created

updated_at
string<date-time>
required

Last updated time

corpora_name
string
required

Name of the corpora

Maximum string length: 100
size_on_disk
number<double>
required

Size of the corpora on disk (in bytes)

index_location
string | null
required

Location of the index on Remote Storage

creator
string<uuid>
required
description
string | null

Description of the corpora

is_published
boolean

Is the corpora Visible to all users?

index_type
enum<string>

Type of index to be used for the corpora

  • VSI - VectorStoreIndex
  • SMI - SummaryIndex
  • DSI - DocumentSummaryIndex
Available options:
VSI,
SMI,
DSI
indexing_status
enum<string>

Status of the corpora processing

  • PND - Pending
  • IQE - In Queue
  • PRS - Processing
  • DEX - Data Extracted Successfully
  • DER - Data Extraction Error
  • IND - Indexed
  • CMP - Completed
  • ERR - Error
Available options:
PND,
IQE,
PRS,
DEX,
DER,
IND,
CMP,
ERR