Upload a File

Overview

This endpoint handles binary file uploads and triggers asynchronous ingestion (content extraction + chunking + vector indexing). Only corpora you own can accept uploads, and files must be in supported formats.

Supported Formats: PDF, DOCX, TXT, CSV, JSON, Markdown (MD), Excel (XLSX/XLS), HTML/HTM, LOG files

Files are processed asynchronously. Monitor indexing_status to track when files are ready for querying.

Authentication

Requires valid JWT token or session authentication. You must be the owner of the target corpus.

Request Body

corpora

UUID

required

ID of the corpus that will own these files. Must be a corpus you created and have access to.

files

file[]

required

One or more files to upload. Send as multipart/form-data with multiple files fields.Processing pipeline:

File validation (format, size)
Upload to cloud storage
Content extraction (text, tables, images)
Chunking and metadata generation
Vector embedding and indexing

File size limits: Check your instance configuration (typically 50MB per file)

Example request

curl -X POST https://{your-host}/api/data/files/ \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -F "corpora=8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd" \
  -F "[email protected]" \
  -F "[email protected]"

Response

Returns an array of file objects (one for each uploaded file):

UUID

Unique identifier for the file resource. Use for tracking, retrieval, or deletion.

created_at

timestamp

ISO 8601 timestamp when the file was uploaded.

updated_at

timestamp

Last update timestamp. Changes when indexing status updates.

indexed_on

timestamp | null

Timestamp when indexing completed successfully. null while processing.

indexing_status

string

Current processing status of the file:

PRS - Processing (extraction and indexing in progress)
IND - Indexed (ready for queries)
ERR - Error (processing failed, check logs)
PND - Pending (queued for processing)

file_name

string

Original filename as uploaded.

file_type

string

Detected file extension/type (e.g., pdf, docx, csv).

location

string (URL)

Cloud storage URL where the file is persisted. Access requires authentication.

file_size

float

File size in bytes.

corpora

UUID

ID of the parent corpus containing this file.

Example Response

[
  {
    "id": "0f75c73e-91bb-4e2b-9ff2-6820a8636ad8",
    "created_at": "2024-09-01T11:23:12.884Z",
    "updated_at": "2024-09-01T11:23:12.884Z",
    "indexed_on": null,
    "indexing_status": "PRS",
    "file_name": "playbook.pdf",
    "file_type": "pdf",
    "location": "https://storage/.../playbook.pdf",
    "file_size": 180423.0,
    "corpora": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
  },
  {
    "id": "f5b69026-879f-457d-8b5a-78e20a8c912c",
    "created_at": "2024-09-01T11:23:12.901Z",
    "updated_at": "2024-09-01T11:23:12.901Z",
    "indexed_on": null,
    "indexing_status": "PRS",
    "file_name": "metrics.csv",
    "file_type": "csv",
    "location": "https://storage/.../metrics.csv",
    "file_size": 2048.0,
    "corpora": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
  }
]

Client examples

Python
TypeScript / JavaScript
Java

import os
import requests

BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ["SOAR_LABS_TOKEN"]
CORPUS_ID = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"

files = [
    ("files", ("playbook.pdf", open("playbook.pdf", "rb"), "application/pdf")),
    ("files", ("metrics.csv", open("metrics.csv", "rb"), "text/csv")),
]

response = requests.post(
    f"{BASE_URL}/api/data/files/",
    headers={"Authorization": f"Bearer {TOKEN}"},
    data={"corpora": CORPUS_ID},
    files=files,
    timeout=120,
)
response.raise_for_status()
uploaded_files = response.json()

const BASE_URL = "https://your-soar-instance.com";
const token = process.env.SOAR_LABS_TOKEN!;

async function uploadFiles(corpusId: string, blobs: File[]) {
  const form = new FormData();
  form.append("corpora", corpusId);
  blobs.forEach((blob) => form.append("files", blob));

  const response = await fetch(`${BASE_URL}/api/data/files/`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${token}`,
    },
    body: form,
  });

  if (!response.ok) {
    throw new Error(`Upload failed: ${response.status}`);
  }

  return response.json();
}

import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.file.Path;
import java.util.List;

// Use a multipart helper (e.g., Apache HttpClient or OkHttp) for production.
// Simplified example with java.net.http sending a single file.
var BASE_URL = "https://your-soar-instance.com";
var token = System.getenv("SOAR_LABS_TOKEN");
var corpusId = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd";

String boundary = "----SoarBoundary";
var body = "--" + boundary + "\r\n" +
    "Content-Disposition: form-data; name=\"corpora\"\r\n\r\n" + corpusId + "\r\n" +
    "--" + boundary + "\r\n" +
    "Content-Disposition: form-data; name=\"files\"; filename=\"playbook.pdf\"\r\n" +
    "Content-Type: application/pdf\r\n\r\n" +
    java.nio.file.Files.readString(Path.of("playbook.pdf")) +
    "\r\n--" + boundary + "--\r\n";

var request = HttpRequest.newBuilder(URI.create(BASE_URL + "/api/data/files/"))
    .header("Authorization", "Bearer " + token)
    .header("Content-Type", "multipart/form-data; boundary=" + boundary)
    .POST(HttpRequest.BodyPublishers.ofString(body))
    .build();

var response = HttpClient.newHttpClient().send(request, HttpResponse.BodyHandlers.ofString());

if (response.statusCode() >= 400) {
    throw new RuntimeException("Upload failed: " + response.statusCode());
}

Post-Upload Operations

Monitor Indexing Status

Poll the files endpoint to track processing progress:

curl -X GET "https://{your-host}/api/data/files/?corpora={corpus-id}" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

Status progression: PND → PRS → IND (success) or ERR (failure)Typical processing times:

Small text files (< 1MB): 5-15 seconds
PDFs with images (5-10MB): 30-60 seconds
Large documents (20MB+): 2-5 minutes

Retrieve File Details

Get detailed information about a specific file:

curl -X GET "https://{your-host}/api/data/files/{file-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

Returns full metadata including chunking statistics and extraction details.

Delete Files

Remove files from the corpus and vector index:

curl -X DELETE "https://{your-host}/api/data/files/{file-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

Important: Deletion is immediate and irreversible. The operation:

Removes the file from cloud storage
Deletes all associated vector embeddings
Updates corpus size metadata
Cannot be undone - you’ll need to re-upload if deleted accidentally

Handle Processing Errors

If a file shows indexing_status: "ERR", common causes include:

Corrupted or invalid file format - Re-export and try again
Unsupported encoding - Convert to UTF-8 for text files
Password-protected PDFs - Remove protection before uploading
Extremely large files - Split into smaller chunks
Unsupported content - Check if file type is in supported list

To retry: Delete the failed file and re-upload with corrections.

Best Practices

Batch Uploads: Upload multiple related files in a single request to improve efficiency and reduce API calls.

Optimize File Preparation

Before uploading:

Remove unnecessary pages - Reduce file size by excluding cover pages, blank pages
Use OCR for scanned PDFs - Convert image-based PDFs to searchable text
Clean up formatting - Remove excessive whitespace, fix broken tables
Verify encoding - Ensure text files use UTF-8 encoding
Test file opens - Verify files aren’t corrupted before upload

Organize by Corpus

Create separate corpora for different content types or use cases:

Internal Documentation - Company policies, procedures
Product Knowledge - Technical specs, user guides
Customer Support - FAQs, troubleshooting guides
Training Materials - Onboarding docs, tutorials

This improves query accuracy and makes management easier.

Monitor Upload Queue

For bulk uploads:

Upload in reasonable batches (10-20 files per request)
Poll status every 10-30 seconds
Implement exponential backoff if servers are busy
Log file IDs for tracking and error recovery

Rate Limits: Large file uploads may hit rate limits. Implement exponential backoff and retry logic in production systems.

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

location

string<uri>

required

Location of the file on Remote Storage

corpora

string<uuid>

required

Corpora to which the Maps to

Response

201 - application/json

string<uuid>

required

created_at

string<date-time>

required

The date and time the organization was created

updated_at

string<date-time>

required

Last updated time

indexed_on

string<date-time> | null

required

indexing_status

enum<string>

required

PND - Pending
IQE - In Queue
PRS - Processing
DEX - Data Extracted Successfully
DER - Data Extraction Error
IND - Indexed
CMP - Completed
ERR - Error

Available options:

PND,

IQE,

PRS,

DEX,

DER,

IND,

CMP,

ERR

file_name

string

required

Original, user-facing name of the uploaded file.

file_type

string | null

required

Type of the file

location

string<uri>

required

Location of the file on Remote Storage

file_size

number<double> | null

required

bytes

corpora

string<uuid>

required

Corpora to which the Maps to

Getting Started

Corpus Management

Query and Retrieve

Resources

Overview

Authentication

Request Body

Example request

Response

Example Response

Client examples

Post-Upload Operations

Best Practices

Authorizations

Body

Response

Getting Started

Corpus Management

Query and Retrieve

Resources

​Overview

​Authentication

​Request Body

​Example request

​Response

​Example Response

​Client examples

​Post-Upload Operations

​Best Practices

Authorizations

Body

Response

Overview

Authentication

Request Body

Example request

Response

Example Response

Client examples

Post-Upload Operations

Best Practices