Skip to main content
POST
/
api
/
data
/
files
{
  "id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "created_at": "2023-11-07T05:31:56Z",
  "updated_at": "2023-11-07T05:31:56Z",
  "indexed_on": "2023-11-07T05:31:56Z",
  "indexing_status": "PND",
  "file_name": "<string>",
  "file_type": "<string>",
  "location": "<string>",
  "file_size": 123,
  "corpora": "3c90c3cc-0d44-4b50-8888-8dd25736052a"
}

Overview

This endpoint handles binary file uploads and triggers asynchronous ingestion (content extraction + chunking + vector indexing). Only corpora you own can accept uploads, and files must be in supported formats.
Supported Formats: PDF, DOCX, TXT, CSV, JSON, Markdown (MD), Excel (XLSX/XLS), HTML/HTM, LOG files
Files are processed asynchronously. Monitor indexing_status to track when files are ready for querying.

Authentication

Requires valid JWT token or session authentication. You must be the owner of the target corpus.

Request Body

corpora
UUID
required
ID of the corpus that will own these files. Must be a corpus you created and have access to.
files
file[]
required
One or more files to upload. Send as multipart/form-data with multiple files fields.Processing pipeline:
  1. File validation (format, size)
  2. Upload to cloud storage
  3. Content extraction (text, tables, images)
  4. Chunking and metadata generation
  5. Vector embedding and indexing
File size limits: Check your instance configuration (typically 50MB per file)

Example request

curl -X POST https://{your-host}/api/data/files/ \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -F "corpora=8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd" \
  -F "[email protected]" \
  -F "[email protected]"

Response

Returns an array of file objects (one for each uploaded file):
id
UUID
Unique identifier for the file resource. Use for tracking, retrieval, or deletion.
created_at
timestamp
ISO 8601 timestamp when the file was uploaded.
updated_at
timestamp
Last update timestamp. Changes when indexing status updates.
indexed_on
timestamp | null
Timestamp when indexing completed successfully. null while processing.
indexing_status
string
Current processing status of the file:
  • PRS - Processing (extraction and indexing in progress)
  • IND - Indexed (ready for queries)
  • ERR - Error (processing failed, check logs)
  • PND - Pending (queued for processing)
file_name
string
Original filename as uploaded.
file_type
string
Detected file extension/type (e.g., pdf, docx, csv).
location
string (URL)
Cloud storage URL where the file is persisted. Access requires authentication.
file_size
float
File size in bytes.
corpora
UUID
ID of the parent corpus containing this file.

Example Response

[
  {
    "id": "0f75c73e-91bb-4e2b-9ff2-6820a8636ad8",
    "created_at": "2024-09-01T11:23:12.884Z",
    "updated_at": "2024-09-01T11:23:12.884Z",
    "indexed_on": null,
    "indexing_status": "PRS",
    "file_name": "playbook.pdf",
    "file_type": "pdf",
    "location": "https://storage/.../playbook.pdf",
    "file_size": 180423.0,
    "corpora": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
  },
  {
    "id": "f5b69026-879f-457d-8b5a-78e20a8c912c",
    "created_at": "2024-09-01T11:23:12.901Z",
    "updated_at": "2024-09-01T11:23:12.901Z",
    "indexed_on": null,
    "indexing_status": "PRS",
    "file_name": "metrics.csv",
    "file_type": "csv",
    "location": "https://storage/.../metrics.csv",
    "file_size": 2048.0,
    "corpora": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
  }
]

Client examples

import os
import requests

BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ["SOAR_LABS_TOKEN"]
CORPUS_ID = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"

files = [
    ("files", ("playbook.pdf", open("playbook.pdf", "rb"), "application/pdf")),
    ("files", ("metrics.csv", open("metrics.csv", "rb"), "text/csv")),
]

response = requests.post(
    f"{BASE_URL}/api/data/files/",
    headers={"Authorization": f"Bearer {TOKEN}"},
    data={"corpora": CORPUS_ID},
    files=files,
    timeout=120,
)
response.raise_for_status()
uploaded_files = response.json()

Post-Upload Operations

Poll the files endpoint to track processing progress:
curl -X GET "https://{your-host}/api/data/files/?corpora={corpus-id}" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"
Status progression: PNDPRSIND (success) or ERR (failure)Typical processing times:
  • Small text files (< 1MB): 5-15 seconds
  • PDFs with images (5-10MB): 30-60 seconds
  • Large documents (20MB+): 2-5 minutes
Get detailed information about a specific file:
curl -X GET "https://{your-host}/api/data/files/{file-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"
Returns full metadata including chunking statistics and extraction details.
Remove files from the corpus and vector index:
curl -X DELETE "https://{your-host}/api/data/files/{file-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"
Important: Deletion is immediate and irreversible. The operation:
  • Removes the file from cloud storage
  • Deletes all associated vector embeddings
  • Updates corpus size metadata
  • Cannot be undone - you’ll need to re-upload if deleted accidentally
If a file shows indexing_status: "ERR", common causes include:
  • Corrupted or invalid file format - Re-export and try again
  • Unsupported encoding - Convert to UTF-8 for text files
  • Password-protected PDFs - Remove protection before uploading
  • Extremely large files - Split into smaller chunks
  • Unsupported content - Check if file type is in supported list
To retry: Delete the failed file and re-upload with corrections.

Best Practices

Batch Uploads: Upload multiple related files in a single request to improve efficiency and reduce API calls.
Before uploading:
  1. Remove unnecessary pages - Reduce file size by excluding cover pages, blank pages
  2. Use OCR for scanned PDFs - Convert image-based PDFs to searchable text
  3. Clean up formatting - Remove excessive whitespace, fix broken tables
  4. Verify encoding - Ensure text files use UTF-8 encoding
  5. Test file opens - Verify files aren’t corrupted before upload
Create separate corpora for different content types or use cases:
  • Internal Documentation - Company policies, procedures
  • Product Knowledge - Technical specs, user guides
  • Customer Support - FAQs, troubleshooting guides
  • Training Materials - Onboarding docs, tutorials
This improves query accuracy and makes management easier.
For bulk uploads:
  1. Upload in reasonable batches (10-20 files per request)
  2. Poll status every 10-30 seconds
  3. Implement exponential backoff if servers are busy
  4. Log file IDs for tracking and error recovery
Rate Limits: Large file uploads may hit rate limits. Implement exponential backoff and retry logic in production systems.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

location
string<uri>
required

Location of the file on Remote Storage

corpora
string<uuid>
required

Corpora to which the Maps to

Response

201 - application/json
id
string<uuid>
required
created_at
string<date-time>
required

The date and time the organization was created

updated_at
string<date-time>
required

Last updated time

indexed_on
string<date-time> | null
required
indexing_status
enum<string>
required
  • PND - Pending
  • IQE - In Queue
  • PRS - Processing
  • DEX - Data Extracted Successfully
  • DER - Data Extraction Error
  • IND - Indexed
  • CMP - Completed
  • ERR - Error
Available options:
PND,
IQE,
PRS,
DEX,
DER,
IND,
CMP,
ERR
file_name
string
required

Original, user-facing name of the uploaded file.

file_type
string | null
required

Type of the file

location
string<uri>
required

Location of the file on Remote Storage

file_size
number<double> | null
required

bytes

corpora
string<uuid>
required

Corpora to which the Maps to