Overview
This endpoint handles binary file uploads and triggers asynchronous ingestion (content extraction + chunking + vector indexing). Only corpora you own can accept uploads, and files must be in supported formats.
Supported Formats : PDF, DOCX, TXT, CSV, JSON, Markdown (MD), Excel (XLSX/XLS), HTML/HTM, LOG files
Files are processed asynchronously. Monitor indexing_status to track when files are ready for querying.
Authentication
Requires valid JWT token or session authentication. You must be the owner of the target corpus.
Request Body
ID of the corpus that will own these files. Must be a corpus you created and have access to.
One or more files to upload. Send as multipart/form-data with multiple files fields. Processing pipeline:
File validation (format, size)
Upload to cloud storage
Content extraction (text, tables, images)
Chunking and metadata generation
Vector embedding and indexing
File size limits : Check your instance configuration (typically 50MB per file)
Example request
curl -X POST https://{your-host}/api/data/files/ \
-H "Authorization: Bearer $SOAR_LABS_TOKEN " \
-F "corpora=8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd" \
-F "[email protected] " \
-F "[email protected] "
Response
Returns an array of file objects (one for each uploaded file):
Unique identifier for the file resource. Use for tracking, retrieval, or deletion.
ISO 8601 timestamp when the file was uploaded.
Last update timestamp. Changes when indexing status updates.
Timestamp when indexing completed successfully. null while processing.
Current processing status of the file:
PRS - Processing (extraction and indexing in progress)
IND - Indexed (ready for queries)
ERR - Error (processing failed, check logs)
PND - Pending (queued for processing)
Original filename as uploaded.
Detected file extension/type (e.g., pdf, docx, csv).
Cloud storage URL where the file is persisted. Access requires authentication.
ID of the parent corpus containing this file.
Example Response
[
{
"id" : "0f75c73e-91bb-4e2b-9ff2-6820a8636ad8" ,
"created_at" : "2024-09-01T11:23:12.884Z" ,
"updated_at" : "2024-09-01T11:23:12.884Z" ,
"indexed_on" : null ,
"indexing_status" : "PRS" ,
"file_name" : "playbook.pdf" ,
"file_type" : "pdf" ,
"location" : "https://storage/.../playbook.pdf" ,
"file_size" : 180423.0 ,
"corpora" : "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
},
{
"id" : "f5b69026-879f-457d-8b5a-78e20a8c912c" ,
"created_at" : "2024-09-01T11:23:12.901Z" ,
"updated_at" : "2024-09-01T11:23:12.901Z" ,
"indexed_on" : null ,
"indexing_status" : "PRS" ,
"file_name" : "metrics.csv" ,
"file_type" : "csv" ,
"location" : "https://storage/.../metrics.csv" ,
"file_size" : 2048.0 ,
"corpora" : "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
}
]
Client examples
Python
TypeScript / JavaScript
Java
import os
import requests
BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ[ "SOAR_LABS_TOKEN" ]
CORPUS_ID = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
files = [
( "files" , ( "playbook.pdf" , open ( "playbook.pdf" , "rb" ), "application/pdf" )),
( "files" , ( "metrics.csv" , open ( "metrics.csv" , "rb" ), "text/csv" )),
]
response = requests.post(
f " { BASE_URL } /api/data/files/" ,
headers = { "Authorization" : f "Bearer { TOKEN } " },
data = { "corpora" : CORPUS_ID },
files = files,
timeout = 120 ,
)
response.raise_for_status()
uploaded_files = response.json()
const BASE_URL = "https://your-soar-instance.com" ;
const token = process . env . SOAR_LABS_TOKEN ! ;
async function uploadFiles ( corpusId : string , blobs : File []) {
const form = new FormData ();
form . append ( "corpora" , corpusId );
blobs . forEach (( blob ) => form . append ( "files" , blob ));
const response = await fetch ( ` ${ BASE_URL } /api/data/files/` , {
method: "POST" ,
headers: {
Authorization: `Bearer ${ token } ` ,
},
body: form ,
});
if ( ! response . ok ) {
throw new Error ( `Upload failed: ${ response . status } ` );
}
return response . json ();
}
import java.io.IOException;
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.file.Path;
import java.util.List;
// Use a multipart helper (e.g., Apache HttpClient or OkHttp) for production.
// Simplified example with java.net.http sending a single file.
var BASE_URL = "https://your-soar-instance.com" ;
var token = System . getenv ( "SOAR_LABS_TOKEN" );
var corpusId = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd" ;
String boundary = "----SoarBoundary" ;
var body = "--" + boundary + " \r\n " +
"Content-Disposition: form-data; name= \" corpora \"\r\n\r\n " + corpusId + " \r\n " +
"--" + boundary + " \r\n " +
"Content-Disposition: form-data; name= \" files \" ; filename= \" playbook.pdf \"\r\n " +
"Content-Type: application/pdf \r\n\r\n " +
java . nio . file . Files . readString ( Path . of ( "playbook.pdf" )) +
" \r\n --" + boundary + "-- \r\n " ;
var request = HttpRequest . newBuilder ( URI . create (BASE_URL + "/api/data/files/" ))
. header ( "Authorization" , "Bearer " + token)
. header ( "Content-Type" , "multipart/form-data; boundary=" + boundary)
. POST ( HttpRequest . BodyPublishers . ofString (body))
. build ();
var response = HttpClient . newHttpClient (). send (request, HttpResponse . BodyHandlers . ofString ());
if ( response . statusCode () >= 400 ) {
throw new RuntimeException ( "Upload failed: " + response . statusCode ());
}
Post-Upload Operations
Poll the files endpoint to track processing progress: curl -X GET "https://{your-host}/api/data/files/?corpora={corpus-id}" \
-H "Authorization: Bearer $SOAR_LABS_TOKEN "
Status progression:
PND → PRS → IND (success) or ERR (failure)Typical processing times:
Small text files (< 1MB): 5-15 seconds
PDFs with images (5-10MB): 30-60 seconds
Large documents (20MB+): 2-5 minutes
Get detailed information about a specific file: curl -X GET "https://{your-host}/api/data/files/{file-id}/" \
-H "Authorization: Bearer $SOAR_LABS_TOKEN "
Returns full metadata including chunking statistics and extraction details.
Remove files from the corpus and vector index: curl -X DELETE "https://{your-host}/api/data/files/{file-id}/" \
-H "Authorization: Bearer $SOAR_LABS_TOKEN "
Important : Deletion is immediate and irreversible. The operation:
Removes the file from cloud storage
Deletes all associated vector embeddings
Updates corpus size metadata
Cannot be undone - you’ll need to re-upload if deleted accidentally
If a file shows indexing_status: "ERR", common causes include:
Corrupted or invalid file format - Re-export and try again
Unsupported encoding - Convert to UTF-8 for text files
Password-protected PDFs - Remove protection before uploading
Extremely large files - Split into smaller chunks
Unsupported content - Check if file type is in supported list
To retry: Delete the failed file and re-upload with corrections.
Best Practices
Batch Uploads : Upload multiple related files in a single request to improve efficiency and reduce API calls.
Optimize File Preparation
Before uploading:
Remove unnecessary pages - Reduce file size by excluding cover pages, blank pages
Use OCR for scanned PDFs - Convert image-based PDFs to searchable text
Clean up formatting - Remove excessive whitespace, fix broken tables
Verify encoding - Ensure text files use UTF-8 encoding
Test file opens - Verify files aren’t corrupted before upload
Create separate corpora for different content types or use cases:
Internal Documentation - Company policies, procedures
Product Knowledge - Technical specs, user guides
Customer Support - FAQs, troubleshooting guides
Training Materials - Onboarding docs, tutorials
This improves query accuracy and makes management easier.
For bulk uploads:
Upload in reasonable batches (10-20 files per request)
Poll status every 10-30 seconds
Implement exponential backoff if servers are busy
Log file IDs for tracking and error recovery
Rate Limits : Large file uploads may hit rate limits. Implement exponential backoff and retry logic in production systems.
Authorizations jwtHeaderAuth jwtCookieAuth cookieAuth basicAuth
Bearer authentication header of the form Bearer <token> , where <token> is your auth token.
Body application/json application/x-www-form-urlencoded multipart/form-data
Location of the file on Remote Storage
Corpora to which the Maps to
created_at
string<date-time>
required
The date and time the organization was created
updated_at
string<date-time>
required
indexed_on
string<date-time> | null
required
PND - Pending
IQE - In Queue
PRS - Processing
DEX - Data Extracted Successfully
DER - Data Extraction Error
IND - Indexed
CMP - Completed
ERR - Error
Available options:
PND,
IQE,
PRS,
DEX,
DER,
IND,
CMP,
ERR
Original, user-facing name of the uploaded file.
Location of the file on Remote Storage
file_size
number<double> | null
required
Corpora to which the Maps to