Overview
URLs let you ingest external web content without uploading files. Each URL is crawled, normalized, and processed through the ingestion pipeline to extract text, generate chunks, and create vector embeddings. Submit multiple URLs in a single request for efficient batch ingestion.
Perfect for : Documentation sites, blog posts, knowledge bases, status pages, and any web content you want to make queryable.
URL content is fetched at ingestion time. Changes to the source page won’t automatically update - you’ll need to re-add the URL to refresh content.
Authentication
Requires valid JWT token or session authentication. You must be the owner of the target corpus.
Request Body
ID of the corpus that will own these URLs. Must be a corpus you created and have access to.
Array of URL objects to crawl and ingest. Each object represents one web resource. Batch size recommendations:
Optimal: 5-20 URLs per request
Maximum: Check your instance configuration (typically 50)
Show URL object structure
Absolute URL to crawl (must include protocol: https:// or http://). Supported URL types:
Public web pages (HTML)
Documentation sites
Blog posts and articles
Public APIs returning HTML/text
Sitemap URLs (when scrape_sitemap is enabled)
Unsupported:
Authentication-required pages
JavaScript-heavy SPAs (limited support)
PDF files served directly (use file upload instead)
Set to true to automatically discover and enqueue additional URLs from the page’s sitemap. How it works:
Soar Labs fetches the provided URL
Searches for sitemap.xml or sitemap links
Automatically enqueues all discovered URLs
Processes each URL as a separate resource
Use cases:
Documentation sites with full sitemaps
Blog archives
Product catalogs
Knowledge base articles
Caution : Large sitemaps (>100 URLs) may take significant time to process. Consider adding specific sections instead of entire sites.
Example request
curl -X POST https://{your-host}/api/data/urls/ \
-H "Authorization: Bearer $SOAR_LABS_TOKEN " \
-H "Content-Type: application/json" \
-d '{
"corpora": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd",
"urls": [
{"url": "https://docs.example.com/support/escalations"},
{"url": "https://status.example.com/incidents", "scrape_sitemap": true}
]
}'
Response
Returns an array of URL objects (one for each submitted URL):
Unique identifier for the URL resource. Use for tracking, retrieval, or deletion.
ISO 8601 timestamp when the URL was added.
Last update timestamp. Changes when crawling or indexing completes.
Timestamp when indexing completed successfully. null while processing.
Current processing status:
PRS - Processing (crawling and indexing in progress)
IND - Indexed (content ready for queries)
ERR - Error (crawling or processing failed)
PND - Pending (queued for crawling)
The original URL as submitted.
Normalized/canonical version of the URL (removes tracking parameters, standardizes format).
Whether sitemap scraping was enabled for this URL.
Extracted metadata from the crawled page. MIME type of the fetched content (e.g., text/html, application/xhtml+xml).
Page title extracted from <title> tag or Open Graph metadata.
Additional fields may include:
description - Meta description or OG description
author - Content author if available
published_date - Publication date if found
ID of the parent corpus containing this URL.
Example Response
[
{
"id" : "f0d6fe08-87c8-4eb0-80d8-7a2de638514b" ,
"created_at" : "2024-09-01T12:05:11.337Z" ,
"updated_at" : "2024-09-01T12:05:11.337Z" ,
"indexed_on" : null ,
"indexing_status" : "PRS" ,
"url" : "https://docs.example.com/support/escalations" ,
"clean_url" : "https://docs.example.com/support/escalations" ,
"scrape_sitemap" : false ,
"metadata" : {
"content_type" : "text/html" ,
"title" : "Escalation Playbook"
},
"corpora" : "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
}
]
Important Notes
URL Validation : Malformed URLs are rejected with 400 Bad Request and field-level errors. Ensure all URLs are absolute and properly formatted.
Track crawling progress via GET /api/data/urls/?corpora={id}. Monitor indexing_status transitions from PRS → IND.
Deletion Impact : Deleting a URL via DELETE /api/data/urls/{id}/ immediately removes all derived chunks from the vector store. This action cannot be undone.
Client examples
Python
TypeScript / JavaScript
Java
import os
import requests
BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ[ "SOAR_LABS_TOKEN" ]
CORPUS_ID = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
payload = {
"corpora" : CORPUS_ID ,
"urls" : [
{ "url" : "https://docs.example.com/support/escalations" },
{ "url" : "https://status.example.com/incidents" , "scrape_sitemap" : True },
],
}
response = requests.post(
f " { BASE_URL } /api/data/urls/" ,
headers = {
"Authorization" : f "Bearer { TOKEN } " ,
"Content-Type" : "application/json" ,
},
json = payload,
timeout = 30 ,
)
response.raise_for_status()
urls = response.json()
const BASE_URL = "https://your-soar-instance.com" ;
const token = process . env . SOAR_LABS_TOKEN ! ;
async function addUrls ( corpusId : string ) {
const response = await fetch ( ` ${ BASE_URL } /api/data/urls/` , {
method: "POST" ,
headers: {
"Content-Type" : "application/json" ,
Authorization: `Bearer ${ token } ` ,
},
body: JSON . stringify ({
corpora: corpusId ,
urls: [
{ url: "https://docs.example.com/support/escalations" },
{ url: "https://status.example.com/incidents" , scrape_sitemap: true },
],
}),
});
if ( ! response . ok ) {
throw new Error ( `Add URLs failed: ${ response . status } ` );
}
return response . json ();
}
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
var BASE_URL = "https://your-soar-instance.com" ;
var token = System . getenv ( "SOAR_LABS_TOKEN" );
var corpusId = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd" ;
var json = "{" +
" \" corpora \" : \" " + corpusId + " \" ," +
" \" urls \" :[{" +
" \" url \" : \" https://docs.example.com/support/escalations \" " +
"},{" +
" \" url \" : \" https://status.example.com/incidents \" , \" scrape_sitemap \" :true" +
"}]" +
"}" ;
var request = HttpRequest . newBuilder ( URI . create (BASE_URL + "/api/data/urls/" ))
. header ( "Authorization" , "Bearer " + token)
. header ( "Content-Type" , "application/json" )
. POST ( HttpRequest . BodyPublishers . ofString (json))
. build ();
var response = HttpClient . newHttpClient (). send (request, HttpResponse . BodyHandlers . ofString ());
if ( response . statusCode () >= 400 ) {
throw new RuntimeException ( "Add URLs failed: " + response . statusCode ());
}
Best Practices
Select URLs that provide maximum value: Good candidates:
Documentation pages with stable content
Knowledge base articles
Blog posts and tutorials
Product specifications
API reference pages
Avoid:
Dynamic pages with frequently changing content
Pages behind authentication/paywalls
JavaScript-heavy SPAs (content may not extract properly)
Pages with primarily images/videos (limited text content)
Auto-generated index pages with little content
Sitemap Scraping Strategy
Use scrape_sitemap: true strategically: When to enable:
Well-structured documentation sites
Blog archives (when you want all posts)
Product catalogs with many pages
Knowledge bases with comprehensive sitemaps
When to avoid:
Sites with 100+ pages (submit sections instead)
Frequently updated news sites (content becomes stale)
Sites with dynamic pagination (submit specific URLs)
Best practice : Start with a single page, verify content quality, then enable sitemap scraping if needed.
Common crawl failures and solutions: 404 Not Found:
Verify URL is correct and accessible
Check if page was moved or deleted
Try accessing in a browser first
403 Forbidden / 401 Unauthorized:
Page requires authentication
Site blocks bots/crawlers
Consider uploading content as file instead
Timeout Errors:
Page is too slow to load
Server is experiencing issues
Try again later or contact site admin
Content Extraction Failures:
JavaScript-heavy page (content not in initial HTML)
Page has anti-scraping measures
Consider using an alternative format (PDF, markdown)
Managing URL content updates: Strategy 1: Manual Refresh
Delete the old URL resource
Re-add the URL to fetch latest content
Best for infrequently changing content
Strategy 2: Scheduled Updates
Implement periodic refresh via API calls
Delete and re-add URLs on a schedule
Good for documentation that updates regularly
Strategy 3: Webhooks (Advanced)
Set up webhooks to trigger updates on content changes
Requires integration with content management system
Best for real-time accuracy requirements
Soar Labs respects robots.txt directives:
Crawlers follow site-specific rate limits
Disallowed paths are not crawled
User-agent identification: SoarLabsBot
If crawling fails:
Check the site’s robots.txt
Verify your URLs aren’t disallowed
Consider reaching out to site owner for permission
Alternative: Upload content as files if you have access
Management Operations
Retrieve all URLs in a corpus: curl -X GET "https://{your-host}/api/data/urls/?corpora={corpus-id}" \
-H "Authorization: Bearer $SOAR_LABS_TOKEN "
Supports pagination with page and page_size parameters.
Modify URL settings (e.g., enable sitemap scraping): curl -X PATCH "https://{your-host}/api/data/urls/{url-id}/" \
-H "Authorization: Bearer $SOAR_LABS_TOKEN " \
-H "Content-Type: application/json" \
-d '{"scrape_sitemap": true}'
Note : Changing settings triggers re-crawling and re-indexing.
To refresh stale content, delete and re-add the URL: # Step 1: Delete old version
curl -X DELETE "https://{your-host}/api/data/urls/{url-id}/" \
-H "Authorization: Bearer $SOAR_LABS_TOKEN "
# Step 2: Re-add with fresh content
curl -X POST "https://{your-host}/api/data/urls/" \
-H "Authorization: Bearer $SOAR_LABS_TOKEN " \
-H "Content-Type: application/json" \
-d '{
"corpora": "{corpus-id}",
"urls": [{"url": "https://example.com/updated-page"}]
}'
Remove URLs from the corpus: curl -X DELETE "https://{your-host}/api/data/urls/{url-id}/" \
-H "Authorization: Bearer $SOAR_LABS_TOKEN "
Warning : Deletion immediately removes all crawled content and vector embeddings. Cannot be undone.
Typical Crawl Times:
Simple pages (< 100KB HTML): 2-5 seconds
Documentation pages (100-500KB): 10-30 seconds
Complex pages with many images: 30-60 seconds
Sitemap crawls (10-50 pages): 1-5 minutes
Rate limiting may apply to prevent overwhelming target sites. Large sitemap crawls are processed in batches with delays between requests.
Authorizations jwtHeaderAuth jwtCookieAuth cookieAuth basicAuth
Bearer authentication header of the form Bearer <token> , where <token> is your auth token.
Body application/json application/x-www-form-urlencoded multipart/form-data
Maximum string length: 200
Corpora to which the Maps to
Canonical form of the URL - free of tracking parameters
Maximum string length: 200
Whether to scrape the sitemap for additional URLs
Additional metadata for the URL
created_at
string<date-time>
required
The date and time the organization was created
updated_at
string<date-time>
required
indexed_on
string<date-time> | null
required
PND - Pending
IQE - In Queue
PRS - Processing
DEX - Data Extracted Successfully
DER - Data Extraction Error
IND - Indexed
CMP - Completed
ERR - Error
Available options:
PND,
IQE,
PRS,
DEX,
DER,
IND,
CMP,
ERR
Maximum string length: 200
Corpora to which the Maps to
Canonical form of the URL - free of tracking parameters
Maximum string length: 200
Whether to scrape the sitemap for additional URLs
Additional metadata for the URL