Add a URL

Overview

URLs let you ingest external web content without uploading files. Each URL is crawled, normalized, and processed through the ingestion pipeline to extract text, generate chunks, and create vector embeddings. Submit multiple URLs in a single request for efficient batch ingestion.

Perfect for: Documentation sites, blog posts, knowledge bases, status pages, and any web content you want to make queryable.

URL content is fetched at ingestion time. Changes to the source page won’t automatically update - you’ll need to re-add the URL to refresh content.

Authentication

Requires valid JWT token or session authentication. You must be the owner of the target corpus.

Request Body

corpora

UUID

required

ID of the corpus that will own these URLs. Must be a corpus you created and have access to.

urls

array<object>

required

Array of URL objects to crawl and ingest. Each object represents one web resource.Batch size recommendations:

Optimal: 5-20 URLs per request
Maximum: Check your instance configuration (typically 50)

Show URL object structure

urls[].url

string (URL)

required

Absolute URL to crawl (must include protocol: https:// or http://).Supported URL types:

Public web pages (HTML)
Documentation sites
Blog posts and articles
Public APIs returning HTML/text
Sitemap URLs (when scrape_sitemap is enabled)

Unsupported:

Authentication-required pages
JavaScript-heavy SPAs (limited support)
PDF files served directly (use file upload instead)

urls[].scrape_sitemap

boolean

default:"false"

Set to true to automatically discover and enqueue additional URLs from the page’s sitemap.How it works:

Soar Labs fetches the provided URL
Searches for sitemap.xml or sitemap links
Automatically enqueues all discovered URLs
Processes each URL as a separate resource

Use cases:

Documentation sites with full sitemaps
Blog archives
Product catalogs
Knowledge base articles

Caution: Large sitemaps (>100 URLs) may take significant time to process. Consider adding specific sections instead of entire sites.

Example request

curl -X POST https://{your-host}/api/data/urls/ \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "corpora": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd",
    "urls": [
      {"url": "https://docs.example.com/support/escalations"},
      {"url": "https://status.example.com/incidents", "scrape_sitemap": true}
    ]
  }'

Response

Returns an array of URL objects (one for each submitted URL):

UUID

Unique identifier for the URL resource. Use for tracking, retrieval, or deletion.

created_at

timestamp

ISO 8601 timestamp when the URL was added.

updated_at

timestamp

Last update timestamp. Changes when crawling or indexing completes.

indexed_on

timestamp | null

Timestamp when indexing completed successfully. null while processing.

indexing_status

string

Current processing status:

PRS - Processing (crawling and indexing in progress)
IND - Indexed (content ready for queries)
ERR - Error (crawling or processing failed)
PND - Pending (queued for crawling)

url

string (URL)

The original URL as submitted.

clean_url

string (URL)

Normalized/canonical version of the URL (removes tracking parameters, standardizes format).

scrape_sitemap

boolean

Whether sitemap scraping was enabled for this URL.

metadata

object

Extracted metadata from the crawled page.

Show metadata fields

content_type

string

MIME type of the fetched content (e.g., text/html, application/xhtml+xml).

title

string

Page title extracted from <title> tag or Open Graph metadata.

Additional fields may include:

description - Meta description or OG description
author - Content author if available
published_date - Publication date if found

corpora

UUID

ID of the parent corpus containing this URL.

Example Response

[
  {
    "id": "f0d6fe08-87c8-4eb0-80d8-7a2de638514b",
    "created_at": "2024-09-01T12:05:11.337Z",
    "updated_at": "2024-09-01T12:05:11.337Z",
    "indexed_on": null,
    "indexing_status": "PRS",
    "url": "https://docs.example.com/support/escalations",
    "clean_url": "https://docs.example.com/support/escalations",
    "scrape_sitemap": false,
    "metadata": {
      "content_type": "text/html",
      "title": "Escalation Playbook"
    },
    "corpora": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
  }
]

Important Notes

URL Validation: Malformed URLs are rejected with 400 Bad Request and field-level errors. Ensure all URLs are absolute and properly formatted.

Track crawling progress via GET /api/data/urls/?corpora={id}. Monitor indexing_status transitions from PRS → IND.

Deletion Impact: Deleting a URL via DELETE /api/data/urls/{id}/ immediately removes all derived chunks from the vector store. This action cannot be undone.

Client examples

Python
TypeScript / JavaScript
Java

import os
import requests

BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ["SOAR_LABS_TOKEN"]
CORPUS_ID = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"

payload = {
    "corpora": CORPUS_ID,
    "urls": [
        {"url": "https://docs.example.com/support/escalations"},
        {"url": "https://status.example.com/incidents", "scrape_sitemap": True},
    ],
}

response = requests.post(
    f"{BASE_URL}/api/data/urls/",
    headers={
        "Authorization": f"Bearer {TOKEN}",
        "Content-Type": "application/json",
    },
    json=payload,
    timeout=30,
)
response.raise_for_status()
urls = response.json()

const BASE_URL = "https://your-soar-instance.com";
const token = process.env.SOAR_LABS_TOKEN!;

async function addUrls(corpusId: string) {
  const response = await fetch(`${BASE_URL}/api/data/urls/`, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${token}`,
    },
    body: JSON.stringify({
      corpora: corpusId,
      urls: [
        { url: "https://docs.example.com/support/escalations" },
        { url: "https://status.example.com/incidents", scrape_sitemap: true },
      ],
    }),
  });

  if (!response.ok) {
    throw new Error(`Add URLs failed: ${response.status}`);
  }

  return response.json();
}

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

var BASE_URL = "https://your-soar-instance.com";
var token = System.getenv("SOAR_LABS_TOKEN");
var corpusId = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd";

var json = "{" +
    "\"corpora\":\"" + corpusId + "\"," +
    "\"urls\":[{" +
        "\"url\":\"https://docs.example.com/support/escalations\"" +
    "},{" +
        "\"url\":\"https://status.example.com/incidents\",\"scrape_sitemap\":true" +
    "}]" +
"}";

var request = HttpRequest.newBuilder(URI.create(BASE_URL + "/api/data/urls/"))
    .header("Authorization", "Bearer " + token)
    .header("Content-Type", "application/json")
    .POST(HttpRequest.BodyPublishers.ofString(json))
    .build();

var response = HttpClient.newHttpClient().send(request, HttpResponse.BodyHandlers.ofString());

if (response.statusCode() >= 400) {
    throw new RuntimeException("Add URLs failed: " + response.statusCode());
}

Best Practices

Choosing URLs Wisely

Select URLs that provide maximum value:Good candidates:

Documentation pages with stable content
Knowledge base articles
Blog posts and tutorials
Product specifications
API reference pages

Avoid:

Dynamic pages with frequently changing content
Pages behind authentication/paywalls
JavaScript-heavy SPAs (content may not extract properly)
Pages with primarily images/videos (limited text content)
Auto-generated index pages with little content

Sitemap Scraping Strategy

Use scrape_sitemap: true strategically:When to enable:

Well-structured documentation sites
Blog archives (when you want all posts)
Product catalogs with many pages
Knowledge bases with comprehensive sitemaps

When to avoid:

Sites with 100+ pages (submit sections instead)
Frequently updated news sites (content becomes stale)
Sites with dynamic pagination (submit specific URLs)

Best practice: Start with a single page, verify content quality, then enable sitemap scraping if needed.

Handling Crawl Failures

Common crawl failures and solutions:404 Not Found:

Verify URL is correct and accessible
Check if page was moved or deleted
Try accessing in a browser first

403 Forbidden / 401 Unauthorized:

Page requires authentication
Site blocks bots/crawlers
Consider uploading content as file instead

Timeout Errors:

Page is too slow to load
Server is experiencing issues
Try again later or contact site admin

Content Extraction Failures:

JavaScript-heavy page (content not in initial HTML)
Page has anti-scraping measures
Consider using an alternative format (PDF, markdown)

Content Freshness

Managing URL content updates:Strategy 1: Manual Refresh

Delete the old URL resource
Re-add the URL to fetch latest content
Best for infrequently changing content

Strategy 2: Scheduled Updates

Implement periodic refresh via API calls
Delete and re-add URLs on a schedule
Good for documentation that updates regularly

Strategy 3: Webhooks (Advanced)

Set up webhooks to trigger updates on content changes
Requires integration with content management system
Best for real-time accuracy requirements

Robots.txt Compliance

Soar Labs respects robots.txt directives:

Crawlers follow site-specific rate limits
Disallowed paths are not crawled
User-agent identification: SoarLabsBot

If crawling fails:

Check the site’s robots.txt
Verify your URLs aren’t disallowed
Consider reaching out to site owner for permission
Alternative: Upload content as files if you have access

Management Operations

List All URLs

Retrieve all URLs in a corpus:

curl -X GET "https://{your-host}/api/data/urls/?corpora={corpus-id}" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

Supports pagination with page and page_size parameters.

Update a URL

Modify URL settings (e.g., enable sitemap scraping):

curl -X PATCH "https://{your-host}/api/data/urls/{url-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"scrape_sitemap": true}'

Note: Changing settings triggers re-crawling and re-indexing.

Refresh URL Content

To refresh stale content, delete and re-add the URL:

# Step 1: Delete old version
curl -X DELETE "https://{your-host}/api/data/urls/{url-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

# Step 2: Re-add with fresh content
curl -X POST "https://{your-host}/api/data/urls/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "corpora": "{corpus-id}",
    "urls": [{"url": "https://example.com/updated-page"}]
  }'

Delete URLs

Remove URLs from the corpus:

curl -X DELETE "https://{your-host}/api/data/urls/{url-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

Warning: Deletion immediately removes all crawled content and vector embeddings. Cannot be undone.

Crawling Performance

Typical Crawl Times:

Simple pages (< 100KB HTML): 2-5 seconds
Documentation pages (100-500KB): 10-30 seconds
Complex pages with many images: 30-60 seconds
Sitemap crawls (10-50 pages): 1-5 minutes

Rate limiting may apply to prevent overwhelming target sites. Large sitemap crawls are processed in batches with delays between requests.

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

url

string<uri>

required

URL of the resource

Maximum string length: 200

corpora

string<uuid>

required

Corpora to which the Maps to

clean_url

string<uri> | null

Canonical form of the URL - free of tracking parameters

Maximum string length: 200

scrape_sitemap

boolean

Whether to scrape the sitemap for additional URLs

metadata

any | null

Additional metadata for the URL

Response

201 - application/json

string<uuid>

required

created_at

string<date-time>

required

The date and time the organization was created

updated_at

string<date-time>

required

Last updated time

indexed_on

string<date-time> | null

required

indexing_status

enum<string>

required

PND - Pending
IQE - In Queue
PRS - Processing
DEX - Data Extracted Successfully
DER - Data Extraction Error
IND - Indexed
CMP - Completed
ERR - Error

Available options:

PND,

IQE,

PRS,

DEX,

DER,

IND,

CMP,

ERR

url

string<uri>

required

URL of the resource

Maximum string length: 200

corpora

string<uuid>

required

Corpora to which the Maps to

clean_url

string<uri> | null

Canonical form of the URL - free of tracking parameters

Maximum string length: 200

scrape_sitemap

boolean

Whether to scrape the sitemap for additional URLs

metadata

any | null

Additional metadata for the URL

Getting Started

Corpus Management

Query and Retrieve

Resources

Overview

Authentication

Request Body

Example request

Response

Example Response

Important Notes

Client examples

Best Practices

Management Operations

Crawling Performance

Authorizations

Body

Response

Getting Started

Corpus Management

Query and Retrieve

Resources

​Overview

​Authentication

​Request Body

​Example request

​Response

​Example Response

​Important Notes

​Client examples

​Best Practices

​Management Operations

​Crawling Performance

Authorizations

Body

Response

Overview

Authentication

Request Body

Example request

Response

Example Response

Important Notes

Client examples

Best Practices

Management Operations

Crawling Performance