Skip to main content
POST
/
api
/
data
/
urls
{
  "id": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "created_at": "2023-11-07T05:31:56Z",
  "updated_at": "2023-11-07T05:31:56Z",
  "indexed_on": "2023-11-07T05:31:56Z",
  "indexing_status": "PND",
  "url": "<string>",
  "corpora": "3c90c3cc-0d44-4b50-8888-8dd25736052a",
  "clean_url": "<string>",
  "scrape_sitemap": true,
  "metadata": "<unknown>"
}

Overview

URLs let you ingest external web content without uploading files. Each URL is crawled, normalized, and processed through the ingestion pipeline to extract text, generate chunks, and create vector embeddings. Submit multiple URLs in a single request for efficient batch ingestion.
Perfect for: Documentation sites, blog posts, knowledge bases, status pages, and any web content you want to make queryable.
URL content is fetched at ingestion time. Changes to the source page won’t automatically update - you’ll need to re-add the URL to refresh content.

Authentication

Requires valid JWT token or session authentication. You must be the owner of the target corpus.

Request Body

corpora
UUID
required
ID of the corpus that will own these URLs. Must be a corpus you created and have access to.
urls
array<object>
required
Array of URL objects to crawl and ingest. Each object represents one web resource.Batch size recommendations:
  • Optimal: 5-20 URLs per request
  • Maximum: Check your instance configuration (typically 50)

Example request

curl -X POST https://{your-host}/api/data/urls/ \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "corpora": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd",
    "urls": [
      {"url": "https://docs.example.com/support/escalations"},
      {"url": "https://status.example.com/incidents", "scrape_sitemap": true}
    ]
  }'

Response

Returns an array of URL objects (one for each submitted URL):
id
UUID
Unique identifier for the URL resource. Use for tracking, retrieval, or deletion.
created_at
timestamp
ISO 8601 timestamp when the URL was added.
updated_at
timestamp
Last update timestamp. Changes when crawling or indexing completes.
indexed_on
timestamp | null
Timestamp when indexing completed successfully. null while processing.
indexing_status
string
Current processing status:
  • PRS - Processing (crawling and indexing in progress)
  • IND - Indexed (content ready for queries)
  • ERR - Error (crawling or processing failed)
  • PND - Pending (queued for crawling)
url
string (URL)
The original URL as submitted.
clean_url
string (URL)
Normalized/canonical version of the URL (removes tracking parameters, standardizes format).
scrape_sitemap
boolean
Whether sitemap scraping was enabled for this URL.
metadata
object
Extracted metadata from the crawled page.
corpora
UUID
ID of the parent corpus containing this URL.

Example Response

[
  {
    "id": "f0d6fe08-87c8-4eb0-80d8-7a2de638514b",
    "created_at": "2024-09-01T12:05:11.337Z",
    "updated_at": "2024-09-01T12:05:11.337Z",
    "indexed_on": null,
    "indexing_status": "PRS",
    "url": "https://docs.example.com/support/escalations",
    "clean_url": "https://docs.example.com/support/escalations",
    "scrape_sitemap": false,
    "metadata": {
      "content_type": "text/html",
      "title": "Escalation Playbook"
    },
    "corpora": "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"
  }
]

Important Notes

URL Validation: Malformed URLs are rejected with 400 Bad Request and field-level errors. Ensure all URLs are absolute and properly formatted.
Track crawling progress via GET /api/data/urls/?corpora={id}. Monitor indexing_status transitions from PRSIND.
Deletion Impact: Deleting a URL via DELETE /api/data/urls/{id}/ immediately removes all derived chunks from the vector store. This action cannot be undone.

Client examples

import os
import requests

BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ["SOAR_LABS_TOKEN"]
CORPUS_ID = "8d0f0a5d-4b5e-4c09-9db6-0e9d2aa8a9fd"

payload = {
    "corpora": CORPUS_ID,
    "urls": [
        {"url": "https://docs.example.com/support/escalations"},
        {"url": "https://status.example.com/incidents", "scrape_sitemap": True},
    ],
}

response = requests.post(
    f"{BASE_URL}/api/data/urls/",
    headers={
        "Authorization": f"Bearer {TOKEN}",
        "Content-Type": "application/json",
    },
    json=payload,
    timeout=30,
)
response.raise_for_status()
urls = response.json()

Best Practices

Select URLs that provide maximum value:Good candidates:
  • Documentation pages with stable content
  • Knowledge base articles
  • Blog posts and tutorials
  • Product specifications
  • API reference pages
Avoid:
  • Dynamic pages with frequently changing content
  • Pages behind authentication/paywalls
  • JavaScript-heavy SPAs (content may not extract properly)
  • Pages with primarily images/videos (limited text content)
  • Auto-generated index pages with little content
Use scrape_sitemap: true strategically:When to enable:
  • Well-structured documentation sites
  • Blog archives (when you want all posts)
  • Product catalogs with many pages
  • Knowledge bases with comprehensive sitemaps
When to avoid:
  • Sites with 100+ pages (submit sections instead)
  • Frequently updated news sites (content becomes stale)
  • Sites with dynamic pagination (submit specific URLs)
Best practice: Start with a single page, verify content quality, then enable sitemap scraping if needed.
Common crawl failures and solutions:404 Not Found:
  • Verify URL is correct and accessible
  • Check if page was moved or deleted
  • Try accessing in a browser first
403 Forbidden / 401 Unauthorized:
  • Page requires authentication
  • Site blocks bots/crawlers
  • Consider uploading content as file instead
Timeout Errors:
  • Page is too slow to load
  • Server is experiencing issues
  • Try again later or contact site admin
Content Extraction Failures:
  • JavaScript-heavy page (content not in initial HTML)
  • Page has anti-scraping measures
  • Consider using an alternative format (PDF, markdown)
Managing URL content updates:Strategy 1: Manual Refresh
  • Delete the old URL resource
  • Re-add the URL to fetch latest content
  • Best for infrequently changing content
Strategy 2: Scheduled Updates
  • Implement periodic refresh via API calls
  • Delete and re-add URLs on a schedule
  • Good for documentation that updates regularly
Strategy 3: Webhooks (Advanced)
  • Set up webhooks to trigger updates on content changes
  • Requires integration with content management system
  • Best for real-time accuracy requirements
Soar Labs respects robots.txt directives:
  • Crawlers follow site-specific rate limits
  • Disallowed paths are not crawled
  • User-agent identification: SoarLabsBot
If crawling fails:
  1. Check the site’s robots.txt
  2. Verify your URLs aren’t disallowed
  3. Consider reaching out to site owner for permission
  4. Alternative: Upload content as files if you have access

Management Operations

Retrieve all URLs in a corpus:
curl -X GET "https://{your-host}/api/data/urls/?corpora={corpus-id}" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"
Supports pagination with page and page_size parameters.
Modify URL settings (e.g., enable sitemap scraping):
curl -X PATCH "https://{your-host}/api/data/urls/{url-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"scrape_sitemap": true}'
Note: Changing settings triggers re-crawling and re-indexing.
To refresh stale content, delete and re-add the URL:
# Step 1: Delete old version
curl -X DELETE "https://{your-host}/api/data/urls/{url-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

# Step 2: Re-add with fresh content
curl -X POST "https://{your-host}/api/data/urls/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "corpora": "{corpus-id}",
    "urls": [{"url": "https://example.com/updated-page"}]
  }'
Remove URLs from the corpus:
curl -X DELETE "https://{your-host}/api/data/urls/{url-id}/" \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"
Warning: Deletion immediately removes all crawled content and vector embeddings. Cannot be undone.

Crawling Performance

Typical Crawl Times:
  • Simple pages (< 100KB HTML): 2-5 seconds
  • Documentation pages (100-500KB): 10-30 seconds
  • Complex pages with many images: 30-60 seconds
  • Sitemap crawls (10-50 pages): 1-5 minutes
Rate limiting may apply to prevent overwhelming target sites. Large sitemap crawls are processed in batches with delays between requests.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

url
string<uri>
required

URL of the resource

Maximum string length: 200
corpora
string<uuid>
required

Corpora to which the Maps to

clean_url
string<uri> | null

Canonical form of the URL - free of tracking parameters

Maximum string length: 200
scrape_sitemap
boolean

Whether to scrape the sitemap for additional URLs

metadata
any | null

Additional metadata for the URL

Response

201 - application/json
id
string<uuid>
required
created_at
string<date-time>
required

The date and time the organization was created

updated_at
string<date-time>
required

Last updated time

indexed_on
string<date-time> | null
required
indexing_status
enum<string>
required
  • PND - Pending
  • IQE - In Queue
  • PRS - Processing
  • DEX - Data Extracted Successfully
  • DER - Data Extraction Error
  • IND - Indexed
  • CMP - Completed
  • ERR - Error
Available options:
PND,
IQE,
PRS,
DEX,
DER,
IND,
CMP,
ERR
url
string<uri>
required

URL of the resource

Maximum string length: 200
corpora
string<uuid>
required

Corpora to which the Maps to

clean_url
string<uri> | null

Canonical form of the URL - free of tracking parameters

Maximum string length: 200
scrape_sitemap
boolean

Whether to scrape the sitemap for additional URLs

metadata
any | null

Additional metadata for the URL