Skip to main content
DELETE
/
api
/
data
/
urls
/
{id}

Overview

Permanently delete a URL resource along with its crawled content, metadata, and vector embeddings. This immediately removes the web page content from search results and retrieval operations.
Irreversible: Deletion cannot be undone. The URL record, crawled content, and all vector embeddings are permanently removed.
Use cases: Removing outdated web content, refreshing stale pages, managing dead links, or retracting content from sources that no longer exist.

Authentication

Requires valid JWT token or session authentication. You must own the parent corpus.

Path Parameters

id
UUID
required
URL resource identifier returned when the URL was created.Example: f0d6fe08-87c8-4eb0-80d8-7a2de638514b

Example request

curl -X DELETE https://{your-host}/api/data/urls/f0d6fe08-87c8-4eb0-80d8-7a2de638514b/ \
  -H "Authorization: Bearer $SOAR_LABS_TOKEN"

Response Codes

204
No Content
URL successfully deleted. No response body returned.
404
Not Found
URL does not exist, was already deleted, or belongs to a corpus you don’t own.

What Gets Deleted

  • URL record - Database entry with URL and metadata
  • Crawled content - Extracted text and page information
  • Vector embeddings - All chunks removed from Qdrant
  • Crawl metadata - Content type, title, description, and sitemap info
Storage impact: Corpus size reported by GET /api/corpora/ decreases after background cleanup completes.

Common Use Cases

Delete and re-add URLs to fetch updated content:
# Step 1: Delete old version
response = requests.delete(
    f"{base_url}/api/data/urls/{url_id}/",
    headers=headers
)

if response.status_code == 204:
    print("Old URL deleted")

    # Step 2: Re-add to fetch fresh content
    response = requests.post(
        f"{base_url}/api/data/urls/",
        headers=headers,
        json={
            "corpora": corpus_id,
            "urls": [{
                "url": "https://docs.example.com/updated-page",
                "scrape_sitemap": False
            }]
        }
    )

    new_url = response.json()[0]
    print(f"New URL ID: {new_url['id']}")
    print("Refreshing content...")
Refresh strategy: Schedule periodic refreshes for documentation sites that update frequently (e.g., weekly or monthly).
Remove multiple URLs from a specific domain or pattern:
# Get all URLs
urls = requests.get(
    f"{base_url}/api/data/urls/?corpora={corpus_id}",
    headers=headers
).json()["results"]

# Delete URLs matching pattern
domain_to_remove = "old-docs.example.com"
matched_urls = []

for url in urls:
    if domain_to_remove in url["url"]:
        matched_urls.append(url)

print(f"Found {len(matched_urls)} URLs from {domain_to_remove}")

# Delete with confirmation
for url in matched_urls:
    print(f"Deleting: {url['url']}")
    requests.delete(
        f"{base_url}/api/data/urls/{url['id']}/",
        headers=headers
    )

print(f"Removed {len(matched_urls)} URLs")
Replace URLs from old domain with new domain:
# Get all URLs from old domain
urls = requests.get(
    f"{base_url}/api/data/urls/?corpora={corpus_id}",
    headers=headers
).json()["results"]

old_domain = "old.example.com"
new_domain = "new.example.com"
migrated = []

for url in urls:
    if old_domain in url["url"]:
        # Create new URL with updated domain
        new_url_string = url["url"].replace(old_domain, new_domain)

        # Add new URL
        response = requests.post(
            f"{base_url}/api/data/urls/",
            headers=headers,
            json={
                "corpora": corpus_id,
                "urls": [{"url": new_url_string}]
            }
        )

        if response.status_code == 201:
            # Delete old URL after successful addition
            requests.delete(
                f"{base_url}/api/data/urls/{url['id']}/",
                headers=headers
            )
            migrated.append(new_url_string)
            print(f"Migrated: {url['url']} -> {new_url_string}")

print(f"\nMigrated {len(migrated)} URLs to new domain")
Remove URLs based on age or relevance:
from datetime import datetime, timedelta

# Get all URLs
urls = requests.get(
    f"{base_url}/api/data/urls/?corpora={corpus_id}",
    headers=headers
).json()["results"]

# Find URLs older than 6 months
cutoff_date = datetime.now() - timedelta(days=180)
old_urls = []

for url in urls:
    indexed_on = url.get("indexed_on")
    if indexed_on:
        indexed_date = datetime.fromisoformat(indexed_on.replace("Z", "+00:00"))
        if indexed_date < cutoff_date:
            old_urls.append(url)

print(f"Found {len(old_urls)} URLs indexed over 6 months ago")

# Review and delete
for url in old_urls:
    print(f"\nURL: {url['url']}")
    print(f"Indexed: {url['indexed_on']}")

    if input("Delete this URL? (y/n): ").lower() == 'y':
        requests.delete(
            f"{base_url}/api/data/urls/{url['id']}/",
            headers=headers
        )
        print("Deleted")
Remove all URLs added via sitemap scraping:
# Get all URLs in corpus
urls = requests.get(
    f"{base_url}/api/data/urls/?corpora={corpus_id}",
    headers=headers
).json()["results"]

# Find sitemap URLs
sitemap_urls = [url for url in urls if url.get("scrape_sitemap")]

print(f"Found {len(sitemap_urls)} URLs added via sitemap")

# Delete sitemap URLs
if input("Remove all sitemap URLs? (yes/no): ") == "yes":
    for url in sitemap_urls:
        requests.delete(
            f"{base_url}/api/data/urls/{url['id']}/",
            headers=headers
        )
    print(f"Removed {len(sitemap_urls)} sitemap URLs")
Use case: Useful when sitemap scraping added too many irrelevant pages.
Content freshness: Set up automated workflows to periodically refresh documentation URLs (delete old → re-add new) for sources that update frequently.
Other resources unaffected: Removing URLs does not affect files or strings in the same corpus. Each resource type is independent.

Client examples

import os
import requests

BASE_URL = "https://your-soar-instance.com"
TOKEN = os.environ["SOAR_LABS_TOKEN"]
URL_ID = "f0d6fe08-87c8-4eb0-80d8-7a2de638514b"

response = requests.delete(
    f"{BASE_URL}/api/data/urls/{URL_ID}/",
    headers={"Authorization": f"Bearer {TOKEN}"},
    timeout=30,
)
if response.status_code != 204:
    response.raise_for_status()

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Path Parameters

id
string<uuid>
required

A UUID string identifying this URL.

Response

204

No response body