Permanently delete a URL resource along with its crawled content, metadata, and vector embeddings. This immediately removes the web page content from search results and retrieval operations.
Irreversible: Deletion cannot be undone. The URL record, crawled content, and all vector embeddings are permanently removed.
Use cases: Removing outdated web content, refreshing stale pages, managing dead links, or retracting content from sources that no longer exist.
Refresh strategy: Schedule periodic refreshes for documentation sites that update frequently (e.g., weekly or monthly).
Clean Up Dead Links
Remove URLs that return 404 or crawl errors:
Copy
# Get all URLs in corpusurls = requests.get( f"{base_url}/api/data/urls/?corpora={corpus_id}", headers=headers).json()# Find URLs with errorsdead_links = []for url in urls["results"]: if url["indexing_status"] == "ERR": dead_links.append(url) print(f"Dead link found: {url['url']}")# Delete dead linksif dead_links: print(f"\nFound {len(dead_links)} dead links") if input("Remove all dead links? (yes/no): ") == "yes": for url in dead_links: requests.delete( f"{base_url}/api/data/urls/{url['id']}/", headers=headers ) print(f"Removed {len(dead_links)} dead links")
Batch URL Deletion
Remove multiple URLs from a specific domain or pattern:
Copy
# Get all URLsurls = requests.get( f"{base_url}/api/data/urls/?corpora={corpus_id}", headers=headers).json()["results"]# Delete URLs matching patterndomain_to_remove = "old-docs.example.com"matched_urls = []for url in urls: if domain_to_remove in url["url"]: matched_urls.append(url)print(f"Found {len(matched_urls)} URLs from {domain_to_remove}")# Delete with confirmationfor url in matched_urls: print(f"Deleting: {url['url']}") requests.delete( f"{base_url}/api/data/urls/{url['id']}/", headers=headers )print(f"Removed {len(matched_urls)} URLs")
Content Source Migration
Replace URLs from old domain with new domain:
Copy
# Get all URLs from old domainurls = requests.get( f"{base_url}/api/data/urls/?corpora={corpus_id}", headers=headers).json()["results"]old_domain = "old.example.com"new_domain = "new.example.com"migrated = []for url in urls: if old_domain in url["url"]: # Create new URL with updated domain new_url_string = url["url"].replace(old_domain, new_domain) # Add new URL response = requests.post( f"{base_url}/api/data/urls/", headers=headers, json={ "corpora": corpus_id, "urls": [{"url": new_url_string}] } ) if response.status_code == 201: # Delete old URL after successful addition requests.delete( f"{base_url}/api/data/urls/{url['id']}/", headers=headers ) migrated.append(new_url_string) print(f"Migrated: {url['url']} -> {new_url_string}")print(f"\nMigrated {len(migrated)} URLs to new domain")
Selective Content Pruning
Remove URLs based on age or relevance:
Copy
from datetime import datetime, timedelta# Get all URLsurls = requests.get( f"{base_url}/api/data/urls/?corpora={corpus_id}", headers=headers).json()["results"]# Find URLs older than 6 monthscutoff_date = datetime.now() - timedelta(days=180)old_urls = []for url in urls: indexed_on = url.get("indexed_on") if indexed_on: indexed_date = datetime.fromisoformat(indexed_on.replace("Z", "+00:00")) if indexed_date < cutoff_date: old_urls.append(url)print(f"Found {len(old_urls)} URLs indexed over 6 months ago")# Review and deletefor url in old_urls: print(f"\nURL: {url['url']}") print(f"Indexed: {url['indexed_on']}") if input("Delete this URL? (y/n): ").lower() == 'y': requests.delete( f"{base_url}/api/data/urls/{url['id']}/", headers=headers ) print("Deleted")
Sitemap URL Cleanup
Remove all URLs added via sitemap scraping:
Copy
# Get all URLs in corpusurls = requests.get( f"{base_url}/api/data/urls/?corpora={corpus_id}", headers=headers).json()["results"]# Find sitemap URLssitemap_urls = [url for url in urls if url.get("scrape_sitemap")]print(f"Found {len(sitemap_urls)} URLs added via sitemap")# Delete sitemap URLsif input("Remove all sitemap URLs? (yes/no): ") == "yes": for url in sitemap_urls: requests.delete( f"{base_url}/api/data/urls/{url['id']}/", headers=headers ) print(f"Removed {len(sitemap_urls)} sitemap URLs")
Use case: Useful when sitemap scraping added too many irrelevant pages.
Content freshness: Set up automated workflows to periodically refresh documentation URLs (delete old → re-add new) for sources that update frequently.
Other resources unaffected: Removing URLs does not affect files or strings in the same corpus. Each resource type is independent.