🤖AI & LLM Integrations

Fetch Web Content for RAG Pipelines

RAG pipelines need content. Instead of building a scraper that fights with JavaScript rendering, anti-bot measures, and redirect chains, send URLs to denkbot.dog and get back clean text ready to chunk and embed. The dog fetches. Your vector store remembers.

What you'd use this for

Building knowledge bases from websites, ingesting documentation into LlamaIndex or LangChain vector stores, keeping RAG indexes fresh with live web content, extracting text from dynamic JS-rendered pages.

How it works

example
from llama_index.core import Document, VectorStoreIndex
import httpx

def url_to_document(url: str) -> Document:
    r = httpx.post("https://api.denkbot.dog/scrape",
        headers={"Authorization": f"Bearer {DENKBOT_API_KEY}"},
        json={"url": url, "renderJs": True, "format": "json"}, timeout=30)
    data = r.json()
    return Document(
        text=data["text"],
        metadata={
            "url": data["url"],
            "title": data["title"],
            "description": data["metadata"].get("description", ""),
        }
    )

urls = ["https://docs.example.com/intro", "https://docs.example.com/api"]
documents = [url_to_document(url) for url in urls]
index = VectorStoreIndex.from_documents(documents)

Questions & Answers

How do I handle JS-rendered documentation sites?+

Set renderJs: true. Playwright renders the page before extraction, so SPA doc sites work correctly.

Can I crawl a whole docs site and index it?+

Use POST /crawl to get all URLs, then batch-scrape them. The crawler returns a tree of all internal links up to 500 pages.

Does the text field strip HTML and navigation?+

It strips HTML tags and returns readable text content. Some navigation text may remain — chunk by paragraph for best results.

Ready to start fetching?

€19/year. Unlimited requests. API key ready in 30 seconds.