Guides
Source-data guides Practical guides for crawler decisions, website-to-RAG preparation, document processing, and source-data workflows.
Start here
Run the crawler Collect public web content with scoped search, scrape, crawl, estimates, and export-ready output. Open Plan a RAG pipeline Use crawler and document output as reviewable source material before chunking, embeddings, and retrieval. Open Prepare documents Think through extraction, normalization, metadata, and evidence before processed files become AI input. Open Review security posture Keep crawler and source-data workflows bounded, auditable, and controlled before scaling paid usage. Open Guide library
RAG preparation guide
How to prepare website content for RAG A practical guide to preparing website content for RAG workflows with scoped crawling, clean exports, source review, chunking, and retrieval-ready structure.
Use explicit URL targets and same-site limits before crawling. Export source material in reviewable formats before embedding. Keep source evidence so bad answers can be traced back to bad inputs. Read guideCrawler vs ETL guide
Web crawler vs. ETL pipeline: what is the difference? Understand how web crawlers and ETL pipelines differ, when to use each one, and why SourceOfTruth.io keeps crawler collection separate from broader AI data preparation.
Use a crawler when the source is a website or documentation set. Use ETL when data needs repeatable extraction, transformation, and loading across systems. Crawler jobs need limits, estimates, and export evidence. Read guideDocument processing guide
AI document processing: from files to usable source text A source-data guide to AI document processing, including extraction, metadata, cleaning, chunking, evidence review, and metered downstream processing.
Extract text before applying AI analysis. Keep metadata and source evidence attached to processed output. Review extraction quality before chunking and embedding. Read guideRAG data pipeline guide
What belongs in a RAG data pipeline? A practical overview of RAG data pipeline stages: source collection, cleaning, chunking, embeddings, indexing, retrieval testing, and evidence review.
The source layer determines the quality ceiling for RAG. Chunking and embeddings should happen after source cleanup. Retrieval tests need source evidence, not just answer demos. Read guideCrawler comparison guide
Looking for a Firecrawl alternative for source-data workflows? A practical guide for teams evaluating crawler tools for AI source-data workflows, clean exports, metered usage, and RAG preparation.
Evaluate crawler tools by workflow fit, not just raw scrape ability. Look for clean exports, limits, estimates, and job history. Keep crawler pricing separate from advanced AI processing costs. Read guideDefinitions
Crawler-first launch The active product surface is Search + Web Crawler, with pricing and credits focused on bounded web collection.
RAG preparation Source material should be reviewed, cleaned, and structured before it becomes chunks, embeddings, and retrieval context.
Document processing File extraction and normalization are a future pipeline surface; page crawling and web exports stay separate.
ETL/ELT roadmap ETL/ELT remains coming soon until connector, retry, observability, governance, and pricing expectations are production-ready.
Crawler is the live revenue surface. Guides can explain future pipeline direction, but public checkout and active pricing should remain crawler-first.