Custom Web Corpora for LLM Pre-training

Define your domain sources and filtering rules. We crawl, clean, and deliver formatted text data at scale.

Target Specific Domains (e.g., medical, code, forums)

Custom PII redaction & deduping rules

Markdown/JSONL outputs

Incremental updates for fresh data

Request Dataset Sample

Keep Your LLM Model Updated

Targeted Crawling

Instead of the whole web, focus your spend on high-quality domains that matter to your model's niche (e.g. newspapers, legal repositories).

GitHub/StackOverflow Scrapes

Get fresh code data from specific repositories, languages, or timeframes, filtered by license type (MIT, Apache 2.0).

Optimized File Formats

Data delivered in JSONL, Parquet, or loose Markdown text, ready for direct ingestion into HuggingFace or custom loaders.

Vertical-Specific Datasets

Deep dives into Finance (reports), Healthcare (PubMed), or Law (case texts) with domain-specific metadata extraction.

Why Choose Our Text Datasets?

Privacy First

PII (Personally Identifiable Information) automatically removed.

Pricing

Fixed pricing by token count or data processing volume.

Custom Tokenization

We can pre-tokenize data using your model's tokenizer (e.g. Llama 3, GPT-4) to save you compute time.

Cleaned & Deduplicated

Structured JSON Metadata

Video Scraping API Access

Contact Sales

Service Guarantee

Consistent data quality updates & 24/7 support.

Cost Effective

Eliminate engineering time spent on scraper maintenance.

Technical Specifications

Enterprise-grade web data capabilities.

Category	Specification
Source Types	Specific Domains, Common Crawl Filtered, GitHub, StackOverflow, Reddit
Output Formats	JSONL, Parquet, Markdown, Plain Text
Technology	Headless Browsers (Full JS Rendering), Anti-bot Bypass
Cleaning & Processing	HTML-to-Text, PII Redaction, MinHash Deduplication, Language ID
Throughput	1B+ Pages/Week
Delivery	Direct S3/GCS Push, Delta Updates

FAQs

Can you crawl specific domains or URL lists?

Yes. You provide the seed list or domain pattern (e.g., *.gov, specific blogs). We configure our crawler to stay strictly within those boundaries or explore linked pages up to a specified depth.

How do you handle dynamic content and JS rendering?

Our crawling infrastructure includes a headless browser fleet that fully renders JavaScript, ensuring we capture content from Single Page Applications (SPAs) and complex sites just as a user sees them.

What does the custom cleaning pipeline include (PII, Deduping)?

We offer modular cleaning steps: HTML-to-Text conversion (Trafilatura/Readability), PII redaction (Presidio), fuzzy deduplication (MinHash LSH), and toxic content filtering. You choose which modules to enable.

Can you filter datasets by language and dialect?

Yes. We use fasttext-based language identification to separate content into language-specific buckets. We can target over 100 languages and filter out mixed-language documents.

Do you support incremental crawls and updates?

We can set up recurring jobs to revisit your target domains daily or weekly, extracting only new or modified content and delivering it as delta files to your S3 bucket.