High-Bandwidth Proxies for Web Text Scraping

Customize 1-100Gbps+ bandwidth
Scrape 100K+ sites at same time
Support text from Alexa/Tranco Top sites, news, forums, e-commerce, etc.

Typical Usage Scenes

Alexa / Tranco Top Sites

Long-cycle crawling of Top sites to build general LLM text corpus.

News, Blogs, Information

Collect articles and titles from news, blogs, and information websites to build time-series corpus.

E-commerce

Collect product titles, details, reviews text from e-commerce websites for recommendation and search models.

Forums, Q&A

Collect multi-turn dialogue text from forums and Q&A communities (under compliance premises) to build dialogue corpus.

High-Bandwidth proxy to scrape 100k+ sites

High Bandwidth

Dedicated 1Gbps to 200Gbps+ (Customizable)

Pricing

Fixed pricing by bandwidth, not by traffic, predictable cost

High-Bandwidth Proxy

123Proxy provides high bandwidth proxy pool service specifically for AI training data collection: fixed bandwidth billing (1Gbps–100Gbps+), unlimited total traffic, unlimited concurrent requests.

1-200Gbps+ dedicated
Unlimited concurrency requests
Price per bandwidth ( per Gbps)
Contact Sales
Service Guarantee

Target site bot automatic monitoring, ensuring target website is not blocked

Cost Effective

Ultra-high cost performance for large scale data scraping.

Scrapy / Playwright Integration Example


# Example 1: Use 123Proxy High Bandwidth Proxy IP in Scrapy settings.py
HTTP_PROXY = "http://USERNAME_sessionId_time:PASSWORD@gateway.123proxy.cn:31000"

DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110,
    "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 400,
}

By assigning different sessionIds to different crawler nodes or tasks, you can disperse the huge request volume of Top sites to massive exit IPs.


# Example 2: Access page via 123Proxy exit in Playwright
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        proxy={
            "server": "http://gateway.123proxy.cn:31000",
            "username": "USERNAME_sessionId_time",
            "password": "PASSWORD",
        },
    )
    page = browser.new_page()
    page.goto("https://example.com", timeout=60000)
    print(page.title())
    browser.close()

FAQs

Q1: Crawling Top sites often returns 403 / 503, what to do?

Solution:
- Enable 123Proxy high bandwidth dedicated proxy to reduce blocking probability with multi-region, multi-IP rotation.
- Add rate limiting and backoff strategies in Scrapy / self-developed crawlers to avoid concentrated access in a short time.

Q2: Encounter Cloudflare / Bot protection, frequent CAPTCHA?

For such complex protected sites, it is recommended to prioritize using the open API or data interface provided by the other party to avoid brute force crawling. If there is a legitimate business need, please strictly control the request frequency to reduce the impact on the target site.

Q3: What is the difference between 1Gbps, 10Gbps, 100Gbps in text collection scenarios?

- 1Gbps: Suitable for small and medium-scale crawling tasks with hundreds of concurrent requests.
- 10Gbps: Supports thousands of concurrent requests, can complete a round of Top site crawling in a relatively short time.
- 100Gbps: Suitable for all-weather, global-scale large-scale LLM training data collection projects.

Q4: How to match Scrapy distributed cluster with proxy?

- Multiple crawler machines share the same high bandwidth proxy exit to achieve unified outbound from 123Proxy.
- Assign different `sessionId` / sub-accounts for different crawler projects or queues to achieve logical isolation.

Q5: How to ensure privacy and compliance when collecting text data?

Using proxy itself is legal, but collecting and using text data must comply with the terms of the target site and local laws. It is recommended to avoid collecting post-login content and sensitive personal information, and establish data audit and deletion mechanisms internally to ensure that data used for training LLM is within the legal and compliant scope.