High-Bandwidth Proxies for Web Text Scraping

Customize 1-100Gbps+ bandwidth
Scrape 100K+ sites at same time
Support text from Alexa/Tranco Top sites, news, forums, e-commerce, etc.

Typical Usage Scenes

Alexa / Tranco Top Sites

Long-cycle crawling of Top sites to build general LLM text corpus.

News, Blogs, Information

Collect articles and titles from news, blogs, and information websites to build time-series corpus.

E-commerce

Collect product titles, details, reviews text from e-commerce websites for recommendation and search models.

Forums, Q&A

Collect multi-turn dialogue text from forums and Q&A communities (under compliance premises) to build dialogue corpus.

High-Bandwidth proxy to scrape 100k+ sites

High Bandwidth

Dedicated 1Gbps to 200Gbps+ (Customizable)

Pricing

Fixed pricing by bandwidth, not by traffic, predictable cost

High-Bandwidth Proxy

123Proxy provides high bandwidth proxy pool service specifically for AI training data collection: fixed bandwidth billing (1Gbps�00Gbps+), unlimited total traffic, unlimited concurrent requests.

1-200Gbps+ dedicated
Unlimited concurrency requests
Price per bandwidth ( per Gbps)
Contact Sales
Service Guarantee

Target site bot automatic monitoring, ensuring target website is not blocked

Cost Effective

Ultra-high cost performance for large scale data scraping.

Scrapy / Playwright Integration Example


# Example 1: Use 123Proxy High Bandwidth Proxy IP in Scrapy settings.py
HTTP_PROXY = "http://USERNAME_sessionId_time:PASSWORD@gateway.123proxy.cn:31000"

DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110,
    "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 400,
}

By assigning different sessionIds to different crawler nodes or tasks, you can disperse the huge request volume of Top sites to massive exit IPs.


# Example 2: Access page via 123Proxy exit in Playwright
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        proxy={
            "server": "http://gateway.123proxy.cn:31000",
            "username": "USERNAME_sessionId_time",
            "password": "PASSWORD",
        },
    )
    page = browser.new_page()
    page.goto("https://example.com", timeout=60000)
    print(page.title())
    browser.close()

Technical Specifications

Optimized for full-web text scraping.

Category Specification
Supported Frameworks Scrapy, Playwright, Selenium, Puppeteer, Colly
Target Sites Alexa Top 1M, News Media, Forums, E-commerce, Social Media
Bandwidth 1Gbps - 100Gbps+ Dedicated
Concurrency Unlimited (High RPS Optimized)
Protocols HTTP/HTTPS, SOCKS5

FAQs

How to fix Scrapy 403 / 503 errors on Top sites?

Solution:
- Enable 123Proxy high bandwidth dedicated proxy to reduce blocking probability with multi-region, multi-IP rotation.
- Add rate limiting and backoff strategies in Scrapy / self-developed crawlers to avoid concentrated access in a short time.

How to bypass Cloudflare / Bot protection?

For such complex protected sites, it is recommended to prioritize using the open API or data interface provided by the other party to avoid brute force crawling. If there is a legitimate business need, please strictly control the request frequency to reduce the impact on the target site.

What is the typical throughput for 1Gbps vs 10Gbps?

- 1Gbps: Suitable for small and medium-scale crawling tasks with hundreds of concurrent requests.
- 10Gbps: Supports thousands of concurrent requests, can complete a round of Top site crawling in a relatively short time.
- 100Gbps: Suitable for all-weather, global-scale large-scale LLM training data collection projects.

How to configure proxies for Scrapy distributed clusters?

- Multiple crawler machines share the same high bandwidth proxy exit to achieve unified outbound from 123Proxy.
- Assign different `sessionId` / sub-accounts for different crawler projects or queues to achieve logical isolation.

Is scraping public web data compliant?

Using proxy itself is legal, but collecting and using text data must comply with the terms of the target site and local laws. It is recommended to avoid collecting post-login content and sensitive personal information, and establish data audit and deletion mechanisms internally to ensure that data used for training LLM is within the legal and compliant scope.