
Long-cycle crawling of Top sites to build general LLM text corpus.
Collect articles and titles from news, blogs, and information websites to build time-series corpus.
Collect product titles, details, reviews text from e-commerce websites for recommendation and search models.
Collect multi-turn dialogue text from forums and Q&A communities (under compliance premises) to build dialogue corpus.
Dedicated 1Gbps to 200Gbps+ (Customizable)
Fixed pricing by bandwidth, not by traffic, predictable cost
123Proxy provides high bandwidth proxy pool service specifically for AI training data
collection: fixed bandwidth billing (1Gbps–100Gbps+), unlimited total traffic, unlimited
concurrent requests.
Target site bot automatic monitoring, ensuring target website is not blocked
Ultra-high cost performance for large scale data scraping.
# Example 1: Use 123Proxy High Bandwidth Proxy IP in Scrapy settings.py
HTTP_PROXY = "http://USERNAME_sessionId_time:PASSWORD@gateway.123proxy.cn:31000"
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110,
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 400,
}
By assigning different sessionIds to different crawler nodes or tasks, you can disperse the huge request volume of Top sites to massive exit IPs.
# Example 2: Access page via 123Proxy exit in Playwright
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": "http://gateway.123proxy.cn:31000",
"username": "USERNAME_sessionId_time",
"password": "PASSWORD",
},
)
page = browser.new_page()
page.goto("https://example.com", timeout=60000)
print(page.title())
browser.close()
Solution:
- Enable 123Proxy high bandwidth dedicated proxy to reduce blocking probability with
multi-region, multi-IP rotation.
- Add rate limiting and backoff strategies in Scrapy / self-developed crawlers to avoid
concentrated access in a short time.
For such complex protected sites, it is recommended to prioritize using the open API or data interface provided by the other party to avoid brute force crawling. If there is a legitimate business need, please strictly control the request frequency to reduce the impact on the target site.
- 1Gbps: Suitable for small and medium-scale crawling tasks with
hundreds of concurrent requests.
- 10Gbps: Supports thousands of concurrent requests, can complete a round of Top site
crawling in a relatively short time.
- 100Gbps: Suitable for all-weather, global-scale large-scale LLM training data collection
projects.
- Multiple crawler machines share the same high bandwidth proxy
exit to achieve unified outbound from 123Proxy.
- Assign different `sessionId` / sub-accounts for different crawler projects or queues to
achieve logical isolation.
Using proxy itself is legal, but collecting and using text data must comply with the terms of the target site and local laws. It is recommended to avoid collecting post-login content and sensitive personal information, and establish data audit and deletion mechanisms internally to ensure that data used for training LLM is within the legal and compliant scope.