
Batch git clone popular repositories with Stars above a certain threshold to build Code LLM training corpus.
Use GitHub REST / GraphQL API to crawl Issue, PR, Commit history and other metadata.
Regularly synchronize repository updates to build a continuously updated code knowledge base.
Provide dedicated line static proxy for enterprise GitHub / GitHub Enterprise for mirroring and backup synchronization.
Dedicated 1Gbps to 200Gbps+ (Customizable)
Fixed pricing by bandwidth, not by traffic, predictable cost
123Proxy provides high bandwidth proxy pool service specifically for AI training data
collection: fixed bandwidth billing (1Gbps–100Gbps+), unlimited total traffic, unlimited
concurrent requests.
Target site bot automatic monitoring, ensuring target website is not blocked
Ultra-high cost performance for large scale data scraping.
# Example 1: Use 123Proxy High Bandwidth Proxy IP for git clone
export https_proxy="http://USERNAME_sessionId_time:PASSWORD@gateway.123proxy.cn:31000"
export http_proxy="$https_proxy"
git clone https://github.com/owner/repo.git
Different sessionIds can be bound to different exit IPs for concurrent cloning of large numbers of repositories.
# Example 2: Call GitHub API via 123Proxy
import requests
proxies = {
"http": "http://USERNAME_sessionId_time:PASSWORD@gateway.123proxy.cn:31000",
"https": "http://USERNAME_sessionId_time:PASSWORD@gateway.123proxy.cn:31000",
}
headers = {
"Authorization": "Bearer YOUR_GITHUB_TOKEN",
"Accept": "application/vnd.github+json",
}
resp = requests.get(
"https://api.github.com/repos/owner/repo",
proxies=proxies,
headers=headers,
timeout=30,
)
print(resp.status_code, resp.json().get("full_name"))
Solution:
- Enable 123Proxy high bandwidth dedicated proxy to ensure link quality to GitHub.
- Appropriately reduce `http.postBuffer` and add retry logic for large repositories.
For large scale API calls:
- Must carry Token to call interface, and reasonably divide multiple Tokens.
- Use different `sessionId` / exit IP to share requests, avoid concentrating on a single IP.
- 1Gbps: Suitable for small and medium scale collection tasks with
dozens to hundreds of concurrent clones.
- 10Gbps: Support hundreds to thousands of concurrent clones, used for TB level dataset
construction.
- 100Gbps: Suitable for continuous full synchronization, enterprise-level Code LLM training
projects.
- Use stable lines and high bandwidth proxies to reduce
disconnection caused by network jitter.
- For super large repositories, split tasks, multiple incremental clones instead of one-time
brute force pull.
- Add failure retry and breakpoint resume logic in the task scheduling layer.
Open source code has their own open source licenses (MIT, Apache, GPL, etc.), should comply with corresponding licenses when used for training and commercial use. 123Proxy only provides network channels and does not participate in data usage; please consult the legal team according to your own business to ensure that collection and use are legal and compliant.