
Use img2dataset to batch download billions of images from URL lists to build visual/multimodal training corpus.
Batch pull original and multi-size images from Flickr, Shutterstock, community sites via gallery-dl / aria2.
Collect e-commerce product images, user uploaded images, etc. for multimodal large model and recommendation system training.
Multi-region IP (US/EU/JP/BR…) coverage, building multi-language, multi-region, multi-scene image datasets.
Dedicated 1Gbps to 200Gbps+ (Customizable)
Fixed pricing by bandwidth, not by traffic, predictable cost
123Proxy provides high bandwidth proxy pool service specifically for AI training data
collection: fixed bandwidth billing (1Gbps–100Gbps+), unlimited total traffic, unlimited
concurrent requests.
Target site bot automatic monitoring, ensuring target website is not blocked
Ultra-high cost performance for large scale data scraping.
# Example: Use 123Proxy High Bandwidth Proxy IP to run img2dataset
export http_proxy="http://USERNAME_sessionId_time:PASSWORD@gateway.123proxy.cn:31000"
export https_proxy="$http_proxy"
img2dataset \
--url_list urls.txt \
--output_format webdataset \
--input_format txt
Assign different sessionIds (automatic IP rotation) for each machine/task. By appending _sessionId after the username, you can bind different Sessions to different tasks, thereby using different exit IPs.
Solution:
- Enable 123Proxy high bandwidth dedicated proxy, use large-scale IP pool to automatically
disperse request pressure.
- Split tasks by different `sessionId` to distribute concurrency to multiple proxy links.
Some sites return placeholders or blank images for abnormal IPs,
recommended:
- Use 123Proxy automatic IP rotation to reduce continuous hitting of anti-crawling
logic.
- Add image integrity check in post-processing stage to automatically remove corrupted
files.
- 1Gbps: Suitable for tens of millions of image datasets, dozens of
concurrent tasks running stably.
- 10Gbps: Suitable for hundreds of millions of image collection tasks, multi-machine cluster
parallel running.
- 100Gbps: Suitable for long-term continuous massive image crawling and multi-team sharing.
- Appropriately increase concurrency (process count/thread count),
use more connections to fill bandwidth.
- Separate metadata requests and image file downloads to different proxy exits to avoid
interference.
- If still unstable, contact 123Proxy technical support to check link and target site
status.
Using proxy itself is legal, but collecting, saving, and using image data should comply with the terms of use and copyright laws of the target site. Please confirm the authorization scope according to specific business scenarios, and use data to train models only under legal authorization or compliance premises.