English | 简体中文
paddle.distributed.launch
or fleetrun
to start the training task. Below is an example of running the script.fleetrun \
--selected_gpu 0,1,2,3,4,5,6,7 \
tools/train.py -c configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml \
--eval &>logs.txt 2>&1 &
--ips
parameter, which indicates the ip list of machines that need to participate in distributed training. The ips of different machines are separated by commas. Below is an example of running code.ip_list="10.127.6.17,10.127.5.142,10.127.45.13,10.127.44.151"
fleetrun \
--ips=${ip_list} \
--selected_gpu 0,1,2,3,4,5,6,7 \
tools/train.py -c configs/ppyoloe/ppyoloe_crn_s_300e_coco.yml \
--eval &>logs.txt 2>&1 &
Note:
ifconfig
or ipconfig
.ip_list
is trainer0, and so on.export FLAGS_START_PORT=17000
, and the port value is recommended to be 10000~20000
.Model | Dataset | Configuration | 8 GPU training time / Accuracy | 3x8 GPU training time / Accuracy | Acceleration ratio |
---|---|---|---|---|---|
PP-YOLOE-s | Objects365 | ppyoloe_crn_s_300e_coco.yml | 301h/- | 162h/17.7% | 1.85 |
PP-YOLOE-l | Objects365 | ppyoloe_crn_l_300e_coco.yml | 401h/- | 178h/30.3% | 2.25 |
Model | Dataset | Configuration | 8 GPU training time / Accuracy | 4x8 GPU training time / Accuracy | Acceleration ratio |
---|---|---|---|---|---|
PP-YOLOE-s | COCO | ppyoloe_crn_s_300e_coco.yml | 39h/42.7% | 13h/42.1% | 3.0 |
PP-YOLOE-m | Objects365 | ppyoloe_crn_m_300e_coco.yml | 337h/- | 112h/24.6% | 3.0 |
PP-YOLOE-x | Objects365 | ppyoloe_crn_x_300e_coco.yml | 464h/- | 125h/32.1% | 3.4 |
PP-YOLOE
series, the batch size of single card is set as 8 and learning rate is same as that of single machine.