Skip to content

Notes on Performance

Aditya Atluri edited this page Apr 11, 2018 · 5 revisions

Introduction

This notes have data regarding different experiments conducted in understanding to how to get peak p2p bandwidth. The machine under testing has 5.26GBps of uni-directional bandwidth and 10.128GBps of uni-directional bandwidth

Experiment 1

When a 16kB buffer is transferred from one gpu to other, doing copy as dwordx4 gave 1.59GBps (over 128 kernel launches with 1024 work items) whereas, dword gave 1.45 GBps.

Experiment 2

The lowest size to get peak bandwidth for copy kernel is 4MB. The kernel should be launched with 1024 work items each doing a dwordx4 mov

Experiment 3

The write to a peer gpu is 2x faster than reading from a peer gpu.

Clone this wiki locally