Teledyne SP Devices ADQ7 User manual

ADQ7 GPU Peer-To-Peer
User Guide
Author(s): Teledyne SP Devices
Document ID: 19-2241
Classification: Public
Revision: PA1
Print date: 2019-05-02

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
Contents
1 Introduction 2
1.1 Definitions and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Prerequisites 2
3 Working Principles 2
3.1 TriggerandDataAlignment ................................... 3
3.2 Double Buffering and Kernel Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.3 Backplane Peer-To-Peer Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Setting up an Application 6
4.1 ChoosingParameters ...................................... 6
4.2 B-scanandDataValid...................................... 7
4.3 Set up P2P GPU with SetupDMAP2p2D() ............................ 7
4.3.1 Nvidia........................................... 7
4.3.2 AMD............................................ 8
4.4 WaitforaCompletedBuffer................................... 8
4.4.1 Nvidia........................................... 8
4.4.2 AMD............................................ 8
4.5 DetectandHandleOverflows.................................. 8
4.6 Process Received Data and Reset the Data Valid Buffer . . . . . . . . . . . . . . . . . . . 8
4.7 UserLogic2Considerations .................................. 9
5 Example Code 9
5.1 SignalConnections........................................ 9
5.2 NvidiaExample.......................................... 9
5.2.1 RunningtheExample .................................. 10
5.2.2 Adjusting Example Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.2.3 KnownBugs ....................................... 11
5.3 AMDExample .......................................... 11
5.3.1 Runningtheexample .................................. 11
5.3.2 Adjusting Example Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 1 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
1 Introduction
This document describes how to perform peer-to-peer transfer from an ADQ7 digitizer to a GPU.
1.1 Definitions and Abbreviations
Table 1lists the definitions and abbreviations used in this document and provides an explanation for each
entry.
Table 1: Definitions and abbreviations used in this document.
Term Explanation
API Application programming interface
DMA Direct memory access
GPU Graphics processing unit (Graphics card)
OCT Optical coherence tomography
P2P Peer-to-peer
UL2 User logic 2—open FPGA area in the ADQ7 firmware.
2 Prerequisites
Hardware
•ADQ7 digitizer
•Peer-to-peer capable GPU
–Windows AMD GPU with DirectGMA support
–Linux Nvidia GPU with GPUDirect support
•Host computer capable of P2P streaming with two PCIe Gen3x8 slots
3 Working Principles
The peer-to-peer function for ADQ7 has been developed with special consideration for OCT applications
and the triggering scheme commonly used in these applications. Data is collected with the help of two
trigger signals: the A-scan trigger and the B-scan trigger.
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 2 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
3.1 Trigger and Data Alignment
A data collection is specified by a record length Mand a line length. Where each line consists of N
records. The A-scan trigger starts collection of a record. Each record is consecutively written directly to
GPU memory. In the normal case a line therefore takes up M×Nsamples of memory. When a line is
full records are continued at the next line immediately after the last.
A B-scan trigger indicates the start of a new line. At the detection of a B-scan trigger the next record
will be written at the start of the next line. If a B-scan trigger arrives before Nrecords have been written
this means one or several records are missing. In this case the line is marked as invalid but the next line
will automatically be properly aligned in memory as if the previous line had been fully written.
Fig. 1shows an example of the memory layout after a successful data collection. Each record is
represented by a dash and corresponding A- and B-triggers are labeled. In this example the line length
Nwould be four, meaning four A-triggers are followed by one B-trigger. At the right hand side of the data
buffer, the data valid buffer is shown containing the number one for each valid line.
A A A A
A A A A
B
B
A A A A
A A A A
B
B
A A A A
A A A A
B
B
1
1
1
1
1
1
Figure 1: Records in GPU memory with labels indicating triggers. The data valid buffer indicates that all
lines are valid.
Information about invalid lines is written as metadata to a separate part of the memory in the GPU.
Fig. 2shows the GPU memory with several A-scans missing, the B-scan has moved the following record
to the next line and the data valid buffer indicates that the second line is not valid. The digitizer will write
a zero to the buffer only when an invalid line is encountered, the buffer is expected to initially be filled
with ones.
The resulting data buffer in GPU memory will always contain consistent data aligned in records and
lines according to the trigger information received by the ADQ7, ready to be processed by the GPU.
3.2 Double Buffering and Kernel Scheduling
The example program employs a double buffering scheme when transferring data to GPU. The digitizer
will write to one buffer and signal the host when the buffer is completely filled. At that signal the host may
schedule processing of the data in the first buffer. Simultaneously the digitizer can start writing data to a
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 3 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
A A A A
A
B
B
A A A A
A A A A
B
B
A A A A
A A A A
B
B
1
1
1
1
1
0
Figure 2: Records in GPU memory where several A-scans are missing, indicated in the data valid buffer.
second buffer. When data processing of the first buffer has completed the second buffer can be passed
to the GPU and the process repeats again.
This method ensures that there is always a buffer available for data transfer, avoiding any wait time.
Transfer rates above 7 GB/s have been measured using the attached example code.
3.3 Backplane Peer-To-Peer Transfer
Data is written from the digitizer directly to the GPU without going through the host CPU or host memory.
This reduces requirement on the host system significantly while still utilizing the full transfer rate capability
of the PCIe back plane. The ADQ7 supports up to eight PCIe generation 3.0 lanes.
ADQ7 GPU P2P
GPU
PCIe
switch
Host
CPU
Root
Complex
Figure 3: Peer-to-peer transfer from the digitizer to the GPU.
Fig. 3shows transfer from a digitizer to a GPU. Note that transfer takes place via a PCIe switch.
While peer-to-peer transfer is sometimes supported directly through the root complex without involving
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 4 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
a switch it is not a mandatory function of the PCIe standard. Therefore, make sure that there is a PCIe
switch between the two endpoints or that the root complex supports peer-to-peer transfer.
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 5 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
4 Setting up an Application
This section outlines the general steps for setting up P2P GPU streaming. Detailed information is given
where deemed necessary. It is recommended to study example files main.c which shows all the listed
steps and gpu_streaming_defines.h which contains macros for calculating buffer sizes, data access,
and settings. Information about ADQ functions is found in the ADQAPI reference guide [1].
•Choose parameters
•Decide if B-scan and data valid buffers should be used
•Initialize GPU driver
•Allocate and pin buffers in GPU
•Initialize ADQ
•Set up triggers
•Set up P2P GPU with SetupDMAP2p2D()
•Start streaming
•Wait for a completed buffer
•Detect and handle overflows
•Process received data and reset the data valid buffer
•Stop streaming
•User logic 2 considerations
4.1 Choosing Parameters
Streaming parameters should be chosen to at least satisfy required conditions. N is an integer 2≤N≤
222 and Ch is the number of channels used: 1 or 2.
1. Record length = N ×32 / Ch, required
2. A-scans per B-scan ≥2, required
3. Samples per B-scan ≤231, required
4. 1≤B-scans per buffer ≤212, required
5. Samples per buffer ≤233, required
6. N/2 ×Records per buffer = Integer, recommended
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 6 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
Condition 6is recommended for minimum latency at buffer completion. See gpu_streaming_defines.h
for macros to calculate correct buffer sizes from parameters. Please note that you GPU may impose
other restrictions on buffer size. The average transfer speed should also be considered:
Transfer speed [MB/s] =A-scans/s ×record length ×Ch ×10−6
If the maximum throughput of your system is exceeded more than momentarily the ADQs buffer may
overflow and data will be lost. Throughput measured in our test setup shown in Fig. 4can be used as
a rough guideline for maximum throughput. GPU load may affect throughput so its recommended to
determine maximum throughput of your system with data processing active.
32 128 512 2048 8192 32768 131072 524288 2097152 8388608
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
1ch
2ch
Record length [samples ]
MiB s
Figure 4: Maximum throughput in test setup for different record lenghts.
4.2 B-scan and Data Valid
If no B-scan signal is connected the system will behave as if a correct B-scan is present, no writes to
data valid buffer should occur. If no data valid buffer is specified, please set data valid size to 0 in the call
to SetupDMAP2p2D(). This blocks any unintentional B-scan trigger from initializing a write to an invalid
address.
4.3 Set up P2P GPU with SetupDMAP2p2D()
The input of marker addresses is vendor specific and described bellow, For general information see the
ADQAPI reference guide [1].
4.3.1 Nvidia
Marker memory is allocated by the ADQAPI. The marker address fields can be left empty.
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 7 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
4.3.2 AMD
The DMA buffers allocated with OpenCL has one marker address each. Since the application only needs
2 markers in total, use the marker addresses associated with the data buffers.
4.4 Wait for a Completed Buffer
Two markers are used to detect completed buffers. The marker associated with data and data valid
buffers 0 will only take on odd values and the marker associated with data and data valid buffers 1
will take on even values. The initial marker value is 1 and the maximum marker value is 232 −1the
succeeding marker value will be 0.
4.4.1 Nvidia
Completed buffers are detected by host. Call function WaitforGPUMarker() which will return as soon as
a buffer is completed (since last function call) or when the specified timeout is reached. WaitforGPU-
Marker() will return the latest marker value via argument marker_list. The completed data and data
valid buffers can be determined as:
completed buffers = (returned marker +1)mod 2
It is also recommended to compare returned marker value with expected marker value to detect missed
buffers which typically means that streaming is faster than your application.
4.4.2 AMD
Completed buffers are detected by GPU. The function clEnqueueWaitSignalAMD() is used to make
GPU wait until the marker associated with the specified buffer is equal to or greater than the specified
value. Operations added to the queue after a clEnqueueWaitSignalAMD() will not be performed until
the clEnqueueWaitSignalAMD() operation is finished. When operations are added to the queue they
can be connected to a cl_event. Use function clWaitForEvents() to make host wait for one or several
events. It should be noted that in some systems clWaitForEvents() only returns when all clEnqueue-
WaitSignalAMD() calls are finished.
4.5 Detect and Handle Overflows
It is recommended to frequently check for overflows with ADQ function GetStreamOverflow(), no de-
tected overflows means that all data since last successful GetStreamOverflow() call is correct. If an
overflow is detected streaming has to be restarted by calling ADQ functions StopStreaming(),SetupDMA-
P2p2D() and StartStreaming(). Other than loss of data overflow may cause unexpected change in
throughput or a total halt of streaming.
4.6 Process Received Data and Reset the Data Valid Buffer
The samples are written in int16_t format into buffers. In two-channel mode samples from channel A
and B are interleaved. Macros in gpu_streaming_defines.h shows how to access a given sample in a
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 8 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
buffer. After a set of buffers are completed user application has to reset data valid buffers by writing an
32-bit variable with value 1 in each position.
4.7 User Logic 2 Considerations
Metadata insertion and sample interleaving for two-channel mode is done in UL2. If UL2 is bypassed
GPU streaming will not work, writing to UL2 registers may alter functionality. If UL2 is modified with the
development kit, you have to make sure that metadata insertion still behaves correctly.
5 Example Code
The ADQ7 GPU P2P function is delivered with two examples, one for Nvidia GPU’s under Linux and
another for AMD GPU’s under Windows.1
5.1 Signal Connections
The example code can collect data from one or two analog inputs using the A-trigger and optional B-
trigger as described in Section 3.1. Table 2shows how the different input signals are labeled on the
ADQ7 backplate.
Table 2: Signal connections for ADQ7 device
Signal ADQ7 input
Analog channel A A
Analog channel B B (optional)
A-trigger Trig
B-trigger Sync (optional)
5.2 Nvidia Example
The example for Nvidia GPUs is written using GPUDirect with CUDA and OpenGL.
GPUDirect works by setting up a bus writable buffers in GPU memory. Additionally, marker buffers
are setup in host memory used to synchronize writes to the GPU buffers with host program execution.
When a GPU buffer has been completely written by the digitizer it writes an iterator to the associated
marker buffer in host and sends an interrupt. The host program waits for the marker write and enqueues
a CUDA kernel at that time. This sequence is repeated in an alternating pattern for two buffers.
The example uses a CUDA kernel which computes an FFT for each record in the buffer, and draws
a point to an OpenGL buffer indicating the peak frequency of the FFT. Fig. 5shows a screen capture of
the CUDA example.
1Nvidia and AMD are protected trademarks of their respective owners.
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 9 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
Figure 5: Screen capture from the CUDA example.
5.2.1 Running the Example
1. Make sure cuda toolkit including examples are installed
2. Make sure your user is in the ‘adq’ group: groups
3. Make sure the ADQ7 device driver is loaded: ls /dev/adq*
4. Go to ADQ7_GPUDirect_example/source/gdrdrv
5. Build kernel module for GPUDirect: make
6. Load kernel module: sudo ./insmod.sh
7. Go to directory ADQ7_GPUDirect_example/source
8. Build example: make
9. Run example: ./cuda_example
Once the example is running press hfor information about mouse and keyboard controls and ifor
information.
5.2.2 Adjusting Example Settings
The file gpu_streaming_defines.h contains macros for settings and debug printouts. When two-channel
mode is active (STREAM_CHANNELS = 2) the channel used for generating graphics can be chosen by re-
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 10 of 11

Classification Revision
Public PA1
Document ID Print date
19-2241 2019-05-02
placing GET_SAMPLE_CH_A macro in function cast_and_validate_kernel located in OCT_func.cu. Re-
member to rebuild example (make) for the changes to take effect.
5.2.3 Known Bugs
1. Resizing of the window causes a high number re-scaling events leading to a temporary slowdown
of the GPU, this can be avoided by freezing the frame (f) before resizing.
2. Transfer speed is calculated from buffer size, the size of skipped samples because of invalid lines
is not subtracted, thus the figure is only 100% accurate when all lines are valid.
5.3 AMD Example
The example for AMD GPUs is written using DirectGMA with OpenCL. DirectGMA works by setting up
a bus writable buffers with markers in GPU memory. Data is written into a buffer and when it is full an
iterator is written to the associated marker. A wait command is added to the OpenCL queue which blocks
the queue until a marker write is detected. At that time a kernel enqueued after the marker wait can run.
This sequence is repeated in an alternating pattern for two buffers. The enqueued operations copy the
content of the data and data_valid buffers to another set of buffers and reinitializes data valid buffer.
5.3.1 Running the example
1. Make sure the ADQ7 device driver is installed.
2. Make sure AMD driver is installed and directGMA activated
3. Make sure Visual studio 2017 or never is installed
4. Install OCL-SDK: OCL-SDK installer
5. Go to directory examples/amd/example
6. Open visual studio project file and build release x64
7. Run example: ADQ7_DirectGMA_example
8. For more details see included README.
5.3.2 Adjusting Example Settings
The file gpu_streaming_defines.h contains macros for settings and debug printouts. Remember to
rebuild example for the changes to take effect.
References
[1] Teledyne Signal Processing Devices Sweden AB, 14-1351 ADQAPI Reference Guide. Technical
Manual.
ADQ7 GPU Peer-To-Peer – User Guide www.teledyne-spdevices.com Page 11 of 11

Worldwide Sales and Technical Support
www.teledyne-spdevices.com
Teledyne SP Devices Corporate Headquarters
Teknikringen 6
SE-583 30 Linköping
Sweden
Phone: +46 (0)13 645 0600
Fax: +46 (0)13 991 3044
Email: [email protected]
Copyright © 2019 Teledyne Signal Processing Devices Sweden AB
All rights reserved, including those to reproduce this publication or parts thereof in any form without permission in writing from Teledyne SP Devices.
Other manuals for ADQ7
1
Table of contents
Other Teledyne SP Devices Measuring Instrument manuals

Teledyne SP Devices
Teledyne SP Devices ADQ User manual

Teledyne SP Devices
Teledyne SP Devices ADQ7DC User manual

Teledyne SP Devices
Teledyne SP Devices ADQ14-FWPD User manual

Teledyne SP Devices
Teledyne SP Devices ADQ7-FWATD User manual

Teledyne SP Devices
Teledyne SP Devices ADQ8-8C User manual

Teledyne SP Devices
Teledyne SP Devices ADQ7 User manual