Graphcore IPU-POD64 reference design Manual

IPU-POD64 reference design | build and test guide
IPU-POD64 REFERENCE DESIGN
Build and test guide

IPU-POD64 reference design | build and test guide
2
Table of contents
Overview.............................................................................................5
1.1 Acronyms and abbreviations ..................................................................... 5
IPU-POD64 reference design components............................................6
2.1 IPU-M2000 ................................................................................................ 6
Overview 6
QR code label 7
2.2 Server........................................................................................................ 7
2.3 Switches.................................................................................................... 7
100GbE RoCE/RDMA switch (ToR switch) 7
1GbE management switch 7
2.4 Power distribution units ............................................................................ 7
2.5 Rack .......................................................................................................... 8
2.6 Supplementary mounting components...................................................... 8
2.7 Cables ....................................................................................................... 8
RJ45 cables 8
OSFP cables 8
QSFP cables 8
Rack assembly.....................................................................................9
3.1 Equipment checklist................................................................................. 11
3.2 Preparing the rack ................................................................................... 12
Rail distance 12
Unpackaging the rack 12
Removing the side panels and doors 13
Removing the vertical accessory channels 14
Adjusting the rear accessory channels 14
Adjusting the rear vertical rails 15
Adjusting the front vertical rails 15
Installing the rack rails 16
Installing PDU brackets 18
3.3 Installing the equipment.......................................................................... 20
Installing the IPU-M2000s 20
Installing the management switch 23
Installing the ToR switch 23
Installing the PDUs 24
Installing the Dell R6525 server(s) 24
3.4 Wiring the rack........................................................................................ 26
IPU-M2000 to IPU-M2000 IPU-Link connectivity (OSFP) 27
IPU-M2000 to IPU-M2000 Sync-Link cabling 29
IPU-M2000 to management switch cabling (RJ45) 31
Management switch –BMC wiring 32
3.4.4.1. Management switch –BMC + GW SoC wiring 34
IPU-M2000 to ToR switch cabling (QSFP) 36

IPU-POD64 reference design | build and test guide
3
Dell R6525 server(s) wiring 39
ToR switch to Dell server(s) 40
Management switch to Dell server(s) - iDRAC 41
Management switch to Dell server(s) –network connector 41
3.4.9.1. Management switch to Dell server(s) –switch management 42
Management switch to PDUs 42
3.5 Power cabling.......................................................................................... 43
IPU-M2000 power cabling 44
Server power cabling –Dell R6525 45
Switch power cabling 45
3.6 Completing the rack................................................................................. 46
Blanking panels 46
Front and rear doors 46
Side panels 46
IPU-POD64 server and switch configuration .......................................47
4.1 Server configuration ................................................................................ 47
Hardware recommendations 47
Storage configuration recommendations 47
Operating system recommendations 48
User accounts and groups 48
Ubuntu OS Installation and packages 49
CentOS OS installation and packages 50
Python packages 50
4.2 Network configuration............................................................................. 51
Overview 51
IPU-POD64 network interfaces 52
Management switch configuration 53
ToR switch configuration 53
IPU-POD64 VLAN assignments 54
Server network configuration 56
Services: DHCP (Dynamic Host Configuration Protocol) 57
Services: NTP (Network Time Protocol) 59
Services: syslog 60
IPU-POD64 software installation and configuration ............................61
5.1 Management server ................................................................................ 61
5.2 V-IPU software installation and configurations ........................................ 61
5.3 IPU-M2000 software installation and configuration ................................. 62
Download IPU-M2000 software update bundle 63
Software update of all IPU-M2000s 63
IPU-M2000 GW’s root file system config files 64
5.4 rack_tool ........................................................................................... 65
IPU-POD64 manual installation tests...................................................66
6.1 Running system BISTs .............................................................................. 66
6.2 Troubleshooting ...................................................................................... 66
BMC BISTs 66

IPU-POD64 reference design | build and test guide
4
V-IPU built in self tests 66
Automatic IPU-POD64 configuration...................................................71
7.1 Devices and preparation.......................................................................... 71
7.2 Scanning and test .................................................................................... 71
Description of each QR code step 75
Scan troubleshooting 76
Installation troubleshooting 76
Example output for Sync-Link or Traffic test failure 77
Other useful commands 77
Document revisions...........................................................................78
8.1 Revision history....................................................................................... 78
Legal notices .....................................................................................79
Warranties & licences 79

IPU-POD64 reference design | build and test guide
5
Overview
The IPU-POD64 reference design is a rack solution containing 16 IPU-M2000s, 1 to 4 host
servers (default 1 host server in reference configuration), network switches and IPU-POD
software. There are 64 Mk2 GC200 IPUs in total with four IPUs in each IPU-M2000.
For more information on IPU-POD systems available from Graphcore see
https://www.graphcore.ai/products.
This guide is for properly trained service personnel and technicians who are required to install
the IPU-POD64.
If you have any questions then please contact your Graphcore representative or use the
resources on the Graphcore support portal: https://www.graphcore.ai/support.
1.1 Acronyms and abbreviations
This is a short list that describes some of the most commonly used terms in this document.
BMC
Baseboard Management Controller –standby power domain
service processor doing system hardware management
BOM
Bill of Materials
GW
Short for IPU-Gateway, a co-processor to the four IPUs in the
IPU-M2000. It enables scaling with multiple IPU-M2000 units
IPU-Link
High speed communication links that interconnects IPUs within
and between IPU-M2000 units. Special cables are required for
IPU-Links between IPU-M2000 units
GW-Link
High speed communication link(s) that interconnect IPUs and
IPU-GWs horizontally between IPU-M2000 units. Special cables
are required for GW-Links between IPU-M2000 units
PDU
Power Distribution Unit
RDMA
Remote DMA
RNIC
RDMA Network Interface Controller
RoCE
RDMA over converged Ethernet
ToR
Top of Rack. Often used in combination with the ToR RDMA
switch that is placed on top of the IPU-M2000 stacked units.
Warning: Only qualified personnel should install, service, or replace IPU-POD64 equipment.

IPU-POD64 reference design | build and test guide
6
IPU-POD64 reference design components
This section describes the components in the IPU-POD64. Each IPU-POD64 contains:
•16 IPU-M2000s
•1 server (default configuration is 1 host server, up to 4 can be supported)
•1 1GbE management switch
•1 100GbE ToR switch
•2 power distribution units
•1 rack
•Supplementary mounting components
•Cables
2.1 IPU-M2000
Overview
There are 16 IPU-M2000s in each IPU-POD64 making a total of 64 IPUs: 4 IPUs per IPU-M2000.
The IPU-M2000 front panel contains:
•2 RNIC ports
•8 IPU-Link ports
•2 management GbE ports (BMC/GW SoC management ports)
•2 GW-Link ports
•8 IPU-Link ports
Front panel
The IPU-M2000 back panel contains:
•2 power connectors per IPU-M2000
•Fan units
•Unit QR code
Back panel
QR
code

IPU-POD64 reference design | build and test guide
7
QR code label
There is a QR code label on the back panel of each IPU-M2000. The QR code contains the
following information for each IPU-M2000:
•Company name (Graphcore)
•Serial number
•Part number
•BMC Ethernet MAC address
•GW Ethernet MAC address
•Graphcore support web URL (https://www.graphcore.ai/support)
2.2 Server
The default configuration of the IPU-POD64 uses a single PowerEdge R6525 server but up to
four servers can be connected. Contact Graphcore sales for details of other supported server
types. This document describes the default server (PowerEdge R6525) installation only –other
servers may have different installation requirements.
The default server configuration is described in section 4.1.
2.3 Switches
Each IPU-POD64 contains two switches serving different purposes.
100GbE RoCE/RDMA switch (ToR switch)
The 100GbE RoCE/RDMA switch (also referred to as the ToR switch) is used by the end user’s
machine learning (ML) jobs as a data-plane, connecting the host servers running thePoplar®
SDK with the IPUs running the ML model in the IPU-M2000s. The default ToR switch is an Arista
DCS-7060CX-32S-F. Contact Graphcore sales for details of other supported switch types. This
document describes the default switch (7060CX) installation only –other switches may have
different installation requirements.
1GbE management switch
The 1GbE management switch is used for connecting the management ports together inside
the rack. The default management switch is an Arista DCS-7010T-48-F. Contact Graphcore
sales for details of other supported switch types. This document describes the default switch
(7010T) installation only –other switches may have different installation requirements.
2.4 Power distribution units
Two power distribution units (PDUs) are installed in each IPU-POD64.The default unit is an
APC AP8886.

IPU-POD64 reference design | build and test guide
8
2.5 Rack
The IPU-M2000s, servers, switches, and PDUs are installed in an APC AR3300SP rack. This rack
has a packing system designed to safely transport and unload the rack.
It is important to follow the instructions carefully when packing or unpacking the rack.
2.6 Supplementary mounting components
The supplementary components listed below also need to be installed.
•Cable organizer
•Blanking panel
2.7 Cables
Each IPU-POD64 has three types of cabling:
•RJ45 cables
•OSFP cables
•QSFP cables
RJ45 cables
•Red: IPU-M2000 to IPU-M2000 within-rack IPU-Link connectivity
•Blue: Connecting IPU-M2000s to the management switch (BMC +GWmanagement)
•Blue: Connecting servers to the management switch
•Yellow for connecting IPU-M2000s to the management switch (BMC only
management)
OSFP cables
•IPU-M2000 to IPU-M2000 (IPU-Link) connectivity
QSFP cables
•IPU-M2000 to ToR switch connectivity
•Server to ToR switch connectivity
All cable connections are described in Section 0.

IPU-POD64 reference design | build and test guide
9
Rack assembly
Please note the correct orientation of the IPU-M2000, server and switch units in the rack to
ensure correct airflow.
The front interface of the IPU-M2000 units (connectivity ports) should be matched with the
front door of the rack (cold aisle). The rear interface of the server and switches (power and
fans) should be matched with the rear door of the rack (hot aisle).
Completed rack - cold aisle (four-server version)
Note that this photo shows a four-server version of the IPU-POD64.The default reference
design has one server which would be the server in the lowest position, closest to the
switches.

IPU-POD64 reference design | build and test guide
10
Completed rack –hot aisle (four server version)
Note that this photo shows 3x blue RJ45 cables in each R6525 server. In the default build,
servers 2 - 4 only have 2x blue RJ45 cables. See later sections for more information about
server cabling.
Note also that this photo shows a four-server version of the IPU-POD64.The default reference
design has one server which would be the server in the lowest position, closest to the
switches.

IPU-POD64 reference design | build and test guide
11
3.1 Equipment checklist
Description
Quantity (1 server)
Quantity (4 server)
Rack (AR3300SP)
1
1
Blanking panels (APC AR8136BLK)
23 pieces for 42U
reference rack (delivered
in packs of 10)
20 pieces for 42U
reference rack (delivered
in packs of 10)
AP8886 PDU
2
2
Hardware mounting kit (APC AR8100)
1
1
PDU bracket kit APC (AR7711)
2
2
Graphcore IPU-M2000
16
16
IPU-M2000 slider kits
16
16
Dell R6525 server
1
4
Arista DCS-7010T-48-F switch
1
1
Arista DCS-7060CX-32S-F switch
1
1
2m purple Ethernet
2
2
1.5m blue Ethernet
11
17
1m blue Ethernet
9
9
1m yellow Ethernet
12
12
1.5m yellow Ethernet
4
4
1m red Ethernet
2
2
0.15m red Ethernet
30
30
1m QSFP28
8
8
1.5m QSFP28
9
12
0.3m OSFP
60
60
1m OSFP
4
4
0.5m red 10A C14 to C15
12
12
1m red 10A C14 to C15
4
4
0.5m blue 10A C14 to C15
12
12
1m blue 10A C14 to C15
4
4
1m red C13 to C14
2
5
1.5m red C13 to C14
1
1

IPU-POD64 reference design | build and test guide
12
1m blue C13 to C14
2
5
1.5m blue C13 to C14
1
1
Velcro
1
1
3.2 Preparing the rack
Rail distance
The IPU-M2000 mounting system requires a rail-to-rail distance of 720mm. This document
describes the adjustments required for an AR3300SP rack. If using a different rack this rail
distance must be observed.
Unpackaging the rack
Follow the instructions to remove the outer packaging of the APC AR3300SP rack, ensuring
that you safely store these materials for later repackaging. Do not remove the rack from the
shock pallet.
Remove the white bag from the rack. This contains screws and cage nuts to be used in the
assembly of the components into the rack.

IPU-POD64 reference design | build and test guide
13
Removing the side panels and doors
Remove the front and rear doors from the rack.
Ensure the earth straps are disconnected before the doors are removed.
Remove the top and bottom side panels:

IPU-POD64 reference design | build and test guide
14
Removing the vertical accessory channels
Using a Torx TX30 screwdriver, remove two accessory channels from the rack.
Accessory channel removal
Adjusting the rear accessory channels
Set the rear accessory channel to the furthest position in the rack.
Tighten up the screws ensuring the teeth engage into the slots in the rail:

IPU-POD64 reference design | build and test guide
15
Adjusting the rear vertical rails
Using a Torx TX30 screwdriver, make both rear vertical rack rails loose and freely movable.
Position the rear vertical rack rails such that there is 20mm of distance between the rear face
of the vertical rack rail and the racks rear frame. This should result in a square symbol being
visible through the alignment window at the top and bottom of the rail.
Secure the rail into position by moving the TX30 screws back upwards such that the teeth
engage with both the supporting rail. This must be done at the top and bottom of the
bracket.
Adjusting the front vertical rails
Using a Torx TX30 screwdriver, make both front vertical rack rails loose and freely movable.
Install the accessory channels in the front of the rack (one on the left hand side, one on the
right hand side) at the frontmost position possible, moving the TX30 screws back upwards
such that the teeth engage with both the supporting rails - this must be done at the top and
bottom of the bracket.
Note
To ensure the clips on the accessory channels align with the channel in the rack,
lift the accessory channels through the cut-out in the top of the rack and then
drop them down onto the channels.
Move the vertical rack rails tight against the vertical cable organisers such that only a single
diamond symbol is visible through the alignment window at the top and bottom of the rail:

IPU-POD64 reference design | build and test guide
16
Secure the rail into position by moving the TX30 screws back upwards such that the teeth
engage with both the supporting rails. This must be done at the top and bottom of the
bracket.
Installing the rack rails
M2000 rack rail kit –unboxed
The IPU-M2000 rail kit comprises two mated inner and outer rack rails and an accessory bag
containing screws. The inner rail affixes to the body of the IPU-M2000 and the outer rail
affixes to the vertical rack rails in the server cabinet.
Firstly, separate the mated inner and outer rails:
1) Fully extend the rails by pulling on the end which has the captive thumb screw attached:
2) Whilst pulling on the thumb screw end of the rails, push the white plastic release tab
towards the thumb screw end:
3) The inner and outer rails will now separate:

IPU-POD64 reference design | build and test guide
17
Mate the inner rails (the thinner of the two separated rails which has a captive thumb screw
at one end) to the body of the IPU-M2000. Please note that the inner rails are mirrored and
are not handed. As such, the procedure for inner rail fixing is the same for the left and right
hand inner rails.
The inner rail should be oriented such that the captive thumb screw end is at the end of the
IPU-M2000 containing the network receptacles.
To affix the inner rail to the body of the IPU-M2000:
1) Offer up the inner rail to the side of the IPU-M2000 and ensure that all fixing pins are
sitting within the enlarged opening of the retention channel:
2) Push the inner rail towards the end of the IPU-M2000 containing the network receptacles,
you should hear a click as the latching mechanism locks behind the head of a fixing pin:
3) Ensure all fixing pins are correctly engaged with their respective retention channel.
4) Locate the four flat head fixing screws from the rack rail accessory bag:

IPU-POD64 reference design | build and test guide
18
5) Using the above screws, affix the inner rail to the body of the M2000:
The inner rails are now securely affixed to the IPU-M2000 body.
Place the outer rails to one side for later use:
IPU-M2000 outer rails
Installing PDU brackets
Install four cage nuts on the outside at the top and bottom to both of the accessory channels
as shown below.
Top PDU bracket cage nuts Bottom PDU bracket cage nuts

IPU-POD64 reference design | build and test guide
19
Screw the PDU support brackets to the inside of the cabinet.
The PDU brackets should be installed at the rear of the rack - one bracket on top with 9cm
distance from the top of rack and one bracket on the bottom with 12cm distance from the
bottom of rack. The figure below illustrates this. Follow the PDU bracket installation
instructions.
PDU support bracket

IPU-POD64 reference design | build and test guide
20
3.3 Installing the equipment
The following sections describe the installation of the IPU-M2000s, PDUs, servers, ToR and
management switches into the rack.
Installing the IPU-M2000s
Earlier in the guide we affixed the inner rack rails to the IPU-M2000 body. We now need to
install the outer rack rails into the rack.
It is possible to identify the front and rear of the outer rail by finding the large metal latching
mechanism –this is to be located at the rear of the rack. The outer rail is also embossed with
the text “FRONT” at the front end of the rail.
The Outer rail large metal latch end is to be installed at the rear of the rack:
For each rack U in U1 through U16 (inclusive), perform the steps below with both the left
hand and right hand outer rack rails:
1) Pull on each end of the outer rail to adjust the rail length to suite your rack
2) Locate the front end of the outer rail and hold it behind the square holes in the vertical
rack rail for your installation U. Pull the outer rail towards the vertical rack rail and the
latching mechanism will click and hold the outer rail in place
Table of contents