Bull Escala M6-700 User manual


ESCALA Power7
High performance clustering
The ESCALA Power7 publications concern the following models:
- Bull Escala E5-700 (Power 750 / 8233-E8B)
- Bull Escala M6-700 (Power 770 / 9117-MMB)
- Bull Escala M6-705 (Power 770 / 9117-MMC)
- Bull Escala M7-700 (Power 780 / 9179-MHB)
- Bull Escala M7-705 (Power 780 / 9179-MHC)
- Bull Escala E1-700 (Power 710 / 8231-E2B)
- Bull Escala E1-705 (Power 710 / 8231-E1C)
- Bull Escala E2-700 / E2-700T (Power 720 / 8202-E4B)
- Bull Escala E2-705 / E2-705T (Power 720 / 8202-E4C)
- Bull Escala E3-700 (Power 730 / 8231-E2B)
- Bull Escala E3-705 (Power 730 / 8231-E2C)
- Bull Escala E4-700 / E4-700T (Power 740 / 8205-E6B)
- Bull Escala E4-705 (Power 740 / 8205-E6C)
References to Power 755 / 8236-E8C models are irrelevant.
Hardware
October 2011
BULL CEDOC
357 AVENUE PATTON
B.P.20845
49008 ANGERS CEDEX 01
FRANCE
REFERENCE
86 A1 93FF 03

The following copyright notice protects this book under Copyright laws which prohibit such actions as, but not limited to, copying,
distributing, modifying, and making derivative works.
Bull SAS 2011
Copyright
Printed in France
Suggestions and criticisms concerning the form, content, and presentation of this book are
invited. A form is provided at the end of this book for this purpose.
To order additional copies of this book or other Bull Technical Publications, you are invited
to use the Ordering Form also provided at the end of this book.
Trademarks and Acknowledgements
We acknowledge the right of proprietors of trademarks mentioned in this book.
The information in this document is subject to change without notice. Bull will not be liable for errors contained herein, or
r incidental or consequential damages in connection with the use of this material.
fo

Contents
Safety notices .................................ix
High-performance computing clusters using InfiniBand hardware...........1
Clustering systems by using InfiniBand hardware .......................2
Cluster information resources.............................2
Fabric communications ...............................6
IBM GX+ or GX++ host channel adapter ........................7
Logical switch naming convention .........................9
Host channel adapter statistics counter .......................10
Vendor and IBM switches ............................10
QLogic switches supported by IBM .........................10
Cables ...................................10
Subnet Manager ................................11
POWER Hypervisor ..............................12
Device drivers ................................12
IBM host stack ................................12
Management subsystem function overview ........................13
Management subsystem integration recommendations ...................13
Management subsystem high-level functions ......................14
Management subsystem overview ..........................15
xCAT..................................17
Fabric manager ...............................17
Hardware Management Console .........................18
Switch chassis viewer .............................19
Switch command-line interface ..........................19
Server Operating system ............................20
Network Time Protocol ............................20
Fast Fabric Toolset ..............................20
Flexible Service processor............................21
Fabric viewer................................21
Email notifications ..............................22
Management subsystem networks .........................22
Vendor log flow to xCAT event management ......................23
Supported components in an HPC cluster ........................24
Cluster planning..................................26
Cluster planning overview .............................27
Required level of support, firmware, and devices......................28
Server planning .................................29
Server types .................................29
Planning InfiniBand network cabling and configuration ...................30
Topology planning ...............................30
Example configurations using only 9125-F2A servers ..................33
Example configurations: 9125-F2A compute servers and 8203-E4Astorage servers .........43
Configurations with IO router servers .......................47
Cable planning ................................48
Planning QLogic or IBM Machine Type InfiniBand switch configuration .............49
Planning maximum transfer unit (MTU).......................51
Planning for global identifier prefixes........................52
Planning an IBM GX HCA configuration .......................53
IP subnet addressing restriction with RSCT......................53
Management subsystem planning ...........................54
Planning your Systems Management application .....................55
Planning xCAT as your Systems Management application .................55
Planning for QLogic fabric management applications ...................56
Planning the fabric manager and fabric Viewer ....................56
© Copyright IBM Corp. 2011 iii

Planning Fast Fabric Toolset ...........................63
Planning for fabric management server ........................64
Planning event monitoring with QLogic and management server ...............66
Planning event monitoring with xCAT on the cluster management server ...........66
Planning to run remote commands with QLogic from the management server ...........67
Planning to run remote commands with QLogic from xCAT/MS ..............67
Frame planning .................................68
Planning installation flow .............................68
Key installation points..............................68
Installation responsibilities by organization .......................68
Installation responsibilities of units and devices .....................69
Order of installation ..............................70
Installation coordination worksheet .........................73
Planning for an HPC MPI configuration .........................74
Planning 12x HCA connections ............................75
Planning aids..................................75
Planning checklist ................................75
Planning worksheets ...............................76
Cluster summary worksheet ............................77
Frame and rack planning worksheet .........................79
Server planning worksheet ............................81
QLogic and IBM switch planning worksheets ......................83
Planning worksheet for 24-port switches.......................84
Planning worksheet for switches with more than 24 ports .................85
xCAT planning worksheets ............................89
QLogic fabric management worksheets ........................92
Installing a high-performance computing (HPC) cluster with an InfiniBand network ...........96
IBM Service representative installation responsibilities ....................97
Cluster expansion or partial installation .........................97
Site setup for power, cooling, and floor .........................98
Installing and configuring the management subsystem ....................98
Installing and configuring the management subsystem for a cluster expansion or addition ......101
Installing and configuring service VLAN devices ....................102
Installing the Hardware Management Console .....................102
Installing the xCAT management server .......................104
Installing operating system installation servers .....................105
Installing the fabric management server .......................105
Set up remote logging .............................112
Remote syslogging to an xCAT/MS ........................112
Using syslog on RedHat Linux-based xCAT/MS ...................120
Set up remote command processing .........................120
Setting up remote command processing from the xCAT/MS................120
Installing and configuring servers with management consoles ................122
Installing and configuring the cluster server hardware....................123
Server installation and configuration information for expansion ...............123
Server hardware installation and configuration procedure .................124
Installing the operating system and configuring the cluster servers ...............127
Installing the operating system and configuring the cluster servers information for expansion .....127
Installing the operating system and configuring the cluster servers ..............128
Installation sub procedure for AIX only.......................134
RedHat rpms required for InfiniBand .......................135
Installing and configuring vendor or IBM InfiniBand switches .................137
Installing and configuring InfiniBand switches when adding or expanding an existing cluster .....137
Installing and configuring the InfiniBand switch.....................138
Attaching cables to the InfiniBand network .......................143
Cabling the InfiniBand network information for expansion .................144
InfiniBand network cabling procedure ........................144
Verifying the InfiniBand network topology and operation ..................145
Installing or replacing an InfiniBand GX host channel adapter .................147
Deferring replacement of a failing host channel adapter ..................149
Verifying the installed InfiniBand network (fabric) in AIX ..................150
iv Power Systems: High performance clustering

Fabric verification ................................150
Fabric verification responsibilities .........................150
Reference documentation for fabric verification procedures .................150
Fabric verification tasks .............................150
Fabric verification procedure ...........................151
Runtime errors .................................151
Cluster Fabric Management .............................152
Cluster fabric management flow ...........................152
Cluster Fabric Management components and their use ...................152
xCAT Systems Management ...........................152
QLogic subnet manager .............................153
QLogic fast fabric toolset ............................154
QLogic performance manager ...........................155
Managing the fabric management server .......................155
Cluster fabric management tasks ...........................155
Monitoring the fabric for problems ..........................156
Monitoring fabric logs from the xCAT Cluster Management server ..............156
Health checking ...............................157
Setting up periodic fabric health checking ......................158
Output files for health check ..........................164
Interpreting health check .changes files .......................167
Interpreting health check .diff files ........................172
Querying status ...............................174
Remotely accessing QLogic management tools and commands from xCAT/MS ...........174
Remotely accessing the Fabric Management Server from xCAT/MS ..............175
Remotely accessing QLogic switches from the xCAT/MS ..................175
Updating code .................................176
Updating Fabric Manager code ..........................176
Updating switch chassis code ...........................179
Finding and interpreting configuration changes ......................180
Hints on using iba_report .............................180
Cluster service ..................................183
Service responsibilities ..............................183
Fault reporting mechanisms ............................183
Fault diagnosis approach .............................185
Types of events................................185
Isolating link problems .............................186
Restarting or repowering on scenarios ........................187
The importance of NTP .............................187
Table of symptoms ...............................187
Service procedures ...............................191
Capturing data for fabric diagnosis ..........................193
Using script command to capture switch CLI output ...................196
Capture data for Fabric Manager and Fast Fabric problems ..................196
Mapping fabric devices ..............................197
General mapping of IBM HCA GUIDs to physical HCAs ..................197
Finding devices based on a known logical switch ....................199
Finding devices based on a known logical HCA.....................201
Finding devices based on a known physical switch port ..................203
Finding devices based on a known ib interface (ibX/ehcaX) .................205
IBM GX HCA Physical port mapping based on device number .................207
Interpreting switch vendor log formats .........................207
Log severities ................................207
Switch chassis management log format ........................208
Subnet Manager log format............................209
Diagnosing link errors ..............................210
Diagnosing and repairing switch component problems ...................213
Diagnosing and repairing IBM system problems......................213
Diagnosing configuration changes ..........................213
Checking for hardware problems affecting the fabric ....................214
Checking for fabric configuration and functional problems ..................214
Contents v

Checking InfiniBand configuration in AIX ........................215
Checking system configuration in AIX .........................217
Verifying the availability of processor resources .....................217
Verifying the availability of memory resources .....................217
Checking InfiniBand configuration in Linux .......................218
Checking system configuration in Linux ........................220
Verifying the availability of processor resources .....................220
Verifying the availability of memory resources .....................220
Checking multicast groups .............................221
Diagnosing swapped HCA ports ...........................221
Diagnosing swapped switch ports ..........................222
Diagnosing events reported by the operating system ....................223
Diagnosing performance problems ..........................224
Diagnosing and recovering ping problems........................225
Diagnosing application crashes ...........................226
Diagnosing management subsystem problems ......................226
Problem with event management or remote syslogging ..................226
Event not in xCAT/MS:/tmp/systemEvents .....................227
Event not in xCAT/MS: /var/log/xcat/syslog.fabric.notices................228
Event not in xCAT/MS: /var/log/xcat/syslog.fabric.info.................230
Event not in log on fabric management server ....................231
Event not in switch log ............................232
Reconfiguring xCAT event management .......................232
Reconfiguring xCAT on the AIX operating system ...................232
Reconfiguring xCAT on the Linux operating system ..................233
Recovering from an HCA preventing a logical partition from activating ..............235
Recovering ibX interfaces .............................235
Recovering a single ibX interface in AIX .......................235
Recovering all of the ibX interfaces in an LPAR in the AIX .................236
Recovering an ibX interface tcp_sendspace and tcp_recvspace ................237
Recovering ml0 in AIX .............................237
Recovering icm in AIX .............................237
Recovering ehcaX interfaces in Linux .........................237
Recovering a single ibX interface in Linux .......................237
Recovering all of the ibX interfaces in an LPAR in the Linux ................238
Recovering to 4K maximum transfer units in the AIX ....................238
Recovering to 4K maximum transfer units in the Linux ...................241
Recovering the original master SM ..........................243
Re-establishing Health Check baseline .........................244
Verifying link FRU replacements ...........................244
Verifying repairs and configuration changes .......................245
Restarting the cluster ...............................246
Restarting or powering off an IBM system........................247
Counting devices ................................248
Counting switches...............................248
Counting logical switches ............................249
Counting host channel adapters ..........................249
Counting end ports ..............................249
Counting ports ................................249
Counting Subnet Managers............................250
Counting devices example ............................250
Handling emergency power off situations ........................251
Monitoring and checking for fabric problems.......................252
Retraining 9125-F2A links .............................252
How to retrain 9125-F2A links...........................252
When to retrain 9125-F2A links ..........................254
Error counters ..................................254
Interpreting error counters .............................255
Interpreting link Integrity errors ..........................256
Interpreting remote errors ............................260
Example PortXmitDiscard analyses ........................261
vi Power Systems: High performance clustering

Example PortRcvRemotePhysicalErrors analyses....................262
Interpreting security errors ............................264
Diagnose a link problem based on error counters .....................264
Error counter details ...............................265
Categorizing Error Counters ...........................265
Link Integrity Errors ..............................266
LinkDownedCounter .............................266
LinkErrorRecoveryCounter ...........................266
LocalLinkIntegrityErrors............................267
ExcessiveBufferOverrunErrors ..........................267
PortRcvErrors ...............................268
SymbolErrorCounter .............................269
Remote Link Errors (including congestion and link integrity) ................271
PortRcvRemotePhysicalErrors ..........................271
PortXmitDiscards ..............................271
Security errors ................................273
PortXmitConstraintErrors ...........................273
PortRcvConstraintErrors............................273
Other error counters ..............................273
VL15Dropped ...............................273
PortRcvSwitchRelayErrors ...........................274
Clearing error counters ..............................274
Example health check scripts .............................275
Configuration script ...............................276
Error counter clearing script ............................276
Healthcheck control script .............................277
Cron setup on the Fabric MS ............................279
Improved healthcheck ..............................279
Notices ...................................283
Trademarks ...................................284
Electronic emission notices ..............................285
Class A Notices.................................285
Terms and conditions................................288
Contents vii

viii Power Systems: High performance clustering

Safety notices
Safety notices may be printed throughout this guide:
vDANGER notices call attention to a situation that is potentially lethal or extremely hazardous to
people.
vCAUTION notices call attention to a situation that is potentially hazardous to people because of some
existing condition.
vAttention notices call attention to the possibility of damage to a program, device, system, or data.
World Trade safety information
Several countries require the safety information contained in product publications to be presented in their
national languages. If this requirement applies to your country, a safety information booklet is included
in the publications package shipped with the product. The booklet contains the safety information in
your national language with references to the U.S. English source. Before using a U.S. English publication
to install, operate, or service this product, you must first become familiar with the related safety
information in the booklet. You should also refer to the booklet any time you do not clearly understand
any safety information in the U.S. English publications.
German safety information
Das Produkt ist nicht für den Einsatz an Bildschirmarbeitsplätzen im Sinne§2der
Bildschirmarbeitsverordnung geeignet.
Laser safety information
IBM®servers can use I/O cards or features that are fiber-optic based and that utilize lasers or LEDs.
Laser compliance
IBM servers may be installed inside or outside of an IT equipment rack.
© Copyright IBM Corp. 2011 ix

DANGER
When working on or around the system, observe the following precautions:
Electrical voltage and current from power, telephone, and communication cables are hazardous. To
avoid a shock hazard:
vConnect power to this unit only with the IBM provided power cord. Do not use the IBM
provided power cord for any other product.
vDo not open or service any power supply assembly.
vDo not connect or disconnect any cables or perform installation, maintenance, or reconfiguration
of this product during an electrical storm.
vThe product might be equipped with multiple power cords. To remove all hazardous voltages,
disconnect all power cords.
vConnect all power cords to a properly wired and grounded electrical outlet. Ensure that the outlet
supplies proper voltage and phase rotation according to the system rating plate.
vConnect any equipment that will be attached to this product to properly wired outlets.
vWhen possible, use one hand only to connect or disconnect signal cables.
vNever turn on any equipment when there is evidence of fire, water, or structural damage.
vDisconnect the attached power cords, telecommunications systems, networks, and modems before
you open the device covers, unless instructed otherwise in the installation and configuration
procedures.
vConnect and disconnect cables as described in the following procedures when installing, moving,
or opening covers on this product or attached devices.
To Disconnect:
1. Turn off everything (unless instructed otherwise).
2. Remove the power cords from the outlets.
3. Remove the signal cables from the connectors.
4. Remove all cables from the devices
To Connect:
1. Turn off everything (unless instructed otherwise).
2. Attach all cables to the devices.
3. Attach the signal cables to the connectors.
4. Attach the power cords to the outlets.
5. Turn on the devices.
(D005)
DANGER
xPower Systems: High performance clustering

Observe the following precautions when working on or around your IT rack system:
vHeavy equipment–personal injury or equipment damage might result if mishandled.
vAlways lower the leveling pads on the rack cabinet.
vAlways install stabilizer brackets on the rack cabinet.
vTo avoid hazardous conditions due to uneven mechanical loading, always install the heaviest
devices in the bottom of the rack cabinet. Always install servers and optional devices starting
from the bottom of the rack cabinet.
vRack-mounted devices are not to be used as shelves or work spaces. Do not place objects on top
of rack-mounted devices.
vEach rack cabinet might have more than one power cord. Be sure to disconnect all power cords in
the rack cabinet when directed to disconnect power during servicing.
vConnect all devices installed in a rack cabinet to power devices installed in the same rack
cabinet. Do not plug a power cord from a device installed in one rack cabinet into a power
device installed in a different rack cabinet.
vAn electrical outlet that is not correctly wired could place hazardous voltage on the metal parts of
the system or the devices that attach to the system. It is the responsibility of the customer to
ensure that the outlet is correctly wired and grounded to prevent an electrical shock.
CAUTION
vDo not install a unit in a rack where the internal rack ambient temperatures will exceed the
manufacturer's recommended ambient temperature for all your rack-mounted devices.
vDo not install a unit in a rack where the air flow is compromised. Ensure that air flow is not
blocked or reduced on any side, front, or back of a unit used for air flow through the unit.
vConsideration should be given to the connection of the equipment to the supply circuit so that
overloading of the circuits does not compromise the supply wiring or overcurrent protection. To
provide the correct power connection to a rack, refer to the rating labels located on the
equipment in the rack to determine the total power requirement of the supply circuit.
v(For sliding drawers.) Do not pull out or install any drawer or feature if the rack stabilizer brackets
are not attached to the rack. Do not pull out more than one drawer at a time. The rack might
become unstable if you pull out more than one drawer at a time.
v(For fixed drawers.) This drawer is a fixed drawer and must not be moved for servicing unless
specified by the manufacturer. Attempting to move the drawer partially or completely out of the
rack might cause the rack to become unstable or cause the drawer to fall out of the rack.
(R001)
Safety notices xi

CAUTION:
Removing components from the upper positions in the rack cabinet improves rack stability during
relocation. Follow these general guidelines whenever you relocate a populated rack cabinet within a
room or building:
vReduce the weight of the rack cabinet by removing equipment starting at the top of the rack
cabinet. When possible, restore the rack cabinet to the configuration of the rack cabinet as you
received it. If this configuration is not known, you must observe the following precautions:
– Remove all devices in the 32U position and above.
– Ensure that the heaviest devices are installed in the bottom of the rack cabinet.
– Ensure that there are no empty U-levels between devices installed in the rack cabinet below the
32U level.
vIf the rack cabinet you are relocating is part of a suite of rack cabinets, detach the rack cabinet from
the suite.
vInspect the route that you plan to take to eliminate potential hazards.
vVerify that the route that you choose can support the weight of the loaded rack cabinet. Refer to the
documentation that comes with your rack cabinet for the weight of a loaded rack cabinet.
vVerify that all door openings are at least 760 x 230 mm (30 x 80 in.).
vEnsure that all devices, shelves, drawers, doors, and cables are secure.
vEnsure that the four leveling pads are raised to their highest position.
vEnsure that there is no stabilizer bracket installed on the rack cabinet during movement.
vDo not use a ramp inclined at more than 10 degrees.
vWhen the rack cabinet is in the new location, complete the following steps:
– Lower the four leveling pads.
– Install stabilizer brackets on the rack cabinet.
– If you removed any devices from the rack cabinet, repopulate the rack cabinet from the lowest
position to the highest position.
vIf a long-distance relocation is required, restore the rack cabinet to the configuration of the rack
cabinet as you received it. Pack the rack cabinet in the original packaging material, or equivalent.
Also lower the leveling pads to raise the casters off of the pallet and bolt the rack cabinet to the
pallet.
(R002)
(L001)
(L002)
xii Power Systems: High performance clustering

(L003)
or
All lasers are certified in the U.S. to conform to the requirements of DHHS 21 CFR Subchapter J for class
1 laser products. Outside the U.S., they are certified to be in compliance with IEC 60825 as a class 1 laser
product. Consult the label on each part for laser certification numbers and approval information.
CAUTION:
This product might contain one or more of the following devices: CD-ROM drive, DVD-ROM drive,
DVD-RAM drive, or laser module, which are Class 1 laser products. Note the following information:
vDo not remove the covers. Removing the covers of the laser product could result in exposure to
hazardous laser radiation. There are no serviceable parts inside the device.
vUse of the controls or adjustments or performance of procedures other than those specified herein
might result in hazardous radiation exposure.
(C026)
Safety notices xiii

CAUTION:
Data processing environments can contain equipment transmitting on system links with laser modules
that operate at greater than Class 1 power levels. For this reason, never look into the end of an optical
fiber cable or open receptacle. (C027)
CAUTION:
This product contains a Class 1M laser. Do not view directly with optical instruments. (C028)
CAUTION:
Some laser products contain an embedded Class 3A or Class 3B laser diode. Note the following
information: laser radiation when open. Do not stare into the beam, do not view directly with optical
instruments, and avoid direct exposure to the beam. (C030)
CAUTION:
The battery contains lithium. To avoid possible explosion, do not burn or charge the battery.
Do Not:
v___ Throw or immerse into water
v___ Heat to more than 100°C (212°F)
v___ Repair or disassemble
Exchange only with the IBM-approved part. Recycle or discard the battery as instructed by local
regulations. In the United States, IBM has a process for the collection of this battery. For information,
call 1-800-426-4333. Have the IBM part number for the battery unit available when you call. (C003)
Power and cabling information for NEBS (Network Equipment-Building System)
GR-1089-CORE
The following comments apply to the IBM servers that have been designated as conforming to NEBS
(Network Equipment-Building System) GR-1089-CORE:
The equipment is suitable for installation in the following:
vNetwork telecommunications facilities
vLocations where the NEC (National Electrical Code) applies
The intrabuilding ports of this equipment are suitable for connection to intrabuilding or unexposed
wiring or cabling only. The intrabuilding ports of this equipment must not be metallically connected to the
interfaces that connect to the OSP (outside plant) or its wiring. These interfaces are designed for use as
intrabuilding interfaces only (Type 2 or Type 4 ports as described in GR-1089-CORE) and require isolation
from the exposed OSP cabling. The addition of primary protectors is not sufficient protection to connect
these interfaces metallically to OSP wiring.
Note: All Ethernet cables must be shielded and grounded at both ends.
The ac-powered system does not require the use of an external surge protection device (SPD).
The dc-powered system employs an isolated DC return (DC-I) design. The DC battery return terminal
shall not be connected to the chassis or frame ground.
xiv Power Systems: High performance clustering

High-performance computing clusters using InfiniBand
hardware
You can use this information to guide you through the process of planning, installing, managing, and
servicing high-performance computing (HPC) clusters that use InfiniBand hardware.
This information serves as a navigation aid through the publications required to install the hardware
units, firmware, operating system, software, or applications publications produced by IBM or other
vendors. This information provides configuration settings and an order of installation and acts as a
launch point for typical service and management procedures. In some cases, this information provides
detailed procedures instead of referencing procedures that are so generic that their use within the context
of a cluster is not readily apparent.
This information is not intended to replace the existing or vendor-supplied publications for the various
hardware units, firmware, operating systems, software, or applications produced by IBM or other
vendors. These publications are referenced throughout this information.
The following table provides a high-level view of the cluster implementation process. This information is
required to effectively plan, install, manage, and service your HPC clusters that use InfiniBand hardware.
Table 1. High-level view of the cluster implementation process and associated information
Content Description
“Clustering systems by using InfiniBand hardware” on
page 2
Provides references to information resources, an
overview of cluster components, and the supported
component levels.
“Cluster information resources” on page 2 Provides a list of the various information resources for
the key components of the cluster fabric and where they
can be obtained. These information resources are used
extensively during your cluster implementation, so it is
important to collect the required documents early in the
process.
“Fabric communications” on page 6 Provides a description of the fabric data flow.
“Management subsystem function overview” on page 13 Provides a description of the management subsystem.
“Supported components in an HPC cluster” on page 24 Provides a list of the supported components and
pertinent features, and the minimum shipment levels for
software and firmware.
“Cluster planning” on page 26 Provides information about planning for the cluster and
the fabric.
“Cluster planning overview” on page 27 Provides navigation through the planning process.
“Required level of support, firmware, and devices” on
page 28
Provides the minimum ship level for firmware and
devices and provides a website to obtain the latest
information.
“Server planning” on page 29, “Planning InfiniBand
network cabling and configuration” on page 30, and
“Management subsystem planning” on page 54
Provides the planning requirements for the main
subsystems.
© Copyright IBM Corp. 2011 1

Table 1. High-level view of the cluster implementation process and associated information (continued)
Content Description
“Planning installation flow” on page 68 Provides guidance in how the various tasks relate to each
other and who is responsible for the various planning
tasks for the cluster. This information also illustrates how
certain tasks are prerequisites to other tasks. This topic
assists you in coordinating the activities of the
installation team.
“Planning worksheets” on page 76 Provides planning worksheets that are used to plan the
important aspects of the cluster fabric. If you are using
your own worksheets, they must cover the items
provided in these worksheets.
Other planning
“Installing a high-performance computing (HPC) cluster
with an InfiniBand network” on page 96
Provides procedures for installing the cluster.
“Cluster Fabric Management” on page 152 Provides tasks for managing the fabric.
“Cluster service” on page 183 Provides high-level service tasks. This topic is intended
to be a launch point for servicing the cluster fabric
components.
Planning installation worksheets Provides blank copies of the planning worksheets for
easy printing.
Clustering systems by using InfiniBand hardware
This information provides planning and installation details to help guide you through the process of
installing a cluster fabric that incorporates InfiniBand switches.
IBM server hardware supports clustering through InfiniBand host channel adapters (HCAs) and switches.
Information about how to manage and service a cluster by using InfiniBand hardware is included in this
information.
The following figure shows servers that are connected in a cluster configuration with InfiniBand switch
networks (fabric). The servers are connected to this network by using IBM GX HCAs. In System p®Blade
servers, the HCAs are based on PCI Express (PCIe).
Notes:
1. Switch refers to the InfiniBand technology switch unless otherwise noted.
2. Not all configurations support the following network configuration. See the IBM sales information for
supported configurations.
Cluster information resources
The following tables indicate important documentation for the cluster, where to get it and when to use it
relative to Planning, Installation, and Management and Service phases of a clusters life.
The tables are arranged into categories of components:
v“General cluster information resources” on page 3
v“Cluster hardware information resources” on page 3
v“Cluster management software information resources” on page 4
Figure 1. InfiniBand network with four switches and four servers connected
2Power Systems: High performance clustering

v“Cluster software and firmware information resources” on page 5
General cluster information resources
The following table lists general cluster information resources:
Table 2. General cluster resources
Component Document Plan Install Manage and
service
IBM Cluster
Information
This document x x x
IBM Clusters with
the InfiniBand
Switch website
IBM Clusters with the InfiniBand Switch readme file
http://www14.software.ibm.com/webapp/set2/
sas/f/networkmanager/home.html
Note: This site lists exceptions that differ from
the IBM and vendor documentation.
QLogic QLogic InfiniBand Switches and Management
Software for IBM System p Clusters web-site.
http://driverdownloads.qlogic.com/
QLogicDriverDownloads_UI/
Product_detail.aspx?oemid=389
xx x
InfiniBand
Architecture
InfiniBand architecture documents and standard
specifications are available from the InfiniBand
Trade Association http://www.infinibandta.org/
home.
HPC Central wiki
and HPC Central
forum
The HPC Central wiki enables collaboration
between customers and IBM teams. This wiki
includes questions and comments.
http://www.ibm.com/developerworks/wikis/
display/hpccentral/HPC+Central
xx x
Note: QLogic uses Silverstorm in their product documentation.
Cluster hardware information resources
The following table lists cluster hardware resources:
Table 3. Cluster hardware information resources
Component Document Plan Install Manage and
service
Site planning for all
IBM systems
System i®and System p Site Preparation and Physical
Planning Guides
x
POWER6®systems
9125-F2A
8204-E8A
8203-E4A
9119-FHA
9117-MMA
8236-E8C
Site and Hardware Planning Guide x
Installation Guide for [MachineType and Model] x
Servicing the IBM system p [MachineType and
Model]
x
PCI Adapter Placement xx
Worldwide Customized Installation Instructions
(WCII) IBM service representative installation
instructions for IBM machines and features
http://w3.rchland.ibm.com/projects/WCII.
x
High-performance computing clusters using InfiniBand hardware 3

Table 3. Cluster hardware information resources (continued)
Component Document Plan Install Manage and
service
Logical partitioning
for all systems
Logical Partitioning Guide x
Install Instructions for IBM LPAR on System i and
System P
x
BladeCenter®JS22
and JS23
Planning, Installation, and Service Guide x x x
IBM GX HCA
Custom Installation
Custom Installation Instructions, one for each
HCA feature http://w3.rchland.ibm.com/
projects/WCII)
xx x
BladeCenter JS22 and
JS23 HCA
Users guide for 1350 x x x
Pass-through module 1350 documentation x x x
Fabric management
server
IBM System x®3550 and 3650 documentation
Management node
HCA
HCA vendor documentation x x x
QLogic switches [Switch model] Users Guide x x x
[Switch model] Quick Setup Guide xx x
[Switch Model] Quick Setup Guide xx
QLogic InfiniBand Cluster Planning Guide x x
QLogic 9000 CLI Reference Guide x x
IBM Power Systems™documentation is available in the IBM Power Systems Hardware Information
Center.
Any exceptions to the location of information resources for cluster hardware as stated above have been
noted in the table. Any future changes to the location of information that occur before a new release of
this document will be noted in the IBM clusters with the InfiniBand switch website.
Note: QLogic uses Silverstorm in their product documentation.
Cluster management software information resources
The following table lists cluster management software information resources:
Table 4. Cluster management software resources
Component Document Plan Install Manage and service
QLogic Subnet
Manager
Fabric Manager and Fabric Viewer Users Guide
http://filedownloads.qlogic.com/files/ms/
72922/QLogic_FM_FV_UG_Rev_A.pdf
xx x
QLogic Fast Fabric
Toolset
Fast Fabric Toolset Users Guide
http://filedownloads.qlogic.com/files/ms/
70168/User%27s_Guide_FF_v4_3_Rev_B.pdf
xx x
4Power Systems: High performance clustering
This manual suits for next models
13
Table of contents
Other Bull Control Unit manuals