Intel ARCHITECTURE IA-32 User manual

IA-32 Intel® Architecture

Optimization Reference

Manual

Order Number: 248966-013US

April 2006

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO

LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY

RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDI-

TIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL

DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL

PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PUR-

POSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLEC-

TUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining

applications. Intel may make changes to specifications and product descriptions at any time, without notice.

This IA-32 Intel®Architecture Optimization Reference Manual as well as the software described in it is furnished

under license and may only be used or copied in accordance with the terms of the license. The information in this man-

ual is furnished for informational use only, is subject to change without notice, and should not be construed as a com-

mitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies

that may appear in this document or any software that may be provided in association with this document.

Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or trans-

mitted in any form or by any means without the express written consent of Intel Corporation.

Developers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “unde-

fined.” Improper use of reserved or undefined features or instructions may cause unpredictable behavior or failure in

developer's software code when running on an Intel®processor. Intel reserves these features or instructions for future

definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized

use.

Hyper-Threading Technology requires a computer system with an Intel®Pentium®4 processor supporting Hyper-

Threading Technology and an HT Technology enabled chipset, BIOS and operating system. Performance will vary

depending on the specific hardware and software you use. See http://www.intel.com/info/hyperthreading for more

information including details on which processors support HT Technology.

Intel, Pentium, Intel Xeon, Intel NetBurst, Intel Core Solo, Intel Core Duo, Intel Pentium D, Itanium, MMX, and

VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other

countries.

*Other names and brands may be claimed as the property of others.

iii

Contents

Introduction

Chapter 1 IA-32 Intel®Architecture Processor Family Overview

SIMD Technology.................................................................................................................... 1-2

Summary of SIMD Technologies...................................................................................... 1-5

MMX™ Technology..................................................................................................... 1-5

Streaming SIMD Extensions....................................................................................... 1-5

Streaming SIMD Extensions 2.................................................................................... 1-6

Streaming SIMD Extensions 3.................................................................................... 1-6

Intel®Extended Memory 64 Technology (Intel®EM64T)........................................................ 1-7

Intel NetBurst®Microarchitecture............................................................................................ 1-8

Design Goals of Intel NetBurst Microarchitecture ............................................................ 1-8

Overview of the Intel NetBurst Microarchitecture Pipeline ............................................... 1-9

The Front End........................................................................................................... 1-11

The Out-of-order Core.............................................................................................. 1-12

Retirement................................................................................................................ 1-12

Front End Pipeline Detail............................................................................................... 1-13

Prefetching................................................................................................................ 1-13

Decoder.................................................................................................................... 1-14

Execution Trace Cache ............................................................................................ 1-14

Branch Prediction ..................................................................................................... 1-15

Execution Core Detail..................................................................................................... 1-16

Instruction Latency and Throughput......................................................................... 1-17

Execution Units and Issue Ports............................................................................... 1-18

Caches...................................................................................................................... 1-19

Data Prefetch............................................................................................................ 1-21

Loads and Stores...................................................................................................... 1-24

Store Forwarding ...................................................................................................... 1-25

Intel®Pentium®M Processor Microarchitecture................................................................... 1-26

The Front End........................................................................................................... 1-27

Data Prefetching....................................................................................................... 1-29

Out-of-Order Core..................................................................................................... 1-30

In-Order Retirement.................................................................................................. 1-31

Microarchitecture of Intel®Core™ Solo and Intel®Core™ Duo Processors........................ 1-31

Front End........................................................................................................................ 1-32

Data Prefetching............................................................................................................. 1-33

Hyper-Threading Technology................................................................................................ 1-33

Processor Resources and Hyper-Threading Technology............................................... 1-36

Replicated Resources............................................................................................... 1-36

Partitioned Resources .............................................................................................. 1-36

Shared Resources.................................................................................................... 1-37

Microarchitecture Pipeline and Hyper-Threading Technology........................................ 1-38

Front End Pipeline......................................................................................................... 1-38

Execution Core............................................................................................................... 1-39

Retirement...................................................................................................................... 1-39

Multi-Core Processors........................................................................................................... 1-39

Microarchitecture Pipeline and Multi-Core Processors................................................... 1-42

Shared Cache in Intel Core Duo Processors ................................................................. 1-42

Load and Store Operations....................................................................................... 1-42

Chapter 2 General Optimization Guidelines

Tuning to Achieve Optimum Performance.............................................................................. 2-1

Tuning to Prevent Known Coding Pitfalls................................................................................ 2-2

General Practices and Coding Guidelines.............................................................................. 2-3

Use Available Performance Tools..................................................................................... 2-4

Optimize Performance Across Processor Generations.................................................... 2-4

Optimize Branch Predictability.......................................................................................... 2-5

Optimize Memory Access................................................................................................. 2-5

Optimize Floating-point Performance............................................................................... 2-6

Optimize Instruction Selection.......................................................................................... 2-6

Optimize Instruction Scheduling....................................................................................... 2-7

Enable Vectorization......................................................................................................... 2-7

Coding Rules, Suggestions and Tuning Hints......................................................................... 2-8

Performance Tools.................................................................................................................. 2-9

Intel®C++ Compiler ......................................................................................................... 2-9

General Compiler Recommendations............................................................................ 2-10

VTune™ Performance Analyzer..................................................................................... 2-10

Processor Perspectives ........................................................................................................ 2-11

CPUID Dispatch Strategy and Compatible Code Strategy............................................. 2-13

Transparent Cache-Parameter Strategy......................................................................... 2-14

Threading Strategy and Hardware Multi-Threading Support.......................................... 2-14

Branch Prediction.................................................................................................................. 2-15

Eliminating Branches...................................................................................................... 2-15

Spin-Wait and Idle Loops................................................................................................ 2-18

Static Prediction.............................................................................................................. 2-19

Inlining, Calls and Returns ............................................................................................. 2-22

Branch Type Selection ................................................................................................... 2-23

Loop Unrolling ............................................................................................................... 2-26

Compiler Support for Branch Prediction......................................................................... 2-28

Memory Accesses................................................................................................................. 2-29

Alignment ....................................................................................................................... 2-29

Store Forwarding............................................................................................................ 2-32

Store-to-Load-Forwarding Restriction on Size and Alignment.................................. 2-33

Store-forwarding Restriction on Data Availability...................................................... 2-38

Data Layout Optimizations ............................................................................................. 2-39

Stack Alignment.............................................................................................................. 2-42

Capacity Limits and Aliasing in Caches.......................................................................... 2-43

Capacity Limits in Set-Associative Caches............................................................... 2-44

Aliasing Cases in the Pentium®4 and Intel®Xeon®Processors ............................. 2-45

Aliasing Cases in the Pentium M Processor............................................................. 2-46

Mixing Code and Data.................................................................................................... 2-47

Self-modifying Code ................................................................................................. 2-47

Write Combining............................................................................................................. 2-48

Locality Enhancement.................................................................................................... 2-50

Minimizing Bus Latency.................................................................................................. 2-52

Non-Temporal Store Bus Traffic ..................................................................................... 2-53

Prefetching ..................................................................................................................... 2-55

Hardware Instruction Fetching.................................................................................. 2-55

Software and Hardware Cache Line Fetching.......................................................... 2-55

Cacheability Instructions ................................................................................................ 2-56

Code Alignment.............................................................................................................. 2-57

Improving the Performance of Floating-point Applications.................................................... 2-57

Guidelines for Optimizing Floating-point Code............................................................... 2-58

Floating-point Modes and Exceptions............................................................................ 2-60

Floating-point Exceptions ......................................................................................... 2-60

Floating-point Modes................................................................................................ 2-62

Improving Parallelism and the Use of FXCH.................................................................. 2-68

x87 vs. Scalar SIMD Floating-point Trade-offs............................................................... 2-69

Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo

Processors............................................................................................................. 2-70

Memory Operands.......................................................................................................... 2-71

Floating-Point Stalls........................................................................................................ 2-72

x87 Floating-point Operations with Integer Operands.............................................. 2-72

x87 Floating-point Comparison Instructions ............................................................. 2-72

Transcendental Functions ........................................................................................ 2-72

Instruction Selection.............................................................................................................. 2-73

Complex Instructions...................................................................................................... 2-74

Use of the lea Instruction................................................................................................ 2-74

Use of the inc and dec Instructions................................................................................ 2-75

Use of the shift and rotate Instructions........................................................................... 2-75

Flag Register Accesses.................................................................................................. 2-75

Integer Divide ................................................................................................................. 2-76

Operand Sizes and Partial Register Accesses............................................................... 2-76

Prefixes and Instruction Decoding.................................................................................. 2-80

REP Prefix and Data Movement..................................................................................... 2-81

Address Calculations...................................................................................................... 2-86

Clearing Registers.......................................................................................................... 2-87

Compares....................................................................................................................... 2-87

Floating Point/SIMD Operands....................................................................................... 2-88

Prolog Sequences.......................................................................................................... 2-90

Code Sequences that Operate on Memory Operands................................................... 2-90

Instruction Scheduling........................................................................................................... 2-91

Latencies and Resource Constraints.............................................................................. 2-91

Spill Scheduling.............................................................................................................. 2-92

Scheduling Rules for the Pentium 4 Processor Decoder ............................................... 2-92

Scheduling Rules for the Pentium M Processor Decoder .............................................. 2-93

Vectorization ......................................................................................................................... 2-93

Miscellaneous ....................................................................................................................... 2-95

NOPs.............................................................................................................................. 2-95

Summary of Rules and Suggestions..................................................................................... 2-96

User/Source Coding Rules............................................................................................. 2-97

Assembly/Compiler Coding Rules.................................................................................. 2-99

Tuning Suggestions...................................................................................................... 2-108

Chapter 3 Coding for SIMD Architectures

Checking for Processor Support of SIMD Technologies......................................................... 3-2

Checking for MMX Technology Support........................................................................... 3-2

Checking for Streaming SIMD Extensions Support.......................................................... 3-3

Checking for Streaming SIMD Extensions 2 Support....................................................... 3-5

Checking for Streaming SIMD Extensions 3 Support....................................................... 3-6

vii

Considerations for Code Conversion to SIMD Programming.................................................. 3-8

Identifying Hot Spots ...................................................................................................... 3-10

Determine If Code Benefits by Conversion to SIMD Execution...................................... 3-11

Coding Techniques ............................................................................................................... 3-12

Coding Methodologies.................................................................................................... 3-13

Assembly.................................................................................................................. 3-15

Intrinsics.................................................................................................................... 3-15

Classes..................................................................................................................... 3-17

Automatic Vectorization............................................................................................ 3-18

Stack and Data Alignment..................................................................................................... 3-20

Alignment and Contiguity of Data Access Patterns........................................................ 3-20

Using Padding to Align Data..................................................................................... 3-20

Using Arrays to Make Data Contiguous.................................................................... 3-21

Stack Alignment For 128-bit SIMD Technologies ........................................................... 3-22

Data Alignment for MMX Technology............................................................................. 3-23

Data Alignment for 128-bit data...................................................................................... 3-24

Compiler-Supported Alignment................................................................................. 3-24

Improving Memory Utilization................................................................................................ 3-27

Data Structure Layout..................................................................................................... 3-27

Strip Mining..................................................................................................................... 3-32

Loop Blocking................................................................................................................. 3-34

Instruction Selection.............................................................................................................. 3-37

SIMD Optimizations and Microarchitectures .................................................................. 3-38

Tuning the Final Application.................................................................................................. 3-39

Chapter 4 Optimizing for SIMD Integer Applications

General Rules on SIMD Integer Code .................................................................................... 4-2

Using SIMD Integer with x87 Floating-point............................................................................ 4-3

Using the EMMS Instruction............................................................................................. 4-3

Guidelines for Using EMMS Instruction............................................................................ 4-4

Data Alignment........................................................................................................................ 4-6

Data Movement Coding Techniques....................................................................................... 4-6

Unsigned Unpack............................................................................................................. 4-6

Signed Unpack................................................................................................................. 4-7

Interleaved Pack with Saturation...................................................................................... 4-8

Interleaved Pack without Saturation............................................................................... 4-10

Non-Interleaved Unpack................................................................................................. 4-11

Extract Word................................................................................................................... 4-13

Insert Word..................................................................................................................... 4-14

Move Byte Mask to Integer............................................................................................. 4-16

viii

Packed Shuffle Word for 64-bit Registers ...................................................................... 4-18

Packed Shuffle Word for 128-bit Registers .................................................................... 4-19

Unpacking/interleaving 64-bit Data in 128-bit Registers................................................. 4-20

Data Movement.............................................................................................................. 4-21

Conversion Instructions.................................................................................................. 4-21

Generating Constants........................................................................................................... 4-21

Building Blocks...................................................................................................................... 4-23

Absolute Difference of Unsigned Numbers .................................................................... 4-23

Absolute Difference of Signed Numbers ........................................................................ 4-24

Absolute Value................................................................................................................ 4-25

Clipping to an Arbitrary Range [high, low]...................................................................... 4-26

Highly Efficient Clipping............................................................................................ 4-27

Clipping to an Arbitrary Unsigned Range [high, low]................................................ 4-28

Packed Max/Min of Signed Word and Unsigned Byte.................................................... 4-29

Signed Word............................................................................................................. 4-29

Unsigned Byte .......................................................................................................... 4-30

Packed Multiply High Unsigned...................................................................................... 4-30

Packed Sum of Absolute Differences............................................................................. 4-30

Packed Average (Byte/Word)......................................................................................... 4-31

Complex Multiply by a Constant..................................................................................... 4-32

Packed 32*32 Multiply.................................................................................................... 4-33

Packed 64-bit Add/Subtract............................................................................................ 4-33

128-bit Shifts................................................................................................................... 4-33

Memory Optimizations .......................................................................................................... 4-34

Partial Memory Accesses............................................................................................... 4-35

Supplemental Techniques for Avoiding Cache Line Splits........................................ 4-37

Increasing Bandwidth of Memory Fills and Video Fills................................................... 4-39

Increasing Memory Bandwidth Using the MOVDQ Instruction................................. 4-39

Increasing Memory Bandwidth by Loading and Storing to and from the

Same DRAM Page ................................................................................................ 4-39

Increasing UC and WC Store Bandwidth by Using Aligned Stores........................... 4-40

Converting from 64-bit to 128-bit SIMD Integer .................................................................... 4-40

SIMD Optimizations and Microarchitectures .................................................................. 4-41

Packed SSE2 Integer versus MMX Instructions....................................................... 4-42

Chapter 5 Optimizing for SIMD Floating-point Applications

General Rules for SIMD Floating-point Code.......................................................................... 5-1

Planning Considerations......................................................................................................... 5-2

Using SIMD Floating-point with x87 Floating-point................................................................. 5-3

Scalar Floating-point Code...................................................................................................... 5-3

Data Alignment........................................................................................................................ 5-4

Data Arrangement............................................................................................................ 5-4

Vertical versus Horizontal Computation...................................................................... 5-5

Data Swizzling............................................................................................................ 5-9

Data Deswizzling...................................................................................................... 5-14

Using MMX Technology Code for Copy or Shuffling Functions................................ 5-17

Horizontal ADD Using SSE....................................................................................... 5-18

Use of cvttps2pi/cvttss2si Instructions .................................................................................. 5-21

Flush-to-Zero and Denormals-are-Zero Modes .................................................................... 5-22

SIMD Floating-point Programming Using SSE3 ................................................................... 5-22

SSE3 and Complex Arithmetics ..................................................................................... 5-23

SSE3 and Horizontal Computation................................................................................. 5-26

SIMD Optimizations and Microarchitectures .................................................................. 5-27

Packed Floating-Point Performance......................................................................... 5-27

Chapter 6 Optimizing Cache Usage

General Prefetch Coding Guidelines....................................................................................... 6-2

Hardware Prefetching of Data................................................................................................. 6-4

Prefetch and Cacheability Instructions.................................................................................... 6-5

Prefetch................................................................................................................................... 6-6

Software Data Prefetch .................................................................................................... 6-6

The Prefetch Instructions – Pentium 4 Processor Implementation................................... 6-8

Prefetch and Load Instructions......................................................................................... 6-8

Cacheability Control................................................................................................................ 6-9

The Non-temporal Store Instructions.............................................................................. 6-10

Fencing..................................................................................................................... 6-10

Streaming Non-temporal Stores ............................................................................... 6-10

Memory Type and Non-temporal Stores................................................................... 6-11

Write-Combining....................................................................................................... 6-12

Streaming Store Usage Models...................................................................................... 6-13

Coherent Requests................................................................................................... 6-13

Non-coherent requests............................................................................................. 6-13

Streaming Store Instruction Descriptions ....................................................................... 6-14

The fence Instructions.................................................................................................... 6-15

The sfence Instruction .............................................................................................. 6-15

The lfence Instruction ............................................................................................... 6-16

The mfence Instruction............................................................................................. 6-16

The clflush Instruction .................................................................................................... 6-17

Memory Optimization Using Prefetch.................................................................................... 6-18

Software-controlled Prefetch.......................................................................................... 6-18

Hardware Prefetch ......................................................................................................... 6-19

Example of Effective Latency Reduction with H/W Prefetch.......................................... 6-20

Example of Latency Hiding with S/W Prefetch Instruction ............................................ 6-22

Software Prefetching Usage Checklist........................................................................... 6-24

Software Prefetch Scheduling Distance......................................................................... 6-25

Software Prefetch Concatenation................................................................................... 6-26

Minimize Number of Software Prefetches...................................................................... 6-29

Mix Software Prefetch with Computation Instructions.................................................... 6-32

Software Prefetch and Cache Blocking Techniques....................................................... 6-34

Hardware Prefetching and Cache Blocking Techniques ................................................ 6-39

Single-pass versus Multi-pass Execution....................................................................... 6-41

Memory Optimization using Non-Temporal Stores................................................................ 6-43

Non-temporal Stores and Software Write-Combining..................................................... 6-43

Cache Management....................................................................................................... 6-44

Video Encoder.......................................................................................................... 6-45

Video Decoder.......................................................................................................... 6-45

Conclusions from Video Encoder and Decoder Implementation .............................. 6-46

Optimizing Memory Copy Routines.......................................................................... 6-46

TLB Priming.............................................................................................................. 6-47

Using the 8-byte Streaming Stores and Software Prefetch....................................... 6-48

Using 16-byte Streaming Stores and Hardware Prefetch......................................... 6-50

Performance Comparisons of Memory Copy Routines ............................................ 6-52

Deterministic Cache Parameters .......................................................................................... 6-53

Cache Sharing Using Deterministic Cache Parameters................................................. 6-55

Cache Sharing in Single-core or Multi-core.................................................................... 6-55

Determine Prefetch Stride Using Deterministic Cache Parameters ............................... 6-56

Chapter 7 Multi-Core and Hyper-Threading Technology

Performance and Usage Models............................................................................................. 7-2

Multithreading................................................................................................................... 7-2

Multitasking Environment ................................................................................................. 7-4

Programming Models and Multithreading ............................................................................... 7-6

Parallel Programming Models .......................................................................................... 7-7

Domain Decomposition............................................................................................... 7-7

Functional Decomposition................................................................................................ 7-8

Specialized Programming Models.................................................................................... 7-8

Producer-Consumer Threading Models.................................................................... 7-10

Tools for Creating Multithreaded Applications................................................................ 7-14

Optimization Guidelines........................................................................................................ 7-16

Key Practices of Thread Synchronization ...................................................................... 7-16

Table of contents

Other Intel Processor manuals

Intel

Intel Xeon Guide

Intel

Intel Q77M vPro Guide

Intel

Intel core i7 X series User manual

Intel

Intel Pentium 4 Quick start guide

Intel

Intel Core 2 Duo Processor User manual

Intel

Intel Quad-Core Xeon 3300 Series User manual

Intel

Intel iAPX 86/88 User manual

Intel

Intel Pentium II Xeon Quick start guide

Intel

Intel Xeon User manual

Intel

Intel 80C186EB User manual

Intel

Intel i486 Quick user guide

Intel

Intel XScale Core User manual

Intel

Intel Centrino Pro User manual

Intel

Intel 80386 Quick user guide

Intel

Intel core i9 X series User manual

Intel

Intel ARM Cortex-A9 Reference manual

Intel

Intel 8008 User manual

Intel

Intel i960 Series User manual

Popular Processor manuals by other brands

FALCON THREE

FALCON THREE TV SDI user manual

Vdwall

Vdwall LVP8601 user manual

Pro ceed

Pro ceed PDP 3 user manual

Fastwel

Fastwel CPC310 user manual

Usl

Usl JSD-60 manual

ALLEN & HEATH

ALLEN & HEATH DR128 Service manual

Harman Kardon

Harman Kardon AVP-2 Installation and operation manual

dbx

dbx Zone Pro 640 user manual

Fastwel

Fastwel CPC309 user manual

BBE

BBE 822 user manual

NAIM

NAIM AV2 installation guide

GrandTec

GrandTec Grand Magic Guard III datasheet

TECshow

TECshow DSP-2600 W user manual

Topp Music Gear

Topp Music Gear RT-DRIVE DLM-206 user manual

NEC

NEC IPS-4000 user manual

Texas Instruments

Texas Instruments Sitara AM335x manual

Artesyn

Artesyn PmPPC440 user manual

Meridian

Meridian 568.2 user guide