Intel ARCHITECTURE IA-32 User manual

IA-32 Intel® Architecture
Optimization Reference
Manual
Order Number: 248966-013US
April 2006

ii
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY
RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDI-
TIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL
DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PUR-
POSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLEC-
TUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining
applications. Intel may make changes to specifications and product descriptions at any time, without notice.
This IA-32 Intel®Architecture Optimization Reference Manual as well as the software described in it is furnished
under license and may only be used or copied in accordance with the terms of the license. The information in this man-
ual is furnished for informational use only, is subject to change without notice, and should not be construed as a com-
mitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies
that may appear in this document or any software that may be provided in association with this document.
Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or trans-
mitted in any form or by any means without the express written consent of Intel Corporation.
Developers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “unde-
fined.” Improper use of reserved or undefined features or instructions may cause unpredictable behavior or failure in
developer's software code when running on an Intel®processor. Intel reserves these features or instructions for future
definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized
use.
Hyper-Threading Technology requires a computer system with an Intel®Pentium®4 processor supporting Hyper-
Threading Technology and an HT Technology enabled chipset, BIOS and operating system. Performance will vary
depending on the specific hardware and software you use. See http://www.intel.com/info/hyperthreading for more
information including details on which processors support HT Technology.
Intel, Pentium, Intel Xeon, Intel NetBurst, Intel Core Solo, Intel Core Duo, Intel Pentium D, Itanium, MMX, and
VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other
countries.
*Other names and brands may be claimed as the property of others.
Copyright © 1999-2006 Intel Corporation.

iii
Contents
Introduction
Chapter 1 IA-32 Intel®Architecture Processor Family Overview
SIMD Technology.................................................................................................................... 1-2
Summary of SIMD Technologies...................................................................................... 1-5
MMX™ Technology..................................................................................................... 1-5
Streaming SIMD Extensions....................................................................................... 1-5
Streaming SIMD Extensions 2.................................................................................... 1-6
Streaming SIMD Extensions 3.................................................................................... 1-6
Intel®Extended Memory 64 Technology (Intel®EM64T)........................................................ 1-7
Intel NetBurst®Microarchitecture............................................................................................ 1-8
Design Goals of Intel NetBurst Microarchitecture ............................................................ 1-8
Overview of the Intel NetBurst Microarchitecture Pipeline ............................................... 1-9
The Front End........................................................................................................... 1-11
The Out-of-order Core.............................................................................................. 1-12
Retirement................................................................................................................ 1-12
Front End Pipeline Detail............................................................................................... 1-13
Prefetching................................................................................................................ 1-13
Decoder.................................................................................................................... 1-14
Execution Trace Cache ............................................................................................ 1-14
Branch Prediction ..................................................................................................... 1-15
Execution Core Detail..................................................................................................... 1-16
Instruction Latency and Throughput......................................................................... 1-17
Execution Units and Issue Ports............................................................................... 1-18
Caches...................................................................................................................... 1-19
Data Prefetch............................................................................................................ 1-21
Loads and Stores...................................................................................................... 1-24
Store Forwarding ...................................................................................................... 1-25
Intel®Pentium®M Processor Microarchitecture................................................................... 1-26
The Front End........................................................................................................... 1-27
Data Prefetching....................................................................................................... 1-29

iv
Out-of-Order Core..................................................................................................... 1-30
In-Order Retirement.................................................................................................. 1-31
Microarchitecture of Intel®Core™ Solo and Intel®Core™ Duo Processors........................ 1-31
Front End........................................................................................................................ 1-32
Data Prefetching............................................................................................................. 1-33
Hyper-Threading Technology................................................................................................ 1-33
Processor Resources and Hyper-Threading Technology............................................... 1-36
Replicated Resources............................................................................................... 1-36
Partitioned Resources .............................................................................................. 1-36
Shared Resources.................................................................................................... 1-37
Microarchitecture Pipeline and Hyper-Threading Technology........................................ 1-38
Front End Pipeline......................................................................................................... 1-38
Execution Core............................................................................................................... 1-39
Retirement...................................................................................................................... 1-39
Multi-Core Processors........................................................................................................... 1-39
Microarchitecture Pipeline and Multi-Core Processors................................................... 1-42
Shared Cache in Intel Core Duo Processors ................................................................. 1-42
Load and Store Operations....................................................................................... 1-42
Chapter 2 General Optimization Guidelines
Tuning to Achieve Optimum Performance.............................................................................. 2-1
Tuning to Prevent Known Coding Pitfalls................................................................................ 2-2
General Practices and Coding Guidelines.............................................................................. 2-3
Use Available Performance Tools..................................................................................... 2-4
Optimize Performance Across Processor Generations.................................................... 2-4
Optimize Branch Predictability.......................................................................................... 2-5
Optimize Memory Access................................................................................................. 2-5
Optimize Floating-point Performance............................................................................... 2-6
Optimize Instruction Selection.......................................................................................... 2-6
Optimize Instruction Scheduling....................................................................................... 2-7
Enable Vectorization......................................................................................................... 2-7
Coding Rules, Suggestions and Tuning Hints......................................................................... 2-8
Performance Tools.................................................................................................................. 2-9
Intel®C++ Compiler ......................................................................................................... 2-9
General Compiler Recommendations............................................................................ 2-10
VTune™ Performance Analyzer..................................................................................... 2-10
Processor Perspectives ........................................................................................................ 2-11
CPUID Dispatch Strategy and Compatible Code Strategy............................................. 2-13
Transparent Cache-Parameter Strategy......................................................................... 2-14
Threading Strategy and Hardware Multi-Threading Support.......................................... 2-14

v
Branch Prediction.................................................................................................................. 2-15
Eliminating Branches...................................................................................................... 2-15
Spin-Wait and Idle Loops................................................................................................ 2-18
Static Prediction.............................................................................................................. 2-19
Inlining, Calls and Returns ............................................................................................. 2-22
Branch Type Selection ................................................................................................... 2-23
Loop Unrolling ............................................................................................................... 2-26
Compiler Support for Branch Prediction......................................................................... 2-28
Memory Accesses................................................................................................................. 2-29
Alignment ....................................................................................................................... 2-29
Store Forwarding............................................................................................................ 2-32
Store-to-Load-Forwarding Restriction on Size and Alignment.................................. 2-33
Store-forwarding Restriction on Data Availability...................................................... 2-38
Data Layout Optimizations ............................................................................................. 2-39
Stack Alignment.............................................................................................................. 2-42
Capacity Limits and Aliasing in Caches.......................................................................... 2-43
Capacity Limits in Set-Associative Caches............................................................... 2-44
Aliasing Cases in the Pentium®4 and Intel®Xeon®Processors ............................. 2-45
Aliasing Cases in the Pentium M Processor............................................................. 2-46
Mixing Code and Data.................................................................................................... 2-47
Self-modifying Code ................................................................................................. 2-47
Write Combining............................................................................................................. 2-48
Locality Enhancement.................................................................................................... 2-50
Minimizing Bus Latency.................................................................................................. 2-52
Non-Temporal Store Bus Traffic ..................................................................................... 2-53
Prefetching ..................................................................................................................... 2-55
Hardware Instruction Fetching.................................................................................. 2-55
Software and Hardware Cache Line Fetching.......................................................... 2-55
Cacheability Instructions ................................................................................................ 2-56
Code Alignment.............................................................................................................. 2-57
Improving the Performance of Floating-point Applications.................................................... 2-57
Guidelines for Optimizing Floating-point Code............................................................... 2-58
Floating-point Modes and Exceptions............................................................................ 2-60
Floating-point Exceptions ......................................................................................... 2-60
Floating-point Modes................................................................................................ 2-62
Improving Parallelism and the Use of FXCH.................................................................. 2-68
x87 vs. Scalar SIMD Floating-point Trade-offs............................................................... 2-69
Scalar SSE/SSE2 Performance on Intel Core Solo and Intel Core Duo
Processors............................................................................................................. 2-70
Memory Operands.......................................................................................................... 2-71

vi
Floating-Point Stalls........................................................................................................ 2-72
x87 Floating-point Operations with Integer Operands.............................................. 2-72
x87 Floating-point Comparison Instructions ............................................................. 2-72
Transcendental Functions ........................................................................................ 2-72
Instruction Selection.............................................................................................................. 2-73
Complex Instructions...................................................................................................... 2-74
Use of the lea Instruction................................................................................................ 2-74
Use of the inc and dec Instructions................................................................................ 2-75
Use of the shift and rotate Instructions........................................................................... 2-75
Flag Register Accesses.................................................................................................. 2-75
Integer Divide ................................................................................................................. 2-76
Operand Sizes and Partial Register Accesses............................................................... 2-76
Prefixes and Instruction Decoding.................................................................................. 2-80
REP Prefix and Data Movement..................................................................................... 2-81
Address Calculations...................................................................................................... 2-86
Clearing Registers.......................................................................................................... 2-87
Compares....................................................................................................................... 2-87
Floating Point/SIMD Operands....................................................................................... 2-88
Prolog Sequences.......................................................................................................... 2-90
Code Sequences that Operate on Memory Operands................................................... 2-90
Instruction Scheduling........................................................................................................... 2-91
Latencies and Resource Constraints.............................................................................. 2-91
Spill Scheduling.............................................................................................................. 2-92
Scheduling Rules for the Pentium 4 Processor Decoder ............................................... 2-92
Scheduling Rules for the Pentium M Processor Decoder .............................................. 2-93
Vectorization ......................................................................................................................... 2-93
Miscellaneous ....................................................................................................................... 2-95
NOPs.............................................................................................................................. 2-95
Summary of Rules and Suggestions..................................................................................... 2-96
User/Source Coding Rules............................................................................................. 2-97
Assembly/Compiler Coding Rules.................................................................................. 2-99
Tuning Suggestions...................................................................................................... 2-108
Chapter 3 Coding for SIMD Architectures
Checking for Processor Support of SIMD Technologies......................................................... 3-2
Checking for MMX Technology Support........................................................................... 3-2
Checking for Streaming SIMD Extensions Support.......................................................... 3-3
Checking for Streaming SIMD Extensions 2 Support....................................................... 3-5
Checking for Streaming SIMD Extensions 3 Support....................................................... 3-6

vii
Considerations for Code Conversion to SIMD Programming.................................................. 3-8
Identifying Hot Spots ...................................................................................................... 3-10
Determine If Code Benefits by Conversion to SIMD Execution...................................... 3-11
Coding Techniques ............................................................................................................... 3-12
Coding Methodologies.................................................................................................... 3-13
Assembly.................................................................................................................. 3-15
Intrinsics.................................................................................................................... 3-15
Classes..................................................................................................................... 3-17
Automatic Vectorization............................................................................................ 3-18
Stack and Data Alignment..................................................................................................... 3-20
Alignment and Contiguity of Data Access Patterns........................................................ 3-20
Using Padding to Align Data..................................................................................... 3-20
Using Arrays to Make Data Contiguous.................................................................... 3-21
Stack Alignment For 128-bit SIMD Technologies ........................................................... 3-22
Data Alignment for MMX Technology............................................................................. 3-23
Data Alignment for 128-bit data...................................................................................... 3-24
Compiler-Supported Alignment................................................................................. 3-24
Improving Memory Utilization................................................................................................ 3-27
Data Structure Layout..................................................................................................... 3-27
Strip Mining..................................................................................................................... 3-32
Loop Blocking................................................................................................................. 3-34
Instruction Selection.............................................................................................................. 3-37
SIMD Optimizations and Microarchitectures .................................................................. 3-38
Tuning the Final Application.................................................................................................. 3-39
Chapter 4 Optimizing for SIMD Integer Applications
General Rules on SIMD Integer Code .................................................................................... 4-2
Using SIMD Integer with x87 Floating-point............................................................................ 4-3
Using the EMMS Instruction............................................................................................. 4-3
Guidelines for Using EMMS Instruction............................................................................ 4-4
Data Alignment........................................................................................................................ 4-6
Data Movement Coding Techniques....................................................................................... 4-6
Unsigned Unpack............................................................................................................. 4-6
Signed Unpack................................................................................................................. 4-7
Interleaved Pack with Saturation...................................................................................... 4-8
Interleaved Pack without Saturation............................................................................... 4-10
Non-Interleaved Unpack................................................................................................. 4-11
Extract Word................................................................................................................... 4-13
Insert Word..................................................................................................................... 4-14
Move Byte Mask to Integer............................................................................................. 4-16

viii
Packed Shuffle Word for 64-bit Registers ...................................................................... 4-18
Packed Shuffle Word for 128-bit Registers .................................................................... 4-19
Unpacking/interleaving 64-bit Data in 128-bit Registers................................................. 4-20
Data Movement.............................................................................................................. 4-21
Conversion Instructions.................................................................................................. 4-21
Generating Constants........................................................................................................... 4-21
Building Blocks...................................................................................................................... 4-23
Absolute Difference of Unsigned Numbers .................................................................... 4-23
Absolute Difference of Signed Numbers ........................................................................ 4-24
Absolute Value................................................................................................................ 4-25
Clipping to an Arbitrary Range [high, low]...................................................................... 4-26
Highly Efficient Clipping............................................................................................ 4-27
Clipping to an Arbitrary Unsigned Range [high, low]................................................ 4-28
Packed Max/Min of Signed Word and Unsigned Byte.................................................... 4-29
Signed Word............................................................................................................. 4-29
Unsigned Byte .......................................................................................................... 4-30
Packed Multiply High Unsigned...................................................................................... 4-30
Packed Sum of Absolute Differences............................................................................. 4-30
Packed Average (Byte/Word)......................................................................................... 4-31
Complex Multiply by a Constant..................................................................................... 4-32
Packed 32*32 Multiply.................................................................................................... 4-33
Packed 64-bit Add/Subtract............................................................................................ 4-33
128-bit Shifts................................................................................................................... 4-33
Memory Optimizations .......................................................................................................... 4-34
Partial Memory Accesses............................................................................................... 4-35
Supplemental Techniques for Avoiding Cache Line Splits........................................ 4-37
Increasing Bandwidth of Memory Fills and Video Fills................................................... 4-39
Increasing Memory Bandwidth Using the MOVDQ Instruction................................. 4-39
Increasing Memory Bandwidth by Loading and Storing to and from the
Same DRAM Page ................................................................................................ 4-39
Increasing UC and WC Store Bandwidth by Using Aligned Stores........................... 4-40
Converting from 64-bit to 128-bit SIMD Integer .................................................................... 4-40
SIMD Optimizations and Microarchitectures .................................................................. 4-41
Packed SSE2 Integer versus MMX Instructions....................................................... 4-42
Chapter 5 Optimizing for SIMD Floating-point Applications
General Rules for SIMD Floating-point Code.......................................................................... 5-1
Planning Considerations......................................................................................................... 5-2
Using SIMD Floating-point with x87 Floating-point................................................................. 5-3
Scalar Floating-point Code...................................................................................................... 5-3

ix
Data Alignment........................................................................................................................ 5-4
Data Arrangement............................................................................................................ 5-4
Vertical versus Horizontal Computation...................................................................... 5-5
Data Swizzling............................................................................................................ 5-9
Data Deswizzling...................................................................................................... 5-14
Using MMX Technology Code for Copy or Shuffling Functions................................ 5-17
Horizontal ADD Using SSE....................................................................................... 5-18
Use of cvttps2pi/cvttss2si Instructions .................................................................................. 5-21
Flush-to-Zero and Denormals-are-Zero Modes .................................................................... 5-22
SIMD Floating-point Programming Using SSE3 ................................................................... 5-22
SSE3 and Complex Arithmetics ..................................................................................... 5-23
SSE3 and Horizontal Computation................................................................................. 5-26
SIMD Optimizations and Microarchitectures .................................................................. 5-27
Packed Floating-Point Performance......................................................................... 5-27
Chapter 6 Optimizing Cache Usage
General Prefetch Coding Guidelines....................................................................................... 6-2
Hardware Prefetching of Data................................................................................................. 6-4
Prefetch and Cacheability Instructions.................................................................................... 6-5
Prefetch................................................................................................................................... 6-6
Software Data Prefetch .................................................................................................... 6-6
The Prefetch Instructions – Pentium 4 Processor Implementation................................... 6-8
Prefetch and Load Instructions......................................................................................... 6-8
Cacheability Control................................................................................................................ 6-9
The Non-temporal Store Instructions.............................................................................. 6-10
Fencing..................................................................................................................... 6-10
Streaming Non-temporal Stores ............................................................................... 6-10
Memory Type and Non-temporal Stores................................................................... 6-11
Write-Combining....................................................................................................... 6-12
Streaming Store Usage Models...................................................................................... 6-13
Coherent Requests................................................................................................... 6-13
Non-coherent requests............................................................................................. 6-13
Streaming Store Instruction Descriptions ....................................................................... 6-14
The fence Instructions.................................................................................................... 6-15
The sfence Instruction .............................................................................................. 6-15
The lfence Instruction ............................................................................................... 6-16
The mfence Instruction............................................................................................. 6-16
The clflush Instruction .................................................................................................... 6-17
Memory Optimization Using Prefetch.................................................................................... 6-18
Software-controlled Prefetch.......................................................................................... 6-18

x
Hardware Prefetch ......................................................................................................... 6-19
Example of Effective Latency Reduction with H/W Prefetch.......................................... 6-20
Example of Latency Hiding with S/W Prefetch Instruction ............................................ 6-22
Software Prefetching Usage Checklist........................................................................... 6-24
Software Prefetch Scheduling Distance......................................................................... 6-25
Software Prefetch Concatenation................................................................................... 6-26
Minimize Number of Software Prefetches...................................................................... 6-29
Mix Software Prefetch with Computation Instructions.................................................... 6-32
Software Prefetch and Cache Blocking Techniques....................................................... 6-34
Hardware Prefetching and Cache Blocking Techniques ................................................ 6-39
Single-pass versus Multi-pass Execution....................................................................... 6-41
Memory Optimization using Non-Temporal Stores................................................................ 6-43
Non-temporal Stores and Software Write-Combining..................................................... 6-43
Cache Management....................................................................................................... 6-44
Video Encoder.......................................................................................................... 6-45
Video Decoder.......................................................................................................... 6-45
Conclusions from Video Encoder and Decoder Implementation .............................. 6-46
Optimizing Memory Copy Routines.......................................................................... 6-46
TLB Priming.............................................................................................................. 6-47
Using the 8-byte Streaming Stores and Software Prefetch....................................... 6-48
Using 16-byte Streaming Stores and Hardware Prefetch......................................... 6-50
Performance Comparisons of Memory Copy Routines ............................................ 6-52
Deterministic Cache Parameters .......................................................................................... 6-53
Cache Sharing Using Deterministic Cache Parameters................................................. 6-55
Cache Sharing in Single-core or Multi-core.................................................................... 6-55
Determine Prefetch Stride Using Deterministic Cache Parameters ............................... 6-56
Chapter 7 Multi-Core and Hyper-Threading Technology
Performance and Usage Models............................................................................................. 7-2
Multithreading................................................................................................................... 7-2
Multitasking Environment ................................................................................................. 7-4
Programming Models and Multithreading ............................................................................... 7-6
Parallel Programming Models .......................................................................................... 7-7
Domain Decomposition............................................................................................... 7-7
Functional Decomposition................................................................................................ 7-8
Specialized Programming Models.................................................................................... 7-8
Producer-Consumer Threading Models.................................................................... 7-10
Tools for Creating Multithreaded Applications................................................................ 7-14
Optimization Guidelines........................................................................................................ 7-16
Key Practices of Thread Synchronization ...................................................................... 7-16

xi
Key Practices of System Bus Optimization.................................................................... 7-17
Key Practices of Memory Optimization .......................................................................... 7-17
Key Practices of Front-end Optimization........................................................................ 7-18
Key Practices of Execution Resource Optimization ....................................................... 7-18
Generality and Performance Impact............................................................................... 7-19
Thread Synchronization........................................................................................................ 7-19
Choice of Synchronization Primitives............................................................................. 7-20
Synchronization for Short Periods.................................................................................. 7-22
Optimization with Spin-Locks ......................................................................................... 7-25
Synchronization for Longer Periods ............................................................................... 7-26
Avoid Coding Pitfalls in Thread Synchronization...................................................... 7-28
Prevent Sharing of Modified Data and False-Sharing.................................................... 7-30
Placement of Shared Synchronization Variable ............................................................. 7-31
System Bus Optimization...................................................................................................... 7-33
Conserve Bus Bandwidth............................................................................................... 7-34
Understand the Bus and Cache Interactions.................................................................. 7-35
Avoid Excessive Software Prefetches............................................................................ 7-36
Improve Effective Latency of Cache Misses................................................................... 7-36
Use Full Write Transactions to Achieve Higher Data Rate............................................. 7-37
Memory Optimization............................................................................................................ 7-38
Cache Blocking Technique............................................................................................. 7-38
Shared-Memory Optimization......................................................................................... 7-39
Minimize Sharing of Data between Physical Processors.......................................... 7-39
Batched Producer-Consumer Model........................................................................ 7-40
Eliminate 64-KByte Aliased Data Accesses................................................................... 7-42
Preventing Excessive Evictions in First-Level Data Cache............................................ 7-43
Per-thread Stack Offset ............................................................................................ 7-44
Per-instance Stack Offset ......................................................................................... 7-46
Front-end Optimization.......................................................................................................... 7-48
Avoid Excessive Loop Unrolling..................................................................................... 7-48
Optimization for Code Size............................................................................................. 7-49
Using Thread Affinities to Manage Shared Platform Resources........................................... 7-49
Using Shared Execution Resources in a Processor Core.............................................. 7-59
Chapter 8 64-bit Mode Coding Guidelines
Introduction ............................................................................................................................. 8-1
Coding Rules Affecting 64-bit Mode........................................................................................ 8-1
Use Legacy 32-Bit Instructions When The Data Size Is 32 Bits....................................... 8-1
Use Extra Registers to Reduce Register Pressure .......................................................... 8-2
Use 64-Bit by 64-Bit Multiplies That Produce 128-Bit Results Only When Necessary..... 8-2

xii
Sign Extension to Full 64-Bits........................................................................................... 8-3
Alternate Coding Rules for 64-Bit Mode.................................................................................. 8-4
Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic..................... 8-4
Use 32-Bit Versions of CVTSI2SS and CVTSI2SD When Possible................................. 8-6
Using Software Prefetch................................................................................................... 8-6
Chapter 9 Power Optimization for Mobile Usages
Overview................................................................................................................................. 9-1
Mobile Usage Scenarios......................................................................................................... 9-2
ACPI C-States......................................................................................................................... 9-4
Processor-Specific C4 and Deep C4 States..................................................................... 9-6
Guidelines for Extending Battery Life...................................................................................... 9-7
Adjust Performance to Meet Quality of Features ............................................................. 9-8
Reducing Amount of Work................................................................................................ 9-9
Platform-Level Optimizations.......................................................................................... 9-10
Handling Sleep State Transitions ................................................................................... 9-11
Using Enhanced Intel SpeedStep®Technology ............................................................. 9-12
Enabling Intel®Enhanced Deeper Sleep ....................................................................... 9-14
Multi-Core Considerations.............................................................................................. 9-15
Enhanced Intel SpeedStep®Technology.................................................................. 9-15
Thread Migration Considerations.............................................................................. 9-16
Multi-core Considerations for C-States..................................................................... 9-17
Appendix AApplication Performance Tools
Intel®Compilers..................................................................................................................... A-2
Code Optimization Options ............................................................................................. A-3
Targeting a Processor (-Gn) ...................................................................................... A-3
Automatic Processor Dispatch Support (-Qx[extensions] and -Qax[extensions])...... A-4
Vectorizer Switch Options ............................................................................................... A-5
Loop Unrolling............................................................................................................ A-5
Multithreading with OpenMP*.................................................................................... A-6
Inline Expansion of Library Functions (-Oi, -Oi-) ............................................................. A-6
Floating-point Arithmetic Precision (-Op, -Op-, -Qprec, -Qprec_div, -Qpc,
-Qlong_double)............................................................................................................ A-6
Rounding Control Option (-Qrcd) .................................................................................... A-6
Interprocedural and Profile-Guided Optimizations .......................................................... A-7
Interprocedural Optimization (IPO)............................................................................ A-7
Profile-Guided Optimization (PGO) ........................................................................... A-7
Intel®VTune™ Performance Analyzer................................................................................... A-8
Sampling ......................................................................................................................... A-9

xiii
Time-based Sampling................................................................................................ A-9
Event-based Sampling............................................................................................. A-10
Workload Characterization ...................................................................................... A-11
Call Graph ..................................................................................................................... A-13
Counter Monitor............................................................................................................. A-14
Intel®Tuning Assistant.................................................................................................. A-14
Intel®Performance Libraries................................................................................................ A-14
Benefits Summary......................................................................................................... A-15
Optimizations with the Intel®Performance Libraries.................................................... A-16
Enhanced Debugger (EDB) ................................................................................................. A-17
Intel®Threading Tools.......................................................................................................... A-17
Intel®Thread Checker................................................................................................... A-17
Thread Profiler............................................................................................................... A-19
Intel®Software College........................................................................................................ A-20
Appendix BUsing Performance Monitoring Events
Pentium 4 Processor Performance Metrics............................................................................ B-1
Pentium 4 Processor-Specific Terminology............................................................................ B-2
Bogus, Non-bogus, Retire............................................................................................... B-2
Bus Ratio......................................................................................................................... B-2
Replay............................................................................................................................. B-3
Assist............................................................................................................................... B-3
Tagging............................................................................................................................ B-3
Counting Clocks..................................................................................................................... B-4
Non-Halted Clockticks..................................................................................................... B-5
Non-Sleep Clockticks ...................................................................................................... B-6
Time Stamp Counter........................................................................................................ B-7
Microarchitecture Notes......................................................................................................... B-8
Trace Cache Events........................................................................................................ B-8
Bus and Memory Metrics................................................................................................. B-8
Reads due to program loads................................................................................... B-11
Reads due to program writes (RFOs)...................................................................... B-11
Writebacks (dirty evictions)...................................................................................... B-12
Usage Notes for Specific Metrics .................................................................................. B-13
Usage Notes on Bus Activities ...................................................................................... B-15
Metrics Descriptions and Categories ................................................................................... B-16
Performance Metrics and Tagging Mechanisms.................................................................. B-46
Tags for replay_event.................................................................................................... B-46
Tags for front_end_event............................................................................................... B-48
Tags for execution_event .............................................................................................. B-48

xiv
Using Performance Metrics with Hyper-Threading Technology........................................... B-50
Using Performance Events of Intel Core Solo and Intel Core Duo processors.................... B-56
Understanding the Results in a Performance Counter.................................................. B-56
Ratio Interpretation........................................................................................................ B-57
Notes on Selected Events............................................................................................. B-58
Appendix CIA-32 Instruction Latency and Throughput
Overview................................................................................................................................ C-2
Definitions .............................................................................................................................. C-4
Latency and Throughput........................................................................................................ C-4
Latency and Throughput with Register Operands.......................................................... C-6
Table Footnotes....................................................................................................... C-19
Latency and Throughput with Memory Operands ......................................................... C-20
Appendix DStack Alignment
Stack Frames......................................................................................................................... D-1
Aligned esp-Based Stack Frames ................................................................................... D-4
Aligned ebp-Based Stack Frames................................................................................... D-6
Stack Frame Optimizations.............................................................................................. D-9
Inlined Assembly and ebx.................................................................................................... D-10
Appendix EMathematics of Prefetch Scheduling Distance
Simplified Equation ................................................................................................................ E-1
Mathematical Model for PSD................................................................................................. E-2
No Preloading or Prefetch............................................................................................... E-6
Compute Bound (Case:Tc >= Tl+ Tb)............................................................................. E-7
Compute Bound (Case: Tl + Tb > Tc > Tb) ..................................................................... E-8
Memory Throughput Bound (Case: Tb >= Tc)............................................................... E-10
Example ........................................................................................................................ E-11
Index

xv
Examples
Example 2-1 Assembly Code with an Unpredictable Branch ............................. 2-17
Example 2-2 Code Optimization to Eliminate Branches.....................................2-17
Example 2-3 Eliminating Branch with CMOV Instruction.................................... 2-18
Example 2-4 Use of pause Instruction ...............................................................2-19
Example 2-5 Pentium 4 Processor Static Branch Prediction Algorithm.............. 2-20
Example 2-6 Static Taken Prediction Example ................................................... 2-21
Example 2-7 Static Not-Taken Prediction Example ............................................2-21
Example 2-8 Indirect Branch With Two Favored Targets .................................... 2-25
Example 2-9 A Peeling Technique to Reduce Indirect Branch Misprediction .....2-26
Example 2-10 Loop Unrolling ............................................................................... 2-28
Example 2-11 Code That Causes Cache Line Split ............................................. 2-31
Example 2-12 Several Situations of Small Loads After Large Store .................... 2-35
Example 2-14 A Non-forwarding Situation in Compiler Generated Code............. 2-36
Example 2-15 Two Examples to Avoid the Non-forwarding Situation in
Example 2-14 ................................................................................ 2-36
Example 2-13 A Non-forwarding Example of Large Load After Small Store ........2-36
Example 2-16 Large and Small Load Stalls ......................................................... 2-37
Example 2-17 An Example of Loop-carried Dependence Chain .......................... 2-39
Example 2-18 Rearranging a Data Structure ....................................................... 2-39
Example 2-19 Decomposing an Array ..................................................................2-40
Example 2-20 Dynamic Stack Alignment ............................................................. 2-43
Example 2-21 Non-temporal Stores and 64-byte Bus Write Transactions............ 2-54
Example 2-22 Non-temporal Stores and Partial Bus Write Transactions ............. 2-54
Example 2-23 Algorithm to Avoid Changing the Rounding Mode......................... 2-66
Example 2-24 Dependencies Caused by Referencing Partial Registers.............. 2-77
Example 2-25 Recombining LOAD/OP Code into REG,MEM Form.....................2-91
Example 2-26 Spill Scheduling Example Code .................................................... 2-92
Example 3-1 Identification of MMX Technology with cpuid................................... 3-3
Example 3-3 Identification of SSE by the OS ....................................................... 3-4
Example 3-2 Identification of SSE with cpuid .......................................................3-4

xvi
Example 3-4 Identification of SSE2 with cpuid ..................................................... 3-5
Example 3-5 Identification of SSE2 by the OS ..................................................... 3-6
Example 3-6 Identification of SSE3 with cpuid ..................................................... 3-7
Example 3-7 Identification of SSE3 by the OS ..................................................... 3-8
Example 3-8 Simple Four-Iteration Loop ............................................................ 3-14
Example 3-9 Streaming SIMD Extensions Using Inlined Assembly Encoding ...3-15
Example 3-10 Simple Four-Iteration Loop Coded with Intrinsics.......................... 3-16
Example 3-11 C++ Code Using the Vector Classes.............................................3-18
Example 3-12 Automatic Vectorization for a Simple Loop.................................... 3-19
Example 3-13 C Algorithm for 64-bit Data Alignment ...........................................3-23
Example 3-14 AoS Data Structure ....................................................................... 3-27
Example 3-16 AoS and SoA Code Samples ........................................................ 3-28
Example 3-15 SoA Data Structure ....................................................................... 3-28
Example 3-17 Hybrid SoA Data Structure ............................................................3-30
Example 3-18 Pseudo-code Before Strip Mining.................................................. 3-32
Example 3-19 Strip Mined Code...........................................................................3-33
Example 3-20 Loop Blocking................................................................................ 3-35
Example 3-21 Emulation of Conditional Moves.................................................... 3-37
Example 4-1 Resetting the Register between __m64 and FP Data Types...........4-5
Example 4-2 Unsigned Unpack Instructions......................................................... 4-7
Example 4-3 Signed Unpack Code ...................................................................... 4-8
Example 4-4 Interleaved Pack with Saturation ................................................... 4-10
Example 4-5 Interleaved Pack without Saturation ..............................................4-11
Example 4-6 Unpacking Two Packed-word Sources in a Non-interleaved Way . 4-13
Example 4-7 pextrw Instruction Code................................................................. 4-14
Example 4-8 pinsrw Instruction Code................................................................. 4-15
Example 4-9 Repeated pinsrw Instruction Code ................................................ 4-16
Example 4-10 pmovmskb Instruction Code.......................................................... 4-17
Example 4-12 Broadcast Using 2 Instructions...................................................... 4-19
Example 4-11 pshuf Instruction Code .................................................................. 4-19
Example 4-13 Swap Using 3 Instructions............................................................. 4-20
Example 4-14 Reverse Using 3 Instructions......................................................... 4-20
Example 4-15 Generating Constants ................................................................... 4-21
Example 4-16 Absolute Difference of Two Unsigned Numbers ............................ 4-23
Example 4-17 Absolute Difference of Signed Numbers ....................................... 4-24
Example 4-18 Computing Absolute Value ............................................................ 4-25
Example 4-19 Clipping to a Signed Range of Words [high, low] .......................... 4-27

xvii
Example 4-20 Clipping to an Arbitrary Signed Range [high, low]......................... 4-27
Example 4-21 Simplified Clipping to an Arbitrary Signed Range ......................... 4-28
Example 4-22 Clipping to an Arbitrary Unsigned Range [high, low]..................... 4-29
Example 4-23 Complex Multiply by a Constant ....................................................4-32
Example 4-24 A Large Load after a Series of Small Stores (Penalty).................. 4-35
Example 4-25 Accessing Data without Delay....................................................... 4-35
Example 4-26 A Series of Small Loads after a Large Store ................................. 4-36
Example 4-27 Eliminating Delay for a Series of Small Loads after a
Large Store.................................................................................... 4-36
Example 4-28 An Example of Video Processing with Cache Line Splits.............. 4-37
Example 4-29 Video Processing Using LDDQU to Avoid Cache Line Splits........ 4-38
Example 5-1 Pseudocode for Horizontal (xyz, AoS) Computation .......................5-8
Example 5-2 Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation........ 5-9
Example 5-3 Swizzling Data............................................................................... 5-10
Example 5-4 Swizzling Data Using Intrinsics .....................................................5-12
Example 5-5 Deswizzling Single-Precision SIMD Data ...................................... 5-14
Example 5-6 Deswizzling Data Using the movlhps and shuffle
Instructions .................................................................................... 5-15
Example 5-7 Deswizzling Data 64-bit Integer SIMD Data ..................................5-16
Example 5-8 Using MMX Technology Code for Copying or Shuffling................. 5-18
Example 5-9 Horizontal Add Using movhlps/movlhps ........................................ 5-19
Example 5-10 Horizontal Add Using Intrinsics with movhlps/movlhps ................. 5-21
Example 5-11 Multiplication of Two Pair of Single-precision Complex Number.... 5-24
Example 5-12 Division of Two Pair of Single-precision Complex Number............ 5-25
Example 5-13 Calculating Dot Products from AOS .............................................. 5-26
Example 6-1 Pseudo-code for Using cflush ....................................................... 6-18
Example 6-2 Populating an Array for Circular Pointer Chasing with
Constant Stride.............................................................................. 6-21
Example 6-3 Prefetch Scheduling Distance ....................................................... 6-26
Example 6-5 Concatenation and Unrolling the Last Iteration of Inner Loop .......6-28
Example 6-4 Using Prefetch Concatenation....................................................... 6-28
Example 6-6 Spread Prefetch Instructions .........................................................6-33
Example 6-7 Data Access of a 3D Geometry Engine without Strip-mining........6-37
Example 6-8 Data Access of a 3D Geometry Engine with Strip-mining............. 6-38
Example 6-9 Using HW Prefetch to Improve Read-Once Memory Traffic .......... 6-40
Example 6-10 Basic Algorithm of a Simple Memory Copy................................... 6-46
Example 6-11 A Memory Copy Routine Using Software Prefetch........................6-48

xviii
Example 6-12 Memory Copy Using Hardware Prefetch and Bus Segmentation..6-50
Example 7-1 Serial Execution of Producer and Consumer Work Items ...............7-9
Example 7-2 Basic Structure of Implementing Producer Consumer Threads....7-11
Example 7-3 Thread Function for an Interlaced Producer Consumer Model .....7-13
Example 7-4 Spin-wait Loop and PAUSE Instructions........................................ 7-24
Example 7-5 Coding Pitfall using Spin Wait Loop .............................................. 7-29
Example 7-6 Placement of Synchronization and Regular Variables .................. 7-32
Example 7-7 Declaring Synchronization Variables without Sharing
a Cache Line ................................................................................. 7-32
Example 7-8 Batched Implementation of the Producer Consumer Threads ...... 7-41
Example 7-9 Adding an Offset to the Stack Pointer of Three Threads............... 7-45
Example 7-10 Adding a Pseudo-random Offset to the Stack Pointer
in the Entry Function ..................................................................... 7-47
Example 7-11 Assembling 3-level IDs, Affinity Masks for Each Logical
Processor ......................................................................................7-51
Example 7-12 Assembling a Look up Table to Manage Affinity Masks
and Schedule Threads to Each Core First .................................... 7-54
Example 7-13 Discovering the Affinity Masks for Sibling Logical
Processors Sharing the Same Cache ........................................... 7-55
Example D-1 Aligned esp-Based Stack Frames .................................................. D-5
Example D-2 Aligned ebp-based Stack Frames................................................... D-7
Example E-1 Calculating Insertion for Scheduling Distance of 3 ..........................E-3

xix
Figures
Figure 1-1 Typical SIMD Operations ................................................................... 1-3
Figure 1-2 SIMD Instruction Register Usage ...................................................... 1-4
Figure 1-3 The Intel NetBurst Microarchitecture ............................................... 1-10
Figure 1-4 Execution Units and Ports in the Out-Of-Order Core....................... 1-19
Figure 1-5 The Intel Pentium M Processor Microarchitecture........................... 1-27
Figure 1-6 Hyper-Threading Technology on an SMP........................................ 1-35
Figure 1-7 Pentium D Processor, Pentium Processor Extreme Edition
and Intel Core Duo Processor ......................................................... 1-41
Figure 2-1 Cache Line Split in Accessing Elements in a Array ......................... 2-31
Figure 2-2 Size and Alignment Restrictions in Store Forwarding...................... 2-34
Figure 3-1 Converting to Streaming SIMD Extensions Chart ............................. 3-9
Figure 3-2 Hand-Coded Assembly and High-Level Compiler
Performance Trade-offs ................................................................... 3-13
Figure 3-3 Loop Blocking Access Pattern ......................................................... 3-36
Figure 4-2 Interleaved Pack with Saturation ....................................................... 4-9
Figure 4-1 PACKSSDW mm, mm/mm64 Instruction Example ............................ 4-9
Figure 4-4 Result of Non-Interleaved Unpack High in MM1.............................. 4-12
Figure 4-3 Result of Non-Interleaved Unpack Low in MM0 .............................. 4-12
Figure 4-5 pextrw Instruction ............................................................................ 4-14
Figure 4-6 pinsrw Instruction............................................................................. 4-15
Figure 4-7 pmovmskb Instruction Example....................................................... 4-17
Figure 4-8 pshuf Instruction Example ............................................................... 4-18
Figure 4-9 PSADBW Instruction Example ........................................................4-31
Figure 5-1 Homogeneous Operation on Parallel Data Elements ........................5-5
Figure 5-2 Dot Product Operation ....................................................................... 5-8
Figure 5-3 Horizontal Add Using movhlps/movlhps ..........................................5-19
Figure 5-5 Horizontal Arithmetic Operation of the SSE3 Instruction
HADDPD .........................................................................................5-23
Figure 5-4 Asymmetric Arithmetic Operation of the SSE3 Instruction .............. 5-23
Figure 6-1 Effective Latency Reduction as a Function of Access Stride........... 6-22

xx
Figure 6-2 Memory Access Latency and Execution Without Prefetch .............. 6-23
Figure 6-3 Memory Access Latency and Execution With Prefetch ................... 6-23
Figure 6-4 Prefetch and Loop Unrolling ............................................................6-29
Figure 6-5 Memory Access Latency and Execution With Prefetch ................... 6-31
Figure 6-6 Cache Blocking – Temporally Adjacent and Non-adjacent
Passes ............................................................................................. 6-35
Figure 6-7 Examples of Prefetch and Strip-mining for Temporally
Adjacent and Non-Adjacent Passes Loops .....................................6-36
Figure 6-8 Single-Pass Vs. Multi-Pass 3D Geometry Engines ......................... 6-42
Figure 7-1 Amdahl’s Law and MP Speed-up ...................................................... 7-3
Figure 7-2 Single-threaded Execution of Producer-consumer
Threading Model................................................................................ 7-9
Figure 7-3 Execution of Producer-consumer Threading Model on
a Multi-core Processor..................................................................... 7-10
Figure 7-4 Interlaced Variation of the Producer Consumer Model.................... 7-12
Figure 7-5 Batched Approach of Producer Consumer Model ........................... 7-40
Figure 9-1 Performance History and State Transitions ....................................... 9-3
Figure 9-2 Active Time Versus Halted Time of a Processor ...............................9-4
Figure 9-3 Application of C-states to Idle Time ................................................... 9-6
Figure 9-4 Profiles of Coarse Task Scheduling and Power Consumption.........9-12
Figure 9-5 Thread Migration in a Multi-Core Processor .................................... 9-17
Figure 9-6 Progression to Deeper Sleep ..........................................................9-18
Figure A-1 Sampling Analysis of Hotspots by Location.....................................A-10
Figure A-2 Intel Thread Checker Can Locate Data Race Conditions................A-18
Figure A-3 Intel Thread Profiler Can Show Critical Paths of Threaded
Execution Timelines.........................................................................A-20
Figure B-1 Relationships Between the Cache Hierarchy, IOQ, BSQ
and Front Side Bus ..........................................................................B-10
Figure D-1 Stack Frames Based on Alignment Type .......................................... D-3
Figure E-1 Pentium II, Pentium III and Pentium 4 Processors Memory
Pipeline Sketch ..................................................................................E-4
Figure E-2 Execution Pipeline, No Preloading or Prefetch ..................................E-6
Figure E-3 Compute Bound Execution Pipeline ..................................................E-7
Figure E-4 Another Compute Bound Execution Pipeline.....................................E-8
Figure E-5 Memory Throughput Bound Pipeline ...............................................E-10
Figure E-6 Accesses per Iteration, Example 1 ..................................................E-12
Figure E-7 Accesses per Iteration, Example 2 ..................................................E-13
Table of contents
Other Intel Processor manuals

Intel
Intel Pentium II Xeon Quick start guide

Intel
Intel 80386 Quick user guide

Intel
Intel XScale Core User manual

Intel
Intel Q77M vPro Guide

Intel
Intel i486 Quick user guide

Intel
Intel Centrino Pro User manual

Intel
Intel i960 Series User manual

Intel
Intel Core 2 Duo Processor User manual

Intel
Intel Xeon Guide

Intel
Intel ARM Cortex-A9 Reference manual