SSE4 を見る


  • SSE4 Vectorizing Compiler and Media Accelerators
Packed DWORD Multiplies PMULLD, PMULDQ New support for four signed or unsigned 32x32 bit multiplications per instruction, as well as signed forms of 32x32->64 multiplication. Broadly useful for improved automated compiler vectorization of data processing written in high level languages (like C and Fortran).
Floating Point Dot Product DPPS, DPPD Improved performance for AOS (Array of Structs)data processing through support for single and double-precision dot products. 3-D content creation, gaming, and support for languages like CG and HLSL.
Packed Blending BLENDPS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB, PBLENDDW Blending conditionally copies one field in the source onto the same field in the destination. These new instructions improve the performance of blending operations for most field sizes through packing multiple operations in a single instruction. Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming.
Packed Integer Min and Max PMINSB, PMAXSB, PMINUW, PMAXUW, PMINUD, PMAXUD, PMINDS, PMAXSD Compares packed signed/unsigned byte/word/dword integers in the destination operand and the source operand, and returns the minimum or maximum as per the instruction type for each packed operand in the destination operand. Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming.
Floating Point Round ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD Efficiently rounds the scalar and packed singleand double- precision operands to integers, with enhanced support for Fortran, JAVA and C99 language requirements. Image processing, graphics, video processing, 2-D/3-D applications, multimedia, and gaming.
Register Insertion/Extraction INSERTPS, PINSRB, PINSRD, PINSRQ, EXTRACTPS, PEXTRB, PEXTRD, PEXTRW, PEXTRQ These new instruction simplify data insertion and extraction between GPR (or memory) and XMM registers. Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming.
Packed Format Conversion PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ Converts from a packed integer (from XMM register or memory) to a zero- or sign-extended integer with wider type. Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming.
Packed Test and Set PTEST Faster branching from SIMD decisions to support conditionally vectorized code. Useful for improved automated compiler vectorization of data processing, image and video processing, 3-D content creation, multimedia, and gaming.
Packed Compare for Equal PCMPEQQ, PCMPGTQ Performs SIMD compare for equality of the packed QWORDs in the destination and the source operand. Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming.
Pack DWORD to Unsigned WORD PACKUSDW Converts packed signed DWORDs into packed unsigned WORDs using unsigned saturation to handle overflow condition. This new instruction completes the set of other instructions in this type. Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming.

Register insertionはおそらく、スカラ値をベクトル値のどこかに入れる、もしくは出す。だと思われる。この操作もあって当然。というか、SIMD演算は大概データ転送がネックになるため、レジスタ間、メモリ間のデータ移動は色々なバリエーションがあればあるほど良いと思う。シャッフルとかもね。

  • SSE4 Efficient Accelerated String and Text Processing
Advanced String Operations PCMPESTRI, PCMPESTRM, PCMPISTRI, PCMPISTRM These new instructions provide a rich set of string and text processing capabilities that traditionally required many more opcodes. Improved performance for virus scan, text search,string processing libraries like ZLIB, databases, compilers and state machine-oriented applications.

  • Overview of Application Targeted Accelerators
Fast CRC (Cyclic Redundancy Check) CRC32 Finds the CRC value using a specific polynomial of a given source operand. Fast and efficient data integrity checks in data transfer protocols for networked storage (e.g., iSCSI, RDMA)
Accelerated searching and pattern recognition of large data sets POPCNT Calculates the number of bits set to 1 in the given operand. Helps to deliver higher performance in applications such as genome mining, handwriting recognition, digital health workloads, fast hamming algorithms, and others.

POPCNTは、population命令と呼ばれるもののようだ。レジスタに存在する1をカウントする命令。Hacker's delightBit Twiddling Hacksなどで計算方法を解説している。このようなbit演算はかなり面白い。CRCに比べて、専用命令になることで大きくstep数が減るかというとそんなことは無い。先のリンクで解説している通り、32bit長ならば高々12命令程度で計測できる。とはいえ、population命令は色々と応用が効きますので、会って困ることはない。
最終更新:2008年02月14日 03:08


ヘルプ / FAQ もご覧ください。