Packed DWORD Multiplies |
PMULLD, PMULDQ |
New support for four signed or unsigned 32x32 bit multiplications per instruction, as well as signed forms of 32x32->64 multiplication. |
Broadly useful for improved automated compiler vectorization of data processing written in high level languages (like C and Fortran). |
Floating Point Dot Product |
DPPS, DPPD |
Improved performance for AOS (Array of Structs)data processing through support for single and double-precision dot products. |
3-D content creation, gaming, and support for languages like CG and HLSL. |
Packed Blending |
BLENDPS, BLENDPD, BLENDVPS, BLENDVPD, PBLENDVB, PBLENDDW |
Blending conditionally copies one field in the source onto the same field in the destination. These new instructions improve the performance of blending operations for most field sizes through packing multiple operations in a single instruction. |
Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming. |
Packed Integer Min and Max |
PMINSB, PMAXSB, PMINUW, PMAXUW, PMINUD, PMAXUD, PMINDS, PMAXSD |
Compares packed signed/unsigned byte/word/dword integers in the destination operand and the source operand, and returns the minimum or maximum as per the instruction type for each packed operand in the destination operand. |
Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming. |
Floating Point Round |
ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD |
Efficiently rounds the scalar and packed singleand double- precision operands to integers, with enhanced support for Fortran, JAVA and C99 language requirements. |
Image processing, graphics, video processing, 2-D/3-D applications, multimedia, and gaming. |
Register Insertion/Extraction |
INSERTPS, PINSRB, PINSRD, PINSRQ, EXTRACTPS, PEXTRB, PEXTRD, PEXTRW, PEXTRQ |
These new instruction simplify data insertion and extraction between GPR (or memory) and XMM registers. |
Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming. |
Packed Format Conversion |
PMOVSXBW, PMOVZXBW, PMOVSXBD, PMOVZXBD, PMOVSXBQ, PMOVZXBQ, PMOVSXWD, PMOVZXWD, PMOVSXWQ, PMOVZXWQ, PMOVSXDQ, PMOVZXDQ |
Converts from a packed integer (from XMM register or memory) to a zero- or sign-extended integer with wider type. |
Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming. |
Packed Test and Set |
PTEST |
Faster branching from SIMD decisions to support conditionally vectorized code. |
Useful for improved automated compiler vectorization of data processing, image and video processing, 3-D content creation, multimedia, and gaming. |
Packed Compare for Equal |
PCMPEQQ, PCMPGTQ |
Performs SIMD compare for equality of the packed QWORDs in the destination and the source operand. |
Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming. |
Pack DWORD to Unsigned WORD |
PACKUSDW |
Converts packed signed DWORDs into packed unsigned WORDs using unsigned saturation to handle overflow condition. This new instruction completes the set of other instructions in this type. |
Broadly useful for automated compiler vectorization of data processing written in high level languages (like C and Fortran), and applications such as image processing, video processing, multimedia, and gaming. |
ショートベクタレジスタ同士の文字列比較命令と思われる。おそらく、オペランドに指定されたレジスタに含まれるバイト列同士の一致を見る。このような、ベクトルを横方向に演算する命令は重要だと思う。DLPっぽく無く利用する命令が増えることで、SIMDの幅が増えると個人的に思う。
CRCの計算方法は結構面倒くさく遅くなる。データ通信を行うときに誤りが無いか検出する際にCRCは良く使うので、利用範囲は広いと思われる。説明書きを読むと、分散メモリ型のプロセッサアレイを使い、データ通信するときの誤り検出用のようなことが書いてある。
POPCNTは、population命令と呼ばれるもののようだ。レジスタに存在する1をカウントする命令。
Hacker's delightや
Bit Twiddling Hacksなどで計算方法を解説している。このようなbit演算はかなり面白い。CRCに比べて、専用命令になることで大きくstep数が減るかというとそんなことは無い。先のリンクで解説している通り、32bit長ならば高々12命令程度で計測できる。とはいえ、population命令は色々と応用が効きますので、会って困ることはない。
population命令は、bitを数え上げるわけですが、見方によっては1bit長の配列をアキュムレートしているともいえる。SSE4用のレジスタが何bit長だか知りませんが、たとえば8bit長の配列を全て足し合わせる操作に似ている。populationを拡張していくことで、1,2,3,4,5,....長の足し合わせ命令をつけておけば直行性が増しそうです。