[background technology]
H264 has very high data compression ratio, and under the condition of equal picture quality, compression ratio H.264 is more than 2 times of MPEG-2, is 1.5~2 times of MPEG-4, and still, high compression ratio and outstanding image quality have been brought very big challenge to video encoding capability.Dct transform is one of computing module the most consuming time in the H264 coding, and in the real-time coding system, it is to keep on improving that its lifting of computational speed is required.TMS320DM642DSP is a high-performance digital multimedia processor that TI Texas Instruments releases, and its multimedia instruction collection and the feature of parallel processing make the such module of optimization DCT have and bring into play the space more flexibly.This chapter mainly studies the method for optimizing the DCT function from the instruction angle of DM642.
DM642 is the senior very long instruction word structure of the second generation (VelociTI) that TI company adopts TI, making can many instruction of parallel processing an instruction cycle, it has four arithmetic elements, is respectively .L .M, .S, .D, wherein each arithmetic element is divided A, B two districts, can handle dissimilar computings respectively, therefore can reach eight instructions of each instruction cycle operation.It can be worked under the 600MHz clock frequency, instructs and calculates according to each instruction cycle 8 32bit that can walk abreast, and it can reach the peak value computational speed of 4800MIPS.DM642 adopts the two-level cache structure: the first order comprises separate LIP (16kB) and LID (16kB), can only use as high-speed cache; Second level L2 (256kB) is a unified program/data space, can whole be mapped to memory space as SRAM, also can be whole as second level Cache, or the two pro rata a kind of combination is used.Therefore as if the maximum using of use rationally and managing internal memory and arithmetic element, can increase substantially the runnability of program.
Exist and be better than the various application of Digital Video Communication and storage, and developed and the corresponding international standard of lasting exploitation.For example the low bitrate of visual telephone and meeting communicate by letter and for example big video file compression of animation produce various video compression standards: H.261, H.263, MPEG-1, MPEG-2, AVS etc.The quantification that these compression methods rely on ground discrete cosine transform (DCT) or similar conversion and conversion coefficient reduces the number of the position that need be used for encoding.
Below be the transformation for mula of two-dimensional dct and IDCT:
Two-dimensional dct transform
Make f (x is N * N discrete picture sequence y), and two-dimensional dct transform is expressed as:
u,v=0,1,...N-1
Two dimension IDCT is:
x,y=0,1,...N-1
As can be seen, have lot of data read-write and signed magnitude arithmetic(al) in the DCT algorithm from formula, with regard to by a 4x4dct, signed magnitude arithmetic(al) wherein reaches 64 times.And in the process of calculating DCT, need frequent access data, one time 4x4dct need distinguish access data 32 times.Cause a large amount of cpu cycle waiting for finishing of access data, make that this operation efficiency is low:
void?dct4x4(int16_t?dct[4][4],uint8_t*pix1,uint8_t*pix2)
{
for(i=0;i<4;i++)
{
const?int?s03=d[i][0]+d[i][3];
const?int?s12=d[i][1]+d[i][2];
const?int?d03=d[i][0]-d[i][3];
const?int?d12=d[i][1]-d[i][2];
tmp[0][i]=s03+s12;
tmp[1][i]=2*d03+d12;
tmp[2][i]=s03-s12;
tmp[3][i]=d03-2*d12;
}
for(i=0;i<4;i++)
{
const?int?s03=tmp[i][0]+tmp[i][3];
const?int?s12=tmp[i][1]+tmp[i][2];
const?int?d03=tmp[i][0]-tmp[i][3];
const?int?d12=tmp[i][1]-tmp[i][2];
dct[i][0]=s03+s12;
dct[i][1]=2*d03+d12;
dct[i][2]=s03-s12;
dct[i][3]=d03-2*d12;
}
}
Original calculation method according to above DCT reads 16 data at every turn, and 16 data are carried out plus and minus calculation.Like this, just be difficult to bring into play the characteristic of DM642, be difficult to reach the purpose of parallel computation, and in the process of waiting for access data, expended a large amount of cpu cycle (processor cycle), cause the DCT computing to become one of bottleneck main in the H264 coding.
Because the real-time coding requirement of h264, we can reach the purpose that promotes the binary encoding performance by the computational speed that promotes the DCT module.
Given this, be necessary to propose a kind of improved technical scheme to overcome the defective of prior art in fact.
[embodiment]
Improvement to existing method is, with loop unrolling, rearranges program structure.DM642 can primary access 64 data.Therefore, the program after the expansion can once be deposited and be read 64 data, makes the number of times of access once be reduced to original 1/4.
DM642 has the instruction of the such dot product of a series of DOTPSU2 of being similar to, and it can calculate the multiply-add operation of 16 of two 32 bit data height, such as: C=_dotpsu2 (A B), is that high 16 with A and B are multiplied each other, and low 16 multiply each other, addition with compose to C.By this instruction, can be with 16 summations of height of 32 bit data that obtain, and need not split, calculate again.
Use the such instruction of similar DOTPSU2, can use 32 data, and can reach certain parallel purpose, save and calculate cpu cycle.But because DOTPSU2 can only calculate in the CPU.M unit, be that each cpu cycle can only be at A, twice of two unit parallel computation of B, do not give full play to eight computing units computation purpose simultaneously, further optimize and ordering through the structure to program, we have found useful method more.
This method is based on two very simple instruction ADD2 and SUB2, and different with original computing is that they are to 32 bit data, divide 16 of height to carry out addition and additive operation.(A B), is that the result deposits in the high 16 of C with high 16 additions of A and B, and with low 16 additions of A and B, the result deposits in low 16 of C such as C=_add2.And it has an obvious advantages with respect to this instruction of DOTPSU2, is exactly which computing unit computing it is not subject in, that is to say that it can be at the .M of cpu .L, and .S, eight arithmetic elements of .D are carried out computing simultaneously.Like this, just help improving the degree of parallelism of program more.Therefore, use ADD2 and SUB2, and, compressed cpucycle greatly, promptly can obviously promote the DCT computational speed fully without DOTPSU2.
Below be the program based on DM642 after optimizing, wherein the structure of reprogramming mainly is with matrix transpose, and transposition has only been used several SPACK2 packing instructions.
void?4x4dct(uint8_t*p_dst,int16_t?dct[4][4])
{
int16_t?d[4][4];
int16_t?tmp[4][4];
int?x,y;
int?i;
int?tmp0_10,tmp0_32,tmp1_10,tmp1_32;
int?tmp2_10,tmp2_32,tmp3_10,tmp3_32;
int?pack0_10,pack1_10,pack0_32,pack1_32;
int?pack2_10,pack3_10,pack2_32,pack3_32;
int?pack3_10_r,pack3_32_r,pack1_10_r,pack1_32_r;
int?d0_10,d0_32,d1_10,d1_32,d2_10,d2_32,d3_10,d3_32;
int?dct0_32=_hi(_amemd8((void*)dct[0]));
int?dct0_10=_lo(_amemd8((void*)dct[0]));
int?dct1_32=_hi(_amemd8((void*)dct[1]));
int?dct1_10=_lo(_amemd8((void*)dct[1]));
int?dct2_32=_hi(_amemd8((void*)dct[2]));
int?dct2_10=_lo(_amemd8((void*)dct[2]));
int?dct3_32=_hi(_amemd8((void*)dct[3]));
int?dct3_10=_lo(_amemd8((void*)dct[3]));
int?h_dct3_32=_shr2(dct3_32,1);
int?h_dct3_10=_shr2(dct3_10,1);
int?h_dct1_32=_shr2(dct1_32,1);
int?h_dct1_10=_shr2(dct1_10,1);
int?add20_10=_add2(dct0_10,dct2_10);
int?add20_32=_add2(dct0_32,dct2_32);
int?sub20_10=_sub2(dct0_10,dct2_10);
int?sub20_32=_sub2(dct0_32,dct2_32);
int?add31_10=_add2(h_dct3_10,dct1_10);
int?add31_32=_add2(h_dct3_32,dct1_32);
int?sub13_10=_sub2(h_dct1_10,dct3_10);
int?sub13_32=_sub2(h_dct1_32,dct3_32);
int?round=32|(32<<16);
tmp0_10=_add2(add20_10,add31_10);
tmp0_32=_add2(add20_32,add31_32);
tmp1_10=_add2(sub20_10,sub13_10);
tmp1_32=_add2(sub20_32,sub13_32);
tmp2_10=_sub2(sub20_10,sub13_10);
tmp2_32=_sub2(sub20_32,sub13_32);
tmp3_10=_sub2(add20_10,add31_10);
tmp3_32=_sub2(add20_32,add31_32);
pack0_10=_pack2(tmp1_10,tmp0_10);
pack1_10=_packh2(tmp1_10,tmp0_10);
pack0_32=_pack2(tmp3_10,tmp2_10);
pack1_32=_packh2(tmp3_10,tmp2_10);
pack2_10=_pack2(tmp1_32,tmp0_32);
pack3_10=_packh2(tmp1_32,tmp0_32);
pack2_32=_pack2(tmp3_32,tmp2_32);
pack3_32=_packh2(tmp3_32,tmp2_32);
pack3_10_r=_shr2(pack3_10,1);
pack3_32_r=_shr2(pack3_32,1);
pack1_10_r=_shr2(pack1_10,1);
pack1_32_r=_shr2(pack1_32,1);
add20_10=_add2(pack2_10,pack0_10);
add20_32=_add2(pack2_32,pack0_32);
add31_10=_add2(pack3_10_r,pack1_10);
add31_32=_add2(pack3_32_r,pack1_32);
sub20_10=_sub2(pack0_10,pack2_10);
sub20_32=_sub2(pack0_32,pack2_32);
sub13_10=_sub2(pack1_10_r,pack3_10);
sub13_32=_sub2(pack1_32_r,pack3_32);
d0_10=_shr2(_add2(_add2(add20_10,add31_10),round),6);
d0_32=_shr2(_add2(_add2(add20_32,add31_32),round),6);
d1_10=_shr2(_add2(_add2(sub20_10,sub13_10),round),6);
d1_32=_shr2(_add2(_add2(sub20_32,sub13_32),round),6);
d2_10=_shr2(_add2(_sub2(sub20_10,sub13_10),round),6);
d2_32=_shr2(_add2(_sub2(sub20_32,sub13_32),round),6);
d3_10=_shr2(_add2(_sub2(add20_10,add31_10),round),6);
d3_32=_shr2(_add2(_sub2(add20_32,add31_32),round),6);
{
int?dst32=_amem4((void*)p_dst);
int?p_dst10,p_dst32;
p_dst10=_add2(d0_10,_unpklu4(dst32));
p_dst32=_add2(d0_32,_unpkhu4(dst32));
_amem4((void*)p_dst)=_spacku4(p_dst32,p_dst10);
p_dst+=FDEC_STRIDE;
dst32=_amem4((void*)p_dst);
p_dst10=_add2(d1_10,_unpklu4(dst32));
p_dst32=_add2(d1_32,_unpkhu4(dst32));
_amem4((void*)p_dst)=_spacku4(p_dst32,p_dst10);
p_dst+=FDEC_STRIDE;
dst32=_amem4((void*)p_dst);
p_dst10=_add2(d2_10,_unpklu4(dst32));
p_dst32=_add2(d2_32,_unpkhu4(dst32));
_amem4((void*)p_dst)=_spacku4(p_dst32,p_dst10);
p_dst+=FDEC_STRIDE;
dst32=_amem4((void*)p_dst);
p_dst10=_add2(d3_10,_unpklu4(dst32));
p_dst32=_add2(d3_32,_unpkhu4(dst32));
_amem4((void*)p_dst)=_spacku4(p_dst32,p_dst10);
p_dst+=FDEC_STRIDE;
}
}
The cpu cycle that DCT calculates is reduced to 56 after the optimization by original 176, promptly improved efficiency more than 3 times, it is remarkable to promote effect.
In the above-described embodiments, only the present invention has been carried out exemplary description, but those skilled in the art can design various execution modes according to different actual needs under the situation of the scope and spirit that do not break away from the present invention and protected.