Movatterモバイル変換


[0]ホーム

URL:


CN101610418A - A kind of optimization method of the two-dimensional dct based on DM642 - Google Patents

A kind of optimization method of the two-dimensional dct based on DM642
Download PDF

Info

Publication number
CN101610418A
CN101610418ACN 200910108686CN200910108686ACN101610418ACN 101610418 ACN101610418 ACN 101610418ACN 200910108686CN200910108686CN 200910108686CN 200910108686 ACN200910108686 ACN 200910108686ACN 101610418 ACN101610418 ACN 101610418A
Authority
CN
China
Prior art keywords
add2
instruction
sub2
int
optimization method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200910108686
Other languages
Chinese (zh)
Inventor
喻军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN RONGCHUANG TIANXIA TECHNOLOGY DEVELOPMENT Co Ltd
Shenzhen Temobi Science and Technology Co Ltd
Original Assignee
SHENZHEN RONGCHUANG TIANXIA TECHNOLOGY DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN RONGCHUANG TIANXIA TECHNOLOGY DEVELOPMENT Co LtdfiledCriticalSHENZHEN RONGCHUANG TIANXIA TECHNOLOGY DEVELOPMENT Co Ltd
Priority to CN 200910108686priorityCriticalpatent/CN101610418A/en
Publication of CN101610418ApublicationCriticalpatent/CN101610418A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

The present invention relates to a kind of optimization method of the two-dimensional dct based on DM642, DM642 is the digital signal processor of the senior very long instruction word structure of the second generation (VelociTI), by two instruction ADD2 and SUB2, they are at 32 bit data, divide 16 of height to carry out addition and additive operation, as use ADD2 when instruction: C=_add2 (A, B), be that the result deposits in the high 16 of C with high 16 additions of A and B, with low 16 additions of A and B, the result deposits in low 16 of C; When using the SUB2 instruction: (A B), is that high 16 with A and B subtract each other to C=_sub2, and the result deposits in the high 16 of C, and with the hanging down 16 and subtract each other of A and B, what the result deposited in C hangs down 16.Reach the purpose that promotes the binary encoding performance by the computational speed that promotes dct transform.

Description

A kind of optimization method of the two-dimensional dct based on DM642
[technical field]
The present invention relates to digital picture and vision signal and handle, more particularly, relate to the method dct transform of optimizing the DCT function from the instruction angle of DM642.
[background technology]
H264 has very high data compression ratio, and under the condition of equal picture quality, compression ratio H.264 is more than 2 times of MPEG-2, is 1.5~2 times of MPEG-4, and still, high compression ratio and outstanding image quality have been brought very big challenge to video encoding capability.Dct transform is one of computing module the most consuming time in the H264 coding, and in the real-time coding system, it is to keep on improving that its lifting of computational speed is required.TMS320DM642DSP is a high-performance digital multimedia processor that TI Texas Instruments releases, and its multimedia instruction collection and the feature of parallel processing make the such module of optimization DCT have and bring into play the space more flexibly.This chapter mainly studies the method for optimizing the DCT function from the instruction angle of DM642.
DM642 is the senior very long instruction word structure of the second generation (VelociTI) that TI company adopts TI, making can many instruction of parallel processing an instruction cycle, it has four arithmetic elements, is respectively .L .M, .S, .D, wherein each arithmetic element is divided A, B two districts, can handle dissimilar computings respectively, therefore can reach eight instructions of each instruction cycle operation.It can be worked under the 600MHz clock frequency, instructs and calculates according to each instruction cycle 8 32bit that can walk abreast, and it can reach the peak value computational speed of 4800MIPS.DM642 adopts the two-level cache structure: the first order comprises separate LIP (16kB) and LID (16kB), can only use as high-speed cache; Second level L2 (256kB) is a unified program/data space, can whole be mapped to memory space as SRAM, also can be whole as second level Cache, or the two pro rata a kind of combination is used.Therefore as if the maximum using of use rationally and managing internal memory and arithmetic element, can increase substantially the runnability of program.
Exist and be better than the various application of Digital Video Communication and storage, and developed and the corresponding international standard of lasting exploitation.For example the low bitrate of visual telephone and meeting communicate by letter and for example big video file compression of animation produce various video compression standards: H.261, H.263, MPEG-1, MPEG-2, AVS etc.The quantification that these compression methods rely on ground discrete cosine transform (DCT) or similar conversion and conversion coefficient reduces the number of the position that need be used for encoding.
Below be the transformation for mula of two-dimensional dct and IDCT:
Two-dimensional dct transform
Make f (x is N * N discrete picture sequence y), and two-dimensional dct transform is expressed as:
F(u,v)=2Nc(u)c(v)Σx=0N-1Σy=0N-1f(x,y)cos[π2N(2x+1)u]cos[π2N(2y+1)v]
u,v=0,1,...N-1
Two dimension IDCT is:
f(x,y)=2NΣu=0N-1Σv=0N-1c(u)c(v)F(u,v)cos[π2N(2x+1)u]cos[π2N(2y+1)v]
x,y=0,1,...N-1
As can be seen, have lot of data read-write and signed magnitude arithmetic(al) in the DCT algorithm from formula, with regard to by a 4x4dct, signed magnitude arithmetic(al) wherein reaches 64 times.And in the process of calculating DCT, need frequent access data, one time 4x4dct need distinguish access data 32 times.Cause a large amount of cpu cycle waiting for finishing of access data, make that this operation efficiency is low:
void?dct4x4(int16_t?dct[4][4],uint8_t*pix1,uint8_t*pix2)
{
for(i=0;i<4;i++)
{
const?int?s03=d[i][0]+d[i][3];
const?int?s12=d[i][1]+d[i][2];
const?int?d03=d[i][0]-d[i][3];
const?int?d12=d[i][1]-d[i][2];
tmp[0][i]=s03+s12;
tmp[1][i]=2*d03+d12;
tmp[2][i]=s03-s12;
tmp[3][i]=d03-2*d12;
}
for(i=0;i<4;i++)
{
const?int?s03=tmp[i][0]+tmp[i][3];
const?int?s12=tmp[i][1]+tmp[i][2];
const?int?d03=tmp[i][0]-tmp[i][3];
const?int?d12=tmp[i][1]-tmp[i][2];
dct[i][0]=s03+s12;
dct[i][1]=2*d03+d12;
dct[i][2]=s03-s12;
dct[i][3]=d03-2*d12;
}
}
Original calculation method according to above DCT reads 16 data at every turn, and 16 data are carried out plus and minus calculation.Like this, just be difficult to bring into play the characteristic of DM642, be difficult to reach the purpose of parallel computation, and in the process of waiting for access data, expended a large amount of cpu cycle (processor cycle), cause the DCT computing to become one of bottleneck main in the H264 coding.
Because the real-time coding requirement of h264, we can reach the purpose that promotes the binary encoding performance by the computational speed that promotes the DCT module.
Given this, be necessary to propose a kind of improved technical scheme to overcome the defective of prior art in fact.
[summary of the invention]
The invention provides a kind of optimization method of the two-dimensional dct based on DM642, DM642 is the digital signal processor of the senior very long instruction word structure of the second generation (VelociTI), it is characterized in that: by two instruction ADD2 and SUB2, they are at 32 bit data, divide 16 of height to carry out addition and additive operation, as use ADD2 when instruction: C=_add2 (A, B), be that the result deposits in the high 16 of C with high 16 additions of A and B, with low 16 additions of A and B, the result deposits in low 16 of C; When using the SUB2 instruction: (A B), is that high 16 with A and B subtract each other to C=_sub2, and the result deposits in the high 16 of C, and with the hanging down 16 and subtract each other of A and B, what the result deposited in C hangs down 16.
Preferably, described DM642 digital signal processor can many instructions of parallel processing an instruction cycle, it has four arithmetic elements, is respectively .L .M, .S, .D, wherein each arithmetic element is divided A, B two districts, can handle dissimilar computings respectively, therefore can reach eight instructions of each instruction cycle operation.
Preferably, ADD2 and SUB2 instruction is parallel simultaneously applies to .M .L .S, eight arithmetic elements of .D.
Preferably, wherein the structure of reprogramming mainly be that matrix is carried out transposition, and transposition instructs by SPACK2.
With respect to prior art, the inventive method after cpu cycle that DCT calculates is reduced to optimization by original 176 56, promptly promoted more than 3 times, it is remarkable to promote effect.
[description of drawings]
No accompanying drawing.
[embodiment]
Improvement to existing method is, with loop unrolling, rearranges program structure.DM642 can primary access 64 data.Therefore, the program after the expansion can once be deposited and be read 64 data, makes the number of times of access once be reduced to original 1/4.
DM642 has the instruction of the such dot product of a series of DOTPSU2 of being similar to, and it can calculate the multiply-add operation of 16 of two 32 bit data height, such as: C=_dotpsu2 (A B), is that high 16 with A and B are multiplied each other, and low 16 multiply each other, addition with compose to C.By this instruction, can be with 16 summations of height of 32 bit data that obtain, and need not split, calculate again.
Use the such instruction of similar DOTPSU2, can use 32 data, and can reach certain parallel purpose, save and calculate cpu cycle.But because DOTPSU2 can only calculate in the CPU.M unit, be that each cpu cycle can only be at A, twice of two unit parallel computation of B, do not give full play to eight computing units computation purpose simultaneously, further optimize and ordering through the structure to program, we have found useful method more.
This method is based on two very simple instruction ADD2 and SUB2, and different with original computing is that they are to 32 bit data, divide 16 of height to carry out addition and additive operation.(A B), is that the result deposits in the high 16 of C with high 16 additions of A and B, and with low 16 additions of A and B, the result deposits in low 16 of C such as C=_add2.And it has an obvious advantages with respect to this instruction of DOTPSU2, is exactly which computing unit computing it is not subject in, that is to say that it can be at the .M of cpu .L, and .S, eight arithmetic elements of .D are carried out computing simultaneously.Like this, just help improving the degree of parallelism of program more.Therefore, use ADD2 and SUB2, and, compressed cpucycle greatly, promptly can obviously promote the DCT computational speed fully without DOTPSU2.
Below be the program based on DM642 after optimizing, wherein the structure of reprogramming mainly is with matrix transpose, and transposition has only been used several SPACK2 packing instructions.
void?4x4dct(uint8_t*p_dst,int16_t?dct[4][4])
{
int16_t?d[4][4];
int16_t?tmp[4][4];
int?x,y;
int?i;
int?tmp0_10,tmp0_32,tmp1_10,tmp1_32;
int?tmp2_10,tmp2_32,tmp3_10,tmp3_32;
int?pack0_10,pack1_10,pack0_32,pack1_32;
int?pack2_10,pack3_10,pack2_32,pack3_32;
int?pack3_10_r,pack3_32_r,pack1_10_r,pack1_32_r;
int?d0_10,d0_32,d1_10,d1_32,d2_10,d2_32,d3_10,d3_32;
int?dct0_32=_hi(_amemd8((void*)dct[0]));
int?dct0_10=_lo(_amemd8((void*)dct[0]));
int?dct1_32=_hi(_amemd8((void*)dct[1]));
int?dct1_10=_lo(_amemd8((void*)dct[1]));
int?dct2_32=_hi(_amemd8((void*)dct[2]));
int?dct2_10=_lo(_amemd8((void*)dct[2]));
int?dct3_32=_hi(_amemd8((void*)dct[3]));
int?dct3_10=_lo(_amemd8((void*)dct[3]));
int?h_dct3_32=_shr2(dct3_32,1);
int?h_dct3_10=_shr2(dct3_10,1);
int?h_dct1_32=_shr2(dct1_32,1);
int?h_dct1_10=_shr2(dct1_10,1);
int?add20_10=_add2(dct0_10,dct2_10);
int?add20_32=_add2(dct0_32,dct2_32);
int?sub20_10=_sub2(dct0_10,dct2_10);
int?sub20_32=_sub2(dct0_32,dct2_32);
int?add31_10=_add2(h_dct3_10,dct1_10);
int?add31_32=_add2(h_dct3_32,dct1_32);
int?sub13_10=_sub2(h_dct1_10,dct3_10);
int?sub13_32=_sub2(h_dct1_32,dct3_32);
int?round=32|(32<<16);
tmp0_10=_add2(add20_10,add31_10);
tmp0_32=_add2(add20_32,add31_32);
tmp1_10=_add2(sub20_10,sub13_10);
tmp1_32=_add2(sub20_32,sub13_32);
tmp2_10=_sub2(sub20_10,sub13_10);
tmp2_32=_sub2(sub20_32,sub13_32);
tmp3_10=_sub2(add20_10,add31_10);
tmp3_32=_sub2(add20_32,add31_32);
pack0_10=_pack2(tmp1_10,tmp0_10);
pack1_10=_packh2(tmp1_10,tmp0_10);
pack0_32=_pack2(tmp3_10,tmp2_10);
pack1_32=_packh2(tmp3_10,tmp2_10);
pack2_10=_pack2(tmp1_32,tmp0_32);
pack3_10=_packh2(tmp1_32,tmp0_32);
pack2_32=_pack2(tmp3_32,tmp2_32);
pack3_32=_packh2(tmp3_32,tmp2_32);
pack3_10_r=_shr2(pack3_10,1);
pack3_32_r=_shr2(pack3_32,1);
pack1_10_r=_shr2(pack1_10,1);
pack1_32_r=_shr2(pack1_32,1);
add20_10=_add2(pack2_10,pack0_10);
add20_32=_add2(pack2_32,pack0_32);
add31_10=_add2(pack3_10_r,pack1_10);
add31_32=_add2(pack3_32_r,pack1_32);
sub20_10=_sub2(pack0_10,pack2_10);
sub20_32=_sub2(pack0_32,pack2_32);
sub13_10=_sub2(pack1_10_r,pack3_10);
sub13_32=_sub2(pack1_32_r,pack3_32);
d0_10=_shr2(_add2(_add2(add20_10,add31_10),round),6);
d0_32=_shr2(_add2(_add2(add20_32,add31_32),round),6);
d1_10=_shr2(_add2(_add2(sub20_10,sub13_10),round),6);
d1_32=_shr2(_add2(_add2(sub20_32,sub13_32),round),6);
d2_10=_shr2(_add2(_sub2(sub20_10,sub13_10),round),6);
d2_32=_shr2(_add2(_sub2(sub20_32,sub13_32),round),6);
d3_10=_shr2(_add2(_sub2(add20_10,add31_10),round),6);
d3_32=_shr2(_add2(_sub2(add20_32,add31_32),round),6);
{
int?dst32=_amem4((void*)p_dst);
int?p_dst10,p_dst32;
p_dst10=_add2(d0_10,_unpklu4(dst32));
p_dst32=_add2(d0_32,_unpkhu4(dst32));
_amem4((void*)p_dst)=_spacku4(p_dst32,p_dst10);
p_dst+=FDEC_STRIDE;
dst32=_amem4((void*)p_dst);
p_dst10=_add2(d1_10,_unpklu4(dst32));
p_dst32=_add2(d1_32,_unpkhu4(dst32));
_amem4((void*)p_dst)=_spacku4(p_dst32,p_dst10);
p_dst+=FDEC_STRIDE;
dst32=_amem4((void*)p_dst);
p_dst10=_add2(d2_10,_unpklu4(dst32));
p_dst32=_add2(d2_32,_unpkhu4(dst32));
_amem4((void*)p_dst)=_spacku4(p_dst32,p_dst10);
p_dst+=FDEC_STRIDE;
dst32=_amem4((void*)p_dst);
p_dst10=_add2(d3_10,_unpklu4(dst32));
p_dst32=_add2(d3_32,_unpkhu4(dst32));
_amem4((void*)p_dst)=_spacku4(p_dst32,p_dst10);
p_dst+=FDEC_STRIDE;
}
}
The cpu cycle that DCT calculates is reduced to 56 after the optimization by original 176, promptly improved efficiency more than 3 times, it is remarkable to promote effect.
In the above-described embodiments, only the present invention has been carried out exemplary description, but those skilled in the art can design various execution modes according to different actual needs under the situation of the scope and spirit that do not break away from the present invention and protected.

Claims (4)

1. optimization method based on the two-dimensional dct of DM642, DM642 is the digital signal processor of the senior very long instruction word structure of the second generation (VelociTI), it is characterized in that: by two concurrent operation instruction ADD2 and SUB2, they are at 32 bit data, divide 16 of height to carry out addition and additive operation, as use ADD2 when instruction: C=_add2 (A, B), be that the result deposits in the high 16 of C with high 16 additions of A and B, with low 16 additions of A and B, the result deposits in low 16 of C; When using the SUB2 instruction: (A B), is that high 16 with A and B subtract each other to C=_sub2, and the result deposits in the high 16 of C, and with the hanging down 16 and subtract each other of A and B, what the result deposited in C hangs down 16.
CN 2009101086862009-07-142009-07-14A kind of optimization method of the two-dimensional dct based on DM642PendingCN101610418A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN 200910108686CN101610418A (en)2009-07-142009-07-14A kind of optimization method of the two-dimensional dct based on DM642

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN 200910108686CN101610418A (en)2009-07-142009-07-14A kind of optimization method of the two-dimensional dct based on DM642

Publications (1)

Publication NumberPublication Date
CN101610418Atrue CN101610418A (en)2009-12-23

Family

ID=41483958

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN 200910108686PendingCN101610418A (en)2009-07-142009-07-14A kind of optimization method of the two-dimensional dct based on DM642

Country Status (1)

CountryLink
CN (1)CN101610418A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102137257A (en)*2011-03-012011-07-27北京声迅电子有限公司Embedded H.264 coding method based on TMS320DM642 chip

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102137257A (en)*2011-03-012011-07-27北京声迅电子有限公司Embedded H.264 coding method based on TMS320DM642 chip
CN102137257B (en)*2011-03-012013-05-08北京声迅电子有限公司Embedded H.264 coding method based on TMS320DM642 chip

Similar Documents

PublicationPublication DateTitle
CN101123723B (en) Digital Video Decoding Method Based on Graphics Processor
CN102158694B (en)Remote-sensing image decompression method based on GPU (Graphics Processing Unit)
Park et al.Design and performance evaluation of image processing algorithms on GPUs
KR20020026243A (en)Method and device for variable complexity decoding of motion-compensated block-based compressed digital video
Makkaoui et al.Fast zonal DCT-based image compression for wireless camera sensor networks
CN103096077A (en)Apparatus And Method For Performing Transforms On Data
CN102186076B (en)Image compression method and image compression device for real-time code rate pre-allocation
US20140010284A1 (en)Image transform and inverse transform method, and image encoding and decoding device using same
CN101188761A (en)Method for optimizing DCT quick algorithm based on parallel processing in AVS
CN102547261B (en)A kind of Fractal Image Coding
CN105120276A (en)Adaptive Motion JPEG coding method and system
CN101426134A (en)Hardware device and method for video encoding and decoding
CN101610418A (en)A kind of optimization method of the two-dimensional dct based on DM642
CN102572436B (en)Intra-frame compression method based on CUDA (Compute Unified Device Architecture)
Zheng et al.Research in a fast DCT algorithm based on JPEG
Rubino et al.Real-time rate distortion-optimized image compression with region of interest on the ARM architecture for underwater robotics applications
CN101330614B (en)Method for implementing motion estimation of fraction pixel precision using digital signal processor
CN103188487B (en)Convolution method in video image and video image processing system
CN107027039B (en) Discrete Cosine Transform Implementation Method Based on High Efficiency Video Coding Standard
Mancini et al.Enhancing non-linear kernels by an optimized memory hierarchy in a high level synthesis flow
KR101722215B1 (en)Apparatus and method for discrete cosine transform
CN203279074U (en)Two-dimensional discrete cosine transform (DCT)/inverse discrete cosine transform (IDCT) circuit
CN104683817A (en)AVS-based methods for parallel transformation and inverse transformation
WakataniImplementation of fractal image coding for GPGPU systems and its power-aware evaluation
CN103279963A (en)Geographic information image compression method

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
C12Rejection of a patent application after its publication
RJ01Rejection of invention patent application after publication

Open date:20091223


[8]ページ先頭

©2009-2025 Movatter.jp