

International Journal of INTELLIGENT SYSTEMS AND APPLICATIONS IN ENGINEERING

ISSN:2147-6799

www.ijisae.org

**Original Research Paper** 

### Implementation of an Efficient and Reconfigurable Architecture for DCT on FPGA

Dr. Anil Kumar C.<sup>1</sup>, Dr. Poornima G. R.<sup>2</sup>, Dr. Aruna R.<sup>3</sup>, Dr. Pradeep Kumar B. P.<sup>4</sup>, Dr. Harish S.<sup>5</sup>, Lavanaya Vaishnavi D. A.<sup>6</sup>

Submitted: 21/10/2023 Revised: 19/12/2023 Accepted: 28/12/2023

Abstract: - The Discrete Cosine Transform (DCT) is computed to minimize the complexity of the algorithm without impacting the performance of the code. Several conventional DCT approximation techniques mainly concentrate on short transform lengths as well as some of them seem to be non-orthogonal. This research work provides an extensive recursive approach for orthogonal DCT approximation in which an estimated length of DCT can be obtained from a set of DC transforms length at the rate of input pre-processing additions. The suggested approximation concept is derived using recursive sparse matrix decomposition as well as by using symmetries of discrete cosine Transform basis vectors. The suggested approach is extremely scalable for the implementation of both hardware and software with DCT of various lengths and also this approach uses conventional 8-point DC transform approximation to obtain the estimated Discrete Cosine transform of any power of two lengths. The suggested DCT approximation performs well when compared to that of conventional DCT approximation approaches in terms of image/video compression also the suggested algorithm seems to have a reduced arithmetic challenge. We have presented a parallel architecture that is completely scalable and reconfigurable for computing approximate Discrete cosine Transform in this research work. The most significant aspects of the suggested architecture are that it can be designed to compute a 32-point DC transform or two 16-point DC transforms or four 8-point DC transforms simultaneously with negligible operating cost. The suggested design offers several benefits that are concerned with hardware intricacy, reliability as well as flexibility. These advantages are illustrated by the experimental outcomes that are obtained from the implementation of FPGA.

Keywords: DCT, FPGA, Transformations, Video Processing

Associate Professor & HoD, Dept Of ECE, R.L. Jalappa Institute of Technology ., Doddaballapur canilkumarc22@gmail.com Professor, Dept of ECE, Sri Venkateshwara College of engineering. Bangalore poornima.gr\_ece@svcengg.edu.in Associate Professor Dept of ECE AMC college of engineering, Bangalore Aruna.ramalingam@gmail.com Associate Professor, dept of ECE, HKBK college of engineering bangalore pradi14cta@gmail.com Associate Professor, Dept Of ECE, R.L. Jalappa Institute of Technology ., Doddaballapur harishsrinivasaiah@gmail.com. Assistant Professor, Dept Of ECE, R.L. Jalappa Institute of Technology ., Doddaballapur lavanayavaishanvi@gmail.com

#### 1. Introduction

In case of image compression and in video compression, the discrete cosine transform is been widely employed. Several techniques for computing DCT have been presented in the work as they are computationally complex. Nowadays, extensive research has recently been carried out to obtain an 8-point approximate DCT to minimize computational intricacy [4] – [9]. The primary goal of approximation techniques is to eliminate multiplications that use a significant proportion of the power as well as processing time and thereby accomplish a reliable DCT approximation. The signed DCT (SDCT) for eight blocks was developed by Haweel wherein the basis vector features were substituted by their sign, i.e., one. A set of approaches has been suggested by Bouguezel-Ahmad-Swami (BAS). By substituting the basis vector elements with 0, 1/2, and 1, they were able to obtain a decent estimate of DCT. Bayer as well as Cintra suggested two transforms that were obtained from 0 and 1 as a transform Kernel element in a

similar manner as that of the BAS. They also demonstrated that their approaches outperform the conventional techniques for scenarios such as lowand high-compression Ratios. The requirement for approximation seems to be more significant for higher DCTs as the computational intricacy of DCT advances nonlinearly.

To obtain a compression ratio which is high, recent coding standards of video like high-efficiency video coding (HEVC) [10] employ DCT with block sizes up to 32. However, the design procedure utilized in H264 AVC cannot be extended for 16-point as well as 32-point transform sizes. Furthermore, many applications of image processing like tracking, concurrent compression as well as encryption necessarily entail larger DCT sizes. Cintra has proposed a novel class of integer transforms that can be applied to a variety of block lengths in this framework. A novel 16X16 matrix for approximating 16-point DC Transform was suggested by Cintra et al. is been experimentally tested. Two novel transforms for 8-point DCT approximation have been introduced recently and they are: a low-complexity 8-point DCT that is based on integer functions proposed by Cintra et al. as well as the other is a unique 8-point DC Transform calculation that includes 14 additions introduced by Potluri et al.

The following characteristics must be included in approximation strategy: the DCT i) The computational overhead must be modest. ii) This strategy must be orthogonal as well as it must possess energy with minimum error to offer compression performance that is close to the precise DCT. iii) To accommodate new video coding standards as well as to support additional applications such as tracking, monitoring, as well as for parallel compression, and encryption, it must enable higher DCT lengths. However, conventional DCT algorithms do not meet all these criteria. Many conventional approaches lack that are concerned with scalability, extension for larger sizes, as well as orthogonality. The two main criteria for retaining orthogonality in DCT approximation are i) Inverse of the transform is obtained whenever the transform is orthogonal, and ii) by taking transpose of the forward transform kernel matrix, the kernel matrix of the inverse transform can be obtained. Similar computing structures might be employed to calculate the forward as well as inverse DCT using the above characteristic of the inverse transform.

Furthermore, identical efficient methods that are best suited for forward as well as inverse transforms are used for orthogonal transform scenarios [19], [20]. We present an approach to develop approximation forms of DCTs that satisfy all the above three criteria in this work. By employing recursive decomposition of a sparse DCT matrix, we can develop the proposed approximate form of DCT. When compared to conventional DCT approximation techniques, the suggested algorithm involves very less arithmetic complexity and reduced error energy. The suggested transform can be extended for larger DCTs as per the decomposition method. Similarly, the suggested technique is highly adaptable for hardware as well as a software implementation of various lengths of DCT and it makes use of conventional 8-point DCT approximations that are optimum. In this work, we develop a fully scalable parallel architecture that can be reconfigurable for computing the estimated DC Transform.

#### 2. Literature Survey

#### Title: A Survey of Multimedia Streaming in Wireless Sensor Networks Author: Satyajayant Misra, Martin Reisslein, and Guoliang Xue

A sensor system that is wireless and which possess multimedia features is often made up of data sensor nodes as well as video sensor nodes to detect sound or motion as well as to record video respectively. The encoding of video that is carried out at the video sensors as well as the actual transmission of the encoded video to a base station (BS) is the main objective of this work. End-to-end delay, as well as distortion during network transmission, are two main criteria for real-time video streams. In this work, the classification of multimedia traffic requirements as well as the techniques required for multimedia streaming in WSNs at every layer of network protocol is been presented. In general, the techniques at application, transport, network, as well as MAC layers are been examined. Further, several cross-layer strategies are been evaluated as well as a few cross-layer alternatives for enhancing the effectiveness of a WSNs for the usage of multimedia streaming are been presented.

# Title: A Row-Parallel 8×8 2-D DCT ArchitectureUsingAlgebraicInteger-BasedExactComputation

In case of bivariate development of AI encoded 2-Dimensional DC Transform, an algebraic integer (AI) that is depended on time-multiplexed rowparallel architecture as well as 2 final reconstruction steps (FRS) techniques are been presented in this work. The design effectively accomplishes a 2-D DC Transform that is free from error and is exclusive of FRSs application that exists between row-column transforms and thereby results in an 8-dimensional DC Transform that is completely exempted of quantization errors on an AI basis. Therefore, the precision selected by the user for every coefficient in the FR step allows every 64 coefficients' accuracy to be chosen separately and thereby eliminates quantization noise between channels as in existing Improved Dempster-Macleod DCT systems. multipliers, as well as expansion factor scaling, are the two techniques of FRS. This design offers digital video processing applications with a high dynamic range as well as low noise that necessitates complete control over the 2-D DCT's finite-precision calculation. The suggested designs, as well as FRS approaches, are practically checked as well as verified with the help of hardware implementations that are practically evaluated on a fieldprogrammable gate array chip. Using two specified FRS methods, six models for 4-bit as well as 8-bit input word sizes have been practically developed, computed, as well as evaluated. 307.787MHz is the maximum clock rate and 38.47 MHz is the maximum block rate that is accomplished by the 8bit data input model and whenever integrated with a actual video-processing system it produces a pixel rate of 8307.7872.462 GHz. For a 1920x1080 image, the approximate frame rate is around 1187.35 Hz. With the help of Xilinx Virtex-6 XC6VLX240T FPGA device, all the implementations are been carried out.

#### Title: Fast Multiplier less Approximations of the DCT with the Lifting Scheme Author: Jie Liang, Student Member, IEEE, and Trac D. Tran, Member, IEEE

The design, application, as well as implementation of various families of fast multiplier less estimations of DC Transform by using the technique of lifting which is commonly stated as the bin DC Transform are presented in this work. By using Chen's as well as Loffler's plane rotation that is based on factorizations of the DC Transform matrix, the bin DCT families can be obtained and this design technique can be used for DCT of any size. There are two design concepts namely: i) The multiplier less transform is derived using the concept of defining an optimization program and thereby estimating its output with dyadic values, ii) The analytical values of all lifting variables are computed in a scaled DCT structure that is based on general lifting and thereby allowing dyadic approximations with varied precisions. The bin DCT can thus be modified to serve as a bridge among the Walsh– Hadamard transform as well as the Discrete Cosine Transform. The 2-dimensional bin DC Transform that corresponds to a 16-bit design provides lossless compression and thereby retains acceptable compatibility with the floating-point DC Transform. The efficiency of bin DCT is also been proven in JPEG, H.263+ as well as in lossless compression.

## Title:ScalableVariableComplexityApproximate Forward DCT

In several image compression as well as video compression systems, the DC Transforms plays a very important role. In case of decoders, the concept of complexity algorithm that are variable was found to be effective and are used to reduce the complexity in the computation of inverse DCT. The extremely inevitable sparseness of the quantized DC Transform coefficients in the real video or image information enables these advantages. The analysis of encoding overhead is required for Realtime video messaging as well as for two-way video transmission via mobile communication systems operating on embedded processors that are used for general-purpose. The main objective of this work is to minimize complexities for the forward DC Transform which is the most difficult function for the encoder. The forward DC Transform does not work on sparse data input but it produces sparse data output, unlike the inverse DCT. Therefore, additional techniques than those employed to perform inverse of DC Transform can be utilized to minimize intricacy. To enable faster forward DCT computation, two key strategies have been used in the research work: i) selection of frequency which approximates only subset coefficients of Discrete cosine transform ii) Choice of accuracy, that computes all DC Transform coefficients with minimized precision. The above two strategies have the ability to minimize computational complexities with negligible output quality reduction provided that the coding values like quantization error are greater than the error caused due to the approximate computation of DCT. These strategies must be integrated with an efficient model that can determine the "correct" degree of approximation that is based on the input as well as target rate features and also on the selection that is frequently based on experimental conditions. Rapid as well as variablecomplexity forward DC Transform algorithms that are depended on accuracy selection as well as frequency are presented in this work. An extensive study of additional distortion that is introduced by each strategy which is a function of quantization value as well as the input block variance is been presented.

#### 3. Proposed System

A reconfigurable DCT structures that can be used to compute DCT of various lengths is been presented in this work. Fig 1 depicts the architecture that can be reconfigurable by implementing the approximated 16-point DC Transform. It is made up of three computing modules: i) two approximated 8point DCT blocks as well as ii) an input adder block of 16-point. The input to the initial 8-point discrete cosine transform estimated block is sent via 8 Muxes which choose (a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7]) or (x[0],x[1],x[2],x[3],x[4],x[5],x[6],x[7])based on whether the block is intended for 16-point or 8-point DC Transform computation. Correspondingly, the subsequent 8-point DC Transform unit's input is sent via 8 MUXes that choose either

(b[0],b[1],b[2],b[3],b[4],b[5],b[6],b[7]) or

(x[8],x[9],x[10],x[11],x[12],x[13],x[14],x[15]) based on whether the block is intended for 16-point or 8-point DC Transform computation.

The output permutation module chooses as well as reorders the output based on the size of the chosen DCT using 14 MUXes. Depending on the DC Transform size that is to be calculated, the MUXes employ SEL16 as a control input to choose inputs as well as to carry out permutation. SEL16=1 and SEL16=0 facilitate the approximation of a 16-point DC Transform as well as parallel approximation of two 8-point DC Transforms respectively. Fig. 2 enables the parallel computation of a 16-point DC Transform or two 8-point discrete cosine transforms.



Fig 2 Reconfigurable Architecture for16-Point DCT



Fig 3. 32-Point DCT Architecture

#### 8-POINT DCT

| 👥 wave - default |          |                                | + a ×    |
|------------------|----------|--------------------------------|----------|
| Message          | s        |                                |          |
| +                | 10101010 | 101010                         | <b>_</b> |
|                  | 11110101 | 11110101                       |          |
| +                | 00101010 | 00101010                       |          |
| +                | 00010010 | 00010010                       |          |
| · IIII → Jdct/x4 | 10000110 | 10000110                       |          |
| 🛨 🧄 /dct/x5      | 10010101 | 10010101                       |          |
|                  | 10101111 | 10101111                       |          |
| 🛨 🧄 /dct/x7      | 00001111 | 00001111                       |          |
| 🛨 🔶 /dct/g1      | 10111001 | 10111001                       |          |
| . <b></b>        | 10100100 | 10100100                       |          |
| . <b></b>        | 10111111 | 10111111                       |          |
| . <b></b>        | 10011000 | 10011000                       |          |
|                  | 10001100 | 10001100                       |          |
| 🖅                | 10010101 | 10010101                       |          |
|                  | 01000110 | 01000110                       |          |
|                  | 10011011 | 10011011                       |          |
|                  | 01010001 | 01010001                       |          |
| A 📰 💿 Nov        | w 200 ps | os 200 ps 400 ps 600 ps 800 ps | 100      |
| 🔓 🌽 🤤 Cursor     | 1 0 ps   |                                |          |
| •                | • • •    |                                |          |
|                  |          |                                | الغات    |

#### Fig 4. 8 point DCT

The figure 4 shows the output of 8- point DCT. In this case, it is clearly seen that all the 8 outputs of the proposed block diagram gives a different output at the initial simulation. Based on the other parameters the 'g' and 'h' bits are decided and calculated.

#### **INPUT ADDER UNIT**



Fig 5. input representation for the adder unit

The above simulation results in figure 5 show the result of input adder units that are used to combine the 8-point wetted DCT, From the above simulation

results, it is clear that each of the adder module works independently based on its functionality. The output of each adder is 16 bit wide.





Fig 6: Processing of 16-POINT DCT

| The figure 6 shows the simulation results of 16 point |  |
|-------------------------------------------------------|--|
| DCT that is obtained after combining two 8 point      |  |

DCTs. The above simulation results also show the processing of the input data in 16 point DCT. For

each processing of data the proposed algorithm is

taking around 600 picoseconds.

#### **32-POINT DCT**



Fig 7: processing of 32 point DCT

The figure 7 represents 32 point DCT. All the outputs of three, 2 point DCT are shown in the simulation above. The above block diagram is a combined effect of tiny structures like one, 6 point DCT, 8 point DCT and others.

#### 4. Conclusion

We present a recursive technique for deriving orthogonal DCT approximations in this research work wherein an estimated DC Transform of N length can be obtained from a set of DC Transforms of (N/2) length at the rate of N input preprocessing additions. Regularity, architectural simplification, reduced computational complexity, as well as scalability, are some of the advantages of the suggested approximated DCT. When compared to the conventional techniques, the proposed approximation technique shows better results in terms of hardware resource utilization, image quality compression as well as error energy. In this work, we have also presented a reconfigurable architecture that can be fully scalable to approximate DCT in which a 32-point DC transform may be designed to compute 2 16-point DC Transforms or 4 8-point DC Transforms simultaneously.

#### 5. References

- A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, "NEDA: A low-power highperformance DCT architecture," *IEEE Trans. Signal Process.*, vol. 54, no. 3, pp. 955–964, 2006.
- [2] C. Loeffler, A. Lightenberg, and G. S. Moschytz, "Practical fast 1-D DCT

algorithm with 11 multiplications," in *Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP)*, May 1989, pp. 988–991.

- [3] M. Jridi, P. K. Meher, and A. Alfalou, "Zeroquantised discrete cosine transform coefficients prediction technique for intraframe video encoding," *IET Image Process.*, vol. 7, no. 2, pp. 165–173, Mar. 2013.
- [4] S. Bouguezel, M. O. Ahmad, and M. N. S. Swamy, "Binary discrete cosine and Hartley transform," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 60, no. 4, pp. 989–1002, Apr. 2013.
- [5] F. M. Bayer and R. J. Cintra, "DCT-like transform for image compression require 14 additionsonly," *Electron. Lett.*, vol. 48, no. 15, pp. 919–921, Jul. 2012.
- [6] R. J. Cintra and F. M. Bayer, "A DCT approximation for image compression," *IEEE Signal Process. Lett.*, vol. 18, no. 10, pp. 579–582, Oct. 2011.
- [7] S. Bouguezel, M. Ahmad, and M. N. S. Swamy, "Low-complexity 8 8 transform for image compression," *Electron. Lett.*, vol. 44, no. 21, pp. 1249–1250, Oct. 2008.
- [8] T. I. Haweel, "A new square wave transform based on the DCT," Signal Process., vol. 81, no. 11, pp. 2309–2319, Nov. 2001.V. Britanak, P.Y.Yip, and K. R. Rao, Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms, and Integer Approximations. London, U.K.: Academic, 2007.

- [9] G. J. Sullivan, J.-R. Ohm,W.-J.Han, and T.Wiegand, "Overview of the highefficiency video coding (HEVC) standard," *IEEE Trans. Circuits Syst. Video Technol.*, vol. 22, no. 12, pp. 1649–1668, Dec. 2012.
- [10] F. Bossen, B. Bross, K. Suhring, and D. Flynn, "HEVC complexity and implementation analysis," *IEEE Trans. Circuits Syst. Video Technol.*, vol. 22, no. 12, pp. 1685–1696, 2012.
- [11] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. Wang, "Incremental learning of 3D-DCT compact representations for robust visual tracking," *IEEE Trans. Pattern Anal. Mach. Intell.*, vol. 35, no. 4, pp. 863–881, Apr. 2013.
- [12] A. Alfalou, C. Brosseau, N. Abdallah, andM. Jridi, "Assessing the performance of a method of simultaneous compression and encryption of multiple images and its resistance against various attacks," *Opt. Express*, vol. 21, no. 7, pp. 8025–8043, 2013.
- [13] R. J. Cintra, "An integer approximation method for discrete sinusoidal transforms," *Circuits,Syst., Signal Process.*, vol. 30, no. 6, pp. 1481–1501, 2011.
- [14] F. M. Bayer, R. J. Cintra, A. Edirisuriya, and A. Madanayake, "A digital hardware fast algorithm and FPGA-based prototype for a novel 16-point approximate DCT for image compression applications," *Meas. Sci. Technol.*, vol. 23, no. 11, pp. 1–10, 2012.
- [15] R. J. Cintra, F. M. Bayer, and C. J. Tablada, "Low-complexity 8-point DCT approximationsbased on integer functions," *Signal Process.*, vol. 99, pp. 201–214, 2014.