Google Summer Of Code Propose: Multithreading Theora Decoder Name: Leonardo de Paula Rosa Piga Email: leonardo.piga@gmail.com Website: http://www.students.ic.unicamp.br/~ra033956 1. Project Goal Parallelize the most time consuming functions of Theora Decoder. 2. The need of parallelize The videos resolutions has increased over the years as well as the video CODECs' workload. On the other hand, the processor manufactures have concluded that increasing clock frequency, pipeline depth, cache size, ... might not provide the rise in performance as expected. Therefore, they have decided to increase the number of cores on a single chip and decrease the clock frequency. However, rising the number of cores provides some performance improvements when parallelism exists. The Moore's Law states that the number of transistors doubles every 18 or 24 months, using the same reasoning for the number of cores, if the law remains true, by the end of next decade we will have a processor with five hundreds to a thousand cores. For mobile devices, it is known that IBM pretends to use CELL for embedded systems, and this processor has a great performance when the work is made in parallel to explores its multiples cores. Thus parallelism is being the new trend. My intent in this work is to implement a multithreading version of the most time consuming functions of Theora Decoder. 3. Programming Model The Fork - Join Model will be used although this approach is not the best for achieve the best results it is the easiest method for programmers. This model begins as a single thread, called master thread. The master executes sequentially until a parallel region is encountered. When this happens, the master thread creates a team of parallel threads. After the team complete their work, they synchronize and terminate, leaving only the master thread again. The figure 1 describes the process Figure 1: Fork - Join Model |-|------>|-| |-|------>|-| |F| |J| |F| |J| --------\|O|------>|O|-----\|O|------>|O|-----\ --------/|R| |I|-----/|R| |I|-----/ Master |K|------>|N| |K|------>|N| Thread | | | | | | | | |-|------>|-| |-|------>|-| Parallel region Parallel region 4. Profile Analysis A simple profile analysis of the Theora Software decoding implementation show that more than 70% of the CPU-time is used in only 2 functions. They are: * The reconstruction procedures * The deblocking filter, called, loop filter Table 1 - Profile Analysis ----------------------------------------------------------- Function | CPU-Time (%) | Count ----------------------------------+--------------+--------- th_decode_packetin | 99.20 | 2038 +-> oc_dec_frags_recon_mcu_plane | 56.60 | 103938 +->oc_state_frag_recon | 47.00 | 43113293 +-> oc_state_loop_filter_frag_row | 17.00 | 103938 ----------------------------------+------------------------ The Table 1 shows the 2 highest CPU-expensive functions of the decoding software and its CPU-time consuming percentage given by a simple profile did with a Theora stream. These functions are good candidates to do a parallel implementation because their high CPU usage. However a data dependency analysis should be done. The function oc_state_frag_recon alone should not be parallelized because it is called to many times and the overhead for create and terminate threads could not reward the possible performance gains. 5. The parallel implementation An implementation using OpenMP requires less code modifications and provides an easy way to write scalable programs. Thus, as a first try the implementation will use OpenMP aiming code maintainability and scalability. But if the preliminary results were not as good as a expected a pthread implementation would be necessary. 6. Work Schedule * Make a data dependency analysis (2 weeks) * Make some tests to check the feasibility of an OpenMP implementation (2 weeks) * Parallelize the oc_dec_frags_recon_mcu_plane function (4 weeks) * Parallelize the oc_state_loop_filter_frag_row function (4 weeks) If I finish the work before the GSoC period I will study more functions to be parallelized. 7. Bio I'm an undergraduate Computer Engineering Student at the State University of Campinas (Brazil) in the ninth period and a researcher of the Computer Systems Laboratory (http://www.lsc.ic.unicamp.br). My areas of interest are hardware design, computer architecture, multithreading programming, compilers and image processing. Since June of 2006 I've been working on a hardware Theora decoder project. I implemented all modules excepted the iDCT in VHDL and the hardware could decode 96x80 Theora videos. I'm involved with Theora for about 2 years and I could observed it progress along the time. For these reasons I think I should be selected.