Introduction
The MAXQ's unique transfer-triggered architecture makes it a top performer in the 16-bit microcontroller market. The MAXQ instruction set features single-clock and instruction-cycle operations for jumps, calls, returns, loop control, and arithmetic operations. As a result the MAXQ enables applications to process more data in less time than other microcontrollers. Designers can thus add more functionality in their applications or reduce power consumption by completing required tasks quickly and spending more time in low-power stop modes.To demonstrate the MAXQ's capabilities for this competitive analysis, we took benchmark code written to showcase the MSP430, ran it on the MAXQ, and monitored MAXQ performance. The competitor's code initially made the MAXQ function comparatively slow and inefficiently. Later when Rowley's highly optimized CrossWorks compiler for the MAXQ was released to the market, we reran the benchmark code. We found that Rowley's compiler used MAXQ architectural features more effectively, and that the MAXQ outperformed both the Texas Instruments (TI) MSP430 and the Atmel AVR. The MAXQ executed the same code in fewer clock cycles. In addition, this accelerated performance did not penalize the user with extra code size—the MAXQ's code size is within 2% of the competitors' code sizes.
This application note presents the details of our study of the MAXQ, Atmel AVR, and TI MSP430 architectures. This study is transparent—there are no tricks of compiler optimizations or specialized code made to force one microcontroller to perform better than another. Project files and source code are provided on the Maxim Web site so that the results can be duplicated. The results of this study (and other MAXQ performance studies) can be found at MAXQBenchmark.
Notes About Methodology
Of the two compilers in this study, the IAR Embedded Workbench and the Rowley CrossWorks, we used Rowley's compiler to generate the MAXQ's benchmark data because it made the best use of MAXQ capabilities. Both the IAR and Rowley compiler results were used for the MSP430 and the AVR microcontroller tests.The data for execution time were gathered with the simulators that ship with IAR's Embedded Workbench and Rowley's CrossWorks toolsets. The execution cycles counted did not include startup time; the count started at the entry point into the main() function and ended with the main() function's return statement.
Code sizes are in bytes and include both CONSTANT and CODE segments. This is because some tools include application constants in the CODE segment, which would make a device's code density appear incorrectly high. Combining the sizes of the CODE and CONSTANT segments ensures an equivalent comparison.
In general, we configured the compilers to use their highest code-optimization levels for ALL devices. This typically meant that all optimizations were enabled when targeting the smallest code size, and almost all optimizations were enabled when targeting the fastest code (because some compiler optimizations sacrifice speed for code size). In some instances, the high optimization settings caused problems—the code generated failed to simulate properly, never reaching the return statement. Often, the code began to work when the optimization level changed. We will indicate when such reductions of the optimization level were required. The project files that accompany this application note contain the optimization settings used to generate the benchmark data.
TI Benchmark
This benchmark is a suite of tests published by Texas Instruments to showcase the MSP430. The suite contains 10 individual benchmarks:- 8-bit math routines
- 8-bit matrix (array) accesses
- 8-bit switch statements
- 16-bit math routines
- 16-bit matrix (array) accesses
- 16-bit switch statement
- 32-bit math routines
- Floating point math routines
- Finite impulse response algorithm
- Matrix multiplication
The TI application note, including the source code, is available for download.
TI Results
The TI study provided results for execution speed (in clock cycles) and code density (in bytes), as shown in Table 1 and Table 2. Note that some of the device names (taken directly from the TI application note) are unclear. For instance, does 8051 refer to a 12-clock, 6-clock, 4-clock, or even 1-clock 8051 architecture?Table 1. TI Study Results: Execution Speed (no. of cycles)
Application | MSP430F135 | ATmega8 | PIC18F242 | 8051 | H8/300L | MC68HC11 | MAXQ20 | ARM7-TDMI (Thumb) |
8-bit math | 299 | 157 | 318 | 112 | 680 | 387 | 421 | 185 |
8-bit matrix | 2899 | 5300 | 20045 | 17744 | 9098 | 15412 | 31691 | 2227 |
8-bit switch | 50 | 131 | 109 | 84 | 388 | 214 | 58 | 146 |
16-bit math | 343 | 319 | 625 | 426 | 802 | 508 | 815 | 259 |
16-bit matrix | 5784 | 24426 | 27021 | 29468 | 15280 | 23164 | 60214 | 2998 |
16-bit switch | 49 | 144 | 163 | 120 | 398 | 230 | 51 | 146 |
32-bit math | 792 | 782 | 1818 | 2937 | 1756 | 1446 | 1034 | 115 |
Floating point | 1207 | 1601 | 1599 | 2487 | 2458 | 4664 | 1943 | 108 |
FIR filter | 152193 | 164793 | 248655 | 206806 | 245588 | 567139 | 464558 | 43191 |
Matrix multiply | 6633 | 16027 | 36190 | 9454 | 26750 | 26874 | 66534 | 2918 |
TOTALS | 170249 | 213680 | 336543 | 269638 | 303198 | 640038 | 627319 | 52293 |
Table 2. TI Study Results: Code Size (no. of bytes)
Application | MSP430F135 | ATmega8 | PIC18F242 | 8051 | H8/300L | MC68HC11 | MAXQ20 | ARM7-TDMI (Thumb) |
8-bit math | 172 | 116 | 386 | 141 | 354 | 285 | 352 | 660 |
8-bit matrix | 118 | 364 | 676 | 615 | 356 | 380 | 378 | 408 |
8-bit switch | 180 | 342 | 404 | 209 | 362 | 387 | 202 | 504 |
16-bit math | 172 | 174 | 598 | 361 | 564 | 315 | 286 | 676 |
16-bit matrix | 156 | 570 | 846 | 825 | 450 | 490 | 526 | 428 |
16-bit switch | 178 | 388 | 572 | 326 | 404 | 405 | 188 | 504 |
32-bit math | 250 | 316 | 960 | 723 | 876 | 962 | 338 | 620 |
Floating point | 662 | 1042 | 1778 | 1420 | 1450 | 1429 | 1596 | 1556 |
FIR filter | 668 | 1292 | 2146 | 1915 | 1588 | 1470 | 1828 | 1420 |
Matrix multiply | 252 | 510 | 936 | 345 | 462 | 499 | 494 | 432 |
TOTALS | 2808 | 5114 | 9302 | 6880 | 6866 | 6622 | 6188 | 7208 |
From this data, the MSP430 produced the densest code—45% smaller than the Atmel AVR microcontroller. The MSP430 also appeared to perform best, with the exception of the 32-bit ARM processor. These results also showed the MAXQ to be comparatively slow and inefficient.
Flaws with the TI Benchmark Study
The manner in which TI produced its benchmarks raised some questions.The first problem is that TI did not use any optimizations in their study. TI argued against compiler optimizations in order to remove the compiler from consideration and to make the microcontroller perform on its own. The problem with this argument is that engineers still use a compiler to generate machine code. If a compiler does not take advantage of the architectural features of a microcontroller when optimizations are not enabled, then you do not get a realistic idea of the microcontroller's performance. In addition, benchmarks are only valuable if they model real applications. An engineer is likely to enable optimizations for size or speed in a real application, and these should thus be included as part of the benchmark study.
The second flaw in the TI benchmark study is that they only considered one compiler. Admittedly, the Rowley compiler was not available to TI at that time. Now available, the Rowley compiler dramatically updates the earlier TI results.
Maxim's Approach
As explained above, our reevaluation of the TI benchmark focused on the MSP430, Atmel AVR, and MAXQ architectures. We considered execution and code size data for both the IAR Embedded Workbench and the Rowley CrossWorks toolsets. All results for execution speed were obtained through simulation.The MAXQ device in this study was the MAXQ2000 microcontroller. In addition to an array of peripherals including an LCD controller, the MAXQ2000 has 16 16-bit accumulators and a 16 x 16 hardware multiply accelerator. For this study, we enabled the hardware multipliers on all three devices under test—we assumed that if performance on mathematical computations (such as a FIR filter) was important, a designer would choose a microcontroller with a multiply accelerator.
For the MSP430 device, we targeted the MSP430F149, a different device than TI targeted in their study (the MSP430F135). We chose the F149 because it has a hardware multiply unit, making comparison to the MAXQ2000 more equitable.
The ATmega8 was selected for study because the current IAR compiler could generate code using the hardware multiplier for this microcontroller. The IAR compiler could not do so for the other AVR devices like the ATmega64 or ATmega128.
Gathering benchmark results from both toolsets was straightforward. In IAR, the code size data is found in a map file (make sure it is generated under Project Options → Linker → List). Scroll down to the bottom of the map file and the following three lines appear:
184 bytes of CODE memory
80 bytes of DATA memory
66 bytes of CONST memory
As mentioned earlier, we count both CODE and CONST memory sections in the total code size, because compilers differ on where they place constant program data. For testing, the only legitimate way to compare code size is to include the constant size.
To find execution cycles in IAR, select the Simulator as the Debug tool and begin debugging. Launch the code profiler under View → Profiling. Click the Activate button and the Autorefresh button (see Figure 1). The debugger should automatically run to the first line of the C code. Press the Run key, and (if no breakpoints are set) the IAR debugger terminates at program exit. Look at the code profiler and report the number of cycles under Accumulated Time for main()—this is the number of cycles spent in the main routine and all subroutines called by main.
Figure 1. IAR Code Profiler: accumulated time (cycles) means the number of cycles spent in that routine and all subroutines which it calls.
Finding the generated code size in the Rowley toolset is also very easy. When the project builds, the Project Explorer lists the code size with the project. Figure 2 shows that for the MSP430F149, the 16-bit math benchmark code size is 238 bytes.
Figure 2. Rowley Project Explorer shows code size details for each project.
Determining the number of execution cycles in the Rowley tool is not quite as easy as with IAR—Rowley does not automatically stop at the end of the program nor does it separate where the cycles are spent. You must reset the cycle counter upon entry to the main program. To do this, first start debugging the program. When the compiler stops at the entry point to main, reset the cycle counter by double clicking on it.
Figure 3. When the Rowley simulator stops at main(), reset the cycle counter (the picture with the hourglass) by double-clicking on it.
Next, set a breakpoint at the end of the application. (Note that lines with the blue triangles in the margins indicate where you can set breakpoints.) Run to the breakpoint and record the number of cycles reported.
There are other possible complications with using the Rowley simulator.
- Depending on the optimizations, you may only be able to simulate at the assembly level, in which case it is more difficult to find the end of the application. The best approach is to scan through the code and find the next RETURN statement in the assembly code, set your breakpoint there, and run to it.
- The simulator may not always stop at the main entry point. When this occurs, try pressing the Restart Debugging button. You may also need to manually find the main entry point and set a breakpoint there.
Compiler Settings
When using the IAR toolset, the compiler options window in the project options is configured for the highest optimization level with all optimizations enabled (see Figure 4). To change between targeting smallest code and fastest execution, switch the selected radio button from Size to Speed.Figure 4. Options for the IAR compiler: all optimizations are enabled. The radio button switches the compiler between optimizing for speed and for size.
Rowley's CrossWorks allows users to create build configurations in addition to the default Debug and Release configurations. Therefore, the benchmark projects for this study also included the Fastest (see Figure 5) and Smallest (Figure 6) configuration options. The Fastest configuration removes any optimization that values code size at the expense of an instruction cycle.
Figure 5. Project options used in Rowley's CrossWorks for the fastest configuration.
The settings for the smallest configuration appear in Figure 6. Options that favored code size at the expense of cycles were enabled, and the overall optimization strategy was to minimize size.
Figure 6. Project options used in Rowley's CrossWorks for the Smallest configuration.
The project and source files for each benchmark run by Maxim are available at www.maxim-ic.com/products/microcontrollers/maxq/performance/competitive.cfm#compiler_detail_links. The configurations in these project files are the same configurations used for the benchmarking. Links to trial versions of the IAR and Rowley tools are available with other third-party tools on the Maxim website, so you can easily reproduce these benchmark results.
MAXQ Benchmark Results
Tables 3 and 4 show the MAXQ benchmark results. Execution speed is again given as clock cycles and code size is given in bytes.Table 3. Results from Maxim's Study: Execution Speed (no. of cycles)
Application | MSP430F149 IAR | MSP430F149 Rowley | ATmega8 IAR | ATmega8 Rowley | MAXQ2000 Rowley | |||||
Configuration | Small | Fast | Small | Fast | Small | Fast | Small | Fast | Small | Fast |
8-bit math | 243 | 243 | 276 | 272 | 110 | 110 | 279 | 278 | 278 | 245 |
8-bit matrix | 1629 | 963 | 6243 | 2659 | 1508 | 1074 | 7348 | 3763 | 3461 | 2947 |
8-bit switch | 31 | 31 | 24 | 24 | 84 | 36 | 45 | 45 | 39 | 39 |
16-bit math | 219 | 219 | 250 | 250 | 275 | 266 | 348 | 330 | 194 | 191 |
16-bit matrix | 1906 | 899 | 6755 | 3171 | 1147 | 697 | 5251 | 5250 | 3205 | 2691 |
16-bit switch | 30 | 30 | 24 | 24 | 111 | 44 | 50 | 50 | 39 | 39 |
32-bit math | 575 | 575 | 790 | 716 | 746 | 731 | 995 | 885 | 545 | 521 |
Floating point | 784 | 784 | 1097 | 921 | 1614 | 1565 | 1491 | 919 | 763 | 744 |
FIR filter | 86042 | 82748 | 90812 | 82592 | 82779 | 82779 | 73598 | 66249 | 62280 | 59470 |
Matrix multiply | 4254 | 2761 | 6036 | 5436 | 7799 | 2396 | 11081 | 9231 | 3704 | 3027 |
TOTALS | 95713 | 89253 | 112307 | 96065 | 96173 | 89698 | 100486 | 87000 | 74508 | 69914 |
Figure 7 graphs the data for execution speed. Only the fastest results are shown. Speed is measured in execution cycles—a smaller bar means better performance.
Figure 7. Execution speed results for the fastest configuration setting. The smaller MAXQ2000 bar shows better performance.
Table 4. Results from Maxim's Study: Code Size (no. of bytes)
Application | MSP430F149 IAR | MSP430F149 Rowley | ATmega8 IAR | ATmega8 Rowley | MAXQ2000 Rowley | |||||
Configuration | Small | Fast | Small | Fast | Small | Fast | Small | Fast | Small | Fast |
8-bit math | 192 | 192 | 258 | 262 | 98 | 98 | 212 | 212 | 248 | 284 |
8-bit matrix | 152 | 180 | 240 | 232 | 318 | 304 | 220 | 250 | 202 | 222 |
8-bit switch | 180 | 180 | 230 | 230 | 312 | 164 | 202 | 200 | 152 | 152 |
16-bit math | 140 | 140 | 220 | 220 | 162 | 154 | 222 | 238 | 162 | 164 |
16-bit matrix | 240 | 240 | 312 | 312 | 398 | 374 | 294 | 350 | 260 | 378 |
16-bit switch | 178 | 178 | 230 | 230 | 346 | 178 | 212 | 240 | 152 | 152 |
32-bit math | 236 | 236 | 284 | 388 | 306 | 296 | 380 | 460 | 274 | 324 |
Floating point | 1100 | 1100 | 966 | 1004 | 1026 | 1046 | 816 | 936 | 1018 | 1090 |
FIR filter | 1178 | 1174 | 924 | 966 | 1258 | 1258 | 860 | 896 | 1024 | 1044 |
Matrix multiply | 266 | 250 | 312 | 316 | 476 | 324 | 294 | 348 | 254 | 264 |
TOTALS | 3862 | 3870 | 4076 | 4160 | 4700 | 4196 | 3712 | 4130 | 3746 | 4074 |
The following graph (Figure 8) shows the code size data for the smallest configuration results. Code size is measured in number of bytes—a smaller bar means better code density.
Figure 8. Code size results for the smallest configuration setting. The MAXQ2000's smaller bar indicates better code density.
Table 5. The Compiler Versions for This Study
Microcontroller | Compiler | Version |
MAXQ2000 | Rowley | CrossWorks for MAXQ, Release 1.0, Build 2 |
MSP430F149 | Rowley | CrossWorks for MSP430, Release 1.3, Build 3 |
MSP430F149 | IAR | IAR C/C++ Compiler for MSP430, V3.30A/W32 (3.30.1.1) |
ATmega8 | Rowley | CrossWorks for AVR, Release 1.1, Build 1 |
ATmega8 | IAR | IAR C/C++ Compiler for AVR, 4.10B/W32 (4.10.2.3) |
Table 6. Issues Encountered When Running These Benchmarks
Device | Tool | Configuration | Benchmark | Issue |
ATmega8 | Rowley | Smallest | 16-bit matrix | The simulation would not terminate unless the Code Factoring optimization was set to NONE. |
ATmega8 | IAR | Fastest | 8-bit matrix, 16-bit matrix | The simulation would not terminate unless the optimization level was set to medium instead of high. |
ATmega8 | IAR | Smallest | FIR filter | Simulation would not terminate even at lowest optimization level. The numbers included in Table 3 and Table 4 are for the FIR filter in the fastest configuration. |
ATmega8 | IAR | Both | Matrix multiplication | The simulation would not terminate on the ATmega8, ATmega16, or ATmega32 targets. The project was targeted instead for the ATmega64. |
Analysis and Summary
Across different compilers and with optimizations enabled, the above results show that the MSP430 is not the best performing microcontroller, even when running TI's specially crafted benchmark code.When considering the total number of execution cycles required to run the entire benchmark suite, the MAXQ2000 outperforms the MSP430F149 and the ATmega8. The MAXQ2000 runs in 69,914 cycles, while the MSP430F149 (IAR) and ATmega8 (Rowley) take 89,253 and 87,000 cycles, respectively. When considering the total size for the benchmark code, the best-case results for the three microcontrollers vary by only 2%, making any difference in code size irrelevant.
Since code density is not a factor for this benchmark, we look deeper into the execution speed results. The total execution-cycle results are heavily weighted by the FIR filter results, where the MAXQ2000 clearly outperforms the competition. The MAXQ2000 is the best performer on the math benchmarks except for the ATmega8 in the 8-bit math benchmark. The MAXQ2000's weakest performance is on the 8-bit and 16-bit matrix benchmarks, which copy items from one multidimensional array to another.
To this point, we are only considering the performance of the test microcontrollers in terms of clock cycles. We have not considered the speed at which a device can run. For the sake of absolute comparison, we use benchmark iterations per second—the number of times that the entire TI benchmark suite can run in a second. Table 7 shows that when all devices run at the same clock speed, the MAXQ2000 is 28% faster than the MSP430F149 and 24% faster than the ATmega8. When the devices run at the maximum clock rate, the MAXQ2000 is 56% faster than the ATmega8 and 218% faster than the MSP430F149.
Table 7. Results from Maxim's Study: Speed (Iterations per Second and at FMAX)
Device | Cycles | Fmax | Iterations/s at 1MHz | Iterations/s at Fmax |
MSP430F149 | 89,253 | 8 | 11.20 | 89.60 |
ATmega8 | 87,000 | 16 | 11.49 | 183.84 |
MAXQ2000 | 69,914 | 20 | 14.30 | 286.00 |
Figure 9. Benchmark iterations per second when running at the maximum clock rate. The taller MAXQ2000 bar shows better performance.
How should we summarize the results of the Maxim benchmark study? At the very least, it counters the results of the TI benchmark study, which showed the MAXQ microcontroller architecture as unremarkable. This updated study shows that the MAXQ2000 is a code-efficient, fast microcontroller that should be considered for any new designs and redesigns that will benefit from a higher performance microcontroller.
This study is part of an ongoing effort. Please visit the homepage for MAXQ benchmarking for additional and updated studies. An evaluation kit is available for the MAXQ2000 microcontroller. For information on the EV kit, links to demonstration code, software, and application information, go to Evaluate the MAXQ2000 Microcontroller with the MAXQ2000-KIT.
評(píng)論
查看更多