Skein hash and ThreeFish block cipher
Skein is a hash function proposed by Niels Ferguson, Stefan Lucks, Bruce Schneier, Doug Whiting, Mihir Bellare, Tadayoshi Kohno, Jon Callas and Jesse Walker, as their submission to the SHA-3 competition. It is based around a new block cipher introduced in the very same paper, called ThreeFish. Both will certainly receive attention and cryptoanalysis in the forthcoming SHA-3 process.
There is barely any reason to implement on 8-bit microcontrollers anything else than the smallest block/key size, 256 bits. Contrary: if the autors would take the pain to try to reduce the block - and possibly also key - size; the low-complexity approach they (deliberately, I guess) used, with straightforward ways for implementation on 8-bitters, would certainly yield a serious competitor for Rijndael in this class of microcontrollers. However, I lack the abilities even to estimate, whether this is viable at all.
The presented implementation for 8051-class microcontrollers and for AVR are aimed at speed at the cost of memory, especially of ThreeFish. Skein is implemented in both cases in a more relaxed style, with indirect memory access and loops rather than explicit coding and unrolled loops. To be honest, I am not interested in the hash itself, as I see much less use of it in the general 8-bitters, than of a block cipher (and I may be wrong, as so often). Also, the block cipher presents the most of the work anyway. Whoever wishes to make a fast Skein, can take the already fast implementation of ThreeFish, and recode the - relatively simple, short and straighforward - Skein part. Also, if only an optimal Skein is desired, and not a generic ThreeFish, the initial key xor of ThreeFish can be performed in line with the plaintext xor, and tweak manipulation can be simplified by taking into account that most of its content is zero.
Only encryption is presented in both cases, as only that is needed for Skein - and, the block cipher being symmetrical, in most applications only one of the encryption/decryption pair is sufficient to have. Both implementations use the unrolled 8 rounds - which again is almost certainly a deliberate design choice to make this sort of implementation easy, and in both the initial key injection is performed separately in line with generation of the key and tweak extra word. Both implementations employ "renaming" instead of byte shifts, resulting in need to "put things in place" byte shift after the 8 rounds.
For both implementations, a "testbench" is included for both standalone ThreeFish and Skein, using vectors from the submission package.
'51
My primary interest - of course - was the '51 implementation. In fact, it is a '52 implementation, as the classic 8051 with mere 128 bytes of RAM altogether (including register area and stack) lacks the memory needed just to store data, key, tweak, some state variable, and to do any other useful job.
I started out with implementing the rotations - which is clearly the key element - using the mul ab instruction (conventional rotations in all places would be hopelessly slow, as they perform only one bit at a time, with the need to move variables in and out of accumulator). This is easy at it results in uniform implementation of all rotations; at the other time, it still needs to move the two separated "halfs" around, and to merge (OR) it with the neighbouring byte's bits. So, in several steps (visible in the source as conditionally assembled alternatives) I replaced the mul's with rotations in case where 2 rotations were needed (left or right - i.e. rot mod 8 = 2 or 6); with a xchd-based construct in cases when 4 rotations were needed; and the combination of the two in cases where 3 or 5 rotations were needed. There is one round where 0 rotations are needed, and none with 1 or 7.
The resulting implementation of ThreeFish-256 occupies around 2.5kBytes of code memory (FLASH), which is a similar value to e.g. many Rijndael implementations on '51; whereas to encrypt a block takes less than 13k cycles on the conventional cycle-structure derivatives, such as the classical 12-clockers, the 6-clockers such as the 'RD2 subfamily, and the NXP P89LPC9xx 2-clockers. The 4-clockers (such as the Dallas DS5xxx "secure" family), and the 1-clockers (such as the SiLabs derivatives, Ramtron Versas, Atmel AT89LP's, Dallas DS89C4x0) have a different cycles-per-instruction structure (with a tendency to have 1 cycle per code (FLASH) byte), so they will require slightly more cycles, although performed faster.
Although the implementation eats up much of the precious internal RAM - both direct and indirect, it needs no stack (and Skein needs only 3 bytes of stack, of which 1 byte can be stored conveninently in the registers), and it occupies only one register bank (and not all registers of it) and no bit addressable area. In most of the applications, the data space can be reclaimed and overlaid by other variables of the application, or stack.
AVR
The main motivation for the AVR implementation was pride. The authors published an estimate of 9.5kcycles of a highly optimised application, where the state would be stored in the 32 registers. I, of course, wanted to know, whether I can reach this figure, beat it, or be beaten.
Clearly, 32 registers all connected to ALU provide a big edge over the accumulator-based architecture of '51. On the other hand, it is clear, that some of the register(s) have to be swapped out to SRAM regularly, to allow operations with key and tweak; and that there is no xchd instruction (which is such a relic left over from the '48, that it is often omitted on some '51 softcores, too) which would save the day, so we need to perform all the rotations ... as... rotations.
To make the story short, the implementation performs ThreeFish-256 at above 10.5kcycles per block. For Skein, it would mean some more cycles for the plaintext xor and to move in a new plaintext to the registers, plus some other housekeeping. I clearly failed the 9k5 limit. But, as the authors did not publish the source of the estimate (they published only the way how they estimated the 19k cycles for the "conventional" swap-in-swap-out implementation), I cannot say, whether their estimate was overly optimistic, omitting perhaps some key factor, or did I simply miss some important optimisation step. Flash consumption is around 2.5kWord, i.e. 5kBytes of FLASH - the 16-bit opcode width of AVR clearly takes its toll.
Conclusion
Skein is undoubtedly one of the hottest candidates for SHA-3. And, as such, it will receive a serious cryptoanalysis in the forthcoming months. As such, it is attractive to implement it, especially when the authors took pain to design the cipher and the hash with 8-bit-implementability in mind.
On the other hand, a 256-bit block cipher is not likely to be needed to secure an application and/or its data on a '51/AVR-class of microcontrollers. Although core of these microcontrollers is used also on smartcards and security devices, these devices often employ hardware acceleration of the implemented cipher, thus a generic implementation for them is mostly useless.
However, in cases where a general-purpose-microcontroller-based device has to implement some already established data exchange protocol based on a given cipher or hash, these implemetations may serve at least for an estimate of fitness, if not basis for a real application.
Enjoy!
7.12.2008