STM32 gotchas
66.SPI master Tx outputs 16 clocks, even if its Data Size set to 8 bit

SPI module even in its original form in the early STM32 families is relatively complex, however, its basic operation follows the expected pattern. In its Master mode, after setting the proper mode, selecting baudrate, and deciding the way how NSS is used, it is enough to select whether transmitted frame is 8-bit or 16-bit wide; and then writing the Data register when it's empty simply makes the SPI module to output 8 or 16 pulses on SCK together with shifting out the data onto MOSI (and simultaneously shifting in data from MISO).

However, starting with the 'F0/'F3 families, users often find, that following this simple procedure while setting 8-bit frame, sometimes results in transmitting 16 bits instead of 8 bits, per each written piece of data. MSB of the transmitted frame is usuall all 0. Why is that?

Of the 3 basic serial communication interfaces - USART, SPI, I2C - SPI is the one generally capable of the highest transfer speeds. As such,it makes sense to try to reduce any bottleneck in moving data between SPI module and the rest of the chip. So, one of the improvements to the SPI module was, what ST calls data packing: whenever the frame size is set to 8 bits (or less), it is possible to read or write two bytes at once, simply by writing/reading a whole 16-bit halfword at once1.

This behaviour is in the new SPI module implicit and cannot be disabled. It makes use of the fact, that within the 32-bit bus system (AHB/APB) used in the STM32, besides data and addresses, there are also signals indicating, what is the width of data being transferred through the buses. It is the busmasters - processor, DMA units - which control these signals, although users may not always be aware of them.

In DMA units, the transfer width (incorrectly named "alignment" in the Cube realm) is explicitly set in the control register.

In processor, load and store instructions have data width encoded directly in the instruction, this is reflected in the respective mnemonics (STB/LDB for byte, STH/LDH for halfword). So, to avoid data packing to send out 16 bits where the user intends to send only 8 bits, one has to use STB to write the data to SPI data register. But most users don't use assembler, they use C instead, so how to accomplish that there?

As SPI registers are defined in the CMSIS-mandated device headers so that the data register is defined as 16-bit type,simply writing to it as

SPI->DR = data;
results exactly in the 16-bit transfer, and thus SPI transmitting 16 bits, which we wanted to avoid. The universally accepted way to do this is use type punning2 by taking pointer to the SPI data register, casting it as pointer to 8-bit type, and then writing to it by dereferencing the pointer3:
*(volatile uint8_t *)&SPI->DR = data; 


1. The next logical question is: hey, but why going just half way far, why does not use this great feature the fact that the STM32 are 32-bits, so by writing a whole word, you could transfer 2 halfwords and 4 bytes at once? The anwer is twofold: one reason is, that at the end of the day, even if the features of these chips are incredibly small, distributing a full-fledged 32-bit bus across major portions of chips (read: APB buses connecting dozens of peripherals) still results in high complexity, silicon area, reduced yield, and thus increased cost. The other reason is historical - many of the STM32 peripherals are IPs taken over from older 16-bit ST mcus, so they still bear some of their legacy.

2. C, being a high(-ish) level language, does not define a good, universally portable method of prescribing, how exactly the hardware should work. Type punning in any form (either using pointer casting or unions) is a feature with undefined behaviour thus broadly speaking can't guarantee the expected goal. However, we here are talking about a very particular usage of the language, where the handful of existing compilers make conscious effort to behave in the "expected way" in this particular regard.

3. ARM in CMSIS headers define __IO as an alias to volatile, to provide no benefit while violating C naming rules (C99 7.1.3#1).