EFTON - STM32 gotchas

9.TIMx->SR &= ~TIM_SR_flag results in lost interrupts (don't RMW or bit-band on status registers)

This gotcha is by no means limited to the timer status registers, but pertains to all peripherals' status registers, bits of which are of rc_w1 or rc_w0 type.

Status registers in peripherals are not "memory-like" registers, where one can simply write a value which then impacts the way how the peripheral works, but otherwise reads back the same value as written. Instead, they are "changed from inside", i.e. hardware changes their values (usually single bits) indicating various state changes of the hardwaare. Often, the change is "acknowledged" by the user by writing a certain value to the same register.

As the various peripherals' IPs in STM32 have various origins (some are purchased from 3rd parties, others are developed in-house but for historically various purposes), the exact method how the status bits are cleared varies. This is indicated by both the narrative in description of the given register, and also the marking of the regster bits: rc_w0 indicates a bit cleared by writing 0 to it (e.g. bits in TIMx_SR); rc_w1 indicates bit cleared by writing 1 to it (e.g. bits in EXTI_PR).

When writing to registers with rc_w0 bits, these bits are not affected by writing 1, similarly rc_w1 bits are unaffected by writing 0. Thus, the correct way to clear a given flag in such register is to write a mask, e.g.

TIM1->SR = ~TIM_SR_UIE;   // rc_w0
EXTI->PR = 1<<10;         // rc_w1

Unfortunately, many users want to treat these registers are normal "memory-like" registers, using in C the "compound" |= and &= operators. This results in a read-modify-write (RMW) operation, where the original value is read out from peripheral register into the processor, the required logic operation is performed, and the result is written back. Performing RMW (&= or |=) with status registers is incorrect and dangerous, as this code can inadvertently clear also bits which were not intended to be cleared.

For example, let's assume that in timer, Update and Capture-Compare 1 interrupts are enabled; Update interrupt was detected and is about to be cleared using &= just before a CC1 event happens. As first part of &=, the processor reads TIMx_SR where only UIF is set. It performs the & which results in all-zeros, but at the same time, the CC1 occurs, so in the actual TIMx_SR both UIF and CC1F are set. Processor writes the calculated all-zero value into TIMx_SR by which it clears both UIF and CC1F, and this means, that the processor will never "see" thus won't handle that CC1 event.

There are also status registers which are cleared by writing to a different register, see DMA status registers; and some bits are cleared by a certain sequence of reads/writes to other registers, e.g. notoriously, the UART RXNE register is cleared by reading the UART data register. These schemes may have their own quirks, but generally are not affected by the problem described here. It's not unusual to see users to RMW into the separate-clear-registers, but while it's useless as those registers usually read constant 0, it's also harmless there.

It may be surprising, that using bit-banding is subject to the same RMW problem. The following article goes into details with that, reiterating partially what has been written above, from a slightly different perspective.

Following article was published originally at community.st.com; but due to software migration on that site it ended up somewhat crippled. This is a reconstruction, perhaps slightly better readable.

Bit-banding is dangerous when used on hardware-set status registers

On the ST's FAQ page, the following could be read some time ago:

Use Cortex-M3 Bit-banding feature for interrupt clearing since it is an atomic operation and NVIC pending interrupts will be ignored during this operation, however Read-Modify-Write is not.

Now this FAQ page is gone already, but the quote is perpetuated on the web and in some materials. The problem with it is, that it is not entirely true. It's true that NVIC pending interrups will be ignored during bit-banding, but it's not true that it's a good method to clear interrupts. In fact, it's dangerous, don't do it unless you exactly know what are you doing.

So let's get this straight.

Bit-banding is a feature of ARM Cortex-M3 and Cortex-M4 processors, allowing certain portions of memory space (including a portion which is usually mapped to peripherals) to be accessed in bit-wise manner. This feature was introduced to attract programmers used to bit-addressable memory from other mcu architectures, most prominently the x51. It is present only in Cortex-M3 and M4, i.e. not present in M0 and M0+, nor in M7. Even in M3 and M4 it is an optional feature and implementers (semiconductor manufacturers) may chose whether to implement it or not - ST's implementation always do implement it, i.e. bit-banding is available on the 'F1, 'F3, 'F4, 'G4, and 'L1 and 'L4 subfamilies.

The bit-wise access is realized through a respective alias region, where every single bit in the original memory./peripheral address space has assigned a corresponding word (32-bits). Reading that word return 0x00000000 or 0x00000001, depending on what is the state of the corresponding bit. When writing to that word in alias region, the lowermost bit will be actually written into the original bit, not affecting other bits in the word containing the original bit.

This is how things look like from the processor's (and thus the programmer's) point of view. But to understand, what is going on, we need to get down to the nasty details of how is this feature implemented in hardware.

The truth is, that contrary to x51, there is no special hardware allowing to flip individual bits. The processor is still interfaced through a 32-bit bus matrix to 32-bit memories and peripherals, so it can only manipulate data in 32-bit chunks (more precisely, it can also do it in 8-bit and 16-bit chunks, if the attached memory or peripheral implements the byte-select signals of the AHB bus; but never in single-bit). So the trick lies in a simple attachment between the processor's S-port and the bus matrix: when the processor attempts to read from the bit-addressable area, the attachment converts the bit-address to the basic word's address, reads from the bus matrix at that address, takes the read word and rotates it the required number of bits and submits that as result to processor (the processor is stalled by the attachment all that time). Writing is slightly more tricky: the attachment issues first a read on the real word address, then takes the read data, masks the required bit, replaces it with the written one, and then performs the writeback through the bus matrix.

So, a bit-banding write is in fact a read-modify-write operation on a whole 32-bit word, from the point of view of the attached memory or peripheral. During this time, the AHB bus is locked down (there's a special signal for that in the bus), so no other master (such as DMA) can interfere. The processor is left to run until it attempts to access the S-port again, when it is stalled until the operation ends.

This means, atomicity of the operation is preserved, as far as the program is concerned (in this the quote is true); and also the possibility of other busmasters interfering has been taken care of. So what could possibly go wrong?

The peripheral itself.

In many peripherals, there are status words containing individual status bits indicating the states through which the internal state machine of the peripheral has passed. As these are set by hardware, they are usually of the clear-by-writing-1 (c1) or clear-by-writing-0 type (c0) - in the former, writing 1 clears such bit but writing 0 leaves it unaffected (and in the latter it's exactly the opposite), so the proper operation to clear certain bits in such register is to write a mask, not to read-modify-write. And this applies not only to software RMW (i.e. register |= mask or register &= ~mask, depending on whether it's c1 or c0 type which many users already know is no-no), but also to the hardware RMW. If the hardware sets a bit while other bit is being cleared through RMW, the writeback clears the newly set bit, too. The following scheme may perhaps illustrate this better on the case of TIM_SR register (bits of which are c0 type):

diagram depicting changes in TIM_SR in time

The write from BB's internal register clears unexpectedly the CC2 interrupt flag. I made up the particular numbers - I don't know what will be the latencies exactly, so the "sweet spot" for the bitbanding write instruction timing for the problem to occur will be most likely different from 30. Note, that even then the CC interrupt *will* happen as the signal has already started to been passed to NVIC; except that in that ISR, when checking for interrupt source, none will be found.

I tried to visualize this risk in a simple example (to be compiled with augmented device headers) for the 'L476 DISCO. The whole system is run on a slow system clock, MSI set to 100kHz, so that the result is visible on blinking LEDs. There are no AHB/APB prescalers nor prescalers in the timer, as that's the simplest possible setting directly converting to the scheme above. A timer (TIM1) is run with ARR set so that it overflows roughly at a 10Hz rate. There are two interrupts set, one from Update and the other from Capture2. Green LED is toggled at the update rate (in fact it is toggled by hardware through CH1; I might've do it in the Update ISR by software, the result would be the same); red LED is toggled in the CC2 interrupt. To find the "sweet spot", the CC2 event is delayed from the start of cycle more and more in each update cycle, simply by incrementing the CCR2 content (shadowing is switched on for the changing CCR2 to be accepted correctly). The fact that CC2 interrupts are missed because of the bit-banding clearing of Update flag, when the "sweet spot" is reached, is visualized by red LED stopping to toggle from time to time, while green LED toggles continuously:

In the isrCnts struct-array there are counters counting the occurence of Update ISR with Update flag set (.up), occurence of CC ISR itself (.cc) and occurence of that ISR with CC2 flag set (.cc2). This is how the vicinity of the "sweet spot" in this counter looks like:

{up = 28, cc = 28, cc2 = 28},
{up = 29, cc = 29, cc2 = 29},
{up = 30, cc = 30, cc2 = 30},
{up = 31, cc = 31, cc2 = 30},
{up = 32, cc = 32, cc2 = 30},
{up = 33, cc = 33, cc2 = 30},
{up = 34, cc = 34, cc2 = 30},
{up = 35, cc = 35, cc2 = 30},
{up = 36, cc = 36, cc2 = 31},
{up = 37, cc = 37, cc2 = 32},
{up = 38, cc = 38, cc2 = 33},

The same program compiled with

// #define USE_BITBAND

commented out, i.e. the TIM_SR flags clear is performed by straight write of the mask:

#ifdef USE_BITBAND
    PERIPH_BB(TIM1->SR, TIM_SR_UIF_Pos) = 0;
#else
    TIM1->SR = ~TIM_SR_UIF;
#endif

results in the three counters in isrCnts match perfectly, while counting up, all the time. No ISR missed.

Boring... ;-)