STM32 gotchas
43.No bit-banding in Cortex-M0/M0+ - but not in Cortex-M7 either

When ARM and its mcu-manufacturing partners started to introduce the Cortex-M family some 15 years ago, their first product was aimed at the mid-range, with the Cortex-M3 core. However, they wanted to capture not only the nascent 32-bit mcu market, but they also wanted to pick up parts of the existing 16-bit and the - then still dominant - 8-bit mcu market. As part of this effort, they introduced so called bit-banding into the new processor's design.

This feature was obviously aimed at users of the venerable Intel-originated 8051, and its numerous clones and variants produced by many companies in huge numbers. As in 8051 part of the memory can be accessed both byte-wise and bit-wise; in a similar move, in Cortex-M3, certain portion of the memory map1 can be accessed again in a different area of the memory map in such a way, that every bit from words in the first area are aliased to the first bit of an otherwise empty word in the second area. Thus, reading words from the second area returns 0x0000'0000 or 0x0000'0001 depending on state of given bit; conversely, writing 0x0000'0000 or 0x0000'0001 will clear or set a single bit in the respective word in the first area2.

This feature is called bit-banding3.

Contrary to the 8051, where the bit-addressable area is an integral part of the core design, and individual bits are accessed using dedicated instructions; in Cortex-M3, the designers employed an ingenious trick: bit-banding is performed by an extension unit on the port, through which the processor accesses the rest of the mcu (usually via a bus matrix)4. This extension detects accesses to the bit-addressable area and intercepts them, while putting the processor to wait. When a read from bit-addressable area is to be performed, the extension reads the corresponding word from the "normal" area, extracts the required bit, and presents it to the processor as the LSB in an otherwise zeroed word. Write to bit-addressable area is more complex: the extension has to read the corresponding word, insert the new bit value into that word, and write the word back to the memory; however, it does not need to waitstate the processor (except if the processor would want to write again while this process lasts). This trick allows to implement bit-banding without any impact to the instruction set or any other portion of the processor or the rest of the system.

Bit-banding was retained also in the Cortex-M4 core5 which expanded on the Cortex-M3, adding mainly the FPU and several smaller additional features. However, it did not gain much popularity amongst the users. Maybe the reason lies in lukewarm acceptance also by the toolmakers, who typically provided little to no support for bit-banding, not even a relatively simple guideline for placing variables into the dedicated bit-addressable area using compiler pragmas and/or (already existing) linker features. There is also a couple of issues when using bit-banding in the peripherals without exercising due care6, leading to further alienation of users.

When ARM proceeded to create the stripped-down Cortex-M0 core for smaller, cheaper end of the whole group of mcus, they decided to omit the bit-banding extension. This is somewhat surprising, as it was created primarily in order to attract existing 8-bit mcu users, and the Cortex-M0 was again intended for mcus directly replacing the 8-bitters. Whether the reason was in the previous weak acceptance of bit-banding, or the attempt to create as compact processor as possible, this strategy was adopted also in the successor: Cortex-M0+ does not have bit-banding either.

Implementation of the bit-banding extension in the high-end Cortex-M7 would probably be cumbersome, given caching and conversion of 32-bit processor bus into the 64-bit AXI bus, maybe even impacting performance; so bit-banding was omitted from Cortex-M7 too7.

The corollary is that while bit-banding is a useful feature to have bit-wide variables (usually of boolean nature), sparing memory while retaining atomicity8; they are available only in Cortex-M3/M4. So, if portability of code onto Cortex-M0/M0+ or Cortex-M7 is desired, bit-banding is better to be given up, in favour of spending a whole byte for the single-bit (boolean) variables. With a bit of planning and proper placement of these variables using linker features, seamless transition may be achieved, but the cost of increased memory consumption still has to be considered.

1. In fact, two portions - one is intended to provide some bit-aliased memory, the other is intended to provide bit-aliased peripheral area.

2. This process, the exact location of the mentioned areas and the exact mapping together with formulas to calculate respective addresses from one area to the other, can be found in ARM's Technical Reference Manuals to the Cortex-M3 and M4. In ST's materials, refer to Bit-banding chapters in PM0056 and PM0214 (Programming Manuals to Cortex-M3 and Cortex-M4 for STM32 mcus). Here is a simple bit-band-addresses calculator.

3. Sometimes confused with the somewhat slangish expression used with microcontrollers: bit-banging - this refers to a non-hardware implementation of certain sequenced protocols such as SPI or I2C, where the individual bits (and respective clock transitions) are output to the port pins (or input from them, i.e. "banged out/in") step-by-step, "manually", in software.

4. In particular, bit-banding extension is only on the S-port (System) of the mcu. That is the von-Neumann-ish port (i.e. port through which both instructions can be fetched and data can be read and written) of the processor, through which addresses above 0x2000'0000 are accessed - this is obvious also from addresses of the particular bit-banded areas. The Harvard-ish I-port (Instruction) nor D-port (Data) (for accesses below 0x2000'0000) are not bit-banded.
   The fact that bit-banding is an extension on the processor's memory interface also means, that the bit-addressable area is "invisible" (non-existing) to all other bus-masters, so, for example, DMA cannot write or read the "single bit" areas.

5. Note, that bit-banding is an optional feature of Cortex-M3/M4, and implementers may choose to omit it. There are examples of Cortex-M4-based mcus from other manufacturers which don't have bit-banding implemented, but as of 2020, every STM32 based on Cortex-M3/M4 does have bit-banding implemented.

6. The main problem with this scheme in peripherals is, that while from the process/program point of view accessing the bit-area is atomic, from the hardware point of view it may not be so; so registers, which are set/cleared autonomously by hardware may be dangerous to be accessed through the bit-area.
   On an NXP Cortex-M3-based LPCxxxx I've experienced also a different gotcha, which might or might not be consequence of bit-banding, but I've experienced it with bit-banding - there were two writes into a GPIO port back-to-back, one using bit-band and the other not. The memory interface merged these two accesses into a single write (plus obviously one read for the bit-banded write), outputting the result of both writes at once, ruining my bit-banged SPI implementation. This was probably due to the fact that NXP decided - not very wisely - to put GPIO into the bit-banded area intended for memories, which in the default setting of MPU is tagged as "Normal", allowing such merging in write. No STM32 has peripherals in this area, and STM32 do have the BSRR registers, performing atomic writes at the GPIO peripheral itself, obviating the need for placing them into bit-banding area at all.

7. Bit-addressable aliases can actually be implemented at various points of the system, not just at the processor-to-system interface; and this then may have less impact on overall performance and processor complexity. An example is the "Bit Manipulation Engine" of certain NXP/ex-Freescale Cortex-Mx-based mcus, where bits can be not only set and cleared, but also some other boolean operations can be performed over them - all this through seemingly innocuous access to an aliasing area.
   However, this is still just an insert into the bus structure, making use of the waitstate/handshake nature of data exchange through the mcu buses (read: the processor waits until the whole read-modify-write process in the "insert" is finished). But, while I haven't seen such yet, I can envisage a piece of truly bit-addressable memory, without alias into byte-wise or word-wise collection of the bits, to be used as boolean variables/arrays. While this would spare silicon area at the cost of wasted address space, there is plenty of the latter in the ARMs.

8. The apparent replacement - C bitfields - notoriously suffer from not being atomic, as they form part of a byte/word in which bits cannot be flipped atomically with a single instruction, at least not in standard memory with standard ARM Cortex-M core.