STM32 gotchas
108. Toggling GPIO in 'H7 is slow (due to complex bus structure)

We've talked about how frustrating it is for the novices who come to the 32-bit mcu world, writing their very first test program, which does nothing but toggles a pin as rapidly as it can be toggled, in a loop, to see the pin toggling much slower than they anticipated.

Nowhere is this frustration bigger than with the 'H7 family, where the processor is clocked at hundreds of MHz, yet a GPIO pin toggles only at low tens of MHz. Why?

To start, let's realize, that in the 'H7 it is only the processor core which runs at the advertised highest frequencies. All main buses run at maximum half that frequency, and most peripherals at maximum quarter of that.

Moreover, the 'H7 is an incredibly complex design, based on three bus matrices. This is given by several facts: one single bus matrix would be physically too big, resulting to be too slow and too power consuming. The "first" matrix is a 64-bit AXI matrix, which is unnecessarily extensive/wasteful for the slower "normal" peripherals. The second matrix is mostly derived from the 'F2/'F4/'F7 family and partially caters to the secondary Cortex-M4 processor core; and even if that is present only in some high-end members of the 'H7 family, the impact is still seen in other members, which largely copy the general layout for simpler overall family design.

Access from processor to GPIO crosses several bus matrix domains in H7.

Bus matrix separation also allows simpler clocks separation between various parts of the chip, which in turn allows a whole bunch of power consumption saving modes - and those are desperately needed in 'H7 should it survive more than just a few years of operation.

This latest requirement is probably why the GPIO modules are placed at the AHB4 bus, which is part of the relatively small D3-domain bus matrix. This matrix is intended to be "alive" at the lowest-power-consumption, yet stil peripherally-active modes, thus contains also the low-power asynchronous timers and UART (as well as one SPI and one I2C unit) and a small single-port DMA unit to be able to handle these peripherals partially without processor intervention. It also contains the optionaly battery-driven RTC, comparators, watchdog and the EXTI unit - all these related to low-power-run and wakeup functionalities.

The consequence of this arrangement is, that accessing GPIO registers from processor and the general-purpose DMA always crosses at least two bus matrices. So this involves not only delays due to two bus-matrix arbitration processes, but as different bus matrices potentially run at different clocks, which means, that there are resynchronization units between the matrices, requiring additional clock periods for their working.

These delays cumulate up to the point where toggling GPIO from the processor (and from the general-purpose DMA) results in frustratingly slow changes on the pin. This is not to say that the pins themselves can't toggle faster than that - indeed they do, when set to Alternate Fuction in GPIOx_MODER and then driven directly from the associated peripherals.

Hand in hand with this comes also jitter i.e. uncertainties in timing - so e.g. duration of a single pulse generated by two consecutive writes to GPIO from processor can result in variable pulse length, depending on load on the bus matrices from other bus masters, and also on the momentarily mutual state of synchronicity between involved bus matrices.

The exact working of inter-matrix synchronizators is undocumented, so calculation of these timing effects is mostly impossible.