The notion of mcu peripherals having individually controlled clocks is a relatively novel concept in mcus, which appeared roughly together with advent of 32-bit mcus. It was enabled by the increasing number of transistors per chip, so this feature was not seen as an unnecessary complexity anymore. At the same time, it was necessitated with the huge number of peripherals on the new mcus, vast majority of which are not used in any particular application - while increasing consumption, would their clock be left running.
However, for users migrating from 8-bit mcus this was a surprising change. So much so, that it became the topic of the very first gotcha...
While enabling a peripheral in STM32 is as easy as set the respective enable bit in one of the enable registers in RCC, in some STM32 it may come as a surprise, that accessing a register in the peripheral immediately after it has been enabled, may go wrong: writes may get ignored by the peripheral and reads may return unexpected values. This is due to the SoC nature of these mcus, where signals may need to pass synchronizers between various clock domains, and this may take some time.
ST realized this problem quite soon (e.g. in the STM32F407 errata, Delay after an RCC peripheral clock enabling erratum appeared in 2012, one year after the 'F407 was introduced) and documented it in errata in older STM32 families and directly in RM in newer STM32 families.
Unfortunately, one of the offered workarounds - and that one which appears to be the most foolproof, easy to implement, and is even used in Cube - i.e. performing a dummy read of target peripheral's register after enabling clock and before any "real" access - is experimentally proven to be wrong in at least STM32F407/'F429.
The fact, that the problem couldn't be reproduced on 'F411 nor 'L476, and slightly different overall timing of related events on these models indicates, that the bus fabric attached to the processor has subtle yet significant differences between individual models/subfamilies, which are entirely undocumented (the processor itself is a component sourced by ARM, indicates the same version/revision number, so it's most likely identical in all four discussed models).
Solution to the root problem is compounded by the fact, that the required delay is very short, a few system clock cycles; and that in C or any other higher-level language there's little guarantee for code ordering (beyond what volatile can offer) yet alone execution timing (which in the 32-bitters is matter of uncertainty for multiple reasons anyway). Most probably the vast majority of users has never encountered the problem, even without adding a delay in their C code (or using the library macro which has the delay incorporated).
One easy and effective way to avoid having problems stemming from this issue is in the traditional controller-oriented mcu development. In that, the particular application is seen as central, thus all resources needed for that application are controlled/enabled at once, at the beginning of the program (this also makes reading of such program easier, as it's immediately clean which and how resources are enabled/set up). So, here, after enabling all clocks, an operation not requiring access to just-enabled peripherals is performed - e.g. setting up the various clocks, clearing/initializing variables in RAM, etc. - and that provides enough time so that all peripherals are enabled safely by the time they are going to be accessed for the first time. This is also more effective as far as code size and execution time is concerned.
However, this approach goes against the grain of the now modern "driver" approach, where all manipulation of resources (including clocks, GPIO) are performed individually, "locally"; in the code, and at the time, where/when given peripheral is initialized. In that case, delay has to be provided in each case of enabling the peripheral's clock.