Software delays in various libraries are often derived from a free-running timer, i.e. the timer is not reset/started/stopped with the delay routine call, as it potentially provides timing for other parts of the software, too. In the ARM Cortex-M realm, such delays are often derived from the SysTick timer, which is part of the processor/core. In the STM32/Cube/HAL environment, HAL_Delay() is exactly such a delay.
Users sometimes find out with surprise, that these delays last one time unit (usually millisecond) longer than their prescribed duration.
However, this is quite naturally given by the granularity of the underlying timer. The delay routine, when called with parameter N, looks at that timer's current value, say it's K, and then waits until that timer rolls over from K+N to K+N+1. If the call happens just moments before the timer rolls over from K to K+1, the delay will be just slightly longer than N time units. If it's called slightly after it rolls over from K-1 to K, the delay will be just slightly shorter than N+1 time units.

Often, a basic test consists of repeatedly toggling an output pin and calling the delay. In this case, the delay's call will happen shortly after the previous delay routine ended, i.e. it's shortly after the timer's rollover. This is then exactly the scenario where the user would see the delay lasting consistently around one time unit longer than expected.

Also, if there are any interrupts run in the microcontroller, execution of the ISRs can increase the delay's duration.
So, these delays have to be understood as lasting "at least the prescribed period".
If delay with more precision is desired, a timer running at higher clock is to be used, possibly employing interrupts of appropriate priority (so they can interrupt other, potentially lengthy ISRs). In 32-bit mcus, interrupts and software loops may take dozens of machine cycles, so that's also the limit of the achievable precision. If even higher precision is desired, hardware has to be used, e.g. output of a timer's compare unit.
Time-wasting delays lasting milliseconds are normally used only in quick testing setups. In production code, only delays lasting a few machine cycles are constructed as time-wasting, as it's inefficient to solve these using timers/interrupts. Also in this case, the delay routine/sequence is usually presented as "lasts at least xxx cycles/time units". In the past, NOP instructions have been used for this sort of delays, but it's hard to properly develop such delays with the increasing complexity of 32-bit mcus, as these delays have to defeat all the measures added to increase the execution speed (caching of all forms, pre-execution NOP removal, branch prediction, etc.), and do it in a defined manner.