This is one of the "not confirmed by ST" issues, treat it as such. Also, having used mostly the dual-port DMA (as found in 'F2/'F4/'F7 and covered by AN4031), I don't have extensive experience with the single-port STM32 DMA (covered by AN2548), so I am not entirely sure how exactly this issue pertains to that one.
In order to keep up with data stream arriving1 to communication devices such as UART or SPI, they have to be serviced by the processor timely, usually in interrupts. With higher loads on the processor, or if multiple ports to be handled, the processor might not be able to keep up with the incoming data stream, resulting in data loss. Traditionally this problem was coped with by adding FIFOs to the transceiving chips, to store up to several received frames before they were read out by the processor. We've seen this for example in the PCs in the 1990s, when the original 825x/16450 UART (having only the basic Rx holding register for one frame) got replaced by 16550 (with a 16-byte FIFO).
Why this works is, that interrupts (or even polling) may involve serious overhead in both executed number of instructions (ISR/functions entry/prologue and exit/epilogue) and the speed how they are executed (program flow disruption resulting in caches miss/purge/reload). Neither of these is really a problem on the 8-bit microcontrollers, (and FIFOs are rare indeed on those); but as mcus get faster, by employing all sorts of tricks from the "big" computers' world, these issues get more and more aggravated. So then it pays off being interrupted only seldom, and process bigger chunks of data at once.
The FIFOs on communication chips attempted to mimic the working of the original FIFO-less chip as close as possible, so that the original software drivers would continue to work. The same interrupt signaling and same register structures was employed, perhaps with a few added control bits in the registers. For Rx, the above benefit to materialize, this meant that the interrupt (signaling originally that the Rx holding register was not empty anymore) could not go up sooner than several frames were received into the FIFO (i.e. a certain threshold was reached in FIFO "fullness"), but then when read, it would not go down until the FIFO was drained completely. If only a few frames (below the threshold) are received, these would "hang" in the FIFO indefinitely; so to avoid this, there is also an idle timer needed to invoke the interrupt. This all results in a nontrivial circuitry (and indeed, the FIFO-containing UARTs have seen their fair deal of early design flaws, programmers had to learn to cope with).
In STM32, the original approach was to build simple communication modules, with the general-purpose DMA serving in the role of FIFO2. This makes sense, as there are many communication peripherals on a single STM32, yet usually only a few of them is used at once, so this approach spares silicon area avoiding FIFO memory and control circuitry in each peripheral, since most of them would get unused in a typical design.
To use DMA for reception, the Rx-nonempty signal of communication module (UART, SPI, I2C) is not used to interrupt the processor directly, but serves as request (trigger) signal for the DMA, which in turn transfers the received frame from the Rx buffer to a buffer in memory. The basic idea works nicely, however, the signaling described above is missing entirely. So this has to be worked around somehow. We here won't deal with the methods how to "alert" (interrupt) the processor of frames present in "FIFO"3; we will assume the processor is already aware of this and wants to pick and process the frames from the buffer.
To find out how many frames have been transferred by DMA, there's only one choice - to use the NDTR register. DMA decrements this register with each transfer. (DMA also increments the address registers if it is set so, and normally the memory side is; but the real "acting" address register is strangely internal to DMA and unreadable by the processor. The visible address registers never change and always read the value last written.)
The problem is, that NDTR counts the peripheral-side transactions, and it gets decremented at the moment, when the DMA unit internally arbitrated the request from peripheral and started to transfer the frame from peripheral to its internal FIFO 4 5. But it takes time until DMA succeeds to store that frame into the memory. If the processor reads the already decremented NDTR, and based on that, reads the frame from memory before DMA writes it, the processor reads an incorrect value.
The problem gets exacerbated by using FIFO in DMA. FIFO does not get stored into memory until its set threshold is reached; and also finding out from NDTR that FIFO threshold has been reached does not mean the data are already stored in memory, it means only that the process leading to storing has begun.
Many users see DMA transfer as an instantaneous process (and this notion is supported by the correct recommendation to avoid the relatively slow transfers involving the processor if DMA is available for the same task), but it is not. First, the frame has to be picked from the peripheral, and that may be slowed by both arbitration for APB access between DMA and other busmasters (usually processor) in the dual-AHB-to-APB bridges which are usually present with the dual-port DMA, and the sheer fact that APB bus is usually slower than AHB in the processors with the dual-port DMA. Then, on the memory side, the internal arbitration within DMA has to be performed, and then DMA has to arbiter for the target memory in the busmatrix.
Unfortunately, I know of no good method how to find out, when are data indicated by NDTR already stored in memory. In direct mode, if NDTR indicates that transfer of next frame has begun, the previous frame is surely stored in memory; but method relying on that fact is usable only for a continuous incoming stream. The only official way to purge FIFO is to disable DMA and wait until its enable bit gets cleared (the very reason for this wait is the fact that the FIFO-to-memory store has to be accomplished) - but that would defeat the very purpose of using DMA. A "reasonable" time between NDTR read and data read from memory could be established by benchmarking for the worst case within the given application and then adding some "safety margin"; but it is far from being trivial to construct a benchmark which ensures the worst case to be achieved.
This issue has been discussed also on community.st.com (thread mingled due to forum software change).