EFTON - Microcontroller programming 205

Microcontroller programming 205: Guts of startup code

This article is STM32-specific and gcc-specific, with some overlaps to other Cortex-M-based mcus and other C-based toolchains.

Programs are run in a certain environment. In the world of "big" computers, they are run from the operating system; in microcontrollers or other so called "freestanding" environments, they run directly after the hardware starts after reset.

Startup code is a piece of code which is supposed to prepare everything needed to run the binary resulting from program's compilation/linking. As the bare minimum, it has to initialize/zero global and static variables, as required by the C standard, and then pass control to main() (a normal function, which is never called from within the program itself, so there has to be some "external" mechanism to call it)¹.

gcc is not a monolithic project.

It is a project, that was and still actively is developed by many developers with no common affiliation, who usually have strong opinions on many aspects of the project, both techincal and organisational. It has developed through times into many directions as it aims to serve multiple goals and targets, resulting in moving "center of gravity" (e.g. the dominant data word width shifts from 32-bit to 64-bit). It is in fact a project which consists of several subprojects, bound together more or less tightly.

So, no wonder, that there's no single point of documentation. While individual sub-projects are documented more or less extensively and precisely; when it comes to the intimate details how the individual parts of gcc are "held together", the general approach is just to make the tools so that "things do work".

It is also a fact, that writing documentation is an extremely hard task, which additionally requires a different mindset than what's needed for writing programs. It appears to be quite rare that a good programmer is also a good teacher (documentation writer).² That's why it appears simpler to write the tools in the way that "they just work", rather than write (and maintain!) extensive documentation, which is also not likely ever to be read by most users.

However, this is exactly the opposite of what we intend to do here, i.e. gain full control.

We therefore here try to put the pieces together, in a Cortex-M-specific manner. However, in this effort, details may get omitted and/or misinterpreted. The following text - which is far from being complete or concise - should be taken with this caveat in mind.

How does startup code look like?

Startup code for most targets is delivered as a precompiled binary (object, .o) and is considered part of the standard library (libc). This is so because most targets are relatively compact and stable environments. This is even true for some of the microcontroller targets, such as AVR. Any need for customization is then reduced to selecting one of the supplied variant of libraries (through innocuously looking command-line switches, selecting mcu model/family)³, and to "hooks" which are designed in into the startup code (as well as into the linker script) to bring in snippets of code to insert into the startup process (see below).

This is not that simple in ARM Cortex-M world. The "bundled" startup code object is traditionally overriden with one, which is provided as a source code and compiled together with the application's sources. One reason for this is, that the interrupt vector table, which is part of the startup code, differs dramatically between individual mcus using this core, so unification would be difficult.

Having startup code as source appears to give more freedom to the user, and also it appears that it's easy to restrict it to the absolute minimum we've mentioned above. The minimal startup code would then contain:

interrupt vector table - this has to be located at a particular address - 0x0000'0000⁴ - and have at least the two words it in, which are read in by the processor after reset as the initial content of stack pointer and program counter. While the former is usually a symbol defined in the linker script and may point anywhere in RAM (or actually nowhere, to be set in the reset routine later), the latter is the address of the reset routine, where execution starts. The rest of the vector table follows⁵, each word being address of the Interrupt Service Routine (ISR) corresponding to interrupts, as they are listed in the Interrupt chapter of Reference Manual⁶.
ISR stubs - see footnote 8 here.
reset routine - routine, to which the second word in vector table points, so that's where program execution starts. It should, at very least:
- copy from FLASH to RAM initialization values for global/static variables, which in C program are explicitly initialized⁷. For this we need start address in RAM and in FLASH, and given area's size (or one of the end addresses, which is equivalent); these are not known at time of assembling the startup code so symbols (equivalent to variable names in C) are used for these purposes, which are then filled in by linker, according to the linker script,
- in a similar process, store zeros to global/static variables, which are not explicitly initialized (or are explicitly initialized to zero)⁸,
- jump to main()
However, this startup code is still not in "vacuum" and is part of the system already beeing there. So, in usual STM32 startup codes (e.g. here for STM32F407), in the reset routine, after initializing .data and .bss sections and before jump to main(), we usually find additional two subroutine calls:
- __libc_init_array() - this is part of the "hooks" we've mentioned, which allow additional function calls before main(), see discussion below
- SystemInit() - this is ST's method to do exactly the same as __libc_init_array() i.e. execute additional code (e.g. to set up external memories, higher clock frequencies etc.), but for some unknown reason avoiding its already established mechanisms. It is also called before __libc_init_array() so it can't enjoy whatever those "system/library" hooks bring in; OTOH allows __libc_init_array() to enjoy whatever SystemInit() provides (in a way a chicken-egg problem).

asm vs. C

Some users - and even some toolmakers - like to (re)write the startup in C, instead of asm. This is probably to use a language more familiar to the developers.

But if the startup is code setting up things needed to run C properly, is it appropriate to run C code (the startup itself) without having set those things needed to run C properly? I leave this to be contemplated by the reader.

The same objection applies to any function called from the startup code using the described mechanisms; and of course, SystemInit(), too.

What's up with __libc_init_array()?

This function is defined in the standard C library accompanying gcc. Most of Cortex-M-specific gcc bundles use newlib maintained by Red Hat as their standard C library, so we can look at its source. It's a relatively simple function, which in itself "does nothing". It goes through an array of function pointers, calling in turn each of them; then calls the _init() function, and finally goes through another array of function pointers. The symbols used to determine position and size of those arrays come from the linker script, and they represent boundaries of .preinit_array and .init_array sections⁹¹⁰.

The _init() function is generated by the compiler, is located in the .init section, and for some targets it contains/calls the constructors, instead of the already described function-pointer-array .init_array method. Why are there two competing mechanisms to perform the same task, is unclear. It appears, that _init() is effectively "empty" for the Cortex-M targets.

The main problem with __libc_init_array() is, that it provides a capability rather than a particular functionality, so several parties may use this facility, with potential problems both when using and when avoiding using them. Also, this facility is used quite indiscriminately, with no catalogue, nor any agreed mechanism to create and maintain one. So, which "parties" could add something surprising to it?

the compiler - after all, these mechanisms are created mainly for it. One notable example are the C++ constructors.
the standard library - it is intertwined and to certain extent overlaps with the compiler
any third-party library
the user himself - there's even a convenient mechanism for this, in the form of constructor function attribute

Can __libc_init_array() and SystemInit() be removed?

Unfortunately, there is no good answer to the question above. Let's discuss some of the related issues, to allow for a more informed decision.

There are two reasons to remove __libc_init_array():

to have full control over what goes into the binary
reduce total code size (and, less importantly, startup speed)

One can argue, that __libc_init_array() and whatever is called from there, are written by presumably knowledgeable people aware of the potential risks. After all, this is part of the compiler suite, and we rely on that (i.e. the compiler and libraries) being correct, anyway. However, most of the compiler and libraries are subject to a scrutiny of a far bigger user and developer community than is the relatively small userbase of one particular embedded target, or even its particular incarnation (such as any given Cortex-M core, to each of which a slightly different version of libc binary is available (resulting from different compile-time setup), in several variants for some of them).

Removing __libc_init_array() call from startup file, together with using -nostartfiles (which removes also code which is nowhere referenced and is also result of gcc attempting to provide "complete environment") in decreases binary size by around 2kB. That is not negligible for smaller mcus.

By disassembling a simple compiled binary, one can judge, whether __libc_init_array() does anything reasonable at all. For simple C projects (i.e. not C++), it appears, that there are no functions called through .preinit_array, and only one function through .init_array, frame_dummy(), which sets up the extension mechanism (try-throw-catch in C++) and here effectively does nothing.

As an example of "what may happen if the project is not simple", if in Cortex-M4 the FPU is used (-mfloat-abi=hard -mfpu=fpv4-sp-d16), and the -ffast-math option is used (a non-standard-compliant option), an additional function pointer appears in the .init_array section, to call __arm_set_fast_math(), which switches on FZ bit of the Floating-point Staus and Control Register FPSCR[24], enabling the non-IEEE754-compliant flush-to-zero mode.¹¹

There may be compromise settings, which retain __libc_init_array() and its mechanisms, yet allow reduced code size; but developing them is hindered by the same lack of documentation as deciding on complete removal.

It's upon the user to judge the risks and benefits.

All the discussion above applies to SystemInit(), too; except the decision is simpler: it's a relatively straighforward piece of code with no external dependencies, so it can be easily scrutinized or reproduced by the user, whether in normal code (main) or in startup.

Dissapointed? Sorry. And so am I. It's only this much the open-source community has to offer.

Views expressed here are personal, arguable. YMMV. Things change, accomodate.
All trademarks belong to their respective holders, and other legalese blah blah blah.
Comments are welcome, please email them to stm32 at efton dot sk.

1. "External" here means, something which is beyond the C program itself.
Startup code is also sometimes/traditionally marked as "crt" (or "crt0" or similar), standing for "C runtime".
In an operating system (called also hosted) environment, it is also expected that a program has an exit point, where it passes control back to its caller, the startup program, which is then supposed to "tear down" whatever resources it acquired at startup, and then return control to operating system. Microcontrollers (freestanding applications) are not supposed to exit ever, so this does not apply to them and we won't discuss this aspect here. Would it be needed, in priciple,the mechanisms used teardown/exit, are the same than in starting up/initialization, so users can easily infer the related functionality from these similarities.

2.As an illustration, this is how program startup is described in gcc's documentation. To be fair, that document describes it from a very different angle: from the perspective of somebody who writes a compiler and writes the functions needed to generate the code needed the startup to work, rather than from the viewpoint of the user of the compiler. Nonetheless, that document is too vague and hints to vast variability of solutions for different targets, so even the prospective compiler writers must be confused and have to refer to the "real documentation" i.e. the source codes, to have the complete information they may need.
Here is newlib/libc documentation. It has no mention of the init/fini mechanisms whatsoever.
Here is a relevant portion of picolib's documentation, describing roughly the same issues as discussed in second part of this article. Note, that while it's by far the best of the available descriptions, it still avoids answering many of the vital questions as they (code generated by gcc or by other libraries or by users) are beyond its control. Picolib is a derivative project from both newlib and AVR-LibC, intended to be newlib replacement for embedded targets. Whether it gains traction and be adopted by ARM or ST for their embedded-targeted gcc bundles, is questionable.

3.The multitude of C library variants (e.g. for targets with vs. without floating point hardware) in the gcc binary bundles, and methods of selecting between them, is a similarly badly documented area, and a rabbit hole too deep to be discussed at this point. Maybe later.

4.In most STM32, after reset, the user FLASH, normally at 0x0800'0000, is aliased to 0x0000'0000, so the vector table in user program may be simply located at the beginning of the user FLASH. If bootloader is invoked (the exact process depending on family, e.g. by pulling BOOT0 pin high), a piece of ROM called System memory is aliased to 0x0000'0000, so the processor picks the first two words from there and uses them to start the bootloader program.

5.In source form, these are labels (names) of the corresponding routine (assuming startup code is in asm; in C this would correspond to function pointers). These are colloquially called "vectors", hence the array of them is called "vector table". As Cortex-M processors run exclusively in the 16-bit Thumb mode, i.e. instructions are 16-bit, in binary, these addresses have the lowermost bit set, i.e. they are odd numbers. This is so because in ARM processors this bit is used to switch between the modes; so if program counter is loaded by (= performs a jump to) an even address, it attempts to switch to the 32-bit mode which in Cortex-M causes a Fault.
Fletcher J pointed out, that not all Thumb instructions are 16-bit and some Thumb-2 instructions used in Cortex-M3/4/7 are 32-bit. That's true; however, they are treated by processor as a multi-half-word instruction i.e. a pair of 16-bit words, rather than a single 32-bit instruction word. As a consequence, they are not required to be word-aligned. These instructions have also different encoding and function from the "native ARM" 32-bit instructions.

6.Strictly speaking, only the first two words have to be at the address 0x0000'0000 from where the processor picks them. The rest of the table may reside on any address (pending proper alignment given by size of table), as before enabling any interrupt, the reset routine or application may change content of SCB_VTOR register, which determines, from where the processor picks the vectors in case of interrupt (with exception of Cortex-M0, which does not have SCB_VTOR register).

7.Explicitly initialized variables are, according to the elf specification, placed into .data section. This is one of the special sections explicitly listed by the specification with the purpose of holding initialized data. Section names in elf starting with dot are reserved for system, users should use section names without starting dot.

8.This is called .bss section, another "special elf section".

9.These two arrays are probably also result of chicken-egg type of problem, i.e. there was a need for functions to be run before functions which are to be run before main(). It aptly illustrates the method how the gcc community solves problems.

10.There's also a an analogous function for "tear down" after program exits, __libc_fini_array(), calling functions through pointers in .fini_array section and calling _fini() functions. As microcontroller programs never exit, these are irrelevant for our discussion and can be removed safely.

11.This btw. is a good example of the chicken-egg problem between __libc_fini_array() and SystemInit(): some older incarnations of ST-provided startup code enable the FPU in SystemInit(), but as that is called after __libc_fini_array(), setting FPU register in __arm_set_fast_math() leads to fault. Newer ST-provided startup codes tend to call SystemInit() earlier on..