Build Flow

Preprocessor -> Compiler -> Assembler -> Linker
Oftentimes the first two are lumped together, as are the last two, into the Compiler/Linker phases. Compiler (gcc) converts .c/.h to .o files, linker (ld) converts .o/.a files to executables or relocatables.

The -c option flag for gcc will run the first three steps, but skip the linking.

.bin vs .elf

A bin file is a pure binary file with no memory relocation, although it will have an IVT header, explicit instructions to be loaded at a specific memory address...

ELF files are Executable Linkable Format which consists of symbol look-ups and relocation table - it can be loaded at any memory address. All symbols used are adjusted to the offset from that memory address where it was loaded into. Usually ELF files have a number of sections, such as 'data', 'text', 'bss', etc - it is within those sections where the runtime can calculate where to adjust the symbol's memory references dynamically.

ELF file also contains the bin within.

Object files

GCC adds symbols like _start and frame_dummy to the executable. _start is the actual beginning, even before main(). _start comes from crt1.o while _init comes from crti.o. These are not really libraries but inline assembly code that does pre-main init stuff like setting up interrupts, initializing stack, etc. The assembly files have a .s extension.

Linker scripts (name.ld or name.ld.S or sometimes even name.lsl as in the Aurix TriCore) or ldscripts are commands to the linker telling it where to place the symbols. This is where the load memory addresses are established. The .text symbol gives the address where the code will be loaded. The main purpose of the linker script is to describe how the sections in the input files should be mapped into the output file, and to control the memory layout of the output file. Linker scripts are not meant to be modified by end users. They should only be modified by toolchain developers, but sometimes you have to step in to make a fix. They also have a section in which you can execute code, init steps to be run before start or main or anything else.

The most fundamental command of the ld command language is the SECTIONS command (see section Specifying Output Sections). Every meaningful command script must have a SECTIONS command: it specifies a "picture" of the output file's layout, in varying degrees of detail. No other command is required in all cases.

The MEMORY command complements SECTIONS by describing the available memory in the target architecture. This command is optional; if you don't use a MEMORY command, ld assumes sufficient memory is available in a contiguous block for all output. See section Memory Layout.

Here is the full syntax of a section definition, including all the optional portions:

SECTIONS {
...
secname start BLOCK(align) (NOLOAD) : AT ( ldadr )
  { contents } >region :phdr =fill
...
}

You can determine the name of the compiler/linker used by looking at the debug .map file. The .map files shows the addresses of the symbols, sizes of the functions, etc. It can be generated by using the -Map=name.map flag for the linker.

If you have a linker error saying cannot move location counter backwards then you are probably exceeding your available memory. You could try adding the gcc flag -ffunction-sections and ld flag --gc-sections (garbage collection) but this may strip important headers off of your binary. Take a look at the size of your elf and maybe all the libraries that are getting linked in.

Explain 'relocateable'

The GNU linker has an option -r for creating relocateable output. Another word for this is partial or incremental linking.

-r
--relocateable
Generate relocatable output--i.e., generate an output file that can in turn 
serve as input to ld. This is often called partial linking.

A linker takes .o or .a files as input. It produces executable or relocateable output. Passing -r (or --relocatable) to ld will create an object that is suitable as input of ld. In the nominal use case, a linker receives relocateable object files (like ELF) and produces an executable of the same format (ELF). Warning: when an input file does not have the same format as the output file, partial linking is only supported if that input file does not contain any relocations.

A relocateable has no address information for symbols, only offsets from main. The linker moves blocks of bytes of your program to their run-time addresses. These blocks slide to their run-time addresses as rigid units; their length does not change and neither does the order of bytes within them. Such a rigid unit is called a section. Assigning run-time addresses to sections is called relocation. Apart from text, data and bss sections you need to know about the absolute section. When the linker mixes partial programs, addresses in the absolute section remain unchanged.

With an incremental link, you can leave "undefined references" in the code because it is presumed they will be resolved at the final linking.

Dependencies

Many build systems add automatically detected make dependencies into a .d file. In particular, for C/C++ source files they determine what #include files are required and automatically generate that information into the .d file. It contains a list of targets and dependencies like

foo.o : foo.h bar.h biz.h

Newer C++ Versions

If you want to compile with updates, you'll need to specify the version to support on the compile command, i.e. g++ -std=c++11 <yada-yada>

Compile-time computations (is the new assert stuff an example of this?) result in output going into .ro data section.

Libraries

A .a file is a static library. The Unix archive format is a collection of relocatable object files with a header for size and location descriptions. The general rule for libraries is to be at the end of the linking command line, otherwise references may not be resolved because the definition (library) is read before the call (module).

Good explanation from over on stack overflow:
Shared libraries are .so (or in Windows .dll, or in OS X .dylib) files. All the code relating to the library is in this file, and it is referenced by programs using it at run-time. A program using a shared library only makes reference to the code that it uses in the shared library.

Static libraries are .a (or in Windows .lib) files. All the code relating to the library is in this file, and it is directly linked into the program at compile time. A program using a static library takes copies of the code that it uses from the static library and makes it part of the program. [Windows also has .lib files which are used to reference .dll files, but they act the same way as the first one].

There are advantages and disadvantages in each method. Shared libraries reduce the amount of code that is duplicated in each program that makes use of the library, keeping the binaries small. It also allows you to replace the shared object with one that is functionally equivalent, but may have added performance benefits without needing to recompile the program that makes use of it. Shared libraries will, however have a small additional cost for the execution of the functions as well as a run-time loading cost as all the symbols in the library need to be connected to the things they use. Additionally, shared libraries can be loaded into an application at run-time, which is the general mechanism for implementing binary plug-in systems.

Static libraries increase the overall size of the binary, but it means that you don't need to carry along a copy of the library that is being used. As the code is connected at compile time there are not any additional run-time loading costs. The code is simply there.

Personally, I prefer shared libraries, but use static libraries when needing to ensure that the binary does not have many external dependencies that may be difficult to meet, such as specific versions of the C++ standard library or specific versions of the Boost C++ library.

The general recommendation is to prefer dynamic linking when possible. Note that there are problems statically linking some libraries with some compilers on certain platforms. For example, the pthread library has a fail silent problem some circumstances, and if you want to statically link it you need to make sure to use the --whole-archive flag. (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52590, https://stackoverflow.com/questions/7090623/c0x-thread-static-linking-problem)

How can I know which dependencies an executable has?

Use readelf -d <bin>

How can I know the arch target for a .a archive file?

The file command won't work here. Use readelf -h <archive>.a | grep 'Class\|File\|Machine'.

Where does the loader search for dependencies on running a program?

The rpath designates the run-time search path hard-coded in an executable file or library. Dynamic linking loaders use the rpath to find required libraries. The rpath is revealed with readelf -d <prog>

objdump

The meaning of the columns for -T (dynamic symbol table)

COLUMN ONE: the symbols value
COLUMN TWO: a set of characters and spaces indicating the flag bits that are set on the symbol. 
There are seven groupings which are listed below:

group one: (l,g,,!) local, global, neither, both.
group two: (w,) weak or strong symbol.
group three: (C,) symbol denotes a constructor or an ordinary symbol.
group four: (W,) symbol is warning or normal symbol.
grout five: (I,) indirect reference to another symbol or normal symbol.
group six: (d,D,) debugging symbol, dynamic symbol or normal symbol.
group seven: (F,f,O,) symbol is the name of function, file, object or normal symbol.

COLUMN THREE: the section in which the symbol lives; 
ABS means not associated with a certain section, UND means referenced but not defined in this file
COLUMN FOUR: the symbols size or alignment.
COLUMN FIVE: ???
COLUMN SIX: the symbols name.

I've seen binaries bereft of .text sections, presumably because they are getting everything from a dynamic lib. An object file written by GNU assembler has at least three sections, any of which may be empty. These are named . text, .data and . bss sections. You may allocate address space in the . bss section, but you may not dictate data to load into it before your program executes. When your program starts running, all the contents of the . bss section are zeroed bytes.

Starting and Running

The very beginning is not main... instead the linker script has a command to identify the actual starting label:
ENTRY(_start) for example.

What goes on once you get into the _start function? For the Aurix TriCore:

disable watchdog
set CPU special register pointers for interrupt vector table, trap table, stacks
load other address registers with needed pointers
init the context switch area
other CPU init

Note with some processors you can specify a program start location using a boot mode index defined in a boot mode header file - just a compiled C file of structures and pointers to form into an image to place into memory. It has its own location in flash and this is given by the processor architecture and specified in the linker script. On reset, with the right boot mode pins, the processor will look to that flash location to load a program counter.

Here is the sequence as the program counter is first given the address of the start location in instruction memory (could be RAM or flash):
start section
init section(pre-main)
text section (where main lives)

Run Time

What's with this term, "runtime"?

I think it's often a poorly used term. It literally means run or execution time, but is often used as a lazy replacement of the more descriptive terms runtime library, runtime system, or runtime environment. Runtime is not a thing, it's a time.

The C runtime library is small and different from the C standard library. It does define the stdlib, but the implementation of those functions must be added. A runtime library is always specific to the platform and compiler, so it is hardware-dependent. GCC provides libgcc. The runtime environment consists of environment variables, and are accessed via the runtime system. The runtime system is also composed of the stack management instructions.

Also from Wikipedia:

"Most programming languages have some form of runtime system that provides an environment in which programs run. This environment may address a number of issues including the management of application memory, how the program accesses variables, mechanisms for passing parameters between procedures, interfacing with the operating system, and otherwise. The compiler makes assumptions depending on the specific runtime system to generate correct code. Typically the runtime system will have some responsibility for setting up and managing the stack and heap, and may include features such as garbage collection, threads or other dynamic features built into the language."

"As a simple example of a basic runtime, the runtime system of the C language is a particular set of instructions inserted into the executable image by the compiler. Among other things, these instructions manage the processor stack, create space for local variables, and copy function-call parameters onto the top of the stack."

Another distinguishment to make is to say that a compiled object file contains only assembly code instructions of the functions, while an executable binary also contains the runtime environment implementation. The object files depend on that environment. There is actually a hierarchy of runtime systems in a complex machine, with the microcode/logic of the CPU itself being the lowest-level runtime system.

For ARM (and others?), there is a file of assembly instructions called crt0.s that sets up the runtime environment by initializing stack pointers for things like IRQ, FIQ, services, etc. It's final instruction is a branch to main, so it executes before main. http://en.wikipedia.org/wiki/Crt0

Aurix C Runtime Environment Init

    /* Initialization of C runtime variables */
    Ifx_Ssw_C_Init();

IFX_SSW_INLINE void Ifx_Ssw_C_InitInline(void)
{
    Ifx_Ssw_CTablePtr pBlockDest, pBlockSrc;
    unsigned int      uiLength, uiCnt;
    unsigned int     *pTable;
    /* clear table */
    pTable = (unsigned int *)&__clear_table;

    while (pTable)
    {
        pBlockDest.uiPtr = (unsigned int *)*pTable++;
        uiLength         = *pTable++;

        /* we are finished when length == -1 */
        if (uiLength == 0xFFFFFFFF)
        {
            break;
        }

        uiCnt = uiLength / 8;

        while (uiCnt--)
        {
            *pBlockDest.ullPtr++ = 0;
        }

        if (uiLength & 0x4)
        {
            *pBlockDest.uiPtr++ = 0;
        }

        if (uiLength & 0x2)
        {
            *pBlockDest.usPtr++ = 0;
        }

        if (uiLength & 0x1)
        {
            *pBlockDest.ucPtr = 0;
        }
    }

    /* copy table */
    pTable = (unsigned int *)&__copy_table;

    while (pTable)
    {
        pBlockSrc.uiPtr  = (unsigned int *)*pTable++;
        pBlockDest.uiPtr = (unsigned int *)*pTable++;
        uiLength         = *pTable++;

        /* we are finished when length == -1 */
        if (uiLength == 0xFFFFFFFF)
        {
            break;
        }

        uiCnt = uiLength / 8;

        while (uiCnt--)
        {
            *pBlockDest.ullPtr++ = *pBlockSrc.ullPtr++;
        }

        if (uiLength & 0x4)
        {
            *pBlockDest.uiPtr++ = *pBlockSrc.uiPtr++;
        }

        if (uiLength & 0x2)
        {
            *pBlockDest.usPtr++ = *pBlockSrc.usPtr++;
        }

        if (uiLength & 0x1)
        {
            *pBlockDest.ucPtr = *pBlockSrc.ucPtr;
        }
    }
}

What is a dynamic run-time environment?

IBM says it's the idea of having different libraries attached to the user portion of the "Runtime". What dynamically changes is library lists. Other than that, this term is unclear. Not much source material discussing it.

Demystifying the Stack Terminology

user stack = sometimes used to refer to a user interrupt stack, i.e. interrupt stack that is only for one core in it's own memory
kernel stack =
interrupt stack = a.k.a istack; when interrupt taken, this can be set to shared global stack or a core-specific user stack
shared global stack = same as interrupt stack, but in a place in core0 memory where everyone can share, really shared global interrupt stack

A thread consists of a user stack and a kernel stack.

Interrupt stacks are associated on a per processor basis, and are only used while the kernel is currently using that particular CPU. When a interrupt (external) happens then the kernel switches to the the interrupt stack, since it saves creating more space on the kernel stack with the associated thread.

What is the frame pointer?

AKA base pointer, this is the first value pushed onto the stack when a new function is invoked, and a new frame pointer is created for the new function. Everything is referenced as relative to the frame pointer. Strictly speaking, they are not totally necessary because you can use the stack pointer as your anchor instead. gcc gives you the -fomit-frame-pointer option, for example. The FP remains constant while the SP moves around as the stack grows and shrinks.

The -fomit-frame-pointer option instructs the compiler to not store stack frame pointers if the function does not need it. You can use this option to reduce the code image size. The -fno-omit-frame-pointer option instructs the compiler to store the stack frame pointer in a register.

What is RTL?

RTL = register transfer language, just above assembly, or an architecture-independent assembly.
When it comes to digital HW design, this refers to a lower-level or more explicit way of writing HDL logic with slightly different syntax.

Aurix TriCore Predefined Program Sections

Default sections
.text Section for commands(Code)
.data Initialized files are stored in .data
.bss Non-initialized files are stored in . bss
.rodata Location of read-only data
.version_info Information in the module on the compiler and options utilised

Small addressable sections
.sdata This section stores initialized data that are addressable via small data area
pointers (%a0)

.sbss Location of non-initialized data. Addressing is effected via small data area
pointers %a0
.srodata Location of read-only data that can be small addressed
Absolute addressable sections
.zdata Initialized data that are absolute addressable
.zbss Non-initialized data, absolute addressable
.zrodata Location of read-only data that can be absolute addressed
PCP sections
.pcptext PCP code section
.pcpdata PCP data section

C++ sections
.eh_frame Exception handling frame of C++ exceptions
.ctors Section for constructors
.dtors Section for destructors

Debug sections
.debug_<name> Various debug sections

System Design: Resource Estimates

It turns out this is a very difficult problem without a lot of direct experience with the target architecture, compiler, software itself (drivers/modules/apps), libraries involved, and requirements.

Some good thoughts on RAM usage estimates: https://electronics.stackexchange.com/questions/140116/how-do-you-determine-how-much-flash-ram-you-need-for-a-microcontroller

Measuring "software size" typically uses:

You can count the source lines of code (SLOC) written to implement the software requirements.
You can measure the size of the requirements for software. (this is the value item, as it gives you foreknowledge in the design phase)

MMU, Translation, Program Memory

In a 32-bit system you have 4GB of virtual address space to play with. But there may only be a smaller amount of RAM. A translation table is used by the MMU to perform the mapping from virtual addresses to physical addresses.

Here's a readelf output from a 32-bit system application that shows the NULL exception address at origin, even though the program has been placed at the 2GB location of 0x8000.0000.

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        80000000 008000 0a7fa4 00  AX  0   0 64

The application translation tables reserve an invalid/inaccessible region of memory there at 0x0 (may be cacheable for bootloader). Small in size, for example maybe 1MB just because that's the minimum size of a region. One page.

In this case both the bootloader and application translation tables have been altered to indicate an accessible DDR cacheable region up at 0x8000.0000, and this is translated through the FPGA and AXI interface to the 0x0000.0000 location in off-chip DDR memory accessible through the processor system's DDR interface. In other words it goes out to the FPGA and then back before going to RAM.

Table entries:

/* BOOTLOADER */
.rept	0x0100			/* 0x80000000 - 0x8fffffff (DDR cacheable, virt mapping to ECC-prot) */
.word	SECT + 0x15de6		/* S=b1 TEX=b101 AP=b11, Domain=b1111, C=b0, B=b1 */
.set	SECT, SECT+0x100000
.endr

/* APP */
.rept	CPU0_CACHEABLE_PAGES/16			/*  (DDR Cacheable) */
.rept	16
#if XPAR_CPU_ID==0
.word	SECT + 0x45de6		/* 16MB page, S=b0 TEX=b101 AP=b11, Domain=b1111, C=b0, B=b1 */
#else
.word	SECT + 0x0			/* invalid, generates a translation fault */
#endif
.endr
.set	SECT, SECT+0x01000000
.endr

Uncached Memory

When the linker script sets up memory regions, stack and heap, etc, you can also designate an uncached region and the umalloc() function can actually allocate space there instead of in the cached heap that's part of CPU memory.

ARM example:

MEMORY
{
  //...
  UNCACHED_BASEADDR : ORIGIN = 0x20000000, LENGTH = 0x0A000000
  //...
}

//...

uncached (NOLOAD) : {
  _uncached_start = .;
  . += _UNCACHED_END - _UNCACHED_START;
  _uncached_end = .;
} > UNCACHED_BASEADDR