Comparing bare-metal FW vs. RTOS for your project:
https://www.edn.com/electronics-blogs/embedded-basics/4441028/RTOS-or-bare-metal---five-decisive-factors
https://www.beningo.com/how-to-choose-between-bare-metal-rtos-and-gpos/#

An RTOS can give you faster response in exchange for reduced throughput; i.e. bare metal is good for doing a few simple things at a high rate of speed, whereas an RTOS is better for handling many different kinds of things. Response requirements generally drive the choice of architecture.

Here is the Xilinx take:
Bare-metal software system typically does not need many features (such as networking) that are provided by an operating system. An operating system consumes some small amount of processor throughput and tends to be less deterministic than simple software systems. Some system designs might not allow the overhead and lack of determinism of some operating systems. As processing speed has continued to increase for embedded processing, the overhead of an operating system has become mostly negligible in many system designs. Some designers choose not to use an operating system due to system complexity

From the SAFERTOS folks: “Should we Use an RTOS?”
There are well-established techniques for writing good embedded software without the use of an RTOS. In some cases, these techniques may provide the most appropriate solution; however as the solution becomes more complex, the benefits of an RTOS become more apparent.
These include:

Priority Based Scheduling: The ability to separate critical processing from non-critical is a powerful tool.
Abstracting Timing Information: The RTOS is responsible for timing and provides API functions. This allows for cleaner (and smaller) application code.
Maintainability/Extensibility: Abstracting timing dependencies and task based design results in fewer interdependencies between modules. This makes for easier maintenance.
Modularity: The task based API naturally encourages modular development as a task will typically have a clearly defined role.
Promotes Team Development: The task based system allows separate designers/teams to work independently on their parts of the project.
Easier Testing: Modular task based development allows for modular task based testing.
Code Reuse: Another benefit of modularity is that similar applications on similar platforms will inevitably lead to the development of a library of standard tasks.
Improved Efficiency: An RTOS can be entirely event driven; no processing time is wasted polling for events that have not occurred.
Idle Processing: Background or idle processing is performed in the idle task. This ensures that things such as CPU load measurement, background CRC checking etc will not affect the main processing.

More comparisons:

PXROS vs. SAFERTOS most important points of difference
- PXROS is larger in memory, with more features/utils
- PXROS does not copy for messaging, it transfers memory region ownership between tasks; SAFERTOS uses copy
- PXROS uses Messages/Mailboxes/Events while SAFERTOS uses Queues/Notifications/Semaphores for task sync/comm
- SAFERTOS can implement shared memory between tasks by passing pointers on a Queue
- PXROS claims to take advantage of Aurix HW architecture (vendors have close association)
- PXROS vendor might be able to provide better platform support (preferred design, Aurix expertise)
- SAFERTOS seems more platform generic in construction, and is ported to many more processors
- We have identified an ASIL D TCP/IP stack for SAFERTOS
- PXROS allows task dynamic memory allocation, SAFERTOS does not
- PXROS runs micro-kernels on each core and the OS handles the internals
- SAFERTOS needs an independent instance on each core and extra communication management (does not support AMP or SMP)
- SAFERTOS based on the open source and very popular FreeRTOS kernel
- SAFERTOS delivered as source, PXROS delivered as libraries/headers

You can probably use bare metal if your application is fulfilling one purpose while handling one slow I/O stream. Beyond that, you want to consider an RTOS.

From The Embedded Muse contributor:
It seems to me that the author has not at all done justice to the possibilities of a cooperative scheduler on bare metal. A few years ago I implemented (on a 32 bit PIC) such a system which accepted and responded to serial commands via UART, simultaneously activated and controlled 3 stepper motors (with a constant acceleration profile to get to minimise forces on the load), while also reading a laser sensor, using averaged readings from which to find edges of objects in the path and report them back. All of this was achieved without an RTOS and has been extremely stable over several years of daily use. (In fact so solid that I would have to get the source code out to remember how I did a lot of it.)

The basis of such a system is of course a scheduler where all tasks MUST complete before the next timer tick occurs. I took the inspiration for mine from RIOS ( https://www.cs.ucr.edu/~vahid/rios/ ) which I then refined to add debug printing from a buffer in the "dead" zone, a "run" LED which also served as a simple load monitor, and other niceties. There are of course a few considerations with such a system:

1. obviously all tasks must complete in the allotted time, or you have a fatal error.

2. everything is single thread, so generally atomicity is only a problem for memory objects which are accessed in ISRs. The general rule to only write to shared buffers etc from ONE (higher priority) source serves very well here, but specific cases do need to be thought through.

3. tasks needing finer granularity than the timer tick (in my case, actual generation of pulses for the motors) can operate in higher priority interrupts, driven from additional timers or other sources.

4. If tasks are kept reasonably simple, determining worst case timing is pretty easy - you just measure the worst case for each task, take the worst case sum, and add the worst case time given to ISRs. This for me is the KEY advantage. (No worries about task switching overhead or when certain tasks might pop up and throw things out. They all run, every time.)

5. Obviously blocking in a task (or an ISR) is a complete no-no. Many tasks end up being little state machines.

Eventually the system becomes a group of state machines with separate responsibilities, which communicate via their APIs.

C lends itself to this approach quite well if you see each .c/.h pair as an "object". (These days I might have multiple files in a folder, with only one "external API", to allow for unit testing within a module using Ceedling.) In such a way, separation of concerns and modularity is quite achievable. In modern systems, for instance certain SOC's that combine several embedded processors with FPGA real estate, such an approach could be hugely powerful - one bare metal processor (with hardware support as needed) does the hard RT stuff, the other perhaps runs embedded Linux or similar to provide networking and higher level tasks.

All in all it seems to me that this approach has many advantages over an RTOS, at least until a certain level of complexity, from the point of view of ensuring hard real time performance.

I find it a bit prosaic that Gliwa has entirely passed over such an approach in a book dedicated to the subject of timing in real time systems, and particularly the implication that anything more than the simplest tasks must use an RTOS. For me, one of the key factors in building systems successfully is to select the simplest and most appropriate techniques of "getting it done". I think that Gliwa has fallen back on the "defaults" of the auto industry here (well-proven though they may be) and bypassed a very important class of simpler solutions.

Terminology

Device stacks are all about making hardware peripherals available to application code through abstract and generic software interfaces. By placing more or less modules on a stack, you can choose the abstraction level you want to use in your application. The lowest level modules are specific for a particular hardware device. On top of that, you can stack higher level modules that provide more generic functionality to access the device. For example, at the higher, abstract level, you could choose to use a module to access a file system in your application. At the lower levels you still can select modules to decide which specific storage device you want to access (a hard drive, SD card, RAM drive, ...) Thus, the lower level modules are more specific for a particular peripheral while the higher level modules are less hardware specific and can even be used in combination with multiple peripheral devices.

Middleware is a term sometimes used to describe software that lies between the device or service layer and application software. Or maybe a service itself that connects applications to a device of some type could be called middleware. A TCP/IP stack is sometimes called middleware. An RTOS itself is sometimes lumped in with middleware, given that it often operates above a target-specific platform abstraction layer that holds all the peripheral/driver modules.

Some RTOS vendors use the terms peripheral and device specifically. A peripheral would be the lowest level module, while a driver would make use of the peripheral to provide services to an application. I like the construct that devices are off-chip ICs, while a peripheral is an interface module that's part of the SoC. Therefore device drivers make use of the lower level peripheral drivers.

If not using an RTOS, be aware that an increasing number of threads will decrease run time determinism.

Mechanisms

How does a context switch work?

In the case of the TriCore, the OS definitions code includes a struct for a special chunk of memory called the Context Switch Area (CSA). On a switch, the core registers are dumped to this memory block. Presumably there should be one for each task.

RTOS basics

SAFERTOS training materials: https://www.highintegritysystems.com/rtos/rtos-tutorials/

context switching

Does context switching (aside from ISRs) happen even if there is only one task? Possibly, because the scheduler may be periodically invoked by the kernel for housekeeping. Depends on the kernel behavior if only one task is created.

interrupts

In real-time kernel based systems, the routines that service hardware interrupts are typically small and fast. Their main function is to capture or send data, and to notify the task scheduling kernel of any further processing required. The bulk of application processing is carried out by tasks running at the “background”’ level of the processor; i.e. its normal state when it is not executing an ISR. There are two general types of task scheduling in real-time kernel based systems:

Non-preemptive scheduling,
Preemptive systems scheduling (most commercial RTOS).

multi-threading

ThreadX recommends: Stack size is always an important debug topic in multithreading. Whenever unexplained behavior is observed, it is usually a good first guess to increase stack sizes for all threads—especially the stack size of the last thread to execute!

If you want to end a thread that normally just runs an infinite loop, you can't use join alone because the master will wait forever. Instead, one idea is to check a global flag (like an atomic) each time through the loop, and set a stop bit from the master if you want the thread to end and then use join.

re-entrant (reentrant) functions

A function that can be used safely by multiple threads. Also, a recursive function must be reentrant. ARM says "Code is reentrant if it can be interrupted in the middle of its execution and then be called again before the previous invocation has completed."

(Ganssle) A routine must satisfy the following conditions to be reentrant:

It never modifies itself. That is, the instructions of the program are never changed. Period. Under any circumstances.
Any variables changed by the routine must be allocated to a particular "instance" of the function's invocation. Thus, if reentrant function FOO is called by three different functions, then FOO's data must be stored in three different areas of RAM.

Testing for re-entrancy is not straightforward, but here are some ideas: http://www.ganssle.com/articles/areentra.htm

An example of a non-reentrant function is the string token function “strtok” found in the standard C library. This function remembers the previous string pointer on subsequent calls. It does this with a static string pointer. If this function is called from multiple threads, it would most likely return an invalid pointer.

Functions that are not reentrant must be protected from interrupts. One way this can be accomplished is to put a wrapper that disables and re-enables interrupts around the calls.

thread safety

Thread safety is a property that allows code to run in multithreaded environments by re-establishing some of the correspondences between the actual flow of control and the text of the program, by means of synchronization. These mechanisms can be used to provide safety:

re-entrancy
thread-local storage
immutable objects (e.g. read-only)
mutual exclusion
atomic operations

watchdog

A nice feature of embOS to allow setting up software watchdogs: https://www.segger.com/doc/UM01001_embOS.html#Watchdog

Multi-core

Information about OpenAMP, a framework for communication between OSes in a multi-core system. http://openamp.github.io/docs/linaro-2017/OpenAMP-Intro-Feb-2017.pdf

Some RTOS build in handling of multi-core systems, while others don't. For example PXROS if configured correctly will run a microkernel on each core to provide system common services and automate communication of messages between processes even if they are on different cores. ThreadX requires you to set up and run an instance on each core and handle creating an inter-core message system yourself.

Security Principles

Green Hills Integrity has been certified at the highest security rating: https://www.eetimes.com/document.asp?doc_id=1169789
Some keys: figuring out how to avoid running drivers in the security kernel, guaranteeing response time in user mode, not granting full privileges to new processes and not letting new processes rank their own priority or get access to full system resources.

Program Space

For example, a ThreadX application program might look like this in memory:

ROM/Flash 
instruction area (machine code)
constant area (used to set up the RAM initialized data)
...
RAM
initialized data
unintialized data
stack

The two RAM data areas contain all the global and static variables.

I/O

General Practices

ISRs can be part of the peripheral driver, which allows non-blocking communication which does not stall the send/receive threads. ISRs could also be associated with the task (application code) instead.

Latency Estimation Chart

I/O type	cycles
L1	3
L2	14
RAM	250
Disk	41,000,000
Network	240,000,000

QNX Intro

ukernel
requires an MMU (VxWorks does not), therefore does not work on Cortex M-series
no kernel space applications, just user space applications due to one API
hardware access by applications does not involve kernel?? (this does not seem exactly true, because there is a resource manager for every HW peripheral and it registers as such, and every request to use it goes through the OS)
OS does virtual memory, and each process has private address space
various scheduling options like sporadic, round-robin
drivers are architected as resource manager/servers, apps are clients
does support SMP, the default is this automatic load-balancing, up to 32 cores; can also lock tasks to cores
High Availability manager (runtime monitor? Or smart watchdog more aptly), this is a separate application outside the kernel that included with licensing
they try to be POSIX compliant it appears
has a chip distributed message passing system, for different QNX nodes being able to easily reference resources on other chips
gnu-based compiler toolchain that is certified
file system structure very similar to Unix; by default use QNX6, a power safe file system; also support NFS, NTFS (RO)
have experience porting applications from Linux, ROS
Supports x86 and ARM architecture, including 64-bit for both
ASIL D
performance metrics can be captured live via Ethernet, or recorded
application sandboxing, whitelisting, resource access control are some of the security features

VxWorks

everything built into one binary, can’t do incremental upgrade, designed for infrequent upgrades
one single address space for all processes, for performance reasons
drivers are in kernel space, and kernel API is proprietary, less POSIX compliant

ThreadX

Here is an example application showing use of timers and queues: ThreadX Example Application

SMP solution was approved in 2018 for critical applications with 100% code coverage. This certification ensures compliance with the industrial safety standard IEC 61508 and all standards derived from it, including IEC 61508 SIL 4, IEC 62304 Class C, ISO 26262 ASIL D, and EN 50128 SW-SIL 4.

Thread Init

Create a memory pool from which to allocate thread stacks
Allocate the stack for the new thread from the pool, giving it the pointer from the memory pool allocation
Create the thread with the stack pointer, a control block object, name, entry function, etc.
Create queues, semaphores, events, mutexes as needed (note these are not pre-assigned to threads, but connected in the thread entry functions)
Return from application define and kick off the kernel

Messaging

ThreadX also uses messages and queues/mailboxes for IPC, with notifications. Mailbox = single message Q. Create a Q to hold messages. Each message queue is a public resource. ThreadX places no constraints on how message queues are used. The memory area for buffering messages is specified during queue creation. Like other memory areas in ThreadX, it can be located anywhere in the target’s address space.

> How did ThreadX add SMP support?

This review only covers the headers/interface, not the code body.

Add time slice measure for each core, and a global interrupt active count

TIMER_DECLARE ULONG             _tx_timer_time_slice;
/*to*/
TIMER_DECLARE ULONG             _tx_timer_time_slice[TX_THREAD_SMP_MAX_CORES];

/* Define count to detect when timer interrupt is active.  */
TIMER_DECLARE ULONG             _tx_timer_interrupt_active;

Add a series of SMP specific variables

/* Define all internal SMP prototypes.  */
void        _tx_thread_smp_current_state_set(ULONG new_state);
UINT        _tx_thread_smp_find_next_priority(UINT priority);
void        _tx_thread_smp_high_level_initialize(void);
void        _tx_thread_smp_rebalance_execute_list(UINT core_index);

/* Define all internal ThreadX SMP low-level assembly routines.   */
VOID        _tx_thread_smp_core_wait(void);
void        _tx_thread_smp_initialize_wait(void);
void        _tx_thread_smp_low_level_initialize(UINT number_of_cores);
void        _tx_thread_smp_core_preempt(UINT core);

/* Define the ThreadX SMP scheduling and mapping data structures.  */
THREAD_DECLARE  TX_THREAD *                 _tx_thread_smp_schedule_list[TX_THREAD_SMP_MAX_CORES];
THREAD_DECLARE  ULONG                       _tx_thread_smp_reschedule_pending;
THREAD_DECLARE  TX_THREAD_SMP_PROTECT       _tx_thread_smp_protection;
THREAD_DECLARE  volatile ULONG              _tx_thread_smp_release_cores_flag;
THREAD_DECLARE  ULONG                       _tx_thread_smp_system_error;
THREAD_DECLARE  ULONG                       _tx_thread_smp_inter_core_interrupts[TX_THREAD_SMP_MAX_CORES];

System stack pointer, current and next execution thread pointers added for additional cores

THREAD_DECLARE  VOID * _tx_thread_system_stack_ptr;
/*to*/
THREAD_DECLARE  VOID * _tx_thread_system_stack_ptr[TX_THREAD_SMP_MAX_CORES];
etc
current_ptr, execute_ptr

Added an option to remap the system state struct and current thread pointer to be function calls instead.

ThreadX vs. ThreadX SMP in a multicore environment

Their advice is to use the standard, single core ThreadX on as many of the cores necessary if it is reasonable for the application to distribute the processing load. This mode of use is typically called Asymmetric Multiprocessing (AMP). If automatic load balancing is required because of the dynamic nature of the application, then ThreadX SMP (Symmetric Multiprocessing) is better. ThreadX SMP has more overhead and is more complicated in general. SMP processing also requires that all processors share the same cache coherent memory space and there are suitable inter-core locks and interrupts available.

As for how to communicate with the cores in AMP, most customers write their own shared memory mechanism for inter-core communication. We have also done integration with OpenAMP, but certification is an issue. We are not sure how to do this since a majority of OpenAMP is open source and not under our control. There might also be coding issues and testing issues that would make it very difficult to certify.

As for ThreadX SMP, it isn't available on the TriCore. It would likely require NRE and considerable schedule as well. Again, if your application can load the processors, then we would recommend sticking with standard ThreadX anyway.

What's a BSP anyway?

https://www.windriver.com/products/bsp_web/what_is_a_bsp.pdf

The above is a nice write-up, explaining how this is a very imprecise term.

kernel's interface to drivers - A Board Support Package (BSP) provides a standardized interface between hardware and the operating system.
a set of HAL libraries for the kernel - The BSP enforces a modular design by isolating hardware-specific functionally into a set of libraries that provide an identical software interface to the hardware functions available on an embedded system.
host/target cross-development environment - The BSP provides support for developers using tools on a host computer for engineering development such as an editor, compiler, linker, and debugger through a client/server communications protocol between the developer's host computer and the embedded target’s CPU, hardware devices and on-board memory.

Timing

JG: A little bit of C code that looks quite deterministic probably makes calls to the black hole that is the runtime library, which is generally uncharacterized (in the time domain) by the vendor. Does that call take a microsecond or a week? No one knows.