Crash course on GPU CUDA development

IPC = instructions per clock

EAch GPU has multiple symmetric multiprocessors, with lots of different cores but like functions. A GPU clock is typically slower than a CPU clock because of heat and power concerns with so many cores. A GPU has no scheduler, no system calls, more efficient.

The paradigm for a compute system is the CPU is host, and the GPU is the device. GPUs are a bunch of slave peripherals managed by the host CPU via CUDA API, with CUDA acting like system calls under the hood. So not unlike an FPGA controlled by a driver. Typically CUDA calls are synchronous so you know that things happen in the right order. Functions can be tagged with __global__, __host__, __device__.

The Orin CPU from Nvidia has 12 ARM cores. Unified memory is accessible by both CPU and GPU and is OCM, on which you can do a malloc. A host malloc is the same as a GPU malloc.

When you run a function in a kernel, specify the number of blocks and threads to use, though it's not required. What are blocks and threads? In memory layout, you have multiple blocks of threads:
| block of threads | block of threads | ... |

What's a kernel?
It runs on the GPU, think of it as a parallel execution function. The command to run it comes from the CPU.

What is a stride?
Defines how far the GPU jumps to get the next data set to work, determined by how many blocks and threads are used. An analogy might be for i in a for loop, i+=4 assignment is the stride.

What is a stream?
Helps with parallelization in making block processing asynchronous, but you can also specify waiting until individual streams are done.

Thoughts on GPU Research

AMD Big Navi and Nvidia Ampere graphics cards ? and maybe one of these days, we'll even need to add Intel Arc into the mix.

Tom?s Hardware gives a slight nod to AMD Radeon over Nvidia GeForce in performance, but different evaluators value different things. Tom?s article about Nvidia in space:

https://www.tomshardware.com/news/nvidias-jetson-ai-board-is-ready-to-go-to-space

https://link.springer.com/article/10.1007/s12567-020-00321-9#:~:text=The adoption of graphical processing,University of Hawaii [10].

In addition, a ? CubeSat volume unit (0.5U) (10???10???5 cm3) cloud computing solution, called SpaceCloud? iX5100 based on AMD 28 nm APU technology is presented as an example of heterogeneous computer solution. An evaluation of the AMD 14 nm Ryzen APU is presented as a candidate for future advanced onboard processing for space vehicles. The authors have explored edge computing and especially onboard AI data processing since 2013, leading up to a scalable radiation tolerant heterogeneous architecture first implemented using AMD 1st generation (28 nm) G-series System-on-Chip (SOC) paired with MicroSemi FPGA on an Input/output (IO) expanded industrial Qseven form factor board [6]. AMD denotes their SOCs as accelerated processing units (APUs). This paper expands on the previous work to include a full heterogeneous computer architecture also for AMD 2nd generation (28 nm) G-series SOC, AMD R-series (28 nm) SOC, and the latest AMD V1000 Series (14 nm) SOC

However, there is a big difference in radiation performance behaviour between the Nvidia TX2 and the AMD SOC, which is further discussed below.

ESA has investigated GPU for space applications through analysis of different low-end and high-end GPUs from radiation and power consumption perspectives in the GPU4Space project [10].

NASA has conducted several studies from a radiation perspective on different GPUs including from both Nvidia and AMD [11, 12, 14]. Notable, Salazar et al. have conducted radiation testing on five COTS graphic cards, of which two AMD GPUs and three Nvidia GPUs, aiming for application on the International Space Station (ISS) in low earth orbit (LEO) radiation environment [14]. Top three among five GPUs were chosen to test under the total dose of 6 krad. However, 6 krad is very low and ISS is not a representative environment for most missions. An expanded description of radiation effects is discussed in Sect. 4. None of the cards failed in a permanent failure, all the cards have several failures, i.e., functional interrupts which are required a reboot or power cycle to get the control again. MSI HD6450 employed with AMD?s GPU has performed the best and recorded 43.1 days of MTTFI (the mean time to functional interrupt).

V1000 and R-Series can be made to leverage AMD?s high-performance computing (HPC) software stack called Radeon Open Compute (ROCm) [16]. A particularly interesting aspect of the ROCm stack is that is can convert and execute Nvidia CUDA code and hence provide an avenue for radiation tolerant execution of CUDA code. This is also of interest, since large algorithm investments have been made in the CUDA framework and ROCm offer an avenue to leverage these investments on an open source platform.

However, there is a large performance price paid for such radiation immunity with rad-hard processors being typically tens to hundreds of times less capable in processor performance compared to modern COTS processors [23]. If chosen carefully and validated through extensive radiation testing, COTS processors can be selected that have favourable destructive radiation effect characteristics?indeed, the AMD processors mentioned in this paper have been shown to have favourable TID and SEL characteristics [24]. However, all COTS processors exhibit high rates of non-destructive SEEs compared to rad-hard processors and thus typically require frequent rebooting to mitigate these effects. The frequency of time between reboot vary greatly based on the underlying technology and the radiation environment in which the processor is operating. In benign environments such as the International Space Station, the time between reboot can be weeks to months while in more stringent environments such as polar orbits, GEO stationary, MEO, or HEO orbits, or exoplanetary missions, SEFI rates, and time between reboots, can be multiple per day. (high performance processors = worse effects of radiation)

To overcome this limitation and provide a means to make COTS processors viable for space applications that require both improved processing capability and reduced radiation susceptibility (i.e., less frequent reboots) Troxel Aerospace developed an SEE Mitigation Middleware that greatly improves non-destructive SEE upset rates. Troxel Aerospace?s SEE Mitigation Middleware (SMM) provides core-, device-, and system-level fault tolerance by implementing multicore checking in the background in Linux.

NOTE ? never seems to reveal the big rad difference between Nvidia and AMD.

https://space.stackexchange.com/questions/33019/have-any-probes-spacecraft-used-gpu-hardware-if-so-what-for

This is a little bit of a cop-out answer, but I have some pertinent experience. There are GPUs in use on the ISS ... in the laptops. The astronauts on the ISS receive briefings before EVAs in a "3D walkthrough" form. This uses NASA's EDGE renderer and a super-accurate 10mil poly model of the exterior of the Station. They also stay up to date on SAFER procedure training using an Oculus Rift (VR headset). The JSC VR Lab had to bypass significant portions of Oculus' software in order to optimize the VR to be useable on the (radiation-tested) HP Z-book 15 Gen 2 laptops they have there.

For general purpose rendering use, the extra radiation hasn't had a noticeable impact. This could change once outside of the Van Allen belts.

The Mars helicopter Ingenuity uses a Qualcomm Snapdragon 801 System-on-Chip which is well known from smartphones and includes an Adreno 330 integrated GPU.

"A Near Real Time Space Based Computer Vision System for Accurate Terrain Mapping" by Caleb Adams, NASA and UGA, small satellite conference

https://www.researchgate.net/publication/335401782_A_Near_Real_Time_Space_Based_Computer_Vision_System_for_Accurate_Terrain_Mapping

The MOCI satellite will have a NVIDIA Jetson GPU module for 3D reconstruction of Earth surface features: The Multiview Onboard Computational Imager (MOCI) is a 3U cube satellite designed to convert high resolution imagery, 4K images at 8m Ground Sample Distance (GSD), into useful end data products in near real time. The primary data products that MOCI seeks to provide are a 3D terrain models of the surface of Earth that can be directly compared to the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) v3 global Digital Elevation Model (DEM). MOCI utilizes a Nvidia TX2 Graphic Processing Unit (GPU)/System on a Chip (SoC) to perform the complex calculations required for such a task

Radiation Mitigation The primary concerns in LEO are Single Event Upsets (SEU), Single Event Functional Interrupts (SEFI), and Single Event Latchups (SEL)11. These are certainly concerns for a dense SoC like the TX2. Thus, MOCI will utilize aluminized capton as a thin layer of protection for the payload. Software mitigation is also implemented. The Clyde Space OBC contains hardware-encoded ECC and could flash a new image onto the TX2 if necessary. The TX2 also utilizes a custom implementation of software-encoded error correction coding (ECC). Further, more detailed, research will soon be published on how we have managed and characterized these radiation conditions.

NASA slides:
https://ntrs.nasa.gov/api/citations/20170006038/downloads/20170006038.pdf

https://nepp.nasa.gov/workshops/etw2018/talks/20JUNE18/0915 - Wyrwas--NEPP-ETW-GPU-TN57824_v2.pdf

https://ntrs.nasa.gov/api/citations/20170009004/downloads/20170009004.pdf

https://nepp.nasa.gov/files/30378/NEPP-TR-2019-Wyrwas-NEPPweb-TR-19-024-AMD-2200G-Microprocessor-2019June02-TN72756.pdf

https://nepp.nasa.gov/files/30362/NEPP-TR-2019-Wyrwas-TR-19-022_AMD-e9173-GPU-2019June02-TN72682.pdf

Nvidia provides a list of ?qualified? or ?certified? boards but it?s not clear any of them have undergone radiation testing.

https://www.nvidia.com/en-us/data-center/data-center-gpus/qualified-system-catalog/