HDL Notes Here
Design, Layout, Timing Notes Here
Vivado In Focus
PetaLinux basics
Xilinx SDK and Vitis
Logic IP Core Addenda
UltraScale RFSoc
Verification
Using Tcl
Migrating Older Designs
CPLDs

Glossary (including Xilinx terms)

ACE = AXI coherency extension port
ACP = accelerator coherency port
AHB = advanced high-perf bus, apparently a non-AXI AMBA bus protocol that is older (http://www.vlsiip.com/amba/axi_vs_ahb.html)
AF = activity factor, how often a signal is toggling
AFI = AXI FIFO interface? this seems to be an AXI bridge for smoothing data crossing between PS and PL, but details are scarce
AMP = asymmetric multi-processing, cores running independent code from one another and only loosely associated
APM = AXI performance monitor
APU = application processing unit, the generic core of the ARM
ASMBL = advanced silicon module block architecture, columnar Xilinx FPGA building technology
ATG = AXI traffic generator
ATP = ACP transaction checker, for debug
AXI = advanced eXtensible interface, the AMBA bus from ARM; there are three types of ports for the Zynq ARMs, AXI_HP (high perf), AXI_ACP (also high throughput, low latency, cache coherency), AXI_GP (general purpose, medium throughput, mainly for control functions and peripheral access)
BD = block design/diagram, manifest in .bd block diagram files which are XML descriptions of the Vivado IP Integrator block design
BEL = basic elements, the primary logic building blocks like LUTs, FFs, BRAMS, DSP cells, clock BUFG, IO BUF, etc
BFM = bus functional model, enables functional verification of PL by mimicking the PS/PL interfaces and memories.
BIF = boot image format
BISC = built-in self-calibration
BMG = block memory generator
BPD = battery power domain
BTT = bytes to transfer? field in part of AXI DMA IP block
CCI = cache coherent interconnect
CCIO = clock capable I/O, meaning high speed pins that can handle high fanout
CDC = clock domain crossing
CDR = clock data recovery, a function of high-speed xcvr blocks
CIPS = control, interface, and processing system; an IP logic block
CLB = configurable logic block, made of FF, LUT, MUX
CMB = clock modifying block, which may have the ability to compensate for clock tree insertion delay
CML = current-mode logic
CMT = clock management tiles, provide clock frequency synthesis, deskew, and jitter filtering functionality with one MMCM and two PLLs
COE = Vivado coefficient file; it reads the text COE file and writes out one or more MIF files when the core is generated (or during synthesis?); COE files can be generated by MATLAB
CPM =
CPR = clock pessimism removal, added or subtracted to the skew depending on analysis performed
CR = clock region, a subdivision of the fabric with a clock spine through the center
CSU = configuration and security unit
CTI = cross trigger interface, for debug
CTM = cross trigger matrix, for debug
CTS = clock tree synthesis
DAP = debug access port, for JTAG ARM PS control functions
DBC = dedicated byte clock, bitslice strobe input for inter-nibble clocking
DBI = data bus inversion, an option for memory interfaces that can invert the DQ bits in a byte
DCD = destination clock delay
DCI = digitally controlled impedance for the high performance I/O banks
DCP = design checkpoint file; created in the out-of-context design flow along with a contraints file
DDS = direct digital synthesizer, an IP core block for creating DAC source data; the core sources sinusoidal waveforms and consists of a Phase Generator and a SIN/COS Lookup Table (phase to sinusoid conversion); apparently this is the same function as an NCO
DFE =
DFX = dynamic function exchage design flow, allows portions of a running device to be reconfigured in real time with partial bitstream
DLA = deep learning acceleration
DNA = DNA_PORT is a primitive Device DNA Access Port that allows access to a dedicated shift register that can be loaded with the unique Device DNA data bits (factory-programmed, read-only ID) for a given 7 Series device.
DPA =
DRP = dynamic reconfiguration port - seems to be a programming/debug port like JTAG, for communicating between PS and PL IP cores
DQ = from a DQ flip-flop, designates a bi-directional data port, inout
DQS = the S is for a differential data strobe pair acting as a "burst clock" for external memory read/write
DRC = design rule check, a review of hardware/logic to make sure all rules are met
DSA = digital step attenuator, RF-ADC attenuation code
ECT = embedded cross-trigger, for debug
EDAC = error detection and correction
EDIF = electronic design interchange format, extension .edf, an EDA vendor-agnostic format for netlists and schematics
EMC = external memory controller, an AXI IP core
EMIO = MIO pins that need to be wired through the PL
EPC = external peripheral controller, an AXI IP core
ETB = embedded trace buffer, for debug
FDPE = flip-flop type D with asynch preset and enable (Primitive: D Flip-Flop with Clock Enable and Asynchronous Preset)
FDRE = same as above but with synch reset
FPD = full power domain, which includes APU core, DDR memory controller, and high speed serial ports
FSBL = the Xilinx first stage boot loader, in the Zynq ROM, can hand off to a second stage like Uboot
FTM = fabric trace monitor, supplied by Xilinx, part of or interfaced with the ARM CoreSight debug and trace mechanism
GC = global clock capable pins
GEM = gigabit eth mac; in the ARM PS
GIC = generic interrupt controller; not a part of the snoop control unit as indicated by the prefix "scu" in the name of the driver; it is an independent module in APU.
GT? = a series of various gigabit transceivers in the PL fabric that provide physical links for high speed serial interfaces of many types
HCS = horizontal clock spine, runs through CR
HDF = handoff design file, sometimes called hardware definition file, a platform spec generated by Vivado and deprecated in later versions from 2020 on (2019?); contains all the information required by Xilinx SDK to create the corresponding software project for your design; created when you export your design; the hardware is exported in a ZIP file (<project wrapper>.hdf) and when SDK launches, the file unzips automatically, and you can find all the files in the SDK project hardware platform folder
HFNS = high fanout nets sythesis
HLS = high level synthesis, a way to write C/C++ code and turn it into PL
HPM = high performance master
HSDP = high speed debug port
HSSIO = high speed select IO wizard is recommended flow to building native mode I/O designs
IBERT = integrated bit-error ratio tester
ICAP = internal configuration access port
ILA = integrated logic analyzer, an IP core
IPC = instructions per cycle --OR-- inter-process communications
IPI = IP integrator (logic cores)
ISim = ISE simulator
ITM = instrumentation trace macrocell, for debug
LC =
LMB = local memory bus, a fast local bus for connecting MicroBlaze instruction and data ports to high-speed peripherals, primarily on-chip block RAM (BRAM)
LPD = low power domain, which includes RPU core, performance monitoring, and OCM control
LVCMOS = low voltage complementary metal-oxide semiconductor, replaced NMOS and TTL, remains the standard for chip manufacturing (99%)
LVDS = low-voltage differential signaling
MAD = microprocessor application definition
MC = memory controller
MCS = An .mcs file can be used by Xilinx’s iMPACT software to program an FPGA system board indirectly via platform flash (PROM); can be created from a .bit file and can be converted by promgen to a .hex; can be created in the boot image creation of the XSDK as alternate to .bin, and apparently has extra meta data the .bin does not
MDD = microprocessor driver definition
MGT = multi-gigabit transceiver
MIF = memory initialization file
MIG = memory interface generator, Xilinx tool to create external DDR interface blocks
MIO = multiplexed I/O, the I/O pins of the PS which are muxed to various hard peripherals of the ARM; software programs the routing of the I/O signals to the MIO pins; the I/O peripheral signals can also be routed to the PL (including PL device pins) through the EMIO interface
MIS = memory interface solution, a core consisting of a controller and PHY for interfacing UltraScale logic to external SDRAM
MLD = microprocessor library definition
MMCM = mixed-mode clock manager, needed for FPGA fabric but not for memory interface, the clock wizard IP is built on this and/or PLL
MRMAC = multirate Eth MAC
MSS = .mss microprocessor software specification in the SDK - directives for customizing OS, libraries, drivers
MTS = multi-tile synchronization, for matching RF data converter latency across tiles and chips via clocking infrastructure
NCF = netlist constraints file, not supported by Vivado
NGC = .ngc is the Xilinx generated netlist file, storing the synthesized design and used to create .ngd and .bit files, can also be converted to .edf
NoC = network on chip
NDR = non-default rule, user-defined routing rule, to reduce sensitivity to crosstalk or EM
ODT = on-die termination, the impedance matching resistor is inside the chip
OOC = out of context, this refers to a module that has it's own synthesis run
OSC = offset calibration for IBUF input stage
PCAP = processor configuration access port; driver code in the FSBL appears to provide the mechanism used for programming the PL on boot
PCF = physical constraints file
PCS =
PCW = processor configuration wizard, used by Vivado or XSDK to setup the ARM PS
PLM =
PLPD = PL power domain
PMA =
PMC = platform management controller, part of the PS
PMU = performance monitor unity, part of the Cortex-A9 CPU
POD = pseudo open drain standard for DDR4
PTM = program trace macrocell, for debug
PRBS = pseudo-random bit sequence
PS = processing system, or processor system, the ARM core and supporting logic hard/fixed portion of the hybrid chip
PS7 = processing system 7, a wrapper around the hard PS core of a Zynq providing connections between the PS and PL
PSM =
RTL = register transfer level, which is one layer up from gate level and is synonymous with the way Verilog/VHDL are typically written (i.e. we usually use RTL code); also can describe HDL code as a layer down from HLS code
QBC = quad byte clock, this clock input can clock resources in all bytes of an I/O bank
QMC = quadrature modulation correction, a feature of the RFSoC data converter IP with DACs and ADCs
SCD = source clock delay
SCU = snoop control unit in the ARM, connects the two Cortex-A9 processors to the memory subsystem and manages the data cache coherency between the two processors and the L2 cache
SD-FEC = soft decision forward error correction
SDB = related to VCS simulator symbol library, not associated with Xilinx tools
SDC = synopsys design contraints, constraint file format well adopted by industry
SDF = standard delay (default) format, an IEEE standard text file format for describing delays/timing of electronics
SDR =
SIMD = single instruction multiple data, a feature of UltraScale DSP blocks
SLCR = system level control registers, for things like resets and level shifters
SLR = super logic region, a single die slice contained in an SSI device, contains typical FPGA circuits (slices, DSP, RAM, GTs)
SMC = static memory controller, a module on the PS used mainly for interfacing to NAND and NOR flash via parallel APB interface (as opposed to a serial QSPI interface, for example)
SMMU =
SPI = shared peripheral interrupts between PS and PL, in addition to the common interface
SPL = secondary program loader, AKA secondary boot loader, which is loaded by FSBL; often U-boot
SSI = stacked silicon interconnect, Xilinx tech to allow better performing FPGA dies/connections (combine multiple dies on one chip?)
SSR = super sample rate, PL filters for RF data converter block, interpolation and decimation
TAP = test access port, for JTAG PL control functions
TBU =
TCF =
TEX = type extension, ARM translation table bits; he Bufferable (B), Cacheable (C), and Type Extension (TEX) bit names are inherited from earlier versions of the architecture; these names no longer adequately describe the function of the B, C, and TEX bits
THS = total (negative) hold slack, the sum of the worst slack violation for all the endpoints that belong to paths crossing the specified clock domains for min delay analysis (hold/removal)
TNS = total negative slack, timing; the sum of the worst slack violation for all the endpoints that belong to paths crossing the specified clock domains
TPIU = trace port interface unit, for debug
TPWS = total pulse width slack, the sum of all WPWS violations, when considering only the worst violation of each pin in the design (0ns when constraints met)
TTCL = TTCL seems to be a TCL preprocessing engine created by Xilinx to help aid in HDL file generation
UCDB = unified coverage database
UCF = user constraints file, the old way of specifying constraints, not supported by Vivado (UCF should be translated to XDC)
UPF = unified power format
VCO = voltage controlled oscillator; FVCO is the frequency thereof
VCU = video codec unit
VIO = virtual I/O
VIP = verification IP
VLNV =
WHS = worst hold slack, the worst slack calculated for various paths crossing the specified clock domains; negative slack indicates a problem in which the path violates a required hold (or removal) time
WNS = worst negative slack, the worst slack calculated for various paths crossing the specified clock domains; a negative slack indicates a problem in which the path violates a required setup (or recovery) time
WPWS = worst pulse width slack, the worst slack of all the timing checks on min/max high/low pulses/periods and skew
XCF = XST constraints file
XCI = xilinx core instance, .xci IP customization file, these can be used to import IP blocks into a project, XML format
XCIC = container of the above, one of these is associated with a core instance, see CORE_CONTAINER in Properties box
XCRG = Xilinx coverage report generator
XCO = Xilinx core generator log file, holds parameters and used to generate Xilinx cores
XDC = xilinx design constraints, based on the SDC standard (appears to expand it), from about 2013 on, each is a Tcl command
XO = xilinx object, .xo file created by the Vitis C/C++ compiler or the RTL kernel wizard, combined by Vitis Linker to form compiled accelerator
XPE = xilinx power estimator
XPM = xilinx parameterized macros, a library, alternative to block mem generator; replaces the Unimacros library for Ultrascale
XMPU = xilinx MPU
XPPU = xilinx peripheral protection unit
XPR = xilinx project file
XPS = xilinx platform studio, the ISE logic development tool like Vivado?
XRT = xilinx run time library
XSA = xilinx shell archive, an exported Vivado hardware spec to allow SW development with Vitis; also used by petalinux build system
XSCT = xilinx software command-line tool is an interactive and scriptable interface to Xilinx SDK, based on Tcl
XSDB = xilinx system de-bugger, an alternative to GDB in the XSDK
XST = xilinx synthesis technology, xilinx HDL synthesis tool integrated into ISE, it appears Vivado has moved past this

AXI

What is an "interface" to Xilinx?
An interface is a grouping of signals that share a common function, containing both individual signals and multiple buses. The Zynq ARM PS is an AXI master and it connects to an AXI interconnect PL IP block, which in turns connects to the AXI interfaces of various slave devices. A user defined module could be an AXI slave, or a UART IP block.

There are now two AXI interconnect IP blocks in Vivado. A newer AXI interconnect called AXI SmartConnect is available while the older AXI Interconnect is still available. Vivado may use the AXI SmartConnect as the default interconnect with connection automation. The Xilinx® LogiCORE IP AXI SmartConnect core connects one or more AXI memory-mapped master devices to one or more memory-mapped slave devices. The AXI SmartConnect is a Hierarchical IP block that is added to a Vivado® IP Integrator block design in the Vivado Design Suite. AXI SmartConnect is a drop-in replacement for the AXI Interconnect v2 core. AXI SmartConnect is more tightly integrated into the Vivado design environment to automatically configure and adapt to connected AXI master and slave IP with minimal user intervention.

AXI is part of ARM AMBA, a family of micro controller buses first introduced in 1996. The first version of AXI was first included in AMBA 3.0, released in 2003. AMBA 4.0, released in 2010, includes the second version of AXI, AXI4. There are three types of AXI4 interfaces:

AXI4—for high-performance memory-mapped requirements.
AXI4-Lite—for simple, low-throughput memory-mapped communication (for example, to and from control and status registers).
AXI4-Stream—for high-speed streaming data.

AXI-3/4 appear to be little endian by definition. AXI provides simultaneous bidirectional (thus independent wires/ports) data transfer reads/writes.

Master and slave interfaces are designated as MXX_AXI and SXX_AXI counting up starting at 00. The master initiates all transactions.

Bursts

FIXED burst = The address is the same for every transfer in the burst. The byte lanes that are valid are constant for all beats in the burst. However, within those byte lanes, the actual bytes that have WSTRB asserted can differ for each beat in the burst. This burst type is used for repeated accesses to the same location such as when loading or emptying a FIFO.

INC burst = Incrementing. In an incrementing burst, the address for each transfer in the burst is an increment of the address for the previous transfer. The increment value depends on the size of the transfer. For example, the address for each transfer in a burst with a size of four bytes is the previous address plus four. This burst type is used for accesses to normal sequential memory.

WRAP burst = A wrapping burst is similar to an incrementing burst, except that the address wraps around to a lower address if an upper address limit is reached. Some alignment and length restrictions.

The burst type is specified by:
ARBURST[1:0], for read transfers
AWBURST[1:0], for write transfers.

The burst length is specified by:
ARLEN[7:0], for read transfers
AWLEN[7:0], for write transfers.

The maximum number of bytes to transfer in each data transfer, or beat, in a burst, is specified by:
ARSIZE[2:0], for read transfers
AWSIZE[2:0], for write transfers.
AXI has the following rules governing the use of bursts:

for wrapping bursts, the burst length must be 2, 4, 8, or 16
a burst must not cross a 4KB address boundary
early termination of bursts is not supported.
No component can terminate a burst early. However, to reduce the number of data transfers in a write burst, the master can disable further writing by deasserting all the write strobes. In this case, the master must complete the remaining transfers in the burst. In a read burst, the master can discard read data, but it must complete all transfers in the burst.

AXI Infrastructure IP Cores

The AXI Infrastructure is a collection of the following IP cores:

AXI Crossbar: Connects one or more similar AXI memory-mapped masters to one or more similar memory-mapped slaves.
AXI Data Width Converter: Connects one AXI memory-mapped master to one AXI memory-mapped slave having a wider or narrower data path.
AXI Clock Converter: Connects one AXI memory-mapped master to one AXI memory-mapped slave operating in a different clock domain.
AXI Protocol Converter: Connects one AXI4, AXI3 or AXI4-Lite master to one AXI slave of a different AXI memory-mapped protocol.
AXI Data FIFO: Connects one AXI memory-mapped master to one AXI memory-mapped slave through a set of FIFO buffers.
AXI Register Slice: Connects one AXI memory-mapped master to one AXI memory-mapped slave through a set of pipeline registers, typically to break a critical timing path.
AXI MMU: Provides address range decoding services for AXI Interconnect

Plus Interconnect and Smartconnect.

Stream

The stream protocol has no address lines, so it is not memory-mapped. This is for directly connected data sharing and is managed by the ready/valid signal handshaking.

IP Core Wire Naming

The _S2MM interface means stream to memory-mapped with _MM2S being opposite. The _STS interface is for status stream, but it's unclear how this is used. _SG is for the scatter gather memory-mapped interface.

DMA

AXI DMA can be configured as Direct Register mode or SG (Scatter/Gather) mode. In register mode, it would be less resource intensive with lower performance. In SG mode, it is possible to perform DMA transactions and management using buffer descriptors (BDs) which can be placed in any memory mapped storage unit such as BRAM. As a result, higher performance can be achieved with data transactions using AXI DMA in SG mode due to the placement of BDs in PL side of the FPGA.

There's a lot going on with the IP block. Here's some example notes from working with an FFT block connected to it.

Interfaces:

S_AXI_LITE
S_AXIS_S2MM (write channel) = S2MM Slave Stream Interface Signals
S_AXIS_STS (write channel, cont/status stream)
M_AXI_SG = Scatter Gather Memory Map Read/Write Interface Signals
M_AXI_MM2S (read channel) = MM2S Memory Map Read Interface Signals
M_AXI_S2MM (write channel) = S2MM Memory Map Write Interface Signals
M_AXIS_MM2S (read channel) = MM2S Master Stream Interface Signals
M_AXIS_CNTRL (read channel, cont/status stream)

enable read channel means include MM2S (which one?)? if you take this out, it removes 3 of the M_ ports? Enabling the MM2S Channel allows read transfers from memory to AXI4-Stream to occur

enable write channel means include S2MM (which one?)? if you take this out, it removes 2 of the S_ ports and M_AXI_S2MM ? Enabling the S2MM Channel allows write transfers from AXI4-Stream to memory to occur.

READ from DDR mem to FFT uses MM2S
WRITE from FFT to DDR mem uses S2MM
S_AXIS_S2MM is the port for the AXI stream data being written to memory and M_AXIS_MM2S is the port for the AXI stream data being read out of memory. Where S2MM stands for stream-to-memory-map and MM2S stands for memory-map-to-stream.
Both M_AXI_MM2S and M_AXI_S2MM connect to the PS slave port through an interconnect IP.

I think the AXI4-stream interface must be one direction only from master to slave? So to get data back from the slave, that would explain needing another interface.

Setting up the AXI DMA transfer
Direct Register Mode (Scatter Gather Engine is disabled) provides a configuration for doing simple DMA transfers on MM2S and S2MM channels that requires less FPGA resource utilization. Transfers are initiated by accessing the DMACR, the Source or Destination Address and the Length registers. When the transfer is completed, a DMASR.IOC_Irq asserts for the associated channel and if enabled generates an interrupt out.

Port Flavors

HPM are the general purpose ports used by the PS to hit AXI slaves in the PL
HP/HPC are the high-performance ports used by the PL AXI masters to hit the PS
ACP is the cache-coherent accelerator coherency slave port on the PS accessed by PL AXI masters; note that this slave port on the PS cannot be used in loopback structure, meaning you cannot execute out of RAM addressed via the ACP port with the PS instruction requiring passing through the APU SCU twice.
ACE is the two-way coherency PS port accessed by PL AXI masters

Status/Control Register

The recommended way to create an AXI status/control register for the FPGA top level is to put in an AXI GPIO IP core, and then the GPIO inputs can be connected to status lines around the device and the outputs connected to control lines. Then it is mapped in the address editor.

Zynq

Memory

There are four fundamental memory regions. These memory regions are the double data rate (DDR) memory, on-chip memory (OCM), tightly-coupled memory (TCM), and advanced eXtensible interface (AXI) block RAM in the PL. Access to memory is controlled by the memory controllers, direct memory access controllers (DMACs), memory management units (MMUs), SMMUs, and the XMPUs.

What are level shifters?

Level shifters are added to ensure that blocks operating at different voltages will operate correctly when integrated together in the SoC. Level shifters must ensure the proper drive strength and accurate timing as signals transition from one voltage level to another.

You must enable the PL level shifters using LVL_SHFTR_EN before PL logic communication can occur. The enabling of the level shifters happens in the FSBL, more precisely in the function 'FsblHandoff' in main.c. In case you use a FSBL to boot your system, ps7_init.c/h will enable the shift registers for you (it's a huge file, but you can search for 'shift' in that file, and you should see the correct values set there. In case you use JTAG to download your bitstream/code (hence there's no FSBL) -> an equivalent tcl script 'ps7_init.tcl' enables the level shifters (search for 'PS_LVL_SHFTR_EN' in that file).

All these files (ps7_init.c/h and ps7_init.tcl) are extracted by SDK from the .hdf file that is generated during a Vivado 'export hardware'. So I assume Vivado should set the right flags in case you enable EMIO, and you don't need to worry about this.

Board Layout

Bank0 is known as the dedicated configuration bank for the 7000 family, and has the DONE signal and JTAG pins, etc.

Boot

The BootROM executes first.
It configures the system in a fixed way.
Copies the FSBL boot image from the boot medium ( flash) to the OCM and then begins FSBL execution.
The FSBL inits the PS using PS7 init data from Vivado
The FSBL grabs the PL bitstream from flash and programs the FPGA.
FSBL Loads the OS into DDR memory (at END of memory).
FSBL hands off control.

FPGA Size and Logic Footprint

A nice look at what Xilinx means by slice here: http://www.ni.com/product-documentation/54503/en/

It used to be that the rule of thumb was to keep your design around 40-50% of resource utilization as optimum. Less means you are overpaying for the chip, more doesn't give you room to grow or debug, and makes PnR times increase and meeting timing is more difficult.

As of 2019, it's not quite so straightforward because there are large complex blocks and using one piece of it will count the entire block as used. Chris Chan said he's seen 95% usage that meets timing just fine due to this factor, so you have to take a closer look. The integrated processor cores are also a factor now.

Clocks and Resets

Digital Clock Manager (DCM) – The DCM is a device primitive used to support clock manipulation (multiplication, division, frequency synthesis, phase shifting, etc). Tasks for this module may include:

  -Double the input clock
  -Create a 180-degree phase-shifter version of the doubled input clock
  -Create a Divided version of the input clock
  -Create a Frequency Synthesized version of the input clock
  -Create a 90-degree, 180-degree, and 270-degree version of the input clock
  -Positive edge align all output clocks, excluding the phase-shifted and frequency synthesized clock 
    (frequency synthesized alignment is design dependent)
  -Create a reset for each clock generated.

Reset Synchronizer – The Reset Synchronizer is a common HDL block used to generate a synchronous reset on each clock domain based on either the DCM locked signal or external reset assertion.
Power On Reset (POR) Logic – This block is used to guarantee that the reset is asserted correctly to the DCM reset counter upon configuration of the FPGA.

An FPGA is divided into clock regions, each of which is a grid of so many CLBs, BRAMs, and DSP blocks. UltraScale architecture clock regions have a rectangular shape with a fixed width and height and are organized in tiles. Horizontal and vertical clock tracks are segmented at the clock region boundaries.

Clock buffers are different from normal buffers because they are designed to achieve equal rise and fall times to prevent the duty cycle from shifting when passing through a chain of buffers. A normal buffer just seeks to have a minimum R+F time sum.

Output clock PLLs

PLLs in the PS and PMC are:

APLL = APU PLL which is in FPD domain
NPLL = NoC PLL which is in PMC domain
RPLL = RPU PLL which is in LPD domain
PPLL = PMC PLL which is in PMC domain

ModelSim

A fatal error of 'Obsolete library format' means libraries are mis-mapped or they need to be refreshed.

The message "Error: Illegal Character .in string parameter!" means that there is a period in a generic of type string. Clearly VHDL doesn't like that. Not sure if perhaps this is only the case with digits, and has been recognized as a number.

If ModelSim ignores a specific Altera package inside of a library, even though it is actually present, it may be a sim compilation library version problem. Try using a newer version of Quartus-generated libraries. For altera.altera_europa_support_lib it worked to change the .ini to point to quartus_12 instead of quartus_9 versions.

To add a `define on the command line, use

vlog ...  +define+FOOBAR_IS_DEFINED ...

run -all runs the sim until there are no more scheduled events, including clocks. run -continue is for use after hitting a break point.

To enforce a permanent state change on a signal that has other values being driven, use force -freeze. Otherwise you may end up with an 'X'.

Running ModelSim in Linux

After install using Mentor's install.linux script, setup a launch script like this:

LM_LICENSE_FILE=port@server:port2@server2;
export LM_LICENSE_FILE
export MTI_VCO_MODE=32
<vsim location> (e.g. /usr/local/modelsim10.1c/modelsim_dlx/linuxpe/vsim)

If a sim fails to properly start, and the command line takes but ignores your commands as well as stop and break button hits, try closing all the open views to recover.

Wave

To view signals from a set of loop-generated VHDL components (such as for i in 1 to X generate) use label_generate__<#>/label_instance

Error codes and meanings

'(vcom-1078) Identifier "constant_name" is not directly visible. means that this is a constant declared twice.

Steps to compile VHDL for sim

vlib work
vdel -all -lib work
vlib work
vmap work work 
vmap xilinxcorelib <path>
#(etc libs)
vcom <.vhd files>

Viewing VHDL constant values while simulating

Open the file for viewing from the work entry in the Library window. Right click and select Examine.

Critical steps for building a Xilinx FPGA on a Linux system with ISE and Synplify

- Add Synplify and Xilinx to PATH
- Export variables for part, bitgen options, script, constraints, results, work, and reports directories
- Update coregen if needed
- If applicable, run version updating script
- Run Synplify:
synplify -batch -licensetype <license> -log log.txt <pname>.prj
- Run ISE PNR

  Build netlist:
  ngdbuild -verbose -intstyle xflow -dd _ngo -p $PART -nt timestamp -uc "<name>.ucf" "<name>.edn" <name>.ngd
  Mapper;
  map -intstyle xflow -p $PART -cm speed -logic_opt on -register_duplication off -pr b -k 4 -c 100 -l
    -ol high -timing -o <mapname>.ncd <name>.ngd <name>.pcf
  Place and Route:
  par -w -intstyle xflow -ol std -t 1 <mapname>.ncd <name>.ncd <name>.pcf
  Trace paths for timing reports:
  trce -intstyle xflow -e 3 -s 4 -xml <name> -tsi <name>.tsi <name>.ncd -o <name>.twr <name>.pcf
  Generate bit file:
  bitgen $BITGENOPTIONS

- Generate report if desired
- Copy results and clean up

Notes: .ngc is a netlist file and can be converted to an .edf

Xilinx Parts

Zynq-7000 series uses 28nm PL technology shared with Artix-7/Kintex-7 families. UltraScale uses 16nm technology shared with Kintex/Virtex US+ families. Versal ACAP architecture is 7nm technology.

Zynq Architecture

The PS consists of the ARM core, I/O peripherals, flash interfaces, DDR interfaces, clocks, and PL interconnect. Aside from the muxed I/O, you also have a bunch of AXI interfaces. A central routing block connects 32b AXI master and slave ports from the PL to the PS I/O. You also have the AXI 32b/64b slave ports for the PL-to-memory interconnect, and 64b AXI ACP slave ports.

Older Xilinx Tools

ISE

To get a util and basic timing analysis, you don't actually need a .ucf, and can declare any block to be your top level file and run synth to get an idea of the size of that particular block. Just add in as source everything that block needs in terms of components. You can also add a .ucf and constrain your clock to get a timing report and a max frequency.

Interpreting Xilinx' vague errors

ERROR:Bitgen:4 - The input NCD file "filename.ncd" is not in the specified location...

This really probably means that there was a problem with the NGDBuild. Check the runme.log, and you might see something like

<symbol> with type <symbol> could not be resolved.  A pin name misspelling can cause this...

Xilinx Sim Libraries

UNISIM library for functional simulation of Xilinx primitives
UniMacro library for functional simulation of Xilinx macros
XilinxCoreLib library for functional simulation of Xilinx cores
SIMPRIM library for timing simulation of Xilinx primitives

http://www.xilinx.com/support/documentation/sw_manuals/xilinx13_1/ise_c_simulation_libraries.htm

Synplify

A synthesis error of Internal Error in m_altera or m_xilinx is undefined and the suggestion is to submit a support request to Synopsys.

Low-impact interrupt re-design

Suppose the SW team requests a design with a single interrupt line to the uP. The logic blocks can each generate one or more interrupts and are created with a single interrupt master clear originating from the register block such that all interrupts are cleared when the interrupt flag register is read. Then SW comes back and says they want to handle interrupts individually instead of all at once, and wants PL to maintain a service record. Not wanting to re-design the logic vis-a-vis new block by block interrupt clears, there must be another solution.

Create two additional SW R/W registers: one to hold a single bit that indicates new interrupts require service (different from the IRQ line to the uP), and one to hold a copy of the contents of the interrupt flag reg written when the SW clears the 'new' reg/bit.

Here's the theory. When a new interrupt occurs, it is marked in the flag reg. Since the copy reg is clear but the flag reg is not, the 'new' interrupt bit gets set in it's register. This immediately triggers the IRQ so SW knows there's an interrupt. Suppose another interrupt comes in, so now there are two flags set. The first thing SW needs to do is ack the interrupt by clearing the 'new' bit. This will cause the copy to occur from the flag reg to the copy reg, which also clears the flag reg. It will also silence all the interrupts internally, at the blocks.

SW then reads the copy reg to find out which interrupts went hot. SW only wants to handle one of these interrupts right now, not both. So it picks one and services it, then clears only that bit in the copy reg. That leaves one more bit in the copy reg, which still must be serviced by SW. The IRQ line will not be de-asserted until the copy reg is clear.

If a new interrupt occurs after SW has cleared the 'new' bit and the copy happens, that's ok. It simply sits in the flag register. Once all the interrupts in the copy reg have been serviced and all is cleared, on the next clock cycle the cleared copy reg in conjunction with a set bit in the flag reg trigger the 'new' bit and IRQ line once again.

Alternatively, SW can still use the 'old' way to read the interrupt flag reg and clear all the interrupts at once if desired. In this way SW can get around a certain problem: say SW wants to ignore an interrupt that's in the copy reg, but has no other way of knowing that a new more important interrupt has set off a flag. They should have just masked off the one to ignore, but there is the fallback of reading the flag reg.

Chip Packaging and Substrates

For high-speed applications, it has been standard design practice to use Flip-Chip packages over wire-bonded ones. The I/O trace lengths in the substrate affect performance, especially in high-speed applications. This stems from the current path between the die and the substrate balls. The electrical delay of the Flip-Chip substrate traces are much easier to predict than those of wire-bonded package wires. In high-speed design applications, these delays could be quite significant, so they must be included in performance analysis models. The Flip-Chip packages have a structural arrangement that makes delays associated with these traces very predictable within reasonable tolerance. Some newer bond wire packages do have flight time associated with them.

Lidded packages support higher interface bitrates, presumably the lid acts as some kind of shielding or extra ground plane or heat spreader? Cost is higher for lidded.

Bitstream Programming (AKA FPGA Configuration)

Here's a nice writeup: http://lastweek.io/fpga/bitstream/

Bitstream size is fixed for a given part regardless of the logic.

Zynq 7030 = 5,980,035 bytes

Hard-Core Processor Designs

More often in 2019 you are seeing chips with integrated ARM cores inside the FPGA fabric, as with the Xilinx Zynq family. They are using the terms PS (processor system) and PL (programmable logic) for the two different parts of the chip.

All of the signals and interfaces that go between the PS and PL traverse a voltage boundary. These input and output signals are routed through voltage level shifters. The PS must be powered on to program the logic in the PL.

Technology Updates

2019 Zynq FPGAs come with a high speed serial SERDES link that is general purpose. It can be used to build any fast serial link on top of it with the right logic IP and firmware, such as USB, SATA, PCIe, GMII, MIPI, HDMI, etc.

The phaser seems to be a modified PLL in the CMT that is used only for memory interfaces.
The 7 series FPGAs and Zynq-7000 AP SoCs have new hard blocks, which mitigate these challenges. These blocks are the Phasers, the I/O FIFOs, and the I/O PLL, which are all contained within or are adjacent to the enhanced CMTs that encompass the traditional MMCM as in the Virtex-6 FPGAs. The new hard blocks (Phasers, I/O FIFOs, and I/O PLLs) in the 7 series FPGAs and Zynq-7000 AP SoCs provide and respond to a higher resolution of clock timing control. These blocks and the ISERDES and OSERDES also respond to higher input frequencies (up to 933 MHz for DDR3 at 1,866 Mb/s) and allow finer phase shift steps than older methods in Virtex-6 FPGAs.

Virtex 7 is a 28nm process. UltraScale is a 20nm process. UltraScale+ is a 16nm process. Versal is a 7nm process.

Memory Interfaces

From a 2017 Micron DDR4 datasheet:

The terms "_t" and "_c" are used to represent the true and complement of a differential signal pair. These terms replace the previously used notation of "#" and/or overbar characters. For example, differential data strobe pair DQS, DQS# is now referred to as DQS_t, DQS_c.
The term "_n" is used to represent a signal that is active LOW and replaces the previously used "#" and/or overbar characters. For example: CS# is now referred to as CS_n.
The terms "DQS" and "CK" are to be interpreted as DQS_t, DQS_c and CK_t, CK_c respectively, unless specifically stated otherwise

State of the Art: A Discussion with an Expert in 2022

Vitis HLS is still a scary mess. Xilinx has been saying they'll take HLS and make it work, for years, but still haven't delivered. Even their Github hosted examples don't work, which appear to be posted by grad students. They have spent too much time evolving FPGA technology, rather than working on HLS. They probably could have had it working by now if they hadn't done so much changing of the FPGA underneath.

Moving from the old XSDK to Vitis is not that painful, if you were familiar with it.

Where have things changed the most in 10 years? Probably in verification and UVM (universal verification model). There are about 30 different UVM modules. It's a stimulus to checker to wrapper model and it feels like a tons of overhead for something that used to be more simple, but there is a lot of useful stuff that appears to be largely reusable. It's worth the training and learning. Everyone in industry now expects UVM style testbenching, and this wasn't the case 10 years ago. It's common now to hire out verification, it has developed as rather it's own specialty and there are some great contracting experts out there to hire. It's work that can largely be done independently/remotely, which is very nice.

Cadence is now more popular than Synplify for synthesis but Vivado has improved enough with simulation, synth, P&R, that third-party tools are no longer necessary. Xilinx tools used to be terrible.

Agilex is the first Intel/Altera joint effort since the takeover and it competes head to head with UltraScale+. Altera had always used Intel fabs so the fit made sense, took a while to get really going though, but now things are ramping up. With Xilinx and AMD, which was motivated more by AMDs need to keep up than anything else, nothing is likely to change for 4 to 5 years. There doesn't appear to be a significant difference in chip offerings or market targets between the two.

Lattice and Microchip still have their traditional niches.

References

Cool website with lots of informational stuff about FPGA technical implementation details: https://zipcpu.com/about/