Building Blocks

The CLB is the primary block. Each CLB slice contains LUTs, carry blocks, muxes, and FFs. (For Xilinx, they cite 8x6inp LUTs, 8-bit carry, 16 FFs). There's no gate/storage in a LUT, purely combinational. LUTs can be cascaded to add clock delays and pipeline timing balance.

A Xilinx SLICEM is a super slice that has extra functions like RAM or shift regs. They can be used as distributed RAM or 32-bit SR.

How can you add a simple latch to provide clock relief?

There is no flip-flop or latch IP block, but you can create this with the "RAM-based shift register" and customize the output and control signals.

Buffers

An IBUFG drives a global clock net from an external pin. A BUFG drives a global clock net from an internal signal. A BUFH is for horizontal row adjacent signals within a region, so you wouldn't use it for global nets.

Memories

Distributed Memory

The Distributed Memory Generator IP core creates a variety of memory structures using Select RAM. It can be used to create Read Only Memory (ROM), single-port Random Access Memory (RAM), and simple dual/Dual port RAM as well as SRL16-based RAM.

FIFOs

The built in block RAM and FIFO primitives in the 7 Series FPGA can be used to implement RAMs, ROMs, and FIFO blocks for a design. The block RAM and FIFO are optimized for performance and allow you to implement a RAM, ROM, or FIFO block in a design without requiring large amounts of fabric resources from slice logic.

Timing Introduction

Xilinx recommends the "UltraFast" design methodology which maps timing in this order:

clocks
clock interactions
I/O
exceptions
The launch edge is the active edge of the source clock that launches the data.
The capture edge is the active edge on the destination clock that captures the data.
The source clock is also referred to as the launch clock.
The destination clock is also referred to as the capture clock.
The setup requirement is the relationship between the launch edge and the capture edge that defines the most restrictive setup constraint.
The setup relationship is the setup check verified by the timing analysis tool.
The hold requirement is the relationship between the launch edge and capture edge that defines the most restrictive hold constraint.
The hold relationship is the hold check verified by the timing analysis tool.

What is skew?
The insertion delay difference between the launch edge of the source clock and the capture edge of the destination clock, plus clock pessimism correction (if any). Positive skew is usually ok, and means that the clock arrives at a source FF earlier than destination FF. Negative skew is the opposite and is bad!

What is slew?
The word represents quickly sliding around or changing direction, and for electronics is the change in voltage per unit of time (volts/second). When given for the output of a circuit, such as an amplifier, the slew rate specification guarantees that the speed of the output signal transition will be at least the given minimum, or at most the given maximum. When applied to the input of a circuit, it instead indicates that the external driving circuitry needs to meet those limits in order to guarantee the correct operation of the receiving device. If these limits are violated, some error might occur and correct operation is no longer guaranteed.

Which process corner(s) are used for setup and hold in IDELAY component modes?
In COUNT mode, Slow corner for Setup, Fast corner for Hold. In TIME mode, Fast corner for Setup, Slow corner for Hold.

What is static timing analysis?
Uses the static timing paths: clocked element -> combinatorial segment -> clocked element. You can have separate static timing paths with different delays. Use -add_delay flag with set_input_delay command.

What is I/O pin delay?
Each input port has a min and max delay. Min is used for hold time check, max is used for setup time check.

What is the general timing constraints flow?
Create clocks, then set I/O delays, then specify timing exceptions.

The "Ultra Fast" Design Methodology

Clock pins

generate all I/O and clocking IP before pin assignments
consolidate clocks and MMCMs, fewer is better
re-visit CDC
consider manual placement for closure

Data pins

group related pins in same or adjacent banks
put clocks in the same bank
consider control signals and data flow
high fanout signals should be in the middle of the chip, and maybe even use clock-capable pins
evaluate pin attributes on placement

I/F control

group I/F data, addr, control lines to same or adjacent banks
place clock/enable/reset/strobe lines in the middle of the data bus
use provided memory bank/byte planner for assigning mem I/O pin groups
GTs have a specific pinout requirement; watch SSI crossing of SLR boundaries, and there is a GT wizard to generate the core

How to reach closure and reduce build time?

use pblocks to assign logic to SLR
don't allow excessive utilization of single SLR
don't allow data to cross SLR boundaries over and over
better to solve large timing problems earliest in design flow to give you a high quality netlist

Build Work Flow

At Synthesis (after Synthesis run is complete, but before Implementation is run):

Setup violations: must be resolved thru timing constraints.
Hold violations: ignored and will be resolved by the tool during implementation thru introduction of delays into logic path.

At Implementation (after Implementation run is complete):

Hold violation: should not occur at the completion of the implementation run.
Setup violations: ideally do not occur here; if occur, utilize logic optimization coding techniques in your design's code.

Clocks

Clocks of same freq will still have differing phases. Note that Vivado default is to use ns timescale. The clock object database is flat, with no concept of hierarchy.

A primary clock enters through an input port, sourced externally. A forwarded clock is an internal FPGA logic clock that is driven to an output pin, used as a reference for other outputs. Physically exclusive clocks have the same source point, and same clock tree. Logically exclusive clock groups have different source points, and share part of the clock tree.

A virtual clock has no physical connections to ports/pins and is used for I/O delay constraints. A virtual clock is declared with the Tcl create_clock command with no source object specified. These are not connected physically to ports/pins, they don't really exist in the design, but are used for I/O delay constraints. They are considered to represent clocks external to FPGA. https://support.xilinx.com/s/article/55287?language=en_US

You may get this warning about a perfectly valid clock name and set_clock_groups command: [Vivado 12-4739] set_clock_groups:No valid object(s) found for '-group clk_fpga_1'. ["/mnt/newHHD/sandbox/midnight_fw/fpga_build/midnight_FPGAsrc/VSG_xilinx_2ch/VSG_xilinx.srcs/constrs_1/new/VSG_xilinx.xdc":312] This may be because of OOC synthesis being used. Once synth is complete, try running the same set_clock_groups command and see if the error is still there. Accordingly, some have suggested this can be ignored for synthesis.

Inter/Intra

There are a number of timing constraints that apply to flip-flops (or other elements) that are all relative to the same clock signal. These are called intra-clock constraints, or timing, etc. Sometimes there are timing paths or constraints that cross from one clock to another. These are called inter-clock constraints, or timing. For example, you can have a skew defined for one clock (intra-clock skew) or you can define the skew between two or more different clocks (inter-clock skew). Suggest fixing inter-clock timing violations first.

Managers

MMCMs and PLLs live on CMTs. The MMCM and PLL share many characteristics. Both can serve as a frequency synthesizer for a wide range of frequencies and as a jitter filter for incoming clocks. At the center of both components is a voltage-controlled oscillator (VCO), which speeds up and slows down depending on the input voltage it receives from the phase frequency detector (PFD).

Visualizing Clocks

Handy Tcl command to show system clock schematic: show_schematic [get_cells -hierarchical -filter { PRIMITIVE_TYPE =~ CLK.*.* } ]

Groups

set_clock_groups is a commonly used constraint that can associate or dissociate clocks. Paths between dissociated clocks are treated as false paths. It has three mutually exclusive arguments: -physical_exclusive, -logical_exclusive and -asynchronous. What are the differences between the three arguments?

The three clock relationships originated from SDC have different impacts on SI crosstalk analysis. From an FPGA timing analysis perspective, the impact would be the same.

-asynchronous
When there are valid timing paths between two clock groups but the two clocks do not have any frequency or phase relationship and these timing paths need not to be timed, use -asynchronous. When there are false timing paths (physically or logically non-existent) between two clock groups, use -physical_exclusive or -logical_exclusive

-logical_exclusive
Used for two clocks that are defined on different source roots. Logically exclusive clocks do not have any functional paths between them, but might have coupling interactions with each other. An example of logically exclusive clocks is multiple clocks, which are selected by a MUX but can still interact through coupling upstream of the MUX cell. When there are physically existing but logically false paths between the two clocks, use "set_clock_groups -logical_exclusive".

-physical_exclusive
-physical_exclusive is used for two clocks that are defined on the same source root by "create_clock -add". Timing paths between these two clocks do not physically exist. As a result you will need to use "set_clock_groups -physical_exclusive" to set them as false paths.

FREQ_HZ Disagreements

[BD 41-237] Bus Interface property FREQ_HZ does not match between /ecc_proxy_ip_0/ECC_S00_AXI(180000360) and /processing_system7_0/M_AXI_GP1(180000000) O

For errors like this, open the IP block for editing. Then go to the Ports and Interfaces listing and edit the interface in question. On the Parameters tab of the box that opens up, you can change the FREQ_HZ value.

PS FCLKs

Clocks called FCLK0-FLCK3 from the PS side of the Zynq for clocking on the PL side. It is convenient to use these FCLKs. However, the jitter associated with these clocks is considerably higher than the jitter for clocks normally used for Programmable Logic (PL). Specifically, the set_input_jitter constraint shown on pg93 of PG082 indicates that FCLK jitter is 0.6ns Pk-Pk. Good clocks have jitter well below 0.1ns Pk-Pk. The higher jitter of the FCLKs could limit the run/clocking speed for your PL-side applications. When you correctly specify clock jitter with the set_input_jitter constraint and correctly specify clock period with the create_clock constraint then Vivado timing analysis will tell you whether your PL-side application will operate properly. -Mark

Slack Timing Analysis

When the slack is positive, timing is said to be MET:

Slack (MET) :             0.111ns  (arrival time - required time)

If negative, timing is said to be VIOLATED:

Slack (VIOLATED) :        -0.633ns  (required time - arrival time)

Paths

The path characteristics fall into four main categories: timing, logic, physical, and property. You can find the definition of each characteristics in the command long help. Tcl Command: report_design_analysis -help

Timing

The timing path requirement is typically one clock period for setup/recovery analysis, 0ns for hold/removal analysis, when the startpoint and endpoint are controlled by the same clock, or by clocks with no phase-shift. When the path is between two different clocks, the requirement corresponds to the smallest positive difference between any source and destination clock edges. This value is overridden by timing exception constraints such as multicycle path, max delay and min delay.

Paths with setup requirement under 2 ns are difficult to meet and must be avoided in general, especially for the older architectures.

The Path Delay = Logic Delay + Net Delay info details the total datapath delay. If the Logic Delay makes up an unusually high proportion of the total datapath delay, for example 50% or higher, it is advised to examine the datapath logic depth and types of cells on the logic path, and possibly modify the RTL or synthesis options to reduce the path depth or use cells with faster delays. If the Net Delay dominates the total path delay for a setup path where the Requirement is reasonable, it is advised to analyze some of the physical characteristics and property characteristics of the path listed in this section. Specific items to look at include the High Fanout and Cumulative Fanout characteristics tounderstand if some nets of the path have a high fanout that could potentially be causing a placement problem.

What is high number of logic levels?
This is a case where logic exceeds some percentage of the total path delay, implying that there is too much logic between timing end points; the amount of logic must be reduced in order to meet timing requirements. This number was traditionally around 50% for older architectures; it would need to be quantified for Virtex families (60%).

The Vivado Design Suite router prioritizes fixing hold over setup. This is because your design may work in the lab if you are failing setup by a small amount. There is always the option of lowering the clock frequency. If you have hold violations, the design will most likely not work.

Floorplanning

Floorplanning can improve the setup slack (TNS, WNS) by reducing the average route delay. During implementation, the timing engine works on resolving the worst setup violations and all the hold violations. Floorplanning can only improve setup slack.

Strategies

Note that the qualifiers of Low and High on the NetDelay strategies refer to the priority level, not the delay amount. So if you need shorter net delays, pick NetDelay High.

External Device Interface Timing

Example from Xilinx course:

Estimate if the AD5404 500-Mbps interface can meet the timing for this
interface.
Compare the input data width and the minimum data window required.
The minimum input data window is 1.2 ns wide as determined from the AD5404
datasheet.
If TIME mode is used with a 1173-ps delay, the data window required by the FPGA is
1.661 ns:
o Total slack available is 1.2 ? 1.661 = -0.461 ns.
o Negative slack implies this will not meet the timing.
? At 500 Mbps, the maximum possible input data window is 1/500MHz= 2 ns. This
is sufficiently wide to meet the timing. However, the AD5404 consumes 40% of
this window for its timing uncertainties, making it more challenging to meet in
Static Component /TIME mode.
If COUNT mode is used with a 255-tap delay, the data window required by the FPGA is
1.115 ns:
o Total slack available is 1.2 ? 1.115 = 0.085 ns.
o Since the input data is center aligned (at the input and within the FPGA), this positive
slack will divide equally between setup slack and hold slack. This will meet timing
with about 49 ps slack for setup and hold each.
For the 500-Mbps interface, timing can be closed by using COUNT mode.

Estimate if the 900-Mbps interface of the AD5409 can meet the timing for
this interface with dynamic delay adjustment.
Compare the input data width and the minimum data window required.
The minimum input data window is 0.460 ns wide (0.672 ns typical) as found from the
AD5409 datasheet.
The worst-case data window required by the FPGA is 0.887 ns (Slow corner).
There is no positive slack to meet the timing. At 900 Mbps, the maximum possible input
data window is 1/900M= 1.1 ns. This is sufficiently wide to meet the timing. However,
AD5409 consumes 58% of this window for timing uncertainties, making it more
challenging to meet in Component mode.
Native mode would need to be considered to meet this timing

UltraScale Family

Virtex and Kintex available as US, but US+ also has Zynq and Artix. Artix is the low-power low-cost chip, which has double the power efficiency and b/w of the 7-series Artix chips.

Has three IOB options:

high perf HP = high speed memory I/F, <= 1.8V
high density HD = low speed I/F
high range HR = <= 3.3V wide standard set options

HBM Gen 2 has the highest DRAM b/w available, using SSI technology. Has a hard AXI I/F controller and supports CCIX.

Routing delay now dominates overall delay and clock skew consumes more margin than before.

Two slice types, SLICEL (LUTs, MUX, CLA) and SLICEM (RAM or 32-bit SR).

The traditional unimacros library is not supported for US, use XPM instead.

RAM Options

Distributed RAM is created with LUTs and is faster/smaller than BRAM but increases chip utilization.

BRAMs are built into fabric, can be powered down when not in use, dynamically, with persistent contents. Up to 36Kb each block. Can be cascaded and has integrated error correction. Often used for FIFO implementation.

UltraRAM is for larger amounts of data. 288 Kb/block (72 bits by 4096 deep). Optional 64-bit ECC and sleep mode. 16 URAM blocks per clock region per column. RoT for design is to target URAM only if you need 144 Kb or more (more than four BRAMs).

I/O Resources

types are IOB, ILOGIC/ISERDES, OLOGIC/OSERDES, IODELAY
1-3.3V for various standards like single-ended, differential, ref input, tri-state

For DDR4, up to 2400 Mb/s speed.

Component mode vs Native mode:
The former is manual design and primitive instantiation, the latter uses primitives mapped directly to PHY circuits for high speed, parallel interfaces like DDR4 (baseline speed is 1600 Mb/s), using I/O wizard for design. Delays can be controlled using TIME or COUNT mode. For TIME, it's convenient for fixed delays, and maintained across voltage/temp changes. In COUNT mode, you have 512 possible taps to use, no calibration so changes in V/T equals variation. In this case choosing the right num of taps is iterative since it can vary. Best performance is achieved in component mode with fixed delay. the native mode RX_BITSLICE primitive is equivalent to ISERDES+IDELAY+RX_FIFO, and TX_BITSLICE is equivalent to OSERDES+ODELAY. The HSSIO wizard creates HDL wrapper for bitslice and PLL config.

Migration from 7-series Designs to US+

Two options for IP, managed or project local. The former is preferred. A managed migrating IP project needs to be moved before the FPGA design is moved.

For clock optimization, consider merging clocks sharing freq+phase, remove redundant buffers, consider replacing MMCM with BUFGCE_DIV or BUFG_GT for simple divided clocks. In timing analysis, intra-clock "partial false paths" may be reported due to native IP generating false path specs, which is fine.

Logic and Digital Electronics Principles

http://en.wikipedia.org/wiki/Four_value_logic

Note that you cannot simply route bi-directional "pass-thru" lines through an FPGA between devices that need to talk to one another directly. You can't just use the FPGA as a wire connector. This works fine for unidirectional signals, but the problem with bi-directional signals is the FPGA has an input or output buffer for each pin or a dual buffer for bi-directional lines. A signal controls how the pin is driven, in or out, and if the FPGA doesn't have smarts about who is supposed to be driving the line at the proper instant it won't know what to do. You'd have to create a state machine in the fabric to handle this.

A good illustration of this is using I2C because it has a bi-directional data/address line SDA.

Related to this issue, inout ports are an interesting special case. They are typically used only in the top level component at the chip edges. You'll need an IO buffer that can tri-state the pin, such as this VHDL example:

IOBUF_inst : IOBUF
   generic map (
      DRIVE => 12,
      IOSTANDARD => "DEFAULT",
      SLEW => "SLOW")
   port map (
      O => RIGHT_CONNECTOR_A1_O,     -- Buffer output
      IO => RIGHT_CONNECTOR_A(1),   -- Buffer inout port (connect directly to top-level port)
      I => RIGHT_CONNECTOR_A1_I,     -- Buffer input
      T => RIGHT_CONNECTOR_A1_T      -- 3-state enable input, high=input, low=output 
   );