This page belongs to a series of pages about timing. The previous pages explained the theory behind timing calculations, discussed the clock period constraint, and introduced timing closure. But what if you've tried to get it right, and yet there's a timing problem? This page attempts to answer that question.
Introduction
In the previous page I tried to convince you that there's no single method to solve a timing closure problem. Sometimes it's correct to focus on the critical path, and sometimes it isn't. Sometimes the problem can be fixed easily with a simple change in the tool's settings, and sometimes it's much harder than so. There is no substitute for making use of all experience and wisdom that you have in order to reach the root cause of the problem. There is no way to summarize timing closure with a checklist.
And yet, it often helps to have a list of possible strategies. So on this page I've collected a few topics that are worth to think about when facing a problem with timing. If you read this because you have a specific timing problem to solve, it's possible that one of these ideas will lead you to the solution. But don't expect your solution to be spelled out here.
Also keep in mind that this series of pages doesn't end here. I've chosen to discuss timing closure before many other topics for the sake of motivation. However, the information in the pages that come later is also relevant.
For the same reason, the discussion about I/O timing constraints is postponed to later. For the time being, I'm focusing on paths that begin and end inside the FPGA.
So here come a few ideas to consider in relation to timing closure.
Idea #1: Fix the logic design
This is always the least appealing solution. This is true in particular if the design is already known to work properly. You don't want to change something that works. And yet, often the fundamental reason for the problem is that the Verilog code wasn't written well enough for the required performance. A change in the logic design solves the problem once and for all, instead of facing ongoing difficulties.
There are a few suggestions for writing fast logic in the previous page. And it's worth to repeat this: Always have timing in mind during the development process. It's much harder to fix timing problems than to write the Verilog code properly from the beginning.
Idea #2: Reduce the fan-out
When a net has a high fan-out, the propagation delay increases because of two main reasons:
- The capacitance of the physical wire is larger, so more electric charge is required to change the logic state.
- It becomes more difficult for the tools to find a routing with a low delay for all destinations of the net: These destinations are logic elements that are scattered on the logic fabric. So the more destinations there are, the harder it becomes to optimize the timing so that all nets have a low delay.
If a synchronous reset is used in the design, this signal is likely to have a high fan-out. This topic is discussed in a separate page.
However every signal that reaches a lot of logic elements can potentially cause timing problems due to a high fan-out. Sometimes this high fan-out is obvious (e.g. clock enable signals) and sometimes it's not as easy to expect the high fan-out. The FPGA tools can usually help by listing the nets that have the highest fan-outs.
There are two methods to keep the fan-out low:
- The FPGA tools have a limit on the fan-out. When this limit is reached, the tools duplicate the register that is the source of the net. It's possible to change this limit's value for each register by virtue of synthesis constraints. It's also possible to change the global limit by changing the synthesizer's parameters.
- Edit the Verilog code: Explicitly replicate the register with a high fan-out into several registers.
Clearly, both methods reach the same result: The register that has a high fan-out is replicated into several registers. So if this can be done automatically by the tools (with the first method), why bother doing this manually (as in the second method)?
The second method requires more effort, but has one significant advantage: It's possible to replicate the register in a sensible way. Keep in mind that the goal is not only to reduce the fan-out. It's also important that the output of each register is distributed to logic elements that are placed in a small region on the logic fabric. Otherwise, the result is large routing delays due to the physical distance. So if the Verilog code is written with fan-out in mind, it's possible to ensure short connections between the logic elements. This is demonstrated for a synchronous reset signal on a different page.
By contrast, if the FPGA tools are responsible for replicating the registers, the result may not be as efficient. The improvement of the routing delay depends on the algorithm that decides how each replicated register is used. The quality of the result depends on the which FPGA tool is used.
Note that by default, when the synthesizer detects two registers that behave exactly the same, these registers are automatically merged into one register. So if a register is replicated in the Verilog code, the synthesizer will replace all replicas with one register. This is often true even if these equivalent registers are defined in different modules. To avoid this merging, it's necessary to disable this feature explicitly. A common way to accomplish this is with synthesis attributes, e.g. "dont_touch", "dont_merge" or "keep".
Idea #3: Check the floorplanning
By default, the placement of the logic elements on the logic fabric is determined automatically by the FPGA tools (the placer, more specifically). It is however possible to request that certain logic elements are placed in specific areas on the FPGA. It's also possible to request that a specific logic element is placed at a specific position. Requests of this sort are referred to as floorplanning. These requests are made by virtue of placement constraints, which are often Tcl commands that have a syntax that is similar to timing constraints.
In most cases, placement constraints make it more difficult to achieve the timing constraints. The first reason is obvious: When the placer's choices are limited, the results can only be worse than without a limitation. However, there are more concrete reason for this:
- Floorplanning can force much logic into a small area on the FPGA. This can lead to routing congestion: The logic elements inside that area require more routing than usual. As a result, the router is forced to use sub-optimal resources. This leads to sub-optimal routing delay, and possibly a failure to meet the requirements.
- Placement constraints may force logic elements to be far away, even though it would be better to place them close to each other. The placer is hence prevented from moving the logic elements for the sake of reducing the routing delay.
- Floorplanning can create regions that are obstacles to routing. For example, suppose that the floorplanning has created an area that is densely populated with logic elements. The routing of other logic may be forced to avoid entering this congested area. As a result, this routing needs to go a longer distance, and hence the routing delay becomes larger.
In most designs, its best to avoid floorplanning and thus give the placer the freedom to optimize the placement of the logic elements. There are however cases where placement constraints are in commonly used, for example:
- An IP block can contain placement constraints for the logic elements that it generates. For example, an IP that implements a PCIe interface often creates placement constraints that determine the position of its most important components: The transceivers, PLLs and dedicated PCIe hard IP. The placement constraints of this sort are usually necessary and correct.
- The FPGA can be divided into regions, so that each region contains only specific modules. This is the original meaning of floorplanning. The motivation for dividing the FPGA like this can be to allow different teams of a project to work independently.
- Partial Reconfiguration is a feature that allows loading a bitstream into the FPGA while it is working, so that only a part of the FPGA is affected. For this to work, floorplanning is required: The FPGA is divided into areas that remain untouched when the new bitstream arrives, and areas that are updated by this bitstream.
For the purpose of timing closure, it's important to be aware of placement constraints as a potential reason for problems. In particular if routing delays are larger than expected, the root cause can be the placer's inability to optimize the positions of the logic elements. Remember that floorplanning can have a negative effect on paths that are unrelated to the logic elements whose placement is restricted.
Idea #4: Check the timing constraints
The timing constraints are crucial to the reliable operation of the FPGA. Hence they should be verified before the implementation of the project. However, sometimes the timing constraints turn out wrong anyhow. The mistakes can become evident during the timing closure process. This shouldn't happen, but when it does, it's of course better to fix the problem.
There is a whole page about checking timing constraints. Here I'll only discuss two common mistakes that can lead to problems with timing closure:
- Unnecessary enforcement of timing constraints on paths between unrelated clocks.
- Unnecessary enforcement of timing constraints on asynchronous resets.
So first, about unrelated clocks: The topic of clock domain crossings has already been discussed earlier: There are several reasons why it's important that the timing constraints reflect which clocks are related clocks and which are not. The most important reason is to ensure proper operation of the logic, but timing closure is also affected: If a pair of clocks is unnecessarily treated as related clocks by the tools, this leads to unnecessary enforcement of timing requirements on the paths between these two clocks. As a result, the tools waste efforts on these paths at the expense of paths that really need these efforts.
The timing constraint that fixes this problem is explained later on in this series of pages.
Regarding asynchronous resets: In most cases, it's necessary to enforce timing constraints on a path that ends at an asynchronous reset. But sometimes, there is no need for this. For example, if there is a guarantee that the clock will not be active when the reset changes to inactive. Another possibility is when the flip-flop that receives the asynchronous reset has a protection mechanism against timing violations, just like with a clock domain crossing. In these situations, the tools' efforts to meet the timing requirements are pointless.
It can be difficult to realize that problems of this sort make it difficult to achieve timing closure: Sometimes the critical path has nothing to do with the two clocks that are unnecessarily treated as related clocks. If an asynchronous reset diverts the tools' efforts, that is even harder to recognize. In such situations, attempts to focus on the critical path in order to improve its timing can be futile.
The timing constraints can of course be wrong in a variety of other ways. The situation described here is just one possibility. So a problem with timing can be a good opportunity to review the timing constraints in general.
Idea #5: Just try again
Recall that the place and route process begins with scattering the logic elements quite arbitrarily on the logic fabric. The tools then try to improve the timing through repeated attempts. Hence the success of this process relies on a certain amount of luck. It's also possible that a slightly different behavior of the place and route algorithm will achieve better results, even when there is no logical explanation for why.
So if the timing constraints fail, and the negative slack is relatively small (about 10-20% of the total delay), it might be enough to just try again. But just re-running the implementation will probably don't do any good: Most FPGA software is designed to repeat its result accurately when it's run with the same input. So it's necessary to change something before rerunning. This change doesn't have do be related to the critical path. The point is only to avoid an accurate repetition of the previous implementation.
For example, in Vivado each run has an attribute that is called "strategy". As its name implies, this attribute controls the strategy that the tools apply during the implementation. Changing this attribute ensures that the next implementation will not be identical to the previous one. It's also possible that a different strategy makes more sense in relation to the specific logic design.
All FPGA tools offer similar possibilities to modify the parameters of the implementation process. It's often possible to request a higher level of effort to meet the design's goals. Sometimes a higher effort is really required, but often requesting a higher level of effort helps just because the tools do something else.
Another way to avoid a repetition is making changes in the Verilog code. Once again, the change doesn't have to be related to the critical path. Sometimes it's enough to change the name of a register in order to obtain an implementation that is different enough from the previous one.
This idea can be taken to the extreme: It's possible to run the implementation on several computers in parallel, so that each implementation has slightly different parameters. This can make sense when the price of the FPGA is important. In this scenario, it is worth to let the computers work hard so that the timing constraints are achieved on a cheaper FPGA.
To summarize, this method relies mostly on luck. The expectations should be accordingly: Trying again helps only when the tools occasionally fail to achieve the timing constraints. But it's always better to improve the timing in other ways, if possible.
Idea #6: Is the FPGA full?
It's quite common that problems with timing constraints begin when the FPGA's fill level reaches about 70%. There are three main reasons why this happens:
- The logic elements are packed more densely into the FPGA's logic fabric. The placer's freedom to improve timing is therefore more limited, because it's harder to move logic elements from one place to another. This leads to larger routing delays.
- The amount of wiring (routing resources) in an FPGA is limited. As long as the FPGA is relatively empty, the router is allowed to select the most suitable pathway between the logic elements. When more logic is added, sub-optimal choices lead to larger routing delays.
- The specialized logic elements in the FPGA may run out. For example, most FPGA have block RAMs and dedicated components for arithmetic multiplication. When these resources run out, the tools implement the required functionality with simple logic elements (slices). This often leads to a large number of logic levels, and hence an increased logic delay. The FPGA's fill level can also begin to increase faster than expected, because slices are used instead of specialized logic elements.
Among the three reasons that are given above, only the third has some kind of solution: For example, it may help to manually decide which logic uses the FPGA's block RAMs and other similar resources. Other than this, the only solution to a full FPGA is to select a larger one. This is not always an option, however. It's therefore important to anticipate that the timing closure will become harder as more logic is added.
Idea #7: Maybe you need a different FPGA?
Sometimes, there is no choice but to admit that the FPGA isn't up to the job. If the same FPGA is available with a higher speed grade, the solution can be to decide to upgrade to a faster FPGA. This decision increases the purchase cost of course, but it can also have another less expected consequence: There might be a shortage of FPGAs with the higher speed grade. Even if these faster FPGAs are abundant at a specific time, faster FPGAs are usually the first one to disappear from the market when the demand exceeds the supply.
This is quite natural: The faster FPGAs are those that passed the tests better, and they can always be used instead of the slower FPGAs. Sometimes the manufacturer fails to produce the faster FPGAs, and sometimes there's a large consumer that buys everything it can work with.
So if you work on a product that is intended for production over a long time, always prefer the lowest speed grade that your design can work with. This is true even when money is not an issue.
A completely different type of upgrade is to choose an FPGA from a more recent FPGA family. Or maybe choose an FPGA from a different vendor. This change is more drastic, but it should be considered if the project is in its early stages. We all tend to fall in love with the tools and components that we're familiar with. But when everything feels too familiar, it's a good time to look around for alternatives.
That said, experienced FPGA engineers know that choosing the latest, brand new FPGA and its tools is a hazardous gamble. But often there's a fairly established alternative that is much better than the current choice of FPGA. In such a situation, it's better to leave the comfort zone and try something new.
Idea #8: Reduce the temperature range
There's a reason why I put this possibility almost last: This is the ugliest solution of them all. But sometimes there's no other choice.
By default, the tools' enforcement of the timing constraints guarantees that the FPGA works reliably on the entire temperature range that is defined in the datasheet. Some FPGA tools allow choosing another temperature range for a project (for example, Quartus has an attribute called "MAX_CORE_JUNCTION_TEMP" for this purpose). This can be used to inform the tools that there is no need to support the full temperature range.
Generally speaking, the delays of logic elements in the FPGAs increase as the temperature rises. If the maximal temperature is reduced, the values of the delays are smaller in the timing calculations. This makes it easier for the tools to achieve tsetup requirements. Sometimes, this is the only way to make the tools achieve the timing constraints.
It's important to understand the risks of using this method. In particular, note that the we're talking about the junction temperature. In other words, this is the temperature on the silicon of the FPGA. This is not the ambient temperature.
So when the maximal temperature is 85°C, it doesn't mean that the FPGA works in an oven with that temperature. This temperature can also be reached at room temperature (25°C), in particular if there is no heatsink on the FPGA. There is always a difference between the temperature at the junction and the ambient temperature. This magnitude of this difference depends on the power consumption of the FPGA and the cooling solution.
If you work on a commercial product, be aware that the ambient temperature of the FPGA can be significantly warmer than room temperature. In particular, if the FPGA is inside a box with a bad air flow, the temperature inside the box can rise much above the temperature outside the box. To make things worse, most electronic products are expected to work at a temperature between 0°C and 40°C (approximately, these number differ from one product to another). So when the final product is tested in the highest temperature, and the FPGA is inside the product's enclosure, what's the temperature at the junction? That's the question to ask. The timing calculations must be based upon that temperature (or higher).
In other words, if you reduce the maximal temperature for the sake of achieving timing constraints and everything works fine in the lab, that means nothing. Recall that regarding timing constraints, it always means nothing that the FPGA design works in the lab. But reducing the maximal temperature is even worse in this regard. Narrowing the temperature range thoughtlessly can be an invitation for real trouble: The product final test before production, which is done at the highest temperature may fail, and it will be impossible to achieve the timing constraints if the temperature range is corrected. The only way to solve this problem will be to rewrite the FPGA design from the beginning.
So before changing the temperature range for the sake of timing closure, make sure that it's safe to do so: Make a rigorous evaluation of the temperature range at the junction in all possible working conditions.
Note that it's impossible to expand the temperature range beyond the default by changing the implementation's parameters. The tool's default temperature range is always the same as stated in the datasheet. Hence it's not possible to guarantee a reliable operation of the FPGA beyond this default temperature range.
Idea #9: Unaligned related clocks
This is a rather esoteric situation, and it's also a bit difficult to understand. So that's why I've put this topic last.
Suppose there is a clock domain crossing between two related clocks that are not aligned. In other words, the two clocks are derived from the same reference clock, but there is no mechanism to control the clock skew between these clocks.
As a result, it becomes more difficult for the tools to meet the timing requirements of paths between these two clocks. There are two possible kinds of difficulties: (recall that tsetup and thold were explained earlier)
- When the clock arrives later to the first flip-flop because of the clock skew: There is hence less time until the next clock edge reaches the second flip-flop. It's therefore harder to meet the tsetup requirement.
- When the clock arrives earlier to the first flip-flop because of the clock skew: Hence, the first flip-flop updates its output before the same clock edge has reached the second flip-flop. As a result, the thold requirement may be violated on the second flip-flop. The tools prevent this by making the routing of the path artificially long. This can lead to a failure to meet the tsetup requirement. It's also a waste of routing resources.
It's important to note that when the tools need to work harder than usual to overcome these difficulties, it may come at the expense of optimizing other paths.
But unaligned related clocks is not an error, and sometimes it's inevitable. This situation only means that the tools need to work harder. If the timing constraints are achieved, there's no problem with the design. However, this should be avoided whenever possible with reasonable effort.
Note that the mistake that is mentioned in "Idea #5" above is different, even though both mistakes make it unnecessarily hard for the tools, and both are related to clock domain crossing.
The best solution to solve a situation with unaligned related clocks is to align these clocks. This is usually done by adding a PLL or adding a clock output to an existing PLL. The goal is that both clocks in question are the outputs of the same PLL.
Another possible solution is to treat the clocks as unrelated clocks. This requires a change in the logic design itself as well as a change in the timing constraints. It may be worth the effort if this isn't too difficult. More on this topic later.
We shall now look at an example of a clock domain crossing between related clocks that are unaligned.
wire pll_clk;
reg [24:0] result;
reg [11:0] x, y, x1, y1;
clk_wiz_0 pll_i
(.clk_in1(clk),
.clk_out1(pll_clk));
always @(posedge clk)
begin
x1 <= x;
y1 <= y;
end
always @(posedge pll_clk)
result <= x1 * y1;
Note that there's a PLL (clk_wiz_0). This PLL uses @clk as its reference clock, which has a frequency of 250 MHz. It's the same PLL that was shown in the example at the top of a previous page. The frequency of @pll_clk is 125 MHz.
The important part of this example is the clock domain crossing between two related clocks (@clk and @pll_clk). Because only @pll_clk is generated by the PLL, these two clocks are not aligned. So there is a clock skew in the paths to @result (from @x1 and @y1). Despite this clock skew, the clocks are still related clocks, and the tools will try to meet the timing requirements.
If the only reason for using @clk is the need for a clock with 250 MHz frequency, the correct solution is to generate another clock with the PLL. It's not a waste of resources to produce a clock with the same frequency as the reference clock. On the contrary, doing this saves the tools a lot of effort. There is only one good reason to use @clk directly as shown in the example: When the logic that uses @clk must work before the PLL generates usable clocks.
The timing report that was generated by Vivado is is as follows. In this specific case, there was only a problem with the tsetup requirement.
Slack (VIOLATED) : -1.456ns (required time - arrival time) Source: x1_reg[7]/C (rising edge-triggered cell FDRE clocked by clk {rise@0.000ns fall@2.000ns period=4.000ns}) Destination: result_reg/DSP_OUTPUT_INST/ALU_OUT[10] (rising edge-triggered cell DSP_OUTPUT clocked by clk_out1_clk_wiz_0 {rise@0.000ns fall@4.000ns period=8.000ns}) Path Group: clk_out1_clk_wiz_0 Path Type: Setup (Max at Slow Process Corner) Requirement: 4.000ns (clk_out1_clk_wiz_0 rise@8.000ns - clk rise@4.000ns) Data Path Delay: 3.012ns (logic 2.677ns (88.878%) route 0.335ns (11.122%)) Logic Levels: 5 (DSP_A_B_DATA=1 DSP_ALU=1 DSP_M_DATA=1 DSP_MULTIPLIER=1 DSP_PREADD_DATA=1) Clock Path Skew: -2.192ns (DCD - SCD + CPR) Destination Clock Delay (DCD): 0.998ns = ( 8.998 - 8.000 ) Source Clock Delay (SCD): 3.202ns = ( 7.202 - 4.000 ) Clock Pessimism Removal (CPR): 0.012ns Clock Uncertainty: 0.148ns ((TSJ^2 + DJ^2)^1/2) / 2 + PE Total System Jitter (TSJ): 0.071ns Discrete Jitter (DJ): 0.103ns Phase Error (PE): 0.086ns Clock Net Delay (Source): 1.414ns (routing 0.002ns, distribution 1.412ns) Clock Net Delay (Destination): 1.184ns (routing 0.002ns, distribution 1.182ns) Clock Domain Crossing: Inter clock paths are considered valid unless explicitly excluded by timing constraints such as set_clock_groups or set_false_path. Location Delay type Incr(ns) Path(ns) Netlist Resource(s) ------------------------------------------------------------------- ------------------- (clock clk rise edge) 4.000 4.000 r AG12 0.000 4.000 r clk (IN) net (fo=0) 0.000 4.000 clk_IBUF_inst/I AG12 INBUF (Prop_INBUF_HRIO_PAD_O) 0.738 4.738 r clk_IBUF_inst/INBUF_INST/O net (fo=1, routed) 0.105 4.843 clk_IBUF_inst/OUT AG12 IBUFCTRL (Prop_IBUFCTRL_HRIO_I_O) 0.049 4.892 r clk_IBUF_inst/IBUFCTRL_INST/O net (fo=1, routed) 0.795 5.687 clk_IBUF BUFGCE_X1Y2 BUFGCE (Prop_BUFCE_BUFGCE_I_O) 0.101 5.788 r clk_IBUF_BUFG_inst/O X2Y0 (CLOCK_ROOT) net (fo=62, routed) 1.414 7.202 clk_IBUF_BUFGCE SLICE_X52Y45 FDRE r x1_reg[7]/C ------------------------------------------------------------------- ------------------- SLICE_X52Y45 FDRE (Prop_HFF_SLICEM_C_Q) 0.138 7.340 f x1_reg[7]/Q net (fo=1, routed) 0.335 7.675 result_reg/A[7] DSP48E2_X8Y18 DSP_A_B_DATA (Prop_DSP_A_B_DATA_DSP48E2_A[7]_A2_DATA[7]) 0.396 8.071 r result_reg/DSP_A_B_DATA_INST/A2_DATA[7] net (fo=1, routed) 0.000 8.071 result_reg/DSP_A_B_DATA.A2_DATA<7> DSP48E2_X8Y18 DSP_PREADD_DATA (Prop_DSP_PREADD_DATA_DSP48E2_A2_DATA[7]_A2A1[7]) 0.182 8.253 r result_reg/DSP_PREADD_DATA_INST/A2A1[7] net (fo=1, routed) 0.000 8.253 result_reg/DSP_PREADD_DATA.A2A1<7> DSP48E2_X8Y18 DSP_MULTIPLIER (Prop_DSP_MULTIPLIER_DSP48E2_A2A1[7]_U[10]) 0.994 9.247 f result_reg/DSP_MULTIPLIER_INST/U[10] net (fo=1, routed) 0.000 9.247 result_reg/DSP_MULTIPLIER.U<10> DSP48E2_X8Y18 DSP_M_DATA (Prop_DSP_M_DATA_DSP48E2_U[10]_U_DATA[10]) 0.164 9.411 r result_reg/DSP_M_DATA_INST/U_DATA[10] net (fo=1, routed) 0.000 9.411 result_reg/DSP_M_DATA.U_DATA<10> DSP48E2_X8Y18 DSP_ALU (Prop_DSP_ALU_DSP48E2_U_DATA[10]_ALU_OUT[10]) 0.803 10.214 r result_reg/DSP_ALU_INST/ALU_OUT[10] net (fo=1, routed) 0.000 10.214 result_reg/DSP_ALU.ALU_OUT<10> DSP48E2_X8Y18 DSP_OUTPUT r result_reg/DSP_OUTPUT_INST/ALU_OUT[10] ------------------------------------------------------------------- ------------------- (clock clk_out1_clk_wiz_0 rise edge) 8.000 8.000 r BUFGCE_X1Y2 BUFGCE 0.000 8.000 r clk_IBUF_BUFG_inst/O net (fo=62, routed) 1.078 9.078 pll_i/inst/clk_in1 MMCME3_ADV_X1Y0 MMCME3_ADV (Prop_MMCME3_ADV_CLKIN1_CLKOUT0) -1.777 7.301 r pll_i/inst/mmcme3_adv_inst/CLKOUT0 net (fo=1, routed) 0.422 7.723 pll_i/inst/clk_out1_clk_wiz_0 BUFGCE_X1Y0 BUFGCE (Prop_BUFCE_BUFGCE_I_O) 0.091 7.814 r pll_i/inst/clkout1_buf/O X2Y0 (CLOCK_ROOT) net (fo=6, routed) 1.184 8.998 result_reg/CLK DSP48E2_X8Y18 DSP_OUTPUT r result_reg/DSP_OUTPUT_INST/CLK clock pessimism 0.012 9.010 clock uncertainty -0.148 8.862 DSP48E2_X8Y18 DSP_OUTPUT (Setup_DSP_OUTPUT_DSP48E2_CLK_ALU_OUT[10]) -0.104 8.758 result_reg/DSP_OUTPUT_INST ------------------------------------------------------------------- required time 8.758 arrival time -10.214 ------------------------------------------------------------------- slack -1.456
This report shows that the tools failed to achieve the timing constraints. The path that is shown starts from @clk's rising edge at 4 ns, and ends at @pll_clk's rising edge at 8 ns. The problem is the time that it takes for the clock at the input pin to reach the first flip-flop's clock input pin: 3.2 ns. The time of arrival of this clock edge is hence 7.2 ns.
But @pll_clk is generated by the PLL, so this clock is aligned to @clk's input pin. The delay is therefore only 1.0 ns. @pll_clk's time of arrival to the second flip-flop is hence 9.0 ns. So the time that is left for the data path is 9.0 – 7.2 = 1.8 ns (approximately, because of clock uncertainty etc.). This is not enough for a arithmetic multiplication, even when the designated arithmetic unit is used. Hence the timing requirements could not be met.
So in this example the clock arrives late to the first flip-flop because of the clock skew. This results in a failure to meet the tsetup requirement.
Note that this can be solved by manipulating the alignment of @pll_clk. For example, the PLL's reference clock can be output of the global clock buffer which distributes @clk. It's also possible to define the phase shift of the PLL in order to achieve a better alignment. These are however solutions that should be used only as a last resort.
Summary
Once again, these were just a few ideas that might help solve a timing closure problem. Unfortunately, solving a problem of this sort may require much more than so. In fact, there isn't a single topic in the field of FPGAs that is not somehow related to timing closure.
As already mentioned, the best strategy is to write the logic design with care from the beginning. The best way to tackle timing closure is to avoid it.
Until this point, this series of pages has discussed timing, but didn't say much about timing constraints. This is about to change: Starting from the next page, the discussion becomes more technical.