FPGA Implementation Strategy for AONM 19-FEB-1998 The logic design for the FPGA was captured in a straightforward fashion, drawing the logic as described in the documentation. The scalers, monitoring readout, and high-speed readout were omitted from this design. The necessary pinout constraints were placed in the .ucf file, and the design was run through the back-end tools just to verify correct schematic entry and pinout constraints. No attempt was made to have the design meet any timing specifications. The next step was to provide some timing specifications for the triggering logic path for this FPGA. Recall that the triggering logic path is the IOBs, RAMs, and AND gates between the MSA_Input_PAD(127:0) bus and the MSA_Output_PAD(127:0) bus. This includes the input and output IOB delay. The following constraints were used: TIMEGRP Input_Pads = PADS ( MSA_Input_PAD* ) ; TIMEGRP Output_Pads = PADS ( MSA_Output_PAD* ) ; TIMESPEC TS_MSA_In_to_MSA_Out = FROM Input_Pads TO Output_Pads 53.6 ns ; which specify that all paths (there are 128*4 = 512 paths) covered by this constraint are allowed 3*18.8ns (minus 5%) to operate. Additionally, the FAST constraint was put on the MSA_Output_PAD(3:0) bus, to use high-slew-rate drivers in an attempt to speed up this path. That constraint is as follows: NET MSA_Output_PAD* FAST ; With only these constraints in place, the design was again run through the back-end tools. Of the 512 constrained paths, 104 failed to meet timing. The minimum slack on these paths was -8.881 ns. The maximum slack was 13.080 ns. If all routing delays could be reduced to 0, the slack on all channels would have been around 29 ns. Thus the logic was consuming about 24 ns, or slightly less than half of the available path delay. This was considered a good sign, as a 50/50 logic/routing mix is generally considered to be attainable. A careful analysis of the design was then performed, looking for ways to reduce the routing delay. Two things were noted: A. The initial 4X1 RAMs were generally NOT located near the IOB's they serviced, contributing to a long delay on the IOB-to-RAM nets. - this was most likely due to the need for these RAMs to have access to the Read_Data and Write_Data busses, which the Xilinx tools typically route on horizontal longlines spread across the middle of the chip B. Long delays between the RAMS and the 2nd tier 8-input AND gates, and between the 2nd tier 8-input AND gates and the 3rd tier 4-input AND gates - some delay on these lines is inevitable, due to the distributed nature of the RAMs on chip, but these nets were NOT generally routed on (relatively fast) longlines, contributing to excess delay on these routes An attempt was made to modify the design, via both placement constraints and re-arranging the structure of the 2 tiers of AND gates. Following is a summary of those modifications and the results obtained. AONM FPGA Modifications and Results 1. Specify Boxes for RAM CLBs This was accomplished using the LOC constraint in the user constraint file. The constraints took the form INST "ao_lup/aoit_31_0/aoit_15_0/lup3_0" LOC = CLB_R15C1:CLB_R16C4 ; Note that the instance name is the one specified on the schematic rather than the name given to the component in epic. Additionally, the name must not contain a colon. In total there were 32 such constraints, one for each 16x4 DPRAM (recall that each 16X4 DPRAM is implemented in 4 separate CLB's). Initially the box size was set to 4 x 2 (2 x 4) or 4 x 3 (3 x 4), depending on whether the four and-or input terms spanned two or three columns (rows). In the corners of the FPGA the box size was increased to 4 x 4. In all cases the boxes were located on the edges of the FPGA. This approach noticeably improved the layout. The number of paths failing to meet timing decreased to 16 and the minimum slack on the constrained paths increased to -2.654 ns. Later an attempt was made to reduce the box sizes for the two sets of higher order input terms (64 to 95 and 96 to 127). This did not significantly alter the layout or the timing characteristics, and in fact resulted in some errors in place and route when areas of the chip became too congested. 2. RAM -> AND4 -> AND8 The schematic was modified so that the outputs from the RAMs were sent to 4 input AND gates rather than 8 input AND gates. This also required that the third tier of logic be switched from 4 to 8 input AND gates. The property HBLKNM was used to confine the 8 input AND gates in single CLBs. Note that it is not sufficient to attach this property to the three gates comprising the 8 input AND; each gate must have an FMAP or HMAP associated with it, and the HBLKNM property must be attached to the FMAP or HMAP. This modification also resulted in a significant improvement in the timing characteristics of the layout. The number of paths failing to meet the timing requirement dropped to 1, and the minimum slack on the constrainted paths increased to -0.429 ns. 3. Prioritize Nets Place and route assigns longlines based on priority and routes higher-priority nets and buses first. The default priority for all nets and buses is 3. The priority of a net or bus can be changed either by adding a PRI property on the schematic or by adding a constraint of the form NET "ao_lup/t2_63:48_paof(0)" PRIORITIZE = 94 ; in the physical constraint file. These two methods are completely interchangeable. However, adding constraints directly to the physical constraint file is somewhat awkward when using the Xilinx design manager; in practice it is much simpler to apply the constraints directly to the schematic. There does not appear to be any means of adding a priority constraint to the user constraint file. The nets from the AND4 gates to the AND8 gates were assigned a new priority of 94 while the nets from the AND8 gates to the output pads were assigned a somewhat lower priority of 54. The values 94 and 54 were chosen completely arbitrarily. Prioritizing the nets resulted in a moderate improvement in the timing characteristics of the layout. When combined with the two previous modifications, it produced a layout that satisfied the timing requirements after only 5 iterations of place and route. 4. Connect the Read and Write Buses Several times an attempt was made to free-up routing resources by connecting the 16 bit read and write buses to form a single bidirectional bus. In practice this made very little difference to the layout because the two buses could only be joined by the tristate buffers near the horizontal longlines. Consequently, even in the bidirectional bus design the read and write buses were routed separately between the CLBs and the horizontal long lines. In principle fewer horizontal longlines were used when the bus was bidirectional, but the timing characteristics of the layout did not seem to depend on whether or not the buses were joined. Presumably this is because there were sufficient routing resources to accommodate the separate buses without interfering with the rest of the layout. 5. Constrain the Location of the AND4 Gates The current software does not allow the use of the LOC constraint with simple gates. It should be possible to add an FMAP or HMAP to the AND gates and then constrain the location of the FMAP or HMAP.