FPGA Implementation Strategy for AONM                       19-FEB-1998 

The logic design for the FPGA was captured in a straightforward fashion,
drawing the logic as described in the documentation.  The scalers, 
monitoring readout, and high-speed readout were omitted from this design.
The necessary pinout constraints were placed in the .ucf file, and the
design was run through the back-end tools just to verify correct 
schematic entry and pinout constraints.  No attempt was made to have
the design meet any timing specifications.  

The next step was to provide some timing specifications for the triggering
logic path for this FPGA.  Recall that the triggering logic path is
the IOBs, RAMs, and AND gates between the MSA_Input_PAD(127:0) bus and
the MSA_Output_PAD(127:0) bus.  This includes the input and output IOB
delay.  The following constraints were used:

    TIMEGRP  Input_Pads           = PADS ( MSA_Input_PAD* ) ;
    TIMEGRP  Output_Pads          = PADS ( MSA_Output_PAD* ) ;
    TIMESPEC TS_MSA_In_to_MSA_Out = FROM Input_Pads TO Output_Pads 53.6 ns ;

which specify that all paths (there are 128*4 = 512 paths) covered by this
constraint are allowed 3*18.8ns (minus 5%) to operate.  

Additionally, the FAST constraint was put on the MSA_Output_PAD(3:0) bus,
to use high-slew-rate drivers in an attempt to speed up this path.  That
constraint is as follows:

    NET       MSA_Output_PAD*     FAST ;

With only these constraints in place, the design was again run through
the back-end tools.  Of the 512 constrained paths, 104 failed to meet
timing.  The minimum slack on these paths was -8.881 ns.  The maximum
slack was 13.080 ns.  If all routing delays could be reduced to 0, the
slack on all channels would have been around 29 ns.  Thus the logic
was consuming about 24 ns, or slightly less than half of the available
path delay.  This was considered a good sign, as a 50/50 logic/routing
mix is generally considered to be attainable.  

A careful analysis of the design was then performed, looking for ways
to reduce the routing delay.  Two things were noted:

    A. The initial 4X1 RAMs were generally NOT located near the IOB's
       they serviced, contributing to a long delay on the IOB-to-RAM
       nets.

        - this was most likely due to the need for these RAMs to have
          access to the Read_Data and Write_Data busses, which the 
          Xilinx tools typically route on horizontal longlines spread
          across the middle of the chip

    B. Long delays between the RAMS and the 2nd tier  8-input AND gates,
       and between the 2nd tier  8-input AND gates and the 3rd tier 
       4-input AND gates

        - some delay on these lines is inevitable, due to the distributed
          nature of the RAMs on chip, but these nets were NOT generally
          routed on (relatively fast) longlines, contributing to 
          excess delay on these routes

An attempt was made to modify the design, via both placement constraints
and re-arranging the structure of the 2 tiers of AND gates.  Following
is a summary of those modifications and the results obtained.

AONM FPGA Modifications and Results

1.  Specify Boxes for RAM CLBs

	This was accomplished using the LOC constraint in the user
	constraint file.  The constraints took the form

	INST "ao_lup/aoit_31_0/aoit_15_0/lup3_0" LOC = CLB_R15C1:CLB_R16C4 ;

	Note that the instance name is the one specified on the
	schematic rather than the name given to the component in epic.
	Additionally, the name must not contain a colon.  In total
	there were 32 such constraints, one for each 16x4 DPRAM (recall
        that each 16X4 DPRAM is implemented in 4 separate CLB's).  

	Initially the box size was set to 4 x 2 (2 x 4) or 4 x 3 (3 x
	4), depending on whether the four and-or input terms spanned
	two or three columns (rows).  In the corners of the FPGA the
	box size was increased to 4 x 4.  In all cases the boxes were
	located on the edges of the FPGA.

	This approach noticeably improved the layout.  The number of
	paths failing to meet timing decreased to 16 and the minimum
	slack on the constrained paths increased to -2.654 ns.

	Later an attempt was made to reduce the box sizes for the two
	sets of higher order input terms (64 to 95 and 96 to 127).
	This did not significantly alter the layout or the timing
	characteristics, and in fact resulted in some errors in place
	and route when areas of the chip became too congested.

2.  RAM -> AND4 -> AND8

	The schematic was modified so that the outputs from the RAMs
	were sent to 4 input AND gates rather than 8 input AND gates.
	This also required that the third tier of logic be switched
	from 4 to 8 input AND gates.  The property HBLKNM was used to
	confine the 8 input AND gates in single CLBs.  Note that it is
	not sufficient to attach this property to the three gates
	comprising the 8 input AND; each gate must have an FMAP or
	HMAP associated with it, and the HBLKNM property must be
	attached to the FMAP or HMAP.

	This modification also resulted in a significant improvement
	in the timing characteristics of the layout.  The number of
	paths failing to meet the timing requirement dropped to 1, and
	the minimum slack on the constrainted paths increased to
	-0.429 ns.

3.  Prioritize Nets

	Place and route assigns longlines based on priority and routes
	higher-priority nets and buses first.  The default priority
	for all nets and buses is 3.  The priority of a net or bus can
	be changed either by adding a PRI property on the schematic or
	by adding a constraint of the form

	NET "ao_lup/t2_63:48_paof(0)" PRIORITIZE = 94 ;

	in the physical constraint file.  These two methods are
	completely interchangeable.  However, adding constraints
	directly to the physical constraint file is somewhat awkward
	when using the Xilinx design manager; in practice it is much
	simpler to apply the constraints directly to the schematic.
	There does not appear to be any means of adding a priority
	constraint to the user constraint file.

	The nets from the AND4 gates to the AND8 gates were assigned a
	new priority of 94 while the nets from the AND8 gates to the
	output pads were assigned a somewhat lower priority of 54.
	The values 94 and 54 were chosen completely arbitrarily.
	Prioritizing the nets resulted in a moderate improvement in
	the timing characteristics of the layout.  When combined with
	the two previous modifications, it produced a layout that
	satisfied the timing requirements after only 5 iterations of
	place and route.

4.  Connect the Read and Write Buses

	Several times an attempt was made to free-up routing resources
	by connecting the 16 bit read and write buses to form a single
	bidirectional bus.  In practice this made very little
	difference to the layout because the two buses could only be
	joined by the tristate buffers near the horizontal longlines.
	Consequently, even in the bidirectional bus design the read
	and write buses were routed separately between the CLBs and
	the horizontal long lines.  In principle fewer horizontal
	longlines were used when the bus was bidirectional, but the
	timing characteristics of the layout did not seem to depend on
	whether or not the buses were joined.  Presumably this is
	because there were sufficient routing resources to accommodate
	the separate buses without interfering with the rest of the
	layout.

5.  Constrain the Location of the AND4 Gates 

	The current software does not allow the use of the LOC
	constraint with simple gates.  It should be possible to add
	an FMAP or HMAP to the AND gates and then constrain the location
	of the FMAP or HMAP.