Trics and Bit3 
                            IO Timing Estimates and Measurements

20-Jun-2003

I) Introduction
===============

    This file collects the timing measurements for simple and complex IOs 
    to the L1FW and L1CT.

    TCC is connected to the L1FW and L1CT by a combination of bit3 PCI-VME 
    bus interface, Vertical Interconnects, and Ironics IO card.
    cf. vme_communication_crate.txt 
    in http://www.pa.msu.edu/hep/d0/ftp/tcc/vme_access/

    Estimates of compound IOs below are made using canonical rounded numbers
    of 2 us per VME IO within Communication Crate (e.g. to Ironics), and 5 us 
    per VME IO over Vertical Interconnect (e.g. to L1/L2 FW).

    Most measurements are done on MSUL1A using logfile timestamps, and running 
    a kludged version of Trics where the "Register IO" and "CBus IO" IO dialogs
    are repeating the same IO 1x million times (1M=10**6).  The VME IO dialog
    already has a repeat count available by default.  The L1CT diagnostics 
    pedestal analysis tool was used to measure Histogram collection times.

    The exception is for FMLN programming where the programming time is 
    intrinsically long enough to be measured directly in DZero logfiles.


II) Elementary IO 
=================

    II.1. Memory Mapped IO method
    -----------------------------

        II.1.A. Base Estimate 

            Within Communication Crate 

                about 2 us per VME Read to Ironics
                about 2 us per VME Write to Ironics

            Over Vertical Interconnect 

                about 5 us per VME Read via V.I. to L1FW register
                about 5 us per VME Write via V.I. to L1FW register

        II.1.B. Measurement

            MMAP read to Ironics

                1Mx MMAP read to Ironics

                    18:12:23.156
                    18:12:25.312   2.16
                    18:12:27.421   2.11 s

                    18:12:52.062
                    18:12:54.171   2.23
                    18:12:56.296   2.12
                    18:13:09.453   2.16 s

                -> 1x MMAP read to Ironics ~ 2.2 us

            MMAP write to Ironics

                1Mx MMAP write to Ironics

                    18:12:27.421
                    18:12:29.281   1.86

                    18:12:46.515
                    18:12:48.390   1.88
                    18:12:50.203   1.81
                    18:12:52.062   1.86 s

                -> 1x MMAP write to Ironics ~ 1.9 us

            MMAP read to L1FW 

                Needs re-measuring

            MMAP write to L1FW 

                Needs re-measuring


    II.2. Bit3 API IO method
    ------------------------

        II.2.A. Base Estimate 

            about 50 us per VME Read to anywhere 
            about 50 us per VME Write to anywhere 

        II.2.B. Measurement

            API read to Ironics

                1Mx API read to Ironics

                    18:07:19.687
                    18:08:10.359  50.67
                    18:08:59.203  48.84
                    18:09:50.734  51.53
                    18:10:42.500  51.77 s

                -> 1x API read to Ironics ~ 52 us

            API write to Ironics

                1Mx API write to Ironics

                    18:10:42.500
                    18:11:32.703  50.20
                    18:12:23.156  50.45 s

                -> 1x API write to Ironics ~ 51 us
        
            API read to L1FW 

                Needs re-measuring

            API write to L1FW 

                Needs re-measuring


    II.3. IO to non existing address
    --------------------------------

        II.3.A. Estimate 

            This was not the focus of the study, 
            but captured here along with other measurements.

        II.3.B. Measurement

            MMAP VME Read to non-existing register

                1Mx L1FW Register Read while rack off

                    10:32:20.000
                    10:33:31.234  71.2 s
                    10:33:37.078  65.8 s

                -> 1x L1FW Register Read no DTACK ~ 70 us

            MMAP VME Write to non-existing register not measured. 
            This would require more temporatay hacking to skip post-check. 
            The time is presumably the same.

            The bit3 driver may have been set to log errors, 
            and this is something the user can turn on/off, I believe. 

            API VME Read/Write to non-existing register not measured.


III) L1FW IO 
============

    III.1. L1FW Register Read
    -------------------------

        III.1.A. Estimate 

            Read  Data 

            Total 1x IO via V.I. ~ 5 us

        III.1.B. Measurement


    III.2. L1FW Register Write
    --------------------------

        III.2.A. Estimate 

            Read  Data (Pre-Write check)
            Write Data 
            Read  Data (Post-Write check)

            Total 3x IO via V.I. ~ 15 us

        III.2.B. Measurement


IV) CBUS IO
===========
              
    IV.1. CBus IO Read
    ------------------        

        IV.1.A. Estimate 

            Write   FA (Function Address)
            Write   CA (Card Address and Direction=Read)
            Write  MBA (Mother Board Address last minimize transient selections)
            Read  Data
            Write  MBA (=0 deselect MBA)

            Total 5x IOs to Ironics ~ 10 us

        IV.1.B. Measurement

            CBus Read

                1Mx CBus Read

                    17:58:24.687
                    17:58:34.437   9.75
                    17:58:50.562
                    17:59:00.296   9.73
                    17:59:09.812   9.52
                    17:59:19.234   9.42 s

                -> 1x CBus Read ~ 10 us

    IV.2. CBus IO Write
    -------------------

        IV.2.A. Estimate 

            Write   FA (Function Address)
            Write   CA (Card Address and Direction=Read)
            Write  MBA (Mother Board Address last minimize transient selections)
            Read  Data (Pre-Write check)
            Write Data
            Write  Dir (=Write)
            Write Strb (=High)
            Write Strb (=Low)
            Write  Dir (=Read)
            Read  Data (Post-Write check)
            Write  MBA (=0 deselect MBA)

            Total 11x IOs to Ironics ~ 22 us

        IV.2.B. Measurement

            CBus Write

                1Mx CBus Write

                    17:59:41.890
                    18:00:02.828  20.93
                    18:00:23.609  20.78
                    18:00:45.203  21.60 s

                -> 1x CBus Write ~ 21 us

    IV.3. Serail DAC Command
    ------------------------

        IV.3.A. Estimate 

            Read CBus Reg = 5 IOs
            Write CBus Reg = 11 IOs
            Write Stream Setup = 4 IOs
            Write Stream Item = 3 IOs repeated 2*12+4*16 Serial Bits 
                                         and * 2 for clock up/down = 528 IOs
            Write Stream Done = 2 IOs
            Write CBus Reg = 11 IOs

            Total = 561 IOs * 2 us = 1.12 ms

        IV.3.B. Measurement

            Serial DAC

                100k Serial DAC Commands

                    10:22:17.953
                    10:24:02.187   104.2 s
                    10:25:48.156   106.0
                    10:27:34.515   106.0 s

                -> 1x Serial DAC Command ~ 1 ms

    IV.4. CBus IO Histogram of 1k points
    ------------------------------------

        IV.4.A. Estimate 

            With NO control of Read/Write A/B 

                Option #0 = set address and 1k reads from Ironics Port 
                    ~ 2 ms per Histogram

            While Controlling Read/Write A/B 

                Method: 
                    Pulse Write A/B to load Pipe 
                        requires 4 + 2*3 + 2 = 12 Ironics IOs
                    Read a number of slices 
                    Repeat until N samples collected
                        requires 2 + N*( 1 + 1 ) + 1 = 2N+3 Ironics IOs

                Option #1 = Keeping every Slice 
                    -> 7 Slices read per Pipeful -> 143 Pipe Loads
                    143*( 12 + 2*7+3 ) = 4,147 Ironics IOs 
                    ~ 8.3 ms per Histogram

                Option #2 = Keeping every 2nd Slice 
                    -> 4 Slices read per Pipeful -> 250 Pipe Loads
                    250*( 12 + 2*4+3 ) = 5,750 Ironics IOs 
                    ~ 11.5 ms per Histogram

                Option #3 = Keeping every 3rd Slice 
                    -> 3 Slices read per Pipeful -> 334 Pipe Loads
                    334*( 12 + 2*3+3 ) = 7,014 Ironics IOs 
                    ~ 14.0 ms per Histogram

                Option #4 = Keeping every 4th Slice 
                    -> 2 Slices read per Pipeful -> 500 Pipe Loads
                    500*( 12 + 2*2+3 ) = 9,500 Ironics IOs 
                    ~ 19.0 ms per Histogram

                Option #5 = Keeping every 5th Slice 
                    -> 2 Slices read per Pipeful -> 500 Pipe Loads
                    500*( 12 + 2*2+3 ) = 9,500 Ironics IOs 
                    ~ 19.0 ms per Histogram

                Option #6 = Keeping every 6th Slice 
                    -> 2 Slices read per Pipeful -> 500 Pipe Loads
                    500*( 12 + 2*2+3 ) = 9,500 Ironics IOs 
                    ~ 19.0 ms per Histogram

                Option #7 = Keeping every 7th Slice 
                    -> 1 Slice read per Pipeful -> 1,000 Pipe Loads
                    1000*( 12 + 2*1+3 ) = 17,000 Ironics IOs 
                    ~ 34.0 ms per Histogram

        IV.4.B. Measurement

            CBus Histogram

                1Mx point CBus Histogram -- Option #0 = No Pipe Control 

                    16:53:10.562
                    16:53:12.875   2.31
                    16:53:15.250   2.37
                    16:53:17.609   2.36
                    16:53:19.968   2.36
                    16:53:22.328   2.36 s

                -> 1x 1k point CBus Histogram Option #0 ~ 2.3 ms

                1Mx point CBus Histogram -- Option #1 = Pipe Control and keep 7

                    16:53:43.453
                    16:53:51.750   8.30
                    16:54:00.046   8.30
                    16:54:08.656   8.61
                    16:54:16.968   8.31
                    16:54:25.343   8.38 s

                -> 1x 1k point CBus Histogram Option #1 ~ 8.3 ms

                1Mx point CBus Histogram -- Option #2 = Pipe Control and skip 1

                    16:55:06.562
                    16:55:17.718  11.16
                    16:55:28.906  11.19
                    16:55:40.078  11.17 
                    16:55:52.718  12.64 s

                -> 1x 1k point CBus Histogram Option #2 ~ 11.2 ms

                1Mx point CBus Histogram -- Option #3 = Pipe Control and skip 2

                    16:55:55.500
                    16:56:09.218  13.71
                    16:56:22.687  13.47
                    16:56:36.125  13.44
                    16:56:49.546  13.42 s

                -> 1x 1k point CBus Histogram Option #3 ~ 13.5 ms

                1Mx point CBus Histogram -- Option #4 = Pipe Control and skip 3

                    16:56:52.703
                    16:57:13.828  21.12
                    16:57:31.750  17.92
                    16:57:49.671  17.92
                    16:58:09.562  19.89 s

                -> 1x 1k point CBus Histogram Option #4 ~ 18 ms

                1Mx point CBus Histogram -- Option #5 = Pipe Control and skip 4 

                    16:58:12.843
                    16:58:30.828  17.98
                    16:58:48.687  17.86
                    16:59:06.593  17.91
                    16:59:27.593  21.00 s

                -> 1x 1k point CBus Histogram Option #5 ~ 18 ms

                1Mx point CBus Histogram -- Option #6 = Pipe Control and skip 5

                    16:59:29.093
                    16:59:47.234  18.14
                    17:00:05.109  17.87
                    17:00:24.921  19.81
                    17:00:44.000  19.08 s

                -> 1x 1k point CBus Histogram Option #6 ~ 18 ms

                1Mx point CBus Histogram -- Option #7 = Pipe Control and keep only 1

                    17:00:47.046
                    17:01:18.625  31.58  
                    17:01:52.781  34.16
                    17:02:24.437  31.66
                    17:02:58.562  34.13 s

                -> 1x 1k point CBus Histogram Option #7 ~ 32 ms

    IV.5. FMLN programming
    ----------------------

        IV.5.A. Estimate 

            Write FMLN

                Update Write Protect
                    Write CBus Reg = 11 IOs  

                Reset Address Generator
                    2x Write CBus Reg = 2x 11 IOs

                Read Counter
                    3x Read CBus Reg = 3x 5 IOs

                Write memory in 4 sections: repeat 4 times

                    Write Stream Setup = 4 IOs
                    Write Stream Item = 3 IOs (most often 2 IOs),  repeated 128*512 times 
                    Write Stream Done = 2 IOs
        
                    Read Counter
                        3x Read CBus Reg = 3x 5 IOs

                Set Normal Mode = Reset Write Protect, OverflowOverrule, Mode
                    3x Write CBus Reg = 3x 11 IOs

                Total 11 + 22 + 15 + 4*( 4 + 128*( 510*2 + 2*3 ) + 2 + 15 ) + 33 = 525,477 IOs
                Total ~ 1.05 s                                                                         

            Verify FMLN

                Reset Address Generator
                    2x Write CBus Reg = 2x 11 IOs

                Read Counter
                    3x Read CBus Reg = 3x 5 IOs

                Update Overflow Overrule
                    Write CBus Reg = 11 IOs  

                Write memory in 4 sections: repeat 4 times

                    Read Stream Setup = 3 IOs
                    Read Stream Item = 3 IOs repeated 128*512 times 
                    Read Stream Done = 1 IOs
        
                    Read Counter
                        3x Read CBus Reg = 3x 5 IOs

                Set Normal Mode = Reset Write Protect, OverflowOverrule, Mode
                    3x Write CBus Reg = 3x 11 IOs

                Total 11 + 22 + 15 + 4*( 3 + 128*512*3 + 1 + 15 ) + 33 = 786,589 IOs
                Total ~ 1.57 s

        IV.5.B. Measurement

            Zero + Verify FMLN Comparator 
                (From D0 logfile TRICS_II_20030610_V10_4_D.LOG;1)

                09:21:43.204
                09:21:44.987  1.78 s Zero 
                09:21:46.519  1.53 s Verify

                09:23:09.388
                09:23:11.170  1.78
                09:23:13.073  1.90

                10:23:38.526  
                10:23:40.529  2.00
                10:23:42.342  1.81

                14:54:46.228  
                14:54:48.031  1.80
                14:54:49.823  1.79

                14:59:34.212  
                14:59:36.065  1.85
                14:59:37.958  1.89

                -> FMLN Initialization takes 2 x 1.8 s = 3.6 s
                (note: the Write phase had a bug and should take less time by not 
                       rewriting the data that never changes.)

            Write + Verify FMLN Comparator 
                (From D0 logfile TRICS_II_20030610_V10_4_D.LOG;1)

                11:11:16.856
                11:11:18.829  1.97 s Write
                11:11:20.662  1.83 s Verify

                11:11:39.048
                11:11:40.871  1.82
                11:11:42.674  1.80

                11:11:59.598
                11:12:01.421  1.82
                11:12:03.223  1.80

                11:12:15.751
                11:12:17.544  1.79
                11:12:19.527  1.98

                10:08:12.412
                10:08:14.204  1.79
                10:08:16.197  1.99

                10:08:26.532
                10:08:28.325  1.79
                10:08:30.127  1.80

                -> FMLN Programming takes 2 x 1.8 s = 3.6 s
                (note: the Write phase had a bug and should take less time by not 
                       rewriting the data when it hasn't changed.)

V) Misc
=======

    V.1. Resource Locks (Semaphores)
    --------------------------------

      cf. TRICS II Version 9.1 Release Notes: 

      "Lock acquiring+release time penalty at about 0.18 us per pair
      (asuming the lock is available)"


    V.2. Find_DAC
    -------------

        V.2.A. Estimate

            Processing one Channel involves changing the Pedestal Control value 31 times
            (Max, Min, Med, Low, 26*Near, Selected) with a Serial Command time of ~1 ms 
            (for a total of ~31 ms).

            Changing the Pedestal Control includes an implicit 2 ms settling wait time 
            (for a desired total of 62 ms wait time), but the NT scheduler's resolution 
            is 10ms (for an actual default total wait time of 310 ms).
            
            Processing one Channel involves building 2x 10-point histograms (Max, Min)
            and 28x 1k-point histograms (Med, Low, 26*Near).  The time per histogram
            is a function of the requested spacing between samples (cf. above). 
            e.g. ~2.3 ms without Read/Write Pipe A/B control or ~11.2 when skipping 
            every other 396ns sample per pipeful captured.

        V.2.B. Wait Time Measurement 

            Trics calls the sleep function to allow 2 ms settling time after 
            each Serial command DAC programming.

            Using logfile time stamps doing Find_DAC on Tadpole.pa.msu.edu (fake IO). 
            We expect that CPU time is small/negligible compared to the waiting.

            Explicit wait via sleep() method
                31x 2 ms wait = 62 ms

            Default NT scheduler 
                measure 310 ms

            Calling timeBeginPeriod ( int ) with argument specifying a timer period :

               20  -> 310 ms
               10  -> 310 ms
                5  -> 160 ms
                4  -> 130 ms 
                3  -> 100 ms
                2  ->  70 ms
                1  -> 100 ms

    V.3. AONM and FOM card programming
    ----------------------------------

      cf. TRICS II Version 10.3 Rev A Revision Notes: 

          Programming AONM and FOM FPGAs requires loading a memory lookup of 
          the desired response for groups of input terms.  This is very 
          IO intensive.  The worst case is during the deallocation of a Specific
          Trigger where Trics has to remove the Trigger programming from many 
          places:

                  2 for L1 And-Or Fired AONM
                 16 for L1 Qualifier
            128 x 3 for L1 Accept, L2 Acc, L2 Rej FOM
                 12 for FOM++ SpTrg Fired Strobe, Skip Next, Skip Next N
                  2 for L2 SpTrg Accept, Reject AONM
             ------
                416 x (8x(16+2)+4 IO per AONM/FOM channel) = 62k FW IO 

          One VME access takes about 5 us, and one full FW write with pre-&post-
          write checks takes 3 times that much.  This adds up to 1 second 
          per SpTrg deallocated.  Note that the bulk of the IOs are in fact made 
          to a register that was already declared "volatile" and thus not pre-write
          checked because these registers are in fact a window into a larger
          space controlled by a programmable base address.

          COOR started deallocating SpTrg by ranges, and deallocating N triggers
          would take N seconds.  COOR would eventually give up and time out.
          This would percolate to subsequent taker actions and cause problems.

          Because Trigger IO has been rock solid, and we never had IO errors
          outside of power problems, we choose to drop the systematic post-write
          check (there already was no pre-write check) for the bulk of the IOs.
          We keep one full pre+post-write checked IO at the beginning of the
          stream of IOs to each FPGA in order to always be ready to clearly
          detect an un-configured FPGA.

          The net result is that the 1 second per SpTrg has become 0.5 second.
          This will keep us out of trouble.  COOR will also be changed to limit
          the number of SpTrg deallocated in a single message.