Trics and Bit3 IO Timing Estimates and Measurements 20-Jun-2003 I) Introduction =============== This file collects the timing measurements for simple and complex IOs to the L1FW and L1CT. TCC is connected to the L1FW and L1CT by a combination of bit3 PCI-VME bus interface, Vertical Interconnects, and Ironics IO card. cf. vme_communication_crate.txt in http://www.pa.msu.edu/hep/d0/ftp/tcc/vme_access/ Estimates of compound IOs below are made using canonical rounded numbers of 2 us per VME IO within Communication Crate (e.g. to Ironics), and 5 us per VME IO over Vertical Interconnect (e.g. to L1/L2 FW). Most measurements are done on MSUL1A using logfile timestamps, and running a kludged version of Trics where the "Register IO" and "CBus IO" IO dialogs are repeating the same IO 1x million times (1M=10**6). The VME IO dialog already has a repeat count available by default. The L1CT diagnostics pedestal analysis tool was used to measure Histogram collection times. The exception is for FMLN programming where the programming time is intrinsically long enough to be measured directly in DZero logfiles. II) Elementary IO ================= II.1. Memory Mapped IO method ----------------------------- II.1.A. Base Estimate Within Communication Crate about 2 us per VME Read to Ironics about 2 us per VME Write to Ironics Over Vertical Interconnect about 5 us per VME Read via V.I. to L1FW register about 5 us per VME Write via V.I. to L1FW register II.1.B. Measurement MMAP read to Ironics 1Mx MMAP read to Ironics 18:12:23.156 18:12:25.312 2.16 18:12:27.421 2.11 s 18:12:52.062 18:12:54.171 2.23 18:12:56.296 2.12 18:13:09.453 2.16 s -> 1x MMAP read to Ironics ~ 2.2 us MMAP write to Ironics 1Mx MMAP write to Ironics 18:12:27.421 18:12:29.281 1.86 18:12:46.515 18:12:48.390 1.88 18:12:50.203 1.81 18:12:52.062 1.86 s -> 1x MMAP write to Ironics ~ 1.9 us MMAP read to L1FW Needs re-measuring MMAP write to L1FW Needs re-measuring II.2. Bit3 API IO method ------------------------ II.2.A. Base Estimate about 50 us per VME Read to anywhere about 50 us per VME Write to anywhere II.2.B. Measurement API read to Ironics 1Mx API read to Ironics 18:07:19.687 18:08:10.359 50.67 18:08:59.203 48.84 18:09:50.734 51.53 18:10:42.500 51.77 s -> 1x API read to Ironics ~ 52 us API write to Ironics 1Mx API write to Ironics 18:10:42.500 18:11:32.703 50.20 18:12:23.156 50.45 s -> 1x API write to Ironics ~ 51 us API read to L1FW Needs re-measuring API write to L1FW Needs re-measuring II.3. IO to non existing address -------------------------------- II.3.A. Estimate This was not the focus of the study, but captured here along with other measurements. II.3.B. Measurement MMAP VME Read to non-existing register 1Mx L1FW Register Read while rack off 10:32:20.000 10:33:31.234 71.2 s 10:33:37.078 65.8 s -> 1x L1FW Register Read no DTACK ~ 70 us MMAP VME Write to non-existing register not measured. This would require more temporatay hacking to skip post-check. The time is presumably the same. The bit3 driver may have been set to log errors, and this is something the user can turn on/off, I believe. API VME Read/Write to non-existing register not measured. III) L1FW IO ============ III.1. L1FW Register Read ------------------------- III.1.A. Estimate Read Data Total 1x IO via V.I. ~ 5 us III.1.B. Measurement III.2. L1FW Register Write -------------------------- III.2.A. Estimate Read Data (Pre-Write check) Write Data Read Data (Post-Write check) Total 3x IO via V.I. ~ 15 us III.2.B. Measurement IV) CBUS IO =========== IV.1. CBus IO Read ------------------ IV.1.A. Estimate Write FA (Function Address) Write CA (Card Address and Direction=Read) Write MBA (Mother Board Address last minimize transient selections) Read Data Write MBA (=0 deselect MBA) Total 5x IOs to Ironics ~ 10 us IV.1.B. Measurement CBus Read 1Mx CBus Read 17:58:24.687 17:58:34.437 9.75 17:58:50.562 17:59:00.296 9.73 17:59:09.812 9.52 17:59:19.234 9.42 s -> 1x CBus Read ~ 10 us IV.2. CBus IO Write ------------------- IV.2.A. Estimate Write FA (Function Address) Write CA (Card Address and Direction=Read) Write MBA (Mother Board Address last minimize transient selections) Read Data (Pre-Write check) Write Data Write Dir (=Write) Write Strb (=High) Write Strb (=Low) Write Dir (=Read) Read Data (Post-Write check) Write MBA (=0 deselect MBA) Total 11x IOs to Ironics ~ 22 us IV.2.B. Measurement CBus Write 1Mx CBus Write 17:59:41.890 18:00:02.828 20.93 18:00:23.609 20.78 18:00:45.203 21.60 s -> 1x CBus Write ~ 21 us IV.3. Serail DAC Command ------------------------ IV.3.A. Estimate Read CBus Reg = 5 IOs Write CBus Reg = 11 IOs Write Stream Setup = 4 IOs Write Stream Item = 3 IOs repeated 2*12+4*16 Serial Bits and * 2 for clock up/down = 528 IOs Write Stream Done = 2 IOs Write CBus Reg = 11 IOs Total = 561 IOs * 2 us = 1.12 ms IV.3.B. Measurement Serial DAC 100k Serial DAC Commands 10:22:17.953 10:24:02.187 104.2 s 10:25:48.156 106.0 10:27:34.515 106.0 s -> 1x Serial DAC Command ~ 1 ms IV.4. CBus IO Histogram of 1k points ------------------------------------ IV.4.A. Estimate With NO control of Read/Write A/B Option #0 = set address and 1k reads from Ironics Port ~ 2 ms per Histogram While Controlling Read/Write A/B Method: Pulse Write A/B to load Pipe requires 4 + 2*3 + 2 = 12 Ironics IOs Read a number of slices Repeat until N samples collected requires 2 + N*( 1 + 1 ) + 1 = 2N+3 Ironics IOs Option #1 = Keeping every Slice -> 7 Slices read per Pipeful -> 143 Pipe Loads 143*( 12 + 2*7+3 ) = 4,147 Ironics IOs ~ 8.3 ms per Histogram Option #2 = Keeping every 2nd Slice -> 4 Slices read per Pipeful -> 250 Pipe Loads 250*( 12 + 2*4+3 ) = 5,750 Ironics IOs ~ 11.5 ms per Histogram Option #3 = Keeping every 3rd Slice -> 3 Slices read per Pipeful -> 334 Pipe Loads 334*( 12 + 2*3+3 ) = 7,014 Ironics IOs ~ 14.0 ms per Histogram Option #4 = Keeping every 4th Slice -> 2 Slices read per Pipeful -> 500 Pipe Loads 500*( 12 + 2*2+3 ) = 9,500 Ironics IOs ~ 19.0 ms per Histogram Option #5 = Keeping every 5th Slice -> 2 Slices read per Pipeful -> 500 Pipe Loads 500*( 12 + 2*2+3 ) = 9,500 Ironics IOs ~ 19.0 ms per Histogram Option #6 = Keeping every 6th Slice -> 2 Slices read per Pipeful -> 500 Pipe Loads 500*( 12 + 2*2+3 ) = 9,500 Ironics IOs ~ 19.0 ms per Histogram Option #7 = Keeping every 7th Slice -> 1 Slice read per Pipeful -> 1,000 Pipe Loads 1000*( 12 + 2*1+3 ) = 17,000 Ironics IOs ~ 34.0 ms per Histogram IV.4.B. Measurement CBus Histogram 1Mx point CBus Histogram -- Option #0 = No Pipe Control 16:53:10.562 16:53:12.875 2.31 16:53:15.250 2.37 16:53:17.609 2.36 16:53:19.968 2.36 16:53:22.328 2.36 s -> 1x 1k point CBus Histogram Option #0 ~ 2.3 ms 1Mx point CBus Histogram -- Option #1 = Pipe Control and keep 7 16:53:43.453 16:53:51.750 8.30 16:54:00.046 8.30 16:54:08.656 8.61 16:54:16.968 8.31 16:54:25.343 8.38 s -> 1x 1k point CBus Histogram Option #1 ~ 8.3 ms 1Mx point CBus Histogram -- Option #2 = Pipe Control and skip 1 16:55:06.562 16:55:17.718 11.16 16:55:28.906 11.19 16:55:40.078 11.17 16:55:52.718 12.64 s -> 1x 1k point CBus Histogram Option #2 ~ 11.2 ms 1Mx point CBus Histogram -- Option #3 = Pipe Control and skip 2 16:55:55.500 16:56:09.218 13.71 16:56:22.687 13.47 16:56:36.125 13.44 16:56:49.546 13.42 s -> 1x 1k point CBus Histogram Option #3 ~ 13.5 ms 1Mx point CBus Histogram -- Option #4 = Pipe Control and skip 3 16:56:52.703 16:57:13.828 21.12 16:57:31.750 17.92 16:57:49.671 17.92 16:58:09.562 19.89 s -> 1x 1k point CBus Histogram Option #4 ~ 18 ms 1Mx point CBus Histogram -- Option #5 = Pipe Control and skip 4 16:58:12.843 16:58:30.828 17.98 16:58:48.687 17.86 16:59:06.593 17.91 16:59:27.593 21.00 s -> 1x 1k point CBus Histogram Option #5 ~ 18 ms 1Mx point CBus Histogram -- Option #6 = Pipe Control and skip 5 16:59:29.093 16:59:47.234 18.14 17:00:05.109 17.87 17:00:24.921 19.81 17:00:44.000 19.08 s -> 1x 1k point CBus Histogram Option #6 ~ 18 ms 1Mx point CBus Histogram -- Option #7 = Pipe Control and keep only 1 17:00:47.046 17:01:18.625 31.58 17:01:52.781 34.16 17:02:24.437 31.66 17:02:58.562 34.13 s -> 1x 1k point CBus Histogram Option #7 ~ 32 ms IV.5. FMLN programming ---------------------- IV.5.A. Estimate Write FMLN Update Write Protect Write CBus Reg = 11 IOs Reset Address Generator 2x Write CBus Reg = 2x 11 IOs Read Counter 3x Read CBus Reg = 3x 5 IOs Write memory in 4 sections: repeat 4 times Write Stream Setup = 4 IOs Write Stream Item = 3 IOs (most often 2 IOs), repeated 128*512 times Write Stream Done = 2 IOs Read Counter 3x Read CBus Reg = 3x 5 IOs Set Normal Mode = Reset Write Protect, OverflowOverrule, Mode 3x Write CBus Reg = 3x 11 IOs Total 11 + 22 + 15 + 4*( 4 + 128*( 510*2 + 2*3 ) + 2 + 15 ) + 33 = 525,477 IOs Total ~ 1.05 s Verify FMLN Reset Address Generator 2x Write CBus Reg = 2x 11 IOs Read Counter 3x Read CBus Reg = 3x 5 IOs Update Overflow Overrule Write CBus Reg = 11 IOs Write memory in 4 sections: repeat 4 times Read Stream Setup = 3 IOs Read Stream Item = 3 IOs repeated 128*512 times Read Stream Done = 1 IOs Read Counter 3x Read CBus Reg = 3x 5 IOs Set Normal Mode = Reset Write Protect, OverflowOverrule, Mode 3x Write CBus Reg = 3x 11 IOs Total 11 + 22 + 15 + 4*( 3 + 128*512*3 + 1 + 15 ) + 33 = 786,589 IOs Total ~ 1.57 s IV.5.B. Measurement Zero + Verify FMLN Comparator (From D0 logfile TRICS_II_20030610_V10_4_D.LOG;1) 09:21:43.204 09:21:44.987 1.78 s Zero 09:21:46.519 1.53 s Verify 09:23:09.388 09:23:11.170 1.78 09:23:13.073 1.90 10:23:38.526 10:23:40.529 2.00 10:23:42.342 1.81 14:54:46.228 14:54:48.031 1.80 14:54:49.823 1.79 14:59:34.212 14:59:36.065 1.85 14:59:37.958 1.89 -> FMLN Initialization takes 2 x 1.8 s = 3.6 s (note: the Write phase had a bug and should take less time by not rewriting the data that never changes.) Write + Verify FMLN Comparator (From D0 logfile TRICS_II_20030610_V10_4_D.LOG;1) 11:11:16.856 11:11:18.829 1.97 s Write 11:11:20.662 1.83 s Verify 11:11:39.048 11:11:40.871 1.82 11:11:42.674 1.80 11:11:59.598 11:12:01.421 1.82 11:12:03.223 1.80 11:12:15.751 11:12:17.544 1.79 11:12:19.527 1.98 10:08:12.412 10:08:14.204 1.79 10:08:16.197 1.99 10:08:26.532 10:08:28.325 1.79 10:08:30.127 1.80 -> FMLN Programming takes 2 x 1.8 s = 3.6 s (note: the Write phase had a bug and should take less time by not rewriting the data when it hasn't changed.) V) Misc ======= V.1. Resource Locks (Semaphores) -------------------------------- cf. TRICS II Version 9.1 Release Notes: "Lock acquiring+release time penalty at about 0.18 us per pair (asuming the lock is available)" V.2. Find_DAC ------------- V.2.A. Estimate Processing one Channel involves changing the Pedestal Control value 31 times (Max, Min, Med, Low, 26*Near, Selected) with a Serial Command time of ~1 ms (for a total of ~31 ms). Changing the Pedestal Control includes an implicit 2 ms settling wait time (for a desired total of 62 ms wait time), but the NT scheduler's resolution is 10ms (for an actual default total wait time of 310 ms). Processing one Channel involves building 2x 10-point histograms (Max, Min) and 28x 1k-point histograms (Med, Low, 26*Near). The time per histogram is a function of the requested spacing between samples (cf. above). e.g. ~2.3 ms without Read/Write Pipe A/B control or ~11.2 when skipping every other 396ns sample per pipeful captured. V.2.B. Wait Time Measurement Trics calls the sleep function to allow 2 ms settling time after each Serial command DAC programming. Using logfile time stamps doing Find_DAC on Tadpole.pa.msu.edu (fake IO). We expect that CPU time is small/negligible compared to the waiting. Explicit wait via sleep() method 31x 2 ms wait = 62 ms Default NT scheduler measure 310 ms Calling timeBeginPeriod ( int ) with argument specifying a timer period : 20 -> 310 ms 10 -> 310 ms 5 -> 160 ms 4 -> 130 ms 3 -> 100 ms 2 -> 70 ms 1 -> 100 ms V.3. AONM and FOM card programming ---------------------------------- cf. TRICS II Version 10.3 Rev A Revision Notes: Programming AONM and FOM FPGAs requires loading a memory lookup of the desired response for groups of input terms. This is very IO intensive. The worst case is during the deallocation of a Specific Trigger where Trics has to remove the Trigger programming from many places: 2 for L1 And-Or Fired AONM 16 for L1 Qualifier 128 x 3 for L1 Accept, L2 Acc, L2 Rej FOM 12 for FOM++ SpTrg Fired Strobe, Skip Next, Skip Next N 2 for L2 SpTrg Accept, Reject AONM ------ 416 x (8x(16+2)+4 IO per AONM/FOM channel) = 62k FW IO One VME access takes about 5 us, and one full FW write with pre-&post- write checks takes 3 times that much. This adds up to 1 second per SpTrg deallocated. Note that the bulk of the IOs are in fact made to a register that was already declared "volatile" and thus not pre-write checked because these registers are in fact a window into a larger space controlled by a programmable base address. COOR started deallocating SpTrg by ranges, and deallocating N triggers would take N seconds. COOR would eventually give up and time out. This would percolate to subsequent taker actions and cause problems. Because Trigger IO has been rock solid, and we never had IO errors outside of power problems, we choose to drop the systematic post-write check (there already was no pre-write check) for the bulk of the IOs. We keep one full pre+post-write checked IO at the beginning of the stream of IOs to each FPGA in order to always be ready to clearly detect an un-configured FPGA. The net result is that the 1 second per SpTrg has become 0.5 second. This will keep us out of trouble. COOR will also be changed to limit the number of SpTrg deallocated in a single message.