This is the log book for the TRICS software started 22-JAN-1992 ================================================================================ +--------------------------------------------------- Updated - 7-JUL-1994 -+ |Code development Policy: | |- DZERO::EWORK1 has current working code. | |- DZERO::ETRICS has last stable code. Backup EWORK1 before modifying. | |- MSU::ETRICS is a backup of DZERO::ETRICS: | |- MSU::EWORK1: is a backup of DZERO::EWORK1: | +--------------------------------------------------------------------------+ |Archival Policy: | |- subdirectory [.TCC] has the files needed on TCC | |- subdirectory [.TRGUSER] has all important files from the trguser account| |- subdirectory [.TRG_LIB] has all important files for linking purpose | +--------------------------------------------------------------------------+ |Code Update reminder: | |- Check Version number and comments in SITE_DEPENDENT.CST | |- SITE_DEPENDENT.CST different parameters -- cf *.CST_MSU and *.CST_DZERO| |- TRICS_Vnm.DAT different node name+number -- cf *.DAT_MSU and *.DAT_DZERO| +--------------------------------------------------------------------------+ ================================================================================ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 19-JUN-1996 Philippe: MSU:: load system from 24-MAY find problem, make new system - kludge version of MOD103_HANDLE_SCALERS.PAS Make version that does not toggle the scan/reset line on the 36x36 scan/reset MTG - Modify MOD227_PHAT_EXECUTE.PAS change argument checking restriction that was limiting MOD_HDB messages to CBUS <=2. Now set to CBUS<= 3. - make and load new system EWORK1:TRICS_V64.SYS_19JUN96 The old ("production") system is still in EWORK1:TRICS_V64.SYS_24OCT95 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 24-MAY-1996 Philippe: MSU:: kludge system to run with L1.5 CT and 36x36 scalers off - kludge version of MOD100_HANDLE_L15CT.PAS and MOD100_HANDLE_L15CT.PAS that has nearly empty routines for handling the L1.5CT VME IO - new system is EWORK1:TRICS_V64.SYS_24MAY96 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 26-FEB-1996 Philippe: MSU:: clean up disk directory - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*JAN95.LOG and TRICS*FEB95.LOG to MSUD02$DUA1:[TCC_LOG_IC] - also copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG from FEB96 - delete TRICS* logfiles from Dec, jan, feb - Delete MPOOL*/LOG*/MAIL*.LOG from Dec, Jan, Feb - [TRIGGER] now uses 59k blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17-JAN-1996 Philippe: MSU:: clean up disk directory - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*DEC95.LOG to MSUD02$DUA1:[TCC_LOG_IC] ** use new DIR ** - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete TRICS* logfiles from Dec - leave MPOOL*/LOG*/MAIL*.LOG from Dec for now (will kill next month) - Delete MPOOL*/LOG*/MAIL*.LOG from Nov (left from last month) - [TRIGGER] now uses 57k blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 16-NOV-1995 Philippe: MSU:: clean up disk directory - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*OCT95.LOG to MSUD02$DUA1:[TCC_LOG_IC] ** note new DIR ** - copy TRICS*NOV95.LOG to MSUD02$DUA1:[TCC_LOG_IC] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete TRICS* logfiles from Oct - leave MPOOL*/LOG*/MAIL*.LOG from Nov for now (will kill next month) - [TRIGGER] now uses 41k blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7-DEC-1995 Philippe: TCC needs reboot - Power supply failure in upper tier 1 supply in rack M105. (probably) Right after the system was back on, all TRGMON sessions were dumped, and could not be restarted. COORs requests to TCC were not being serviced. Here is what could be reconstructed by looking at logfiles: 11:52 COOR Initialized TCC with Caltrig not turned on this must be to switch to the special run 11:54 Initialization completes with a bunch of errors since most of the hardware is turned off, but no problem here ... Everything is fine, COOR sends requests and TCC can perform IOs to the hardware all ok. 13:59 "SOMETHING" happens preventing all subsequent IOs from TCC to L1FW+CT 14:02 the Monitoring server, more exactly ITC, dumps all TRGMON connections and quits accepting new ones. The error in TCC's Mpool_server was ITC-E-NO_CHANNEL, Channel requested has not been activated COOR keeps sending requests to unwind from the special run. TCC is bogged down by slow and failed IOs as TCC it get control of the L1FW+CT bus (remember this bus is shared between TCC control and high speed event readout). Each individual IO has to time out (7s): this takes forever. 15 mn later TCC hadn't even got to the INITIALIZE request from COOR. Now for what the "something" was: we believe the ZRL pQBA interface in the BA23 QBUS enclosure tied to the microVAX 4000/60 TCC must have gotten corrupted. Flipping on power supplies in L1 CT might have been correlated to this. The pQBA stayed hosed until it could be reset. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4-DEC-1995 First weekly reboot of TCC - TCC is now rebooted every Monday. During "solid store operation" this should happen before flipping magnet polarity, as the later is done after the protons are already in. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 16-NOV-1995 Philippe: D0:: clean up disk directory - clean [TRIGGER] directory. Delete (not saved) files from oct. [trigger] now holds 42,480 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 15-NOV-1995 Philippe: MSU:: power failure, bug noticed - After a sitewide power failure, TCC was booted before D0HSC was available. This caused an access violation in the booting sequence that hung TCC. This happened when TCC was trying to write the begein/end run file to capture the scaler counts before initializing them; the file could not be written to the host, and the error message that was trying to advertize the fact tried to OPTIONAL access argmunents not passed from this initialization call. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 30-OCT-1995 Philippe: D0:: clean up disk directory - clean [TRIGGER] directory. Delete (not saved) files from july, and sept. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 23-24-OCT-1995 Philippe: D0:: add 36x36 bunch test scalers - New system in EWORK1: TRICS_V64.SYS_23OCT95 add support for the 36x36 bunch test scaler crate and control line length of the show sptrg message - TRICS_V64.SYS_24OCT95;1 Fix bug introduced preventing SBSC loading - TRICS_V64.SYS_24OCT95;2 Fix bug preventing finding all mtg36x36_ctrl registers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 20-JUN-1995 Philippe: MSU:: timing of COOR message execution - numbers in seconds, population sample on the righthand column. note that the first "SPECTRIG L15_TYPE" takes longer, as the special command file TRICS_L1_OBEY_L15.DAT needs to be executed SPECTRIG ENABLE %% min= 0.10 max= 0.10 ave= 0.10 n = 1 SPECTRIG FEBZDIS %% min= 0.09 max= 0.17 ave= 0.09 n = 30 SPECTRIG RD_TIME %% min= 0.08 max= 0.12 ave= 0.09 n = 30 SPECTRIG ANDORREQ %% min= 0.12 max= 0.22 ave= 0.12 n = 30 SPECTRIG L15_TERM %% min= 0.09 max= 0.17 ave= 0.10 n = 12 SPECTRIG L15_TYPE %% min= 0.08 max= 1.43 ave= 0.17 n = 15 SPECTRIG OBEYBUSY %% min= 0.08 max= 0.11 ave= 0.09 n = 30 SPECTRIG OBEYLEV2 %% min= 0.08 max= 0.17 ave= 0.09 n = 30 SPECTRIG PRESCALE %% min= 0.08 max= 0.18 ave= 0.09 n = 58 SPECTRIG STARTDGT %% min= 0.09 max= 0.10 ave= 0.09 n = 30 REFSET EMET %% min= 0.10 max= 0.26 ave= 0.17 n = 25 REFSET LRG_TILE %% min= 0.10 max= 0.12 ave= 0.10 n = 14 THRESHLD EMETCNT %% min= 0.09 max= 0.09 ave= 0.09 n = 5 THRESHLD MISPTSUM %% min= 1.26 max= 1.36 ave= 1.31 n = 3 THRESHLD TOTETCNT %% min= 0.09 max= 0.09 ave= 0.09 n = 8 ST_VS_RS TOT_LIST %% min= 0.09 max= 0.17 ave= 0.09 n = 16 PAUSE %% min= 0.15 max= 0.27 ave= 0.21 n = 5 RESUME %% min= 0.15 max= 0.23 ave= 0.17 n = 4 L15CTERM REFSET %% min= 0.09 max= 0.12 ave= 0.10 n = 9 L15CTERM LOC_DSP %% min= 0.09 max= 0.09 ave= 0.09 n = 3 L15CTERM FRAMECOD %% min= 0.08 max= 0.16 ave= 0.11 n = 3 L15CTERM GLOB_DSP %% min= 0.09 max= 0.09 ave= 0.09 n = 3 L15CTERM ST_VS_TM %% min= 0.09 max= 0.09 ave= 0.09 n = 3 L15CTSYS START %% min= 11.27 max= 11.58 ave= 11.31 n = 80 L15CTSYS LOADCODE %% min= 13.08 max= 13.68 ave= 13.33 n = 79 WRT_HOST END_RUN %% min= 0.19 max= 0.19 ave= 0.19 n = 2 WRT_HOST BEG_STOR %% min= 0.18 max= 0.18 ave= 0.18 n = 1 WRT_HOST SYNCHRO %% min= 0.08 max= 11.36 ave= -- INITIAL %% min= 50.35 max= 51.59 ave= 50.66 n = 99 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 15-JUN-1995 - Jan caught a problem with TCC: We were in the middle of a store. We had just done an end run, change prescales and begin run. Then about 10 minutes (or so) into the new run I noticed that all the L2 nodes were in wait data. We were not getting any events into L2. I looked at TRGMON and saw a single line message in the middle of the screen. I forget exactly what it said, but it was telling us that TRGMON could not talk to TCC. I then tried edebugging TCC and it looked ok. I did a directory of TCC's disk, and that responded. Dean and I then went in to look at TCC's console. We didn't see anything unusual. It just looked like it froze. When I edited TCC's log file, the last line was that it was closing the log file. I then edebugged TCC again and got some output for you, shown below. TRGMON still couldn't talk to TCC. So at this point we triggered TCC. We have now redownloaded the triggers and are running happily. Answer: After examining the logfiles from TCC, I would believe that the problem was related to screen access. All of TCC's jobs and subprocesses have to synchronize to avoid writing at the same time. There is a semaphore to handle this, and it has been peforming reliably as far as I can tell. It hasn't always been simple, and I had to address some difficulties -- e.g. when interrupt routines want to make screen IO, or for special time critical sequences -- but I believe everything has been ironed out by now. There is also an emergency recovery that can automatically de-jam the semaphore: it was tried last night but seemed to not have been enough. There are two possible explanations for last night's lock-up. Either there still is some possible but rare sequence of event that I don't understand and can get us in trouble once every 6 month. But I couldn't see any trace of improper activity at the time. The other possibility was that the screen was physically locked up, either by someone pushing the hold screen key, or by hardware failure. extracted from logfiles: %% time: 09-JUN-1995 13:42:02.02 TRICS V6.4 CLOSED LOGFILE, DUA0:[TRIGGER]LOG_SERVER_09JUN95.LOG %% time: 15-JUN-1995 18:15:28.58 E-EXC/MBX% Message Mailbox is Full but Not Signaled S-EXC/MBX% Flush_to_File now Servicing Exception Mailbox X-WAI/CNS%flush% Console Locked for 5s, Recover: Force Unlock S-EXC/MBX% Exception Mailbox now empty %% time: 15-JUN-1995 18:15:28.91 TRICS V6.4 CLOSED LOGFILE, DUA0:[TRIGGER]LOG_SERVER_09JUN95.LOG C-RCV/CH1% 1:17 %00001034 RESUME S-PRS/CHK% COOR Lets Framework Resume C-ACK/CH1% 1:44 %00002201 ACKNOW 00001034 OK DONE S-MPL/FRH% Start Getting Fresh Data Blocks @ 15-JUN-1995 18:04:55.15 %% time: 15-JUN-1995 18:06:09.71 TRICS V6.4 CLOSED LOGFILE, DUA0:[TRIGGER]TRICS_09JUN95.LOG %% time: 15-JUN-1995 05:38:11.01 I-MAI/SRV% Mailed to TRGMGR: TRICS V6.4/09-JUN-1995/ COOR Initializing Trigger %% time: 15-JUN-1995 05:38:44.21 TRICS V6.4 CLOSED LOGFILE, DUA0:[TRIGGER]MAIL_SERVER_09JUN95.LOG %% time: 15-JUN-1995 18:12:15.22 S-MON/SRV% Channel #4 Disconnected after generating 6 messages %% time: 15-JUN-1995 18:15:30.59 TRICS V6.4 CLOSED LOGFILE, DUA0:[TRIGGER]MPOOL_SERVER_09JUN95.LOG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9-JUN-1995 - TCC was rebooted because it had quit writing to its TRICS logfile. It had been up for almost 10 days, and leaking memory anyway. All other logfiles seemed fine and active. Remote file access was happy too. It seems like one IO to the logfile failed, and TRICS swithced over to its protective mode of no longer writing to the logfile. As far as we know, this is the first time this happened. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5-JUN-1995 - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*APR95.LOG to MSUD02$DUA1:[TCC_LOG_IB] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete TRICS logfiles from April - [TRIGGER] now uses 84,825 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 31-MAY-1995 - Jan rebooted TCC this afternoon. The store was just being scraped, and there was an "apparent problem" between COOR and TCC. As I understood it, they sent an initialize command and never got notified that it completed. Moreover, she said the TRGMON display did NOT show that the time since initialize was reset. Jan didn't take any chances and rebooted TCC. I do not dispute that. But I still wanted to find out what really happened. Jan saw the DECNET links still alive, the COOR logfiles seemed to just show that the timeout value was changed (yes, it seems that Bruce just changes the timeout back and forth) and then nothing else. I looked in TCC's logfile. It did get the INITIALize message on time, and acknowledged 50 sec later, as usual. There was absolutely no sign of problem. So maybe Jan was too quick to give up, or there might be something sick in COOR (some new bug introduced recently?). - It became clear that there was absolutely nothing wrong. People on shift (including Jan) are used to get the feedback from COOR saying "TCC acknowledge timeout". They no longer get the message. It seems that they don't get any aknowledgement in COOR's logfile (but I dont think they get an acknowledgement for ANY message, only when bad), just the change of timeout message. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 29-MAY-1995 - TCC disk problem again. Not during store, MR was in access. - The SCSI port driver PKCDRIVER starts forgetting (after running fine for several days) to delete messages that pile up and consume "Pool Blocks" (one per message lost), and memory. This time I could see that PKCDRIVER had not run out of pool blocks, nor exhausted all the memory. I need to spend more time on this. After a 30 sec look, I still don't know for sure which quota it bumped against. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 22-APR-1995 and 24-APR-1995 - Jan boots TCC along with the rest of the ELN mob to stay clear from name server problems. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17-APR-1995 Philippe: - TCC had another spell of ELN/Disk problems last night; DAP code = 01F77C54 2.5 hours of low lum beam were lost. The DAQEXP waited 2 hours before calling for help. The last entry in TRICS logfile is from 16-APR-1995 17:05 Mail messages to TRGMGR shows last COOR INIT at 16-APR-1995 03:52 - a mail message was sent to Jan and Stu Fuess (CC D.Owen) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6-APR-1995 Philippe: - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*MAR95.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete TRICS logfiles from March - [TRIGGER] now uses 21,600 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 31-MAR-1995 Philippe: - Steve notices that TCC is in trouble again. This is identical to 30-DEC-1994, 25-MAY-1994, and 12-JUL-1994. Trying to do a directory on TCC produces: -RMS-F-NET, network operation failed at remote node; DAP code = 01F77C54 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 30-MAR-1995 Philippe: - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*JAN95.LOG and *FEB95 to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete TRICS logfiles from January and February - [TRIGGER] now uses 72,800 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1-2-MAR-1995 Philippe: - move V6.3 from EWORK1: -> ETRICS: - Archive TRICS V6.3 to MSUD01::DUA1:[ARCHIVE.TRICS_V63] done from MSUD01:: using 000_ARCHIVE.V63 - EWORK1: update version number V6.3 -> V6.4 - Copy new files from MSU top DZERO MPOOL_SERVER.PAS MPOOL_DATA.TYP MOD223_COOR_GLOBAL_EXECUTE.PAS SITE_DEPENDENT.CST_MSU/DZERO MOD067_HANDLE_ZRL.PAS MOD245_PHAT_DISPATCH.PAS MOD227_PHAT_EXECUTE.PAS - build New System EWORK1:TRICS_V64.SYS_2MAR95, load it - old one is in ETRICS:TRICS_V63.SYS_19OCT94 - New Monit Pool Server for TRGMON with longer integration time. (only one mpool_server now) - Fix message length for mail message "BAD returned to COOR". (this problem was crashing the mail server) - Add new message to change the threshold for the error message filter. This would be useful to see ALL error messages during initialize. (boot default value is 50) $ @EENV:COMMANDS $ PHAT ERR_FILT nnn with 1 <= nnn <= 9,999,999 - Update TRGUSERROOT:[TRGMON] release of TRGMON V6.1 with option for setting longer integration time - old files are renamed from x.y to x.y_OLD and will be saved for awhile - "improve" the menus for setting integr. time and polling/refresh time (these are now 2 separate comands) - Add options during Print/Dump Screen to close the file or change name - add between screen dumps. - sense and use full screen length during program startup. - Only one version of TRGMON left (no more TEST version) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7-FEB-1995 Philippe: - Dan had power off for a long time. The ZRL hardware does not hang this time. TCC is now resetting the ZRL interfaces during initialize. This is not a proof that all problems are solved. Also read the pQBA/pVBA registers before and after the Initialize. No difference pQBA Register Base Address = %X34000000 pQBA Q-Bus Error Address Reg - Longwrd (QBus Address) = %X00000000 pQBA Q-Bus Interrupt Reg INT0 - Word #1 (QBus Int Vect) = %X0020 pQBA Q-Bus Interrupt Reg INT0 - Byte #3 (Qbus Int Enb) = %B00000000 pQBA Q-Bus Interrupt Reg INT0 - Byte #4 (QBus Int Pend) = %B00000000 pQBA Q-Bus Interrupt Reg INT1 - Byte #1 (QBus Err Msk) = %B00000000 pQBA Q-Bus Interrupt Reg INT1 - Byte #2 (Reset Error) = %B00000000 pQBA Q-Bus Interrupt Reg INT1 - Byte #3 (QBus Err Enb) = %B10000000 pQBA Q-Bus Interrupt Reg INT1 - Byte #4 (QBus Ext Err) = %B00000000 pQBA Reset Register - LSBit (1 = Problem) = %B0 pVBA Register Base Address = %X35000000 pVBA Bus Control Register - Longwrd (Bus Ownership) = %XFF7F7F7F pVBA Error Status Register - Byte #1 (Error Mask) = %B11111111 pVBA Mailbox FIFO Register - Byte #1 (Register Numb) = %B00000000 pVBA Mailbox FIFO Register - MSBit (1=FIFO empty) = %B1 pVBA VME Interrupt Reg INT0 - Byte #1 (Vector Number) = %X00 pVBA VME Interrupt Reg INT0 - Byte #3 (IRQ enab Mask) = %B11111111 pVBA VME Interrupt Reg INT0 - MSBit (0 = Int Pend) = %B1 pVBA VSB Interrupt Reg INT1 - Byte #1 (Vector Number) = %X00 pVBA VSB Interrupt Reg INT1 - Byte #3 (Int enab Mask) = %B00000000 pVBA VSB Interrupt Reg INT1 - MSBit (0 = Int Pend) = %B1 pVBA Reset Register - LSBit (1 = Problem) = %B0 Note that the byte #3 of pQBA INT0 should read %X80 instead of %X00. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 27-JAN-1995 Philippe: - status of backup of TCC logfiles all of run Ib logfiles are in MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] another copy is in MSUD02::DUA1:[TCC_LOG_Ib] run Ia files are in MSUD02::DUA1:[TCC_LOG_Ia] - There are 2 DAT tapes with a copy of all the files from run Ia and Ib from 92, 93 and 94 (up to TRICS_30DEC94.LOG). (note that for some files from 1992, only the COOR messages were saved in files with suffix *.MSG) A listing of their content is in the file drawer labelled "TAPES" in the "lobby" of the E-shop ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 26-JAN-1995 Philippe: - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*DEC94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete TRICS logfiles from December - Purge all files (nothing to purge) - [TRIGGER] now uses 13,500 blocks - The December logfiles are also backed up to MSUD02::DUA0:[TCC_LOG_IB] (also backup the November logfiles) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12-JAN-1995 Philippe: - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*NOV94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete TRICS logfiles from November - delete MPOOL*.LOG, LOG*.LOG, MAIL*.LOG from October - Purge all files (nothing to purge) - [TRIGGER] now uses 66,500 blocks - The NOV logfiles are NOT YET backed up to MSUD02::DUA0:[TCC_LOG_IB] (done 27-JAN-1995 ) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 30-DEC-1994 Philippe: - Jan (ust back from vacation) called They were trying a new COOR (which was unrelated), after the initialize, the system stayes stale, with no sptrg #31, no lights on the L1FW This is identical to 25-MAY-1994,12-JUL-1994 and probably 13-JUN-1994 Trying to do a directory on TCC produces the error message: %DIRECT-E-OPENIN, error opening D0HTCC::DUA0:[TRIGGER]TRICS_*.LOG; as input -RMS-F-NET, network operation failed at remote node; DAP code = 01F77C54 Edebug CTRL/C>halt 8,2 (FLUSH->FILE) Edebug CTRL/C>set ses 8,2 Edebug 8,2>ex mod_handle_logfile\no_logfile Loading symbols for module "MOD_HANDLE_LOGFILE". NO_LOGFILE: TRUE This means that TRICS had problem writing to its logfile at one time, and switched mode: to fly with NO logfile. But TCC does not do a full switch over, as it still tries to get its input files (init_auxi, reset,...) from the disk. There was no particular rush today, so I tried to see if I could find a possible "emergency recovery" action (in case this happen in the middle of a run, and we don't want to loose scaler information). So I changed the variable holding the location of TRICS's command files. Edebug 8,2>dep mod_common_global_flags\boot_directory_name = '57.3"TRGUSER TRGGER"::TRGCUR:' I then told TCC to initialize, and it did well with the L1FW (lights flashed, and sptrg #31 appeared) but the initialization got in trouble when it reached the L1.5CT and the load from_local_disk command. The initialization seemed to hang, but was just slow, as it had to timeout each of the 12 EXE file OPEN. The disk loss probably occured around 6 am (last entry in MPOOL_SERVER.LOG) TCC didn't try reaching its disk (after giving up on writing logfiles) and there was no problems or symptoms for COOR, as long as no request for initialization, or begin/end run file was sent. Reboot TCC and everything looks back to normal. There is an access now. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2-DEC-1994 Philippe: - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*OCT94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete TRICS logfiles from October - delete MPOOL*.LOG, LOG*.LOG, MAIL*.LOG from October - Purge all files - [TRIGGER] now uses 57,000 blocks - also backup all TRICS logfiles from MAY, JUN, JUL, AUG, SEP, OCT to MSUD02::DUA0:[TCC_LOG_IB] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11-NOV-1994 Philippe: - TCC was locked up this morning: TRGMON said: 0.00 Hz, Paused, Stale, with the expected TWB's (Paused, Stale, VME saw no L1 activity). DIR D0HTCC::[TRIGGER]*.* gave a normal response READLOG LAST 100 BLOCKS got as far as "asking TRICS to flush records to file" but no further. These are the tail ends of the TRICS and LOG_SERVER logfiles: S-15C/HDL% Preparing Params for L1.5 CT Crate %% time: 11-NOV-1994 09:23:24.53 S-15C/HDL% Copying Params to L1.5 CT Crate %% time: 11-NOV-1994 09:23:24.60 S-EXC/MBX% Flush_to_File now Servicing Exception Mailbox %% time: 11-NOV-1994 09:23:31.11 X-DSP/EXC%2203468%PAS-F-FILALRACT, file already active %% time: 11-NOV-1994 09:23:30.85 X-DSP/EXC%Skipping %% time: 11-NOV-1994 09:23:30.85 S-EXC/MBX% Exception Mailbox now empty %% time: 11-NOV-1994 09:23:31.48 TRICS V6.3 CLOSED LOGFILE, DUA0:[TRIGGER]TRICS_30OCT94.LOG %% time: 11-NOV-1994 09:23:31.48 C-RCV/CH2% 1:26 %00000001 PHAT CLOSELOG %% time: 11-NOV-1994 09:42:47.84 TRICS V6.3 CLOSED LOGFILE, DUA0:[TRIGGER]TRICS_30OCT94.LOG %% time: 11-NOV-1994 09:43:31.59 --------------END-OF-LOG-FILE------------------ I-LOG/SRV% Log Server Closed DUA0:[TRIGGER]TRICS_30OCT94.LOG; %% time: 10-NOV-1994 22:40:09.93 TRICS V6.3 CLOSED LOGFILE, DUA0:[TRIGGER]LOG_SERVER_30OCT94.LOG %% time: 10-NOV-1994 22:43:24.70 E-EXC/MBX% Message Mailbox is Full but Not Signaled %% time: 11-NOV-1994 09:23:31.02 S-EXC/MBX% Flush_to_File now Servicing Exception Mailbox %% time: 11-NOV-1994 09:23:31.29 X-WAI/CNS%flush% Console Locked for 5s, Recover: Force Unlock %% time: 11-NOV-1994 09:23:30.85 S-EXC/MBX% Exception Mailbox now empty %% time: 11-NOV-1994 09:23:31.54 TRICS V6.3 CLOSED LOGFILE, DUA0:[TRIGGER]LOG_SERVER_30OCT94.LOG %% time: 11-NOV-1994 09:23:31.54 --------------END-OF-LOG-FILE------------------ The last message in the MAIL_SERVER logfile was for sending mail about this last initialize, and closing the logfile. The last message in the MPOOL_SERVER logfile was from 7:04, which is a bit old, but not unconceivable. There are supposed to be 2 MPOOL_SERVER logfiles (for the 2 servers running in parallel). I just realized that there is a problem with this, because they have the same name and just different revision numbers. The first one (the old trgmon) only had a few, old records. There is no problem in opening a new logfiles, but when the logfile is closed by the flush_to_logfile process for that job, the file is later reopened by name, and the 2 jobs fight for the same file. There might be long time gaps between the time file is closed reopened again. One of the jobs loses when it cannot open the file for write access and just gives up on writing to a logfile. The other one goes on. It is probably a matter of luck that decides who wins, with preference to the second server that starts with the file already opened. TRICS was in the process of servicing an initialization request from COOR. The next message after "Copying Params to L1.5 CT Crate" would have been "Starting L1.5 CT Crate", about 7s later. This is right when we see the "file already active" message happened. It seems like TRICS was just trying to put out this message to the screen, and this is what caused the file access conflict. We see in the LOG_SERVER logfile that its Flush_to_File process (which wakes up every 3mn to see if the logfile needs closing, if the console stays locked for more than 5s, or if the mailbox needs servicing) had been waiting for the console lock for more than 5s and forced it unlocked, which was just at the time when TRICS was also waiting/ready to write its "Starting L1.5 CT Crate" message. It sounds like some third job must have just been in the process of writing to the screen at that time, which caused the "file already active" error. Maybe that third process was interrupted halfway while writing to the screen, and transfer was passed to the TRICS dispatcher process which was then busy using the full CPU for 7s during which the console stayed unlocked. There is no trace of another message in other logfiles, but we couldn't see one of the MPOOL_Servers. This is only an hypothesis, but it would probably be wise to increase the timeout on the flush_to_logfile process for waiting on a locked console from 5 to 15 or 30s. Another possibility is that someone bumped the hold screen button on the TCC keyboard. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21-OCT-1994 Philippe: - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*SEP94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] but couldn't do MSUD02::DUA0:[TCC_LOG_IB] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG from August - delete TRICS logfiles from September - delete MPOOL*.LOG, LOG*.LOG, MAIL*.LOG from July but keep ones from September until next month (deleted 21-OCT-1994) - Purge all files - [TRIGGER] now uses 53,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 19-OCT-1994 Philippe: - build New System EWORK1:TRICS_V63.SYS_19OCT94;1, loaded has the "old MENU" upgraded to address any of the 4 CBUSs, in particualr the items #1 write, #2 read and #13 write register step - build New System EWORK1:TRICS_V63.SYS_19OCT94;2, loaded fixed bug (was missing the upgraded MOD059_MENU_IO_HANDLING.PAS) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 13-14-OCT-1994 Philippe: - build New System EWORK1:TRICS_V63.SYS_13OCT94;1 add messages PHAT READHIST and PHAT SHOW_REG the READHIST message calls a special CBUS cycle that does multiple reads on the same register address. The address is selected, and the data is read multiple times, a few microseconds apart, and histogrammed. - build New System EWORK1:TRICS_V63.SYS_14OCT94;1 fix (forgot to zero histogram) - build New System EWORK1:TRICS_V63.SYS_14OCT94;2 remove 9999 limit on histogram sample size - This system was left in. - Backup Directory MSU::EWORK1: was updated ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 10-12-OCT-1994 Philippe: - build New System EWORK1:TRICS_V63.SYS_10OCT94;1 (never loaded) fix problem of interference with L1 FW add logic to cope with the clipped coverage at eta>16 and missing T1 EM and HD CAT2 - build New System EWORK1:TRICS_V63.SYS_12OCT94;1 (never loaded) deselect the MBA at the end of a CBUS cycle ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6-7-OCT-1994 Philippe: - Build New System EWORK1:TRICS_V63.SYS_6OCT94;1 This version better incorporates the large tile tests - cf. D0_HALL_LOGBOOK.LBK for details. - Build New System EWORK1:TRICS_V63.SYS_6OCT94;2 This version reads the andor IMLROs twice in a row, and reads the other andor backplane for comparison. - Bug in diagnostics code interferes with L1 FW, thus reload ETRICS:TRICS_V62.SYS_7SEP94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 23-SEP-1994 Philippe: - Move TRICS V6.2 from EWORK1: to ETRICS: - Build New System V6.3 in EWORK1: TRICS_V63.SYS_23SEP94 (new additions made to random test to check the loarge tiles up to the andor network). This version also has additional "progress report" messages in CHTCR and CTFE lookup PROM tests to show wich PROM is being checked. the CTFE PROM test spends 56.15 s/page for 64 towers (but the remote console was on [at MSU!], and this may slow things down) which is about 20 % slower (was 23.7 s/page for 32 towers). This test should be redone without the remote console to decide if this progress report is worth the slow down. - cf. D0_HALL_LOGBOOK.LBK for details. - V6.3 diagnostics code is not yet stable/useful. thus reload ETRICS:TRICS_V62.SYS_7SEP94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 22-SEP-1994 Philippe: - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*AUG94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] but couldn't do MSUD02::DUA0:[TCC_LOG_IB] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG from August - delete TRICS logfiles from August - delete MPOOL*.LOG, LOG*.LOG, MAIL*.LOG from July but keep ones from August until next month (deleted 25-AUG-1994) - Purge all files - [TRIGGER] now uses 38,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7-SEP-1994 Philippe: - New system is EWORK1:TRICS_V62.SYS_7SEP94;1 1) TCC overwrites the 68k Dual Port Mem with %XFF at Load code & start crate 2) TRICS expects to see all 32 L1.5 FW Terms - backup new files in DZERO::EWORK1: to MSU::EWORK1: - system loaded 8-SEP-1994 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 24-AUG-1994 Philippe: - New system is EWORK1:TRICS_V62.SYS_24AUG94;1 Allow Com File Code to properly service re-entrant requests Send "alert" mail msg to MSU whenever TRICS answers "BAD" to COOR INIT 17:44 - Dan loads system, which complains about bad parameters... 17:59 - Dan returns to system from 19-AUG - New system is EWORK1:TRICS_V62.SYS_24AUG94;2 SITE_DEPENDENT.CST had "lost" the L1.5 CT upgrade to allow 4 terms. 22:18 - Dan loads system again, ok now. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 19-AUG-1994 Philippe: - New system is EWORK1:TRICS_V62.SYS_19AUG94;1 Four L1.5 CT terms are allowed [0..3] (raise the maximum L1.5 CT Term Number from 0 to 3 in SITE_DEPENDENT.CST) never meant to be loaded - New system is EWORK1:TRICS_V62.SYS_19AUG94;2 Execute L15CT_DEFAULT_CONFIG.DAT as last step of COOR's LOADCODE msg Implement keyword FROM_LOCAL_DISK for msg LOADCODE to load executables from local disk D0HTCC::[L15CT$EXEC]. 11:46 - Dan Loads system and quickly notices problem. - New system is EWORK1:TRICS_V62.SYS_19AUG94;3 remove (third time) overwriting of 68k Dual Port Memory 12:04 - Dan loads system - New system is EWORK1:TRICS_V62.SYS_19AUG94;4 Philippe noticed that bug was re-introduced because it wasn't propagated back to MSU on 6-AUG: missing return status from 3 routines in L1.5 CT that cause errors at initialize in TRICS_INIT_AUXI_L15CT 14:04 - Dan loads system, now ok ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11-AUG-1994 Philippe: - 20:27 - Dan loads EWORK1:TRICS_V62.SYS_10AUG94;1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 10-AUG-1994 Philippe: - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*JUN94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] and TRICS*JUL94.LOG but couldn't do MSUD02::DUA0:[TCC_LOG_IB] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG from June or July - delete TRICS logfiles from June and July - delete MPOOL*.LOG, LOG*.LOG, MAIL*.LOG from June, but keep ones from July until next month (deleted 25-AUG-1994) - Purge LSM_ZEBRA.LOG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 10-AUG-1994 Philippe: - 12:36 - Dan loads EWORK1:TRICS_V62.SYS_6AUG94;1 and notices problem: TCC is overwriting again 68k Dual Port Memory, but 68k not yet upgraded. - 14:07 - Dan returns to EWORK1:TRICS_V61.SYS_27JUL94 - New system EWORK1:TRICS_V62.SYS_10AUG94 remove overwrinting of 68k dual port memory Not loaded until 11-AUG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6-AUG-1994 Philippe: - Dan had L1.5 CT initialization messages in TRICS_INIT_AUXI_L15CT.DAT and noticed errors when executing "Preparing Params", and "Copying Params". The error only occurs from the command file, not from TRICS_ACCESS or COOR. The problem was in status argument in a couple routines that wasn't explicitly written, and kept whatever random value. The random values were ok for outside messages, not for command file messages. - New system is EWORK1:TRICS_V62.SYS_6AUG94 Fix (missing) return status from Preparing and Copying Params to L15CT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5-AUG-1994 Philippe: - Archive TRICS V6.0 to MSUD01::DUA1:[ARCHIVE.TRICS_V60] done from MSUD01:: using 000_ARCHIVE.V60 - Save TRICS V6.1 from DZERO::EWORK1 to DZERO::ETRICS done by swapping directory file names - Backup V6.1 from DZERO::ETRICS to MSU::ETRICS done from MSU $ @COPY_TRICS ALL DZERO::[TRG_TARGET.SOURCE_TRICS] ETRICS: - Build TRICS V6.2 in EWORK1: starting from V6.1 files from ETRICS run @CHANGE_VERSION_NUMBER.com 6.1 -> 6.2 copy files MOD071_DEF_HARDWARE_TABLES.PAS (add ERPB-MTG) MOD123_INIT_CBUS_CARDS.PAS (add ERPB-MTG) MOD171_PARSE_GLOBAL.PAS (use OTS$CVT_T_F) SITE_DEPENDENT.CST* (new V6.2, fix scaler_recover_dir) TRICS_V62.PAS to remove the kludge from 27-JUL-1994 - note that the kludge in MOD100_HANDLE_L15CT.PAS to skip painting FF's in 68k dual port memory is sill there (WRONG! see 10-AUG) $ MMS/SKIP New system is EWORK1:TRICS_V62.SYS_5AUG94 - Backup V6.2 from DZERO::EWORK1 to MSU::EWORK1 done from MSU $ @COPY_TRICS ALL DZERO::[TRG_TARGET.SOURCE_1WORK] EWORK1: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 27-JUL-1994 Philippe: - Temporarily modify MOD100_HANDLE_L15CT.PAS, skip painting FF in the 68k dual port. The proper long term action needs to be thought of. - Temporarily change TRICS_V61.PAS to add a ":" between LOGGER$BRD and TCC_BOOT_ddmmmyy.INFO - There is a new system in EWORK1:TRICS_V61.SYS_27JUL94;2, older V6 systems were deleted - This new system file was successfully loaded - These temporary kludges were NOT copied to MSU ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 27-JUL-1994 Philippe: - Add code to display L1.5CT 68k cycle scalers (big N, little n,...). There is a message to read the scalers any time: L15CTSYS 68K_CNT CRATE(0) This is also executed automatically at Init and Load code, along with 68k errors and 68k flags. - Update L1.5CT 68K errors display with the latest addition: count of byte misalignment errors in Object Lists. - There is a new system in EWORK1:TRICS_V61.SYS_27JUL94;1 - Backup V6.0 DZERO::ETRICS: to MSU::ETRICS: (using MSU::EWORK1: from 7-JUL) - Backup V6.1 DZERO::EWORK1: to MSU::EWORK1: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 26-JUL-1994 Philippe: - Fixed 2 reasons why the "watch double buffer" process was not kicking in (that gave the 50% chance to have read/write_A_B start off wrong). - Also fixed an alignment problem in the shared memory space (that was causing all weird boot status, wake up words,...). - There is a new system in EWORK1:TRICS_V61.SYS_26JUL94 (never loadded) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 22-JUL-1994 Philippe: - Dan Loads new code TRICS_V61.SYS_21JUL94 - Problem with "watch double buffer" task obviously not doing its job. - Problem with DSP status - Return to old V5.3 code - L1C coverage was "clipped down" to eta = 3.2 (TT_Eta=16) Global Missing Et Global EM Et \ Global HD Et / and thus Global Total Et Tower Counts and Large Tiles are still using full coverage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21-JUL-1994 Philippe: 1) Move to Version V6.1 2) Add repeating screen message at boot time in case of pQBA or pVBA hardware problem detected at boot time (e.g. power off). 3) At Initialize, the ZRL cards are reset, and reloaded. The DRV11Js too. 4) At boot time, clarify the various screen messages related to ZRL/pVBA/pQBA/DRV 5) At initialize the 68k is now "parked" before the DSPs are reset 6) At initialize, and at loadcode, the 68k error counters and flags are read and put in the logfile. (Your latest COM_PORT error counter isn't in yet). 7) Execute file TRICS_INIT_AUXI_L15CT.DAT right after TRICS_INIT_AUXI.DAT This is done at boot time, and at initialize (note: missing file is not fatal) Dan, please start a new file with the ERPB MTG stuff at your convenience. 8) add messages L15CTSYS DSP_STAT check all DSP status L15CTSYS 68K_CTRL check 68k control words (wake up + transfer words) L15CTSYS 68K_STAT check 68k status L15CTSYS 68K_ERR check 68k run-time error counters L15CTSYS 68K_FLAG Check 68k software flags L15CTSYS VER_TSEL Verify Term Select Paddle Board Memory These messages aren't in TRICS_ACCESS yet. Use: $ @ EENV:COMMANDS $ SEND_TRICS L15CTSYS DSP_STAT CRATE(0) Note that they ALL need the CRATE argument. 9) add message PHAT READPVBA to read pVBA control registers. 10) L1.5 CT Crate_ID and Term_Num are range checked against the current implementation (must be 0 and 0) 11) remove integer decoding in error message at "start L1.5CT" of local/global/frame 'param out of range' messages, display Hex only. (TCC doesn't know the data type) 12) Fix bug, there were two CLOSE(initfile) statements in INIT_AUXI service. There is a small chance that this could have been causing our TCC/Disk hangup problems. 13) Read all scalers, as a recovery for end run/store in case of TCC crash/reboot. Done After initializing the ZRL, DRV11j... After Boot_Auxi because this is where the scaler list is defined But before any register is initialized. File is LOGGER$BRD:TCC_BOOT_ddmmmyy.INFO - There is a new system in EWORK1:TRICS_V61.SYS_21JUL94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 20-JUL-1994 Philippe: - Message from Oscar Ramirez about a problem during the PR_TRIG run summary command file that calls TRGMON Hi, I don't know if this is important but in any way I want to send you the following trace back of the PR_TRIG COMMAND for the TRIGMON DISPLAY. when I pick up the print out there was only a couple of sheets that came from the file SUPDUMP.TXT and the triggers bits info is missing. DJOKO AND RAMIREZ THE SHIFTERS P.S. After try again everything seems to be O.K. ie, I got the trgmon_dump.txt file printed >>> Doing TRIGGERS %SYSTEM-F-FILNOTACC, file not accessed on channel %TRACE-F-NOMSG, Message number 0009804C module name routine name line rel PC abs PC MESSAGE_XFER RECORD_CALL 527 00000190 000035BC 80892B55 80892B55 CONNECT ITC_DISCONNECT 1033 000000C1 00000FAD 00000850 00000850 - reply is: I don't know what happened, this looks to me like an ITC link problem. There is not much I can do right now, especially if this was an isolated problem. I will enter this information in our records for further reference. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12-JUL-1994 Philippe: - TCC became messed up again, in a manner similar to the time (25-MAY-1994) when there was network problems, when many level 2 nodes got hung. I don't know what caused today's problem. TCC could no longer access its disk, with the same weird error messages. I tried changing the boot directory to using the host, on the fly, from the begin_end_run task Edebug 8,12>dep mod_common_global_flags\boot_directory_name = '57.3"TRGUSER TRGGER"::TRGCUR:' But the begin_end_run task was stuck in the CLOSE statement in close_auxi_file for the file 'TRICS_FORCE_BUF_UPDATE.DAT', and I couldn't get it un-stuck anyway. (also notice a bug that the close statement appears twice in the routine, and that bug has been there since beginning of the service, but the task was stuck in the first call, not the second). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7-JUL-1994 Philippe: - DZERO::EWORK2: moved to DZERO::ETRICS (see new policy on top of file) - Backed up V6.0 from DZERO::EWORK1: to MSU::EWORK1: - Note that V5.3 DZERO::ETRICS: was not backed up to MSU:: because it is obsolete, and EWORK1: is already stable. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 28-JUN-1994 Philippe: - Problems from last week (below) were due to a bug in the ZRL Interrupt Service Routine MACRO Dual_ISR, where the register R5 was not saved. New system is ework1:TRICS_V60.SYS_28JUN94;1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 27-JUN-1994 Philippe: - Send message to Norm Amos on how to recover most of the luminosity data after a TCC reboot during a store. From: MSUPA::LAURENS "Philippe" 27-JUN-1994 10:56:35.25 To: FNAL::AMOS CC: FNBIT::D0::JGUIDA,laurens,edmunds Subj: RE: recovering lost luminosity info Norm, We had TCC problems during a store last week, here is what I think is the method that would recover the most information. I don't expect you will find anything surprising here, and you might have already figured it out for yourself in earlier instances of TCC crash. For the missing end of run file (the run during which TCC rebooted), I would simply take the last pause/resume run file from before the crash, and maybe extrapolate for the actual end of the run. The other half of the problem is that all scalers have been reset between the begin and end of store. I would simply take the end of store file (which is from after the reboot) and correct each scaler by adding the values from the last pause/resume file (this is from right before the reboot). Philippe ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 24-JUN-1994 Philippe: another crash, but now in user process. Job 10, process 1, program MPOOL_SERVER raised exception. %SYSTEM-F-ACCVIO, access violation, reason mask=04, virtual address=00000004, PC=00003C9E, PSL=03C000B4 Module MOD_MPOOL_SERVER 1612: FOR vetonum := 0 TO 6 >>1613: DO base_address^.sptrgcnt[sptrgnum].stvetos[vetonum] := 1614: mpool_rec^.current.sptrgcnt[sptrgnum].stvetos[vetonum] 1615: - mpool_rec^.previous.sptrgcnt[sptrgnum].stvetos[vetonum] ; 1616: 1617: base_address^.sptrgcnt[sptrgnum].L15_incr.st_confirm := --Edebug 10,1>sho call Module name Routine or Psect name Line Rel PC Abs PC MOD_MPOOL_SERVERREAD_MPOOL_ST_GS 1613 0000016C 00003C9E MOD_MPOOL_SERVERSERVICE_REQUEST 1357 000000DD 00002E65 MOD_MPOOL_SERVERMPOOL_SERVER 1286 00000459 00002C84 00000000 800049A7 Edebug 10,1>ex vetonum VETONUM: 2 (00000002) Edebug 10,1>ex/inst %Line 1612 %Line 1612 + 0000: MOVL #00,-08(FP) Edebug 10,1>e/i %Line 1613 + 0000: MULL3 #00000060,-0C(FP),R3 Edebug 10,1>e/i %Line 1613 + 0009: MULL3 #04,-08(FP),R2 Edebug 10,1>e/i %Line 1613 + 000E: ADDL2 R2,R3 Edebug 10,1>e/i %Line 1613 + 0011: MOVL -10(FP),R2 Edebug 10,1>e/i %Line 1613 + 0015: MOVAB 0B0C(R2)[R3],R5 Edebug 10,1>e/i %Line 1613 + 001B: MULL3 #00000050,-0C(FP),R3 Edebug 10,1>ex r5 R5: 4 (00000004) Edebug 10,1>ex r3 R3: 2328 (00000918) Edebug 10,1>ex r2 R2: 0 (00000000) Edebug 10,1>ex fp FP: 2147480072 (7FFFF208) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 22-23-JUN-1994 Philippe: - Install new Double Port "Stereo" ZRL VAXStation 4000, with new "baby" VNE backplane in BA23 enclosure. New TCC first powered at 14:00, Booted with new Code for Dual Interface, and with the new store at 14:42 22-jun-94 TCC crash to at about 22:28 with an Access violation 800066DC MTPR 50(R5),#10 (Move To Processor Register) LDPCTX (Load Process Context) REI (Return from Interrupt) This Absolute PC address is part of the shareable image KER$SCHEDULE_JOB (as listed in ELN$:4NNKER.MAP) With R5 = 00000004 this is not a legal address. Call Stack: Absolute PC 800066DC 7FFFFD4C 7FFFFDBC R0 General register 0 . . R11 General register 11 R12 or AP General register 12 or argument pointer. FP Frame pointer SP Stack pointer PC Program counter - TCC is rebooted, crashes again at about 16:10 23-JUN-94 again with an Access violation 800060EF MOVZBL 0A(R1),14(R5) (Move Zero-Extended Byte to LongWord) This is part of the shareable image KER$UNWAIT With R1 = 80441680 R5 = 00000004 this is not a legal address, and by coincidence (?) it is the same processor register, and the same bad content. Call Stack now 800060EF 7FFFFB84 7FFFFBDC 7FFFFC34 7FFFFC98 7FFFFD28 7FFFFD7C 7FFFFDBC - Boot the old code in the new dual ZRL interface box, to try to get an idea if this is a hardware problem with the dual box, or a software problem with the new code. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 13-JUN-1994 Philippe: - Investigate cause of "Error during Reset COMINT file" 11-JUN-1994 18:18:34.48 18:19:38.61 18:20:43.17 18:21:47.47 18:22:51.92 18:23:56.41 18:25:00.77 18:26:05.23 Then Initialize 12-JUN-1994 08:19:38.60 09:13:01.68 09:31:15.32 and Boot at 12-JUN-1994 09:36:27.23 These messages were generated when TRICS was not able to access the disk (RESET COMINT is in TRICS_RESET_DIRECTIVES.DAT) The last disk accesses were the regular flush buffer to file, at LOG_SERVER_26MAY94.LOG;2 4/10 31-MAY-1994 09:05:19.00 MAIL_SERVER_26MAY94.LOG;2 57/60 11-JUN-1994 12:05:30.00 MPOOL_SERVER_26MAY94.LOG;2 4506/4510 11-JUN-1994 14:29:33.00 TRICS_26MAY94.LOG;2 25721/25730 11-JUN-1994 13:43:34.00 I imagine we had either a disk failure that caused the ELN disk driver job to quit (remember that ELN has no recovery ability). Or a re-occurence of a system/network problem like on 25-MAY-1994 evening. - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*MAY94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] and to MSUD02::DUA0:[TCC_LOG_IB] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete ALL [TRIGGER] logfiles from May ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9-JUN-1994 Philippe: - MSUD02::DUA0:[TCC_LOG_IA] and MSUD02::DUA1:[TCC_LOG_IA] have the logfiles from run Ia. They no longer are on MSUD01:: - MSUD02::DUA0:[TCC_LOG_IB] has a copy of MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] It must be updated at the same time. - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*APR94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] and to MSUD02::DUA0:[TCC_LOG_IB] - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete ALL [TRIGGER] logfiles from April - [TRIGGER] now uses 68,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 31-MAY-1994 Philippe: - check logfile for source of MAIL error message Date: 30-MAY-1994 12:24:47.45 Subj: TRICS V5.3/26-MAY-1994/ 1 Errors Initializing Framework Registers TRICS_26MAY94.LOG had %% time: 30-MAY-1994 11:25:55.77 S-INI/HDB%COORini% Initializing All Framework Registers E-HIO/HDB% Failure Writing 15 @ cbus 2 mba 129 ca 16 fa 3 read 11 then following in the same initialize, S-INI/ODB%COORini% Initializing all Specific Triggers E-HIO/HDB%COORini% Previously 15 instead of 11 @ cbus 2 mba 129 ca 16 fa 3 This is an ANDOR card that has shown this behavior for a long time. This is probably nothing to panic about, as the correct value was probably programmed, and only the immediate read back failed. But also next initialize had no error at init-all-fw-reg, but %% time: 30-MAY-1994 14:40:48.12 S-INI/ODB%COORini% Initializing all Specific Triggers E-HIO/HDB%COORini% Previously 11 instead of 15 @ cbus 2 mba 129 ca 4 fa 9 Hopefully this was only a read back problem, and the proper value was stored and not lost. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 26-MAY-1994 Philippe: Jan called this morning, here are the reconstructed facts: 1) 15mn after Dan left last last night, we had a water drip trip. 2) They were probably not using it because they only called this morning. 3) Jan was there, reseting the RMI and the RPSS worked. 4) From TRICS logfiles, it was clear there was a global communication problem, and symptoms with assistant CBUS and every write reads zero. 5) I had her check there was power and LED's in M114. There was only few LEDs in M102,M103, and no Beam X LED's 6) asked Jan to turn off the BA23 and the 4000, count to 60 and turn them back on. 7) Everything is back to normal. This is the "standard" way the 4000/ZRL can get hosed: having M114 powered off for extended periods of time. - In the process, I noticed error messages at the beginning of the logfile from this morning and from last night (i.e. the new system) that look like TRICS failed all the BOOT_AUXI messages defining the end of run scalers. So I took the oportunity to restore the earlier system. It is not clear to me what went wrong, as I changed nothing that could do this (that's what I thought, but obviously, I am wrong). 8-JUN-1994 update: Yes, I was wrong. the bug was in MOD171_PARSE_GLOBAL.PAS, see TRICS.LBK 24-JUN-1994 update: in EWORK2: delete the files DESCRIP.MMS MOD171_PARSE_GLOBAL.PAS MOD247_L15CT_DISPATCH.PAS MOD263_SOFT_CONN_DISPATCH.PAS TRICS_V53.EXE and TRICS_V53.SYS_12MAY94 restore correct versions of MOD171_PARSE_GLOBAL.PAS and MOD263_SOFT_CONN_DISPATCH.PAS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 25-MAY-1994 Philippe: - Coor had a time out waiting for an Acknowledge from TCC at end of run time. Read log gets the following: Looking for Current Log File %DIRECT-E-OPENIN, error opening D0HTCC::DUA0:[TRIGGER]TRICS_*.LOG; as input -RMS-F-NET, network operation failed at remote node; DAP code = 01F77C54 $ dirs d0htcc::dua0:[trigger] %DIRECT-E-OPENIN, error opening D0HTCC::DUA0:[TRIGGER]*.*;* as input -RMS-F-NET, network operation failed at remote node; DAP code = 01F77C54 TRICS can talk to TCC OK e.g. Dan can do a read reg ok. Philippe didn't learn anything, and has no idea of what is going on. Philippe doesn't believe it is a problem with our disk. None of TRICS' process/jobs seems to be in trouble, including the disk driver. From Set host TCC, ECL> DIR complained about "network timeout". What it could possibly need the network for (name service ?!?) Jan says that 12 L2 nodes have died with some strange network problem. This remains a mistery. - Later on, Dan slides in new system that accepts L1.5 CalTrig messages (but takes no action) EWORK2:TRICS_V53.SYS_12MAY94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12-MAY-1994 Philippe: - build new system that accepts L1.5 CalTrig messages (but takes no action) new files are MOD171_PARSE_GLOBAL.PAS (to accept more than 8 consecutive blanks) MOD263_SOFT_CONN_DISPATCH.PAS \ to admit L15CT messages MOD247_L15CT_DISPATCH.PAS / and DESCRIP.MMS new system is EWORK2:TRICS_V53.SYS_12MAY94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 22-APR-1994 Philippe: - Clean D0HTCC::[TRIGGER] directory. - copy TRICS*FEB94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] - copy TRICS*MAR94.LOG to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] - files TRICS*JAN94.LOG and TRICS*DEC93.LOG had already been copied - did NOT copy MPOOL*.LOG, LOG*.LOG, MAIL*.LOG - delete ALL logfiles from before April - [TRIGGER] now uses 48,200 blocks - Area MSUD01::DUA1:[BACKUP.LOGFILES_D0HALL] has been deleted. all files from run Ia have been copied to disks D0MSU2$DUA0: and D0MSU2$DUA1: in area [TCC_LOG_Ia] for archival. - From now on, all new run Ib files copied to MSUD01::DUA1:[BACKUP.TCC_LOG_Ib] are also copied to D0MSU2$DUA0: and D0MSU2$DUA1: in area [TCC_LOG_Ib] for backup. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 28-MAR-1994 Philippe: Control Room calls, they can no longer create new TRGMON windows It seems that it is only the TRGMON server that is in trouble, with some kind of ITC problem. In order to learn more, I use edebug to see what D0HTCC is doing. TRGMON service is back to 100 % normal. I restarted the separate job that serves the monitoring information to TRMGON. This will not have any effect on the data taken in this run, nor will it affect the end of run luminosity information. TCC had simply reached the maximum number of 15 allowed ITC connections. However I don't think this was real, that is I don't think there were 15 TRGMON running at the same time. I am suspicious that ITC is "sometimes" forgetting to notice that a channel has been fully released and is available for re-use -- Maybe this happens when a host node crashes, as we had a few times last week -- I will investigate some more tomorrow morning. While investigating this problem, I noticed weird logfile content, and the bug that is causing it: the logfile shows a series (every 5s) of Channel #11 ... (ReConnecting) the variable message_cnt is declared 1..10 while the maximum number of ITC channels is now set to 15 in [.itc.inc]ITC_CONFIG.INC. I don't think this is related to the current problem, but this was clearly overwriting something else... Update 1-APR-1994 : investigating in the old TCC Mpool_server logfiles, the time where TCC/ITC started loosing ITC connections coincides with the Edebug session where I located the source of the MP_FOREIGN integer overflow. This most likely is simply what triggered it. Also note that the new logfile started 28-mar shows no sign of any channel being lost. There doesn't seem to be in intrinsic problem here. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 25-MAR-1994 Philippe: - Notice that the remote console displays some Mpool_server integer overflow message during the MP_foreign request. use Edebug to capture a few sets of mpool_rec^.current.foreign.scaler and mpool_rec^.previous.foreign.scaler to understand where this is coming from. It is simply that the first 8 scalers are not plugged in and read unpredictably. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 18-FEB-1994 Philippe: - load new code that solves the problem "E-MPL/STL% Pilot captured Spy, but not Assistant" File EWORK2:TRICS_V53.SYS_18FEB94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17-FEB-1994 Philippe: - Load new code where the prom test reads global counts as signed integers and tracks tier#1 truncation. File EWORK2:TRICS_V53.SYS_17FEB94 Run PROM tests on all PROMS, all pages Run over 5 Mega Loops of random test ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 10-FEB-1994 Philippe: - The directory ETRICS at Dzero is now officially used to create a copy of the last stable code. I will be using it to back up the EWORKn directory before I start modifying the code for the next version of TRICS at Fermi. -> Etrics will keep the n-1 version of the TCC code and system. - The directory ETRICS at MSU will now officialy be used as a backup of the latest code and system from DZero. - I will also keep a subdirectory [.tcc] for the files needed on TCC - and a subdirectory [.trguser] for all the important files of the trguser account at DZero. (This is not a redundancy with respect to [trg_current.dzero], as the appropriate files will be archived with the corresponding version of TRICS) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3-FEB-1994 Philippe: - Install dual COMINT system. This is TRICS Version V5.3, and the code at DZero is in directory EWORK2: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2-FEB-1994 Philippe: - check TRICS_23JAN94.LOG logfile for source of mail message. One error while initializing framework. Failure Writing 15 @ cbus 0 mba 129 ca 11 fa 12 read 11 but later Previously 15 instead of 11 @ cbus 0 mba 129 ca 11 fa 12 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 23-JAN-1994 Philippe: - Control Room (Mike T.) calls: TCC unreachable from COOR nor TRGMON. Use set host/log to run edebug: many processes were stuck while writing to the console and 2 of the processes had noticed the system ran out of Pool. %PAS-F-ERRDURPUT, error during PUT -KERNEL-F-NO_POOL, no pool available - The system parameter "Pool Blocks" was down at 2048, I don't know why. I made a new system with 10 k blocks. There is obviously still something consuming this resource. $ COPY DZERO::EWORK1:TRICS_V52.DAT MSU::EWORK3:*.DAT_DZERO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14-JAN-1994 Philippe: - modify fill_monit_pool to correct for the offset in Tier#3 energies copied into the data block. file name TRICS_V52.SYS_14JAN94. File loaded. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 13-JAN-1994 Philippe: - we had another occurence of ZRL/pQBA being screwed up without notifying TCC. This happen after a power glitch that tripped Level 1. - make and load same system without Name Server enable file TRICS_V52.SYS_13JAN94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12-JAN-1994 Philippe: - Inspect logfile TRICS_08JAN94.LOG (up to 12-JAN-1994 11:11:42.07) Notice many occurances of incorrect prescaler messages from ELNCON Ch#2, need to check with Dan to see who sent them one occurence of bad register readout during sptrg initialization Previously 11 instead of 15 @ cbus 0 mba 129 ca 1 fa 28 - Inspect logfiles TRICS_02JAN94.LOG;1 (includes the pQBA problem noted earlier) (No pQBA device error was logged at the initial problem time. But the end of the logfile shows what probably was a manual power-cycling of the BA23. The logfile shows that the pQBA device was woken up and that TRICS reset it. There is also a successful write afterwards) TRICS_03JAN94.LOG;1 TRICS_06JAN94.LOG;2 TRICS_06JAN94.LOG;1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ..-JAN-1994 Philippe, while I was sick - There has been one occurrence where TRICS would systematically read zeroes back from the CalTrig (through the pQBA). Power cycling both boxes cured it. Re-triggering TCC was probably enough. There had been (it is believed) a power glitch in the HV racks, and TCC/BA23 were still taking power from these racks at the time. - There is a problem in Level 2 that, when started, kills the Level 2 nodes (and probably TCC) as they pass a corrupted table of the name server around each other. All nodes need to be rebooted then. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 28-DEC-1993 Philippe: - inspect logfiles for closer analysis of earlier tests TRICS_21DEC93.LOG;2 TRICS_21DEC93.LOG;1 TRICS_22DEC93.LOG;1 cf. D0_HALL_LOGBOOK.LBK - also look in TRICS_22DEC93.LOG;2 but the file is not closed one init error instance of %% time: 25-DEC-1993 01:31:12.53 Previously 11 instead of 15 @ cbus 0 mba 129 ca 7 fa 28 (cont) 00001011 i/of 00001111 Msk= W 11111111 R 11111111, Writing 240 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 20-22-DEC-1993 Philippe: - fix all known remaining problems in random tests truncation needs to watch Px/Py sign - fix some of the ctfe prom test problems (signed Px/Py sums, initializing all tested cells between pages) but still some problems are left (reading signed results) cf. D0_HALL_LOGBOOK.LBK - Temporary init error on one L15 digimem reg after radiator work - Is the eta 17..20 tier#2 wiring different? - Eta +/- 1..16 CHTCR PROMs were tested. - Eta +/- 1..20 Phi 1..5 CTFE PROMs tested - several Random Test Runs - temporary 66 Gev EMEt traced to tier#3 - Run Find DAC on 17..20. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 16-DEC-1993 Philippe: - We made a second run of Find DAC last night that wasn't interrupted by an INITIALIZE. The new values for 1..16 were loaded in INIT_DAC_BYTES.LSM The values for 17..20 were left unchanged - Update TRICS_INIT_AUXI.DAT Change the location of the L0-L1 box from Tier#2 9..16 to Tier#2 17..20 (i.e. from mba 209 to 249) - propagate above change to USER1:[TRGUSER.DIRECT_TO_TCC]FORCE_L0_FAST_Z.MSG and copy to msu::hepe:[TRGUSER.DIRECT_TO_TCC] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 15-DEC-1993 Philippe: - copy MSU::LSMP$DATA:NEW_D0HTCC_FILE_FOR_L15CT.LSO to D0HTCC::[TRIGGER]LOOKUP_SYSTEM_MANAGER.ZEB, and TRGCUR: move TEMP_D0HTCC_FILE_L15CT.LSO to [TRG_CURRENT.OBSOLETE] move and rename old TRGCUR:LOOKUP_SYSTEM_MANAGER.ZEB to [.OBSOLETE]LOOKUP_SYSTEM_MANAGER.ZEB_PRE_L15CT - modify TRICS_BOOT_AUXI.DAT Change method for shrinking eta coverage. It is required by TRICS that the towers in INIT_DAC_BYTES.LSM exactly match the coverage currently defined. Failure to interpret the file will cause TRICS to keep the default value 10 as DAC_BYTE. Also any tower not defined in INIT_DAC_BYTES.LSM keeps 10 as DAC_BYTE. This becomes inconvenient when we later want to restore a larger coverage. We now have a file with the pedestal for eta 1..20, but would like to limit the current coverage to 1..16. TRICS_BOOT_AUXI.DAT now first define full coverage (or whatever appropriate coverage for the INIT_DAC_BYTES.LSM), then loads INIT_DAC_BYTES.LSM, then defines the actual more limited coverage (if necessary). - restore the 2-DEC-93 version of INIT_DAC_BYTES.LSM - do trics ->Initialize.Trg.Twr and get C-RCV/CH2% 1:42 %00000009 INITIAL TRGTWR MAGN_ETA(1:16) E-HIO/HDB%C09/1: Previously 34 instead of 10 @ cbus 0 mba 201 ca 55 fa 4 E-HIO/HDB% (cont) 00100010 i/of 00001010 Msk= W 11111111 R 11111111, Writing 3 E-HIO/HDB%C09/1: Previously 41 instead of 45 @ cbus 1 mba 169 ca 21 fa 22 E-HIO/HDB% (cont) 00101001 i/of 00101101 Msk= W 00111111 R 00111111, Writing 3 C-ACK/CH2% 1:44 %00000E6D ACKNOW 00000009 OK DONE 0 201 55 4 is CTFE (+9..12,12) ped dac control chan #3 i.e eta +11 1 169 21 22 is CAT2 (+1..4,25..32) -Py comparator register #2 bit 1..6 do it again, and get no complaint - Fix TRICS tree offset computation and build new system. HD energy sums showed about -15 GeV. Also noticed that the correction loaded in Tier 3 were the same for EM Et, EM L2, HD Et, HD L2. - Fix TRICS Tree Browsing software to read CAT3 operands. The problem was with a register address shift while reading CAT3 operands Use Tree browing to locate a problem with tier #1 cat2 for eta -5..-8, phi 1..8 was 255 too low and read phi 7 input as 0 card replaced, problem gone and tier #1 cat2 for eta -5..-8, phi 17..24 was %X180 too high card replaced, problem gone - Try to detect a problem that seemed to make Global EMEt read 16 GeV more than EM L2, even when all towers were excluded and both lookups locked on page 4. We excluded all EM towers, and set a specific trigger to require 250 MeV of EM Et. The Andor rate stayed at zero. - Extend this testing method to other quantities. Exclude all EM and HD trigger towers and verify that the andor rate stays at zero when one requires 250 MeV of Tot Et or any tower above an EM refset of 250 MeV or any tower above an Tot refset of 500 MeV ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 10-DEC-1993 Philippe: - Notify John F and Jan G that new system is 4000 and now uses disk booting - John updates permanent databases ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9-DEC-1993 Philippe: - Move EWORK1 TRICS_V5.0 to ETRICS - Fill EWORK1 with TRICS V5.2 for 4000 M60 code - now use disk booting - Change INITIALIZE_TIME.PAS Write D0HSC address in KER$GQ_HOST_ADDRESS[2] at DZero for Disk Boot - Change MOD033_HANDLE_CONSOLE.PAS increase the wait for semaphore timeout to 15 seconds - Modify COM:TRIGGER_NODE.COM to use the the disk boot method everywhere (was doing this at MSU only). use [SYS0.SYSEXE]SYSBOOT.EXE - Temporary Modify MOD052_TCS_IO_COMINT_HANDLING.PAS not to wait for Assistant COMINT on Pauses - build system TRICS_V52.SYS_9DEC93 - New D0HTCC has ethernet hardware address 08-00-2B-34-EA-E5 Old D0HTCC had ethernet hardware address 08-00-2B-07-04-85 - Load a simple system with FAL only $ COPY EENV:INIT_DISK.SYS ESYS:TRGD0HTCC.SYS /OVER Note we do this by hand because I have modified the ELOAD command. - Turn uVAXII off, and 4000M60 on. >>> B ESA We don't let it boot from the disk, otherwise it would wake up as MSUD03, address 46.193 - copy the environment files $ DELETE D0HTCC::[trigger]*.*.* $ COPY TRGCUR:*.* D0HTCC::[trigger] - copy the new system over $ CD EWORK1: $ ELOADHTCC TRICS_V52.SYS_9DEC93 y "New 4000 M60, TRICS V5.2, disk boot" I have switched the behavior of the ELOAD command, it now copies the system file to target::[SYS0.SYSEXE]SYSBOOT.EXE. Note that the system files are now different. They don't have the same (any?) header like the ones for downloading. - Reboot. Using ncp trigger node is ok because the disk boot is the default method. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9-DEC-1993 Philippe: - uVAXII system locks up. use Edebug process flush_to_file error in display_console "file already active" message that was going to be displayed "mailbox full but not signaled" mailbox content was wai/cns%flush% console locked for 5s- recover 10:12:56 wai/cns% console locked for 5s 10:14:15 wai/cns% console locked for 5s 10:20:19 file already active 10:20:19 skipping 10:20:19 process dispatch was waiting for new message process refresh mpool was waiting in display console with some frog blinking data channel #1 was waiting for display_console with a pause message to process channel #2 was gone! channel #3 and #4 was not stoppable (i.e. waiting for elncon message) watch_double_buffer was waiting for its interrupt on port C begin/end_run was waiting for its cue. - Rebooting fails, initialize time is waiting inside the first WRITELN - Dan does a CTRL_Q, and the first message goes through, but things still screwed up, and initialize_time still waiting. - Reboot, and now everything ok. --> was it simply that someone (Dan) pushed the hold screen key? The message time was old, and it might have stayed un-noticed for a while --> Does it explain the weird symtoms from 23-NOV-1993? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3-DEC-1993 Philippe: - Dan Loads the new system. There is also a funny D0HTCC::[TRIGGER]LOOKUP_SYSTEM_MANAGER.ZEB with eta 5:8 having old HD PROM's and Dan edited Boot-Auxi to tell it eta 1:8 should be used. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2-DEC-1993 Philippe: - fix bug to readjust the Large Tile upper and lower range boundaries in the code for the message modifying the trigger tower range on the fly. - build new system EWORK1:TRICS_V50.SYS_2DEC93 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 23-NOV-1993 Philippe: - TCC crashed. Trace back shows MPOOL_SERVER with an error during screen IO %PAS-F-FILALRACT, file already active >>494: WRITELN ( message ) ; - Jan has some more information about additional (?) problems: After the crash we tried triggering TCC, but it didn't come back up. Then we tried rebooting it by pressing the halt button. That didn't work either. Then we tried power cycling it. No luck. We saw that it was stuck in self test #9. John found in a book, that 9 means it's having problems talking to its console. John then powered down TCC, then power cycle the terminal, then powered TCC back up. Then it was ok. We haven't had any problems since. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 22-NOV-1993 Philippe: - There has been 2 instances of TCC hanging. One reported by K.Johns at 12-NOV-1993 23:53:37.00, one reported by Jan at 19-NOV-1993 21:10:13.00. - Edebug shows a small number of Pool blocks. - build new system EWORK1:TRICS_V50.SYS_22NOV93;1 with 10,000 Pool blocks, and 1024 ports. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5-NOV-1993 Philippe: - Dan loads TCC with EWORK1:TRICS_V50.SYS_3NOV93;1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3-NOV-1993 Philippe: - Copy TRICS_V50 code and build new system EWORK1:TRICS_V50.SYS_3NOV93;1 Large Tiles, ITC fix, MPt FMLN Programming ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3-NOV-1993 Philippe: - clean D0HTCC::[TRIGGER] directory. - delete ALL logfiles from October - the system was rebooted on TRICS_20OCT93.LOG;1 5738/5740 20-OCT-1993 18:01 TRICS_26OCT93.LOG;2 4386/4390 26-OCT-1993 14:30 TRICS_26OCT93.LOG;1 156/160 26-OCT-1993 14:18 TRICS_29OCT93.LOG;2 1303/1305 29-OCT-1993 17:30 TRICS_29OCT93.LOG;1 451/455 29-OCT-1993 16:57 - [TRIGGER] now uses 6,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ......-1993 Philippe: - clean D0HTCC::[TRIGGER] directory. - delete ALL logfiles from August - [TRIGGER] now uses 18,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 24-AUG-1993 Philippe: - clean D0HTCC::[TRIGGER] directory. - delete ALL logfiles from June and July - [TRIGGER] now uses 12,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9-JUN-1993 Philippe: - Jan G. turns Calorimeter Trigger OFF for rest of shutdown. TRICS_BOOT_AUXI.DAT updated to reduce coverage to minimum of 1 tower TT(eta,phi)= (+1,1) and limit the number of errors reported by TRICS. The same command was also sent to TRICS by hand. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5-JUN-1993 Philippe: - Power outage (snake in feeder line). Errors when TCC restarted while trigger was still off. Booted again after power restored. Everyhting ok. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2-JUN-1993 Philippe: - clean D0HTCC::[TRIGGER] directory. - copy TRICS*MAY*.LOG to MSUD01::DUA1:[BACKUP.LOGFILES_D0HALL] - also copy MPOOL*MAY*.LOG, LOG*MAY*.LOG, MAIL*MAY*.LOG - delete ALL logfiles from May - [TRIGGER] now uses 2,600 blocks - This is also an attempt to help the disk problems by using a different area of the disk, maybe on the outer perimeter of the disk, where the flux is greater. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 27-MAY-1993 Philippe: - TCC was rebooted. They had problem doing end run, and had no andor rate after intialize, also leds were off. - Looking in the logfile shows the last entry at 17:50 while the machine was only rebooted at 10:45. This is suggesting we had a recurrence of the disk error, as on 21-may. This explanation is also consistent with problems during begin/end run and consistent with initialization problems (no LEDs, no andor rate, probably no sptrg #31) as both of these actions need to read a file of commands from the disk. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 25-MAY-1993 Philippe: - Stu found bug in ITC that was consuming system resources (missing Delete IO port on failure) - rebuild the TRICS_V40.SYS_NO_DISK system using new ITC. - build TRICS_V40.SYS_25MAY93 but do not load, the system file in ESYS has >NOT< been overwriten ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 23-MAY-1993 Philippe: - read logfile, no problem doing it, nothing learned. Last entry in TRICS logfile was at 7:49 21-MAY-1993, Jan's message was from 11am. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21-MAY-1993 Philippe: - TCC DUDRIVER job exited in the middle of physics running, TCC still running but cannot open TRICS_FORCE_UPDATE.DAT and thus cannot make end of run files. Is it a disk bad block? - reboot, and run fine. I Will >not< try to read the logifles untill tomorrow's study period. - build a new system TRICS_V40.SYS_NO_DISK that doesn't mount the disk, ready to load if TCC fails again. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 20-MAY-1993 Philippe: I believe there is a different problem with the current version of ITC, showing up at least in TCC, where TRGMON connects and disconnects ITC channels quite often. When an ITC channel is connected to TCC from TRGMON, 7 blocks of the "System Pool" resource are allocated (seen using Edebug) but when the channel is disconnected, only 6 blocks are released, leading to a net loss of one block of system pool. This did not use to happen with the private version I had been using until 8-APR-93. That is 7 blocks were allocated and 7 blocks were released. The difference in the ITC I loaded is in the fix from MAR-92 "to reset CCB[Channel].In_Use on connect failures". I haven't tried yet to guess where the bug is in ELN_CONNECT.EPAS. One block of this System Pool is used for each Kernel object created. It is probably a PORT, MESSAGE or something else that is created, and not deleted before it is created again. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11-12-MAY-1993 Philippe: - build system with new ITC (last attempt failed) which also has logfile messages in ignore_problem routine old fix for preventing CPU hogging when reaching max number of channel system TRICS_V40.SYS_11MAY93 is loaded by Jan G. Wed. morning. - plot REPEAT_EDEBUG for previous system - restart REPEAT_EDEBUG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 10-MAY-1993 Philippe: - REPEAT_EDEBUG now shows pool size below 9000 read MPOOL_SERVER_07MAY93.LOG, see truncate messages still there. Use Edebug to halt the ITC connect process and view the source file, it is NOT the new code. The cause probably was to MPOOL_SERVER.EXE not having been relinked on 4-MAY while building new system (ITC OLB not listed in MMS). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7-MAY-1993 00:10 - Dan reboot TCC to read new BOOT_AUXI with active veto for begin/end run. - restart REPEAT_EDEBUG in the morning, previous recording (since 5-may) does not show a visible drop. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5-MAY-1993 Philippe: - build new system 4-MAY late evening with Ports parameter increases 256 -> 1024 New ITC from Stu Beta release which has fix to the recover connection update that deletes the port clear the truncated message flag that isn't used anywhere - system loaded by Jan G., today is accelerator study day - notice that the global caltrig energy is large and negative. TRGMON's ADC dump shows that every tower is 0, 1 or 2. This is traced back to a failure loading the DAC_BYTE file. TRICS stopped after noticing that INIT_DAC_BYTES.LSM had values for eta > 16 (that are now turned off). - clean D0HTCC::[TRIGGER] directory. - copy TRICS*APR*.LOG to MSUD01::DUA1:[BACKUP.LOGFILES_D0HALL] - also copy MPOOL*APR*.LOG, LOG*APR*.LOG, MAIL*APR*.LOG - delete ALL logfiles from April - [TRIGGER] now uses 9,700 blocks - use REPEAT_EDEBUG from D0MSU2 (MSU2:[TMP12.LAURENS.EDEBUG]) to monitor TCC in general and system poolin particular ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 26-APR-1993 Philippe: - water leak in the cal trig. Turn off the last 2 racks. change TRICS_BOOT_AUXI to have a tower range of 1:16 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 23-APR-1993 Philippe: - EDISPLAY shows that the pool size has dropped from 500 to 200 - build a new system with pool increased from 1024 to 10,000 - There is beam in the machine and the pool size is dropping to 70. call Jan to find out when it is possible to reboot. A run is about to end, but the beam is lost right before the reboot anyway. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21-APR-1993 Philippe: - start watching D0HTCC with SETHOST/LOG on D0MSU2 running EDISPLAY with 60 s refresh rate SETHOST/LOG on D0MSU2 running EDEBUG every 10 mn ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 20-APR-1993 Philippe: Accelerator is rebuiling the Pbar stack and I get an oportunity to investigate ITC truncate messages. - load MPOOL_SERVER.EXE from previous system (before ITC fix) no truncate message - add a report about the channel and the request in the error message, recompile ITC, relink and reload MPOOL_SERVER.EXE truncate messages are back not correlated to a particular channel or particular request type - find an old ITC library and relink truncate messages are still there - Stu Fuess thinks the truncate flag is never set and thus picks up random previous memory content - increase system parameters and reboot P1 1024 -> 2048 System Interrupt stack 2 -> 128 System region size 1024 -> 2048 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 19-APR-1993 Philippe: 1) TRGMON timeout and ITC truncate messages. These ITC truncate messages have started appearing with the last TCC system change on 8-APR-93. It is probably not a coincidence. They come somewhat randomly. Unfortunately I didn't have the TCC monitoring pool server also advertize which channel they come from, and I don't know what to correlate them with. From what I have gathered so far, these errors are associated with incoming messages. This is a flag reported by, but not generated by the ELN ITC code. I am not quite sure if the flag is set by the host ITC code or by the system services sending and receiving the messages. 2) multiple reboots on 17-APR-93 I believe there was a problem and TCC needed to be rebooted. I believe there was some confusion while trying to reboot and restart. I will not go in details, but there is (again) evidence that TCC was still booting while COOR was told to talk to it. I don't know if this string of 3 problems in a week is a fluctuation of the previously "once per month" rate, or if it is linked to the latest system change on 8-APR-93. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 18-APR-1993 Philippe: Jan reports TCC being rebooted on the owl shift (Sunday morning). There were other problems at the time with the host, ethernet, ... All HSB windows disappeared at about that time. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17-APR-93 Philippe: Jan reports notes from the DAQ log book 19:40 TRGMON can't connect to D0HTCC insufficient resources at remote node 19:45 End run 63792. Fail to read luminosity scalers. Seems like a good time to re-init a few things. stop/start COOR and Data Logger Trigger D0HTCC {We were having problems earlier on the day shift and I had asked them to stop/start COOR and data logger during the next shot setup. The COOR/data logger connection was not right.} Boot unreachable L2 nodes ... COOR dies in downlaod. Still a problem with D0HTCC. Reboot D0HTCC (push the button). Initialize trigger framework. Still cannot connect to D0HTCC. EDEBUG D0HTCC - looks OK. Trig_init still fails {Initialize framework with COOR} Power cycle D0HTCC. Try TRIG_INIT several times. now downloading OK. 20:40 Start run 63800 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14-APR-1993 Philippe: - get a call from the DAQEXP, Gene Alvarez. D0HTCC "crashed" during machine studies so that I have time to investigate %PAS-F-ERRDURPUT, error during PUT -KERNEL-F-DISCONNECT, circuit disconnected by partner Job 5, process 1705, program DUDRIVER has exited. Job 5, process 1704, program DUDRIVER has exited. - Answer to Gene, Jan G. : Some jobs and sub-processes had disappeared. The main problem that Edebug showed was the "-KERNEL-F-DISCONNECT, circuit disconnected by partner" exception inside a call to WRITELN. This makes little sense to me. The only "circuit" that I can think of would be the connection that has to be made inside the EPASCAL IO routines to the CONSOLE (VT300) driver. At this time, I believe that the problem did not originate in Level 1 software but in the ELN Kernel. It could be a software or hardware problem that screwed up (all?) the Kernel's datagram tables. (Not just for the Level 1 software, e.g. trying to SET HOST to D0HTCC also generates a similar exceptions in the ECL process). - also later notice that D0HTCC is very low on "Pool" pages. Is this a cause or a consequence? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 13-APR-1993 Philippe: - Found in the electronic captain logbook that TCC was booted over the weekend, ask Jan for what is in the DAQEXP logbook. April 10 23:30 TRGMON has disconnected a number of times with Error detected during > ITC disconnect %ITC-E-NO_CHANNEL, channel request has not been activated TRGMON cannot get data from TCC, edebug D0HTCC raised exception in TRICS_V40 489: WRITELN(message) Trigger D0HTCC April 11 0:00 loose all D0HSB windows COOR dies 0:15 COOR dies - unable TO TALK TO TCC EDEBUG TCC - looks ok Triggered TCC Restart COOR - answer to Jan: The ethernet problems might explain some of the confusion and needing to boot TCC a second time, but I am not sure that it explains everything. It seems that one of the processes was halted in Edebug. There is no telling now what really happened. I hope that I will be around to investigate if this happens again. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 10-11-APR-1993 Philippe: - TCC was booted twice, at 10-APR-93 23:57 and 11-APR-93 00:30. With no message from COOR in TRICS_10APR93.LOG. TRICS_08APR93.LOG last message is from 23:17 MPOOL_SERVER_08APR39.LOG has the following messages at 23:21 E-EXC/MBX% Message Mailbox is Full but Not Signaled S-EXC/MBX% Flush_to_File now Servicing Exception Mailbox X-WAI/CNS% Console Locked for 5s, Recover: Force Unlock S-EXC/MBX% Exception Mailbox now empty ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8-APR-1993 Philippe: - TCC booted to load EWORK1:TRICS_V40.SYS_7APR93. - MRBS_LOSS was moved from andor term #121 to 124 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7-APR-1993 Philippe: DZERO::EWORK1: - V3.1 has been moved to ETRICS - version change from V3.1 to V4.0; system is EWORK1:TRICS_V40.SYS_7APR93 - upgrade the remote console to be able to receive the messages from the other jobs (MAIL_SERVER, MPOOL_SERVER, LOG_SERVER). - The remote console is still a sub-process of TRICS and created/deleted by TRICS, but relays messages from >all< the jobs. - The remote console now also survives a temporary bottleneck while sending the messages to the host's terminal. If a message is not serviced by the remote console within 1 second, further message copying is suspended for 1 minute, then resumed with an error message notification. - wait 5mn before starting MONIT POOL SERVER (keep off Nina's monitoring and trgmon... while booting) - fix LOG_SERVER to gracefully survive a link-to-host problem (e.g. exceeded quota) - remove confusion between tree OFFSET and tree CORRECTION - implement all the tree browsing messages. - modify the method of servicing exceptions in exception handlers, - all messages generated by an exception handler are now put in a mailbox. An "exception tracing" state is set at the begining and cleared at the end of each exception handler. Any message generated, even indirectly by routines called by the exception handler are put on the mailbox stack. This method also makes the exception handler execute without interruption (no screen IO to wait for). The mailbox is then "signaled" to show that it needs emptying at the end of the exception handler. - the mailbox only holds a maximum of 10 messages, which is plenty for all known cases. The mailbox counts the number of messages that overflow, and the time of the last entry. - Previously, the job of the process "flush to logfile" was to wait for a fixed time interval of 2mn30 to wake up and close the logfile. It is now given the additional responsibility of also waking up when an exception handler signals the exception message mailbox as needing servicing. The process "flush to logfile" will empty the mailbox to the screen and logfile (note that the messages keep their original time stamp). These messages are prefixed by X- The process flush to logfile can also unlock the console when it finds it remains locked for more than 5s. - Other jobs (MAIL_SERVER, MPOOL_SERVER, LOG_SERVER) also receive the same treatment for their exception handlers and now have their own "flush to logfile" subprocess - MPOOL_SERVER and ITC Increase the maximum number of connections to MPOOL_SERVER from 10 to 15 Also use a more recent version of ITC that has a fix to recover channel resources in case of connection failure. Add a fix to prevent indefinite loop when the maximum number of channels are connected. - make the "mailing to" message a system class message to have it in the logfile. - hardware updates Update the read/write mask of L1.5 Control MTG Ch 29 & 30 (long timeout) Quit initializing L1.5 control MTG channel 29 Initialize L1.5 receiving MTG channels 1:19 Fix mtgbusy, comint busy stretch PAL is at FA 1, not 31. - watch double-buffer Move screen message notifying of re-shynchronization to AFTER doing it Raise process priority (8 -> 7) for un-interrupted service - refresh monit pool reset comint Resetting COMINT to restore data flow is now done through the file TRICS_RESET_DIRECTIVES.DAT. - INITIALIZE_TIME Remove the wait for 15 seconds before starting executing Clear the screen when starting, for better visibility In case of problem, wait 10 seconds and retry ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5-APR-1993 Philippe: - move to Daylight savings time by the following simple method Edebug> Create Job Initialize_Time ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3-4-APR-1993 Philippe: - clean D0HTCC::[TRIGGER] directory. - copy TRICS*MAR*.LOG to MSUD01::DUA1:[BACKUP.LOGFILES_D0HALL] - also copy MPOOL*MAR*.LOG, LOG*MAR*.LOG, MAIL*MAR*.LOG - delete ALL logfiles from March - [TRIGGER] now uses 11,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2-APR-1993 about 13:30 Philippe: - TCC locks up at the beginning of a store, but the run is already started. When Dan called me, I used Edebug to reach the node, and browse around. I couldn't make any of the symptoms fit in a coherent manner. - None of the ITC, or ELNCON connections could get to it. - None of the Level 1 processes appeared to be doing anything. - None of the processes ran out of memory, or other resources - I couldn't halt any but one (begin_end_run) of the L1 processes. - even FAL wouldn't work, which is the first time I saw it. - Only Edebug could talk to it. - I couldn't SET HOST either, meaning also no ELN monitoring. I was starting to think it was an ELN kernel problem that had all the processes stuck, or a hardware error in the primitive uVAX II CPU board (.e.g. it only has a one-bit-per-byte parity checking in memory). A new store was in and I gave up after 10 minutes and told Dan to reboot TCC. I had captured my session with Edebug in a logfile and picked it apart, comparing memory usage and delta CPU time. It took me almost an hour to reach the following conclusion (which fits all the symptoms), and TCC was rebooted by then. I didn't get to use edebug and see what ITC was doing. My current understanding of what happened is that one of the two subprocesses that ITC created in the ELN server job for TRGMON became 100 % busy doing something. I don't really understand ITC, but I believe this process handles all incoming new connections. It is at a higher priority than regular user jobs. The ELN system is not time-sharing and this job took every CPU cycle it could find. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17-MAR-1993 Philippe: - clean D0HTCC::[TRIGGER] directory. - copy TRICS*FEB*.LOG to MSUD01::DUA1:[BACKUP.LOGFILES_D0HALL] - also copy MPOOL*FEB*.LOG, LOG*FEB*.LOG, MAIL*FEB*.LOG - delete ALL logfiles from February - [TRIGGER] now uses 17,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 29-MAR-1993 Philippe: - there was this entry (3am?) in the daqexp logbook (reported by Jan G.) Fatal alarm - TCC_Link_Err.COOR Lose COMM_TKR's? RESET_COOR Test trigger complains - Failure on effort to connect to D0HTCC COOR dies restart COOR COOR.OUT file full of NETCON error %SYSTEM-F-REMRSCRC, insufficient system resources at remote node and DISCON error: %SYSTEM-F-IVCHAN, invalid lack of connection to D0HTCC TRIG_INIT fails in TALK TRGMON error detected during ITC connect to D0HTCC Insufficient system resources at remote node. Error: Cannot get data from control computer. %ITC-E-NO_CHANNEL Channel requested has not been activated EDEBUG D0HTCC - Loading traceback from MAIL_SERVER.EXE crashed in MSUTRGOUT:[TRG_TARGET.SOURCE_1WORK]MAIL_SERVER.EXE LINE 458F: ELN$UNLOCK_AREA(console_obj, console_synch^.lock) MCR NCP TR NO D0HTCC EDEBUG D0HTCC only jobs up are XQDRIVER CONSOLE EDEBUGREM DUDRIVER INITIALIZE_TIME priority 16-waiting no other jobs running INITIALIZE_TIME => D0HTCC is still booting L1 68k diskplay is going crazy!! The slave ready did not drop Event with no Spec Trigs Fired, not transfered. But no runs are in progress. Reload 68k and immediately (at Go 95000) get the same message as above. Reboot D0SUPR, D0SEQR Retrigger D0HTCC TALK INIT_TRIG Fatal error message goes away. Edebug looks ok. Exit. Restart. Everything's OK. Test trigger running smoothly. - here is part of my answer to Jan 1) I have no doubt that there was a problem, and it was right to boot. 2) Either they didn't wait long enough for TCC to boot, or the host failed to answer to the INITIALIZE_TIME task. The trigger control software started two logfiles, at 3:14 and 3:16. If the logbook is correct that they only had to trigger TCC twice, then it looks like they didn't wait long enough. There also was 2 initialize messages from COOR only 90 s apart at 3:20. 3) I wouldn't worry much about the VME 68k until TCC is fully booted. Here is what I propose for improving the overall situation. 1) John restoring FDDI and returning to fluid network link to the host(s) is going to help any TRGMON, and/or booting problem. 2) I will improve the INITIALIZE_TIME task to do a better job at displaying what phase of the booting process it is in. And make it automatically retry in case of failure. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12-MAR-1993 Philippe: - Dan has fixed COMINT to clear most recent data block while it is trying to start. The rate of resynchronization messages in the logfile should go from every 20 mn to almost never. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 24-FEB-1993 Philippe: - Dan loads TRICS_V31.SYS_11FEB93 into D0HTCC. This is the fix to the Begin/End Run file Synchro problem with COOR and it has 5x the margin for virtual address space. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12-FEB-1993 Philippe: - clean D0HTCC::[TRIGGER] directory. - copy TRICS*JAN*.LOG to MSUD01::DUA1:[BACKUP.LOGFILES_D0HALL] - delete ALL logfiles from January - [TRIGGER] now uses 33,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8-JAN-1993 Philippe: - clean D0HTCC::[TRIGGER] directory. - copy TRICS*DEC*.LOG to MSUD01::DUA1:[BACKUP.LOGFILES_D0HALL] - delete all logfiles from December - [TRIGGER] has now 8,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7-JAN-1993 Philippe: - new system, installed, EWORK1:TRICS_V31.SYS_7JAN93 Add 2 messages in begin/end run: "file opened" and "done". Increase the priority of the begin/end run task by one unit. All this is on top of the modifications from 19-dec-92 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 19-DEC-1992 Philippe: - New systemwas built, but never installed EWORK1:TRICS_V31.SYS_19DEC92 Modify update_ruegister (more messages, data not write masked) Modify definition and initialization of FW TSS write A/B PAL Send messages at error during register initialization Add prescaler ratio to begin/end run file Initialize only a fraction of the l1.5 terms ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3-DEC-1992 Philippe: - clean D0HTCC::[TRIGGER] directory. - copy TRICS*NOV*.LOG to MSUD01::DUA1:[BACKUP.LOGFILES_D0HALL] - delete all logfiles from November ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12-NOV-1992 Philippe: - clean D0HTCC::[TRIGGER] directory. - copy TRICS*OCT*.LOG to MSUD01::DUA1:[BACKUP.LOGFILES_D0HALL] - delete all logfiles from October - [TRIGGER] has now 17,000 blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11-NOV-1992 Philippe: - mpool_server - fix mpool_server for Nina's messages - upgrade to skip building the same message as sent before - change exception handler messages ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6-NOV-1992 Philippe: - Add new monitoring message for foreign scalers. - update mesage to Nina's Cross system monitoring with Level 1.5 quantities - system loaded 7-nov-92 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21-OCT-1992 Philippe: - new system with new version number 3.0 -> 3.1 - new definition for mtgfwtss Ch # 4 as MTG05 mtgcttss Ch # 5 as MTG06 (latch/shift) mtgcttss Ch # 29 as MTG05 (IMLRO latch/shift) mtgbusy Ch # 31 as FEBzGS01 (double buffer full) - hardware initialize mtgbusy Ch # 31 to 0 (instead of 10) - fix bug in CHTCR test that was truncating error messages for phi - tighten Find_DAC_byte "median" requirement - Find_DAC_byte now has exception handler to close result file and resignal. This will now properly close the file when the test task is deleted. - change the console messages for the "auxiliary init" to be less specific since the concept is used in other instances. "Command File Closed" instead of "Auxiliary Init File Closed", ... - fix bug in monit pool filler, (ZERO mpool_rec.twr) to solve non existing towers appearing excluded in TRGMON - delete area mpool_obj before quitting or after delete task, to prevent eating up virtual address space - fix bug in the result_file service that caused binary garbage in the first line of the result file (e.g. DAC_*.LOG) - Begin/End Run now forces a latch/shift using the file D0HTCC::[TRIGGER]TRICS_FORCE_BUF_UPDATE.DAT - the new pause run, .... messages are implemented. - I forgot to make FIND_DAC use the proper value to take control of latch/shift. Fortunately, one can still start find_dac and (quickly) overwrite the register, as long as it is before the first pause. - make new system. Dan loads it. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 29-SEP-1992 Philippe: - fix FIND_DAC to make a requirement that the median is close to the proposed DAC_BYTE. This is to suppress problems where the gaussian tail low statistics produces an abnormal high reading. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 29-SEP-1992 Philippe: - fix FIND_DAC to make a requirement that the median is close to the proposed DAC_BYTE. This is to suppress problems where the gaussian tail low statistics produces an abnormal high reading. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9-SEP-1992 Philippe: - fix set_trgtwr_simu (used by exclude trigge tower) to still complete action in case of failure. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9-JUL-1992 Philippe: - upgrade to TRICS V3.0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3-JUL-1992 Philippe: - failed attempt to upgrade to trics V3.0. Some problem were fixed with disk mounting missing from Ebuild. Still problem with TWB bits always on. - return to system from 1-jul-92 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1-JUL-1992 Philippe: - bug fix in card address of L1.5 SBSC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 26-JUN-1992 Philippe: - add monitoring Level 1.5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 18-JUN-1992 Philippe: - new system: L1.5 bug fixes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11-12-JUN-1992 Philippe: - new system - fix bug initializing all triggers as l1.5 capabale - properly define new L1.5 PALs in FW TSS MTG, Hld Trf. MTG, St Dgt MTG - define new L1 PALs listening to Level 1.5 - tune Level 1.5 programming ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4-JUN-1992 Philippe: - new system - ignores COOR's GEO_SECT DGTZ_OFF messages, for solving new level 1.5 PAL prgramming problem in start digitize. - Special TRICS intitialization of MTG FW TSS ch#14 incr.St.Dgt.Num to be ROM Gated. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3-JUN-1992 Philippe: - prepare new system using the new MENU with updated andor card test, and including correct restoring of reference sets. Not loaded - use 1470 byte network segment. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 19-MAY-1992 Philippe: - unexplained communication to TCC hanging shortly after beginning of run. - build a system without a disk. - update SITE_DEPENDENT.CST - update EBUILD, not mount the disk - arrange to open FOR003 on the host before zebra in MOD095_INIT_LSM.PAS - the diskless system didn't help - the best guess is that there was a hardware or configuration problem on the ethernet link to the host. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 13-MAY-1992 Philippe: - new system global threshold data base manage tree offsets restore caltrig towers, threshold, jet list progr monitor thresholds - manually modified by leaving out 200 EM and 200 HD counts of offset in Tier #3 in order to keep 400 counts of offset in Tier #4 Tot MOD130_INIT_THRESHOLDS.PAS MOD095_INIT_LSM.PAS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 24-APR-1992 Philippe: - build new system prepared to look for the jet list argument andor card in mba 106. reset the new DBSCs in M101. fix the TRGMON 68k state problem (items swapped) - build new system fix missing initialization of jet list andor card ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 23-APR-1992 Philippe: - fix ITC max message size and rebuild system ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21-APR-1992 Philippe: - copy EWORK1: (TRICS_V25) to ETRICS - build new system in EWORK1: in preparation for new COMINT PROMs copy MSU's new Code (TRICS_V26) to EWOKR1: modify MOD123_INIT_CBUS_CARDS.PAS to ignore FMLN cards - recompile official TRGMON to match new data format. update lv1_mpool_raw.inc in TRGMGR::SHTRGMON: directory update HTRGMON:TRGMON_DRIVER_LINK.OPT for new common block $ @COMPILE_TRGMON $ @LINK_TRGMON ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 20-APR-1992 Philippe: - check and delete logfile TRICS_02APR92.LOG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17-APR-1992 Philippe: - D0HTTCC was rebooted, thus using latest system ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 13-APR-1992 Philippe: - install in TRGUSER account directory [.DIRECT_TO_TCC] RUN_PAUSE_RESUME.COM and PAUSE_RESUME.EXE - mail sent to Jan Guida, Joey Thompson ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9-APR-1992 Philippe: - Build new system, placed as next load file, but NO reboot made. hardware initialize L1 fired for bunch in FWTSS as gated, fix bug in message parsing (R.Astur, second refset receives BAD PARAM) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2-APR-1992 Philippe: - modify LOGIN.COM for TRGUSER so as to establish a permanent NETSERVER for use by MAIL, and write begin/end run. - This was discovered by tracking the code in SYS$SYSTEM:NETSERVER.COM $! If this is a network request, tell NETSERVER.COM to create a PERMANENT $! server (with a maximum of 1 permanent server). $ IF ( F$MODE() .NES. "NETWORK") THEN EXIT $ DEFINE NETSERVER$SERVERS_TRGUSER 1 $! AND set a timeout of 24 hours for any additional netserver that will be $! created non-permanent. $ DEFINE NETSERVER$TIMEOUT "0 23:59:59" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1-APR-1992 Philippe: - relink TRGMON with recent ITC, this appears to solve the hanging problem. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 25-MAR-1992 Philippe: - fix mail server. It was releasing the area to early, and thus could advertize the job done for the next message. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 25-MAR-1992 Philippe: - prepare a new system for bruce's test on WRT_HOST messages only acknowledge, no action WRT_HOST BEG_RUN WRT_HOST END_RUN WRT_HOST SYNCHRO - move resulting set of trics V2.4 files EWORK1: -> ETRICS: - copy MSU's files from MSUHEP::EWORK1: to D0::EWORK1: and start building new system TRICS V2.5 - step up and now open file on host. No data yet. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 20-MAR-1992 Philippe: - Install new TRICS V2.4 system with - full turn 6 on 6 MTG, - new find_dac, - load_dac, - no error on dbsc reset - add BOOT AUXI file, with find&loa DAC - init trigger number & sptrg strobe fwtss to ROM gated ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5-MAR-1992 Philippe: - There is a file in TRGCUR:SET_LEVEL1_BEAM_CROSSING_PERIOD.COM There are 2 new files on D0HTCC::[TRIGGER] and TRGCUR: TRICS_INIT_AUXI.DAT_4_BUNCH TRICS_INIT_AUXI.DAT_6_BUNCH The COM_file copies the *.DAT_n_BUNCH to *.DAT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 17-FEB-1992 Philippe, D0::EWORK1: - John installed ELN V4.3 last Friday. - save OLB from ELN V4.2 $ REN *.OLB TRICS_V23_DEB.OLB_ELNV42 - start EWORK1: $ MMS fails on MPOOL_SERVER.OBJ because ITC still references old $KERNELMSG - copy MSUHEP MSUTRGROOT:[TRG_LIB.ITC]ELNOLBS.COM, [.ELN]*.*, [.INC]*.* - $ @ ELNOLBS - $ REN DEB_ELN_ITC.OLB TRGOLB: - restart EWORK1: $ MMS/SKIP - prepare system for next load. - send mail to John to rebuild D0$ITC: - try TRICS_V23.SYS file on MSUD04:: fails on INITIALIZE_TIME when asking boot node 557095 = 544.39! -> INITIALIZE_TIME probably needs recompiling. - $ DELETE TRICS_V23.SYS; $ DELETE INITIALIZE_TIME.EXE; $ DELETE UNLOCK_SHA.EXE; $ MMS/SKIP - retry TRICS_V23.SYS file on MSUD04:: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 30&31-JAN-1992 Philippe, D0::EWORK1 - install new system with st_vs_rs, global_threshold, bug fix in reference set programming, error_filter - increase time constant to 0.5 s - saved in EWORK1: $ REN TRICS_V23.SYS TRICS_V23.SYS_31JAN92 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 21&22-JAN-1992 Philippe: - check logfiles up to TRICS_10JAN92.LOG ################################################################################ **** NOTE ################################################################################ VAXELN - V4.3 D0HTCC UP: 2 19:16:49.21 5-APR-1993 10:01:18.08 (57.560) IDLE: 2 07:29:46.54 PAGES: 11849/11849/26624 S0-REGION: 5143/5143/5280 POOL: 526/1024 JOB-SLOTS: 8/24 LOADER: 0/0/0 PROCESS-SLOTS: 25/80 R/W R/O Job# Program_name Mode Pri Pages State Runtime 2 XQDRIVER K 1 131 104 WAIT 0 02:20:12.62 3 CONSOLE K 2 23 23 WAIT 0 00:09:37.09 4 EDEBUGREM K 3 2 27 WAIT 0 00:00:00.33 5 DUDRIVER K 4 89 143 WAIT 0 00:03:38.49 7 FALSERVER U 16 14 13 WAIT 0 00:00:53.04 8 TRICS_V31 K 16 3713 1562 WAIT 0 07:30:49.77 9 MAIL_SERVER U 16 23 233 WAIT 0 00:00:00.75 10 MPOOL_SERVER U 16 372 258 WAIT 0 01:30:40.71 11 LOG_SERVER U 16 25 230 WAIT 0 00:00:00.06 12 RTDRIVER U 16 48 39 WAIT 0 00:00:00.69 14 ECL U 16 26 242 WAIT 0 00:00:00.25 15 EDISPLAY U 16 52 47 RUN 0 00:00:02.45 ################################################################################ VAXELN - V4.3 D0HTCC UP: 1 07:56:19.98 9-APR-1993 11:24:20.01 (57.560) IDLE: 1 02:21:15.98 PAGES: 11716/11675/26624 S0-REGION: 5141/5141/5280 POOL: 517/1024 JOB-SLOTS: 8/24 LOADER: 0/0/0 PROCESS-SLOTS: 23/80 R/W R/O Job# Program_name Mode Pri Pages State Runtime 2 XQDRIVER K 1 127 104 WAIT 0 01:03:59.23 3 CONSOLE K 2 20 23 WAIT 0 00:04:54.66 4 EDEBUGREM K 3 2 27 WAIT 0 00:00:00.00 5 DUDRIVER K 4 48 143 WAIT 0 00:02:41.41 7 FALSERVER U 16 6 13 WAIT 0 00:00:00.83 8 TRICS_V40 K 16 3506 1637 WAIT 0 03:36:14.76 9 MAIL_SERVER U 16 33 244 WAIT 0 00:00:00.43 10 MPOOL_SERVER U 16 640 269 WAIT 0 00:41:53.04 11 LOG_SERVER U 16 35 242 WAIT 0 00:00:14.86 12 RTDRIVER U 16 48 39 WAIT 0 00:00:01.16 13 ECL U 16 26 242 WAIT 0 00:00:00.22 14 EDISPLAY U 16 52 47 RUN 0 00:00:01.64 ################################################################################