Proposed Luminosity Accounting/Tracking Procedure

23-April-1999, RDS

  There will be a luminosity monitoring process running on the online system.  
It may have connections to other DAQ processes which depend on run status, 
but it will monitor luminosity independent of data runs in order to supply 
a continuous stream of data to the accelerator.  To do this it will query TCC 
(trigger control computer) at a rate not to exceed 1 Hz.  The actual rate 
will be selected to have sufficient counting rate not to be dominated 
by statistical fluctuations, while maximizing feedback to the accelerator.  
In order to allow TCC to capture the required information, independent 
of run status, the framework will be "paused" for about 4 mSec for each request.
The meaning of pause is that there will be no level 1 accepts issued during 
this time.  Live time for active trigger sets will be properly accounted for.
The scalers used for this monitoring are never reset, and the raw numbers 
are send to the luminosity monitoring process.  It is the responsibility of 
the luminosity monitoring process to remember previous scaler values to take 
the appropriate differences.  In the block of data sent to the luminosity 
monitoring process there will also be the following numbers:

1)	A 32 bit "luminosity index" number,
2)	An 8 bit "tick" number,
3)	A 32 bit "turn" number, and
4)	A 64 bit "date-time stamp".  

  The definition of the luminosity index number is a number which is set 
to one(1) until Jan 1,2000 00:00:01.  After that it is incremented by one 
whenever the luminosity monitoring process requests from TCC a luminosity 
update WITH index update.  The data shipped on that update is the final set 
of scaler readings for the old index number, and all data (including raw data)
after the turn/crossing listed will be marked with the next index.  
Whenever the index number is changed an entry into a luminosity database 
containing all relevant information for the just ended luminosity interval 
must be made.  This index number is maintained in the trigger framework, 
with a copy in non-volatile memory so it is remembered even through power 
outages.  It is expected that this index will only be updated once every 
5 minutes or so, but there is no restriction on when it is updated.

  The definition of the tick number is a number which ranges from 0-158 
and increments every 7 rf buckets (132 nSec).  It is synchronized to the 
accelerator and indicates which of the potential 159 locations for a proton 
bunch is currently in front of the D0 detector.

  The definition of the turn number is a number which increments one for 
every time the same beam particle could go by D0.  This is NOT the turn number
used in stamping different parts of the same event, which gets reset on 
SCL init commands.  This number only resets on power failures.

  The definition of the date-time stamp is a packed BCD data-time value 
identical to the format used by all geographic sectors to time stamp locally 
detected significant events (including errors).  For details on the format see
the D0 DAQ Geographic Section Specifications on the web.  It is not meant 
to be more accurate then a few minutes (although typically should be good 
to a second or two).  To automatically insure at least this level of accuracy, 
I propose that a standard TCP time server be setup (either in D0 or elsewhere 
on site) where all network accessible systems can check their local time, 
a few times per day.
 

Numbers that could be used to identify events

The following numbers are (or could be) stamped on every event 
(by the trigger framework):

  1)	Luminosity Index   (32 bits)

Should be used to split high rate streams into multiple files for a single run.
Each file is then associated with a beginning and ending index number.  
Different streams can split at different index values.  This works assuming 
that the index number is incremented often enough.

  2)	Coor Run Number   (32 bits)

These 32 bits are fully under the control of Coor and can be updated whenever 
data taking is paused.  A suggested use of these bits is a global run number. 
Remember that because multiple run, containing different trigger bits are 
allowed to run simultaneously, but this number can not be different as a 
function of trigger bits.  If this number changes on every begin/end run, 
then Coor can store all global run numbers used during a particular local run.

  3)	132ns Clock   (32 or 64 bits)

This number increments about 7 million times per second and is only reset 
on power failures.  This insures that the number for any run starts 
at its smallest value and continually increases until the end of the run.
It will be the same if a single event occurs in two separate (but simultaneous)
runs.  In a typical 4 hour run this number changes by about 1.1e11, 
which is a large number but not much worse than 131.225.224.121.  
The full 64 bit number is large enough that I will not live to see it overflow,
and therefore is unique in my lifetime.

  4)	L1 accept number   (32 bits)

This number is similar to but different from the L1 accept number used by the 
geographic sectors, in that it only resets on power failure (and possibly 
a request from Coor).  At a 10KHz rate this number overflows about once every 
100+ hours and is therefore unique and monotonic in any run.

  5)	L2 accept number   (32 bits)

This number is identical to the L3 transfer number, except 32 bits long instead
of the 16 lsb's used by the geographic sectors.  It is reset on power failure 
(and possibly a request from Coor).  At a 1KHz rate this number overflows about
once every 1000+hours and is therefore unique and monotonic in any run.

  6)	Date-Time number   (12 byte BCD)

This is the standard geographic sector time stamp.  It is only precise to 
a second and accurate to a few seconds, so can not be used by itself to 
uniquely identify an event.


Other noteworthy observations

   Any number assigned after the events enter the L3 farm can no longer be 
in chronological order.  In fact you can not even assign a number which is 
not mixed across stream parts, because we will be splitting files on a 
luminosity index number boundary.

   The present plan is to NOT write an event more than once per run.
This requires analysis programs to access multiple streams to get all the 
data for a given stream, but eliminates the need to check if the event exists
in a different file.  To hold down the number of different streams, it is 
expected that a few of the lower rate streams will be combined into 
one "stream".  This implies that analysis programs running on these 
"mixed streams" will have to check to see if the event they are reading 
actually passes the particular trigger bits they are interested in.

   I have not been able to understand where the stream assignment is made, 
as different people appear to think it is made at different levels.  
My personal opinion is that it should be made in the L3 output path, 
as the event is shipped off over the net.  The Level 3 should also define 
a small header block containing all the information needed to properly handle 
the data.  During the online meeting it was made clear that the L3 framework 
will handle the stream assignment, before the data leaves to node.  
If multiple runs are in progress, this might entail defining more than 
one stream destination for an event.

   During the online meeting it was noted that there should be a restriction 
placed on multiple runs which are taking data simultaneously.  Since it is 
expected that events that have trigger bits from different runs will be written
to multiple streams, the stream name should not be allowed to be identical 
for simultaneous runs.  Coor should enforce this restriction so that other 
subsystems can assume this will never happen.  Since this means that the same 
event can appear in different files, but only if they are associated with 
different runs, no analysis should attempt to combine data taking in different
runs which are taken during the same luminosity index number, since one can 
not correctly calculate the luminosity for that mixed exposure.

   Also during the online meeting it was noted that if the full bandwidth 
of the level 3 system was directed to a single stream, it would fill the 
design file size (1Gb) in a minute or so.  Since we were expecting to only 
close files on luminosity index number changes, this would require an index 
number update at about one per minute (instead of the proposed once per 
5 minutes.  If logging the luminosity data once every minute is not practical,
more than just the luminosity index number must be checked and logged 
for deciding when one is allowed to close a stream file.  Also restricting 
files to only close on luminosity index number changes, means that the 
luminosity monitoring process must be running whenever data is being logged, 
even if the accelerator is not running.  Since we can still recover the 
integrated luminosity that occurred while the luminosity monitoring process 
was down, we should not require the data to stop if the luminosity monitoring 
process goes away.  On the other hand, for runs which need luminosity 
recorded, runs should not be able to start until the luminosity monitoring 
process indicates that it has successfully incremented the luminosity index 
number.  Also end runs which are marked with needing luminosity should insure 
that the luminosity index is changed at the instant that trigger accepts are 
disabled for trigger bits in that run.