Skip to content

A simple program to generate mock data for CMS ML projects based on L1 Calorimeter Trigger region data.

Notifications You must be signed in to change notification settings

SridharaDasu/CMSMLProjectData

Repository files navigation

This package contains a simple program to generate mock data, which is not realistic at all, for detector background and single electron or tau signal superimposed on the background for the moment. The output file format is that which can be used for downloading to the Wisconsin APx boards with Xilinx VU9P FPGAs using APx core software tools.

The goal of the ML project is to develop techniques for separating background from signal.

Data files produced by this code uses the following parameters for geometry and packing of data in gigabit optical links. 16-bit data is available from each of the "regions" in two dimenional eta-phi space. 252 16-bit numbers reported every 25ns requires 16 ~10-Gbps optical fibers running 64/66 protocol, which provide event data in a frame of 4-cycles in 64-bit words. Parameters used are:

  const size_t N_REGIONS_ETA = 14;                                 // Number of eta divisions
  const size_t N_REGIONS_PHI = 18;                                 // Number of phi divisions
  const size_t N_REGIONS = N_REGIONS_ETA * N_REGIONS_PHI;          // 252 numbers
  const size_t N_REGION_BITS = 16;                                 // each of 16-bits width
  const size_t N_WORD_SIZE = 64;                                   // Each 64-bit word packs data
  const size_t N_REGIONS_PER_WORD = N_WORD_SIZE / N_REGION_BITS;   // for 4 regions
  const size_t N_BITS_PER_EVENT = N_REGION_BITS * N_REGIONS;       // Each event has 4032 bits of data
  const size_t N_WORDS_PER_FRAME = 4;                              // Event is read out in a 4-clock cycle frame
  const size_t N_BITS_PER_FRAME = N_WORD_SIZE * N_WORDS_PER_FRAME; // So, each link can accomodate 256-bits
  size_t N_INPUT_LINKS;                                            // needing 16 links
  if (N_BITS_PER_EVENT % N_BITS_PER_FRAME)
    N_INPUT_LINKS = (N_BITS_PER_EVENT / N_BITS_PER_FRAME) + 1;
  else
    N_INPUT_LINKS = (N_BITS_PER_EVENT / N_BITS_PER_FRAME);
  const size_t N_OUTPUT_LINKS = N_INPUT_LINKS;
  const size_t MAX_MEMORY_WORDS = 1024;                            // Firmware uses 1024 64-bit wide memories
  const size_t N_EVENTS = MAX_MEMORY_WORDS / N_WORDS_PER_FRAME;    // So, each file can accommodate 256 events

The 16-bit for each region contains 10-bit transverse energy deposit, 4-bit position of the cluster within the 4x4-towers comprising the region, an electron tag and a tau tag bit.:

class UCTRegion {
public:
  UCTRegion(double et, int pos, bool ele, bool tau) {
    if (et <= 0) _bits = 0;
    else if (et < 1024.0) _bits = uint16_t(et); // 10-bit et
    else _bits = 0x3FF;
    _bits |= (pos << 10);
    if (ele) _bits |= 0x4000;
    if (tau) _bits |= 0x8000;
  }
  uint16_t bits() {return _bits;}
private:
  uint16_t _bits;
};

To get started get code and compile it (verified to work on MacOS and Linux):

git clone https://github.com/SridharaDasu/CMSMLProjectData.git
cd CMSMLProjectData
c++ *.cpp -o genMLProjectData

You can run genMLProjectData to produce data. The random number seed is provided as the value of the option --background, which is necessary to produce data. You have to use --write to save the produced data. You can produce single electron and single tau signals by using the --electron or --tau options. The value of those variables specifies the transverse momentum of the particle.

The goal of the ML project is to produca a model that would have good efficiency >70% to identify 25 GeV objects. The efficiency for 50-GeV objects should be very good >95%. The background fakes should be 10% or less.

genMLProjectData --background=232341231 --write=BackgroundRegionData.txt --compare=BackgroundRegionData.ref --dump=BackgroundRegionData.csv
genMLProjectData --background=987654323 --electron=50 --write=ElectronRegionData.txt --compare=ElectronRegionData.ref --dump=ElectronRegionData.csv
genMLProjectData --background=478343223 --tau=50 --write=TauRegionData.txt --compare=TauRegionData.ref --dump=TauRegionData.csv

Reference files for above three cases, created with above random seeds, are in the repository. You can double check that your production worked by using the --compare= feature with the same random number seeds as above. You can generate multiple files, with unique names - say labeled by the seed -- with different seeds. Each file will have 256 events. You can then run over more data, by varying the SEEDs and file names without the --compare option.

To produce a CSV file with data in ints and floats for some developments, you may use (note that you must have --background and --write always):

genMLProjectData --background=232341231 --write=BackgroundRegionData.txt --dump=BackgroundRegionData.csv

For assistance contact: [email protected]

About

A simple program to generate mock data for CMS ML projects based on L1 Calorimeter Trigger region data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages