Multigpu hcal #498

vkhristenko · 2020-07-03T14:55:04Z

PR description:

this is to allow hcal running on a node with multiple gpus.
all the modules have been updated for that and now basically no protection for cuda service is needed.

the only thing in this pr is that the newly added condition's Record should be moved to CondFormats/DataRecord eventually. Here it sits inside of the hcal mahi package

PR validation:

as usual using the provided exec for hcal

… is relinquishable

fwyzard · 2020-07-06T09:46:23Z

@vkhristenko why do we move the pulseOffsets to a new "condition" ?
@mariadalfonso how are they handled in CPU code, as a configuration parameter or as a condition ?

vkhristenko · 2020-07-06T10:57:45Z

@fwyzard
i moved cause i see no other way to handle a configuration that is an array/vector generically... for the case of multiple gpus

mariadalfonso · 2020-07-06T14:46:50Z

@fwyzard pulseOffsets is an algorithm configuration parameter. We use it to decide to fit with 1,3,8 templates.
It's not a detector condition.

My understanding is that the GPU-fit is not configurable, so in principle there is no need to have all this gymnastic

vkhristenko · 2020-07-06T14:56:24Z

@mariadalfonso

this is actually not the right way to think about this. this array shows the distance from the sample of interest. and that can be configurable! for instance, if sample of interest goes from 4 to 3, all you need is to change the contents of this array, not the length. for whatever reasons... regarding the length, true, it is made a static parameter in there, but that can be changed in principle... (removing eigen even further and using more mapping kinda stuff)

furthermore, this does not solve a problem, i will need the same for ecal... whenever you want an array of parameters to be allocated/handled generically for the gpu, i found that thru es producer is the easiest way to make this generic, although u gotta write a bunch of code... i'm happy to change that if there are suggestions what i should use ...

fwyzard · 2020-07-06T15:02:11Z

If it's not configurable but constant, I guess a simple approach is to use a __constant__ C array on the GPU.
If it's configurable, the simplest approach could be to copy it to the GPU on event (using the caching allocator, the stream from the context, etc.)
@makortel, do you have other suggestions for this ?

vkhristenko · 2020-07-06T16:31:53Z

The problem with event one is how make sure u already transferred things for that device... I guess I could keep that info somehow Previously I was advised to use the same mechanism as for conditions... For constant or not, I can put those arrays into constant mem, then people will ask “it would be nice to have this configurable”... typically this goes this way VK

…

On Mon, 6 Jul 2020 at 17:02, Andrea Bocci ***@***.***> wrote: If it's not configurable but constant, I guess a simple approach is to use a __constant__ C array on the GPU. If it's configurable, the simplest approach could be to copy it to the GPU on event (using the caching allocator, the stream from the context, etc.) @makortel <https://github.com/makortel>, do you have other suggestions for this ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#498 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSFUCI4CBDENZC5RUUCS53R2HRQHANCNFSM4OP4AT6Q> .

fwyzard · 2020-07-06T16:44:10Z

On Mon, 6 Jul 2020, 18:33 Viktor Khristenko, ***@***.***> wrote: The problem with event one is how make sure u already transferred things for that device... I guess I could keep that info somehow

What I meant was to transfer them for every event. Previously I was advised to use the same mechanism as for conditions... OK, let's keep it like this.

mariadalfonso · 2020-07-06T17:26:33Z

@mariadalfonso

this is actually not the right way to think about this. this array shows the distance from the sample of interest. and that can be configurable! for instance, if sample of interest goes from 4 to 3, all you need is to change the contents of this array, not the length. for whatever reasons... regarding the length, true, it is made a static parameter in there, but that can be changed in principle... (removing eigen even further and using more mapping kinda stuff)

The SOI is stored in the Frame and has actually a dedicated condition for it
https://github.com/cms-sw/cmssw/blob/7a01de257e0495120f3568fe326659909281c9f6/RecoLocalCalo/HcalRecProducers/src/HBHEPhase1Reconstructor.cc#L472

pulseOffsets array there is a parameter of the multifit templates not the HCAL-frame. so it's not a detector condition.
of course you can configure all the parameter you want.
Once it goes in the release, and we will do some meaningful review, we will harmonise all.

furthermore, this does not solve a problem, i will need the same for ecal... whenever you want an array of parameters to be allocated/handled generically for the gpu, i found that thru es producer is the easiest way to make this generic, although u gotta write a bunch of code... i'm happy to change that if there are suggestions what i should use ...

makortel · 2020-07-06T19:21:56Z

The problem with event one is how make sure u already transferred things for that device... I guess I could keep that info somehow

What I meant was to transfer them for every event.

Previously I was advised to use the same mechanism as for conditions...

OK, let's keep it like this.

(I still need to look the code but) in general configurable "constants" are currently most effectively treated as conditions data to avoid transferring them on each event.

In principle transferring them from the EDProducer is possible as well, e.g. at beginJob() time, but currently there are no helpers for that. And even then the mode of operation would need to resemble ESProducts (transfer to all devices, somehow check and cache on first events if that transfer has completed).

makortel

Overall looks good to me.

makortel · 2020-07-07T02:04:23Z

RecoLocalCalo/HcalRecProducers/src/HcalMahiPulseOffsetsGPUESProducer.cc

+                        edm::ValidityInterval&) override;
+
+private:
+    edm::ParameterSet const& pset_;


Storing the std::vector<int> would be better.

makortel · 2020-07-08T01:18:01Z

RecoLocalCalo/HcalRecProducers/src/HBHERecHitProducerGPU.cc

@@ -239,6 +215,10 @@ void HBHERecHitProducerGPU::acquire(edm::Event const& event,
  setup.get<HcalSiPMCharacteristicsRcd>().get(sipmCharacteristicsHandle);
  auto const& sipmCharacteristicsProduct = sipmCharacteristicsHandle->getProduct(ctx.stream());

+  edm::ESHandle<HcalMahiPulseOffsetsGPU> pulseOffsetsHandle;
+  setup.get<HcalMahiPulseOffsetsGPURecord>().get(pulseOffsetsHandle);
+  auto const& pulseOffsetsProduct = pulseOffsetsHandle->getProduct(ctx.stream());


Just to note that eventually these need to be migrated to ESGetToken (but probably a separate PR is better)
https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideHowToGetDataFromES#In_ED_module
https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideHowToGetDataFromES#Getting_data_from_EventSetup_wit

I guess we need to do it once we move to CMSSW_11_2_X ?

I guess we need to do it once we move to CMSSW_11_2_X ?

The "new code must use ESGetToken" is rather a policy than technical requirement (so "need" is before making a PR to CMSSW master). Technically the migration can be done at any time, the ESGetToken API has been there already for over a year.

OK - I was wondering if it is technically enforced in 11.2.x.
If it isn't "before making the PR for master" (or rather "while the PR for master is being reviewed") seems like a good moment :-)

It's not enforced (yet) for EDModules because of ~5500 existing calls that need to be migrated first :)

fwyzard · 2020-07-08T05:16:52Z

RecoLocalCalo/HcalRecAlgos/src/HcalMahiPulseOffsetsGPU.cc

+HcalMahiPulseOffsetsGPU::HcalMahiPulseOffsetsGPU(edm::ParameterSet const& ps) 
+{
+    auto const& values = ps.getParameter<std::vector<int>>("pulseOffsets");
+    values_.resize(values.size());
+    std::copy(values.begin(), values.end(), values_.begin());
+}


Suggested change

HcalMahiPulseOffsetsGPU::HcalMahiPulseOffsetsGPU(edm::ParameterSet const& ps)

{

auto const& values = ps.getParameter<std::vector<int>>("pulseOffsets");

values_.resize(values.size());

std::copy(values.begin(), values.end(), values_.begin());

}

HcalMahiPulseOffsetsGPU::HcalMahiPulseOffsetsGPU(edm::ParameterSet const& ps) :

values_(ps.getParameter<std::vector<int>>("pulseOffsets"))

{

}

But I do agree with @makortel , it would be better to pass directly the std::vector<int> rather than the edm::ParameterSet.

fwyzard · 2020-07-08T05:19:06Z

RecoLocalCalo/HcalRecAlgos/src/HcalMahiPulseOffsetsGPU.cc

+  auto const& product =
+      product_.dataForCurrentDeviceAsync(cudaStream, [this](HcalMahiPulseOffsetsGPU::Product& product, cudaStream_t cudaStream) {
+        // malloc
+        cudaCheck(cudaMalloc((void**)&product.values, this->values_.size() * sizeof(int)));


@makortel is there any reason we shouldn't use the caching allocator here as well ?

The current caching allocator API does not suit well for the ESProduct case, because the life time of the memory is
not related to the processing queued to the argument cudaStream_t.

I do have an idea (old prototype if I still manage to find it) on how to improve on that on top of #412, although given #487 I'd do it a bit differently.

Maybe I should interpret the silence on the RFC's as "no strong objections" and just go ahead with further prototyping.

I a re-read to remind myself about it once 11.2.0-pre2 Patatrack is out...

vkhristenko · 2020-07-08T09:58:01Z

superseeeded by #502

vkhristenko added 5 commits June 19, 2020 12:46

make scratch use caching alloc

9ca193e

use caching allocator for input cpu

c7f9392

using allocators for EventHFilter/HcalRawToDigi

a84d731

hcal/mahi/gpu switch to new input and use allocator for scratch

d05f147

hcal/mahi/gpu move pulse offsets to use esproducer and all dev memory…

8ee109b

… is relinquishable

fwyzard added the HCAL HCAL-related developments label Jul 4, 2020

remove old

9e67a22

fwyzard mentioned this pull request Jul 6, 2020

Synchronise with CMSSW_11_1_0_patch2 #500

Closed

makortel reviewed Jul 8, 2020

View reviewed changes

fwyzard reviewed Jul 8, 2020

View reviewed changes

vkhristenko mentioned this pull request Jul 8, 2020

Update ECAL and HCAL reconstruction to run on multple GPUs #502

Merged

vkhristenko closed this Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multigpu hcal #498

Multigpu hcal #498

vkhristenko commented Jul 3, 2020

fwyzard commented Jul 6, 2020

vkhristenko commented Jul 6, 2020

mariadalfonso commented Jul 6, 2020

vkhristenko commented Jul 6, 2020

fwyzard commented Jul 6, 2020

vkhristenko commented Jul 6, 2020 via email

fwyzard commented Jul 6, 2020 via email

mariadalfonso commented Jul 6, 2020

makortel commented Jul 6, 2020

makortel left a comment

makortel Jul 7, 2020

makortel Jul 8, 2020

fwyzard Jul 8, 2020

makortel Jul 8, 2020

fwyzard Jul 8, 2020

makortel Jul 8, 2020 •

edited

Loading

fwyzard Jul 8, 2020

fwyzard Jul 8, 2020

makortel Jul 8, 2020

makortel Jul 8, 2020

fwyzard Jul 8, 2020 •

edited

Loading

vkhristenko commented Jul 8, 2020

Multigpu hcal #498

Multigpu hcal #498

Conversation

vkhristenko commented Jul 3, 2020

PR description:

PR validation:

fwyzard commented Jul 6, 2020

vkhristenko commented Jul 6, 2020

mariadalfonso commented Jul 6, 2020

vkhristenko commented Jul 6, 2020

fwyzard commented Jul 6, 2020

vkhristenko commented Jul 6, 2020 via email

fwyzard commented Jul 6, 2020 via email

mariadalfonso commented Jul 6, 2020

makortel commented Jul 6, 2020

makortel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makortel Jul 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fwyzard Jul 8, 2020 • edited Loading

Choose a reason for hiding this comment

vkhristenko commented Jul 8, 2020

makortel Jul 8, 2020 •

edited

Loading

fwyzard Jul 8, 2020 •

edited

Loading