Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepMark #101

Open
soumith opened this issue Apr 14, 2016 · 125 comments
Open

DeepMark #101

soumith opened this issue Apr 14, 2016 · 125 comments

Comments

@soumith
Copy link
Owner

soumith commented Apr 14, 2016

Hi all,

The reason I've been slow on convnet-benchmarks these days is because i've been working on the side on DeepMark.

I initially wrote convnet-benchmarks to increase competition among frameworks so that we can work towards faster ConvNets, and they served their purpose well. After the release of convnet-benchmarks, multiple frameworks pulled up their socks to speedup convnets, with a deep sense of prestige for being on top of these benchmarks. In these two years, we as a community accelerated GPU ConvNets across all frameworks between 4x to 10x, efficiently implementing tricks such as FFT, Winograd, and powered by faster hardware. Alex Khrizevsky, Yangqing Jia, Scott Gray, Nicolas Vasilache, Sander Dieleman, Michael Mathieu, Julien Denmouth and many other human compilers helped make this a reality -- looking at the diversity in terms of where each of us work(ed) professionally shows that this kind of acceleration was truly a community effort with a ton of openness, something that is plain awesome! :)
I've also enjoyed reading the deeply technical discussions that take place on convnet-benchmarks (my favorites: #93 , #66 , #59 in recent times ).

Moving on, convnet-benchmarks do not accurately capture everything we think of when we say deep learning. We don't have Recurrent Nets, we don't have video use-cases, speech, NLP etc. There is a need for such comprehensive benchmarks, especially as the space is getting ready for dedicated hardware chips, multi-GPU and multi-machine frameworks and more complex use-cases.

I've sat down with a few of you at NIPS and GTC to discuss and freeze the initial round of benchmarks for what I am calling DeepMark. My initial plan was to work on the initial set of benchmark scripts by myself and cover the most popular frameworks, and then let the direction and maintenance of the benchmarks be community-driven. But the breadth of this effort has been overwhelming to say the least. After careful thought, I've decided that I'll just ask everyone to pitch in for their part of the benchmarks with making scripts etc., especially as many of you were very receptive to the idea offline.

Here are the initial set of use-cases we want to cover:

Networks

Images

Video

Audio

Text

Platform

  • Initially multi-GPU with (1 to 4 titan-X cards)
  • However, multi-machine, custom hardware, other GPU cards such as AMD, CPUs etc. can and should be accommodated, we will work this out after the initial push.

Metrics

  • Round-trip time for 1 epoch of training (will define an epoch size separately for each network)
  • Maximum batch-size that fits (to show and focus on the extra memory consumption that the framework uses)

Frameworks

Everyone who wants to join-in, but I thought an initial set that is important to cover would be:

  • Caffe
  • Chainer
  • MXNet
  • Neon
  • Theano
  • TensorFlow
  • Torch

Scripts format

Guarantees

  • I will personally to the best of my abilities make sure that the benchmarking is fair and unbiased. The hope is that the community at large will watch these and point-out / fix mistakes.

Governance

  • The benchmarks will be placed at https://github.com/DeepMark/deepmark and other key community members / organizations who want ownership will be welcome to join in proposing new benchmarks that get relevant as the field progresses.

Timeframe

  • Initial Release: June 15th, 2016

My hope is that these new set of benchmarks will not only increase competition but will also be beneficial in other ways to the community, serving as common examples to get started, etc.

Let me know what you think :)
Soumith

cc: @hughperkins @f0k @scott-gray @rajatmonga @vrv @benanne @nouiz @Yangqing @tqchen @unnonouno

@soumith
Copy link
Owner Author

soumith commented Apr 14, 2016

Oh lastly, a good timeline for this would be to get an initial round of benchmarks by June 15th (since I only gave some of you a heads-up right now)

@daviddao
Copy link

So awesome and useful. What are the data sets one should benchmark on? ImageNet, CIFAR10? It would also be nice to compare the accuracy of current implementations for each framework (although that would probably be a lot of work).

@Smerity
Copy link

Smerity commented Apr 15, 2016

For text, I'd hope to expand beyond just RNN character generation. It doesn't capture many of the complexities of other models, such as variable sequence lengths or bidirectional RNNs.

The Attention Sum Reader is a simple architecture (bidirectional GRU + dot product) that currently has SotA and could allow for optimizing sequences of different lengths, a major issue in RNNs. The model also has four different dataset sizes, small (Children's Book Test), medium (CNN), and large (Daily Mail), which are publicly available.

@vrv
Copy link
Contributor

vrv commented Apr 15, 2016

This is great, thanks for organizing this! One thing I've also been thinking about like @daviddao is how to validate that the models are actually computing the same thing -- I've seen some benchmarks elsewhere that have raised personal doubts that the benchmark code written in different frameworks are computing the same function. As part of the benchmark framework, maybe the API could include a way to validate that given specified initialization and input, the outputs (forward and backward) are approximately equal. Open to thoughts :). Cheers!

@craffel
Copy link

craffel commented Apr 15, 2016

Nice! I am very excited for this.

I have https://github.com/craffel/lstm_benchmarks, which is an out-of-date benchmark of Theano vs. rnnlib vs. currennt (which, at the time I wrote the benchmarks, were essentially the only options for LSTM). The task was CHIME noisy speech recognition, which has pretty limited adoption so I would not strongly advocate for it being added as a task. And I assume that rnnlib and current shouldn't be included in these benchmarks are they are RNN-only, right?

I'll be happy to contribute to some of the Theano RNN benchmarks once it becomes appropriate to do so.

This is great, thanks for organizing this! One thing I've also been thinking about like @daviddao is how to validate that the models are actually computing the same thing -- I've seen some benchmarks elsewhere that have raised personal doubts that the frameworks are computing the same function.

This would be very cool, but from my own experience with the LSTM benchmark it can be very difficult - you have to make sure literally every hyperparameter is identical, and you effectively can't use any RNGs. Not to say it's impossible, but it would add a lot of overhead to implementing new benchmarks.

@hughperkins
Copy link
Contributor

Caveat: per Paul Graham, better to go deep, do something very well, than kind of blur one's 'focus' over many things. I worry gently that if too many benchmarks then:

  • each benchmark less well maintained
  • more confusing to read

@f0k
Copy link
Contributor

f0k commented Apr 15, 2016

👍

What are the data sets one should benchmark on? ImageNet, CIFAR10?

Training on something like ImageNet would move away the focus from pure computation to fast dataset iteration -- this would be interesting as well, but should probably become a separate benchmark since not all frameworks actually provide any tools for this. The other extreme would be training on random dummy data (like sampled from a Gaussian), but this makes sense only if we can guarantee the running time does not depend on the input data. So probably we should have some realistic set of inputs for each task, just large enough to fill two batches or so?

As part of the benchmark framework, maybe the API could include a way to validate that given specified initialization and input, the outputs (forward and backward) are approximately equal.

This seems useful. It requires initial model parameters to be dumped in some format and loaded into each framework, but it would help to ensure that all implementations are the same.

@hughperkins
Copy link
Contributor

hughperkins commented Apr 15, 2016

This seems useful. It requires initial model parameters to be dumped in some format and loaded into each framework, but it would help to ensure that all implementations are the same.

In the strongest case, weight initialization could be defined precisely as:

  • a precise order of initialization, eg by layer, then by infeatureplane, then by outfeatureplane, then by height, then by width (for example)
  • a precise random function to use (eg mt19937)
  • the exact seed to use
  • (edit, and of course the exact function to use, eg sqrt(numberinputs) * 0.1, or similar)

@f0k
Copy link
Contributor

f0k commented Apr 15, 2016

In the strongest case, weight initialization could be defined precisely as [...]

I guess getting the same stream of pseudo-random values in all different frameworks is more difficult than importing a set of tensors into all different frameworks. We wouldn't want to exclude candidates from being benchmarked because they fail to implement the same RNG.

@hughperkins
Copy link
Contributor

hughperkins commented Apr 15, 2016

I guess getting the same stream of pseudo-random values in all different frameworks is more difficult than importing a set of tensors into all different frameworks. We wouldn't want to exclude candidates from being benchmarked because they fail to implement the same RNG.

Having gone through the exact same process, to compare DeepCL with convnetjs, I found it significantly easier to make convnetjs use the exact same weight generator as DeepCL, than to load weights from a file https://github.com/hughperkins/DeepCL/blob/master/prototyping/convnetjs-reference/testconvnet2.js#L143-L201 . It was a long time ago, so I dont remember why. I do remember I initially tried writing weights to a file though, and I couldnt get it to work as easily as syncing weight generators, for some reason.

@f0k
Copy link
Contributor

f0k commented Apr 15, 2016

I found it significantly easier to make convnetjs use the exact same weight generator as DeepCL, than to load weights from a file

If that's the case, one could of course create the initial weights in a reproducible way and save them to files, so implementers for the different benchmarked frameworks can choose whatever is easiest.
(Note that loading from files has the additional benefit of documenting how to load foreign model parameters into each framework.)
Umm... are we supposed to discuss such details here or over at the deepmark repository?

@soumith
Copy link
Owner Author

soumith commented Apr 15, 2016

@daviddao @vrv @hughperkins @f0k for V1, I thought we should just go with synthetic data. It's very hard to setup to-convergence benchmarks, as there are very fine details wrt convergence being guaranteed, for example: some of the models (like googlenetv3) have taken a year to reproduce outside of the paper.

@Smerity In terms of evaluating perf, we can add a bidirectional RNN too. In fact, DeepSpeech2 has bidirectional-RNNs, so that should be sufficient?

@vrv definitely a great idea, but very very hard and takes a ton of resources. I feel like atleast for V1, we should just go with code review + synthetic data.

@craffel Awesome! At the moment I dont have a Point of Contact for Theano, maybe a combination of you, @f0k and @benanne could work (especially if they're implemented in Lasagne).

@craffel
Copy link

craffel commented Apr 15, 2016

(especially if they're implemented in Lasagne).

That would be nice :) though I am personally interested in which of the Theano-based libraries manage to eek out the most performance, since their implementations are nonidentical.

@Yangqing
Copy link
Contributor

Yangqing commented Apr 15, 2016

Thanks @soumith for organizing this effort! I think this would definitely help us advance the field to the next level.

I am also very interested in benchmarking not only the training pipeline, but a more wide range of evaluation criteria. The reason is as follows: if I may make a bold claim, I believe that all frameworks will again very quickly converge to the same performance, because there is no fundamental difference between them. What we saw at convnet-benchmark is that almost everyone is using the same underlying library, and we are effectively benchmarking framework overheads, something that is good to know of course, but seems to be overwhelmed by other factors, such as ease to use etc.

Given the wide attention of this benchmark, I think it would be great if we can draw attention to some of the more practical issues, such as small batch sizes in deployment time - several frameworks (including some non-open-source production systems I've worked on) have historically ignored this, and I think it is worthwhile to invite people to invest more on this direction.

I have not got a perfece idea on this, of course. One thing we can do is to simply benchmark different batch sizes, but a more complex, and potentially useful, way is probably to set up a harness that can simulate requests generated from a Poisson distribution and comes with latency requirements, and see whether frameworks can address that in an optimal fashion - this might be too application specific, though. Just my 2 cents.

(Also adding @ajtulloch to the conversation. Andrew first raised this point when we were discussing offline.)

@moloned
Copy link

moloned commented Apr 15, 2016

How about per-pixel scene labelling and optical flow?

Regards

-David
On 15 Apr 2016 00:41, "Soumith Chintala" [email protected] wrote:

Hi all,

The reason I've been slow on convnet-benchmarks these days is because i've
been working on the side on DeepMark.

I initially wrote convnet-benchmarks to increase competition among
frameworks so that we can work towards faster ConvNets, and they served
their purpose well. After the release of convnet-benchmarks, multiple
frameworks pulled up their socks to speedup convnets, with a deep sense of
prestige for being on top of these benchmarks. In these two years, we as a
community accelerated GPU ConvNets across all frameworks between 4x to 10x,
efficiently implementing tricks such as FFT, Winograd, and powered by
faster hardware. Alex Khrizevsky, Yangqing Jia, Scott Gray, Nicolas
Vasilache, Sander Dieleman, Michael Mathieu, Julien Denmouth and many other
human compilers helped make this a reality -- looking at the diversity in
terms of where each of us work(ed) professionally shows that this kind of
acceleration was truly a community effort with a ton of openness, something
that is plain awesome! :)
I've also enjoyed reading the deeply technical discussions that take place
on convnet-benchmarks (my favorites: #93
#93 , #66
#66 , #59
#59 in recent times
).< /p>

Moving on, convnet-benchmarks do not accurately capture everything we
think of when we say deep learning. We don't have Recurrent Nets, we
don't have video use-cases, speech, NLP etc. There is a need for such
comprehensive benchmarks, especially as the space is getting ready for
dedicated hardware chips, multi-GPU and multi-machine frameworks and more
complex use-cases.

I've sat down with a few of you at NIPS and GTC to discuss and freeze the
initial round of benchmarks for what I am calling DeepMark. My initial
plan was to work on the initial set of benchmark scripts by myself and
cover the most popular frameworks, and then let the direction and
maintenance of the benchmarks be community-driven. But the breadth of this
effort has been overwhelming to say the least. After careful thought, I've
decided that I'll just ask everyone to pitch in for their part of the
benchmarks with making scripts etc., especially as many of you were very
receptive to the idea offline.

Here are the initial set of use-cases we want to cover:
Networks Images

Video

Audio

Text

Platform

  • Initially multi-GPU with (1 to 4 titan-X cards)
  • However, multi-machine, custom hardware, other GPU cards such as
    AMD, CPUs etc. can and should be accommodated, we will work this out after
    the initial push.

Metrics

  • Round-trip time for 1 epoch of training (will define an epoch size
    separately for each network)
  • Maximum batch-size that fits (to show and focus on the extra memory
    consumption that the framework uses)

Frameworks

Everyone who wants to join-in, but I thought an initial set that is
important to cover would be:

  • Caffe
  • Chainer
  • MXNet
  • Neon
  • Theano
  • TensorFlow
  • Torch

Scripts format

Guarantees

  • I will personally to the best of my abilities make sure that the
    benchmarking is fair and unbiased. The hope is that the community at large
    will watch these and point-out / fix mistakes.

Governance

  • The benchmarks will be placed at https://github.com/DeepMark/deepmark
    and other key community members / organizations who want ownership will be
    welcome to join in proposing new benchmarks that get relevant as the field
    progresses.

My hope is that these new set of benchmarks will not only increase
competition but will also be beneficial in other ways to the community,
serving as common examples to get started, etc.

Let me know what you think :)
Soumith

cc: @hughperkins https://github.com/hughperkins @f0k
https://github.com/f0k @scott-gray https://github.com/scott-gray
@rajatmonga https://github.com/rajatmonga @vrv https://github.com/vrv
@bennane @nouiz https://github.com/nouiz @Yangqing
https://github.com/Yangqing @tqchen https://github.com/tqchen
@unnonouno https://github.com/unnonouno


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#101

@tqchen
Copy link

tqchen commented Apr 15, 2016

I agree that evaluate more aspects. Some of them is already covered in this proposal, for example

  • Memory consumption
  • Parallelization and Scheduling overhead

One of the most important factor, tradeoff between ease of use and optimization, was unfortunately not easy to benchmark as each people have their own taste.

What @Yangqing suggested is more on measuring perf for production and serving pipeline, which could be a whole area of new directions. As this benchmark was primarily on training. One alternative could be making a deep-serving benchmark on DeepMark organization that dedicate to this topic.

@moloned
Copy link

moloned commented Apr 15, 2016

It would be interesting to log the power dissipation in each testcase, as
well as fps, memory BW, FLOPS etc.
On 15 Apr 2016 00:41, "Soumith Chintala" [email protected] wrote:

Hi all,

The reason I've been slow on convnet-benchmarks these days is because i've
been working on the side on DeepMark.

I initially wrote convnet-benchmarks to increase competition among
frameworks so that we can work towards faster ConvNets, and they served
their purpose well. After the release of convnet-benchmarks, multiple
frameworks pulled up their socks to speedup convnets, with a deep sense of
prestige for being on top of these benchmarks. In these two years, we as a
community accelerated GPU ConvNets across all frameworks between 4x to 10x,
efficiently implementing tricks such as FFT, Winograd, and powered by
faster hardware. Alex Khrizevsky, Yangqing Jia, Scott Gray, Nicolas
Vasilache, Sander Dieleman, Michael Mathieu, Julien Denmouth and many other
human compilers helped make this a reality -- looking at the diversity in
terms of where each of us work(ed) professionally shows that this kind of
acceleration was truly a community effort with a ton of openness, something
that is plain awesome! :)
I've also enjoyed reading the deeply technical discussions that take place
on convnet-benchmarks (my favorites: #93
#93 , #66
#66 , #59
#59 in recent times
).< /p>

Moving on, convnet-benchmarks do not accurately capture everything we
think of when we say deep learning. We don't have Recurrent Nets, we
don't have video use-cases, speech, NLP etc. There is a need for such
comprehensive benchmarks, especially as the space is getting ready for
dedicated hardware chips, multi-GPU and multi-machine frameworks and more
complex use-cases.

I've sat down with a few of you at NIPS and GTC to discuss and freeze the
initial round of benchmarks for what I am calling DeepMark. My initial
plan was to work on the initial set of benchmark scripts by myself and
cover the most popular frameworks, and then let the direction and
maintenance of the benchmarks be community-driven. But the breadth of this
effort has been overwhelming to say the least. After careful thought, I've
decided that I'll just ask everyone to pitch in for their part of the
benchmarks with making scripts etc., especially as many of you were very
receptive to the idea offline.

Here are the initial set of use-cases we want to cover:
Networks Images

Video

Audio

Text

Platform

  • Initially multi-GPU with (1 to 4 titan-X cards)
  • However, multi-machine, custom hardware, other GPU cards such as
    AMD, CPUs etc. can and should be accommodated, we will work this out after
    the initial push.

Metrics

  • Round-trip time for 1 epoch of training (will define an epoch size
    separately for each network)
  • Maximum batch-size that fits (to show and focus on the extra memory
    consumption that the framework uses)

Frameworks

Everyone who wants to join-in, but I thought an initial set that is
important to cover would be:

  • Caffe
  • Chainer
  • MXNet
  • Neon
  • Theano
  • TensorFlow
  • Torch

Scripts format

Guarantees

  • I will personally to the best of my abilities make sure that the
    benchmarking is fair and unbiased. The hope is that the community at large
    will watch these and point-out / fix mistakes.

Governance

  • The benchmarks will be placed at https://github.com/DeepMark/deepmark
    and other key community members / organizations who want ownership will be
    welcome to join in proposing new benchmarks that get relevant as the field
    progresses.

My hope is that these new set of benchmarks will not only increase
competition but will also be beneficial in other ways to the community,
serving as common examples to get started, etc.

Let me know what you think :)
Soumith

cc: @hughperkins https://github.com/hughperkins @f0k
https://github.com/f0k @scott-gray https://github.com/scott-gray
@rajatmonga https://github.com/rajatmonga @vrv https://github.com/vrv
@bennane @nouiz https://github.com/nouiz @Yangqing
https://github.com/Yangqing @tqchen https://github.com/tqchen
@unnonouno https://github.com/unnonouno


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#101

@hughperkins
Copy link
Contributor

It would be interesting to log the power dissipation in each testcase

I like this idea. A Titan draws 250watts peak (I think?). 24 hours a day for a year, 250watts is about ~600usd, which is in the same order of magnitude as the purchase price.

And power dissipation is going to become the main bottleneck plausibly in years to come. ("And over here we have our farm of 1000 Titan 2026s, and over there is the 20MW pebble bed we are using to power them" :-) )

@forresti
Copy link

@Yangqing

"if I may make a bold claim, I believe that all frameworks will again very quickly converge to the same performance, because there is no fundamental difference between them."

Agreed. Soumith's current benchmarks are useful, but they mainly evaluate "who can make the thinnest wrapper around cuDNN, Neon, or similar?"

It would be useful to benchmark implementations of emerging algorithms for which tuned libraries may not yet exist -- certain versions of LSTMs and RNNs, for instance.

@soumith
Copy link
Owner Author

soumith commented Apr 16, 2016

@forresti yea, for historical context it used to be not like that, but it is like that now. I think for LSTMs and RNNs, a lot of perf is still up for grabs.

@hughperkins
Copy link
Contributor

hughperkins commented Apr 16, 2016

Agreed. Soumith's current benchmarks are useful, but they mainly evaluate "who can make the thinnest wrapper around cuDNN, Neon, or similar?"

To be fair, cudnn, neon are competing with each other. The opencl implementations mostly copy the caffe cuda implementation of im2col as far as I know :-D but have different performance from cudnn. There is also 16-bit vs 32-bit.

(Edit, by the way, the cudnn vs neon comparison is exactly what comes to mind about power consumption. I dont know if it's still the case, but as far as I know it used to be the case that cudnn ran cooler than neon, and it'd be useful to be able to see this in the results)

@forresti
Copy link

forresti commented Apr 16, 2016

@hughperkins Good point. I didn't mean to imply that there isn't a diverse array of low-level computational libraries for DNNs.

To tune up my comment a bit: "When doing speed/efficiency benchmarks, it's hard to avoid conflating low-level computational libraries (cuDNN, Neon, various OpenCL efforts, ...) and higher-level frameworks (Caffe, Torch, Tensorflow, Theano, ...)."

@scott-gray
Copy link

I would say that convolution is far from a solved problem. I still have a long list of optimizations I want to make. The biggest area to explore is how to best leverage lower precision without sacrificing accuracy. The obvious target there would be xnor nets but maybe a bit more precision is best for the highest levels of accuracy. The 4x int8 performance that Pascal will soon have (unfortunately not in P100 though) is a tantalizing format to target. And also obviously the native fp16 support.

Another area is better efficiency at smaller batch sizes. I have some brand new work there that I'd like to show off. This is important for both inference and scaling to many nodes.

Power comparisons are useful but only when looking at implementations that have the same computational throughput. Or just use some kind of flops/watt metric. With my newer kernels I'm getting very good at squeezing the most out of cache utilization and hence I'm hitting and maintaining higher boost clocks (while using smaller and more versatile tiles).

As for the frameworks, the big area to focus on is graph based optimizations. Maximizing data locality (compounding), memory allocation vs compute trade-offs, auto-parallelizing independent work across streams and gpus, and lots of other creative things computational graphs greatly simplify.

As for synthetic vs real data and parameters.. In fp32 I think only the distribution matters for performance comparisons. But in lower precision like fp16 it's very easy to saturate or underflow with synthetic data which leads to far higher performance than is warranted. At the very least you want to account for the fan-in when setting weight magnitudes (Kaiming, Xavier, etc). Batch norm helps a lot here too. Basically you should be able to prove that you can train with the params you benchmark with.

At the end of the day we care about speed and usability. I think these benchmarks should make both pretty clear. For usability you'll be able to inspect the script to see who has the cleanest syntax and solves the most problems for you without extra steps.

@forresti
Copy link

@scott-gray That all sounds great. :)

@hughperkins
Copy link
Contributor

just use some kind of flops/watt metric

Well, the ideal would be joules per batch. But I think this will be tricky to measure. Might need some specialized hardware device, that sits on the power bus?

@scott-gray
Copy link

scott-gray commented Apr 16, 2016

Maybe it wouldn't be quite so tricky. You'd just need to collect some running average of the on chip power stats during the execution of the epoch. Something like this would give you realtime stats:

nvidia-smi -i 1 --loop-ms=333 --format=csv,noheader --query-gpu=power.draw,clocks.gr,temperature.gpu,fan.speed,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown

Or even better tie your benchmark script directly into NVML queries:
https://developer.nvidia.com/nvidia-management-library-nvml

But I guess you'd want to be running these queries continuously so maybe as a separate process would be better. You'd just need to synchronize the collection with the execution of the network. Just a small bit of shell scripting should achieve this.

@hughperkins
Copy link
Contributor

Or even better tie your benchmark script directly into NVML queries:
https://developer.nvidia.com/nvidia-management-library-nvml

Interesting. Seems it's just a c-interface, so accessible using ffi etc.

nvmlDeviceGetPowerUsage(nvmlDevice_t device, unsigned int *power);

@scott-gray
Copy link

And python bindings can be found here:
https://pypi.python.org/pypi/nvidia-ml-py

@scott-gray
Copy link

But, it's worth pointing out that the boost clock is already tightly coupled with these real-time power and temperature measurements so the overall timings should be reflective of this. So perhaps it's not worth the effort.

@gnawice
Copy link

gnawice commented May 17, 2016

I'd hope this also supports CPU training/testing.
My mojo-cnn project seems to be relatively fast on CPU for smaller problems and i know intel is still trying to push CPU use with their DNN wrappers for MKL to better compete with GPUs.

@andravin
Copy link

@jbalma yes I find strong scaling the training problem to be very interesting, and of course that is an active area of research. I hope you find a good forum for pursuing it. Let me know if I can help.

I would also point out with regards to profiling, an external profiling tool will of course not be able to correlate timings with the graph structure of the network, so it cannot group statistics by layer, which is essential for understanding the performance of the network. I think all successful machine learning frameworks will be instrumented for profiling eventually, and tensorflow appears to be taking the lead there. Now imagine the power of being able to compare network performance graphs across different frameworks, because they all were designed to output profiler stats in the same format.

@hughperkins
Copy link
Contributor

hughperkins commented May 19, 2016

I think all successful machine learning frameworks will be instrumented for profiling eventually, and tensorflow appears to be taking the lead there

cltorch profiling and instrumentation :-)

@f0k
Copy link
Contributor

f0k commented May 19, 2016

I think all successful machine learning frameworks will be instrumented for profiling eventually, and tensorflow appears to be taking the lead there

cltorch profiling and instrumentation :-)

THEANO_FLAGS=profile=1 works since at least 2013. Caveat: It profiles both CPU and GPU operations, so it times things "from outside", so it needs to be combined with CUDA_LAUNCH_BLOCKING=1 to make GPU operations synchronous, so it cannot profile a real-life program. We could time kernels with CUDA events instead, but that would limit timings to things happening in a CUDA stream.

I would also point out with regards to profiling, an external profiling tool will of course not be able to correlate timings with the graph structure of the network, so it cannot group statistics by layer, which is essential for understanding the performance of the network.

If we want metrics per layer (not per operation, or per operation type), we'll need to figure out how/where to introduce timing checkpoints in the graph without hindering optimization. That's a fundamental problem for frameworks based on computation graphs: They are free to rearrange operations across layer boundaries, so you cannot necessarily correlate the optimized graph with the network layers at all.

@nouiz
Copy link
Contributor

nouiz commented May 19, 2016

On Thu, May 19, 2016 at 7:46 AM, Jan Schlüter [email protected]
wrote:

I think all successful machine learning frameworks will be instrumented
for profiling eventually, and tensorflow appears to be taking the lead there

cltorch profiling and instrumentation
https://github.com/hughperkins/cltorch#profiling-tools :-)

THEANO_FLAGS=profile=1 works since at least 2013
https://github.com/Theano/Theano/commits/master/doc/tutorial/profiling.txt.
Caveat: It profiles both CPU and GPU operations, so it times things "from
outside", so it needs to be combined with CUDA_LAUNCH_BLOCKING=1 to make
GPU operations synchronous, so it cannot profile a real-life program. We
could time kernels with CUDA events
https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/
instead, but that would limit timings to things happening in a CUDA stream.

As the CPU node are already blocking, we could just add this as a second
timing columns that would be non-0 only for the new gpu-backend.

I would also point out with regards to profiling, an external profiling
tool will of course not be able to correlate timings with the graph
structure of the network, so it cannot group statistics by layer, which is
essential for understanding the performance of the network.

If we want metrics per layer (not per operation, or per operation type),
we'll need to figure out how/where to introduce timing checkpoints in the
graph without hindering optimization. That's a fundamental problem for
frameworks based on computation graphs: They are free to rearrange
operations across layer boundaries, so you cannot necessarily correlate the
optimized graph with the network layers at all.

We could reject optimization that merge node from 2 layers to have exact
per layer timing. Otherwise we could just let layer to be merged in the
profiler.

As always, we can always do better!

@hughperkins
Copy link
Contributor

hughperkins commented May 19, 2016

or, we could use the existing per-layer timings method, which seems to me to be a reasonably rigorous way of doing it, and avoids fudge-factors based on differing opinions on how to do in-situ timings. like:

  • do we call synchronize() between layers?
  • wallclock timings? kernel profiling timings?

I'm sure there are a bunch more of such questions....

@f0k
Copy link
Contributor

f0k commented May 19, 2016

We could reject optimization that merge node from 2 layers to have exact per layer timing.

The event-based timing could be realized as Ops that insert events into the CUDA stream. If inserted between layers, these Ops would naturally block any optimizations across layers if existing optimizers are not adapted to ignore them. But again, this would prevent real in-situ timings, because the graph might not be fully optimized.

I'm sure there are a bunch more of such questions....

Yes... in-situ timings would be the best, but depending on the framework, we cannot get per-layer timings without changing the process, and maybe not even distinguish the forward and backward pass, just training (fw+bw) and inference (fw only). For the start, we should probably have per-epoch or per-batch timings only.

@soumith
Copy link
Owner Author

soumith commented May 19, 2016

On the Caffe side, Mike Houston and team at NVIDIA have agreed to do the benchmark scripts. So that concludes all the volunteers for each framework :)

@naibaf7
Copy link

naibaf7 commented May 19, 2016

@soumith
I might do some adaptions to the Caffe scripts for OpenCL/Greentea-libDNN if required :)

@soumith
Copy link
Owner Author

soumith commented May 19, 2016

Sure :)

@naibaf7
Copy link

naibaf7 commented May 26, 2016

@soumith
I know we're on the edge of switching to a new benchmark system, but it would be great if you could give this a shot: #106
Thanks :)

@hughperkins
Copy link
Contributor

hughperkins commented May 26, 2016

Opinion: the new system should be a collection of specialized repos, each with their own repo, being:

  • images: owner Soumith, basically this one
  • video: owner: ???
  • nlp: owner: ???

... etc. Therefore, I see no reason for this repo disappearing in any way, shape or form, personally :-)

@hughperkins
Copy link
Contributor

hughperkins commented May 30, 2016

forresti wrote:

Agreed. Soumith's current benchmarks are useful, but they mainly evaluate "who can make the thinnest wrapper around cuDNN, Neon, or similar?"

On a somewhat related note, since GEMM seems to be at the heart of convolution. It's used by fft, winograd, im2col, and presumably also implicit gemm. So, could it be worth having some GEMM benchmarks? For OpenCL, there are at least 3 GEMM implementations I know of, ie: clBLAS, clBLAST, ViennaCL.

(Edit: and for CUDA, there's at least: cublas, and the sass gemm implementation that is part of neon)

@bhack
Copy link

bhack commented May 30, 2016

@hughperkins Do you mean something like https://github.com/dividiti/gemmbench?

@hughperkins
Copy link
Contributor

@bhack

Possibly. I'm not sure what I mean to be honest. I'm not sure that's quite exactly what I was thinking of. I was thinking of something more like the simple tables in convnet-benchmarks, but comparing these 5 or so GEMM implementations. Presumably, the actual workloads, if we are targeting convolution, should be workloads sampled by running a forwards-backwards batch through a few common convolutional models, such as those currently in convnet-benchmarks.

@naibaf7
Copy link

naibaf7 commented May 30, 2016

@hughperkins
I actually think GEMM will be a bit less at the heart of convolutions as we move forward. At least not pure GEMM as implemented in BLAS libraries.
There are many nice optimization and memory coalescing possibilities when implementing kernel fusion, that the GEMM only remains at the core (register blocked GEMM, at most shared memory GEMM).
I am just right now doing these optimizations on libDNN to get closer to cuDNN scores.

Also, what makes benchmarking the GEMMs more difficult is that their performance differs a lot on a variety of devices. And then the GEMMs can also be autotuned, which works more or less depending on what BLAS and architecture combination is used.

I think benchmarking on the network and layer level is enough for the DeepMark project.

@hughperkins
Copy link
Contributor

There are many nice optimization and memory coalescing possibilities when implementing kernel fusion, that the GEMM only remains at the core (register blocked GEMM, at most shared memory GEMM).
I am just right now doing these optimizations on libDNN to get closer to cuDNN scores.

Ah, sounds good :-)

@hughperkins
Copy link
Contributor

hughperkins commented Jun 14, 2016

Hi Andrew, So ... guess what? I wrote a correctness checking script :-) And ... ironically enough... it targets Neon :-D Because I need it for testing the OpenCL port.

It outputs for each layer:

  • average forward time
  • average backwards time
  • average delta between each output value and a cpu-calculated reference value, assumed correct, since python
    uses float64 (I think?)
  • ditto for weights gradient
  • ditto for input gradient

Example results, for neon on Titan X, using vgga model:

Maxwell kernels, Winograd, SASS:

Layer 0: fprop=0.004 bprop=0.036 eps_O=3e-07 eps_gradW=2e-03 eps_gradI=4e-05 (note: direct, not winograd)
Layer 1: fprop=0.012 bprop=0.032 eps_O=1e-05 eps_gradW=1e-02 eps_gradI=3e-05
Layer 2: fprop=0.009 bprop=0.021 eps_O=5e-05 eps_gradW=2e-03 eps_gradI=6e-05
Layer 3: fprop=0.017 bprop=0.036 eps_O=1e-04 eps_gradW=3e-03 eps_gradI=8e-05
Layer 4: fprop=0.007 bprop=0.016 eps_O=8e-05 eps_gradW=6e-04 eps_gradI=2e-04
Layer 5: fprop=0.015 bprop=0.031 eps_O=9e-05 eps_gradW=8e-04 eps_gradI=1e-04
Layer 6: fprop=0.005 bprop=0.010 eps_O=8e-05 eps_gradW=4e-04 eps_gradI=9e-05
Layer 7: fprop=0.005 bprop=0.010 eps_O=8e-05 eps_gradW=4e-04 eps_gradI=9e-05

Kepler kernels, Direct, CUDA:

Layer 0: SKIPPING
Layer 1: fprop=0.032 bprop=0.158 eps_O=9e-06 eps_gradW=1e-03 eps_gradI=2e-05
Layer 2: fprop=0.033 bprop=0.110 eps_O=2e-05 eps_gradW=3e-04 eps_gradI=2e-05
Layer 3: fprop=0.067 bprop=0.222 eps_O=2e-05 eps_gradW=3e-04 eps_gradI=3e-05
Layer 4: fprop=0.033 bprop=0.111 eps_O=3e-05 eps_gradW=1e-04 eps_gradI=9e-05
Layer 5: fprop=0.066 bprop=0.222 eps_O=4e-05 eps_gradW=9e-05 eps_gradI=5e-05
Layer 6: fprop=0.016 bprop=0.053 eps_O=4e-05 eps_gradW=2e-05 eps_gradI=3e-05
Layer 7: fprop=0.016 bprop=0.053 eps_O=4e-05 eps_gradW=2e-05 eps_gradI=3e-05

Kepler kernels, Direct, OpenCL:

Layer 0: SKIPPING
Layer 1: fprop=0.039 bprop=0.173 eps_O=1e-05 eps_gradW=8e-04 eps_gradI=1e-05
Layer 2: fprop=0.039 bprop=0.124 eps_O=1e-05 eps_gradW=4e-04 eps_gradI=2e-05
Layer 3: fprop=0.073 bprop=0.237 eps_O=3e-05 eps_gradW=2e-04 eps_gradI=4e-05
Layer 4: fprop=0.038 bprop=0.125 eps_O=2e-05 eps_gradW=5e-05 eps_gradI=2e-05
Layer 5: fprop=0.073 bprop=0.238 eps_O=3e-05 eps_gradW=1e-04 eps_gradI=4e-05
Layer 6: fprop=0.022 bprop=0.069 eps_O=5e-05 eps_gradW=2e-05 eps_gradI=7e-05
Layer 7: fprop=0.021 bprop=0.067 eps_O=5e-05 eps_gradW=2e-05 eps_gradI=7e-05

https://github.com/hughperkins/neon-benchmarks

(Edited with the layer 0 results for Maxwell CUDA kernels, summary page at https://github.com/hughperkins/neon-benchmarks/blob/master/results/vgga_summary.md )

@hughperkins
Copy link
Contributor

hughperkins commented Jun 21, 2016

For the benchmarks with correctness checker, added stride and padding, so it can handle eg alexnet now:

https://github.com/hughperkins/neon-benchmarks/blob/master/results/alexnet_summary.md

Nervana Neon CUDA/SASS Winograd kernels for Maxwell

neon_maxwell

Layer 0: fprop=0.003 bprop=0.012 eps_O=4e-06 eps_gradW=5e-04 eps_gradI=2e-06 (note: direct, not winograd)
Layer 1: fprop=0.009 bprop=0.020 eps_O=9e-06 eps_gradW=4e-04 eps_gradI=3e-05 (note: direct, not winograd)
Layer 2: fprop=0.003 bprop=0.006 eps_O=5e-05 eps_gradW=3e-04 eps_gradI=8e-05
Layer 3: fprop=0.004 bprop=0.007 eps_O=5e-05 eps_gradW=7e-04 eps_gradI=1e-04
Layer 4: fprop=0.003 bprop=0.005 eps_O=1e-04 eps_gradW=3e-04 eps_gradI=6e-05

Nervana Neon Kepler direct kernels, in CUDA

neon_kepler

Layer 0: SKIPPED
Layer 1: fprop=0.015 bprop=0.046 eps_O=9e-06 eps_gradW=2e-04 eps_gradI=3e-05
Layer 2: fprop=0.008 bprop=0.023 eps_O=2e-05 eps_gradW=3e-05 eps_gradI=5e-05
Layer 3: fprop=0.010 bprop=0.030 eps_O=4e-05 eps_gradW=3e-05 eps_gradI=2e-05
Layer 4: fprop=0.007 bprop=0.020 eps_O=3e-05 eps_gradW=2e-05 eps_gradI=2e-05

OpenCL port of Nervana Neon Kepler direct kernels

neoncl_direct

Layer 0: SKIPPED
Layer 1: fprop=0.016 bprop=0.049 eps_O=1e-05 eps_gradW=1e-04 eps_gradI=6e-05
Layer 2: fprop=0.008 bprop=0.024 eps_O=2e-05 eps_gradW=3e-05 eps_gradI=3e-05
Layer 3: fprop=0.010 bprop=0.031 eps_O=4e-05 eps_gradW=3e-05 eps_gradI=3e-05
Layer 4: fprop=0.007 bprop=0.021 eps_O=2e-05 eps_gradW=2e-05 eps_gradI=2e-05

However, there's still a couple of things we'd want, if we wanted to generalize this to other networks:

  • some way of running the other frameworks from python, and / or
  • some way of transmitting the tensors to the other frameworks

Actually, torch can run from python, as can theano, tensorflow, mxnet (I think?), caffe, DeepCL, chainer. So maybe python is all that is needed???

@hughperkins
Copy link
Contributor

hughperkins commented Jun 22, 2016

Andrew, hmmm, just noticed, this is figure 4 in your paper. Hadnt noticed that before :-D Well, I hadnt read the paper earlier... I see that you are using element max though. I think this might be sensitive to outliers, and also larger for layers with more output values? Maybe median or average, possibly also with standard deviation, is good? (edited to spell median with a 'd' and an 'i' ...)

@f0k
Copy link
Contributor

f0k commented Aug 24, 2016

Possibly relevant: Fathom: Reference Workloads for Modern Deep Learning Methods

Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.

@bhack
Copy link

bhack commented Sep 27, 2016

See also Baidu Research DeepBench

@pedropgusmao
Copy link

Hi everyone,
we will soon be benchmarking a NVIDIA DGX-1 and we are currently looking for ways to do it effectively. So if you have ever wondered how well the DGX-1 would perform on a specific task, let me know which. We will release all results online.

For now, we have run some classic benchmarks with a K40/Maxwell Titan X/GTX 1080/ Pascal Titan X, using various architectures and frameworks. Results are available here.
Again, all suggestions on how to benchmark the DGX-1 are welcome. Thanks

@Yangqing
Copy link
Contributor

@pedropgusmao from the curve it seems that you basically need to increase the workspace limit in Caffe:

https://github.com/BVLC/caffe/blob/80f44100e19fd371ff55beb3ec2ad5919fb6ac43/src/caffe/layers/cudnn_conv_layer.cpp#L113

The value was in order to support all platforms (old and new) but it usually comes slow for most recent cudnn with most recent hardware.

@pedropgusmao
Copy link

@Yangqing , thank you very much. Do you have any suggested values for workspace_limit_bytes considering both those GPUs and the DGX-1? Again, thanks.

@vrv
Copy link
Contributor

vrv commented Feb 28, 2017

@pedropgusmao take a look at https://www.tensorflow.org/performance/performance_guide with TF 1.0.0. I believe we are also working on benchmarks for some of these models, so you'll have comparison code at some point soon.

@pedropgusmao
Copy link

@vrv, thanks a lot! We will modify our code to follow those suggestions. We look forward to see your results.

Repository owner deleted a comment from Julia991 Jan 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests