From 832288f83cd6bcfa5b1aaeb53ef61c3b038da861 Mon Sep 17 00:00:00 2001 From: nick black Date: Wed, 4 Sep 2024 11:42:07 -0400 Subject: [PATCH 1/3] chapter 1 section 0 edits Changed up rather more content than I would like here due to style, but it's the beginning of the book; it ought punch a bit harder imho. Feel free to reject any changes you don't like. I did think a few things needed fixing: microarchitecture improvements did a lot, if not as much as frequency boosts. Moore's law has *not* continued according to its original track, but slowed. From wikipedia (ugh): Microprocessor architects report that semiconductor advancement has slowed industry-wide since around 2010, below the pace predicted by Moore's law.[17] Brian Krzanich, the former CEO of Intel, announced, "Our cadence today is closer to two and a half years than two."[103] Intel stated in 2015 that improvements in MOSFET devices have slowed, starting at the 22 nm feature width around 2012, and continuing at 14 nm.[104] Pat Gelsinger, Intel CEO, stated at the end of 2023 that "we're no longer in the golden era of Moore's Law, it's much, much harder now, so we're probably doubling effectively closer to every three years now, so we've definitely seen a slowing." I mean, we've definitely seen slowing there; I don't see how you can argue otherwise. Compiler improvements haven't achieved much, unfortunately. I would maybe mention the Sprangle/Carmean 2002 paper "increasing processor performance by implementing deeper pipelines." Maybe mention Cerebras for transistor count fun. There weren't any grammatical issues in this section iirc, so seriously, you can dump all the wording changes if you'd rather not admit them. --- chapters/0-Preface/0-2 Preface.md | 2 +- chapters/1-Introduction/1-0 Introduction.md | 12 ++++++------ 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/chapters/0-Preface/0-2 Preface.md b/chapters/0-Preface/0-2 Preface.md index f94b5987be..b255569db2 100644 --- a/chapters/0-Preface/0-2 Preface.md +++ b/chapters/0-Preface/0-2 Preface.md @@ -18,7 +18,7 @@ I started this book with a simple goal: educate software developers to better un When I was taking my first steps in performance engineering, the only good sources of information on the topic were software developer manuals, which are not what mainstream developers like to read. Frankly, I wish I had this book when I was trying to learn low-level performance analysis. In 2016 I started sharing things that I learned on my blog, and received some positive feedback from my readers. Some of them suggested I aggregate this information into a book. This book is their fault. -Many people have asked me why I decided to self-publish the book. In fact, I initially tried to pitch it to several reputable publishers, but they didn't see the financial benefits of making such a book. However, I really wanted to write it, so I decided to do it anyway. In the end, it turned out quite well, so I decided to self-publish the second edition also. +Many people have asked me why I decided to self-publish the book. In fact, I initially tried to pitch it to several reputable publishers, but they didn't see the financial benefits of making such a book. However, I really wanted to write it, so I decided to do it anyway. In the end, it turned out quite well, so I decided to self-publish the second edition as well. The first edition was released in November 2020. It was well-received by the community, but I also received a lot of constructive criticism. The most popular feedback was to include exercises for experimentation. Some readers complained that it was too focused on Intel CPUs and didn't cover other architectures like AMD, ARM, etc. Other readers suggested that I should cover system performance, not just CPU performance. The second edition expands in all these and many other directions. It came out to be twice as big as the first book. diff --git a/chapters/1-Introduction/1-0 Introduction.md b/chapters/1-Introduction/1-0 Introduction.md index 018729aeb3..74fd49e417 100644 --- a/chapters/1-Introduction/1-0 Introduction.md +++ b/chapters/1-Introduction/1-0 Introduction.md @@ -1,18 +1,18 @@ # Introduction {#sec:chapter1} -They say, "Performance is king". It was true a decade ago, and it certainly is now. According to [@Domo2017], in 2017, the world has been creating 2.5 quintillion[^1] bytes of data every day, and as predicted in [@Statista2024], it will reach 400 quintillion bytes per day in 2024. In our increasingly data-centric world, the growth of information exchange fuels the need for both faster software and faster hardware. +Performance is king: this was true a decade ago, and it certainly is now. According to [@Domo2017], in 2017, the world has been creating 2.5 quintillion[^1] bytes of data every day. [@Statista2024] predicts 400 quintillion bytes per day in 2024. In our increasingly data-centric world, the growth of information exchange requires both faster software and faster hardware. -Software programmers have had an "easy ride" for decades, thanks to Moore’s law. It used to be the case that some software vendors preferred to wait for a new generation of hardware to speed up their software products and did not spend human resources on making improvements in their code. By looking at Figure @fig:50YearsProcessorTrend, we can see that single-threaded[^2] performance growth is slowing down. From 1990 to 2000, single-threaded performance grew by a factor of approximately 25 to 30 times based on SPECint benchmarks. The increase in CPU frequency was the key factor driving performance growth. +Software programmers have had an "easy ride" for decades, thanks to Moore’s law. Software vendors could rely on new generations of hardware to speed up their software products, even if they did not spend human resources on making improvements. By looking at Figure @fig:50YearsProcessorTrend, we can see that single-threaded[^2] performance growth is slowing down. From 1990 to 2000, single-threaded performance on SPECint benchmarks increased by a factor of approximately 25 to 30, driven largely by higher CPU frequencies and improved microarchitecture. ![50 Years of Microprocessor Trend Data. *© Image by K. Rupp via karlrupp.net*. Original data up to the year 2010 was collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2021 by K. Rupp.](../../img/intro/50-years-processor-trend.png){#fig:50YearsProcessorTrend width=100%} -However, from 2000 to 2010, single-threaded CPU performance growth was more modest compared to the previous decade (approximately 4 to 5 times). Clock speed stagnated due to a combination of power consumption, heat dissipation challenges, limitations in voltage scaling (Dennard Scaling[^3]), and other fundamental problems. Despite slower clock speed improvements, architectural advancements continued, including better branch prediction, deeper pipelines, larger caches, and more efficient execution units. +Single-threaded CPU performance growth was more modest from 2000 to 2010 (a factor between four and five). Clock speeds topped out around 4GHz due to power consumption, heat dissipation challenges, limitations in voltage scaling (Dennard Scaling[^3]), and other fundamental problems. Architectural advancements continued: better branch prediction, deeper pipelines, larger caches, prefetching, and more efficient execution units. -From 2010 to 2020, single-threaded performance grew only by about 2 to 3 times. During this period, CPU manufacturers began to focus more on multi-core processors and parallelism rather than solely increasing single-threaded performance. +From 2010 to 2020, single-threaded performance grew only by a factor between two and three. Multicore processors entered the mainstream during this decade, as did simultaneous multithreading. -The original interpretation of Moore's law is still standing, as transistor count in modern processors maintains its trajectory. For instance, the number of transistors in Apple chips grew from 16 billion in M1 to 20 billion in M2, to 25 billion in M3, to 28 billion in M4 in a span of roughly four years. The growth in transistor count enables manufacturers to add more cores to a processor. As of 2024, you can buy a high-end server processor that will have more than 100 logical cores on a single CPU socket. This is very impressive, unfortunately, it doesn't always translate into better performance. Very often, application performance doesn't scale with extra CPU cores. +Transistor counts continue to increase in modern processors. Apple's M1 shipped in 2020 with 16 billion transistors. M2 made use of 20, M3 included 25, and 2024's M4 employs 28 billion transistors, a fifty percent increase over four years. The growth in transistor count enables manufacturers to add more cores to a processor. Intel's Sierra Forest dies are expected to boast a formidable 288 cores each. This is very impressive. Unfortunately, it doesn't always translate into better performance. Very often, application performance doesn't scale with extra CPU cores. -When it's no longer the case that each hardware generation provides a significant performance boost, we must start paying more attention to how fast our code runs. When seeking ways to improve performance, developers should not rely on hardware. Instead, they should start optimizing the code of their applications. +As it's no longer the case that each hardware generation provides a significant performance boost, we must start paying more attention to our code's efficiency. When seeking ways to improve performance, developers should not rely on hardware. Instead, they should start optimizing the code of their applications. > “Software today is massively inefficient; it’s become prime time again for software programmers to get really good at optimization.” - Marc Andreessen, the US entrepreneur and investor (a16z Podcast) From 8f1393025b9178c8e9b51c526b7518c6f427c968 Mon Sep 17 00:00:00 2001 From: Denis Bakhvalov Date: Sat, 7 Sep 2024 16:43:14 -0400 Subject: [PATCH 2/3] Denis fixes --- chapters/1-Introduction/1-0 Introduction.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/chapters/1-Introduction/1-0 Introduction.md b/chapters/1-Introduction/1-0 Introduction.md index 74fd49e417..5f02a8889a 100644 --- a/chapters/1-Introduction/1-0 Introduction.md +++ b/chapters/1-Introduction/1-0 Introduction.md @@ -2,17 +2,17 @@ Performance is king: this was true a decade ago, and it certainly is now. According to [@Domo2017], in 2017, the world has been creating 2.5 quintillion[^1] bytes of data every day. [@Statista2024] predicts 400 quintillion bytes per day in 2024. In our increasingly data-centric world, the growth of information exchange requires both faster software and faster hardware. -Software programmers have had an "easy ride" for decades, thanks to Moore’s law. Software vendors could rely on new generations of hardware to speed up their software products, even if they did not spend human resources on making improvements. By looking at Figure @fig:50YearsProcessorTrend, we can see that single-threaded[^2] performance growth is slowing down. From 1990 to 2000, single-threaded performance on SPECint benchmarks increased by a factor of approximately 25 to 30, driven largely by higher CPU frequencies and improved microarchitecture. +Software programmers have had an "easy ride" for decades, thanks to Moore’s law. Software vendors could rely on new generations of hardware to speed up their software products, even if they did not spend human resources on making improvements in their code. This strategy doesn't work any longer. By looking at Figure @fig:50YearsProcessorTrend, we can see that single-threaded[^2] performance growth is slowing down. From 1990 to 2000, single-threaded performance on SPECint benchmarks increased by a factor of approximately 25 to 30, driven largely by higher CPU frequencies and improved microarchitecture. ![50 Years of Microprocessor Trend Data. *© Image by K. Rupp via karlrupp.net*. Original data up to the year 2010 was collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2021 by K. Rupp.](../../img/intro/50-years-processor-trend.png){#fig:50YearsProcessorTrend width=100%} -Single-threaded CPU performance growth was more modest from 2000 to 2010 (a factor between four and five). Clock speeds topped out around 4GHz due to power consumption, heat dissipation challenges, limitations in voltage scaling (Dennard Scaling[^3]), and other fundamental problems. Architectural advancements continued: better branch prediction, deeper pipelines, larger caches, prefetching, and more efficient execution units. +Single-threaded CPU performance growth was more modest from 2000 to 2010 (a factor between four and five). Clock speeds topped out around 4GHz due to power consumption, heat dissipation challenges, limitations in voltage scaling (Dennard Scaling[^3]), and other fundamental problems. Despite clock speed stagnation, architectural advancements continued: better branch prediction, deeper pipelines, larger caches, and more efficient execution units. -From 2010 to 2020, single-threaded performance grew only by a factor between two and three. Multicore processors entered the mainstream during this decade, as did simultaneous multithreading. +From 2010 to 2020, single-threaded performance grew only by a factor between two and three. During this period, CPU manufacturers began to focus more on multi-core processors and parallelism rather than solely increasing single-threaded performance. -Transistor counts continue to increase in modern processors. Apple's M1 shipped in 2020 with 16 billion transistors. M2 made use of 20, M3 included 25, and 2024's M4 employs 28 billion transistors, a fifty percent increase over four years. The growth in transistor count enables manufacturers to add more cores to a processor. Intel's Sierra Forest dies are expected to boast a formidable 288 cores each. This is very impressive. Unfortunately, it doesn't always translate into better performance. Very often, application performance doesn't scale with extra CPU cores. +Transistor counts continue to increase in modern processors. For instance, the number of transistors in Apple chips grew from 16 billion in M1 to 20 billion in M2, to 25 billion in M3, to 28 billion in M4 in a span of roughly four years. The growth in transistor count enables manufacturers to add more cores to a processor. As of 2024, you can buy a high-end server processor that will have more than 100 logical cores on a single CPU socket. This is very impressive. Unfortunately, it doesn't always translate into better performance. Very often, application performance doesn't scale with extra CPU cores. -As it's no longer the case that each hardware generation provides a significant performance boost, we must start paying more attention to our code's efficiency. When seeking ways to improve performance, developers should not rely on hardware. Instead, they should start optimizing the code of their applications. +As it's no longer the case that each hardware generation provides a significant performance boost, we must start paying more attention to how fast our code runs. When seeking ways to improve performance, developers should not rely on hardware. Instead, they should start optimizing the code of their applications. > “Software today is massively inefficient; it’s become prime time again for software programmers to get really good at optimization.” - Marc Andreessen, the US entrepreneur and investor (a16z Podcast) From 87a04f2b1058616ed4e0d1dddd22c1214ac2ac9e Mon Sep 17 00:00:00 2001 From: Denis Bakhvalov Date: Sat, 7 Sep 2024 16:44:34 -0400 Subject: [PATCH 3/3] Denis fixes++ --- chapters/0-Preface/0-2 Preface.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/0-Preface/0-2 Preface.md b/chapters/0-Preface/0-2 Preface.md index b255569db2..f94b5987be 100644 --- a/chapters/0-Preface/0-2 Preface.md +++ b/chapters/0-Preface/0-2 Preface.md @@ -18,7 +18,7 @@ I started this book with a simple goal: educate software developers to better un When I was taking my first steps in performance engineering, the only good sources of information on the topic were software developer manuals, which are not what mainstream developers like to read. Frankly, I wish I had this book when I was trying to learn low-level performance analysis. In 2016 I started sharing things that I learned on my blog, and received some positive feedback from my readers. Some of them suggested I aggregate this information into a book. This book is their fault. -Many people have asked me why I decided to self-publish the book. In fact, I initially tried to pitch it to several reputable publishers, but they didn't see the financial benefits of making such a book. However, I really wanted to write it, so I decided to do it anyway. In the end, it turned out quite well, so I decided to self-publish the second edition as well. +Many people have asked me why I decided to self-publish the book. In fact, I initially tried to pitch it to several reputable publishers, but they didn't see the financial benefits of making such a book. However, I really wanted to write it, so I decided to do it anyway. In the end, it turned out quite well, so I decided to self-publish the second edition also. The first edition was released in November 2020. It was well-received by the community, but I also received a lot of constructive criticism. The most popular feedback was to include exercises for experimentation. Some readers complained that it was too focused on Intel CPUs and didn't cover other architectures like AMD, ARM, etc. Other readers suggested that I should cover system performance, not just CPU performance. The second edition expands in all these and many other directions. It came out to be twice as big as the first book.