Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide more cpufreq steps for sun8i/legacy #298

Closed
ThomasKaiser opened this issue May 15, 2016 · 70 comments
Closed

Provide more cpufreq steps for sun8i/legacy #298

ThomasKaiser opened this issue May 15, 2016 · 70 comments

Comments

@ThomasKaiser
Copy link
Contributor

ThomasKaiser commented May 15, 2016

Based on the discussion in the forum I would propose adding more cpufreq steps above 816 MHz on sun8i/legacy kernel so that sunxi-cpufreq.c looks like this:

struct cpufreq_frequency_table sunxi_freq_tbl[] = {
    { .frequency = 60000  , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 120000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 240000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 312000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 408000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 480000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 504000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 600000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 648000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 720000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 816000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 864000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 912000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 960000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1008000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1056000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1104000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1152000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1200000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1248000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1296000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1344000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1440000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
    { .frequency = 1536000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },

    /* table end */
    { .frequency = CPUFREQ_TABLE_END,  .index = 0,              },
};

This alone should help with performance in situations where throttling occurs since throttling gets more efficient when finer graded jumps between different frequencies are possible. It's already known that Allwinner's defaults in every BSP kernel so far were close to horrible so it's up to us to improve things.

What do you think both regarding the whole approach as well as where to apply (PR against Igor's linux repo or a patch 'as usual')? IMO we should take sun8i as test balloon (though not really necessary since sun50i/A64 has already served as test but the H3 boards with primitive or no programmable voltage regulator would benefit the most from more cpufreqs) and if no complaints are heard should add additional cpufreqs at least to all sunxi legacy kernels and start to test with sun7i and mainline kernel too whether more fine graded throttling there also helps with increasing performance.

Any thoughts?

@zador-blood-stained
Copy link
Member

zador-blood-stained commented May 15, 2016

While adding more steps may not improve things a lot for interactive governor, it still may be useful for ondemand and conservative (yes, I saw your post on forum about switching latency, and in any case switching governors is a tradeoff between performance and power saving).

What do you think both regarding the whole approach as well as where to apply (PR against Igor's linux repo or a patch 'as usual')?

Igor's repo doesn't have any previous commits history, so I think a patch "as usual" will be fine.
I added patch with your table to my repo.

@zador-blood-stained
Copy link
Member

and if no complaints are heard should add additional cpufreqs at least to all sunxi legacy kernels and start to test with sun7i and mainline kernel too whether more fine graded throttling there also helps with increasing performance.

I thought that it's not easy to heat up A10/A20 to the point of throttling without using synthetic loads like cpuburn.

@ssvb
Copy link

ssvb commented May 15, 2016

@zador-blood-stained FYI, here is an explanation why interactive is better than ondemand: https://lkml.org/lkml/2012/2/7/483

@ssvb
Copy link

ssvb commented May 15, 2016

@zador-blood-stained

I thought that it's not easy to heat up A10/A20 to the point of throttling without using synthetic loads like cpuburn.

There are real workloads, which are only a little bit less power hungry than cpuburn. A10 and A20 don't have 4 cores, but they are also less power efficient than H3 and need higher core voltage. Still, as far as I know, currently A10 and A20 don't implement any thermal throttling at all and the current DVFS settings are almost on the edge of being overheating prone.

@zador-blood-stained
Copy link
Member

here is an explanation why interactive is better than ondemand

Here "better" is still subjective and depends on types of workloads and whole system use case.

This governor is designed for latency-sensitive workloads, such as
interactive user interfaces.

So for battery saving (for A10,A20 and A64) other governors may still be better.

A10 and A20 don't have 4 cores, but they are also less power efficient than H3 and need higher core voltage.

Also they don't have precise enough thermal sensor to base thermal throttling on.

@ssvb
Copy link

ssvb commented May 15, 2016

Here "better" is still subjective and depends on types of workloads and whole system use case. So for battery saving (for A10,A20 and A64) other governors may still be better.

And this is based on what? Yes, I know that some people think that "the ondemand governor is supported in the mainline kernel, and everything that is included in the mainline kernel can't be bad by definition" :-)

But the ondemand governor is just a horrible piece of code. It is based on the "waking up to decide whether the CPU is idle" concept. And having unnecessary periodic wakeups is exactly the thing that ruins battery life. This governor is very clearly not fit for the job and Android people had no choice but to replace it with something else. There is nothing like "tradeoffs" here, the ondemand is just inferior in every possible way.

That said, the work is being done in the mainline kernel to clean up this ugly mess, see http://lkml.iu.edu/hypermail/linux/kernel/1603.1/05278.html and http://marc.info/?l=linux-acpi&m=145814049919895&w=2

@zador-blood-stained
Copy link
Member

I know that some people think that "the ondemand governor is supported in the mainline kernel, and everything that is included in the mainline kernel can't be bad by definition" :-)

I'm not one of them.

And having unnecessary periodic wakeups is exactly the thing that ruins battery life.

Now I see your point. Patch note doesn't focus on this aspect.
I based my comparison on assumption that power consumption scales non-linearly with frequency, so staying longer at lower frequencies is more efficient.

@ThomasKaiser
Copy link
Contributor Author

While adding more steps may not improve things a lot for interactive governor

Hmm... I've been talking about throttling already happening. With the H3 BSP kernel it's not that bad as it was with A64 since here we currently have 96 MHz steps (with the one strange expection before @ssvb added the 1296 MHz cpufreq) but providing the ability to use 48 MHz steps will increase performance in throttling situations for sure (regardless of governor used since throttling frequency then defines maximum cpufreq)

Regarding A20: mea culpa. This isn't a real throttling candidate but I was already thinking about A20E (based on .dts stuff available through A64 BSP) and thought about checking throttling activity with mainline instead of BSP kernel just to realize that things are moving (reading through ssvb's links right now)

Thx for adopting changes that fast. And am really looking forward to sun8i-simple-cpu-corekeeper.patch.disabled. What's missing? Testing?

@zador-blood-stained
Copy link
Member

And am really looking forward to sun8i-simple-cpu-corekeeper.patch.disabled. What's missing? Testing?

[ 532.981703] thermal_sys: Critical temperature reached(100 C),shutting down

Needs more work. cpuburn-a7 kills OPi One with heatsink in 5-10 minutes with current trip points configuration.

@ThomasKaiser
Copy link
Contributor Author

cpuburn-a7 kills OPi One with heatsink in 5-10 minutes with current trip points configuration.

Well, I think since 2 months that there's a lot of room for improvements based on ssvb's comments about strange cooling maps nodes in A64 BSP kernel settings. But I lack the skills...

@zador-blood-stained
Copy link
Member

With killing and powering back cores one by one and extended DVFS table in-kernel corekeeper (new version) works stable enough (cpuburn-a7 running for an hour never killed more than 2 cores)

[cooler_table]
cooler_count = 6
cooler0 = "1200000 4 4294967295 0"
cooler1 = "1008000 4 4294967295 0"
cooler2 = "648000 4 4294967295 0"
cooler3 = "600000 3 4294967295 0"
cooler4 = "504000 2 4294967295 0"
cooler5 = "480000 1 4294967295 0"
[dvfs_table]
pmuic_type = 1
pmu_gpio0 = port:PL06<1><1><2><1>
pmu_level0 = 11300
pmu_level1 = 1100
max_freq = 1200000000
min_freq = 480000000
LV_count = 12
LV1_freq = 1200000000
LV1_volt = 1300
LV2_freq = 1104000000
LV2_volt = 1300
LV3_freq = 1056000000
LV3_volt = 1300
LV4_freq = 100800000
LV4_volt = 1300
LV5_freq = 960000000
LV5_volt = 1300
LV6_freq = 912000000
LV6_volt = 1100
LV7_freq = 816000000
LV7_volt = 1100
LV8_freq = 720000000
LV8_volt = 1100
LV9_freq = 648000000
LV9_volt = 1100
LV10_freq = 600000000
LV10_volt = 1100
LV11_freq = 504000000
LV11_volt = 1100
LV12_freq = 480000000
LV12_volt = 1100

but killing cores earlier may affect both performance in CPU-intensive tasks and benchmarking results.

@zador-blood-stained
Copy link
Member

image

Since this graph doesn't tell the real picture, during an hour of testing, 3rd core was killed 20 times and 4th core was killed 303 times.

@ThomasKaiser
Copy link
Contributor Author

Hmm... according to the graph H3 was most of the times running at 600 MHz where it's already allowed to kill CPU cores. Another approach would be to allow further throttling down to lower frequencies without killing cores and also increasing some trip points (IMO it's fine to exceed 90°C under full load).

BTW: Did you check VDD_CPUX voltage in this test? Really at 1.1V all the time?

@zador-blood-stained
Copy link
Member

Another small test - with these settings and corekeeper disabled cpuburn kills only one core in ~10 minutes - that's enough to keep temperature down at 648MHz
image

BTW: Did you check VDD_CPUX voltage in this test? Really at 1.1V all the time?

There was small peak at ~19:46 where it jumped to 1.3 with frequency going up, besides that it stayed at 1.1. Or do you want me to measure the voltage at test points?

Hmm... according to the graph H3 was most of the times running at 600 MHz where it's already allowed to kill CPU cores.

Killing bringing back cores at higher frequencies may create too big of a temperature increase where it would trigger auto shutdown. And situation without heatsink may be even worse.

@ThomasKaiser
Copy link
Contributor Author

I was really talking about measuring since the graphs are based on parsing script.bin and then displaying VDD_CPUX only according to dvfs fex settings. And on OPi One it's not even possible to query SY8106A for the voltage really used.

@zador-blood-stained
Copy link
Member

zador-blood-stained commented May 16, 2016

Here I was already talking about measured voltages (1.13V and 1.33V), relative to one of GND pins on GPIO header.

Edit: Even though my multimeter is relatively cheap and old, I tested it on REF01CPZ voltage reference, and it should be precise enough for measuring DC voltage.

@ThomasKaiser
Copy link
Contributor Author

I know but since I never looked into the driver I simply have no trust at all in the readouts. And temperatures appear to be pretty high compared to the stuff I measured with OPi PC so far.

I still fail to understand why OPi One differs that much compared to PC here (since it should be related to VDD_CPUX voltage and workload -- same settings, same results)

@zador-blood-stained
Copy link
Member

zador-blood-stained commented May 16, 2016

I still fail to understand why OPi One differs that much compared to PC here (since it should be related to VDD_CPUX voltage and workload -- same settings, same results)

Maybe board size matters, bigger PCB means bigger surface area for heat dissipation and bigger volume for heat accumulation and smoothing fast temperature changes.

@zador-blood-stained
Copy link
Member

@ThomasKaiser
So what is better in your opinion (I mean as future default settings for general use) - more cores running at low frequency or less cores running at high frequency? Obviously multithreaded tasks will benefit from first option and single-threaded will benefit from second option.

@ssvb
Copy link

ssvb commented May 17, 2016

Single-threaded tasks are unlikely to trigger thermal throttling in the first place, so more cores running at low frequency seems to be a universally good choice.

@ssvb
Copy link

ssvb commented May 17, 2016

Theoretically there could be single-threaded GPU heavy workloads, but then the budget cooling needs to take the GPU into account properly. Which is a part of the budget cooling design in principle, but I'm not sure if it is implemented correctly in Allwinner BSP kernels yet.

@zador-blood-stained
Copy link
Member

I'm not sure if it is implemented correctly in Allwinner BSP kernels yet.

Well, quick grepping through kernel source shows that there is some sort of implementation: this in theory should call that or that if all is configured correctly.

@ThomasKaiser
Copy link
Contributor Author

Single-threaded tasks are unlikely to trigger thermal throttling in the first place, so more cores running at low frequency seems to be a universally good choice.

I agree for the same reason. And even in case there are many unrelated single-threaded workloads running in parallel that lead to a throttling situation keeping CPU cores while reducing clockspeed (and when we're talking about systems that implement dvfs at the same time also VDD_CPUX!) is the better option since we end up with more overall performance.

I tested 2 cores running at 1200 MHz vs. 4 cores running at 600 MHz back in December on OPi PC and both temperatures/consumption were lower when running with full core count at half the speed. Would mean at the same consumption/temperature level higher clockspeeds would be possible (720 MHz or maybe even 816 MHz)

BTW: The more I think about this stuff looking from OPi One/Lite and especially SinoVoip's M2+ perspective (at 1.3V all the time) the more I come to the conclusion that the intermediate steps between 480 MHz and 816 MHz should better look like

{ .frequency = 528000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 576000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 624000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 672000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 720000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 768000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },

(currently there are just 4 clockspeeds defined that aren't following the 48MHz step rule)

@zador-blood-stained
Copy link
Member

If I understand things correctly, adding more frequencies to the driver won't help if you don't define operating points in FEX file, and current limit is 16 operating points.

@ThomasKaiser
Copy link
Contributor Author

ThomasKaiser commented May 17, 2016

Hmm... just had a look through screenshots taken (with wrong voltage assumptions) since I can not test currently:

orange_pi_one_comparison

Seems like I only used 2 dvfs entries in the fex file but more intermediate steps were used. But even if we have to deal with 16 dvfs operating points max it shouldn't be a problem at all since the boards with primitive or no programmable voltage regulator end up with (translated to dvfs fex entries of course)

{ .frequency = 240000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 408000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 480000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 576000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 672000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 720000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 768000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 816000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 864000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 912000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 960000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1008000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1056000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1104000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1152000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1200000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },

and the ones with SY8106A with

{ .frequency = 480000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 576000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 672000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 720000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 768000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 816000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 864000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 912000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 960000 , .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1008000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1056000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1104000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1152000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1200000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1248000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },
{ .frequency = 1296000, .index = SUNXI_CLK_DIV(0, 0, 0, 0), },

@zador-blood-stained
Copy link
Member

zador-blood-stained commented May 17, 2016

Seems like I only used 2 dvfs entries in the fex file but more intermediate steps were used.

Yes, you are right. Output of cpufreq-info lists frequencies from driver and not from DVFS table.

But even if we have to deal with 16 dvfs operating points max

This can be patched to allow more, but we don't have more than 16 voltage options in any case.

@ThomasKaiser
Copy link
Contributor Author

This can be patched to allow more, but we don't have more than 16 voltage options in any case.

But if I understand correctly there's nothing to patch? We could replace the 4 cpufreq entries now with the 7 in 48 MHz steps as proposed above and all we have to do is to exchange 648 MHz with 672 here (or switching back to 2 dvfs operating points now with 816 MHz as treshold and not 648 MHz as before -- this was based on wrong assumptions by me back then)

BTW: I always looked into /sys/devices/system/cpu/cpu0/cpufreq/stats/total_trans instead cpufreq-info output.

@zador-blood-stained
Copy link
Member

Updated

BTW: I always looked into /sys/devices/system/cpu/cpu0/cpufreq/stats/total_trans instead cpufreq-info output.

I didn't set up zsh and history-based autocompletion on OPi One, so cpufreq-info is easier to remember and faster to type 😄

@zador-blood-stained
Copy link
Member

New settings: Reduced DVFS table + new cooler table

[dvfs_table]
pmuic_type = 1
pmu_gpio0 = port:PL06<1><1><2><1>
pmu_level0 = 11300
pmu_level1 = 1100
max_freq = 1200000000
min_freq = 480000000
LV_count = 4
LV1_freq = 1200000000
LV1_volt = 1300
LV2_freq = 960000000
LV2_volt = 1300
LV3_freq = 912000000
LV3_volt = 1100
LV4_freq = 480000000
LV4_volt = 1100
[cooler_table]
cooler_count = 6
cooler0 = "1200000 4 4294967295 0"
cooler1 = "912000 4 4294967295 0"
cooler2 = "768000 4 4294967295 0"
cooler3 = "720000 3 4294967295 0"
cooler4 = "600000 2 4294967295 0"
cooler5 = "504000 1 4294967295 0"

cpuburn-a7 running for an hour, first 10 minutes with active cooling (fan).
Strangely, no CPU cores were killed this time (CPU was running at 768MHz mostly)
image

@ThomasKaiser
Copy link
Contributor Author

I would call this success and would also use these settings with 5.11 for One/Lite/M1 :)

On M2+ the corekeeper patch should help a lot (did you already disable installation of the ugly core-keeper.sh hack on sun8i?)

@zador-blood-stained
Copy link
Member

@ThomasKaiser
Nothing yet

@ThomasKaiser
Copy link
Contributor Author

Ok, then I will postpone any PRs adjusting ths/cooler table settings for now.

I made many tests, just to realize that Xunlong obviously exchanged the PCB material on the 3 new boards (thicker and spreads heat more efficiently).

There is little room for improvements regarding the SY8106A based boards but that's stuff for a (wiki) article, on One/Lite we might want to increase trip points to better match the switch from 1.1V to 1.3V and regarding the only board without any programmable voltage regulator that also shows poor heat dissipation (BPi M2+) I'm a bit clueless how to proceed. I've some ths/cooler table settings that show better behaviour when using cpuburn-a7 and cpuminer in parallel. But this is no realistic workload.

Maybe that's also stuff for documentation (don't buy this board and if you do so, don't expect full performance over longer periods of time)?

Anyway: I would increase the trip point where an emergency shutdown occurs to 105°C and define the last throttling trip point 10°C lower (or increase both a bit and use 100°C for the last throttling trip point and 110°C for emergency shutdown).

Anyway: I leave it up to you. With increased emergency shutdown temperature I didn't managed to shutdown OPi Lite with your activated in-kernel core-keeper. IMO we should activate your code ASAP. :)

@zador-blood-stained
Copy link
Member

New OPi One, no heatsink, current default settings
image

@ThomasKaiser
Copy link
Contributor Author

Hmm... since I assume 'Active CPUs' is 4 this means cpufreq jumps between 816 and 912 MHz and voltage at 1.1V all the time? And you're still running cpuburn-a7?

@zador-blood-stained
Copy link
Member

Yes, this is cpuburn-a7. Frequency was jumping between 768 and 912MHz, so voltage stayed at 1.1V.

@ThomasKaiser
Copy link
Contributor Author

Would be interesting to test this also with a less demanding workload (eg cpuminer and escpecially Linpack) to watch behaviour when voltage starts to jump between 1.1V and 1.3V.

Linpack starts pretty soft the first x seconds to increase load then dramatically. But it's somewhat time consuming to install it:

At least when trying to optimise dvfs settings on Pine64 it was worth the efforts since it pretty reliably detected undervoltage situations (I started a few days ago with a script for users to automagically improve dvfs/cpufreq settings on their specific H3 board just to realise that my attempt to heat up the SoC prior to linpack run with cpuburn-a7 doesn't work since when the SoC is already undervolted cpuburn-a7 kills it reliably -- still searching for a better way).

Anyway: A workload that let the board jump between the two voltages would be great to test since we're interested whether temperature increase could be critical with the new settings.

Looking at cpuburn-a7 alone above I'm pretty happy already :)

@zador-blood-stained
Copy link
Member

But it's somewhat time consuming to install it:

This with this compilation command isn't it?

@ssvb
Copy link

ssvb commented Jun 1, 2016

@zador-blood-stained What you are referring to is a toy-grade Linpack, which uses a simplistic naive algorithm with poor memory locality and has no assembly optimizations. It is demonstrating laughable GFLOPS numbers too.

The true Linpack is a bit more complex piece of software, which relies on a highly optimized OpenBLAS library. As such, it also happens to be pretty stressful for the hardware and is sensitive to undervoltage conditions.

@ThomasKaiser
Copy link
Contributor Author

@ssvb I recompiled OpenBLAS and hpl (this time using version 2.2 and not 2.1 as back then again for H3 and now slightly exceed 2.0 GFLOPS on OPi Plus 2E. So either I did something wrong back then or the new version is 'better'. Anyway: I still want to use this linpack to be able to detect undervoltage.

So currently searching for a way to heat up the SoC prior to running Linpack (or starting to understand settings and simply adjust parameters so that a single benchmark run takes 3 or 4 times longer since with the settings from the RPi thread benchmark duration is too short to really create considerable heat).

Anyway: For the test we're currently after testing @zador-blood-stained's in-kernel core-keeper and THS settings the optimized Linpack might be great since a switch between light and heavy load is involved.

@zador-blood-stained
Copy link
Member

But it's somewhat time consuming to install it

Nothing special, took only ~40 minutes to compile.

So do I run it in parallel with cpuburn-a7 or it's more complicated? What's the proper testing procedure?

@ssvb
Copy link

ssvb commented Jun 1, 2016

@ThomasKaiser Sounds like making use of the hardware watchdog built into the SoC might be a good idea for automation. I did use it when automatically tuning DRAM settings for A10/A13/A20. Some modifications might be necessary for H3 though.

@ssvb
Copy link

ssvb commented Jun 1, 2016

@zador-blood-stained

So do I run it in parallel with cpuburn-a7 or it's more complicated? What's the proper testing procedure?

I have dropped the ball on this front, but IMHO the right way to proceed would be to implement ssvb/cpuburn-arm#4 and improve these tools in general.

@ThomasKaiser
Copy link
Contributor Author

@zador-blood-stained For now it should be enough to let linpack run on the small boards to check THS and core-keeper stuff (maybe while another lightweight workload is running in parallel to force switch between 1.1V and 1.3V more often -- that would be the goal to test whether bringing back CPU cores from your kernel code might kill boards with current THS settings or not)

@ssvb thanks for mentioning the watchdog. Seems useful for exactly that so in case I get stuck with this (if I try to follow that route since the main problem with such an 'test out hardware reliability' approach is the user in question) I dig deeper.

@zador-blood-stained
Copy link
Member

zador-blood-stained commented Jun 1, 2016

@ThomasKaiser I can always jump between operating points (and thus voltage) manually with cpufreq-set.
In case SoC temperature matters here too, I can heat the board with soldering fan 😄

@ThomasKaiser
Copy link
Contributor Author

Well, I thought we're still testing whether bringing back CPU cores (too fast) might cause problems since we then reach a critical treshold where an emergency shutdown is initiated since temperature increases again too fast? That's the focus of testing now at least if I understood your concers a while back?

I'm pretty fine already with current settings and would like to see your in-kernel core-keeper being default rather sooner than later :)

@zador-blood-stained
Copy link
Member

My concern was that with old settings bringing several cores back (since we had either 4 cores or 1 core in cooler_table) heats SoC so fast that it would trigger emergency shutdown before budget cooling algorithm had time to react to this temperature - and this was on OPi One with defective thermal sensor.

With current settings and "normal" Oranges this shouldn't cause any problems unless somebody decides to use shitty power supply and killing/bringing back cores causes momentary CPU undervoltage.

@ssvb
Copy link

ssvb commented Jun 1, 2016

@zador-blood-stained The current 1 core state is supposed to be unreachable. The last state with 4 active cores should be already running at a sufficiently low clock speed to handle any load without overheating. If we ever reach the 1 core state, then it's already a catastrophic event similar to thermal shutdown.

@ssvb
Copy link

ssvb commented Jun 1, 2016

In fact we may revise this last 4 core state by running the lima-textured-cube demo together with cpuburn-a7 and putting the board in a box with poor ventilation :-) The 648MHz CPU clock speed might be too high.

@ThomasKaiser
Copy link
Contributor Author

ThomasKaiser commented Jun 1, 2016

@zador-blood-stained OK, now I start to understand. But I'm also sure that I do not understand relationship between cooler_table and THS trip points. On the other hand I don't care that much since I would like to run mainline kernel on H3 devices.

So while playing around with this stuff with legacy kernel to check hardware limits I really hope we get support for THS in mainline kernel soon. @ssvb IIRC megi and you talked a while ago in linux-sunxi IRC about the state of these commits. Have to look through IRC logs to get the idea. I still fear that sending patches upstream gets delayed and we can't benefit from thermal/throttling on H3 boards with mainline kernel before 2017 :\

@zador-blood-stained
Copy link
Member

OPi PC, no heatsink, cpuburn-a7
image

So IMO it's OK to enable corekeeper for all H3 oranges.

@ThomasKaiser
Copy link
Contributor Author

ThomasKaiser commented Jun 3, 2016

I'm fine with this but would suggest that we adjust two more values on all SY8106A equipped boards: Increase 1st trip point by 5°C and shutdown treshold also so that we get 10°C between last throttling step and emergency shutdown:

 ths_trip1_0 = 75
 ths_trip1_1 = 80
 ths_trip1_2 = 85
 ths_trip1_3 = 90
 ths_trip1_4 = 95
 ths_trip1_5 = 105

BTW: While we're at it (H3 boards). What about decreasing DRAM clockspeed for both BPi M2+ and NanoPi M1 in u-boot and fex file? The test results from yesterday and today do not look that promising when relying on mainline u-boot (Tido also pointed out that DRAM chips on BPi M2+ are slightly different than those Samsungs used on Orange Pis: K4B4G1646D-BCK0 vs. K4B4G1646Q-HYK0 on Oranges -- I asked Tido to correct this in linux-sunxi wiki)

@zador-blood-stained
Copy link
Member

Increase 1st trip point by 5°C and shutdown treshold also so that we get 10°C between last throttling step and emergency shutdown

You need to adjust second part of THS table, which defines cooling states, too.

What about decreasing DRAM clockspeed for both BPi M2+ and NanoPi M1 in u-boot and fex file?

Don't have any of these boards to test, but if there are 2 or more cases of failing lima-memtester tests with current DRAM speed (or current + 24MHz), then it's better safe than sorry I guess.

@zador-blood-stained
Copy link
Member

BTW, Do we need any more tests without heatsinks? I think I have enough small heatsinks and adhesive stuff for all new boards.

@ThomasKaiser
Copy link
Contributor Author

You need to adjust second part of THS table, which defines cooling states, too.

Really? I just want to modify the first and last entry therefore letting the first throttling step happen 5°C higher than before and get some safety headroom regarding emergency shutdowns on the upper end of the thermal scala. IMO adjusting both temperature values should be enough?

Regarding tests without heatsink IMO only confirming thermal behaviour of OPi Plus 2E would be interesting since I'm still amazed how less throttling here occurs. So just one graph with our current settings and information regarding ambient temperature would be fine (still preparing a side-by-side review of BPi M2+ and OPi Plus 2E)

@ssvb
Copy link

ssvb commented Jun 3, 2016

IIRC, the emergency shutdown temperature is configured by the ths_trip2_0 = 105 line in FEX.

@ThomasKaiser
Copy link
Contributor Author

IIRC, the emergency shutdown temperature is configured by the ths_trip2_0 = 105 line in FEX.

Sure, but reaching the last ths_trip1 entry has already the same effect. So in case ths_trip1_count = 8 is defined reaching ths_trip1_7 will also trigger a shutdown.

@zador-blood-stained
Copy link
Member

Regarding tests without heatsink IMO only confirming thermal behaviour of OPi Plus 2E would be interesting

Opi Plus 2E, cpuburn-a7, no heatsink
image

Most of the time spent at 1008MHz

@ThomasKaiser
Copy link
Contributor Author

Thx for the test. I just merged #340 so from now on we have identical THS settings on all H3 boards. I also asked for more testers regarding BPi M2+ (both DRAM reliability as well as thermal readouts since the results I got so far are simply weird or an indication that this board overheats like hell).

Hopefully a few more users get back to us soon. As long as this is unresolved we should clock DRAM with 624 MHz as on the other H3 boards already.

I'm currently preparing an article regarding H3 boards and performance tuning, eg. analysing own workload and thermal behaviour and then tuning THS settings so that throttling will happen more fine grained in the approriate thermal range (then really making use of the more cpufreq operating points we added).

IMO no more tests in this area (and w/o heatsinks) necessary :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants