Commit2802bf3

msrasmussen

authored and

Ingo Molnar

committed

sched/fair: Add over-utilization/tipping point indicator

Energy-aware scheduling is only meant to be active while the system is_not_ over-utilized. That is, there are spare cycles available to shifttasks around based on their actual utilization to get a moreenergy-efficient task distribution without depriving any tasks. Whenabove the tipping point task placement is done the traditional way basedon load_avg, spreading the tasks across as many cpus as possible basedon priority scaled load to preserve smp_nice. Below the tipping point wewant to use util_avg instead. We need to define a criteria for when wemake the switch.The util_avg for each cpu converges towards 100% regardless of how manyadditional tasks we may put on it. If we define over-utilized as:sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)some individual cpus may be over-utilized running multiple tasks evenwhen the above condition is false. That should be okay as long as we tryto spread the tasks out to avoid per-cpu over-utilization as much aspossible and if all tasks have the _same_ priority. If the latter isn'ttrue, we have to consider priority to preserve smp_nice.For example, we could have n_cpus nice=-10 util_avg=55% tasks andn_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we arelikely to end up with nice=-10 tasks sharing cpus and nice=0 tasksgetting their own as we 1.5*n_cpus tasks in total and 55%+55% is lessover-utilized than 55%+60% for those cpus that have to be shared. Thesystem utilization is only 85% of the system capacity, but we arebreaking smp_nice.To be sure not to break smp_nice, we have defined over-utilizationconservatively as when any cpu in the system is fully utilized at itshighest frequency instead:cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacityIOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avgto factor in priority to preserve smp_nice.With this definition, we can skip periodic load-balance as no cpu has analways-running task when the system is not over-utilized. All tasks willbe periodic and we can balance them at wake-up. This conservativecondition does however mean that some scenarios that could benefit fromenergy-aware decisions even if one cpu is fully utilized would not getthose benefits.For systems where some cpus might have reduced capacity on some cpus(RT-pressure and/or big.LITTLE), we want periodic load-balance checks assoon a just a single cpu is fully utilized as it might one of those withreduced capacity and in that case we want to migrate it.[ peterz: Added a comment explaining why new tasks are not accounted during overutilization detection. ]Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>Signed-off-by: Quentin Perret <quentin.perret@arm.com>Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>Cc: Linus Torvalds <torvalds@linux-foundation.org>Cc: Mike Galbraith <efault@gmx.de>Cc: Peter Zijlstra <peterz@infradead.org>Cc: Thomas Gleixner <tglx@linutronix.de>Cc: adharmap@codeaurora.orgCc: chris.redpath@arm.comCc: currojerez@riseup.netCc: dietmar.eggemann@arm.comCc: edubezval@gmail.comCc: gregkh@linuxfoundation.orgCc: javi.merino@kernel.orgCc: joel@joelfernandes.orgCc: juri.lelli@redhat.comCc: patrick.bellasi@arm.comCc: pkondeti@codeaurora.orgCc: rjw@rjwysocki.netCc: skannan@codeaurora.orgCc: smuckle@google.comCc: srinivas.pandruvada@linux.intel.comCc: thara.gopinath@linaro.orgCc: tkjos@google.comCc: valentin.schneider@arm.comCc: vincent.guittot@linaro.orgCc: viresh.kumar@linaro.orgLink:https://lkml.kernel.org/r/20181203095628.11858-13-quentin.perret@arm.comSigned-off-by: Ingo Molnar <mingo@kernel.org>

1 parent630246a commit2802bf3Copy full SHA for 2802bf3

File tree

2 files changed

+61

-2

lines changed

kernel/sched
- fair.c
- sched.h

2 files changed

+61

-2

lines changed

`‎kernel/sched/fair.c‎`

Lines changed: 57 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -5082,6 +5082,24 @@ static inline void hrtick_update(struct rq *rq)`
`5082`	`5082`	`}`
`5083`	`5083`	`#endif`
`5084`	`5084`
	`5085`	`+#ifdefCONFIG_SMP`
	`5086`	`+staticinlineunsigned longcpu_util(intcpu);`
	`5087`	`+staticunsigned longcapacity_of(intcpu);`
	`5088`	`+`
	`5089`	`+staticinlineboolcpu_overutilized(intcpu)`
	`5090`	`+{`
	`5091`	`+return (capacity_of(cpu)1024)< (cpu_util(cpu)capacity_margin);`
	`5092`	`+}`
	`5093`	`+`
	`5094`	`+staticinlinevoidupdate_overutilized_status(structrq*rq)`
	`5095`	`+{`
	`5096`	`+if (!READ_ONCE(rq->rd->overutilized)&&cpu_overutilized(rq->cpu))`
	`5097`	`+WRITE_ONCE(rq->rd->overutilized,SG_OVERUTILIZED);`
	`5098`	`+}`
	`5099`	`+#else`
	`5100`	`+staticinlinevoidupdate_overutilized_status(structrq*rq) { }`
	`5101`	`+#endif`
	`5102`	`+`
`5085`	`5103`	`/*`
`5086`	`5104`	`* The enqueue_task method is called before nr_running is`
`5087`	`5105`	`* increased. Here we update the fair scheduling stats and`
`@@ -5139,8 +5157,26 @@ enqueue_task_fair(struct rq rq, struct task_struct p, int flags)`
`5139`	`5157`	`update_cfs_group(se);`
`5140`	`5158`	`}`
`5141`	`5159`
`5142`		`-if (!se)`
	`5160`	`+if (!se) {`
`5143`	`5161`	`add_nr_running(rq,1);`
	`5162`	`+/*`
	`5163`	`+ * Since new tasks are assigned an initial util_avg equal to`
	`5164`	`+ * half of the spare capacity of their CPU, tiny tasks have the`
	`5165`	`+ * ability to cross the overutilized threshold, which will`
	`5166`	`+ * result in the load balancer ruining all the task placement`
	`5167`	`+ * done by EAS. As a way to mitigate that effect, do not account`
	`5168`	`+ * for the first enqueue operation of new tasks during the`
	`5169`	`+ * overutilized flag detection.`
	`5170`	`+ *`
	`5171`	`+ * A better way of solving this problem would be to wait for`
	`5172`	`+ * the PELT signals of tasks to converge before taking them`
	`5173`	`+ * into account, but that is not straightforward to implement,`
	`5174`	`+ * and the following generally works well enough in practice.`
	`5175`	`+ */`
	`5176`	`+if (flags&ENQUEUE_WAKEUP)`
	`5177`	`+update_overutilized_status(rq);`
	`5178`	`+`
	`5179`	`+}`
`5144`	`5180`
`5145`	`5181`	`hrtick_update(rq);`
`5146`	`5182`	`}`
`@@ -7940,6 +7976,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,`
`7940`	`7976`	`if (nr_running>1)`
`7941`	`7977`	`*sg_status \|=SG_OVERLOAD;`
`7942`	`7978`
	`7979`	`+if (cpu_overutilized(i))`
	`7980`	`+*sg_status \|=SG_OVERUTILIZED;`
	`7981`	`+`
`7943`	`7982`	`#ifdefCONFIG_NUMA_BALANCING`
`7944`	`7983`	`sgs->nr_numa_running+=rq->nr_numa_running;`
`7945`	`7984`	`sgs->nr_preferred_running+=rq->nr_preferred_running;`
`@@ -8170,8 +8209,15 @@ static inline void update_sd_lb_stats(struct lb_env env, struct sd_lb_stats sd`
`8170`	`8209`	`env->fbq_type=fbq_classify_group(&sds->busiest_stat);`
`8171`	`8210`
`8172`	`8211`	`if (!env->sd->parent) {`
	`8212`	`+structroot_domain*rd=env->dst_rq->rd;`
	`8213`	`+`
`8173`	`8214`	`/* update overload indicator if we are at root domain */`
`8174`		`-WRITE_ONCE(env->dst_rq->rd->overload,sg_status&SG_OVERLOAD);`
	`8215`	`+WRITE_ONCE(rd->overload,sg_status&SG_OVERLOAD);`
	`8216`	`+`
	`8217`	`+/* Update over-utilization (tipping point, U >= 0) indicator */`
	`8218`	`+WRITE_ONCE(rd->overutilized,sg_status&SG_OVERUTILIZED);`
	`8219`	`+}elseif (sg_status&SG_OVERUTILIZED) {`
	`8220`	`+WRITE_ONCE(env->dst_rq->rd->overutilized,SG_OVERUTILIZED);`
`8175`	`8221`	`}`
`8176`	`8222`	`}`
`8177`	`8223`
`@@ -8398,6 +8444,14 @@ static struct sched_group find_busiest_group(struct lb_env env)`
`8398`	`8444`	`* this level.`
`8399`	`8445`	`*/`
`8400`	`8446`	`update_sd_lb_stats(env,&sds);`
	`8447`	`+`
	`8448`	`+if (static_branch_unlikely(&sched_energy_present)) {`
	`8449`	`+structroot_domain*rd=env->dst_rq->rd;`
	`8450`	`+`
	`8451`	`+if (rcu_dereference(rd->pd)&& !READ_ONCE(rd->overutilized))`
	`8452`	`+gotoout_balanced;`
	`8453`	`+}`
	`8454`	`+`
`8401`	`8455`	`local=&sds.local_stat;`
`8402`	`8456`	`busiest=&sds.busiest_stat;`
`8403`	`8457`
`@@ -9798,6 +9852,7 @@ static void task_tick_fair(struct rq rq, struct task_struct curr, int queued)`
`9798`	`9852`	`task_tick_numa(rq,curr);`
`9799`	`9853`
`9800`	`9854`	`update_misfit_status(curr,rq);`
	`9855`	`+update_overutilized_status(task_rq(curr));`
`9801`	`9856`	`}`
`9802`	`9857`
`9803`	`9858`	`/*`

`‎kernel/sched/sched.h‎`

Lines changed: 4 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -718,6 +718,7 @@ struct perf_domain {`
`718`	`718`
`719`	`719`	`/* Scheduling group status flags */`
`720`	`720`	`#defineSG_OVERLOAD0x1/* More than one runnable task on a CPU. */`
	`721`	`+#defineSG_OVERUTILIZED0x2/* One or more CPUs are over-utilized. */`
`721`	`722`
`722`	`723`	`/*`
`723`	`724`	`* We add the notion of a root-domain which will be used to define per-domain`
`@@ -741,6 +742,9 @@ struct root_domain {`
`741`	`742`	`*/`
`742`	`743`	`intoverload;`
`743`	`744`
	`745`	`+/* Indicate one or more cpus over-utilized (tipping point) */`
	`746`	`+intoverutilized;`
	`747`	`+`
`744`	`748`	`/*`
`745`	`749`	`* The bit corresponding to a CPU gets set here if such CPU has more`
`746`	`750`	`* than one runnable -deadline task (as it is below for RT tasks).`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit2802bf3

File tree

2 files changed

2 files changed

`‎kernel/sched/fair.c‎`

`‎kernel/sched/sched.h‎`

0 commit comments