blkcg: implement blk-iocost

This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others.  In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern.  However, the cost isn't a complete mystery.  Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device.  This controller distributes IO capacity based
on the costs estimated by such model.  The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well.  All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This commit is contained in:
Tejun Heo 2019-08-28 15:05:58 -07:00 committed by Jens Axboe
parent 6f816b4b74
commit 7caa47151a
7 changed files with 2656 additions and 0 deletions

View File

@ -1435,6 +1435,100 @@ IO Interface Files
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
io.cost.qos
A read-write nested-keyed file with exists only on the root
cgroup.
This file configures the Quality of Service of the IO cost
model based controller (CONFIG_BLK_CGROUP_IOCOST) which
currently implements "io.weight" proportional control. Lines
are keyed by $MAJ:$MIN device numbers and not ordered. The
line for a given device is populated on the first write for
the device on "io.cost.qos" or "io.cost.model". The following
nested keys are defined.
====== =====================================
enable Weight-based control enable
ctrl "auto" or "user"
rpct Read latency percentile [0, 100]
rlat Read latency threshold
wpct Write latency percentile [0, 100]
wlat Write latency threshold
min Minimum scaling percentage [1, 10000]
max Maximum scaling percentage [1, 10000]
====== =====================================
The controller is disabled by default and can be enabled by
setting "enable" to 1. "rpct" and "wpct" parameters default
to zero and the controller uses internal device saturation
state to adjust the overall IO rate between "min" and "max".
When a better control quality is needed, latency QoS
parameters can be configured. For example::
8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
shows that on sdb, the controller is enabled, will consider
the device saturated if the 95th percentile of read completion
latencies is above 75ms or write 150ms, and adjust the overall
IO issue rate between 50% and 150% accordingly.
The lower the saturation point, the better the latency QoS at
the cost of aggregate bandwidth. The narrower the allowed
adjustment range between "min" and "max", the more conformant
to the cost model the IO behavior. Note that the IO issue
base rate may be far off from 100% and setting "min" and "max"
blindly can lead to a significant loss of device capacity or
control quality. "min" and "max" are useful for regulating
devices which show wide temporary behavior changes - e.g. a
ssd which accepts writes at the line speed for a while and
then completely stalls for multiple seconds.
When "ctrl" is "auto", the parameters are controlled by the
kernel and may change automatically. Setting "ctrl" to "user"
or setting any of the percentile and latency parameters puts
it into "user" mode and disables the automatic changes. The
automatic mode can be restored by setting "ctrl" to "auto".
io.cost.model
A read-write nested-keyed file with exists only on the root
cgroup.
This file configures the cost model of the IO cost model based
controller (CONFIG_BLK_CGROUP_IOCOST) which currently
implements "io.weight" proportional control. Lines are keyed
by $MAJ:$MIN device numbers and not ordered. The line for a
given device is populated on the first write for the device on
"io.cost.qos" or "io.cost.model". The following nested keys
are defined.
===== ================================
ctrl "auto" or "user"
model The cost model in use - "linear"
===== ================================
When "ctrl" is "auto", the kernel may change all parameters
dynamically. When "ctrl" is set to "user" or any other
parameters are written to, "ctrl" become "user" and the
automatic changes are disabled.
When "model" is "linear", the following model parameters are
defined.
============= ========================================
[r|w]bps The maximum sequential IO throughput
[r|w]seqiops The maximum 4k sequential IOs per second
[r|w]randiops The maximum 4k random IOs per second
============= ========================================
From the above, the builtin linear model determines the base
costs of a sequential and random IO and the cost coefficient
for the IO size. While simple, this model can cover most
common device classes acceptably.
The IO cost model isn't expected to be accurate in absolute
sense and is scaled to the device behavior dynamically.
io.weight io.weight
A read-write flat-keyed file which exists on non-root cgroups. A read-write flat-keyed file which exists on non-root cgroups.
The default is "default 100". The default is "default 100".

View File

@ -135,6 +135,16 @@ config BLK_CGROUP_IOLATENCY
Note, this is an experimental interface and could be changed someday. Note, this is an experimental interface and could be changed someday.
config BLK_CGROUP_IOCOST
bool "Enable support for cost model based cgroup IO controller"
depends on BLK_CGROUP=y
select BLK_RQ_ALLOC_TIME
---help---
Enabling this option enables the .weight interface for cost
model based proportional IO control. The IO controller
distributes IO capacity between different groups based on
their share of the overall weight distribution.
config BLK_WBT_MQ config BLK_WBT_MQ
bool "Multiqueue writeback throttling" bool "Multiqueue writeback throttling"
default y default y

View File

@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o

2371
block/blk-iocost.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -15,6 +15,7 @@ struct blk_mq_debugfs_attr;
enum rq_qos_id { enum rq_qos_id {
RQ_QOS_WBT, RQ_QOS_WBT,
RQ_QOS_LATENCY, RQ_QOS_LATENCY,
RQ_QOS_COST,
}; };
struct rq_wait { struct rq_wait {
@ -84,6 +85,8 @@ static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
return "wbt"; return "wbt";
case RQ_QOS_LATENCY: case RQ_QOS_LATENCY:
return "latency"; return "latency";
case RQ_QOS_COST:
return "cost";
} }
return "unknown"; return "unknown";
} }

View File

@ -169,6 +169,9 @@ struct bio {
*/ */
struct blkcg_gq *bi_blkg; struct blkcg_gq *bi_blkg;
struct bio_issue bi_issue; struct bio_issue bi_issue;
#ifdef CONFIG_BLK_CGROUP_IOCOST
u64 bi_iocost_cost;
#endif
#endif #endif
union { union {
#if defined(CONFIG_BLK_DEV_INTEGRITY) #if defined(CONFIG_BLK_DEV_INTEGRITY)

View File

@ -0,0 +1,174 @@
/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM iocost
#if !defined(_TRACE_BLK_IOCOST_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_BLK_IOCOST_H
#include <linux/tracepoint.h>
TRACE_EVENT(iocost_iocg_activate,
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
u64 last_period, u64 cur_period, u64 vtime),
TP_ARGS(iocg, path, now, last_period, cur_period, vtime),
TP_STRUCT__entry (
__string(devname, ioc_name(iocg->ioc))
__string(cgroup, path)
__field(u64, now)
__field(u64, vnow)
__field(u64, vrate)
__field(u64, last_period)
__field(u64, cur_period)
__field(u64, last_vtime)
__field(u64, vtime)
__field(u32, weight)
__field(u32, inuse)
__field(u64, hweight_active)
__field(u64, hweight_inuse)
),
TP_fast_assign(
__assign_str(devname, ioc_name(iocg->ioc));
__assign_str(cgroup, path);
__entry->now = now->now;
__entry->vnow = now->vnow;
__entry->vrate = now->vrate;
__entry->last_period = last_period;
__entry->cur_period = cur_period;
__entry->last_vtime = iocg->last_vtime;
__entry->vtime = vtime;
__entry->weight = iocg->weight;
__entry->inuse = iocg->inuse;
__entry->hweight_active = iocg->hweight_active;
__entry->hweight_inuse = iocg->hweight_inuse;
),
TP_printk("[%s:%s] now=%llu:%llu vrate=%llu "
"period=%llu->%llu vtime=%llu->%llu "
"weight=%u/%u hweight=%llu/%llu",
__get_str(devname), __get_str(cgroup),
__entry->now, __entry->vnow, __entry->vrate,
__entry->last_period, __entry->cur_period,
__entry->last_vtime, __entry->vtime,
__entry->inuse, __entry->weight,
__entry->hweight_inuse, __entry->hweight_active
)
);
DECLARE_EVENT_CLASS(iocg_inuse_update,
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
u32 old_inuse, u32 new_inuse,
u64 old_hw_inuse, u64 new_hw_inuse),
TP_ARGS(iocg, path, now, old_inuse, new_inuse,
old_hw_inuse, new_hw_inuse),
TP_STRUCT__entry (
__string(devname, ioc_name(iocg->ioc))
__string(cgroup, path)
__field(u64, now)
__field(u32, old_inuse)
__field(u32, new_inuse)
__field(u64, old_hweight_inuse)
__field(u64, new_hweight_inuse)
),
TP_fast_assign(
__assign_str(devname, ioc_name(iocg->ioc));
__assign_str(cgroup, path);
__entry->now = now->now;
__entry->old_inuse = old_inuse;
__entry->new_inuse = new_inuse;
__entry->old_hweight_inuse = old_hw_inuse;
__entry->new_hweight_inuse = new_hw_inuse;
),
TP_printk("[%s:%s] now=%llu inuse=%u->%u hw_inuse=%llu->%llu",
__get_str(devname), __get_str(cgroup), __entry->now,
__entry->old_inuse, __entry->new_inuse,
__entry->old_hweight_inuse, __entry->new_hweight_inuse
)
);
DEFINE_EVENT(iocg_inuse_update, iocost_inuse_takeback,
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
u32 old_inuse, u32 new_inuse,
u64 old_hw_inuse, u64 new_hw_inuse),
TP_ARGS(iocg, path, now, old_inuse, new_inuse,
old_hw_inuse, new_hw_inuse)
);
DEFINE_EVENT(iocg_inuse_update, iocost_inuse_giveaway,
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
u32 old_inuse, u32 new_inuse,
u64 old_hw_inuse, u64 new_hw_inuse),
TP_ARGS(iocg, path, now, old_inuse, new_inuse,
old_hw_inuse, new_hw_inuse)
);
DEFINE_EVENT(iocg_inuse_update, iocost_inuse_reset,
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
u32 old_inuse, u32 new_inuse,
u64 old_hw_inuse, u64 new_hw_inuse),
TP_ARGS(iocg, path, now, old_inuse, new_inuse,
old_hw_inuse, new_hw_inuse)
);
TRACE_EVENT(iocost_ioc_vrate_adj,
TP_PROTO(struct ioc *ioc, u64 new_vrate, u32 (*missed_ppm)[2],
u32 rq_wait_pct, int nr_lagging, int nr_shortages,
int nr_surpluses),
TP_ARGS(ioc, new_vrate, missed_ppm, rq_wait_pct, nr_lagging, nr_shortages,
nr_surpluses),
TP_STRUCT__entry (
__string(devname, ioc_name(ioc))
__field(u64, old_vrate)
__field(u64, new_vrate)
__field(int, busy_level)
__field(u32, read_missed_ppm)
__field(u32, write_missed_ppm)
__field(u32, rq_wait_pct)
__field(int, nr_lagging)
__field(int, nr_shortages)
__field(int, nr_surpluses)
),
TP_fast_assign(
__assign_str(devname, ioc_name(ioc));
__entry->old_vrate = atomic64_read(&ioc->vtime_rate);;
__entry->new_vrate = new_vrate;
__entry->busy_level = ioc->busy_level;
__entry->read_missed_ppm = (*missed_ppm)[READ];
__entry->write_missed_ppm = (*missed_ppm)[WRITE];
__entry->rq_wait_pct = rq_wait_pct;
__entry->nr_lagging = nr_lagging;
__entry->nr_shortages = nr_shortages;
__entry->nr_surpluses = nr_surpluses;
),
TP_printk("[%s] vrate=%llu->%llu busy=%d missed_ppm=%u:%u rq_wait_pct=%u lagging=%d shortages=%d surpluses=%d",
__get_str(devname), __entry->old_vrate, __entry->new_vrate,
__entry->busy_level,
__entry->read_missed_ppm, __entry->write_missed_ppm,
__entry->rq_wait_pct, __entry->nr_lagging, __entry->nr_shortages,
__entry->nr_surpluses
)
);
#endif /* _TRACE_BLK_IOCOST_H */
/* This part must be outside protection */
#include <trace/define_trace.h>