Commit Graph

193 Commits

Author SHA1 Message Date
Vlad Buslov
4ce70b4aed net: sched: sch_htb: don't call qdisc_put() while holding tree lock
Recent changes that removed rtnl dependency from rules update path of tc
also made tcf_block_put() function sleeping. This function is called from
ops->destroy() of several Qdisc implementations, which in turn is called by
qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock,
which results sleeping-while-atomic BUG.

Steps to reproduce for htb:

tc qdisc add dev ens1f0 root handle 1: htb default 12
tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps
tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10
tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps

Resulting dmesg:

[ 4791.148551] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909
[ 4791.151354] in_atomic(): 1, irqs_disabled(): 0, pid: 27273, name: tc
[ 4791.152805] INFO: lockdep is turned off.
[ 4791.153605] CPU: 19 PID: 27273 Comm: tc Tainted: G        W         5.3.0-rc8+ #721
[ 4791.154336] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[ 4791.155075] Call Trace:
[ 4791.155803]  dump_stack+0x85/0xc0
[ 4791.156529]  ___might_sleep.cold+0xac/0xbc
[ 4791.157251]  __mutex_lock+0x5b/0x960
[ 4791.157966]  ? console_unlock+0x363/0x5d0
[ 4791.158676]  ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 4791.159395]  ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 4791.160103]  tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 4791.160815]  tcf_block_put_ext.part.0+0x21/0x50
[ 4791.161530]  tcf_block_put+0x50/0x70
[ 4791.162233]  sfq_destroy+0x15/0x50 [sch_sfq]
[ 4791.162936]  qdisc_destroy+0x5f/0x160
[ 4791.163642]  htb_change_class.cold+0x5df/0x69d [sch_htb]
[ 4791.164505]  tc_ctl_tclass+0x19d/0x480
[ 4791.165360]  rtnetlink_rcv_msg+0x170/0x4b0
[ 4791.166191]  ? netlink_deliver_tap+0x95/0x400
[ 4791.166907]  ? rtnl_dellink+0x2d0/0x2d0
[ 4791.167625]  netlink_rcv_skb+0x49/0x110
[ 4791.168345]  netlink_unicast+0x171/0x200
[ 4791.169058]  netlink_sendmsg+0x224/0x3f0
[ 4791.169771]  sock_sendmsg+0x5e/0x60
[ 4791.170475]  ___sys_sendmsg+0x2ae/0x330
[ 4791.171183]  ? ___sys_recvmsg+0x159/0x1f0
[ 4791.171894]  ? do_wp_page+0x9c/0x790
[ 4791.172595]  ? __handle_mm_fault+0xcd3/0x19e0
[ 4791.173309]  __sys_sendmsg+0x59/0xa0
[ 4791.174024]  do_syscall_64+0x5c/0xb0
[ 4791.174725]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 4791.175435] RIP: 0033:0x7f0aa41497b8
[ 4791.176129] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5
4
[ 4791.177532] RSP: 002b:00007fff4e37d588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 4791.178243] RAX: ffffffffffffffda RBX: 000000005d8132f7 RCX: 00007f0aa41497b8
[ 4791.178947] RDX: 0000000000000000 RSI: 00007fff4e37d5f0 RDI: 0000000000000003
[ 4791.179662] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000020149a0
[ 4791.180382] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001
[ 4791.181100] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000

In htb_change_class() function save parent->leaf.q to local temporary
variable and put reference to it after sch tree lock is released in order
not to call potentially sleeping cls API in atomic section. This is safe to
do because Qdisc has already been reset by qdisc_purge_queue() inside sch
tree lock critical section.

Fixes: c266f64dbf ("net: sched: protect block state with mutex")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:13:55 +02:00
Thomas Gleixner
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Cong Wang
b362487a3b sch_htb: redefine htb qdisc overlimits
In commit 3c75f6ee13 ("net_sched: sch_htb: add per class overlimits counter")
we added an overlimits counter for each HTB class which could
properly reflect how many times we use up all the bandwidth
on each class. However, the overlimits counter in HTB qdisc
does not, it is way bigger than the sum of each HTB class.
In fact, this qdisc overlimits counter increases when we have
no skb to dequeue, which happens more often than we run out of
bandwidth.

It makes more sense to make this qdisc overlimits counter just
be a sum of each HTB class, in case people still get confused.

I have verified this patch with one single HTB class, where HTB
qdisc counters now always match HTB class counters as expected.

Eric suggested we could fold this field into 'direct_pkts' as
we only use its 32bit on 64bit CPU, this saves one cache line.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:55:20 -07:00
Johannes Berg
8cb081746c netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:

 1) liberal (default)
     - undefined (type >= max) & NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted
     - garbage at end of message accepted
 2) strict (opt-in)
     - NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted

Split out parsing strictness into four different options:
 * TRAILING     - check that there's no trailing data after parsing
                  attributes (in message or nested)
 * MAXTYPE      - reject attrs > max known type
 * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
 * STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
 * nla_parse           -> nla_parse_deprecated
 * nla_parse_strict    -> nla_parse_deprecated_strict
 * nlmsg_parse         -> nlmsg_parse_deprecated
 * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
 * nla_parse_nested    -> nla_parse_nested_deprecated
 * nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:21 -04:00
Michal Kubecek
ae0be8de9a netlink: make nla_nest_start() add NLA_F_NESTED flag
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
Paolo Abeni
e5f0e8f8e4 net: sched: introduce and use qdisc tree flush/purge helpers
The same code to flush qdisc tree and purge the qdisc queue
is duplicated in many places and in most cases it does not
respect NOLOCK qdisc: the global backlog len is used and the
per CPU values are ignored.

This change addresses the above, factoring-out the relevant
code and using the helpers introduced by the previous patch
to fetch the correct backlog len.

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:50:13 -07:00
Paolo Abeni
5dd431b6b9 net: sched: introduce and use qstats read helpers
Classful qdiscs can't access directly the child qdiscs backlog
length: if such qdisc is NOLOCK, per CPU values should be
accounted instead.

Most qdiscs no not respect the above. As a result, qstats fetching
for most classful qdisc is currently incorrect: if the child qdisc is
NOLOCK, it always reports 0 len backlog.

This change introduces a pair of helpers to safely fetch
both backlog and qlen and use them in stats class dumping
functions, fixing the above issue and cleaning a bit the code.

DRR needs also to access the child qdisc queue length, so it
needs custom handling.

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:50:13 -07:00
Toke Høiland-Jørgensen
f6bab19931 sched: Avoid dereferencing skb pointer after child enqueue
Parent qdiscs may dereference the pointer to the enqueued skb after
enqueue. However, both CAKE and TBF call consume_skb() on the original skb
when splitting GSO packets, leading to a potential use-after-free in the
parent. Fix this by avoiding dereferencing the skb pointer after enqueueing
to the child.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-15 20:12:00 -08:00
Vlad Buslov
86bd446b5c net: sched: rename qdisc_destroy() to qdisc_put()
Current implementation of qdisc_destroy() decrements Qdisc reference
counter and only actually destroy Qdisc if reference counter value reached
zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
describe the way in which this function currently implemented and used.

Extract code that deallocates Qdisc into new private qdisc_destroy()
function. It is intended to be shared between regular qdisc_put() and its
unlocked version that is introduced in next patch in this series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25 20:17:35 -07:00
Cong Wang
11957be20f htb: use anonymous union for simplicity
cl->leaf.q is slightly more readable than cl->un.leaf.q.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:39:57 -07:00
David S. Miller
aea890b8b2 sch_htb: Remove local SKB queue handling code.
Instead, adjust __qdisc_enqueue_tail() such that HTB can use it
instead.

The only other caller of __qdisc_enqueue_tail() is
qdisc_enqueue_tail() so we can move the backlog and return value
handling (which HTB doesn't need/want) to the latter.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:52 -07:00
Cong Wang
fe0b082fed net_sched: remove unused htb drop_list
After commit a09ceb0e08 ("sched: remove qdisc->drop"),
it is no longer used.

Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-24 16:42:46 +09:00
Li RongQing
da01ec4ee5 net: sched: do not emit messages while holding spinlock
move messages emitting out of sch_tree_lock to avoid holding
this lock too long.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-01 20:53:50 -04:00
Alexander Aring
a38a98821c net: sch: api: add extack support in qdisc_create_dflt
This patch adds extack support for the function qdisc_create_dflt which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why qdisc_create_dflt failed. The function qdisc_create_dflt
will also call an init callback which can fail by any per-qdisc specific
handling.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:51 -05:00
Alexander Aring
8d1a77f974 net: sch: api: add extack support in tcf_block_get
This patch adds extack support for the function tcf_block_get which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why tcf_block_get failed.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:51 -05:00
Alexander Aring
e9bc3fa28b net: sch: api: add extack support in qdisc_get_rtab
This patch adds extack support for the function qdisc_get_rtab which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why qdisc_get_rtab failed.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring
653d6fd68d net: sched: sch: add extack for graft callback
This patch adds extack support for graft callback to prepare per-qdisc
specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring
cbaacc4e8a net: sched: sch: add extack for block callback
This patch adds extack support for block callback to prepare per-qdisc
specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring
793d81d6a1 net: sched: sch: add extack to change class
This patch adds extack support for class change callback api. This prepares
to handle extack support inside each specific class implementation.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Alexander Aring
e63d7dfd2d net: sched: sch: add extack for init callback
This patch adds extack support for init callback to prepare per-qdisc
specific changes for extack.

Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-21 12:32:50 -05:00
Gustavo A. R. Silva
f3ae608edb net: sched: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 02:07:08 +01:00
Jiri Pirko
69d78ef25c net: sched: store Qdisc pointer in struct block
Prepare for removal of tp->q and store Qdisc pointer in the block
structure.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:00:40 +01:00
Eric Dumazet
3c75f6ee13 net_sched: sch_htb: add per class overlimits counter
HTB qdisc overlimits counter is properly increased, but we have no per
class counter, meaning it is difficult to diagnose HTB problems.

This patch adds this counter, visible in "tc -s class show dev eth0",
with current iproute2.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-18 19:12:14 -07:00
David S. Miller
6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Nikolay Aleksandrov
88c2ace69d sch_htb: fix crash on init failure
The commit below added a call to the ->destroy() callback for all qdiscs
which failed in their ->init(), but some were not prepared for such
change and can't handle partially initialized qdisc. HTB is one of them
and if any error occurs before the qdisc watchdog timer and qdisc work are
initialized then we can hit either a null ptr deref (timer->base) when
canceling in ->destroy or lockdep error info about trying to register
a non-static key and a stack dump. So to fix these two move the watchdog
timer and workqueue init before anything that can err out.
To reproduce userspace needs to send broken htb qdisc create request,
tested with a modified tc (q_htb.c).

Trace log:
[ 2710.897602] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 2710.897977] IP: hrtimer_active+0x17/0x8a
[ 2710.898174] PGD 58fab067
[ 2710.898175] P4D 58fab067
[ 2710.898353] PUD 586c0067
[ 2710.898531] PMD 0
[ 2710.898710]
[ 2710.899045] Oops: 0000 [#1] SMP
[ 2710.899232] Modules linked in:
[ 2710.899419] CPU: 1 PID: 950 Comm: tc Not tainted 4.13.0-rc6+ #54
[ 2710.899646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 2710.900035] task: ffff880059ed2700 task.stack: ffff88005ad4c000
[ 2710.900262] RIP: 0010:hrtimer_active+0x17/0x8a
[ 2710.900467] RSP: 0018:ffff88005ad4f960 EFLAGS: 00010246
[ 2710.900684] RAX: 0000000000000000 RBX: ffff88003701e298 RCX: 0000000000000000
[ 2710.900933] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88003701e298
[ 2710.901177] RBP: ffff88005ad4f980 R08: 0000000000000001 R09: 0000000000000001
[ 2710.901419] R10: ffff88005ad4f800 R11: 0000000000000400 R12: 0000000000000000
[ 2710.901663] R13: ffff88003701e298 R14: ffffffff822a4540 R15: ffff88005ad4fac0
[ 2710.901907] FS:  00007f2f5e90f740(0000) GS:ffff88005d880000(0000) knlGS:0000000000000000
[ 2710.902277] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2710.902500] CR2: 0000000000000000 CR3: 0000000058ca3000 CR4: 00000000000406e0
[ 2710.902744] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2710.902977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2710.903180] Call Trace:
[ 2710.903332]  hrtimer_try_to_cancel+0x1a/0x93
[ 2710.903504]  hrtimer_cancel+0x15/0x20
[ 2710.903667]  qdisc_watchdog_cancel+0x12/0x14
[ 2710.903866]  htb_destroy+0x2e/0xf7
[ 2710.904097]  qdisc_create+0x377/0x3fd
[ 2710.904330]  tc_modify_qdisc+0x4d2/0x4fd
[ 2710.904511]  rtnetlink_rcv_msg+0x188/0x197
[ 2710.904682]  ? rcu_read_unlock+0x3e/0x5f
[ 2710.904849]  ? rtnl_newlink+0x729/0x729
[ 2710.905017]  netlink_rcv_skb+0x6c/0xce
[ 2710.905183]  rtnetlink_rcv+0x23/0x2a
[ 2710.905345]  netlink_unicast+0x103/0x181
[ 2710.905511]  netlink_sendmsg+0x326/0x337
[ 2710.905679]  sock_sendmsg_nosec+0x14/0x3f
[ 2710.905847]  sock_sendmsg+0x29/0x2e
[ 2710.906010]  ___sys_sendmsg+0x209/0x28b
[ 2710.906176]  ? do_raw_spin_unlock+0xcd/0xf8
[ 2710.906346]  ? _raw_spin_unlock+0x27/0x31
[ 2710.906514]  ? __handle_mm_fault+0x651/0xdb1
[ 2710.906685]  ? check_chain_key+0xb0/0xfd
[ 2710.906855]  __sys_sendmsg+0x45/0x63
[ 2710.907018]  ? __sys_sendmsg+0x45/0x63
[ 2710.907185]  SyS_sendmsg+0x19/0x1b
[ 2710.907344]  entry_SYSCALL_64_fastpath+0x23/0xc2

Note that probably this bug goes further back because the default qdisc
handling always calls ->destroy on init failure too.

Fixes: 87b60cfacf ("net_sched: fix error recovery at qdisc creation")
Fixes: 0fbbeb1ba4 ("[PKT_SCHED]: Fix missing qdisc_destroy() in qdisc_create_dflt()")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30 15:26:11 -07:00
WANG Cong
143976ce99 net_sched: remove tc class reference counting
For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:

1) For class modification and dumping paths, we already hold RTNL lock,
   so all of these ->get(),->change(),->put() are atomic.

2) For filter bindiing/unbinding, we use other reference counter than
   this one, and they should have RTNL lock too.

3) For ->qlen_notify(), it is special because it is called on ->enqueue()
   path, but we already hold qdisc tree lock there, and we hold this
   tree lock when graft or delete the class too, so it should not be gone
   or changed until we release the tree lock.

Therefore, this patch removes ->get() and ->put(), but:

1) Adds a new ->find() to find the pointer to a class by classid, no
   refcnt.

2) Move the original class destroy upon the last refcnt into ->delete(),
   right after releasing tree lock. This is fine because the class is
   already removed from hash when holding the lock.

For those who also use ->put() as ->unbind(), just rename them to reflect
this change.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-25 17:19:10 -07:00
Konstantin Khlebnikov
959466588a net_sched: call qlen_notify only if child qdisc is empty
This callback is used for deactivating class in parent qdisc.
This is cheaper to test queue length right here.

Also this allows to catch draining screwed backlog and prevent
second deactivation of already inactive parent class which will
crash kernel for sure. Kernel with print warning at destruction
of child qdisc where no packets but backlog is not zero.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 10:55:34 -07:00
Konstantin Khlebnikov
898904226b net_sched: reset pointers to tcf blocks in classful qdiscs' destructors
Traffic filters could keep direct pointers to classes in classful qdisc,
thus qdisc destruction first removes all filters before freeing classes.
Class destruction methods also tries to free attached filters but now
this isn't safe because tcf_block_put() unlike to tcf_destroy_chain()
cannot be called second time.

This patch set class->block to NULL after first tcf_block_put() and
turn second call into no-op.

Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 17:16:39 -07:00
Jiri Pirko
e25ea21ffa net: sched: introduce a TRAP control action
There is need to instruct the HW offloaded path to push certain matched
packets to cpu/kernel for further analysis. So this patch introduces a
new TRAP control action to TC.

For kernel datapath, this action does not make much sense. So with the
same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just
dropped in the datapath (and virtually ejected to an upper level, which
does not exist in case of kernel).

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 12:45:23 -04:00
Jiri Pirko
6529eaba33 net: sched: introduce tcf block infractructure
Currently, the filter chains are direcly put into the private structures
of qdiscs. In order to be able to have multiple chains per qdisc and to
allow filter chains sharing among qdiscs, there is a need for common
object that would hold the chains. This introduces such object and calls
it "tcf_block".

Helpers to get and put the blocks are provided to be called from
individual qdisc code. Also, the original filter_list pointers are left
in qdisc privs to allow the entry into tcf_block processing without any
added overhead of possible multiple pointer dereference on fast path.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Jiri Pirko
87d83093bf net: sched: move tc_classify function to cls_api.c
Move tc_classify function to cls_api.c where it belongs, rename it to
fit the namespace.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 15:22:13 -04:00
Johannes Berg
fceb6435e8 netlink: pass extended ACK struct to parsing functions
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Jiri Kosina
49b499718f net: sched: make default fifo qdiscs appear in the dump
The original reason [1] for having hidden qdiscs (potential scalability
issues in qdisc_match_from_root() with single linked list in case of large
amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
qdisc linked list to hashtable").

This allows us for bringing more clarity and determinism into the dump by
making default pfifo qdiscs visible.

We're not turning this on by default though, at it was deemed [2] too
intrusive / unnecessary change of default behavior towards userspace.
Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
applications to request complete qdisc hierarchy dump, including the
ones that have always been implicit/invisible.

Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
about singletons would require quite some surgery with very little gain
(seeing no qdisc or seeing noop qdisc in the dump is probably setting
the same user expectation).

[1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
[2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-12 22:53:02 -07:00
Jiri Pirko
cf1facda2f sched: move tcf_proto_destroy and tcf_destroy_chain helpers into cls_api
Creation is done in this file, move destruction to be at the same place.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-10 11:38:08 -05:00
Eric Dumazet
1c0d32fde5 net_sched: gen_estimator: complete rewrite of rate estimators
1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
  gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:21:59 -05:00
Eric Dumazet
73e42ff72a sch_htb: do not report fake rate estimators
When I prepared commit d250a5f90e ("pkt_sched: gen_estimator: Dont
report fake rate estimators"), htb still had an implicit rate estimator
for all its classes.

Then later, I made this rate estimator optional in commit 64153ce0a7
("net_sched: htb: do not setup default rate estimators"), but I forgot
to update htb use of gnet_stats_copy_rate_est()

After this patch, "tc -s qdisc ..." no longer report fake rate
estimators for HTB classes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 17:13:49 -04:00
Florian Westphal
48da34b7a7 sched: add and use qdisc_skb_head helpers
This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

Its similar to the skb_buff_head api, but does not use skb->prev pointers.

Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
While skb_buff_head works fine for this, enqueue/dequeue needs to also
adjust the prev pointer of next element.

The ->prev pointer is not required for qdiscs so we can just leave
it undefined and avoid one cacheline write access for en/dequeue.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:47:18 -04:00
David S. Miller
de0ba9a0d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just several instances of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24 00:53:32 -04:00
Konstantin Khlebnikov
0564bf0afa net/sched/sch_htb: clamp xstats tokens to fit into 32-bit int
In kernel HTB keeps tokens in signed 64-bit in nanoseconds. In netlink
protocol these values are converted into pshed ticks (64ns for now) and
truncated to 32-bit. In struct tc_htb_xstats fields "tokens" and "ctokens"
are declared as unsigned 32-bit but they could be negative thus tool 'tc'
prints them as signed. Big values loose higher bits and/or become negative.

This patch clamps tokens in xstat into range from INT_MIN to INT_MAX.
In this way it's easier to understand what's going on here.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-18 22:44:31 -07:00
David S. Miller
ee58b57100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:03:36 -04:00
Eric Dumazet
338ed9b4de net_sched: sch_htb: export class backlog in dumps
We already get child qdisc qlen, we also can get its backlog
so that class dumps can report it.

Also replace qstats by a single drop counter, but move it in
a separate cache line so that drops do not dirty useful cache lines.

Tested:

$ tc -s cl sh dev eth0
class htb 1:1 root leaf 3: prio 0 rate 1Gbit ceil 1Gbit burst 500000b cburst 500000b
 Sent 2183346912 bytes 9021815 pkt (dropped 2340774, overlimits 0 requeues 0)
 rate 1001Mbit 517543pps backlog 120758b 499p requeues 0
 lended: 9021770 borrowed: 0 giants: 0
 tokens: 9 ctokens: 9

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 12:19:35 -04:00
Eric Dumazet
520ac30f45 net_sched: drop packets after root qdisc lock is released
Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.

Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.

Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.

This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.

I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 12:19:35 -04:00
Florian Westphal
0ee13627f9 htb: call qdisc_root with rcu read lock held
saw a debug splat:
net/include/net/sch_generic.h:287 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
 2 locks held by kworker/2:1/710:
  #0:  ("events"){.+.+.+}, at: [<ffffffff8106ca1d>]
  #1:  ((&q->work)){+.+...}, at: [<ffffffff8106ca1d>] process_one_work+0x14d/0x690
Workqueue: events htb_work_func
Call Trace:
 [<ffffffff812dc763>] dump_stack+0x85/0xc2
 [<ffffffff8109fee7>] lockdep_rcu_suspicious+0xe7/0x120
 [<ffffffff814ced47>] htb_work_func+0x67/0x70

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:42:46 -07:00
Eric Dumazet
a5a9f5346f net_sched: sch_htb: defer skb freeing
Both htb_reset() and htb_destroy() can use __qdisc_reset_queue()
instead of __skb_queue_purge() to defer skb freeing of internal
queues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:08:35 -07:00
Eric Dumazet
45f50bed1d net_sched: remove generic throttled management
__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.

I already removed it for sch_fq in commit f2600cf02b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.

When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.

Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.

This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:58:21 -07:00
Florian Westphal
a09ceb0e08 sched: remove qdisc->drop
after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
more callers of ->drop() outside of other ->drop functions, i.e.
nothing calls them.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 23:58:52 -07:00
Eric Dumazet
edb09eb17e net: sched: do not acquire qdisc spinlock in qdisc/class stats dump
Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :

For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()

In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Kevin Athey <kda@google.com>
Cc: Xiaotian Pei <xiaotian@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07 16:37:14 -07:00
Eric Dumazet
a9efad8b24 net_sched: avoid too many hrtimer_start() calls
I found a serious performance bug in packet schedulers using hrtimers.

sch_htb and sch_fq are definitely impacted by this problem.

We constantly rearm high resolution timers if some packets are throttled
in one (or more) class, and other packets are flying through qdisc on
another (non throttled) class.

hrtimer_start() does not have the mod_timer() trick of doing nothing if
expires value does not change :

	if (timer_pending(timer) &&
            timer->expires == expires)
                return 1;

This issue is particularly visible when multiple cpus can queue/dequeue
packets on the same qdisc, as hrtimer code has to lock a remote base.

I used following fix :

1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
it.

2) Cache watchdog prior expiration. hrtimer might provide this, but I
prefer to not rely on some hrtimer internal.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-24 14:49:14 -07:00
Nicolas Dichtel
2a51c1e8ec sched: use nla_put_u64_64bit()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:09:09 -04:00
WANG Cong
431e3a8e36 sch_htb: update backlog as well
We saw qlen!=0 but backlog==0 on our production machine:

qdisc htb 1: dev eth0 root refcnt 2 r2q 10 default 1 direct_packets_stat 0 ver 3.17
 Sent 172680457356 bytes 222469449 pkt (dropped 0, overlimits 123575834 requeues 0)
 backlog 0b 72p requeues 0

The problem is we only count qlen for HTB qdisc but not backlog.
We need to update backlog too when we update qlen, so that we
can at least know the average packet length.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-29 17:02:33 -05:00