In the Linux kernel, the following vulnerability has been resolved:
sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash.
The offending case is when dequeue_entities() is called to dequeue a delayed group entity, and then the entity's parent's dequeue is delayed. In that case:
- In the if (entity_is_task(se)) else block at the beginning of
- The first for_each_sched_entity() loop dequeues the entity.
- If the entity was its parent's only child, then the next iteration
- If the parent's dequeue needs to be delayed, then it breaks from the
- The second for_each_sched_entity() loop sets the parent's ->slice to
This throws off subsequent calculations with potentially catastrophic results. A manifestation we saw in production was:
- In update_entity_lag(), se->slice is used to calculate limit, which
- limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
- In place_entity(), se->vlag is scaled, which overflows and results in
- The adjusted lag is subtracted from se->vruntime, which increases or
- pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
- Nothing appears to be eligible, so pick_eevdf() returns NULL.
- pick_next_entity() tries to dereference the return value of
Dumping the cfs_rq states from the core dumps with drgn showed tell-tale huge vruntime ranges and bogus vlag values, and I also traced se->slice being set to U64_MAX on live systems (which was usually "benign" since the rest of the runqueue needed to be in a particular state to crash).
Fix it in dequeue_entities() by always setting slice from the first non-empty cfs_rq.