In the Linux kernel, the following vulnerability has been resolved:
btrfs: fix space cache corruption and potential double allocations
When testing space_cache v2 on a large set of machines, we encountered a few symptoms:
- "unable to add free space :-17" (EEXIST) errors.
- Missing free space info items, sometimes caught with a "missing free
- Double-accounted space: ranges that were allocated in the extent tree
- On some hosts with no on-disk corruption or error messages, the
All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") subtly broke this by allowing multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
- An extent is deleted from an uncached block group in transaction A.
- btrfs_commit_transaction() is called for transaction A.
- btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
- __btrfs_free_extent() -> do_free_extent_accounting() ->
- do_free_extent_accounting() -> btrfs_update_block_group() ->
- btrfs_commit_transaction() for transaction A calls
- The caching thread gets to our block group. Since the commit roots
- btrfs_commit_transaction() advances transaction A to
- fsync calls btrfs_commit_transaction() for transaction B. Since
- btrfs_commit_transaction() for transaction B calls
- btrfs_commit_transaction() for transaction A calls
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free space is re-added in step 11, it will fail with EEXIST. * If another thread reallocates the deleted extent in between steps 7 and 11, then step 11 will silently re-add that space to the space cache as free even though it is actually allocated. Then, if that space is allocated *again*, the free space tree will be corrupted (namely, the wrong item will be deleted). * If we don't catch this free space tree corr ---truncated---