Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jun 20, 2024
  2. Jun 18, 2024
    • Babu Moger's avatar
      KVM: Stop processing *all* memslots when "null" mmu_notifier handler is found · c3f3edf7
      Babu Moger authored
      Bail from outer address space loop, not just the inner memslot loop, when
      a "null" handler is encountered by __kvm_handle_hva_range(), which is the
      intended behavior.  On x86, which has multiple address spaces thanks to
      SMM emulation, breaking from just the memslot loop results in undefined
      behavior due to assigning the non-existent return value from kvm_null_fn()
      to a bool.
      
      In practice, the bug is benign as kvm_mmu_notifier_invalidate_range_end()
      is the only caller that passes handler=kvm_null_fn, and it doesn't set
      flush_on_ret, i.e. assigning garbage to r.ret is ultimately ignored.  And
      for most configuration the compiler elides the entire sequence, i.e. there
      is no undefined behavior at runtime.
      
        ------------[ cut here ]------------
        UBSAN: invalid-load in arch/x86/kvm/../../../virt/kvm/kvm_main.c:655:10
        load of value 160 is not a valid value for type '_Bool'
        CPU: 370 PID: 8246 Comm: CPU 0/KVM Not tainted 6.8.2-amdsos-build58-ubuntu-22.04+ #1
        Hardware name: AMD Corporation Sh54p/Sh54p, BIOS WPC4429N 04/25/2024
        Call Trace:
         <TASK>
         dump_stack_lvl+0x48/0x60
         ubsan_epilogue+0x5/0x30
         __ubsan_handle_load_invalid_value+0x79/0x80
         kvm_mmu_notifier_invalidate_range_end.cold+0x18/0x4f [kvm]
         __mmu_notifier_invalidate_range_end+0x63/0xe0
         __split_huge_pmd+0x367/0xfc0
         do_huge_pmd_wp_page+0x1cc/0x380
         __handle_mm_fault+0x8ee/0xe50
         handle_mm_fault+0xe4/0x4a0
         __get_user_pages+0x190/0x840
         get_user_pages_unlocked+0xe0/0x590
         hva_to_pfn+0x114/0x550 [kvm]
         kvm_faultin_pfn+0xed/0x5b0 [kvm]
         kvm_tdp_page_fault+0x123/0x170 [kvm]
         kvm_mmu_page_fault+0x244/0xaa0 [kvm]
         vcpu_enter_guest+0x592/0x1070 [kvm]
         kvm_arch_vcpu_ioctl_run+0x145/0x8a0 [kvm]
         kvm_vcpu_ioctl+0x288/0x6d0 [kvm]
         __x64_sys_ioctl+0x8f/0xd0
         do_syscall_64+0x77/0x120
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
         </TASK>
        ---[ end trace ]---
      
      Fixes: 071064f1
      
       ("KVM: Don't take mmu_lock for range invalidation unless necessary")
      Signed-off-by: default avatarBabu Moger <babu.moger@amd.com>
      Link: https://lore.kernel.org/r/b8723d39903b64c241c50f5513f804390c7b5eec.1718203311.git.babu.moger@amd.com
      
      
      [sean: massage changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      c3f3edf7
  3. Jun 05, 2024
    • Breno Leitao's avatar
      KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin() · 49f683b4
      Breno Leitao authored
      Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the
      loads and stores are atomic.  In the extremely unlikely scenario the
      compiler tears the stores, it's theoretically possible for KVM to attempt
      to get a vCPU using an out-of-bounds index, e.g. if the write is split
      into multiple 8-bit stores, and is paired with a 32-bit load on a VM with
      257 vCPUs:
      
        CPU0                              CPU1
        last_boosted_vcpu = 0xff;
      
                                          (last_boosted_vcpu = 0x100)
                                          last_boosted_vcpu[15:8] = 0x01;
        i = (last_boosted_vcpu = 0x1ff)
                                          last_boosted_vcpu[7:0] = 0x00;
      
        vcpu = kvm->vcpu_array[0x1ff];
      
      As detected by KCSAN:
      
        BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]
      
        write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
        kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
        handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
        vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
      		 arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
        vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
        kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
        kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
        __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
        __x64_sys_ioctl (fs/ioctl.c:890)
        x64_sys_call (arch/x86/entry/syscall_64.c:33)
        do_syscall_64 (arch/x86/entry/common.c:?)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
      
        read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
        kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
        handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
        vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
      			arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
        vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
        kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
        kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
        __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
        __x64_sys_ioctl (fs/ioctl.c:890)
        x64_sys_call (arch/x86/entry/syscall_64.c:33)
        do_syscall_64 (arch/x86/entry/common.c:?)
        entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
      
        value changed: 0x00000012 -> 0x00000000
      
      Fixes: 217ece61
      
       ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      49f683b4
  4. May 05, 2024
  5. May 02, 2024
    • Venkatesh Srinivas's avatar
      KVM: Remove kvm_make_all_cpus_request_except() · 82e9c84d
      Venkatesh Srinivas authored
      Remove kvm_make_all_cpus_request_except() as it effectively has no users,
      and arguably should never have been added in the first place.
      
      Commit 54163a34 ("KVM: Introduce kvm_make_all_cpus_request_except()")
      added the "except" variation for use in SVM's AVIC update path, which used
      it to skip sending a request to the current vCPU (commit 7d611233
      ("KVM: SVM: Disable AVIC before setting V_IRQ")).
      
      But the AVIC usage of kvm_make_all_cpus_request_except() was essentially a
      hack-a-fix that simply squashed the most likely scenario of a racy WARN
      without addressing the underlying problem(s).  Commit f1577ab2 ("KVM:
      SVM: svm_set_vintr don't warn if AVIC is active but is about to be
      deactivated") eventually fixed the WARN itself, and the "except" usage was
      subsequently dropped by df63202f
      
       ("KVM: x86: APICv: drop immediate
      APICv disablement on current vCPU").
      
      That kvm_make_all_cpus_request_except() hasn't gained any users in the
      last ~3 years isn't a coincidence.  If a VM-wide broadcast *needs* to skip
      the current vCPU, then odds are very good that there is underlying bug
      that could be better fixed elsewhere.
      
      Signed-off-by: default avatarVenkatesh Srinivas <venkateshs@chromium.org>
      Link: https://lore.kernel.org/r/20240404232651.1645176-1-venkateshs@chromium.org
      
      
      [sean: rewrite changelog with --verbose]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      82e9c84d
  6. Apr 25, 2024
  7. Apr 19, 2024
  8. Apr 11, 2024
  9. Apr 09, 2024
  10. Apr 08, 2024
  11. Mar 04, 2024
    • David Woodhouse's avatar
      KVM: pfncache: simplify locking and make more self-contained · 6addfcf2
      David Woodhouse authored
      The locking on the gfn_to_pfn_cache is... interesting. And awful.
      
      There is a rwlock in ->lock which readers take to ensure protection
      against concurrent changes. But __kvm_gpc_refresh() makes assumptions
      that certain fields will not change even while it drops the write lock
      and performs MM operations to revalidate the target PFN and kernel
      mapping.
      
      Commit 93984f19
      
       ("KVM: Fully serialize gfn=>pfn cache refresh via
      mutex") partly addressed that — not by fixing it, but by adding a new
      mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh()
      calls on a given gfn_to_pfn_cache, but is still only a partial solution.
      
      There is still a theoretical race where __kvm_gpc_refresh() runs in
      parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has
      dropped the write lock, kvm_gpc_deactivate() clears the ->active flag
      and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous
      ->pfn and ->khva are still valid, and reinstalls those values into the
      structure. This leaves the gfn_to_pfn_cache with the ->valid bit set,
      but ->active clear. And a ->khva which looks like a reasonable kernel
      address but is actually unmapped.
      
      All it takes is a subsequent reactivation to cause that ->khva to be
      dereferenced. This would theoretically cause an oops which would look
      something like this:
      
      [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0
      [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0
      
      I say "theoretically" because theoretically, that oops that was seen in
      production cannot happen. The code which uses the gfn_to_pfn_cache is
      supposed to have its *own* locking, to further paper over the fact that
      the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own
      rwlock abuse is not sufficient.
      
      For the Xen vcpu_info that external lock is the vcpu->mutex, and for the
      shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect
      the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all
      but the cases where the vcpu or kvm object is being *destroyed*, in
      which case the subsequent reactivation should never happen.
      
      Theoretically.
      
      Nevertheless, this locking abuse is awful and should be fixed, even if
      no clear explanation can be found for how the oops happened. So expand
      the use of the ->refresh_lock mutex to ensure serialization of
      activate/deactivate vs. refresh and make the pfncache locking entirely
      self-sufficient.
      
      This means that a future commit can simplify the locking in the callers,
      such as the Xen emulation code which has an outstanding problem with
      recursive locking of kvm->arch.xen.xen_lock, which will no longer be
      necessary.
      
      The rwlock abuse described above is still not best practice, although
      it's harmless now that the ->refresh_lock is held for the entire duration
      while the offending code drops the write lock, does some other stuff,
      then takes the write lock again and assumes nothing changed. That can
      also be fixed^W cleaned up in a subsequent commit, but this commit is
      a simpler basis for the Xen deadlock fix mentioned above.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org
      
      
      [sean: use guard(mutex) to fix a missed unlock]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      6addfcf2
  12. Feb 23, 2024
  13. Feb 22, 2024
  14. Feb 20, 2024
  15. Feb 08, 2024
    • Paolo Bonzini's avatar
      treewide: remove CONFIG_HAVE_KVM · f48212ee
      Paolo Bonzini authored
      
      It has no users anymore.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f48212ee
    • Paolo Bonzini's avatar
      kvm: move "select IRQ_BYPASS_MANAGER" to common code · 61df71ee
      Paolo Bonzini authored
      
      CONFIG_IRQ_BYPASS_MANAGER is a dependency of the common code included by
      CONFIG_HAVE_KVM_IRQ_BYPASS.  There is no advantage in adding the corresponding
      "select" directive to each architecture.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      61df71ee
    • Paolo Bonzini's avatar
      kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbol · 8886640d
      Paolo Bonzini authored
      
      KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask
      unused definitions in include/uapi/linux/kvm.h.  __KVM_HAVE_READONLY_MEM however
      was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on
      architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly
      return nonzero.  This however does not make sense, and it prevented userspace
      from supporting this architecture-independent feature without recompilation.
      
      Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and
      is only used in virt/kvm/kvm_main.c.  Userspace does not need to test it
      and there should be no need for it to exist.  Remove it and replace it
      with a Kconfig symbol within Linux source code.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8886640d
  16. Feb 06, 2024
    • Sean Christopherson's avatar
      KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed · c2744ed2
      Sean Christopherson authored
      
      Nullify the async #PF worker's local "apf" pointer immediately after the
      point where the structure can be freed by the vCPU.  The existing comment
      is helpful, but easy to overlook as there is no associated code.
      
      Update the comment to clarify that it can be freed by as soon as the lock
      is dropped, as "after this point" isn't strictly accurate, nor does it
      help understand what prevents the structure from being freed earlier.
      
      Reviewed-by: default avatarXu Yilun <yilun.xu@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Link: https://lore.kernel.org/r/20240110011533.503302-5-seanjc@google.com
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      c2744ed2
    • Sean Christopherson's avatar
      KVM: Get reference to VM's address space in the async #PF worker · 8284765f
      Sean Christopherson authored
      
      Get a reference to the target VM's address space in async_pf_execute()
      instead of gifting a reference from kvm_setup_async_pf().  Keeping the
      address space alive just to service an async #PF is counter-productive,
      i.e. if the process is exiting and all vCPUs are dead, then NOT doing
      get_user_pages_remote() and freeing the address space asap is desirable.
      
      Handling the mm reference entirely within async_pf_execute() also
      simplifies the async #PF flows as a whole, e.g. it's not immediately
      obvious when the worker task vs. the vCPU task is responsible for putting
      the gifted mm reference.
      
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarXu Yilun <yilun.xu@intel.com>
      Link: https://lore.kernel.org/r/20240110011533.503302-4-seanjc@google.com
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      8284765f
    • Sean Christopherson's avatar
      KVM: Put mm immediately after async #PF worker completes remote gup() · 422eeb54
      Sean Christopherson authored
      
      Put the async #PF worker's reference to the VM's address space as soon as
      the worker is done with the mm.  This will allow deferring getting a
      reference to the worker itself without having to track whether or not
      getting a reference succeeded.
      
      Note, if the vCPU is still alive, there is no danger of the worker getting
      stuck with tearing down the host page tables, as userspace also holds a
      reference (obviously), i.e. there is no risk of delaying the page-present
      notification due to triggering the slow path in mmput().
      
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: default avatarXu Yilun <yilun.xu@intel.com>
      Link: https://lore.kernel.org/r/20240110011533.503302-3-seanjc@google.com
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      422eeb54
    • Sean Christopherson's avatar
      KVM: Always flush async #PF workqueue when vCPU is being destroyed · 3d75b8aa
      Sean Christopherson authored
      Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
      completion queue, e.g. when a VM and all its vCPUs is being destroyed.
      KVM must ensure that none of its workqueue callbacks is running when the
      last reference to the KVM _module_ is put.  Gifting a reference to the
      associated VM prevents the workqueue callback from dereferencing freed
      vCPU/VM memory, but does not prevent the KVM module from being unloaded
      before the callback completes.
      
      Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
      async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
      result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
      finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
      
       WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
       Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
       CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
       Workqueue: events async_pf_execute [kvm]
       RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
       Call Trace:
        <TASK>
        async_pf_execute+0x198/0x260 [kvm]
        process_one_work+0x145/0x2d0
        worker_thread+0x27e/0x3a0
        kthread+0xba/0xe0
        ret_from_fork+0x2d/0x50
        ret_from_fork_asm+0x11/0x20
        </TASK>
       ---[ end trace 0000000000000000 ]---
       INFO: task kworker/8:1:251 blocked for more than 120 seconds.
             Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
       Workqueue: events async_pf_execute [kvm]
       Call Trace:
        <TASK>
        __schedule+0x33f/0xa40
        schedule+0x53/0xc0
        schedule_timeout+0x12a/0x140
        __wait_for_common+0x8d/0x1d0
        __flush_work.isra.0+0x19f/0x2c0
        kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
        kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
        kvm_put_kvm+0x1c1/0x320 [kvm]
        async_pf_execute+0x198/0x260 [kvm]
        process_one_work+0x145/0x2d0
        worker_thread+0x27e/0x3a0
        kthread+0xba/0xe0
        ret_from_fork+0x2d/0x50
        ret_from_fork_asm+0x11/0x20
        </TASK>
      
      If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
      then there's no need to gift async_pf_execute() a reference because all
      invocations of async_pf_execute() will be forced to complete before the
      vCPU and its VM are destroyed/freed.  And that in turn fixes the module
      unloading bug as __fput() won't do module_put() on the last vCPU reference
      until the vCPU has been freed, e.g. if closing the vCPU file also puts the
      last reference to the KVM module.
      
      Note that kvm_check_async_pf_completion() may also take the work item off
      the completion queue and so also needs to flush the work queue, as the
      work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
      on the workqueue could theoretically delay a vCPU due to waiting for the
      work to complete, but that's a very, very small chance, and likely a very
      small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
      new request, i.e. will effectively delay entering the guest, so the
      remaining work is really just:
      
              trace_kvm_async_pf_completed(addr, cr2_or_gpa);
      
              __kvm_vcpu_wake_up(vcpu);
      
              mmput(mm);
      
      and mmput() can't drop the last reference to the page tables if the vCPU is
      still alive, i.e. the vCPU won't get stuck tearing down page tables.
      
      Add a helper to do the flushing, specifically to deal with "wakeup all"
      work items, as they aren't actually work items, i.e. are never placed in a
      workqueue.  Trying to flush a bogus workqueue entry rightly makes
      __flush_work() complain (kudos to whoever added that sanity check).
      
      Note, commit 5f6de5cb ("KVM: Prevent module exit until all VMs are
      freed") *tried* to fix the module refcounting issue by having VMs grab a
      reference to the module, but that only made the bug slightly harder to hit
      as it gave async_pf_execute() a bit more time to complete before the KVM
      module could be unloaded.
      
      Fixes: af585b92
      
       ("KVM: Halt vcpu if page it tries to access is swapped out")
      Cc: stable@vger.kernel.org
      Cc: David Matlack <dmatlack@google.com>
      Reviewed-by: default avatarXu Yilun <yilun.xu@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Link: https://lore.kernel.org/r/20240110011533.503302-2-seanjc@google.com
      
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      3d75b8aa
  17. Jan 29, 2024