Skip to content

Paper describes heap-based Top-k kernel, but codebase seems to use a different Top-k implementation #6

@limin2021

Description

@limin2021

Hi, thanks for open-sourcing MSA.

In Section 4.1 of the paper, the Top-k kernel is described as a heap-based implementation:

Each of the warp’s 32 lanes streams a 1/32 stride of the input row and maintains a k-element min-heap in shared memory. The heap root is cached in a register, and insertions are performed with deferred writes. Finally, a k-round shuffle merge combines the 32 local TopK results.

However, in the current codebase I could not find this heap-based Top-k implementation.

The exposed API:

from fmha_sm100 import sparse_topk_select

appears to call:

python/fmha_sm100/api.py::sparse_topk_select
python/fmha_sm100/csrc/sparse_topk_select.cu
python/fmha_sm100/csrc/include/sparse_topk_select.cuh

The implementation in sparse_topk_select.cuh seems to be based on TensorRT-LLM indexerTopK, using histogram/threshold selection plus insertion sort, rather than the per-lane min-heap + shuffle-merge algorithm described in the paper.

Could you clarify:

  1. Is the heap-based Top-k kernel from the paper included in this repository?
  2. If yes, where is the implementation located?
  3. If no, is the current sparse_topk_select implementation intended to replace the paper-described heap-based kernel?
  4. Are the benchmark numbers in the paper based on the heap-based kernel or the currently released sparse_topk_select kernel?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions