Skip to content

Add a Warp (NVIDIA warp-lang) GPU backend for PySPH WCSPH#435

Open
kunalpuri-prediqt wants to merge 40 commits into
pypr:mainfrom
kunalpuri-prediqt:blast-from-the-past
Open

Add a Warp (NVIDIA warp-lang) GPU backend for PySPH WCSPH#435
kunalpuri-prediqt wants to merge 40 commits into
pypr:mainfrom
kunalpuri-prediqt:blast-from-the-past

Conversation

@kunalpuri-prediqt

@kunalpuri-prediqt kunalpuri-prediqt commented Jun 18, 2026

Copy link
Copy Markdown

Summary

This PR adds a prototype Warp (NVIDIA warp-lang) GPU backend for PySPH's weakly-compressible SPH (WCSPH) path, validated against the elliptical-drop and 3D dam-break (Lobovsky no-obstacle) benchmarks. Particle state stays device-resident across timesteps, and the design mirrors PySPH's equation/group transpilation model on the GPU.

This is a prototype put up for review/discussion — feedback on the approach and on how (or whether) it should integrate with mainline PySPH is very welcome.

Second validation case — 3D dam-break (Lobovsky no-obstacle, ADR-0005)

The backend is now also validated on the 3D dam-break (Lobovsky no-obstacle) — the first 3D, gravity-driven, multi-array (fluid + solid wall) case. It is added as purely additive extensions: the 2D elliptical-drop path and its generated kernel source stay byte-identical (test-pinned).

New physics in warp_sph.py (all additive): WendlandQuintic kernel (new kernel id), gravity (apply_body_force), TaitEOSHGCorrection for fixed solid walls (clamp ρ≥ρ₀ so wall p≥0), and wc_sph_dam_break_step — an E-P-E-C step (matching the reference EPECIntegrator) that sums the fluid's acceleration + density rate over [fluid, wall], takes wall density from [fluid] only, XSPH from fluid only, and holds the walls fixed. The fluid acceleration blocks (pressure gradient + Monaghan AV + continuity) are fused into one kernel per source.

Validation (.ai/.../experiments/2026-06-18_warp-dam-break-3d-runner/):

  • Tier-1 — a hand-rolled CPU EPEC reimplementation (LinkedListNNPS(dim=3) + WendlandQuintic) matches Warp (fp32) on x/y/z/u/v/w + fluid/wall density to ~1e-8; pressure matches in absolute terms at the fp32 Tait-EOS cancellation floor.
  • Tier-2 — vs the real shipped dam_break_3d_lobovsky.py Application (WCSPHScheme + EPECIntegrator, fp64) at matched checkpoint times: kinetic energy / surge-front / max-height / density agree to fp32 scale, and pressure to ~1% once the flow develops.
  • Focused suite test_warp_* 67 passed, including a guard asserting the 2D-path generated kernel source is byte-identical.

3D structure at ~1M particles (Warp fp32, t = 0.2 s) — x-z side, x-y top-down, y-z end, and a 3D scatter:

Warp 3D dam-break, ~1M particles, four views

Cross-GPU performance (fused dam-break step, fixed-dt, fp32, 32k–9.5M particles; L40S + RTX PRO 6000 Blackwell + an RTX 4060 anchor — 5090/B300 pending). At ~1M particles:

GPU throughput (M particle-steps/s) speedup vs 1 CPU core $ / billion p-steps kJ / billion p-steps
RTX PRO 6000 Blackwell (sm_120) 39.1 163× $0.0187 15.4
L40S (sm_89) 25.9 108× $0.0114 13.5
RTX 4060 Laptop (sm_89) 4.5 19× 25.3

CPU ref: single-thread PySPH Cython 4.18 s/step at ~1M. The fixed-dt sweep is the cross-GPU metric; the adaptive-dt production step is ~1.5× slower. $ and energy are estimates from public hourly rates / datasheet TDP. The L40S is cheapest per unit work and most energy-efficient; the RTX PRO 6000 is fastest and still scaling at 9.5M. Per-particle-step throughput is ~3.5× below the elliptical drop's — expected for the heavier multi-array step. Details + figures: .ai/.../2026-06-18_warp-dam-break-3d-runner/gpu-sweep/README.md.

Warp 3D dam-break throughput vs particle count

What's included

Core backend (pysph/base/):

  • warp_codegen.py — dynamic Warp equation-group code generation: WarpEquation blocks contribute initialize/loop/post_loop snippets, and a group emits one fused, disk-cached JIT kernel per (equation-set, dtype). Supports neighbor_mode='flat' (CSR neighbor cache) and 'grid' (direct uniform-grid cell-list walk with the support cutoff inline), additive accumulate_outputs, and a periodic minimum-image variant. Kernel names are a deterministic md5 of the structural cache key so the Warp on-disk cache survives across sessions.
  • warp_nnps.pyUniformGridWarpNNPS cell-list NNPS with device rebuild from updated coordinates and set_periodic_box (wrapped binning + tiled bounds).
  • warp_sph.py — WCSPH equations (continuity / summation density, Tait & isothermal EOS, pressure gradient, Monaghan artificial viscosity, XSPH, Gaussian kernel), a device-reduced adaptive timestep, and Euler / KDK / continuity-PEC step paths. Plus (ADR-0005): WendlandQuintic kernel, gravity, TaitEOSHGCorrection solid walls, and a fused multi-array dam-break step.
  • warp_device_helper.py, particle_array.pyx — device-mirror plumbing.
  • Tests: test_warp_codegen.py, test_warp_sph.py, test_warp_nnps.py, test_warp_device_helper.py.

Build & docs: BUILD.md (reproducible build + GPU-arch compatibility: V100/A100/H100/RTX 5090/RTX PRO 6000 Blackwell/B300), a docs/ tutorial, and a setup.py fix that skips the MPI/Zoltan parallel extension when PyZoltan is absent.

Validation (elliptical drop)

  • The resolved nx=100 elliptical drop matches the PySPH CPU Application (continuity density + PySPH timestep policy) to floating-point scale, with identical step counts (1393).
  • Focused test suite passes (test_warp_*).
  • The device path is fp32 (compyle use_double=False).

Performance — elliptical drop (cross-GPU sweep)

Grid-direct continuity-PEC step, fp32, 31k–78.5M particles, across RTX 4060 / L40S / RTX 5090 / RTX PRO 6000 Blackwell / B300 SXM6. Headline: up to ~491× vs a single CPU core at 1M particles. The artifact also reports a cost-of-compute lens ($/billion particle-steps), an estimate-only energy-to-solution lens, and an analytic roofline lens (which suggests the ~1.43e8 Blackwell plateau is occupancy/launch/grid-build bound, not a memory-bandwidth wall). Details in .ai/implementations/blast-from-the-past/experiments/2026-06-16_warp-elliptical-drop-runner/gpu-sweep/README.md.

Note on scope

This branch also carries the .ai/implementations/blast-from-the-past/ tree — the AI-assisted development memory/journal (plans, ADRs, daily/session logs, experiment scripts and figures) used while building this. It is included for full provenance. If you'd rather review only the code, I'm happy to strip it down to a code-only PR (pysph/ + BUILD.md + docs/ + setup.py).

🤖 Generated with Claude Code

kunalpuri-prediqt and others added 30 commits June 15, 2026 07:35
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-repeated-step-leapfrog-checkpoint.md

ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-elliptical-drop-runner-smoke-ramp.md

ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-artificial-viscosity-momentum-term.md

ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-tait-eos-and-sound-speed.md

ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-xsph-gaussian-adaptive-baseline.md

ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-continuity-density-resolved-elliptical-drop.md

Plans: .ai/implementations/blast-from-the-past/plans/2026-06-16_warp-resolved-elliptical-drop-performance-comparison.md; .ai/implementations/blast-from-the-past/plans/2026-06-16_warp-continuity-density-leapfrog-parity.md
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-17_warp-adaptive-timestep-policy-parity.md
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-17_warp-fp32-resolved-rerun.md
…003)

Add pysph/base/warp_codegen.py: a dynamic Warp equation-group code
generator. WarpEquation blocks contribute initialize/loop/post_loop
snippets and a group emits one fused, cached JIT kernel per
(equation-set, dtype), mirroring PySPH's equation/group transpilation.

The continuity-density PEC half-stage now runs one fused launch
(continuity + pressure gradient + Monaghan viscosity + XSPH) instead of
four. The summation-density path, Euler step, per-equation helpers
(kept as oracle), and adaptive _wcsph_dt_factors traversal are unchanged.

Million-particle fixed-step (nx=565, 10 steps): equation launches 8->2
per step, equation-kernel time ~5-6x lower; neighbor-cache build is now
the dominant per-step cost. Parity vs prior Warp essentially exact
(KE delta -6.4e-09); adaptive nx=100 guard kept exactly 1393 steps.

Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell).
Reviewed-by: @prabhu (LGTM)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-0004)

After ADR-0003 fused the continuity-density PEC half-stage into one neighbor
loop, the flat CSR neighbor-cache build (count traversal + scan + host readback
+ alloc + fill traversal) became the dominant per-step cost. With a single
fused consumer per half-stage, the flat list was built only to be read once.

Add neighbor_mode='grid' to the warp_codegen generator: the same fused kernel
body, but the flat starts/lengths/neighbors loop is replaced by a direct
uniform-grid cell-list walk with the support cutoff inline (geometry split
pre/post cutoff; cell-block walk wraps the equation snippets; mode in the
structural cache key). compute_wcsph_accel_continuity and (via hand-written
grid-direct _wcsph_dt_factors_grid_{f32,f64}) the adaptive timestep now walk the
cell list directly; the continuity PEC step builds only the grid. The flat path
is retained for the host get_nearest_particles query API, the per-equation
oracle, summation density, and compute_neighbor_sum; compute_wcsph_adaptive_timestep
defaults to flat so the summation leapfrog path is byte-identical.

Focused suite: 50 passed. Million-particle (nx=565): build_neighbor_cache_gpu
called 0 times on the continuity path; steady step wall 0.076-0.098 -> 0.059-0.064 s
(~25-35%). Adaptive nx=100 kept exactly 1393 steps at fp32-scale deltas.

Fresh same-session CPU (single-threaded PySPH Cython) vs Warp grid-direct
(RTX 4060, fp32): nx=100 resolved 160.10 s vs 6.78 s = 23.6x; 1M particles /
100 fixed steps 344.50 s vs 8.37 s = 41.2x wall / 57.6x per-step, KE rel delta
1.6e-9.

A 6-dimension adversarial review (review + skeptic-verify per finding) returned
one nit: the adaptive-dt grid default had leaked into the summation path; fixed
by defaulting that helper to flat and opting the continuity step into grid.

Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell).
Reviewed-by: @prabhu (LGTM)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document how to build the repo as set up for the Warp-backend work:
core PySPH (Cython/C++ extensions via setup.py build_ext / pip install -e .)
separated from the Warp GPU backend extra (warp-lang + NVIDIA GPU), with the
known-good toolchain (Python 3.14, numpy 2.4, Cython 3.2, compyle 0.9.1,
warp-lang 1.14 / CUDA Toolkit 12.9), verification + test commands, the
fp32 note, and gotchas (bare `python`, pre-commit hook, OpenMP fallback).

Add a GPU architecture compatibility section for V100 (sm_70), A100 (sm_80),
H100 (sm_90), RTX 5090 and RTX PRO 6000 Blackwell (sm_120): the build is
arch-independent (Warp JIT-compiles per-arch at runtime), so support is a
driver + warp-lang matter. warp-lang 1.14 (CUDA 12.9) covers sm_70-sm_120;
Blackwell needs an R570+ driver and CUDA >= 12.8. Includes per-arch driver
floors, fp64 guidance (strong on V100/A100/H100, keep fp32 on Blackwell),
VRAM/problem-size scaling, single-GPU note, and a warp.init() verification.

Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…follow-up)

Make warp_codegen the single source for every neighbor-loop kernel and retire
the ~14 duplicated hand @wp.kernels (_summation_density, _continuity,
_pressure_gradient, _artificial_viscosity, _xsph_correction,
_wcsph_dt_factors{,_grid}, f32+f64).

- warp_codegen: add accumulate_outputs (seed _acc_<out> from the existing
  d_<out>[i] for read-modify-write composition, used by the standalone additive
  viscosity term). Make the generated kernel func_name a deterministic md5 of
  the structural cache key instead of len(_KERNEL_CACHE): the old order-
  dependent name changed the generated source whenever the build order shifted,
  defeating Warp's on-disk kernel cache and forcing a cold recompile every
  session. Verified the source is byte-stable across processes; kernels now
  cold-compile once per machine then load cached (~20 ms).
- warp_sph: add SummationDensity and WcsphCflFactor blocks (the CFL dt_cfl is a
  free-form wp.max neighbor reduction, dt_force a per-particle post_loop) and a
  shared _run_equation_group launcher (flat/grid, accumulate, cross-array).
  Repoint compute_summation_density / _continuity / _pressure_gradient /
  _artificial_viscosity (accumulate) / _xsph_correction /
  compute_wcsph_adaptive_timestep onto the generator; refactor
  compute_wcsph_accel_continuity onto the launcher. Helpers reject non-canonical
  out_prop/out_props (fail-fast) since the generated path writes canonical names.

Behavior-preserving consolidation: the continuity hot path uses the byte-
identical fused grid kernel (perf-neutral; million steady floor 0.059 s, 0 flat
builds, KE identical); the summation path keeps its pressure-gradient(overwrite)
-> viscosity(add) composition. Focused suite 52 passed; adaptive nx=100 guard
keeps exactly 1393 steps (KE relative ~8e-9). Headline CPU-vs-Warp numbers
(nx=100 23.6x; 1M/100-step 41.2x wall / 57.6x per-step) unchanged.

A 6-dimension adversarial review (skeptic-verified) found one minor issue (the
out_prop/out_props silent-write), addressed by the guards.

Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell).
Reviewed-by: @prabhu (LGTM)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend ADR-0004 grid-direct from the continuity PEC path to the summation Euler
(wc_sph_euler_step) and KDK leapfrog step paths, so no device step path builds a
flat CSR neighbor cache.

- The five standalone equation helpers (compute_summation_density, _continuity,
  _pressure_gradient, _artificial_viscosity, _xsph_correction) gain
  neighbor_mode='flat' (default), threaded to _run_equation_group. The flat
  default keeps the oracle/cross-array tests and compute_neighbor_sum unchanged.
- _compute_wcsph_acceleration and wc_sph_euler_step pass neighbor_mode='grid' to
  summation density, pressure gradient, viscosity (accumulate), continuity; the
  summation branch of wc_sph_leapfrog_step passes 'grid' to its adaptive-dt and
  xsph calls. No fusing: the pressure-gradient(overwrite) -> viscosity(add)
  composition is preserved, so the only numerical change is neighbor visitation
  order (fp32 reorder ~1e-7).

build_neighbor_cache_gpu is now used only by the host get_nearest_particles
query API, compute_neighbor_sum, and the flat-mode oracle/cross-array tests.

Focused suite 53 passed (summation Euler/KDK CPU-parity tests pass under
grid-direct with no tolerance changes; new no-flat-cache assertion). Continuity
adaptive nx=100 guard unchanged (1393 steps, KE 7797.7071). A 4-dimension
adversarial review (skeptic-verified) returned 0 findings.

Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell).
Reviewed-by: @prabhu (LGTM)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…R-0004)

Add true periodic neighbors (wrapped cell walk + minimum-image distance) to the
grid-direct Warp WCSPH path, enabling periodic benchmarks (Taylor-Green etc.).
Approach: minimum-image + wrapped cell walk, not ghost particles. The
free-surface elliptical drop is unaffected.

- warp_codegen: generate_group_source/build_group_kernel gain periodic=False
  (grid only). The periodic variant adds box_lx/ly/lz + periodic_x/y/z runtime
  flags, wraps the cell index per periodic dim (((ix%nx)+nx)%nx), and applies
  minimum image (dx -= box_lx*wp.round(dx/box_lx)) before rij2/cutoff. periodic
  is in the cache key only when set, so non-periodic kernels keep their exact
  name/source/disk-cache and do not recompile.
- warp_nnps: UniformGridWarpNNPS.set_periodic_box(bounds) tiles a cubic periodic
  box exactly; the cell-id binning kernels now WRAP (not clamp) the cell index
  in periodic dims so out-of-box source positions bin into their image cell,
  consistent with the wrapped walk.
- warp_sph: _run_equation_group auto-detects periodicity from nnps._bounds and
  builds the periodic variant + passes the box via _grid_launch_args.

A 5-dimension adversarial review (skeptic-verified) found 5 issues, all fixed:
MAJOR -- out-of-box source positions were clamp-binned, so the wrapped walk
missed wrap-around neighbors (now wrap-binned; out-of-box regression test added);
MINOR -- a box narrower than 2*support was silently accepted (now requires
floor(L/cell_min) >= 3 with a clear error); plus clear errors for missing
min/max and a raw-length equal-length check.

Validation: periodic summation density matches a CPU minimum-image reference to
5.5e-6 (including an out-of-box particle in the nx=4 clamp-bug regime); periodic
uniform lattice density is uniform (no boundary deficiency); focused suite 57
passed; non-periodic path byte-identical (continuity guard 1393 steps, 0
recompiles). MVP supports equal-length (cubic) periodic boxes.

Also add the production results-report tooling (generate_results_report.py +
RESULTS_REPORT_HOWTO.md) to be run on a capable GPU machine: it runs the nx=100
apples-to-apples and 1M/100-step comparisons fresh and assembles a Markdown
report (smoke-validated end-to-end via --quick).

Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell).
Reviewed-by: @prabhu (LGTM)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l runs

Document the optional MPI/Zoltan path (not needed for the Warp single-GPU
backend): mpi4py + the Zoltan library (via ZOLTAN / ZOLTAN_INCLUDE /
ZOLTAN_LIBRARY, USE_TRILINOS) + the separate pyzoltan package
(`pip install pyzoltan --no-build-isolation`), matching setup.py's Zoltan
options and docs/source/installation.rst. Points to the PyZoltan docs for the
environment-specific Zoltan-library build. Cross-referenced from the
prerequisites note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
get_parallel_extensions() gated only on HAVE_MPI, and get_zoltan_args()
early-returned when HAVE_ZOLTAN was False without disabling the parallel build.
So on a box with mpi4py installed but PyZoltan absent, the build tried to
cythonize pysph/parallel/parallel_manager.pyx and failed on the missing
pyzoltan .pxd cimports. Now, when PyZoltan is not installed, get_zoltan_args
prints a notice and sets HAVE_MPI=False so the parallel extension is skipped and
the serial/GPU build (which does not need it) succeeds. Documented in BUILD.md
section 5a.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Million-particle / 100 fixed steps, single-threaded PySPH CPU vs grid-direct
Warp (fp32) on an NVIDIA RTX PRO 6000 Blackwell Server Edition (95 GiB, sm_120;
Warp 1.14 / CUDA 12.9 / driver 13.0), run by @kunalpuri-prediqt on a cloud box:

  CPU  357.43 s total (3.574 s/step), KE 7854.038961
  Warp   1.32 s total (0.008625 s/step), KE 7854.038955
  speedup 271.5x wall / 414.4x per-step | KE rel delta 8.0e-10 | all finite
  segmented: 0 flat-cache builds; steady step wall ~8.0 ms

Per-step ~8.0 ms vs the RTX 4060 laptop's ~60 ms (~7.4x), so the headline jumps
from 41x/57x (4060) to 271x/414x (Blackwell). The fused grid kernel cold-compiled
once (38.6 s) then loaded cached (~6 ms), confirming the deterministic-name disk
cache on a fresh machine; Blackwell ran with no code/build changes.

Summary JSON + experiment.md + current.md updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path

Sweeps particle count (via nx) and reports steady per-step wall time and
throughput (particle-steps/s) for the fixed-step continuity PEC step on the
current GPU. GPU-only (fast), warms the kernel once to absorb the one-time cold
compile, and catches OOM per point to find the capacity ceiling. Run it on each
GPU and collect the JSON blocks as a cross-GPU performance artifact. Documented
in RESULTS_REPORT_HOWTO.md. Smoke-validated on the RTX 4060.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kunalpuri-prediqt and others added 10 commits June 18, 2026 19:05
Collect per-GPU gpu_perf_sweep.py runs under
experiments/.../gpu-sweep/ (per-GPU JSON + comparison README). L40S (sm_89,
44 GiB) full sweep: throughput ramps to a ~9.8e7 particle-steps/s plateau from
~1M to ~10M particles, super-linear knee at 21M (5.96e7), all finite; ~89 s
cold compile then disk-cached. Cross-GPU at 1M particles:
RTX 4060 1.68e7 < L40S 9.19e7 < RTX PRO 6000 Blackwell 1.16e8 particle-steps/s.
Full Blackwell/4060 sweeps still pending. experiment.md + current.md updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…omparison

Four full gpu_perf_sweep.py curves now in gpu-sweep/ (L40S sm_89, RTX 5090
sm_120, RTX PRO 6000 Blackwell sm_120, B300 SXM6 sm_103) plus the 4060 1M
anchor. Throughput at 1M particles (particle-steps/s):
  4060 1.68e7 < 5090 6.74e7 < L40S 9.19e7 < RTX PRO 6000 1.22e8 < B300 1.41e8.
Both Blackwell cards plateau at ~1.43e8 (likely a bandwidth/occupancy ceiling);
the B300's edge is scale -- it holds the peak to 6M, runs 78.5M particles, and
keeps a higher post-knee plateau (~7.0e7).

Open finding (flagged, cause unconfirmed): a super-linear per-step knee whose
location is non-monotonic with VRAM (RTX PRO 6000 95 GiB at 6M, B300 268 GiB at
10M, 5090 32 GiB at 10M, L40S 44 GiB at 21M) -- not capacity-bound; an
algorithmic/grid effect to profile with profile_grid_direct_neighbors.py.
All points finite. README + experiment.md + current.md updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fact

Speedup vs a single CPU core (single-threaded PySPH Cython ~3.5 s/step at 1M;
GPU per-step from the sweep) at 1M particles: RTX 4060 ~59x, RTX 5090 ~235x,
L40S ~321x, RTX PRO 6000 Blackwell ~426x, B300 SXM6 ~491x. Directly-measured
CPU-vs-Warp pairs corroborate (4060 57.6x, RTX PRO 6000 414x from
headline_million_100step.py). Speedup is ~flat with particle count until the
GPU knee, then falls (B300 at 10M ~273x). README + experiment.md + current.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
plot_gpu_sweep.py renders three PNGs from the sweep JSONs (pure matplotlib, no
GPU): throughput vs particle count (ramp, ~1.43e8 Blackwell ceiling, per-GPU
knees), per-step wall time vs particle count (log-log), and the headline speedup
vs a single CPU core @ 1M (B300 ~491x ... RTX 4060 ~59x). Figures embedded in
gpu-sweep/README.md and experiment.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a dollar-cost lens alongside the existing performance figures
(throughput/per-step/speedup), priced by NVIDIA Brev on-demand rates
(2026-06-18). New metric: $ per billion particle-steps =
($/hr)/(throughput*3600)*1e9 -- the price of the numerical work, not
perf-per-dollar.

plot_gpu_sweep.py: BREV_COST_PER_HR table + cost_per_billion(); two new
figures cost_per_billion_1M.png (bar @1m) and
cost_per_billion_vs_particles.png (log-log vs N). Performance figures
unchanged.

Finding @1m: RTX 5090 $0.0032 ~= L40S $0.0032 < RTX PRO 6000 $0.0060
(1.9x) < B300 $0.0187 (5.8x). The cheap cards do the same SPH work for
~6x less money; the datacenter cards buy scale and latency (78.5M
particles, ~2x faster single step), not cost-per-work.

gpu-sweep/README.md and experiment.md: cost table + both cost figures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Following a 5-lens "missing metric" panel (HPC/TCO/energy/physicist/
architect), adds the two convergent decision metrics as ESTIMATES from
the existing sweep data (no measured power/profiler run, and none
planned -- estimates are the final reported values, each with caveats):

Energy-to-solution (energy_per_gpstep_1M.png + table): kJ/Gp-step =
TDP_W/throughput*1e6 from datasheet board power. @1m L40S 3.8 < RTX PRO
6000 4.9 < 4060 6.9 < 5090 8.5 < B300 10.0. Breaks the 5090=L40S dollar
tie: the 350W L40S uses ~2.2x fewer joules than the 575W 5090 at equal
$/work, so L40S wins on a power-capped/owned fleet; B300 least efficient.

Roofline / bandwidth utilization (mbu_at_1M.png + table): analytic MBU =
throughput x ~1.12 KB/p-step / peak_BW. @1m every card under ~12% of
peak (B300 ~2% of 8 TB/s). Refines/partly corrects the earlier
"bandwidth-bound" guess: the ~1.43e8 Blackwell ceiling is occupancy/
launch/grid-build bound, NOT a memory wall -- large untapped headroom;
next perf win is occupancy/launch tuning, not faster memory. Updated the
README Observations bullet accordingly.

plot_gpu_sweep.py gains BOARD_TDP_W / PEAK_BW_TBS / B_EFF_BYTES + helpers
and the two bar figures. README + experiment.md carry both lenses with
explicit model-based caveats. Closeout: current.md, daily 2026-06-18,
session-log 2026-06-18_2008.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ADR-0005)

Second Warp validation case, and the first 3D / gravity-driven / multi-array
(fluid + solid wall) one. Additive to the backend; the 2D elliptical-drop path
and its generated source stay byte-identical (test-pinned).

warp_sph.py:
- WendlandQuintic as kernel id 2 (cubic 0 / gaussian 1 routers gain an additive
  branch; their code paths unchanged); register the Wendland device leaves in
  _WARP_DEVICE_FUNCS for parity with cubic/gaussian.
- apply_body_force (ramped gravity), compute_tait_eos_hg_correction (clamp
  rho>=rho0 so wall p>=0), and wc_sph_dam_break_step: an E-P-E-C
  (== reference EPECIntegrator) continuity-density step that sums fluid
  accel+density over [fluid, walls], takes wall density from [fluid] only, XSPH
  from fluid only, holds walls fixed via zero accel.
- Fuse the dam-break fluid step (_WCSPH_DAM_BREAK_FLUID_BLOCKS = pressure + AV +
  continuity) into one kernel per source -> 1.23x faster Warp step at >1M.

tests:
- dim=3 codegen-grid, fused/grid-direct, and adaptive-timestep pins; new-kernel
  parity tests (Wendland, gravity, Tait-HG, two-array dam-break step).
- test_2d_path_generated_source_is_byte_identical_to_golden: md5-pins the 2D
  elliptical-drop generated source (cache-stability guard).

experiment packet experiments/2026-06-18_warp-dam-break-3d-runner/:
- dam_break_3d_runner.py, run_correctness.sh smoke wrapper, tier-1 hand-rolled
  CPU EPEC parity, tier-2 vs the real dam_break_3d_lobovsky.py Application,
  perf+snapshot and 1M bench + 3D-explicit snapshot.

Validation: tier-1 kinematics/density match Warp(fp32) to ~1e-8, pressure at the
fp32 Tait-EOS cancellation floor; tier-2 KE/surge-front/height/density to fp32,
p_max ~1% relative once developed. ~13-15x per-step vs single-thread PySPH CPU
fp64 at >1M particles (1,014,072). Focused suite 67 passed.

ADR-0005 Accepted; decisions graph/index, aspect contexts, current.md, daily,
session log, and review (reviews/2026-06-19_warp-3d-dam-break-lobovsky.md, with
_assets/ figures) updated. The review is pending @prabhu sign-off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gpu_perf_sweep_dam_break.py: GPU-only per-step throughput sweep vs particle
count (via dx over the Lobovsky no-obstacle geometry), mirroring the elliptical
drop's gpu_perf_sweep.py. Uses a vectorized numpy-mask IC (same box conditions
as DamBreak3DGeometry) so 1M-10M ICs build in seconds, not the example's
per-point Python loop. Fixed dt, fp32, fused wc_sph_dam_break_step; warmup
absorbs the cold compile; per-point OOM is caught so the sweep finds each GPU's
capacity ceiling. Run on each GPU and paste the JSON for a cross-GPU artifact.

Local 4060 check: dx=0.011 -> 999,975 particles (matches the geometry count),
4.54M particle-steps/s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gpu-sweep/: per-GPU sweep JSONs (L40S, RTX PRO 6000 Blackwell), the lens+plot
script (plot_gpu_sweep_dam_break.py), README, and figures. At ~1M (fixed-dt
fused step): RTX PRO 6000 39.1 Mp-st/s (163x vs 1 CPU core) > L40S 25.9 (108x) >
RTX 4060 4.5 (19x). $/energy favour the L40S ($0.0114 vs $0.0187 per billion
p-steps; 13.5 vs 15.4 kJ/billion). Roughly ~24-33% analytic BW utilization (rough
~11 KB/p-step model) -- ~10x the elliptical's bytes/p-step but still not BW-bound.
5090 + B300 deferred (drop their JSON in and re-run the plot script).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Record the cross-GPU sweep (L40S + RTX PRO 6000 Blackwell + 4060 anchor) and the
PR pypr#435 update in current.md, the session log, and the daily. Note the GitHub MCP
can't write to upstream pypr (403) -- the PR body was edited with the author's own
token. Remaining: RTX 5090 + B300 sweep (boxes busy), then @prabhu review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kunalpuri-prediqt

Copy link
Copy Markdown
Author

@prabhuramachandran
could we have a meeting about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant