Add a Warp (NVIDIA warp-lang) GPU backend for PySPH WCSPH#435
Open
kunalpuri-prediqt wants to merge 40 commits into
Open
Add a Warp (NVIDIA warp-lang) GPU backend for PySPH WCSPH#435kunalpuri-prediqt wants to merge 40 commits into
kunalpuri-prediqt wants to merge 40 commits into
Conversation
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-repeated-step-leapfrog-checkpoint.md ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-elliptical-drop-runner-smoke-ramp.md ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-artificial-viscosity-momentum-term.md ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-tait-eos-and-sound-speed.md ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-xsph-gaussian-adaptive-baseline.md ADRs: none
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-16_warp-continuity-density-resolved-elliptical-drop.md Plans: .ai/implementations/blast-from-the-past/plans/2026-06-16_warp-resolved-elliptical-drop-performance-comparison.md; .ai/implementations/blast-from-the-past/plans/2026-06-16_warp-continuity-density-leapfrog-parity.md
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-17_warp-adaptive-timestep-policy-parity.md
Review: .ai/implementations/blast-from-the-past/reviews/2026-06-17_warp-fp32-resolved-rerun.md
…003) Add pysph/base/warp_codegen.py: a dynamic Warp equation-group code generator. WarpEquation blocks contribute initialize/loop/post_loop snippets and a group emits one fused, cached JIT kernel per (equation-set, dtype), mirroring PySPH's equation/group transpilation. The continuity-density PEC half-stage now runs one fused launch (continuity + pressure gradient + Monaghan viscosity + XSPH) instead of four. The summation-density path, Euler step, per-equation helpers (kept as oracle), and adaptive _wcsph_dt_factors traversal are unchanged. Million-particle fixed-step (nx=565, 10 steps): equation launches 8->2 per step, equation-kernel time ~5-6x lower; neighbor-cache build is now the dominant per-step cost. Parity vs prior Warp essentially exact (KE delta -6.4e-09); adaptive nx=100 guard kept exactly 1393 steps. Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell). Reviewed-by: @prabhu (LGTM) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-0004)
After ADR-0003 fused the continuity-density PEC half-stage into one neighbor
loop, the flat CSR neighbor-cache build (count traversal + scan + host readback
+ alloc + fill traversal) became the dominant per-step cost. With a single
fused consumer per half-stage, the flat list was built only to be read once.
Add neighbor_mode='grid' to the warp_codegen generator: the same fused kernel
body, but the flat starts/lengths/neighbors loop is replaced by a direct
uniform-grid cell-list walk with the support cutoff inline (geometry split
pre/post cutoff; cell-block walk wraps the equation snippets; mode in the
structural cache key). compute_wcsph_accel_continuity and (via hand-written
grid-direct _wcsph_dt_factors_grid_{f32,f64}) the adaptive timestep now walk the
cell list directly; the continuity PEC step builds only the grid. The flat path
is retained for the host get_nearest_particles query API, the per-equation
oracle, summation density, and compute_neighbor_sum; compute_wcsph_adaptive_timestep
defaults to flat so the summation leapfrog path is byte-identical.
Focused suite: 50 passed. Million-particle (nx=565): build_neighbor_cache_gpu
called 0 times on the continuity path; steady step wall 0.076-0.098 -> 0.059-0.064 s
(~25-35%). Adaptive nx=100 kept exactly 1393 steps at fp32-scale deltas.
Fresh same-session CPU (single-threaded PySPH Cython) vs Warp grid-direct
(RTX 4060, fp32): nx=100 resolved 160.10 s vs 6.78 s = 23.6x; 1M particles /
100 fixed steps 344.50 s vs 8.37 s = 41.2x wall / 57.6x per-step, KE rel delta
1.6e-9.
A 6-dimension adversarial review (review + skeptic-verify per finding) returned
one nit: the adaptive-dt grid default had leaked into the summation path; fixed
by defaulting that helper to flat and opting the continuity step into grid.
Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell).
Reviewed-by: @prabhu (LGTM)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document how to build the repo as set up for the Warp-backend work: core PySPH (Cython/C++ extensions via setup.py build_ext / pip install -e .) separated from the Warp GPU backend extra (warp-lang + NVIDIA GPU), with the known-good toolchain (Python 3.14, numpy 2.4, Cython 3.2, compyle 0.9.1, warp-lang 1.14 / CUDA Toolkit 12.9), verification + test commands, the fp32 note, and gotchas (bare `python`, pre-commit hook, OpenMP fallback). Add a GPU architecture compatibility section for V100 (sm_70), A100 (sm_80), H100 (sm_90), RTX 5090 and RTX PRO 6000 Blackwell (sm_120): the build is arch-independent (Warp JIT-compiles per-arch at runtime), so support is a driver + warp-lang matter. warp-lang 1.14 (CUDA 12.9) covers sm_70-sm_120; Blackwell needs an R570+ driver and CUDA >= 12.8. Includes per-arch driver floors, fp64 guidance (strong on V100/A100/H100, keep fp32 on Blackwell), VRAM/problem-size scaling, single-GPU note, and a warp.init() verification. Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…follow-up)
Make warp_codegen the single source for every neighbor-loop kernel and retire
the ~14 duplicated hand @wp.kernels (_summation_density, _continuity,
_pressure_gradient, _artificial_viscosity, _xsph_correction,
_wcsph_dt_factors{,_grid}, f32+f64).
- warp_codegen: add accumulate_outputs (seed _acc_<out> from the existing
d_<out>[i] for read-modify-write composition, used by the standalone additive
viscosity term). Make the generated kernel func_name a deterministic md5 of
the structural cache key instead of len(_KERNEL_CACHE): the old order-
dependent name changed the generated source whenever the build order shifted,
defeating Warp's on-disk kernel cache and forcing a cold recompile every
session. Verified the source is byte-stable across processes; kernels now
cold-compile once per machine then load cached (~20 ms).
- warp_sph: add SummationDensity and WcsphCflFactor blocks (the CFL dt_cfl is a
free-form wp.max neighbor reduction, dt_force a per-particle post_loop) and a
shared _run_equation_group launcher (flat/grid, accumulate, cross-array).
Repoint compute_summation_density / _continuity / _pressure_gradient /
_artificial_viscosity (accumulate) / _xsph_correction /
compute_wcsph_adaptive_timestep onto the generator; refactor
compute_wcsph_accel_continuity onto the launcher. Helpers reject non-canonical
out_prop/out_props (fail-fast) since the generated path writes canonical names.
Behavior-preserving consolidation: the continuity hot path uses the byte-
identical fused grid kernel (perf-neutral; million steady floor 0.059 s, 0 flat
builds, KE identical); the summation path keeps its pressure-gradient(overwrite)
-> viscosity(add) composition. Focused suite 52 passed; adaptive nx=100 guard
keeps exactly 1393 steps (KE relative ~8e-9). Headline CPU-vs-Warp numbers
(nx=100 23.6x; 1M/100-step 41.2x wall / 57.6x per-step) unchanged.
A 6-dimension adversarial review (skeptic-verified) found one minor issue (the
out_prop/out_props silent-write), addressed by the guards.
Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell).
Reviewed-by: @prabhu (LGTM)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend ADR-0004 grid-direct from the continuity PEC path to the summation Euler (wc_sph_euler_step) and KDK leapfrog step paths, so no device step path builds a flat CSR neighbor cache. - The five standalone equation helpers (compute_summation_density, _continuity, _pressure_gradient, _artificial_viscosity, _xsph_correction) gain neighbor_mode='flat' (default), threaded to _run_equation_group. The flat default keeps the oracle/cross-array tests and compute_neighbor_sum unchanged. - _compute_wcsph_acceleration and wc_sph_euler_step pass neighbor_mode='grid' to summation density, pressure gradient, viscosity (accumulate), continuity; the summation branch of wc_sph_leapfrog_step passes 'grid' to its adaptive-dt and xsph calls. No fusing: the pressure-gradient(overwrite) -> viscosity(add) composition is preserved, so the only numerical change is neighbor visitation order (fp32 reorder ~1e-7). build_neighbor_cache_gpu is now used only by the host get_nearest_particles query API, compute_neighbor_sum, and the flat-mode oracle/cross-array tests. Focused suite 53 passed (summation Euler/KDK CPU-parity tests pass under grid-direct with no tolerance changes; new no-flat-cache assertion). Continuity adaptive nx=100 guard unchanged (1393 steps, KE 7797.7071). A 4-dimension adversarial review (skeptic-verified) returned 0 findings. Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell). Reviewed-by: @prabhu (LGTM) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…R-0004) Add true periodic neighbors (wrapped cell walk + minimum-image distance) to the grid-direct Warp WCSPH path, enabling periodic benchmarks (Taylor-Green etc.). Approach: minimum-image + wrapped cell walk, not ghost particles. The free-surface elliptical drop is unaffected. - warp_codegen: generate_group_source/build_group_kernel gain periodic=False (grid only). The periodic variant adds box_lx/ly/lz + periodic_x/y/z runtime flags, wraps the cell index per periodic dim (((ix%nx)+nx)%nx), and applies minimum image (dx -= box_lx*wp.round(dx/box_lx)) before rij2/cutoff. periodic is in the cache key only when set, so non-periodic kernels keep their exact name/source/disk-cache and do not recompile. - warp_nnps: UniformGridWarpNNPS.set_periodic_box(bounds) tiles a cubic periodic box exactly; the cell-id binning kernels now WRAP (not clamp) the cell index in periodic dims so out-of-box source positions bin into their image cell, consistent with the wrapped walk. - warp_sph: _run_equation_group auto-detects periodicity from nnps._bounds and builds the periodic variant + passes the box via _grid_launch_args. A 5-dimension adversarial review (skeptic-verified) found 5 issues, all fixed: MAJOR -- out-of-box source positions were clamp-binned, so the wrapped walk missed wrap-around neighbors (now wrap-binned; out-of-box regression test added); MINOR -- a box narrower than 2*support was silently accepted (now requires floor(L/cell_min) >= 3 with a clear error); plus clear errors for missing min/max and a raw-length equal-length check. Validation: periodic summation density matches a CPU minimum-image reference to 5.5e-6 (including an out-of-box particle in the nx=4 clamp-bug regime); periodic uniform lattice density is uniform (no boundary deficiency); focused suite 57 passed; non-periodic path byte-identical (continuity guard 1393 steps, 0 recompiles). MVP supports equal-length (cubic) periodic boxes. Also add the production results-report tooling (generate_results_report.py + RESULTS_REPORT_HOWTO.md) to be run on a capable GPU machine: it runs the nx=100 apples-to-apples and 1M/100-step comparisons fresh and assembles a Markdown report (smoke-validated end-to-end via --quick). Validated: validate-memory PASS (hook run manually; git `python` unresolved in shell). Reviewed-by: @prabhu (LGTM) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l runs Document the optional MPI/Zoltan path (not needed for the Warp single-GPU backend): mpi4py + the Zoltan library (via ZOLTAN / ZOLTAN_INCLUDE / ZOLTAN_LIBRARY, USE_TRILINOS) + the separate pyzoltan package (`pip install pyzoltan --no-build-isolation`), matching setup.py's Zoltan options and docs/source/installation.rst. Points to the PyZoltan docs for the environment-specific Zoltan-library build. Cross-referenced from the prerequisites note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
get_parallel_extensions() gated only on HAVE_MPI, and get_zoltan_args() early-returned when HAVE_ZOLTAN was False without disabling the parallel build. So on a box with mpi4py installed but PyZoltan absent, the build tried to cythonize pysph/parallel/parallel_manager.pyx and failed on the missing pyzoltan .pxd cimports. Now, when PyZoltan is not installed, get_zoltan_args prints a notice and sets HAVE_MPI=False so the parallel extension is skipped and the serial/GPU build (which does not need it) succeeds. Documented in BUILD.md section 5a. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Million-particle / 100 fixed steps, single-threaded PySPH CPU vs grid-direct Warp (fp32) on an NVIDIA RTX PRO 6000 Blackwell Server Edition (95 GiB, sm_120; Warp 1.14 / CUDA 12.9 / driver 13.0), run by @kunalpuri-prediqt on a cloud box: CPU 357.43 s total (3.574 s/step), KE 7854.038961 Warp 1.32 s total (0.008625 s/step), KE 7854.038955 speedup 271.5x wall / 414.4x per-step | KE rel delta 8.0e-10 | all finite segmented: 0 flat-cache builds; steady step wall ~8.0 ms Per-step ~8.0 ms vs the RTX 4060 laptop's ~60 ms (~7.4x), so the headline jumps from 41x/57x (4060) to 271x/414x (Blackwell). The fused grid kernel cold-compiled once (38.6 s) then loaded cached (~6 ms), confirming the deterministic-name disk cache on a fresh machine; Blackwell ran with no code/build changes. Summary JSON + experiment.md + current.md updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path Sweeps particle count (via nx) and reports steady per-step wall time and throughput (particle-steps/s) for the fixed-step continuity PEC step on the current GPU. GPU-only (fast), warms the kernel once to absorb the one-time cold compile, and catches OOM per point to find the capacity ceiling. Run it on each GPU and collect the JSON blocks as a cross-GPU performance artifact. Documented in RESULTS_REPORT_HOWTO.md. Smoke-validated on the RTX 4060. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collect per-GPU gpu_perf_sweep.py runs under experiments/.../gpu-sweep/ (per-GPU JSON + comparison README). L40S (sm_89, 44 GiB) full sweep: throughput ramps to a ~9.8e7 particle-steps/s plateau from ~1M to ~10M particles, super-linear knee at 21M (5.96e7), all finite; ~89 s cold compile then disk-cached. Cross-GPU at 1M particles: RTX 4060 1.68e7 < L40S 9.19e7 < RTX PRO 6000 Blackwell 1.16e8 particle-steps/s. Full Blackwell/4060 sweeps still pending. experiment.md + current.md updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…omparison Four full gpu_perf_sweep.py curves now in gpu-sweep/ (L40S sm_89, RTX 5090 sm_120, RTX PRO 6000 Blackwell sm_120, B300 SXM6 sm_103) plus the 4060 1M anchor. Throughput at 1M particles (particle-steps/s): 4060 1.68e7 < 5090 6.74e7 < L40S 9.19e7 < RTX PRO 6000 1.22e8 < B300 1.41e8. Both Blackwell cards plateau at ~1.43e8 (likely a bandwidth/occupancy ceiling); the B300's edge is scale -- it holds the peak to 6M, runs 78.5M particles, and keeps a higher post-knee plateau (~7.0e7). Open finding (flagged, cause unconfirmed): a super-linear per-step knee whose location is non-monotonic with VRAM (RTX PRO 6000 95 GiB at 6M, B300 268 GiB at 10M, 5090 32 GiB at 10M, L40S 44 GiB at 21M) -- not capacity-bound; an algorithmic/grid effect to profile with profile_grid_direct_neighbors.py. All points finite. README + experiment.md + current.md updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fact Speedup vs a single CPU core (single-threaded PySPH Cython ~3.5 s/step at 1M; GPU per-step from the sweep) at 1M particles: RTX 4060 ~59x, RTX 5090 ~235x, L40S ~321x, RTX PRO 6000 Blackwell ~426x, B300 SXM6 ~491x. Directly-measured CPU-vs-Warp pairs corroborate (4060 57.6x, RTX PRO 6000 414x from headline_million_100step.py). Speedup is ~flat with particle count until the GPU knee, then falls (B300 at 10M ~273x). README + experiment.md + current.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
plot_gpu_sweep.py renders three PNGs from the sweep JSONs (pure matplotlib, no GPU): throughput vs particle count (ramp, ~1.43e8 Blackwell ceiling, per-GPU knees), per-step wall time vs particle count (log-log), and the headline speedup vs a single CPU core @ 1M (B300 ~491x ... RTX 4060 ~59x). Figures embedded in gpu-sweep/README.md and experiment.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a dollar-cost lens alongside the existing performance figures (throughput/per-step/speedup), priced by NVIDIA Brev on-demand rates (2026-06-18). New metric: $ per billion particle-steps = ($/hr)/(throughput*3600)*1e9 -- the price of the numerical work, not perf-per-dollar. plot_gpu_sweep.py: BREV_COST_PER_HR table + cost_per_billion(); two new figures cost_per_billion_1M.png (bar @1m) and cost_per_billion_vs_particles.png (log-log vs N). Performance figures unchanged. Finding @1m: RTX 5090 $0.0032 ~= L40S $0.0032 < RTX PRO 6000 $0.0060 (1.9x) < B300 $0.0187 (5.8x). The cheap cards do the same SPH work for ~6x less money; the datacenter cards buy scale and latency (78.5M particles, ~2x faster single step), not cost-per-work. gpu-sweep/README.md and experiment.md: cost table + both cost figures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Following a 5-lens "missing metric" panel (HPC/TCO/energy/physicist/ architect), adds the two convergent decision metrics as ESTIMATES from the existing sweep data (no measured power/profiler run, and none planned -- estimates are the final reported values, each with caveats): Energy-to-solution (energy_per_gpstep_1M.png + table): kJ/Gp-step = TDP_W/throughput*1e6 from datasheet board power. @1m L40S 3.8 < RTX PRO 6000 4.9 < 4060 6.9 < 5090 8.5 < B300 10.0. Breaks the 5090=L40S dollar tie: the 350W L40S uses ~2.2x fewer joules than the 575W 5090 at equal $/work, so L40S wins on a power-capped/owned fleet; B300 least efficient. Roofline / bandwidth utilization (mbu_at_1M.png + table): analytic MBU = throughput x ~1.12 KB/p-step / peak_BW. @1m every card under ~12% of peak (B300 ~2% of 8 TB/s). Refines/partly corrects the earlier "bandwidth-bound" guess: the ~1.43e8 Blackwell ceiling is occupancy/ launch/grid-build bound, NOT a memory wall -- large untapped headroom; next perf win is occupancy/launch tuning, not faster memory. Updated the README Observations bullet accordingly. plot_gpu_sweep.py gains BOARD_TDP_W / PEAK_BW_TBS / B_EFF_BYTES + helpers and the two bar figures. README + experiment.md carry both lenses with explicit model-based caveats. Closeout: current.md, daily 2026-06-18, session-log 2026-06-18_2008. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ADR-0005) Second Warp validation case, and the first 3D / gravity-driven / multi-array (fluid + solid wall) one. Additive to the backend; the 2D elliptical-drop path and its generated source stay byte-identical (test-pinned). warp_sph.py: - WendlandQuintic as kernel id 2 (cubic 0 / gaussian 1 routers gain an additive branch; their code paths unchanged); register the Wendland device leaves in _WARP_DEVICE_FUNCS for parity with cubic/gaussian. - apply_body_force (ramped gravity), compute_tait_eos_hg_correction (clamp rho>=rho0 so wall p>=0), and wc_sph_dam_break_step: an E-P-E-C (== reference EPECIntegrator) continuity-density step that sums fluid accel+density over [fluid, walls], takes wall density from [fluid] only, XSPH from fluid only, holds walls fixed via zero accel. - Fuse the dam-break fluid step (_WCSPH_DAM_BREAK_FLUID_BLOCKS = pressure + AV + continuity) into one kernel per source -> 1.23x faster Warp step at >1M. tests: - dim=3 codegen-grid, fused/grid-direct, and adaptive-timestep pins; new-kernel parity tests (Wendland, gravity, Tait-HG, two-array dam-break step). - test_2d_path_generated_source_is_byte_identical_to_golden: md5-pins the 2D elliptical-drop generated source (cache-stability guard). experiment packet experiments/2026-06-18_warp-dam-break-3d-runner/: - dam_break_3d_runner.py, run_correctness.sh smoke wrapper, tier-1 hand-rolled CPU EPEC parity, tier-2 vs the real dam_break_3d_lobovsky.py Application, perf+snapshot and 1M bench + 3D-explicit snapshot. Validation: tier-1 kinematics/density match Warp(fp32) to ~1e-8, pressure at the fp32 Tait-EOS cancellation floor; tier-2 KE/surge-front/height/density to fp32, p_max ~1% relative once developed. ~13-15x per-step vs single-thread PySPH CPU fp64 at >1M particles (1,014,072). Focused suite 67 passed. ADR-0005 Accepted; decisions graph/index, aspect contexts, current.md, daily, session log, and review (reviews/2026-06-19_warp-3d-dam-break-lobovsky.md, with _assets/ figures) updated. The review is pending @prabhu sign-off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gpu_perf_sweep_dam_break.py: GPU-only per-step throughput sweep vs particle count (via dx over the Lobovsky no-obstacle geometry), mirroring the elliptical drop's gpu_perf_sweep.py. Uses a vectorized numpy-mask IC (same box conditions as DamBreak3DGeometry) so 1M-10M ICs build in seconds, not the example's per-point Python loop. Fixed dt, fp32, fused wc_sph_dam_break_step; warmup absorbs the cold compile; per-point OOM is caught so the sweep finds each GPU's capacity ceiling. Run on each GPU and paste the JSON for a cross-GPU artifact. Local 4060 check: dx=0.011 -> 999,975 particles (matches the geometry count), 4.54M particle-steps/s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gpu-sweep/: per-GPU sweep JSONs (L40S, RTX PRO 6000 Blackwell), the lens+plot script (plot_gpu_sweep_dam_break.py), README, and figures. At ~1M (fixed-dt fused step): RTX PRO 6000 39.1 Mp-st/s (163x vs 1 CPU core) > L40S 25.9 (108x) > RTX 4060 4.5 (19x). $/energy favour the L40S ($0.0114 vs $0.0187 per billion p-steps; 13.5 vs 15.4 kJ/billion). Roughly ~24-33% analytic BW utilization (rough ~11 KB/p-step model) -- ~10x the elliptical's bytes/p-step but still not BW-bound. 5090 + B300 deferred (drop their JSON in and re-run the plot script). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Record the cross-GPU sweep (L40S + RTX PRO 6000 Blackwell + 4060 anchor) and the PR pypr#435 update in current.md, the session log, and the daily. Note the GitHub MCP can't write to upstream pypr (403) -- the PR body was edited with the author's own token. Remaining: RTX 5090 + B300 sweep (boxes busy), then @prabhu review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
|
@prabhuramachandran |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a prototype Warp (NVIDIA
warp-lang) GPU backend for PySPH's weakly-compressible SPH (WCSPH) path, validated against the elliptical-drop and 3D dam-break (Lobovsky no-obstacle) benchmarks. Particle state stays device-resident across timesteps, and the design mirrors PySPH's equation/group transpilation model on the GPU.This is a prototype put up for review/discussion — feedback on the approach and on how (or whether) it should integrate with mainline PySPH is very welcome.
Second validation case — 3D dam-break (Lobovsky no-obstacle, ADR-0005)
The backend is now also validated on the 3D dam-break (Lobovsky no-obstacle) — the first 3D, gravity-driven, multi-array (fluid + solid wall) case. It is added as purely additive extensions: the 2D elliptical-drop path and its generated kernel source stay byte-identical (test-pinned).
New physics in
warp_sph.py(all additive): WendlandQuintic kernel (new kernel id), gravity (apply_body_force),TaitEOSHGCorrectionfor fixed solid walls (clamp ρ≥ρ₀ so wall p≥0), andwc_sph_dam_break_step— an E-P-E-C step (matching the referenceEPECIntegrator) that sums the fluid's acceleration + density rate over[fluid, wall], takes wall density from[fluid]only, XSPH from fluid only, and holds the walls fixed. The fluid acceleration blocks (pressure gradient + Monaghan AV + continuity) are fused into one kernel per source.Validation (
.ai/.../experiments/2026-06-18_warp-dam-break-3d-runner/):LinkedListNNPS(dim=3)+ WendlandQuintic) matches Warp (fp32) on x/y/z/u/v/w + fluid/wall density to ~1e-8; pressure matches in absolute terms at the fp32 Tait-EOS cancellation floor.dam_break_3d_lobovsky.pyApplication (WCSPHScheme+EPECIntegrator, fp64) at matched checkpoint times: kinetic energy / surge-front / max-height / density agree to fp32 scale, and pressure to ~1% once the flow develops.test_warp_*67 passed, including a guard asserting the 2D-path generated kernel source is byte-identical.3D structure at ~1M particles (Warp fp32, t = 0.2 s) — x-z side, x-y top-down, y-z end, and a 3D scatter:
Cross-GPU performance (fused dam-break step, fixed-dt, fp32, 32k–9.5M particles; L40S + RTX PRO 6000 Blackwell + an RTX 4060 anchor — 5090/B300 pending). At ~1M particles:
CPU ref: single-thread PySPH Cython 4.18 s/step at ~1M. The fixed-dt sweep is the cross-GPU metric; the adaptive-dt production step is ~1.5× slower. $ and energy are estimates from public hourly rates / datasheet TDP. The L40S is cheapest per unit work and most energy-efficient; the RTX PRO 6000 is fastest and still scaling at 9.5M. Per-particle-step throughput is ~3.5× below the elliptical drop's — expected for the heavier multi-array step. Details + figures:
.ai/.../2026-06-18_warp-dam-break-3d-runner/gpu-sweep/README.md.What's included
Core backend (
pysph/base/):warp_codegen.py— dynamic Warp equation-group code generation:WarpEquationblocks contributeinitialize/loop/post_loopsnippets, and a group emits one fused, disk-cached JIT kernel per(equation-set, dtype). Supportsneighbor_mode='flat'(CSR neighbor cache) and'grid'(direct uniform-grid cell-list walk with the support cutoff inline), additiveaccumulate_outputs, and aperiodicminimum-image variant. Kernel names are a deterministic md5 of the structural cache key so the Warp on-disk cache survives across sessions.warp_nnps.py—UniformGridWarpNNPScell-list NNPS with device rebuild from updated coordinates andset_periodic_box(wrapped binning + tiled bounds).warp_sph.py— WCSPH equations (continuity / summation density, Tait & isothermal EOS, pressure gradient, Monaghan artificial viscosity, XSPH, Gaussian kernel), a device-reduced adaptive timestep, and Euler / KDK / continuity-PEC step paths. Plus (ADR-0005): WendlandQuintic kernel, gravity,TaitEOSHGCorrectionsolid walls, and a fused multi-array dam-break step.warp_device_helper.py,particle_array.pyx— device-mirror plumbing.test_warp_codegen.py,test_warp_sph.py,test_warp_nnps.py,test_warp_device_helper.py.Build & docs:
BUILD.md(reproducible build + GPU-arch compatibility: V100/A100/H100/RTX 5090/RTX PRO 6000 Blackwell/B300), adocs/tutorial, and asetup.pyfix that skips the MPI/Zoltan parallel extension when PyZoltan is absent.Validation (elliptical drop)
nx=100elliptical drop matches the PySPH CPUApplication(continuity density + PySPH timestep policy) to floating-point scale, with identical step counts (1393).test_warp_*).use_double=False).Performance — elliptical drop (cross-GPU sweep)
Grid-direct continuity-PEC step, fp32, 31k–78.5M particles, across RTX 4060 / L40S / RTX 5090 / RTX PRO 6000 Blackwell / B300 SXM6. Headline: up to ~491× vs a single CPU core at 1M particles. The artifact also reports a cost-of-compute lens ($/billion particle-steps), an estimate-only energy-to-solution lens, and an analytic roofline lens (which suggests the ~1.43e8 Blackwell plateau is occupancy/launch/grid-build bound, not a memory-bandwidth wall). Details in
.ai/implementations/blast-from-the-past/experiments/2026-06-16_warp-elliptical-drop-runner/gpu-sweep/README.md.Note on scope
This branch also carries the
.ai/implementations/blast-from-the-past/tree — the AI-assisted development memory/journal (plans, ADRs, daily/session logs, experiment scripts and figures) used while building this. It is included for full provenance. If you'd rather review only the code, I'm happy to strip it down to a code-only PR (pysph/+BUILD.md+docs/+setup.py).🤖 Generated with Claude Code