feat(aggregation): Add STCH#719
Conversation
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
PierreQuinton
left a comment
There was a problem hiding this comment.
Many thanks for the PR, LGTM. We'll wait for @ValerianRey 's review still.
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
|
Thanks a lot for the PR. It already looks really good. I agree with the choice of going for what's described in the paper: the official implementation (and likewise the LibMTL implementation) seem to contain many important differences with respect to the paper, which make them not representative at all of what's described in the paper. On the other hand, what's described in the paper seems to be reasonable, and it's also what's implemented in LibMoon (see the link I just added in the issue). I would add that there are two extra things mentioned in the paper: appendices B1 and B2. In B1, they describe a way to stabilize the method. This would maybe be a bit hard to implement for a user, out of STCH. So maybe we could add a boolean parameter Appendix B2 is about normalization of the function values, which can easily be handled by the user or by a torchjd.normalization that we'll add in the near future. So I think it shouldn't be added in this PR. I'll make an in-depth review of the PR later. |
Thanks for the review @ValerianRey ! I looked more carefully at B.1. The stabilization there is specifically the max-subtraction trick applied before dividing by exponents = weights * shifted / self.mu
return self.mu * torch.logsumexp(exponents.flatten(), dim=-1)
B.1's approach of centering y_i before dividing by mu avoids this entirely, since the values fed to exp are always non-positive regardless of input scale. In float32 this only bites at extreme magnitudes, but in float16/bfloat16 mixed precision the /mu step can overflow at values of just a few hundred, so it's a real footgun in low precision. And since the fix is value-preserving (same output, same gradient) and essentially free, I'd rather just make it the default than gate it behind a flag: y = weights * shifted
max_y = y.max()
exponents = (y - max_y) / self.mu
return self.mu * torch.logsumexp(exponents.flatten(), dim=-1) + max_yThis is what B.1 describes minus the dropped constant: it returns exactly the same STCH value and gradient as now, but never overflows in the /mu step. The + max_y cancels the centering in the gradient, so it stays correct without needing to detach anything. Happy to update the PR with this if you agree it's the right default. |
|
The result is the same? Or is it more than just numerical considerations? If the former, then I would go for it as it looks more stable. |
|
@PierreQuinton Purely numerical. The value and the gradient are mathematically identical to what we have now, it just avoids the overflow in the |
Thanks for explaining. I understand better now, and I agree with your suggested change. Please go ahead with it. |
ValerianRey
left a comment
There was a problem hiding this comment.
Just did the thorough review, and I have nothing to report, this is super clean.
We can merge after you add the extra stabilization trick from appendix B1.
Signed-off-by: ppraneth <pranethparuchuri@gmail.com>
|
@ValerianRey I have made the changes |
New
torchjd.scalarization.STCH, the smooth Tchebycheff scalarization from Smooth Tchebycheff Scalarization for Multi-Objective Optimization .It returns a differentiable approximation of the weighted, shifted maximum of the values:
where, following the paper's notation:
weightsparameter)referenceparameter)muparameter)As$\mu \to 0$ this recovers the classical (non-differentiable) Tchebycheff $\max_i \lambda_i (f_i - z_i^*)$ ; larger $\mu$ gives a smoother approximation.
Design decisions
Confirmed with the maintainer before implementation:
Xi-L/STCHimpl and LibMTL's copy of it are stateful (epoch-based warmup + a running nadir estimate, applied tolog(loss / nadir)), which diverges from eq. 9. We use the clean stateless formula the paper proves its theory on.STCH(the paper's acronym).muis required, no default. The paper testsmu(required),weights(optional, default uniform on the simplex),reference(optional, default none).One thing worth a look: the
1/mdefault andmuThe default$(f_i - z_i^*) / (m\mu)$ , so the effective smoothing temperature is $m\mu$ , not $\mu$ . In practice this means the meaning of
weightsis uniform on the simplex (1/m), matching the paper. A consequence: the exponent becomesmuis coupled to the number of objectives — more objectives gives a smoother result for the samemu. This is faithful to the paper's simplex convention, but if you'd rather decouplemufromm, the alternative is an all-ones default. Happy to switch if you prefer.Implementation
torch.logsumexp, so it's numerically stable without manual max-subtraction.GeometricMean); negative values are fine.mu <= 0raises in__init__.weights/referenceshape mismatches raise at call time (same pattern asConstant).weightsis not enforced (permissive, consistent withConstant).Files
src/torchjd/scalarization/_stch.pysrc/torchjd/scalarization/__init__.pydocs/source/docs/scalarization/stch.rstdocs/source/docs/scalarization/index.rsttests/unit/scalarization/test_stch.pyCHANGELOG.md[Unreleased]entryTest plan
uv run pytest tests/unit/scalarization/test_stch.py -W error -vuv run pytest tests/unit -W error(full regression)uv run ruff check && uv run ruff format --checkuv run ty check