Zaps and replies are high signal, true, but you need relative measuring here which they don't provide.
You might have an excellent app that got 2x 21 sats, vs a mediocre app that got 10x 5 sats but simply because it was discovered a bit earlier. The computation will give 2 stars to the former and 5 to the latter?
Plus, you introduce another thing to trust (the computation). Ratings are signed by users and are incredibly low friction - in the context of wanting to rate something, write, zap, etc