Mixture of Experts

There is a gate that opens a subset of the experts, and the output is the weighted sum of the outputs of the experts. The weights are computed by a gating network. Optimizations for DNN-20250510123829347

One problem is load balancing, non uniform assignment. And there is a lot of communication overhead when you place them in different devices.