Mixture of Experts
There is a gate that opens a subset of the experts, and the output is the weighted sum of the outputs of the experts. The weights are computed by a gating network.
One problem is load balancing, non uniform assignment. And there is a lot of communication overhead when you place them in different devices.