Learn how Multi-Head Attention allows Transformer models to attend to information from different representation subspaces simultaneously.