Skip to content

Conformer Multi-Headed Self Attention


Bases: Module

Conformer employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL, the relative sinusoidal positional encoding scheme. The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoder is more robust to the variance of the utterance length. Conformer use prenorm residual units with dropout which helps training and regularizing deeper models.


Name Type Description Default
d_model int

The dimension of model

num_heads int

The number of attention heads.

dropout_p float

probability of dropout

inputs, mask
  • inputs (batch, time, dim): Tensor containing input vector
  • mask (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked


Type Description
(batch, time, dim)

Tensor produces by relative multi headed self attention module.

Source code in models/tts/delightful_tts/attention/
class ConformerMultiHeadedSelfAttention(Module):
    """Conformer employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL,
    the relative sinusoidal positional encoding scheme. The relative positional encoding allows the self-attention
    module to generalize better on different input length and the resulting encoder is more robust to the variance of
    the utterance length. Conformer use `prenorm` residual units with dropout which helps training
    and regularizing deeper models.

        d_model (int): The dimension of model
        num_heads (int): The number of attention heads.
        dropout_p (float): probability of dropout

    Inputs: inputs, mask
        - **inputs** (batch, time, dim): Tensor containing input vector
        - **mask** (batch, 1, time2) or (batch, time1, time2): Tensor containing indices to be masked

        (batch, time, dim): Tensor produces by relative multi headed self attention module.

    def __init__(
        d_model: int,
        num_heads: int,
        dropout_p: float,

        # Initialize the RelativeMultiHeadAttention module passing the model dimension and number of attention heads
        self.attention = RelativeMultiHeadAttention(
            d_model=d_model, num_heads=num_heads,
        self.dropout = nn.Dropout(p=dropout_p)

    def forward(
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        mask: torch.Tensor,
        encoding: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        batch_size, _, _ = key.size()

        # Trim or extend the "encoding" to match the size of key, and repeat this for each input in the batch
        encoding = encoding[:, : key.shape[1]]
        encoding = encoding.repeat(batch_size, 1, 1)

        # Pass inputs through the RelativeMultiHeadAttention layer, dropout the resulting outputs
        outputs, attn = self.attention(
            query, key, value, pos_embedding=encoding, mask=mask,

        # Apply dropout to the attention outputs
        outputs = self.dropout(outputs)
        return outputs, attn