Skip to content

Phoneme Prosody Predictor

PhonemeProsodyPredictor

Bases: Module

A class to define the Phoneme Prosody Predictor.

In linguistics, prosody (/ˈprɒsədi, ˈprɒzədi/) is the study of elements of speech that are not individual phonetic segments (vowels and consonants) but which are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.

Wikipedia Prosody (linguistics)

This prosody predictor is non-parallel and is inspired by the work of Du et al., 2021 ?. It consists of multiple convolution transpose, Leaky ReLU activation, LayerNorm, and dropout layers, followed by a linear transformation to generate the final output.

Parameters:

Name Type Description Default
model_config AcousticModelConfigType

Configuration object with model parameters.

required
phoneme_level bool

A flag to decide whether to use phoneme level bottleneck size.

required
leaky_relu_slope float

The negative slope of LeakyReLU activation function.

LEAKY_RELU_SLOPE
Source code in models/tts/delightful_tts/acoustic_model/phoneme_prosody_predictor.py
class PhonemeProsodyPredictor(Module):
    r"""A class to define the Phoneme Prosody Predictor.

    In linguistics, prosody (/ˈprɒsədi, ˈprɒzədi/) is the study of elements of speech that are not individual phonetic segments (vowels and consonants) but which are properties of syllables and larger units of speech, including linguistic functions such as intonation, stress, and rhythm. Such elements are known as suprasegmentals.

    [Wikipedia Prosody (linguistics)](https://en.wikipedia.org/wiki/Prosody_(linguistics))

    This prosody predictor is non-parallel and is inspired by the **work of Du et al., 2021 ?**. It consists of
    multiple convolution transpose, Leaky ReLU activation, LayerNorm, and dropout layers, followed by a
    linear transformation to generate the final output.

    Args:
        model_config (AcousticModelConfigType): Configuration object with model parameters.
        phoneme_level (bool): A flag to decide whether to use phoneme level bottleneck size.
        leaky_relu_slope (float): The negative slope of LeakyReLU activation function.
    """

    def __init__(
        self,
        model_config: AcousticModelConfigType,
        phoneme_level: bool,
        leaky_relu_slope: float = LEAKY_RELU_SLOPE,
    ):
        super().__init__()

        # Get the configuration
        self.d_model = model_config.encoder.n_hidden
        kernel_size = model_config.reference_encoder.predictor_kernel_size
        dropout = model_config.encoder.p_dropout

        # Decide on the bottleneck size based on phoneme level flag
        bottleneck_size = (
            model_config.reference_encoder.bottleneck_size_p
            if phoneme_level
            else model_config.reference_encoder.bottleneck_size_u
        )

        # Define the layers
        self.layers = nn.ModuleList(
            [
                ConvTransposed(
                    self.d_model,
                    self.d_model,
                    kernel_size=kernel_size,
                    padding=(kernel_size - 1) // 2,
                ),
                nn.LeakyReLU(leaky_relu_slope),
                nn.LayerNorm(
                    self.d_model,
                ),
                nn.Dropout(dropout),
                ConvTransposed(
                    self.d_model,
                    self.d_model,
                    kernel_size=kernel_size,
                    padding=(kernel_size - 1) // 2,
                ),
                nn.LeakyReLU(leaky_relu_slope),
                nn.LayerNorm(
                    self.d_model,
                ),
                nn.Dropout(dropout),
            ],
        )

        # Output bottleneck layer
        self.predictor_bottleneck = nn.Linear(
            self.d_model,
            bottleneck_size,
        )

    def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
        r"""Forward pass of the prosody predictor.

        Args:
            x (torch.Tensor): A 3-dimensional tensor `[B, src_len, d_model]`.
            mask (torch.Tensor): A 2-dimensional tensor `[B, src_len]`.

        Returns:
            torch.Tensor: A 3-dimensional tensor `[B, src_len, 2 * d_model]`.
        """
        # Expand the mask tensor's dimensions from [B, src_len] to [B, src_len, 1]
        mask = mask.unsqueeze(2)

        # Pass the input through the layers
        for layer in self.layers:
            x = layer(x)

        # Apply mask
        x = x.masked_fill(mask, 0.0)

        # Final linear transformation
        return self.predictor_bottleneck(x)

forward(x, mask)

Forward pass of the prosody predictor.

Parameters:

Name Type Description Default
x Tensor

A 3-dimensional tensor [B, src_len, d_model].

required
mask Tensor

A 2-dimensional tensor [B, src_len].

required

Returns:

Type Description
Tensor

torch.Tensor: A 3-dimensional tensor [B, src_len, 2 * d_model].

Source code in models/tts/delightful_tts/acoustic_model/phoneme_prosody_predictor.py
def forward(self, x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
    r"""Forward pass of the prosody predictor.

    Args:
        x (torch.Tensor): A 3-dimensional tensor `[B, src_len, d_model]`.
        mask (torch.Tensor): A 2-dimensional tensor `[B, src_len]`.

    Returns:
        torch.Tensor: A 3-dimensional tensor `[B, src_len, 2 * d_model]`.
    """
    # Expand the mask tensor's dimensions from [B, src_len] to [B, src_len, 1]
    mask = mask.unsqueeze(2)

    # Pass the input through the layers
    for layer in self.layers:
        x = layer(x)

    # Apply mask
    x = x.masked_fill(mask, 0.0)

    # Final linear transformation
    return self.predictor_bottleneck(x)