Skip to content

Pitch Adaptor Conv

PitchAdaptorConv

Bases: Module

The PitchAdaptorConv class is a pitch adaptor network in the model. Updated version of the PitchAdaptorConv uses the conv embeddings for the pitch.

Parameters:

Name Type Description Default
channels_in int

Number of in channels for conv layers.

required
channels_out int

Number of out channels.

required
kernel_size int

Size the kernel for the conv layers.

required
dropout float

Probability of dropout.

required
leaky_relu_slope float

Slope for the leaky relu.

required
emb_kernel_size int

Size the kernel for the pitch embedding.

required
inputs, mask
  • inputs (batch, time1, dim): Tensor containing input vector
  • target (batch, 1, time2): Tensor containing the pitch target
  • dr (batch, time1): Tensor containing aligner durations vector
  • mask (batch, time1): Tensor containing indices to be masked

Returns: - pitch prediction (batch, 1, time1): Tensor produced by pitch predictor - pitch embedding (batch, channels, time1): Tensor produced pitch adaptor - average pitch target(train only) (batch, 1, time1): Tensor produced after averaging over durations

Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
class PitchAdaptorConv(nn.Module):
    """The PitchAdaptorConv class is a pitch adaptor network in the model.
    Updated version of the PitchAdaptorConv uses the conv embeddings for the pitch.

    Args:
        channels_in (int): Number of in channels for conv layers.
        channels_out (int): Number of out channels.
        kernel_size (int): Size the kernel for the conv layers.
        dropout (float): Probability of dropout.
        leaky_relu_slope (float): Slope for the leaky relu.
        emb_kernel_size (int): Size the kernel for the pitch embedding.

    Inputs: inputs, mask
        - **inputs** (batch, time1, dim): Tensor containing input vector
        - **target** (batch, 1, time2): Tensor containing the pitch target
        - **dr** (batch, time1): Tensor containing aligner durations vector
        - **mask** (batch, time1): Tensor containing indices to be masked
    Returns:
        - **pitch prediction** (batch, 1, time1): Tensor produced by pitch predictor
        - **pitch embedding** (batch, channels, time1): Tensor produced pitch adaptor
        - **average pitch target(train only)** (batch, 1, time1): Tensor produced after averaging over durations

    """

    def __init__(
        self,
        channels_in: int,
        channels_hidden: int,
        channels_out: int,
        kernel_size: int,
        dropout: float,
        leaky_relu_slope: float,
        emb_kernel_size: int,
    ):
        super().__init__()
        self.pitch_predictor = VariancePredictor(
            channels_in=channels_in,
            channels=channels_hidden,
            channels_out=channels_out,
            kernel_size=kernel_size,
            p_dropout=dropout,
            leaky_relu_slope=leaky_relu_slope,
        )
        self.pitch_emb = nn.Conv1d(
            1,
            channels_hidden,
            kernel_size=emb_kernel_size,
            padding=int((emb_kernel_size - 1) / 2),
        )

    def get_pitch_embedding_train(
        self,
        x: torch.Tensor,
        target: torch.Tensor,
        dr: torch.Tensor,
        mask: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        r"""Function is used during training to get the pitch prediction, average pitch target,
        and pitch embedding.

        Args:
            x (torch.Tensor): A 3D tensor of shape [B, T_src, C] where B is the batch size,
                            T_src is the source sequence length, and C is the number of channels.
            target (torch.Tensor): A 3D tensor of shape [B, 1, T_max2] where B is the batch size,
                                T_max2 is the maximum target sequence length.
            dr (torch.Tensor): A 2D tensor of shape [B, T_src] where B is the batch size,
                                T_src is the source sequence length. The values represent the durations.
            mask (torch.Tensor): A 2D tensor of shape [B, T_src] where B is the batch size,
                                T_src is the source sequence length. The values represent the mask.

        Returns:
            pitch_pred (torch.Tensor): A 3D tensor of shape [B, 1, T_src] where B is the batch size,
                                        T_src is the source sequence length. The values represent the pitch prediction.
            avg_pitch_target (torch.Tensor): A 3D tensor of shape [B, 1, T_src] where B is the batch size,
                                            T_src is the source sequence length. The values represent the average pitch target.
            pitch_emb (torch.Tensor): A 3D tensor of shape [B, C, T_src] where B is the batch size,
                                    C is the number of channels, T_src is the source sequence length. The values represent the pitch embedding.
        Shapes:
            x: :math: `[B, T_src, C]`
            target: :math: `[B, 1, T_max2]`
            dr: :math: `[B, T_src]`
            mask: :math: `[B, T_src]`
        """
        pitch_pred = self.pitch_predictor.forward(x, mask)
        pitch_pred = pitch_pred.unsqueeze(1)

        avg_pitch_target = average_over_durations(target, dr)
        pitch_emb = self.pitch_emb(avg_pitch_target)

        return pitch_pred, avg_pitch_target, pitch_emb

    def add_pitch_embedding_train(
        self,
        x: torch.Tensor,
        target: torch.Tensor,
        dr: torch.Tensor,
        mask: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        r"""Add pitch embedding during training.

        This method calculates the pitch embedding and adds it to the input tensor 'x'.
        It also returns the predicted pitch and the average target pitch.

        Args:
            x (torch.Tensor): The input tensor to which the pitch embedding will be added.
            target (torch.Tensor): The target tensor used in the pitch embedding calculation.
            dr (torch.Tensor): The duration tensor used in the pitch embedding calculation.
            mask (torch.Tensor): The mask tensor used in the pitch embedding calculation.

        Returns:
            x (torch.Tensor): The input tensor with added pitch embedding.
            pitch_pred (torch.Tensor): The predicted pitch tensor.
            avg_pitch_target (torch.Tensor): The average target pitch tensor.
        """
        pitch_pred, avg_pitch_target, pitch_emb = self.get_pitch_embedding_train(
            x=x,
            target=target.unsqueeze(1),
            dr=dr,
            mask=mask,
        )
        x_pitch = x + pitch_emb.transpose(1, 2)
        return x_pitch, pitch_pred, avg_pitch_target

    def get_pitch_embedding(
        self,
        x: torch.Tensor,
        mask: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        r"""Function is used during inference to get the pitch embedding and pitch prediction.

        Args:
            x (torch.Tensor): A 3D tensor of shape [B, T_src, C] where B is the batch size,
                            T_src is the source sequence length, and C is the number of channels.
            mask (torch.Tensor): A 2D tensor of shape [B, T_src] where B is the batch size,
                                T_src is the source sequence length. The values represent the mask.

        Returns:
            pitch_emb_pred (torch.Tensor): A 3D tensor of shape [B, C, T_src] where B is the batch size,
                                            C is the number of channels, T_src is the source sequence length. The values represent the pitch embedding.
            pitch_pred (torch.Tensor): A 3D tensor of shape [B, 1, T_src] where B is the batch size,
                                        T_src is the source sequence length. The values represent the pitch prediction.
        """
        pitch_pred = self.pitch_predictor.forward(x, mask)
        pitch_pred = pitch_pred.unsqueeze(1)

        pitch_emb_pred = self.pitch_emb(pitch_pred)
        return pitch_emb_pred, pitch_pred

    def add_pitch_embedding(
        self,
        x: torch.Tensor,
        mask: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        r"""Add pitch embedding during inference.

        This method calculates the pitch embedding and adds it to the input tensor 'x'.
        It also returns the predicted pitch.

        Args:
            x (torch.Tensor): The input tensor to which the pitch embedding will be added.
            mask (torch.Tensor): The mask tensor used in the pitch embedding calculation.
            pitch_transform (Callable): A function to transform the pitch prediction.

        Returns:
            x (torch.Tensor): The input tensor with added pitch embedding.
            pitch_pred (torch.Tensor): The predicted pitch tensor.
        """
        pitch_emb_pred, pitch_pred = self.get_pitch_embedding(x, mask)
        x_pitch = x + pitch_emb_pred.transpose(1, 2)
        return x_pitch, pitch_pred

add_pitch_embedding(x, mask)

Add pitch embedding during inference.

This method calculates the pitch embedding and adds it to the input tensor 'x'. It also returns the predicted pitch.

Parameters:

Name Type Description Default
x Tensor

The input tensor to which the pitch embedding will be added.

required
mask Tensor

The mask tensor used in the pitch embedding calculation.

required
pitch_transform Callable

A function to transform the pitch prediction.

required

Returns:

Name Type Description
x Tensor

The input tensor with added pitch embedding.

pitch_pred Tensor

The predicted pitch tensor.

Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
def add_pitch_embedding(
    self,
    x: torch.Tensor,
    mask: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    r"""Add pitch embedding during inference.

    This method calculates the pitch embedding and adds it to the input tensor 'x'.
    It also returns the predicted pitch.

    Args:
        x (torch.Tensor): The input tensor to which the pitch embedding will be added.
        mask (torch.Tensor): The mask tensor used in the pitch embedding calculation.
        pitch_transform (Callable): A function to transform the pitch prediction.

    Returns:
        x (torch.Tensor): The input tensor with added pitch embedding.
        pitch_pred (torch.Tensor): The predicted pitch tensor.
    """
    pitch_emb_pred, pitch_pred = self.get_pitch_embedding(x, mask)
    x_pitch = x + pitch_emb_pred.transpose(1, 2)
    return x_pitch, pitch_pred

add_pitch_embedding_train(x, target, dr, mask)

Add pitch embedding during training.

This method calculates the pitch embedding and adds it to the input tensor 'x'. It also returns the predicted pitch and the average target pitch.

Parameters:

Name Type Description Default
x Tensor

The input tensor to which the pitch embedding will be added.

required
target Tensor

The target tensor used in the pitch embedding calculation.

required
dr Tensor

The duration tensor used in the pitch embedding calculation.

required
mask Tensor

The mask tensor used in the pitch embedding calculation.

required

Returns:

Name Type Description
x Tensor

The input tensor with added pitch embedding.

pitch_pred Tensor

The predicted pitch tensor.

avg_pitch_target Tensor

The average target pitch tensor.

Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
def add_pitch_embedding_train(
    self,
    x: torch.Tensor,
    target: torch.Tensor,
    dr: torch.Tensor,
    mask: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    r"""Add pitch embedding during training.

    This method calculates the pitch embedding and adds it to the input tensor 'x'.
    It also returns the predicted pitch and the average target pitch.

    Args:
        x (torch.Tensor): The input tensor to which the pitch embedding will be added.
        target (torch.Tensor): The target tensor used in the pitch embedding calculation.
        dr (torch.Tensor): The duration tensor used in the pitch embedding calculation.
        mask (torch.Tensor): The mask tensor used in the pitch embedding calculation.

    Returns:
        x (torch.Tensor): The input tensor with added pitch embedding.
        pitch_pred (torch.Tensor): The predicted pitch tensor.
        avg_pitch_target (torch.Tensor): The average target pitch tensor.
    """
    pitch_pred, avg_pitch_target, pitch_emb = self.get_pitch_embedding_train(
        x=x,
        target=target.unsqueeze(1),
        dr=dr,
        mask=mask,
    )
    x_pitch = x + pitch_emb.transpose(1, 2)
    return x_pitch, pitch_pred, avg_pitch_target

get_pitch_embedding(x, mask)

Function is used during inference to get the pitch embedding and pitch prediction.

Parameters:

Name Type Description Default
x Tensor

A 3D tensor of shape [B, T_src, C] where B is the batch size, T_src is the source sequence length, and C is the number of channels.

required
mask Tensor

A 2D tensor of shape [B, T_src] where B is the batch size, T_src is the source sequence length. The values represent the mask.

required

Returns:

Name Type Description
pitch_emb_pred Tensor

A 3D tensor of shape [B, C, T_src] where B is the batch size, C is the number of channels, T_src is the source sequence length. The values represent the pitch embedding.

pitch_pred Tensor

A 3D tensor of shape [B, 1, T_src] where B is the batch size, T_src is the source sequence length. The values represent the pitch prediction.

Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
def get_pitch_embedding(
    self,
    x: torch.Tensor,
    mask: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    r"""Function is used during inference to get the pitch embedding and pitch prediction.

    Args:
        x (torch.Tensor): A 3D tensor of shape [B, T_src, C] where B is the batch size,
                        T_src is the source sequence length, and C is the number of channels.
        mask (torch.Tensor): A 2D tensor of shape [B, T_src] where B is the batch size,
                            T_src is the source sequence length. The values represent the mask.

    Returns:
        pitch_emb_pred (torch.Tensor): A 3D tensor of shape [B, C, T_src] where B is the batch size,
                                        C is the number of channels, T_src is the source sequence length. The values represent the pitch embedding.
        pitch_pred (torch.Tensor): A 3D tensor of shape [B, 1, T_src] where B is the batch size,
                                    T_src is the source sequence length. The values represent the pitch prediction.
    """
    pitch_pred = self.pitch_predictor.forward(x, mask)
    pitch_pred = pitch_pred.unsqueeze(1)

    pitch_emb_pred = self.pitch_emb(pitch_pred)
    return pitch_emb_pred, pitch_pred

get_pitch_embedding_train(x, target, dr, mask)

Function is used during training to get the pitch prediction, average pitch target, and pitch embedding.

Parameters:

Name Type Description Default
x Tensor

A 3D tensor of shape [B, T_src, C] where B is the batch size, T_src is the source sequence length, and C is the number of channels.

required
target Tensor

A 3D tensor of shape [B, 1, T_max2] where B is the batch size, T_max2 is the maximum target sequence length.

required
dr Tensor

A 2D tensor of shape [B, T_src] where B is the batch size, T_src is the source sequence length. The values represent the durations.

required
mask Tensor

A 2D tensor of shape [B, T_src] where B is the batch size, T_src is the source sequence length. The values represent the mask.

required

Returns:

Name Type Description
pitch_pred Tensor

A 3D tensor of shape [B, 1, T_src] where B is the batch size, T_src is the source sequence length. The values represent the pitch prediction.

avg_pitch_target Tensor

A 3D tensor of shape [B, 1, T_src] where B is the batch size, T_src is the source sequence length. The values represent the average pitch target.

pitch_emb Tensor

A 3D tensor of shape [B, C, T_src] where B is the batch size, C is the number of channels, T_src is the source sequence length. The values represent the pitch embedding.

Shapes: x: :math: [B, T_src, C] target: :math: [B, 1, T_max2] dr: :math: [B, T_src] mask: :math: [B, T_src]

Source code in models/tts/delightful_tts/acoustic_model/pitch_adaptor_conv.py
def get_pitch_embedding_train(
    self,
    x: torch.Tensor,
    target: torch.Tensor,
    dr: torch.Tensor,
    mask: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    r"""Function is used during training to get the pitch prediction, average pitch target,
    and pitch embedding.

    Args:
        x (torch.Tensor): A 3D tensor of shape [B, T_src, C] where B is the batch size,
                        T_src is the source sequence length, and C is the number of channels.
        target (torch.Tensor): A 3D tensor of shape [B, 1, T_max2] where B is the batch size,
                            T_max2 is the maximum target sequence length.
        dr (torch.Tensor): A 2D tensor of shape [B, T_src] where B is the batch size,
                            T_src is the source sequence length. The values represent the durations.
        mask (torch.Tensor): A 2D tensor of shape [B, T_src] where B is the batch size,
                            T_src is the source sequence length. The values represent the mask.

    Returns:
        pitch_pred (torch.Tensor): A 3D tensor of shape [B, 1, T_src] where B is the batch size,
                                    T_src is the source sequence length. The values represent the pitch prediction.
        avg_pitch_target (torch.Tensor): A 3D tensor of shape [B, 1, T_src] where B is the batch size,
                                        T_src is the source sequence length. The values represent the average pitch target.
        pitch_emb (torch.Tensor): A 3D tensor of shape [B, C, T_src] where B is the batch size,
                                C is the number of channels, T_src is the source sequence length. The values represent the pitch embedding.
    Shapes:
        x: :math: `[B, T_src, C]`
        target: :math: `[B, 1, T_max2]`
        dr: :math: `[B, T_src]`
        mask: :math: `[B, T_src]`
    """
    pitch_pred = self.pitch_predictor.forward(x, mask)
    pitch_pred = pitch_pred.unsqueeze(1)

    avg_pitch_target = average_over_durations(target, dr)
    pitch_emb = self.pitch_emb(avg_pitch_target)

    return pitch_pred, avg_pitch_target, pitch_emb