Transformer

Outline

  • Brief Artificial Neural Nerwork
  • Brief Recurrent Neural Network
  • Brief Long Short Term Memory (LSTM)
  • Transformer

Brief Artificial Neural Network

Gambar Analogi Artificial Neural Network (sumber)

Gambar Analogi Artificial Neural Network Sederhana (sumber, sumber sumber)

from numpy import exp, array, random, dot
training_set_inputs = array([[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]])
training_set_outputs = array([[0, 1, 1, 0]]).T
random.seed(1)
synaptic_weights = 2 * random.random((3, 1)) - 1
for iteration in xrange(10000):
    output = 1 / (1 + exp(-(dot(training_set_inputs, synaptic_weights))))
    synaptic_weights += dot(training_set_inputs.T, (training_set_outputs - output) * output * (1 - output))
print 1 / (1 + exp(-(dot(array([1, 0, 0]), synaptic_weights))))
Python
import tensorflow as tf

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',loss=loss_fn,metrics=['accuracy'])
              
model.fit(x_train, y_train, epochs=5)
Python

Brief Recurrent Neural Network

Gambar Arsitektur RNN (sumber)

import tensorflow as tf

model = keras.Sequential()
model.add(layers.Embedding(input_dim=1000, output_dim=64))
model.add(layers.Dense(10))

model.compile(optimizer='adam',loss=loss_fn,metrics=['accuracy'])
              
model.fit(x_train, y_train, epochs=5)
Python

Brief Long Short Term Memory (LSTM)

Gambar Arsitektur LSTM (sumber)

import tensorflow as tf

model = keras.Sequential()

model.add(layers.Embedding(input_dim=1000, output_dim=64))
model.add(layers.LSTM(128))
model.add(layers.Dense(10))

model.compile(optimizer='adam',loss=loss_fn,metrics=['accuracy'])
              
model.fit(x_train, y_train, epochs=5)
Python

Transformer

Transformers can process language in parallel rather than sequentially, allowing them to handle vast amounts of data and capture complex relationships in language — all without losing track of long-range dependencies.

Transformer Task : Understanding, Machine Translation, Text Generation

Trasnformer Configuration : encoder-only, decoder-only, and encoder-decoder models

Encoder-Only Models (e.g., BERT)

  • Encoder-only models are designed primarily for understanding and representation tasks, where generating new text isn’t required. These models excel at capturing intricate relationships within the text by learning high-quality embeddings of language data. Encoder-only models process input bidirectionally, allowing them to understand the context around each token by looking both forward and backward within a sentence.
  • Strengths: Exceptional at language comprehension and representation learning. The bidirectional processing enables the model to discern complex relationships between words, providing a deeper understanding of context.
  • Applications: Ideal for tasks that require analyzing or classifying text, such as:
  • Sentence Classification: Determining the sentiment, topic, or intent of a sentence.
  • Sentiment Analysis: Evaluating whether a piece of text expresses positive, negative, or neutral sentiments.
  • Named Entity Recognition (NER): Identifying entities like people, organizations, and locations within the text.
  • Question Answering: Finding relevant answers to questions based on text input.
  • Real-World Example: BERT (Bidirectional Encoder Representations from Transformers) is widely used in search engines, helping improve relevance by analyzing the context of user queries to deliver more accurate results.

Decoder-Only Models (e.g., GPT Series)

  • Decoder-only models, like the GPT (Generative Pre-trained Transformer) series, are optimized for generative tasks. Unlike encoder-only models, decoder-only architectures are unidirectional, processing tokens sequentially from left to right. This design is ideal for tasks that require predicting the next word or token in a sequence, ensuring that generated text is coherent and contextually appropriate.
  • Strengths: Built specifically for text generation, these models excel at creating fluid, logical continuations of a prompt. They are particularly effective for tasks that need sequential prediction, which is key to producing natural, flowing language.
  • Applications: Commonly used in tasks that require generating new text, such as:
  • Conversational AI: Powering chatbots and virtual assistants with human-like dialogue.
  • Creative Writing: Assisting in writing stories, poems, or other forms of creative content.
  • Content Generation: Generating articles, social media posts, or summaries based on a brief or prompt.
  • Code Generation: Converting natural language descriptions into executable code.
  • Real-World Example: GPT models, like ChatGPT, are widely used in chatbots and virtual assistants to produce contextually relevant responses that align with the conversation history, delivering an engaging user experience.

Encoder-Decoder Models (e.g., T5)

  • Encoder-decoder models combine both encoder and decoder stacks, making them exceptionally versatile. This architecture is designed for tasks where an input sequence needs to be transformed into a specific output sequence, such as translation, summarization, or paraphrasing. In these models, the encoder processes the input to create a rich representation, which is then passed to the decoder to generate the desired output.
  • Strengths: Versatile and flexible, encoder-decoder models can handle a variety of tasks that require both understanding the input context and generating coherent output. This setup makes them robust for tasks that involve complex input-output mappings.
  • Applications: Suitable for a range of NLP applications, including:
  • Machine Translation: Translating text from one language to another (e.g., English to French).
  • Text Summarization: Condensing lengthy articles or documents into shorter summaries.
  • Text-to-Text Transformations: Tasks like rephrasing sentences, generating questions from statements, or filling in missing parts of a text.
  • Real-World Example: T5 (Text-To-Text Transfer Transformer) is renowned for its ability to handle multiple NLP tasks within a single unified framework. By casting every problem as a text-to-text transformation, T5 can seamlessly switch between tasks, making it a versatile tool for diverse NLP applications.

How Transformer Work

  • Tokenization and Embedding
  • Adding Positional Encodings
  • Self-Attention Computation
  • Multi-Head Attention and Feed-Forward Networks (FFNs)
  • Stacking Layers
  • Output Generation

Architecture

Embedding

Each word is embedded into a vector of size 512. We’ll represent those vectors with these simple boxes.

Ilustrasi Embedding (sumber)

Architecture

Architecture – Self Attention

“The animal didn’t cross the street because it was too tired”

Transformer – Self Attention – Multiheads

Positional Encoding

class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(EncoderLayer, self).__init__()

    self.mha = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
    ffn_output = self.dropout2(ffn_output, training=training)
    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

    return out2
Python
class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(DecoderLayer, self).__init__()

    self.mha1 = MultiHeadAttention(d_model, num_heads)
    self.mha2 = MultiHeadAttention(d_model, num_heads)

    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)


  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):
    # enc_output.shape == (batch_size, input_seq_len, d_model)

    attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)

    attn2, attn_weights_block2 = self.mha2(
        enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

    return out3, attn_weights_block1, attn_weights_block2
Python
class Transformer(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               target_vocab_size, pe_input, pe_target, rate=0.1):
    super(Transformer, self).__init__()

    self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                           input_vocab_size, pe_input, rate)

    self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                           target_vocab_size, pe_target, rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inp, tar, training, enc_padding_mask,
           look_ahead_mask, dec_padding_mask):

    enc_output = self.encoder(x=inp, training=training, mask=enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        x=tar, enc_output=enc_output, training=training, look_ahead_mask=look_ahead_mask, padding_mask=dec_padding_mask)

    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

    return final_output, attention_weights
Python

Transformer Visulization

Future Work

  • Sparse Attention Mechanisms | Sparse Attention: Real-time language translation and document summarization with extended context.
  • Linear Transformers | Linear Transformers: Time-sensitive applications like live speech-to-text transcription and financial forecasting.
  • Dynamic and Selective Sparsity in Transformers | Dynamic Sparsity: Optimizing chatbot response times and adaptive language models for dynamic dialogues.
  • Multi-Modal Transformers | Multi-Modal Transformers: Augmented reality systems that process visual and auditory cues simultaneously, enhancing user interaction.
  • Energy-Efficient and Hardware-Optimized Transformers | Energy-Efficient Transformers: Practical NLP applications on mobile devices and real-time IoT sensors for environmental monitoring.

Reference


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *