Facebook Llama – Richasdy Blog

Llama are pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Llama-2-Chat, are optimized for dialogue use cases

How do large langeuage models work?

How do large language models work?[15]

LLMs take a complex approach that involves multiple components.

At the foundational layer, an LLM needs to be trained on a large volume — sometimes referred to as a corpus — of data that is typically petabytes in size. The training can take multiple steps, usually starting with an unsupervised learning approach. In that approach, the model is trained on unstructured data and unlabeled data. The benefit of training on unlabeled data is that there is often vastly more data available. At this stage, the model begins to derive relationships between different words and concepts. [continuing text task]

The next step for some LLMs is training and fine-tuning with a form of self-supervised learning. Here, some data labeling has occurred, assisting the model to more accurately identify different concepts. [dialog-optimized task]

Next, the LLM undertakes deep learning as it goes through the transformer neural network process. The transformer architecture enables the LLM to understand and recognize the relationships and connections between words and concepts using a self-attention mechanism. That mechanism is able to assign a score, commonly referred to as a weight, to a given item (called a token) in order to determine the relationship. [dialog-optimized task]

Once an LLM has been trained, a base exists on which the AI can be used for practical purposes. By querying the LLM with a prompt, the AI model inference can generate a response, which could be an answer to a question, newly generated text, summarized text or a sentiment analysis.

Different Kinds of LLMs[26]

LLMs can be broadly classified into 2 types depending on their task:

Continuing the text
Dialogue optimized

Continuing the Text

These LLMs are trained to predict the next sequence of words in the input text. Their task at hand is to continue the text.

For example, given the text “How are you”, these LLMs might complete the sentence with “How are you doing? or “How are you? I am fine.

The list of LLMs falling under this category are Transformers, BERT, XLNet, GPT, and its variants like GPT-2, GPT-3, GPT-4, etc.

Now, the problem with these LLMs is that its very good at completing the text rather than answering. Sometimes, we expect the answer rather than completion.

As discussed above, given How are you? as an input, LLM tries to complete the text with doing? or I am fine. The response can be either of them: completion or an answer. This is exactly why the dialogue-optimized LLMs were introduced.

2. Dialogue Optimized

These LLMs respond back with an answer rather than completing it. Given the input “How are you?”, these LLMs might respond back with an answer “I am doing fine.” rather than completing the sentence.

The list of dialogue-optimized LLMs is InstructGPT, ChatGPT, BARD, Falcon-40B-instruct, etc.

Now, we will see the challenges involved in training LLMs from scratch.

Model

Model	Llama2	Llama2-hf	Llama2-chat	Llama2-chat-hf
7B	link	link	link	link
13B	link	link	link	link
70B	link	link	link	link

Implementation 1

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-2-70b-chat-hf")

Python

class TokenSimilarity:

    def load_pretrained(self, from_pretrained:str="indobenchmark/indobert-base-p1"):
        self.tokenizer = AutoTokenizer.from_pretrained(from_pretrained)
        self.model = AutoModel.from_pretrained(from_pretrained)

    def __cleaning(self, text:str):
        # Lower Case
        text = text.lower()

        # clear punctuations
        text = text.translate(str.maketrans('', '', string.punctuation))

        # clear multiple spaces
        text = re.sub(r'/s+', ' ', text).strip()

        return text

    def __process(self, first_token:str, second_token:str):
        inputs = self.tokenizer([first_token, second_token],
                                max_length=self.max_length,
                                truncation=self.truncation,
                                padding=self.padding,
                                return_tensors='pt')

        attention = inputs.attention_mask

        outputs = self.model(**inputs)

        # get the weights from the last layer as embeddings
        embeddings = outputs[0] # when used in older transformers version
        # embeddings = outputs.last_hidden_state # when used in newer one

        # add more dimension then expand tensor
        # to match embeddings shape by duplicating its values by rows
        mask = attention.unsqueeze(-1).expand(embeddings.shape).float()

        masked_embeddings = embeddings * mask

        # MEAN POOLING FOR 2ND DIMENSION
        # first, get sums by 2nd dimension
        # second, get counts of 2nd dimension
        # third, calculate the mean, i.e. sums/counts
        summed = masked_embeddings.sum(1)
        counts = clamp(mask.sum(1), min=1e-9)
        mean_pooled = summed/counts

        # return mean pooling as numpy array
        return mean_pooled.detach().numpy()

    def predict(self, first_token:str, second_token:str,
                return_as_embeddings:bool=False, max_length:int=16,
                truncation:bool=True, padding:str="max_length"):
        self.max_length = max_length
        self.truncation = truncation
        self.padding = padding

        first_token = self.__cleaning(first_token)
        second_token = self.__cleaning(second_token)

        mean_pooled_arr = self.__process(first_token, second_token)
        if return_as_embeddings:
            return mean_pooled_arr

        # calculate similarity
        similarity = cosine_similarity([mean_pooled_arr[0]], [mean_pooled_arr[1]])

        return similarity

Python

model = TokenSimilarity()
model.load_pretrained('indobenchmark/indobert-base-p2')

Python

import pandas as pd
import numpy as np
df = pd.read_csv('{}/Klaster_Pertanyaan.csv'.format(path), encoding= 'unicode_escape')
df_test = pd.read_csv('{}/Data all.csv'.format(path), encoding= 'unicode_escape')

Python

%%time
# Inisialisasi list untuk menyimpan hasil cosine similarity dan masing-masing pertanyaannya
sims = []
qp = []
pred = []
highest_sim = []

# Hitung cosine similarity antara pertanyaan1 indeks i dengan pertanyaan2 indeks i
for i in range(len(df_test.Question1)):
    q1 = df_test.Question1[i]
    # print(i)
    for j in range(len(df.Klaster_Pertanyaan)):
        q2 = df.Klaster_Pertanyaan[j]
        sim = model.predict(q1, q2)[0][0]
        pred_label = int(np.round(sim).flatten()[0])
        sims.append(sim)
        pred.append(pred_label)
        qp.append((q1,q2))
        highest_similarity_indices1 = np.where(sims == np.max(sims))[0]


# Ubah hasil ke dalam bentuk DataFrame untuk ditampilkan
result = {
    'Pertanyaan1': [pair[0] for pair in qp],
    'Pertanyaan2': [pair[1] for pair in qp],
    'Similarity': sims,
    #'prediction_label': pred,

}


res_csv = pd.DataFrame(result)

res_csv.to_csv('{}/result_fix.csv'.format(path), index=True)

Python

Implementation 2

%%capture
%pip install transformers SentencePiece accelerate

Python

import transformers, torch
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = LlamaForCausalLM.from_pretrained(
        "decapoda-research/llama-7b-hf",
        load_in_8bit=False,
        torch_dtype=torch.float16,
        device_map="auto",
    )

Python

instruction = "How old is the universe?"
inputs = tokenizer(
    f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: {instruction}
### Response:""",
    return_tensors="pt",
)
input_ids = inputs["input_ids"].to("cuda")

generation_config = transformers.GenerationConfig(
    do_sample=True,
    temperature=0.1,
    top_p=0.75,
    top_k=80,
    repetition_penalty=1.5,
    max_new_tokens=128,
)

with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        attention_mask=torch.ones_like(input_ids),
        generation_config=generation_config,
    )
output_text = tokenizer.decode(
    generation_output[0].cuda(), skip_special_tokens=True
).strip()
print(output_text)

Python

Evaluation

common sense reasoning
trivia
reading comprehension
question answering
mathematical reasoning
code generation
general domain knowledge.

Evalution Example :

Common sense reasoning. The LLaMA-65B model has outperformed SOTA model architectures in PIQA, SIQA, and OpenBookQA reasoning benchmarks. Even smaller model 33B has outperformed all of them in ARC, easy and challenging.
Closed-Book Question Answering & Trivia. The test measures LLM’s ability to interpret and respond to realistic, human questions. LLaMA model has consistently outperformed GPT3, Gopher, Chinchilla, and PaLM in Natural Questions and TriviaQA benchmarks.
Reading comprehension. It uses RACE-middle and RACE-high benchmark tests. LLaMA models have outperformed GPT-3 and have similar performance to PaLM 540B.
Mathematical Reasoning. LLaMA was not fine-tuned on any mathematical data, and it performed quite poorly compared to Minerva.
Code Generation. It uses HumanEval and MBPP test benchmarks. LLaMA has outperformed both LAMDA and PaLM in HumanEval@100, MBP@1, and MBP@80.

Intrinsic Methods [26]

Traditional Language models were evaluated using intrinsic methods like perplexity, bits per character, etc. These metrics track the performance on the language front i.e. how well the model is able to predict the next word.

Extrinsic Methods [26]

With the advancements in LLMs today, extrinsic methods are preferred to evaluate their performance. The recommended way to evaluate LLMs is to look at how well they are performing at different tasks like problem-solving, reasoning, mathematics, computer science, and competitive exams like MIT, JEE, etc.

EleutherAI released a framework called as Language Model Evaluation Harness to compare and evaluate the performance of LLMs. Hugging face integrated the evaluation framework to evaluate open-source LLMs developed by the community.

The proposed framework evaluates LLMs across 4 different datasets. The final score is an aggregation of scores from each dataset.

AI2 Reasoning Challenge: A collection of science questions designed for elementary school students.
HellaSwag: A test that challenges state-of-the-art models to make common-sense inferences, which are relatively easy for humans (about 95% accuracy).
MMLU: A comprehensive test that evaluates the multitask accuracy of a text model. It includes 57 different tasks covering subjects like basic math, U.S. history, computer science, law, and more.
TruthfulQA: A test specifically created to assess a model’s tendency to generate accurate answers and avoid reproducing false information commonly found online.

Reference

Paper home page : https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
Paper : https://scontent.fcgk29-1.fna.fbcdn.net/v/t39.2365-6/10000000_662098952474184_2584067087619170692_n.pdf
Github repository : https://github.com/facebookresearch/llama
Technical overview : https://ai.meta.com/resources/models-and-libraries/llama/
Community : https://ai.meta.com/llama/open-innovation-ai-research-community/
Model : https://huggingface.co/meta-llama
https://huggingface.co/blog/llama2
Demo https://huggingface.co/spaces/ysharma/Explore_llamav2_with_TGI
https://beebom.com/how-use-llama-2-ai-model/
https://aituts.com/llama/
https://research.aimultiple.com/meta-llama/
—
type :
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
https://www.datacamp.com/blog/introduction-to-meta-ai-llama
https://www.datacamp.com/blog/12-gpt4-open-source-alternatives
https://www.datacamp.com/blog/5-projects-you-can-build-with-generative-ai-models
https://www.techtarget.com/whatis/definition/large-language-model-LLM
https://www.elastic.co/what-is/large-language-models
—
training :
https://towardsdatascience.com/different-ways-of-training-llms-c57885f388ed
https://blog.replit.com/llm-training
https://www.datacamp.com/tutorial/how-to-train-a-llm-with-pytorch
https://kili-technology.com/large-language-models-llms/data-labeling-and-large-language-models-training
https://www.analyticsvidhya.com/blog/2023/07/build-your-own-large-language-models/
https://research.aimultiple.com/llm-fine-tuning/
https://www.analyticsvidhya.com/blog/2023/07/build-your-own-large-language-models/
https://medium.com/@lokaregns/fine-tuning-transformers-with-custom-dataset-classification-task-f261579ae068
https://platform.openai.com/docs/guides/fine-tuning/fine-tuning-examples