langchain で Llama2 を使用する方法

目次 Llama 2 モデルを変換するためのコードを取得 Llama 2 モデルのダウンロード Llama 2 モデルを gguf に変換 llama-cpp-python のインストール CPU の場合 CPU + GPU の場合 Apple Silicon Chip の MacOS の場合実行参考文献

Llama 2 モデルを変換するためのコードを取得

git clone https://github.com/ggerganov/llama.cpp.git

Llama 2 モデルのダウンロード

変換前のモデルのダウンロードは下記の方が上げてくれたモデルからダウンロードします．

TheBloke (Tom Jobbins)

LLM: quantisation, fine tuning

https://huggingface.co

上記のモデル一覧から自分が使用したい Llama 2 のモデルを検索してください．

私は以下のモデルを使用しました．

TheBloke/Llama-2-7B-Chat-GGML · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co

model の名前を以下に示します．

TheBloke/Llama-2-7B-Chat-GGML

上記の URL から Files and versions をクリックし以下に飛びます．

TheBloke/Llama-2-7B-Chat-GGML at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

https://huggingface.co

次に

llama-2-7b-chat.ggmlv3.q4_0.bin

を先ほど clone したディレクトリの models にダウンロードしてください．

Llama 2 モデルを gguf に変換

下記を実行してモデルを変換してください

python ./convert-llama-ggmlv3-to-gguf.py --eps 1e-5 --input models/llama-2-7b-chat.ggmlv3.q4_0.bin --output models/llama-2-7b-chat.gguf.q4_0.bin

module not found error が出た場合は適宜 pip install か poetry add してください．

requirements.txt を使用してもいいと思います．

`llama-cpp-python` のインストール

実際に Llama2 を動かすために llama-cpp-python のインストールします．

CPU の場合

pip install llama-cpp-python

CPU + GPU の場合

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

Apple Silicon Chip の MacOS の場合

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

基本的には参考文献を参照してください．

私の場合は GPU 環境のため 2 番目を実行しました．

実行

下記を実行してください

model_path は適宜変更してください．

import os

from langchain import LLMChain, PromptTemplate
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import LlamaCpp

template = """Question: {question}

Answer: Let's work this out in a step by step way to be sure we have the right answer."""

prompt = PromptTemplate(template=template, input_variables=["question"])


# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])


n_gpu_layers = 40  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

model_dir = os.path.join("モデルへのパスを指定")
model_path = os.path.join(model_dir, "llama-2-7b-chat.gguf.q4_0.bin")


# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)