QEUR23_LMABYS12:　Llama2の精度評価（日本語方面）

8月 17, 2023

～結局、このモデルの日本語の能力は「まあまあ」～

D先生： “LLM（大規模言語モデル）を使う環境が揃ったので、これからモデルの精度を評価するんですよね？”

QEU:FOUNDER ： “US永住権認証試験（改）のデータを使いますが、日本語の部分を使います。”

D先生： “ちなみに、前回、前々回にやってみたGPT4AllとLlama2の英語を使ったときの正解率は、以下のような感じでした。英語の質問では、Llama2はさすがのパフォーマンスでした。もちろん、英語でトレーニングされていますから「得意領域」ですからね・・・。”

（GPT4All）

       (Langchainなし) / (LangChainあり)

Open_qa : 32/61 - 36/61

Classification : 4/39 - 9/39

(Llama-v2-13b-GPTQ：英語)

QEU:FOUNDER ： “そして、今回は、LLMの、おそらく不得意領域にあたる「日本語での質問に対して、どのように回答するか」です。それでは、LLMで質問に回答するプログラムを晒します”

# llama2-13B-chat-GPTQによる回答プログラム
import pandas as pd
import numpy as np
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

# モデル名
model_name_or_path = "TheBloke/h2ogpt-4096-llama2-13B-chat-GPTQ"
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        use_safetensors=True,
        trust_remote_code=False,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

"""
# To download from a specific branch, use the revision parameter, as in this example:
# Note that `revision` requires AutoGPTQ 0.3.1 or later!

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        revision="gptq-4bit-32g-actorder_True",
        use_safetensors=True,
        trust_remote_code=False,
        device="cuda:0",
        quantize_config=None)
"""

# -----
# Inference can also be done using transformers' pipeline
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

#print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0,            # GPUを使うことを指定 (cuda:0と同義)
    max_new_tokens=512,
    temperature=0.2,
    top_p=0.95,
    repetition_penalty=1.15
)

#################
##################################

llm = HuggingFacePipeline(
    pipeline = pipe, 
    model_kwargs = {"temperature": 0.2}
)

# Prompts: プロンプトを作成する(default)
template = """<prompt>Question: {question} Write answer within 15 words.</prompt>
<answer>Answer:"""

# Promptを完成する
prompt = PromptTemplate(template=template, input_variables=["question"])
#print(f"prompt: {prompt}")

# 質問の文字列リストを定義する
df = pd.read_excel("/content/drive/MyDrive/LCdocs/QEU-verify100-ja2(en).xlsx")
#print(df)
len(df)

question_list = df["instruction(jp)"].values
#print(question_list[0:20])

reference_list = df["response(jp)"].values
#print(referece_list[0:20])

type_list = df["type"].values
#print(type_list[0:20])

df_out = df.loc[:, ["NO", "instruction(jp)", "type"]]
print(df_out)

context = """
Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger; its name was coined as a portmanteau of "wiki" and "encyclopedia". Initially available only in English, versions in other languages were quickly developed. The English Wikipedia, with 6.3 million articles as of Feb-ruary 2021, is the largest of the 321 language editions. Combined, Wikipedia's editions comprise more than 55 million articles, and attract more than 17 million edits and more than 1.7 billion unique visitors per month.
"""

# 個別の質問応答のタスクを実行する関数を定義する
def qa(i, question, context):

    # Chains: 作成したモデルを利用可能な状態にする
    llm_chain = LLMChain(prompt=prompt, llm=llm)

    # 回答を表示する
    print(f"No.{i}: Question: {question}")
    answer = llm_chain.predict(question=question)
    pos_str = answer.find('</answer>')  # </answer>記号以前の情報を使う
    cut_answer = answer[0:pos_str]
    print(f"Answer: {cut_answer}")
    
    return cut_answer

# -----
# 繰り返し質問する
arr_answer = []
for i in range(100):     # range(len(question_list))

    # 今回の質問を抽出する
    question = question_list[i]
    # 質問に回答する
    answer = qa(i, question, context)
    # 質問リストを生成する
    arr_answer.append(answer)

# 回答リストを出力する
print("--------")
print("generate arr_answer")
print(np.array(arr_answer))

#################
##################################

# -----
# Create DataFrame
df_out["Referece"] = reference_list
df_out["Answer"] = arr_answer
print(df_out)

# -----
# 出力のファイル名を QEU-verify100-jp(Raw).xlsx, シート名を Sheet_name_1 として保存する。
df_out.to_excel("/content/drive/MyDrive/LCdocs/QEU-verify100-jp(Raw).xlsx",sheet_name="Sheet_name_1")

QEU:FOUNDER ： “そして、以下が回答の結果です。・・・ドン！！”

D先生： “やっぱり英語による回答事例が多いですね。ただし、LLMの回答そのものは正解が多いです。次は、COS類似度による評価のステップなんですが、英語と日本語はマッチングしてくれるんですかねえ？”

(Llama-v2-13b-GPTQ：日本語)

(Llama-v2-13b-GPTQ：英語、再掲)

QEU:FOUNDER ： “英語と日本語のパフォーマンスを比較すると面白いでしょ？Open-QAの質問はメタメタだが、Classification（分類）の質問にはそこそこに対応できています。もうちょっと詳しくみてみましょう。”

D先生： “NO2の回答を見てみましょう。「LLMさん」は英語で回答しました。その回答は正解です。しかし、ベクトル化では日本語と英語はマッチングしませんでしたね。”

QEU:FOUNDER ： “このLLMについて、大まかな姿が見えてきましたね。なぜかClassificationの得点だけが良いという・・・。”

D先生： “今回の問題は「プロンプト・エンジニアリング」で解決できる可能性がある。つまり、LangChainのVector Storeで解決できる余地があります。”

QEU:FOUNDER ： “それでは、次はLangChainの適用に入りましょう。”

～まとめ～

C部長 : “そういえば、昔、FOUNDERが支持していたイケメン（↓）については、今現在、どう思っているんですか？”

QEU:FOUNDER ： “なんとも思っていないです。あの人たちは「経済政策が目玉」なんだけど、「あんなにうまく行くかね～」と思っているから・・・（笑）。”

C部長 : “MMT（現代貨幣理論）に対する不信感？”

QEU:FOUNDER ： “あの人はMMTという言葉を使ったことがないよ。そこらへんはすごいなぁ～、と思います。”

C部長 : “より深く現状を理解している可能性があります。だれだろう・・・、（彼の）背後にいるブレーンは・・・。”

QEU:FOUNDER ： “消極的支持・・・（笑）。”

このブログを検索

QEU_ROUND2-3：　大規模言語モデル上にQEUシステムを構築する

QEUR23_LMABYS12:　Llama2の精度評価（日本語方面）

～結局、このモデルの日本語の能力は「まあまあ」～

～まとめ～

コメント

コメントを投稿

このブログの人気の投稿

QEUR23_LMABYS2:　閑話休題～ブログをつくるとわかること

QEUR23_LMABYS7: LangChainで本格的に性能を評価する

QEUR23_LMABYS16:　Llama2の精度評価（LangChain-日本語-その2）

QEUR23_LMABYS12: Llama2の精度評価（日本語方面）

～ 結局、このモデルの日本語の能力は「まあまあ」 ～

～ まとめ ～

コメント

コメントを投稿

このブログの人気の投稿

QEUR23_LMABYS2: 閑話休題～ブログをつくるとわかること

QEUR23_LMABYS7: LangChainで本格的に性能を評価する

QEUR23_LMABYS16: Llama2の精度評価（LangChain-日本語-その2）

QEUR23_LMABYS12:　Llama2の精度評価（日本語方面）

～結局、このモデルの日本語の能力は「まあまあ」～

～まとめ～

QEUR23_LMABYS2:　閑話休題～ブログをつくるとわかること

QEUR23_LMABYS16:　Llama2の精度評価（LangChain-日本語-その2）