2024年7月25日 星期四

AI 開發筆記 - 透過 llama.cpp 使用 Meta Llama 3.1 8B,過程包括資料格式轉換 @ MacBook Pro M1 32GB RAM



首先,下載 Meta llama3.1 8B 方式都寫在網站上,先到 llama.meta.com/llama-downloads 填單,填完單就會顯示下載網址,例如這種格式:

https://llama3-1.llamameta.net/*?Policy=XXX&Key-Pair-Id=XXX&Download-Request-ID=XXX

然後下載方式不是直接瀏覽他,要依照指定方式,透過 download.sh 下載,下載資訊:


連續動作:

```
% git clone https://github.com/meta-llama/llama-models
% bash llama-models/models/llama3_1/download.sh
Enter the URL from email:  https://llama3-1.llamameta.net/*?Policy=XXX&Key-Pair-Id=XXX&Download-Request-ID=XXX

 **** Model list ***
 -  meta-llama-3.1-405b
 -  meta-llama-3.1-70b
 -  meta-llama-3.1-8b
 -  meta-llama-guard-3-8b
 -  prompt-guard
Choose the model to download: meta-llama-3.1-8b

**** Available models to download: *** 
 -  meta-llama-3.1-8b-instruct
 -  meta-llama-3.1-8b

Enter the list of models to download without spaces or press Enter for all: meta-llama-3.1-8b
Downloading LICENSE and Acceptable Usage Policy
...
```

資料量大小:

```
% du -hd1 Meta-Llama-3.1-8B 
 15G    Meta-Llama-3.1-8B

% tree Meta-Llama-3.1-8B 
Meta-Llama-3.1-8B
├── consolidated.00.pth
├── params.json
└── tokenizer.model

1 directory, 3 files
```

接著就可以享受 Meta 釋出給全世界的 AI 模型了,萬分感謝 Orz 省去自己掏錢買機器訓練,據說這版 llama3.1 號稱可以跟 ChatGPT-4o 或 Claude 3.5 Sonnet 抗衡。但基於家裡的算力不足,純試試 8B 吧!

接下來使用 llama.cpp 體驗,除了程式碼編譯外,還要做資料格式的轉換:

```
% wget https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/models/llama/convert_llama_weights_to_hf.py
% python3 -m venv venv
% source venv/bin/activate
(venv) % pip install transformers torch huggingface_hub tiktoken blobfile accelerate
(venv) % python3 convert_llama_weights_to_hf.py --input_dir Meta-Llama-3.1-8B --model_size 8B --output_dir llama3_1_hf --llama_version 3.1

(venv) % du -hd1 llama3_1_hf
 15G    llama3_1_hf
(venv) % tree llama3_1_hf
llama3_1_hf
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer.json
└── tokenizer_config.json

1 directory, 10 files
```

編譯及使用 llama.cpp:

```
(venv) % git clone https://github.com/ggerganov/llama.cpp
(venv) % cd llama.cpp
(venv) llama.cpp % LLAMA_METAL=1 make  
(venv) llama.cpp % pip install -r requirements.txt
(venv) llama.cpp % time python3 convert_hf_to_gguf.py ../llama3_1_hf/ --outfile llama3_1-8B.gguf
(venv) llama.cpp % 
(venv) llama.cpp % du -hd1 llama3_1-8B.gguf
 15G    llama3_1-8B.gguf
(venv) llama.cpp % ./llama-server -m ./llama3_1-8B.gguf
...
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
...
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
llama_kv_cache_init:      Metal KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.98 MiB
llama_new_context_with_model:      Metal compute buffer size =  8480.00 MiB
llama_new_context_with_model:        CPU compute buffer size =   264.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
...
^Cggml_metal_free: deallocating
```

預設環境會碰到 Aplpe MacBook Pro 筆電資源的限制,多使用了 -c 31072 來跑,整個 llama-server 在 macOS 活動監視器上,可以看到使用記憶體不到 4GB,看起來有穩定下來:

```
(venv) llama.cpp % ./llama-server -m ./llama3_1-8B.gguf -c 31072
...
.........................................................................................
llama_new_context_with_model: n_ctx      = 31072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
llama_kv_cache_init:      Metal KV buffer size =  3884.00 MiB
llama_new_context_with_model: KV self size  = 3884.00 MiB, K (f16): 1942.00 MiB, V (f16): 1942.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.98 MiB
llama_new_context_with_model:      Metal compute buffer size =  2034.69 MiB
llama_new_context_with_model:        CPU compute buffer size =    68.69 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

INFO [                    init] initializing slots | tid="0x1f3b34c00" timestamp=1721895109 n_slots=1
INFO [                    init] new slot | tid="0x1f3b34c00" timestamp=1721895109 id_slot=0 n_ctx_slot=31072
INFO [                    main] model loaded | tid="0x1f3b34c00" timestamp=1721895109
INFO [                    main] chat template | tid="0x1f3b34c00" timestamp=1721895109 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [                    main] HTTP server listening | tid="0x1f3b34c00" timestamp=1721895109 port="8080" n_threads_http="9" hostname="127.0.0.1"
INFO [            update_slots] all slots are idle | tid="0x1f3b34c00" timestamp=1721895109
...
```

如此就可以用 http://localhost:8080 來體驗 llama3.1 8B 的資料了

沒有留言:

張貼留言