第二十四個夏天後: AI 開發筆記 - 透過 llama.cpp 使用 Meta Llama 3.1 8B，過程包括資料格式轉換 @ MacBook Pro M1 32GB RAM

首先，下載 Meta llama3.1 8B 方式都寫在網站上，先到 llama.meta.com/llama-downloads 填單，填完單就會顯示下載網址，例如這種格式：

https://llama3-1.llamameta.net/*?Policy=XXX&Key-Pair-Id=XXX&Download-Request-ID=XXX

然後下載方式不是直接瀏覽他，要依照指定方式，透過 download.sh 下載，下載資訊：

github.com/meta-llama/llama-models/blob/main/README.md

連續動作：

```
% git clone https://github.com/meta-llama/llama-models
% bash llama-models/models/llama3_1/download.sh
Enter the URL from email: https://llama3-1.llamameta.net/*?Policy=XXX&Key-Pair-Id=XXX&Download-Request-ID=XXX

**** Model list ***
- meta-llama-3.1-405b
- meta-llama-3.1-70b
- meta-llama-3.1-8b
- meta-llama-guard-3-8b
- prompt-guard
Choose the model to download: meta-llama-3.1-8b

**** Available models to download: ***
- meta-llama-3.1-8b-instruct
- meta-llama-3.1-8b

Enter the list of models to download without spaces or press Enter for all: meta-llama-3.1-8b
Downloading LICENSE and Acceptable Usage Policy
...
```

資料量大小：

```
% du -hd1 Meta-Llama-3.1-8B
15G Meta-Llama-3.1-8B

% tree Meta-Llama-3.1-8B
Meta-Llama-3.1-8B
├── consolidated.00.pth
├── params.json
└── tokenizer.model

1 directory, 3 files
```

接著就可以享受 Meta 釋出給全世界的 AI 模型了，萬分感謝 Orz 省去自己掏錢買機器訓練，據說這版 llama3.1 號稱可以跟 ChatGPT-4o 或 Claude 3.5 Sonnet 抗衡。但基於家裡的算力不足，純試試 8B 吧！

接下來使用 llama.cpp 體驗，除了程式碼編譯外，還要做資料格式的轉換：

```
% wget https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/models/llama/convert_llama_weights_to_hf.py
% python3 -m venv venv
% source venv/bin/activate
(venv) % pip install transformers torch huggingface_hub tiktoken blobfile accelerate
(venv) % python3 convert_llama_weights_to_hf.py --input_dir Meta-Llama-3.1-8B --model_size 8B --output_dir llama3_1_hf --llama_version 3.1

(venv) % du -hd1 llama3_1_hf
15G llama3_1_hf
(venv) % tree llama3_1_hf
llama3_1_hf
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer.json
└── tokenizer_config.json

1 directory, 10 files
```

編譯及使用 llama.cpp：

```
(venv) % git clone https://github.com/ggerganov/llama.cpp
(venv) % cd llama.cpp
(venv) llama.cpp % LLAMA_METAL=1 make
(venv) llama.cpp % pip install -r requirements.txt
(venv) llama.cpp % time python3 convert_hf_to_gguf.py ../llama3_1_hf/ --outfile llama3_1-8B.gguf
(venv) llama.cpp %
(venv) llama.cpp % du -hd1 llama3_1-8B.gguf
15G llama3_1-8B.gguf
(venv) llama.cpp % ./llama-server -m ./llama3_1-8B.gguf
...
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
...
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
llama_kv_cache_init: Metal KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.98 MiB
llama_new_context_with_model: Metal compute buffer size = 8480.00 MiB
llama_new_context_with_model: CPU compute buffer size = 264.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
...
^Cggml_metal_free: deallocating
```

預設環境會碰到 Aplpe MacBook Pro 筆電資源的限制，多使用了 -c 31072 來跑，整個 llama-server 在 macOS 活動監視器上，可以看到使用記憶體不到 4GB，看起來有穩定下來：

```
(venv) llama.cpp % ./llama-server -m ./llama3_1-8B.gguf -c 31072
...
.........................................................................................
llama_new_context_with_model: n_ctx = 31072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MB
llama_kv_cache_init: Metal KV buffer size = 3884.00 MiB
llama_new_context_with_model: KV self size = 3884.00 MiB, K (f16): 1942.00 MiB, V (f16): 1942.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.98 MiB
llama_new_context_with_model: Metal compute buffer size = 2034.69 MiB
llama_new_context_with_model: CPU compute buffer size = 68.69 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2

INFO [ init] initializing slots | tid="0x1f3b34c00" timestamp=1721895109 n_slots=1
INFO [ init] new slot | tid="0x1f3b34c00" timestamp=1721895109 id_slot=0 n_ctx_slot=31072
INFO [ main] model loaded | tid="0x1f3b34c00" timestamp=1721895109
INFO [ main] chat template | tid="0x1f3b34c00" timestamp=1721895109 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [ main] HTTP server listening | tid="0x1f3b34c00" timestamp=1721895109 port="8080" n_threads_http="9" hostname="127.0.0.1"
INFO [ update_slots] all slots are idle | tid="0x1f3b34c00" timestamp=1721895109
...
```

如此就可以用 http://localhost:8080 來體驗 llama3.1 8B 的資料了

ref: github.com/changyy/meta-llama-study.git

第二十四個夏天後

2024年7月25日星期四

AI 開發筆記 - 透過 llama.cpp 使用 Meta Llama 3.1 8B，過程包括資料格式轉換 @ MacBook Pro M1 32GB RAM

沒有留言:

張貼留言

Subscribe Now

2024年7月25日 星期四

AI 開發筆記 - 透過 llama.cpp 使用 Meta Llama 3.1 8B，過程包括資料格式轉換 @ MacBook Pro M1 32GB RAM

沒有留言:

張貼留言

Subscribe Now

2024年7月25日星期四