首先,下載 Meta llama3.1 8B 方式都寫在網站上,先到 llama.meta.com/llama-downloads 填單,填完單就會顯示下載網址,例如這種格式:
https://llama3-1.llamameta.net/*?Policy=XXX&Key-Pair-Id=XXX&Download-Request-ID=XXX
然後下載方式不是直接瀏覽他,要依照指定方式,透過 download.sh 下載,下載資訊:
連續動作:
```% git clone https://github.com/meta-llama/llama-models% bash llama-models/models/llama3_1/download.shEnter the URL from email: https://llama3-1.llamameta.net/*?Policy=XXX&Key-Pair-Id=XXX&Download-Request-ID=XXX**** Model list ***- meta-llama-3.1-405b- meta-llama-3.1-70b- meta-llama-3.1-8b- meta-llama-guard-3-8b- prompt-guardChoose the model to download: meta-llama-3.1-8b**** Available models to download: ***- meta-llama-3.1-8b-instruct- meta-llama-3.1-8bEnter the list of models to download without spaces or press Enter for all: meta-llama-3.1-8bDownloading LICENSE and Acceptable Usage Policy...```
資料量大小:
```% du -hd1 Meta-Llama-3.1-8B15G Meta-Llama-3.1-8B% tree Meta-Llama-3.1-8BMeta-Llama-3.1-8B├── consolidated.00.pth├── params.json└── tokenizer.model1 directory, 3 files```
接著就可以享受 Meta 釋出給全世界的 AI 模型了,萬分感謝 Orz 省去自己掏錢買機器訓練,據說這版 llama3.1 號稱可以跟 ChatGPT-4o 或 Claude 3.5 Sonnet 抗衡。但基於家裡的算力不足,純試試 8B 吧!
接下來使用 llama.cpp 體驗,除了程式碼編譯外,還要做資料格式的轉換:
```% wget https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/models/llama/convert_llama_weights_to_hf.py% python3 -m venv venv% source venv/bin/activate(venv) % pip install transformers torch huggingface_hub tiktoken blobfile accelerate(venv) % python3 convert_llama_weights_to_hf.py --input_dir Meta-Llama-3.1-8B --model_size 8B --output_dir llama3_1_hf --llama_version 3.1(venv) % du -hd1 llama3_1_hf15G llama3_1_hf(venv) % tree llama3_1_hfllama3_1_hf├── config.json├── generation_config.json├── model-00001-of-00004.safetensors├── model-00002-of-00004.safetensors├── model-00003-of-00004.safetensors├── model-00004-of-00004.safetensors├── model.safetensors.index.json├── special_tokens_map.json├── tokenizer.json└── tokenizer_config.json1 directory, 10 files```
編譯及使用 llama.cpp:
```(venv) % git clone https://github.com/ggerganov/llama.cpp(venv) % cd llama.cpp(venv) llama.cpp % LLAMA_METAL=1 make(venv) llama.cpp % pip install -r requirements.txt(venv) llama.cpp % time python3 convert_hf_to_gguf.py ../llama3_1_hf/ --outfile llama3_1-8B.gguf(venv) llama.cpp %(venv) llama.cpp % du -hd1 llama3_1-8B.gguf15G llama3_1-8B.gguf(venv) llama.cpp % ./llama-server -m ./llama3_1-8B.gguf...error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)...llama_new_context_with_model: n_ctx = 131072llama_new_context_with_model: n_batch = 2048llama_new_context_with_model: n_ubatch = 512llama_new_context_with_model: flash_attn = 0llama_new_context_with_model: freq_base = 500000.0llama_new_context_with_model: freq_scale = 1ggml_metal_init: allocatingggml_metal_init: found device: Apple M1 Proggml_metal_init: picking default device: Apple M1 Proggml_metal_init: using embedded metal libraryggml_metal_init: GPU name: Apple M1 Proggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)ggml_metal_init: simdgroup reduction support = trueggml_metal_init: simdgroup matrix mul. support = trueggml_metal_init: hasUnifiedMemory = trueggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MBllama_kv_cache_init: Metal KV buffer size = 16384.00 MiBllama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiBllama_new_context_with_model: CPU output buffer size = 0.98 MiBllama_new_context_with_model: Metal compute buffer size = 8480.00 MiBllama_new_context_with_model: CPU compute buffer size = 264.01 MiBllama_new_context_with_model: graph nodes = 1030llama_new_context_with_model: graph splits = 2...^Cggml_metal_free: deallocating```
預設環境會碰到 Aplpe MacBook Pro 筆電資源的限制,多使用了 -c 31072 來跑,整個 llama-server 在 macOS 活動監視器上,可以看到使用記憶體不到 4GB,看起來有穩定下來:
```(venv) llama.cpp % ./llama-server -m ./llama3_1-8B.gguf -c 31072............................................................................................llama_new_context_with_model: n_ctx = 31072llama_new_context_with_model: n_batch = 2048llama_new_context_with_model: n_ubatch = 512llama_new_context_with_model: flash_attn = 0llama_new_context_with_model: freq_base = 500000.0llama_new_context_with_model: freq_scale = 1ggml_metal_init: allocatingggml_metal_init: found device: Apple M1 Proggml_metal_init: picking default device: Apple M1 Proggml_metal_init: using embedded metal libraryggml_metal_init: GPU name: Apple M1 Proggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)ggml_metal_init: simdgroup reduction support = trueggml_metal_init: simdgroup matrix mul. support = trueggml_metal_init: hasUnifiedMemory = trueggml_metal_init: recommendedMaxWorkingSetSize = 22906.50 MBllama_kv_cache_init: Metal KV buffer size = 3884.00 MiBllama_new_context_with_model: KV self size = 3884.00 MiB, K (f16): 1942.00 MiB, V (f16): 1942.00 MiBllama_new_context_with_model: CPU output buffer size = 0.98 MiBllama_new_context_with_model: Metal compute buffer size = 2034.69 MiBllama_new_context_with_model: CPU compute buffer size = 68.69 MiBllama_new_context_with_model: graph nodes = 1030llama_new_context_with_model: graph splits = 2INFO [ init] initializing slots | tid="0x1f3b34c00" timestamp=1721895109 n_slots=1INFO [ init] new slot | tid="0x1f3b34c00" timestamp=1721895109 id_slot=0 n_ctx_slot=31072INFO [ main] model loaded | tid="0x1f3b34c00" timestamp=1721895109INFO [ main] chat template | tid="0x1f3b34c00" timestamp=1721895109 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=trueINFO [ main] HTTP server listening | tid="0x1f3b34c00" timestamp=1721895109 port="8080" n_threads_http="9" hostname="127.0.0.1"INFO [ update_slots] all slots are idle | tid="0x1f3b34c00" timestamp=1721895109...```
如此就可以用 http://localhost:8080 來體驗 llama3.1 8B 的資料了