mirror of
https://github.com/oobabooga/text-generation-webui.git
synced 2025-12-06 07:12:10 +01:00
1.2 KiB
1.2 KiB
Using llama.cpp in the web UI
- Re-install the requirements.txt:
pip install -r requirements.txt -U
-
Follow the instructions in the llama.cpp README to generate the
ggml-model-q4_0.binfile: https://github.com/ggerganov/llama.cpp#usage -
Create a folder inside
models/for your model and putggml-model-q4_0.binin it. For instance,models/llamacpp-7b/ggml-model-q4_0.bin. -
Start the web UI normally:
python server.py --model llamacpp-7b
- This procedure should work for any
ggml*.binfile. Just put it in a folder, and use the name of this folder as the argument after--modelor as the model loaded inside the interface. - You can change the number of threads with
--threads N.
Performance
This was the performance of llama-7b int4 on my i5-12400F:
Output generated in 33.07 seconds (6.05 tokens/s, 200 tokens, context 17)
Limitations
~* The parameter sliders in the interface (temperature, top_p, top_k, etc) are completely ignored. So only the default parameters in llama.cpp can be used.~
~* Only 512 tokens of context can be used.~
~Both of these should be improved soon when llamacpp-python receives an update.~