mirror of
https://github.com/oobabooga/text-generation-webui.git
synced 2026-04-04 14:17:28 +00:00
Backend cleanup (#6025)
This commit is contained in:
parent
6a1682aa95
commit
bd7cc4234d
23 changed files with 57 additions and 442 deletions
|
|
@ -64,14 +64,6 @@ Loads: GPTQ models.
|
|||
* **no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
|
||||
* **desc_act**: For ancient models without proper metadata, sets the model "act-order" parameter manually. Can usually be ignored.
|
||||
|
||||
### GPTQ-for-LLaMa
|
||||
|
||||
Loads: GPTQ models.
|
||||
|
||||
Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlamaV2 and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
|
||||
|
||||
* **pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.
|
||||
|
||||
### llama.cpp
|
||||
|
||||
Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.
|
||||
|
|
|
|||
|
|
@ -13,28 +13,6 @@ Source: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/1126
|
|||
|
||||
This file will be automatically detected the next time you start the web UI.
|
||||
|
||||
## Using LoRAs with GPTQ-for-LLaMa
|
||||
|
||||
This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
|
||||
|
||||
To use it:
|
||||
|
||||
Install alpaca_lora_4bit using pip
|
||||
|
||||
```
|
||||
git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
|
||||
cd alpaca_lora_4bit
|
||||
git fetch origin winglian-setup_pip
|
||||
git checkout winglian-setup_pip
|
||||
pip install .
|
||||
```
|
||||
|
||||
Start the UI with the --monkey-patch flag:
|
||||
|
||||
```
|
||||
python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
|
||||
```
|
||||
|
||||
## DeepSpeed
|
||||
|
||||
`DeepSpeed ZeRO-3` is an alternative offloading strategy for full-precision (16-bit) transformers models.
|
||||
|
|
|
|||
|
|
@ -2,15 +2,13 @@
|
|||
|
||||
| Loader | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
|
||||
|----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
|
||||
| Transformers | ✅ | ✅\*\*\* | ✅\* | ✅ | ✅ |
|
||||
| Transformers | ✅ | ✅\*\* | ✅\* | ✅ | ✅ |
|
||||
| llama.cpp | ❌ | ❌ | ❌ | ❌ | use llamacpp_HF |
|
||||
| llamacpp_HF | ❌ | ❌ | ❌ | ❌ | ✅ |
|
||||
| ExLlamav2_HF | ✅ | ✅ | ❌ | ❌ | ✅ |
|
||||
| ExLlamav2 | ✅ | ✅ | ❌ | ❌ | use ExLlamav2_HF |
|
||||
| AutoGPTQ | ✅ | ❌ | ❌ | ✅ | ✅ |
|
||||
| AutoAWQ | ? | ❌ | ? | ? | ✅ |
|
||||
| GPTQ-for-LLaMa | ✅\*\* | ✅\*\*\* | ✅ | ✅ | ✅ |
|
||||
| QuIP# | ? | ? | ? | ? | ✅ |
|
||||
| HQQ | ? | ? | ? | ? | ✅ |
|
||||
|
||||
❌ = not implemented
|
||||
|
|
@ -19,6 +17,4 @@
|
|||
|
||||
\* Training LoRAs with GPTQ models also works with the Transformers loader. Make sure to check "auto-devices" and "disable_exllama" before loading the model.
|
||||
|
||||
\*\* Requires the monkey-patch. The instructions can be found [here](https://github.com/oobabooga/text-generation-webui/wiki/08-%E2%80%90-Additional-Tips#using-loras-with-gptq-for-llama).
|
||||
|
||||
\*\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.
|
||||
\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue