Backend cleanup (#6025)

This commit is contained in:
oobabooga 2024-05-21 13:32:02 -03:00 committed by GitHub
parent 6a1682aa95
commit bd7cc4234d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
23 changed files with 57 additions and 442 deletions

View file

@ -64,14 +64,6 @@ Loads: GPTQ models.
* **no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
* **desc_act**: For ancient models without proper metadata, sets the model "act-order" parameter manually. Can usually be ignored.
### GPTQ-for-LLaMa
Loads: GPTQ models.
Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlamaV2 and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
* **pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.
### llama.cpp
Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.

View file

@ -13,28 +13,6 @@ Source: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/1126
This file will be automatically detected the next time you start the web UI.
## Using LoRAs with GPTQ-for-LLaMa
This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
To use it:
Install alpaca_lora_4bit using pip
```
git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
cd alpaca_lora_4bit
git fetch origin winglian-setup_pip
git checkout winglian-setup_pip
pip install .
```
Start the UI with the --monkey-patch flag:
```
python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
```
## DeepSpeed
`DeepSpeed ZeRO-3` is an alternative offloading strategy for full-precision (16-bit) transformers models.

View file

@ -2,15 +2,13 @@
| Loader | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
|----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
| Transformers | ✅ | ✅\*\*\* | ✅\* | ✅ | ✅ |
| Transformers | ✅ | ✅\*\* | ✅\* | ✅ | ✅ |
| llama.cpp | ❌ | ❌ | ❌ | ❌ | use llamacpp_HF |
| llamacpp_HF | ❌ | ❌ | ❌ | ❌ | ✅ |
| ExLlamav2_HF | ✅ | ✅ | ❌ | ❌ | ✅ |
| ExLlamav2 | ✅ | ✅ | ❌ | ❌ | use ExLlamav2_HF |
| AutoGPTQ | ✅ | ❌ | ❌ | ✅ | ✅ |
| AutoAWQ | ? | ❌ | ? | ? | ✅ |
| GPTQ-for-LLaMa | ✅\*\* | ✅\*\*\* | ✅ | ✅ | ✅ |
| QuIP# | ? | ? | ? | ? | ✅ |
| HQQ | ? | ? | ? | ? | ✅ |
❌ = not implemented
@ -19,6 +17,4 @@
\* Training LoRAs with GPTQ models also works with the Transformers loader. Make sure to check "auto-devices" and "disable_exllama" before loading the model.
\*\* Requires the monkey-patch. The instructions can be found [here](https://github.com/oobabooga/text-generation-webui/wiki/08-%E2%80%90-Additional-Tips#using-loras-with-gptq-for-llama).
\*\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.
\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.