Backend cleanup (#6025)

2026-04-04 14:17:28 +00:00 · 2024-05-21 13:32:02 -03:00 · 2024-05-21 13:32:02 -03:00 · bd7cc4234d
commit bd7cc4234d
parent 6a1682aa95
23 changed files with 57 additions and 442 deletions
--- a/docs/04
+++ b/docs/04
@ -64,14 +64,6 @@ Loads: GPTQ models.
 * **no_use_cuda_fp16**: On some systems, the performance can be very bad with this unset. Can usually be ignored.
 * **desc_act**: For ancient models without proper metadata, sets the model "act-order" parameter manually. Can usually be ignored.

-### GPTQ-for-LLaMa
-
-Loads: GPTQ models.
-
-Ancient loader, the first one to implement 4-bit quantization. It works on older GPUs for which ExLlamaV2 and AutoGPTQ do not work, and it doesn't work with "act-order", so you should use it with simple 4-bit-128g models.
-
-* **pre_layer**: Used for CPU offloading. The higher the number, the more layers will be sent to the GPU. GPTQ-for-LLaMa CPU offloading was faster than the one implemented in AutoGPTQ the last time I checked.
-
 ### llama.cpp

 Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.
--- a/Tips.md
+++ b/Tips.md
@ -13,28 +13,6 @@ Source: https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/1126

 This file will be automatically detected the next time you start the web UI.

-## Using LoRAs with GPTQ-for-LLaMa
-
-This requires using a monkey patch that is supported by this web UI: https://github.com/johnsmith0031/alpaca_lora_4bit
-
-To use it:
-
-Install alpaca_lora_4bit using pip
-
-```
-git clone https://github.com/johnsmith0031/alpaca_lora_4bit.git
-cd alpaca_lora_4bit
-git fetch origin winglian-setup_pip
-git checkout winglian-setup_pip
-pip install .
-```
-
-Start the UI with the --monkey-patch flag:
-
-```
-python server.py --model llama-7b-4bit-128g --listen --lora tloen_alpaca-lora-7b --monkey-patch
-```
-
 ## DeepSpeed

 `DeepSpeed ZeRO-3` is an alternative offloading strategy for full-precision (16-bit) transformers models.
--- a/docs/What
+++ b/docs/What
@ -2,15 +2,13 @@

 | Loader         | Loading 1 LoRA | Loading 2 or more LoRAs | Training LoRAs | Multimodal extension | Perplexity evaluation |
 |----------------|----------------|-------------------------|----------------|----------------------|-----------------------|
-| Transformers   |       ✅       |           ✅\*\*\*      |       ✅\*     |          ✅          |           ✅          |
+| Transformers   |       ✅       |           ✅\*\*        |       ✅\*     |          ✅          |           ✅          |
 | llama.cpp      |       ❌       |           ❌            |       ❌       |          ❌          |    use llamacpp_HF    |
 | llamacpp_HF    |       ❌       |           ❌            |       ❌       |          ❌          |           ✅          |
 | ExLlamav2_HF   |       ✅       |           ✅            |       ❌       |          ❌          |           ✅          |
 | ExLlamav2      |       ✅       |           ✅            |       ❌       |          ❌          |   use ExLlamav2_HF    |
 | AutoGPTQ       |       ✅       |           ❌            |       ❌       |          ✅          |           ✅          |
 | AutoAWQ        |       ?        |           ❌            |       ?        |          ?           |           ✅          |
-| GPTQ-for-LLaMa |       ✅\*\*   |           ✅\*\*\*      |       ✅       |          ✅          |           ✅          |
-| QuIP#          |       ?        |           ?             |       ?        |          ?           |           ✅          |
 | HQQ            |       ?        |           ?             |       ?        |          ?           |           ✅          |

 ❌ = not implemented
@ -19,6 +17,4 @@

 \* Training LoRAs with GPTQ models also works with the Transformers loader. Make sure to check "auto-devices" and "disable_exllama" before loading the model.

-\*\* Requires the monkey-patch. The instructions can be found [here](https://github.com/oobabooga/text-generation-webui/wiki/08-%E2%80%90-Additional-Tips#using-loras-with-gptq-for-llama).
-
-\*\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.
+\*\* Multi-LoRA in PEFT is tricky and the current implementation does not work reliably in all cases.