Hermann Hans Klie
779795266f
Update models.py
...
the def load_model(model_name, loader=None) we fill in ktransformers .
before the def unload_model(keep_model_name=False) fill def ktransformers_loader
2025-10-24 08:53:23 +03:00
oobabooga
7f06aec3a1
exllamav3: Implement the logits function for /v1/internal/logits
2025-10-09 11:24:25 -07:00
oobabooga
00ed878b05
Slightly more robust model loading
2025-09-02 10:16:26 -07:00
oobabooga
8028d88541
Lint
2025-08-30 21:29:20 -07:00
oobabooga
cb8780a4ce
Safer check for is_multimodal when loading models
...
Avoids unrelated multimodal error when a model fails to load due
to lack of memory.
2025-08-28 11:13:19 -07:00
oobabooga
f247c2ae62
Make --model work with absolute paths, eg --model /tmp/gemma-3-270m-it-IQ4_NL.gguf
2025-08-22 11:47:33 -07:00
oobabooga
9e7b326e34
Lint
2025-08-19 06:50:40 -07:00
oobabooga
7d23a55901
Fix model unloading when switching loaders ( closes #7203 )
2025-08-18 09:05:47 -07:00
altoiddealer
57f6e9af5a
Set multimodal status during Model Loading ( #7199 )
2025-08-13 16:47:27 -03:00
Katehuuh
88127f46c1
Add multimodal support (ExLlamaV3) ( #7174 )
2025-08-08 23:31:16 -03:00
oobabooga
ad6d0218ae
Fix after 219f0a7731
2025-06-01 19:27:14 -07:00
oobabooga
219f0a7731
Fix exllamav3_hf models failing to unload ( closes #7031 )
2025-05-30 12:05:49 -07:00
oobabooga
9ec46b8c44
Remove the HQQ loader (HQQ models can be loaded through Transformers)
2025-05-19 09:23:24 -07:00
oobabooga
5534d01da0
Estimate the VRAM for GGUF models + autoset gpu-layers ( #6980 )
2025-05-16 00:07:37 -03:00
oobabooga
d4b1e31c49
Use --ctx-size to specify the context size for all loaders
...
Old flags are still recognized as alternatives.
2025-04-25 16:59:03 -07:00
oobabooga
86c3ed3218
Small change to the unload_model() function
2025-04-20 20:00:56 -07:00
oobabooga
b3bf7a885d
Fix ExLlamaV2_HF and ExLlamaV3_HF after ae02ffc605
2025-04-20 11:32:48 -07:00
oobabooga
ae02ffc605
Refactor the transformers loader ( #6859 )
2025-04-20 13:33:47 -03:00
oobabooga
ae54d8faaa
New llama.cpp loader ( #6846 )
2025-04-18 09:59:37 -03:00
oobabooga
8b8d39ec4e
Add ExLlamaV3 support ( #6832 )
2025-04-09 00:07:08 -03:00
SamAcctX
f28f39792d
update deprecated deepspeed import for transformers 4.46+ ( #6725 )
2025-02-02 20:41:36 -03:00
oobabooga
c08d87b78d
Make the huggingface loader more readable
2025-01-09 12:23:38 -08:00
oobabooga
7157257c3f
Remove the AutoGPTQ loader ( #6641 )
2025-01-08 19:28:56 -03:00
oobabooga
c0f600c887
Add a --torch-compile flag for transformers
2025-01-05 05:47:00 -08:00
Petr Korolev
13c033c745
Fix CUDA error on MPS backend during API request ( #6572 )
...
---------
Co-authored-by: oobabooga <oobabooga4@gmail.com>
2025-01-02 00:06:11 -03:00
oobabooga
7b88724711
Make responses start faster by removing unnecessary cleanup calls ( #6625 )
2025-01-01 18:33:38 -03:00
oobabooga
b92d7fd43e
Add warnings for when AutoGPTQ, TensorRT-LLM, or HQQ are missing
2024-09-28 20:30:24 -07:00
oobabooga
e926c03b3d
Add a --tokenizer-dir command-line flag for llamacpp_HF
2024-08-06 19:41:18 -07:00
oobabooga
9dcff21da9
Remove unnecessary shared.previous_model_name variable
2024-07-28 18:35:11 -07:00
oobabooga
514fb2e451
Fix UI error caused by --idle-timeout
2024-07-28 18:30:06 -07:00
oobabooga
e6181e834a
Remove AutoAWQ as a standalone loader
...
(it works better through transformers)
2024-07-23 15:31:17 -07:00
oobabooga
8b44d7b12a
Lint
2024-07-04 20:16:44 -07:00
GralchemOz
8a39f579d8
transformers: Add eager attention option to make Gemma-2 work properly ( #6188 )
2024-07-01 12:08:08 -03:00
oobabooga
577a8cd3ee
Add TensorRT-LLM support ( #5715 )
2024-06-24 02:30:03 -03:00
oobabooga
536f8d58d4
Do not expose alpha_value to llama.cpp & rope_freq_base to transformers
...
To avoid confusion
2024-06-23 22:09:24 -07:00
oobabooga
a36fa73071
Lint
2024-06-12 19:00:21 -07:00
oobabooga
bd7cc4234d
Backend cleanup ( #6025 )
2024-05-21 13:32:02 -03:00
oobabooga
9f77ed1b98
--idle-timeout flag to unload the model if unused for N minutes ( #6026 )
2024-05-19 23:29:39 -03:00
Tisjwlf
907702c204
Fix gguf multipart file loading ( #5857 )
2024-05-19 20:22:09 -03:00
oobabooga
e9c9483171
Improve the logging messages while loading models
2024-05-03 08:10:44 -07:00
oobabooga
dfdb6fee22
Set llm_int8_enable_fp32_cpu_offload=True for --load-in-4bit
...
To allow for 32-bit CPU offloading (it's very slow).
2024-04-26 09:39:27 -07:00
oobabooga
4094813f8d
Lint
2024-04-24 09:53:41 -07:00
Colin
f3c9103e04
Revert walrus operator for params['max_memory'] ( #5878 )
2024-04-24 01:09:14 -03:00
wangshuai09
fd4e46bce2
Add Ascend NPU support (basic) ( #5541 )
2024-04-11 18:42:20 -03:00
oobabooga
d02744282b
Minor logging change
2024-04-06 18:56:58 -07:00
oobabooga
1bdceea2d4
UI: Focus on the chat input after starting a new chat
2024-04-06 12:57:57 -07:00
oobabooga
1b87844928
Minor fix
2024-04-05 18:43:43 -07:00
oobabooga
6b7f7555fc
Logging message to make transformers loader a bit more transparent
2024-04-05 18:40:02 -07:00
oobabooga
308452b783
Bitsandbytes: load preconverted 4bit models without additional flags
2024-04-04 18:10:24 -07:00
oobabooga
d423021a48
Remove CTransformers support ( #5807 )
2024-04-04 20:23:58 -03:00