This is llama.cpp's official SOTA 2bit quantization using imatrix down to as low as 16.95GB in size for the best overall bilingual model with 73.8 humaneval!
It turns out that computation of the imatrix with GPU accelerated is very intensive.
Testing shows that the perplexity difference of using imatrix computed from Q6k and Q2K is negligible.
Now it is possible to run Q2KS on 24GB VRAM GPUs with 2K contextsize, IQ2XS on 22GB VRAM GPU, and IQ2_XXS on 20GB VRAM GPU.
Clone with HTTP
git clone https://www.modelscope.cn/whatever1983/deepseek-llm-67b-chat-SOTA2bit-imatrix-GGUF.git
评论