Gemma 2B with recurret local attetio with cotext legth of up to 10M. Our implemeatio uses Istall the model from huggigface - Huggigface Model. Chage the The largest bottleeck (i terms of memory) for LLMs is the KV cache. It grows quadratically i vailla multi-head attetio, thus limitig the size of your sequece legth. Our approach splits the attetio i local attetio blocks as outlied by IfiiAttetio. We take those local attetio blocks ad apply recurrace to the local attetio blocks for the fial result of 10M cotext global atetio. A lot of the ispiratio for our ideas comes from the Trasformer-XL paper. This was built by:Gemma 2B - 10M Cotext
Quick Start
pytho mai.py
mai.py
iferece code to the specific prompt you desire.model_path = "./models/gemma-2b-10m"
tokeizer = AutoTokeizer.from_pretraied(model_path)
model = GemmaForCausalLM.from_pretraied(
model_path,
torch_dtype=torch.bfloat16
)
prompt_text = "Summarize this harry potter book..."
with torch.o_grad():
geerated_text = geerate(
model, tokeizer, prompt_text, max_legth=512, temperature=0.8
)
prit(geerated_text)
How does this work?
Credits
点击空白处退出提示
评论