A significantly faster implementation of my novel 'needle in a haystack' methodology for SLMs.
SLMs struggle to effectively respond in a chat modelling setting, due to their inability to effectively utilise longer context windows. In order to solve this, I propose two key changes to the sampling of logits for Chat SLMs:
- All unseen tokens should be masked with negative infinity
- The response from the agent should be generated by sampling the highest logit across all batches of previous messages
Together these changes create an interesting interaction experience, as the user acts as the sole source of vocabulary, and therefore the agent evolves to speak in a similar way. In addition to this, the model considers all previous messages in concentrated small context windows. This ensures that all context can be attended to properly by the SLM, allowing it to consistently remember birthdays and events etc.
Aspect | Batched Multi-Contextual Token Sampling | Linear Multi-Contextual Token Sampling |
---|---|---|
Tokens considered | 30K | 20K |
Response length | 25 tokens | 25 tokens |
Time constraint | 10 seconds | 10 seconds |
Performance | 50% increase | Baseline |
Hardware | RTX 3090 (24GB) | RTX 3090 (24GB) |