GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
arXiv preprint, 2026
A benchmark for evaluating LLMs' ability to understand users from their interaction histories in recommendation systems. We propose two metric families (Interest Groundedness and Interest Specificity) and evaluate eight open-weight LLMs (7B to 120B), revealing performance bottlenecks in counting and attributing engagement signals across heterogeneous interactions.