Meta Releases AI Models with Different Test and Public Versions


Meta’s recently released AI model, Maverick, currently ranks second on LM Arena, a testing platform where human raters evaluate and compare model outputs. Documentation indicates that the version of Maverick used for LM Arena testing differs from the version available through Meta’s developer platform.
According to Meta’s official announcement, the Maverick version on LM Arena is described as an “experimental chat version.” The official Llama website documentation specifies that Meta’s LM Arena testing utilized “Llama 4 Maverick optimized for conversationality.”
LM Arena represents one of several available benchmarks for evaluating AI model performance, each with specific testing methodologies and limitations. The practice of optimizing models specifically for LM Arena testing has not been widely documented among AI companies.
The difference between test and release versions creates specific considerations for developers evaluating model performance. When benchmark testing and publicly available versions differ, developers may need additional testing to determine how the model will perform in their intended applications.
Researchers examining both versions have documented specific behavioral differences between the publicly downloadable Maverick and the LM Arena version. Notable distinctions include the LM Arena version’s frequent use of emojis and tendency to provide more extensive responses.
Requests for clarification have been sent to both Meta and Chatbot Arena, the organization responsible for maintaining LM Arena.