Towards Personalized Evaluation of Large Language Models with an Anonymous Crowd-Sourcing Platform
WWW 2024(2024)
Abstract
Large language model evaluation plays a pivotal role in the enhancement ofits capacity. Previously, numerous methods for evaluating large language modelshave been proposed in this area. Despite their effectiveness, these existingworks mainly focus on assessing objective questions, overlooking the capabilityto evaluate subjective questions which is extremely common for large languagemodels. Additionally, these methods predominantly utilize centralized datasetsfor evaluation, with question banks concentrated within the evaluationplatforms themselves. Moreover, the evaluation processes employed by theseplatforms often overlook personalized factors, neglecting to consider theindividual characteristics of both the evaluators and the models beingevaluated. To address these limitations, we propose a novel anonymouscrowd-sourcing evaluation platform, BingJian, for large language models thatemploys a competitive scoring mechanism where users participate in rankingmodels based on their performance. This platform stands out not only for itssupport of centralized evaluations to assess the general capabilities of modelsbut also for offering an open evaluation gateway. Through this gateway, usershave the opportunity to submit their questions, testing the models on apersonalized and potentially broader range of capabilities. Furthermore, ourplatform introduces personalized evaluation scenarios, leveraging various formsof human-computer interaction to assess large language models in a manner thataccounts for individual user preferences and contexts. The demonstration ofBingJian can be accessed at https://github.com/Mingyue-Cheng/Bingjian.
MoreTranslated text
Key words
Topic Modeling,User Modeling
PDF
View via Publisher
AI Read Science
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper
Summary is being generated by the instructions you defined