Towards Personalized Evaluation of Large Language Models with an Anonymous Crowd-Sourcing Platform

Mingyue Cheng,Hao Zhang,Jiqian Yang,Qi Liu, Li, Xin Huang, Liwei Song,Zhi Li,Zhenya Huang,Enhong Chen

WWW 2024（2024）

Cited 5|Views85

Abstract

Large language model evaluation plays a pivotal role in the enhancement ofits capacity. Previously, numerous methods for evaluating large language modelshave been proposed in this area. Despite their effectiveness, these existingworks mainly focus on assessing objective questions, overlooking the capabilityto evaluate subjective questions which is extremely common for large languagemodels. Additionally, these methods predominantly utilize centralized datasetsfor evaluation, with question banks concentrated within the evaluationplatforms themselves. Moreover, the evaluation processes employed by theseplatforms often overlook personalized factors, neglecting to consider theindividual characteristics of both the evaluators and the models beingevaluated. To address these limitations, we propose a novel anonymouscrowd-sourcing evaluation platform, BingJian, for large language models thatemploys a competitive scoring mechanism where users participate in rankingmodels based on their performance. This platform stands out not only for itssupport of centralized evaluations to assess the general capabilities of modelsbut also for offering an open evaluation gateway. Through this gateway, usershave the opportunity to submit their questions, testing the models on apersonalized and potentially broader range of capabilities. Furthermore, ourplatform introduces personalized evaluation scenarios, leveraging various formsof human-computer interaction to assess large language models in a manner thataccounts for individual user preferences and contexts. The demonstration ofBingJian can be accessed at https://github.com/Mingyue-Cheng/Bingjian.

Translated text

Key words

Topic Modeling,User Modeling

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

Summary is being generated by the instructions you defined