LLM Leaderboard Platform
LLM Leaderboard Platform
- The Platform for LLM evals
- LLM Organization and Product
1. The Platform for LLM evals
1.1. LMSYS
Organization:
LMSYS and UC Berkeley SkyLab
Evaluate Way:
Chatbot Arena - a crowdsourced, randomized battle platform.
Evaluate LLMs by human preference in the real-world.
Ask any question to two anonymous models (e.g., ChatGPT, Gemini, Claude, Llama) and vote for the better one!
Evaluate Result:
Arena Elo
Elo, short for Elo rating system, is named after its inventor, Hungarian-American physicist Arpad Elo. It was originally developed for ranking chess players in the 1960s.
Website:
https://chat.lmsys.org/
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
1.2. LiveBench
Organization:
Abacus.AI
Properties:
- LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses.
- Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge.
- LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time.
Website:
https://livebench.ai/
1.3. Fine-tuning Index
The Fine-tuning Leaderboard compares the performance of GPT-4 and popular open-source models that They fine-tuned across a series of tasks.
Remarkably, most of the fine-tuned open-source models surpass GPT-4 with Llama-3, Phi-3 and Zephyr demonstrating the strongest performance.
Website:
https://predibase.com/fine-tuning-index
1.4. SuperCLUE
Domestic Leaderboard
Website:
https://www.superclueai.com/
2. LLM Organization and Product
Organization | Product | OpenSource | Location |
---|---|---|---|
Foreign | |||
OpenAI | GPT | Close | US, UK |
Gemini/Bard/Gemma/PaLM | Open | - | |
Anthropic | Claude | Close | US, UK |
Meta | Llama/Alpaca | Open | - |
Microsoft | Phi/WizardLM/Bing | Open | - |
Mistral | Mistral/Mixtral | Open | US, France |
HuggingFace | Zephyr | Open | - |
Cohere | Command R | Open | - |
NousResearch | Nous/OpenHermes | Open | - |
LMSYS | Vicuna/FastChat | - | - |
Reka AI | Reka | Open | US, UK, Singapore |
Nvidia | Nemotron/NV/ChipNeMo | Open | - |
Nexusflow | Starling | Open | Palo Alto, CA |
Databricks/MosaicML | DBRX/Dolly/ MPT | Open | Many |
OpenChat | OpenChat | - | - |
Snowflake | Sonwflake | Close | - |
UC Berkeley | Starling/Koala/Gorilla | Close | - |
Perplexity AI | pplx | Close | - |
Cognitive Computations | Dolphin | Open | Personal |
Upstage AI | SOLAR | Open | Korea |
TII | falcon | Open | Abu Dhabi, UAE |
Together AI | StripedHyena | Open | San Francisco |
Allen AI | Tulu/OLMo | Open | Seattle, WA, United States |
Nomic AI | GPT4All | Open | New York |
RWKV | RWKV | Open | - |
OpenAssistant | OpenAssistant | Open | - |
Stability AI | StableLM | Open | Canada |
Bloomberg | BloombergGPT | Close | US, UK |
inflection.ai | Inflection | Close | San Francisco Bay Area |
xAI(Elon Mask) | Grōk | Close | San Francisco Bay Area, California, U.S |
Scale | Scale | Close | San Francisco |
Character AI | Character | Close | Menlo Park, CA |
Domestic | |||
Alibaba | Qwen | Open | Hangzhou |
Tsinghua/Zhipu AI | GLM/ChatGLM | Open | Beijing |
Baichuan | Baichuan | Open | Beijing |
ModelBest | CPM | Open | Beijing |
01 AI | Yi | Open | Beijing |
DeepSeek AI | DeepSeek | Open | Hangzhou |
Colossal AI | Colossal | Open | Beijing |
Moonshot | Moonshot | Close | Beijing |
Step | Step | Close | Shanghai |
MiniMax | ABAB | Close | Shanghai |
Baidu | ERNIE | Close | Beijing |
SenseTime | SenseChat | Close | Shanghai |
Bytedance | Doubao/Coze | Close | Beijing |
Tencent | Hunyuan | Close | Shenzhen |
360 | 360gpt | Close | Beijing |
XVERSE | XVERSE | Open | Shenzhen |