LLM Leaderboard Platform
LLM Leaderboard Platform
- The Platform for LLM evals
- LLM Organization and Product
1. The Platform for LLM evals
1.1. LMSYS
Organization:
LMSYS and UC Berkeley SkyLab
Evaluate Way:
Chatbot Arena - a crowdsourced, randomized battle platform.
Evaluate LLMs by human preference in the real-world.
Ask any question to two anonymous models (e.g., ChatGPT, Gemini, Claude, Llama) and vote for the better one!
Evaluate Result:
Arena Elo
Elo, short for Elo rating system, is named after its inventor, Hungarian-American physicist Arpad Elo. It was originally developed for ranking chess players in the 1960s.
Website:
https://chat.lmsys.org/
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
1.2. LiveBench
Organization:
Abacus.AI
Properties:
- LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses.
- Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge.
- LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time.
Website:
https://livebench.ai/
1.3. Fine-tuning Index
The Fine-tuning Leaderboard compares the performance of GPT-4 and popular open-source models that They fine-tuned across a series of tasks.
Remarkably, most of the fine-tuned open-source models surpass GPT-4 with Llama-3, Phi-3 and Zephyr demonstrating the strongest performance.
Website:
https://predibase.com/fine-tuning-index
1.4. SuperCLUE
Domestic Leaderboard
Website:
https://www.superclueai.com/
2. LLM Organization and Product
| Organization | Product | OpenSource | Location |
|---|---|---|---|
| Foreign | |||
| OpenAI | GPT | Close | US, UK |
| Gemini/Bard/Gemma/PaLM | Open | - | |
| Anthropic | Claude | Close | US, UK |
| Meta | Llama/Alpaca | Open | - |
| Microsoft | Phi/WizardLM/Bing | Open | - |
| Mistral | Mistral/Mixtral | Open | US, France |
| HuggingFace | Zephyr | Open | - |
| Cohere | Command R | Open | - |
| NousResearch | Nous/OpenHermes | Open | - |
| LMSYS | Vicuna/FastChat | - | - |
| Reka AI | Reka | Open | US, UK, Singapore |
| Nvidia | Nemotron/NV/ChipNeMo | Open | - |
| Nexusflow | Starling | Open | Palo Alto, CA |
| Databricks/MosaicML | DBRX/Dolly/ MPT | Open | Many |
| OpenChat | OpenChat | - | - |
| Snowflake | Sonwflake | Close | - |
| UC Berkeley | Starling/Koala/Gorilla | Close | - |
| Perplexity AI | pplx | Close | - |
| Cognitive Computations | Dolphin | Open | Personal |
| Upstage AI | SOLAR | Open | Korea |
| TII | falcon | Open | Abu Dhabi, UAE |
| Together AI | StripedHyena | Open | San Francisco |
| Allen AI | Tulu/OLMo | Open | Seattle, WA, United States |
| Nomic AI | GPT4All | Open | New York |
| RWKV | RWKV | Open | - |
| OpenAssistant | OpenAssistant | Open | - |
| Stability AI | StableLM | Open | Canada |
| Bloomberg | BloombergGPT | Close | US, UK |
| inflection.ai | Inflection | Close | San Francisco Bay Area |
| xAI(Elon Mask) | Grōk | Close | San Francisco Bay Area, California, U.S |
| Scale | Scale | Close | San Francisco |
| Character AI | Character | Close | Menlo Park, CA |
| Domestic | |||
| Alibaba | Qwen | Open | Hangzhou |
| Tsinghua/Zhipu AI | GLM/ChatGLM | Open | Beijing |
| Baichuan | Baichuan | Open | Beijing |
| ModelBest | CPM | Open | Beijing |
| 01 AI | Yi | Open | Beijing |
| DeepSeek AI | DeepSeek | Open | Hangzhou |
| Colossal AI | Colossal | Open | Beijing |
| Moonshot | Moonshot | Close | Beijing |
| Step | Step | Close | Shanghai |
| MiniMax | ABAB | Close | Shanghai |
| Baidu | ERNIE | Close | Beijing |
| SenseTime | SenseChat | Close | Shanghai |
| Bytedance | Doubao/Coze | Close | Beijing |
| Tencent | Hunyuan | Close | Shenzhen |
| 360 | 360gpt | Close | Beijing |
| XVERSE | XVERSE | Open | Shenzhen |
