Leaderboard Toxicity in Spanish: Elo Rating Cycle 9

Leaderboard

Model	Accuracy	Precision	Recall	F1-Score	Elo-Score
GPT-4.5-preview (2025-02-27)*	0.929	0.940	0.917	0.928	1788
Athene-V2 (72B-L)	0.925	0.932	0.917	0.925	1741
Qwen 2.5 (72B-L)	0.924	0.932	0.915	0.923	1728
GPT-4o (2024-05-13)	0.921	0.905	0.941	0.923	1725
GPT-4o (2024-11-20)	0.921	0.922	0.920	0.921	1722
GPT-4 (0613)	0.920	0.927	0.912	0.919	1687
Grok Beta	0.916	0.906	0.928	0.917	1685
Pixtral Large (2411)	0.913	0.884	0.952	0.917	1683
OpenThinker (32B-L)	0.916	0.915	0.917	0.916	1681
Qwen 2.5 (14B-L)	0.915	0.904	0.928	0.916	1680
GPT-4 Turbo (2024-04-09)	0.912	0.880	0.955	0.916	1679
GPT-4o (2024-08-06)	0.913	0.895	0.936	0.915	1666
Qwen 2.5 (32B-L)	0.915	0.919	0.909	0.914	1665
Gemini 2.0 Flash	0.909	0.872	0.960	0.914	1664
Llama 3.1 (70B-L)	0.912	0.908	0.917	0.912	1664
o1 (2024-12-17)	0.911	0.895	0.931	0.912	1663
Nous Hermes 2 (11B-L)	0.912	0.912	0.912	0.912	1663
Gemini 1.5 Flash	0.909	0.889	0.936	0.912	1663
Grok 2 (1212)	0.900	0.864	0.949	0.905	1658
Gemini 1.5 Flash (8B)	0.905	0.909	0.901	0.905	1658
Gemini 1.5 Pro	0.900	0.859	0.957	0.905	1657
Falcon3 (10B-L)	0.904	0.891	0.920	0.906	1656
Exaone 3.5 (32B-L)	0.907	0.913	0.899	0.906	1656
Aya (35B-L)	0.908	0.925	0.888	0.906	1655
Gemini 2.0 Flash-Lite (02-05)	0.903	0.872	0.944	0.907	1655
Gemma 2 (27B-L)	0.905	0.892	0.923	0.907	1654
Llama 3.3 (70B-L)	0.904	0.880	0.936	0.907	1653
Llama 3.1 (405B)	0.904	0.880	0.936	0.907	1653
Aya Expanse (32B-L)	0.905	0.888	0.928	0.907	1652
Open Mixtral 8x22B	0.911	0.935	0.883	0.908	1652
Aya Expanse (8B-L)	0.905	0.876	0.944	0.909	1651
GLM-4 (9B-L)	0.911	0.925	0.893	0.909	1651
Nemotron (70B-L)	0.908	0.896	0.923	0.909	1650
DeepSeek-R1 (671B)	0.905	0.869	0.955	0.910	1650
Sailor2 (20B-L)	0.912	0.933	0.888	0.910	1650
Gemma 3 (27B-L)*	0.904	0.865	0.957	0.909	1650
DeepSeek-V3 (671B)	0.913	0.948	0.875	0.910	1650
GPT-4o mini (2024-07-18)	0.908	0.884	0.939	0.911	1649
Phi-4 (14B-L)	0.901	0.899	0.904	0.902	1645
Qwen 2.5 (7B-L)	0.900	0.887	0.917	0.902	1644
Hermes 3 (70B-L)	0.905	0.937	0.869	0.902	1643
Gemma 3 (12B-L)*	0.899	0.866	0.944	0.903	1642
Mistral Large (2411)	0.896	0.863	0.941	0.901	1629
o1-preview (2024-09-12)+	0.800	0.731	0.991	0.841	1622
o3-mini (2025-01-31)	0.896	0.886	0.909	0.897	1615
DeepSeek-R1 D-Qwen (14B-L)	0.897	0.896	0.899	0.897	1614
o1-mini (2024-09-12)	0.895	0.878	0.917	0.897	1599
Mistral NeMo (12B-L)	0.891	0.873	0.915	0.893	1586
Command R7B Arabic (7B-L)*	0.897	0.926	0.864	0.894	1584
GPT-3.5 Turbo (0125)	0.875	0.822	0.957	0.884	1564
QwQ (32B-L)	0.892	0.940	0.837	0.886	1562
Gemma 2 (9B-L)	0.876	0.818	0.968	0.886	1560
Mistral (7B-L)	0.891	0.897	0.883	0.890	1558
Llama 3.1 (8B-L)	0.889	0.878	0.904	0.891	1556
Marco-o1-CoT (7B-L)	0.888	0.866	0.917	0.891	1554
Phi-4-mini (3.8B-L)*	0.884	0.891	0.875	0.883	1553
Mistral Small (22B-L)	0.871	0.806	0.976	0.883	1551
OpenThinker (7B-L)	0.872	0.812	0.968	0.883	1549
Tülu3 (8B-L)	0.881	0.893	0.867	0.880	1541
Tülu3 (70B-L)	0.891	0.962	0.813	0.882	1538
OLMo 2 (7B-L)	0.871	0.868	0.875	0.871	1527
OLMo 2 (13B-L)	0.867	0.804	0.971	0.879	1525
Llama 3.2 (3B-L)	0.876	0.885	0.864	0.874	1525
Claude 3.5 Haiku (20241022)	0.885	0.947	0.816	0.877	1522
Pixtral-12B (2409)	0.865	0.804	0.965	0.878	1519
Mistral Saba*	0.867	0.810	0.957	0.878	1516
Orca 2 (7B-L)	0.876	0.910	0.835	0.871	1514
Claude 3.7 Sonnet (20250219)*	0.887	0.950	0.816	0.878	1513
DeepSeek-R1 D-Llama (8B-L)	0.865	0.837	0.907	0.871	1511
Claude 3.5 Sonnet (20241022)	0.887	0.950	0.816	0.878	1510
Yi 1.5 (9B-L)	0.859	0.826	0.909	0.865	1496
Granite 3.1 (8B-L)	0.869	0.921	0.808	0.861	1495
Yi Large	0.871	0.979	0.757	0.854	1449
Nous Hermes 2 Mixtral (47B-L)	0.867	0.963	0.763	0.851	1438
Mistral OpenOrca (7B-L)	0.863	0.939	0.776	0.850	1434
Gemma 3 (4B-L)*	0.820	0.742	0.981	0.845	1428
Ministral-8B (2410)	0.823	0.744	0.984	0.847	1427
Exaone 3.5 (8B-L)	0.853	0.913	0.781	0.842	1416
Codestral Mamba (7B)	0.827	0.774	0.923	0.842	1415
Dolphin 3.0 (8B-L)	0.807	0.731	0.971	0.834	1386
Granite 3.2 (8B-L)*	0.849	0.940	0.747	0.832	1359
Yi 1.5 (34B-L)	0.849	0.955	0.733	0.830	1333
Solar Pro (22B-L)	0.844	0.916	0.757	0.829	1304
Hermes 3 (8B-L)	0.840	0.932	0.733	0.821	1244
DeepSeek-R1 D-Qwen (7B-L)	0.821	0.832	0.805	0.818	1228
Nemotron-Mini (4B-L)	0.771	0.696	0.963	0.808	1197
Phi-3 Medium (14B-L)	0.815	0.940	0.672	0.784	1089
Yi 1.5 (6B-L)	0.807	0.908	0.683	0.779	1056
DeepScaleR (1.5B-L)*	0.620	0.688	0.440	0.537	891
Perspective 0.55	0.768	0.986	0.544	0.701	880
Granite 3 MoE (3B-L)	0.747	0.894	0.560	0.689	877
DeepSeek-R1 D-Qwen (1.5B-L)	0.633	0.685	0.493	0.574	800
Perspective 0.60	0.731	0.989	0.467	0.634	778
Perspective 0.70	0.665	1.000	0.331	0.497	747
Perspective 0.80	0.609	1.000	0.219	0.359	665
Granite 3.1 MoE (3B-L)	0.555	1.000	0.109	0.197	577

Task Description

In this cycle, we used a balanced sample of 5000 messages for toxic-language detection in Spanish split in a proportion of 70/15/15 for training, validation, and testing in case of potential fine-tuning jobs.
The sample corresponds to ground-truth CLANDESTINO data prepared for CLEF TextDetox 2024.
The task involved a toxicity zero-shot classification using Google’s and Jigsaw’s core definitions of incivility and toxicity. The temperature was set at zero, and the performance metrics were averaged for binary classification. In Gemini models 1.5, the temperature was set at the default value.
It is important to note that QwQ. Marco-o1-CoT, o1-preview, o1-mini, DeepSeek-R1, o3-mini and o1 incorporated internal reasoning steps. The temperature was set as the default variable in the OpenAI reasoning models and GPT-4.5-preview.
After the billions of parameters in parenthesis, the uppercase L implies that the model was deployed locally. In this cycle, Ollama v0.6.4 and Python Ollama, OpenAI, Anthropic, GenerativeAI and MistralAI dependencies were utilised.
Rookie models in this cycle are marked with an asterisk.
The plus symbol indicates that the model is inactive since it was not tested in this cycle. In these cases, we follow a Keep the Last Known Elo-Score policy.