Leaderboard Toxicity in Spanish: Elo Rating Cycle 8

Leaderboard

Model	Accuracy	Precision	Recall	F1-Score	Elo-Score
Athene-V2 (72B-L)	0.925	0.932	0.917	0.925	1743
Qwen 2.5 (72B-L)	0.924	0.932	0.915	0.923	1729
GPT-4o (2024-05-13)	0.921	0.905	0.941	0.923	1726
GPT-4o (2024-11-20)	0.921	0.922	0.920	0.921	1723
GPT-4 (0613)	0.920	0.927	0.912	0.919	1686
Grok Beta	0.916	0.906	0.928	0.917	1684
Pixtral Large (2411)	0.913	0.884	0.952	0.917	1682
Qwen 2.5 (14B-L)	0.915	0.904	0.928	0.916	1679
OpenThinker (32B-L)*	0.916	0.915	0.917	0.916	1678
GPT-4 Turbo (2024-04-09)	0.912	0.880	0.955	0.916	1678
GPT-4o (2024-08-06)	0.913	0.895	0.936	0.915	1664
Qwen 2.5 (32B-L)	0.915	0.919	0.909	0.914	1663
Llama 3.1 (70B-L)	0.912	0.908	0.917	0.912	1661
Nous Hermes 2 (11B-L)	0.912	0.912	0.912	0.912	1660
Gemini 1.5 Flash	0.909	0.889	0.936	0.912	1660
Gemini 2.0 Flash*	0.909	0.872	0.960	0.914	1659
o1 (2024-12-17)*	0.911	0.895	0.931	0.912	1658
Grok 2 (1212)	0.900	0.864	0.949	0.905	1651
Gemini 1.5 Flash (8B)	0.905	0.909	0.901	0.905	1651
Gemini 1.5 Pro	0.900	0.859	0.957	0.905	1650
Falcon3 (10B-L)	0.904	0.891	0.920	0.906	1650
Exaone 3.5 (32B-L)	0.907	0.913	0.899	0.906	1649
Aya (35B-L)	0.908	0.925	0.888	0.906	1649
Gemma 2 (27B-L)	0.905	0.892	0.923	0.907	1649
Llama 3.1 (405B)	0.904	0.880	0.936	0.907	1648
Llama 3.3 (70B-L)	0.904	0.880	0.936	0.907	1648
Aya Expanse (32B-L)	0.905	0.888	0.928	0.907	1647
Open Mixtral 8x22B	0.911	0.935	0.883	0.908	1647
Aya Expanse (8B-L)	0.905	0.876	0.944	0.909	1646
Gemini 2.0 Flash-Lite (02-05)*	0.903	0.872	0.944	0.907	1646
GLM-4 (9B-L)	0.911	0.925	0.893	0.909	1646
Nemotron (70B-L)	0.908	0.896	0.923	0.909	1646
DeepSeek-R1 (671B)	0.905	0.869	0.955	0.910	1645
Sailor2 (20B-L)	0.912	0.933	0.888	0.910	1645
DeepSeek-V3 (671B)	0.913	0.948	0.875	0.910	1645
GPT-4o mini (2024-07-18)	0.908	0.884	0.939	0.911	1645
Qwen 2.5 (7B-L)	0.900	0.887	0.917	0.902	1636
Hermes 3 (70B-L)	0.905	0.937	0.869	0.902	1636
Phi-4 (14B-L)*	0.901	0.899	0.904	0.902	1634
o1-preview (2024-09-12)+	0.800	0.731	0.991	0.841	1622
Mistral Large (2411)	0.896	0.863	0.941	0.901	1621
o3-mini (2025-01-31)*	0.896	0.886	0.909	0.897	1603
DeepSeek-R1 D-Qwen (14B-L)*	0.897	0.896	0.899	0.897	1603
Mistral NeMo (12B-L)	0.891	0.873	0.915	0.893	1589
GPT-3.5 Turbo (0125)	0.875	0.822	0.957	0.884	1564
QwQ (32B-L)	0.892	0.940	0.837	0.886	1562
Gemma 2 (9B-L)	0.876	0.818	0.968	0.886	1561
Mistral (7B-L)	0.891	0.897	0.883	0.890	1559
Llama 3.1 (8B-L)	0.889	0.878	0.904	0.891	1558
Marco-o1-CoT (7B-L)	0.888	0.866	0.917	0.891	1556
Tülu3 (8B-L)	0.881	0.893	0.867	0.880	1554
Tülu3 (70B-L)	0.891	0.962	0.813	0.882	1552
Mistral Small (22B-L)	0.871	0.806	0.976	0.883	1550
OpenThinker (7B-L)*	0.872	0.812	0.968	0.883	1547
OLMo 2 (13B-L)*	0.867	0.804	0.971	0.879	1537
OLMo 2 (7B-L)*	0.871	0.868	0.875	0.871	1530
Llama 3.2 (3B-L)	0.876	0.885	0.864	0.874	1529
Claude 3.5 Haiku (20241022)	0.885	0.947	0.816	0.877	1527
Pixtral-12B (2409)	0.865	0.804	0.965	0.878	1525
Claude 3.5 Sonnet (20241022)	0.887	0.950	0.816	0.878	1522
Orca 2 (7B-L)	0.876	0.910	0.835	0.871	1516
DeepSeek-R1 D-Llama (8B-L)*	0.865	0.837	0.907	0.871	1514
Yi 1.5 (9B-L)	0.859	0.826	0.909	0.865	1497
Granite 3.1 (8B-L)	0.869	0.921	0.808	0.861	1493
o1-mini (2024-09-12)+	0.731	0.667	0.991	0.797	1471
Yi Large	0.871	0.979	0.757	0.854	1431
Nous Hermes 2 Mixtral (47B-L)	0.867	0.963	0.763	0.851	1419
Mistral OpenOrca (7B-L)	0.863	0.939	0.776	0.850	1413
Ministral-8B (2410)	0.823	0.744	0.984	0.847	1404
Exaone 3.5 (8B-L)	0.853	0.913	0.781	0.842	1398
Codestral Mamba (7B)	0.827	0.774	0.923	0.842	1397
Dolphin 3.0 (8B-L)*	0.807	0.731	0.971	0.834	1372
Yi 1.5 (34B-L)	0.849	0.955	0.733	0.830	1307
Solar Pro (22B-L)	0.844	0.916	0.757	0.829	1276
DeepSeek-R1 D-Qwen (7B-L)*	0.821	0.832	0.805	0.818	1241
Hermes 3 (8B-L)	0.840	0.932	0.733	0.821	1226
Nemotron-Mini (4B-L)	0.771	0.696	0.963	0.808	1178
Phi-3 Medium (14B-L)	0.815	0.940	0.672	0.784	1077
Yi 1.5 (6B-L)	0.807	0.908	0.683	0.779	1050
DeepSeek-R1 D-Qwen (1.5B-L)*	0.633	0.685	0.493	0.574	915
Granite 3 MoE (3B-L)	0.747	0.894	0.560	0.689	860
Perspective 0.55	0.768	0.986	0.544	0.701	856
Perspective 0.70+	0.665	1.000	0.331	0.497	851
Perspective 0.80+	0.609	1.000	0.219	0.359	786
Perspective 0.60	0.731	0.989	0.467	0.634	748
Granite 3.1 MoE (3B-L)	0.555	1.000	0.109	0.197	689

Task Description

In this cycle, we used a balanced sample of 5000 messages for toxic-language detection in Spanish split in a proportion of 70/15/15 for training, validation, and testing in case of potential fine-tuning jobs.
The sample corresponds to ground-truth CLANDESTINO data prepared for CLEF TextDetox 2024.
The task involved a toxicity zero-shot classification using Google’s and Jigsaw’s core definitions of incivility and toxicity. The temperature was set at zero, and the performance metrics were averaged for binary classification. In Gemini models 1.5, the temperature was set at the default value.
It is important to note that QwQ, Marco-o1-CoT, o1-preview, o1-mini, DeepSeek-R1, o3-mini and o1 incorporated internal reasoning steps. The temperature was set as the default variable in the OpenAI reasoning models.
After the billions of parameters in parenthesis, the uppercase L implies that the model was deployed locally. In this cycle, Ollama v0.6.2 and Python Ollama, OpenAI, Anthropic, GenerativeAI and MistralAI dependencies were utilised.
Rookie models in this cycle are marked with an asterisk.
The plus symbol indicates that the model is inactive since it was not tested in this cycle. In these cases, we follow a Keep the Last Known Elo-Score policy.