Leaderboard

Model Accuracy Precision Recall F1-Score Elo-Score
GPT-4.5-preview (2025-02-27)* 0.929 0.940 0.917 0.928 1788
Athene-V2 (72B-L) 0.925 0.932 0.917 0.925 1741
Qwen 2.5 (72B-L) 0.924 0.932 0.915 0.923 1728
GPT-4o (2024-05-13) 0.921 0.905 0.941 0.923 1725
GPT-4o (2024-11-20) 0.921 0.922 0.920 0.921 1722
GPT-4 (0613) 0.920 0.927 0.912 0.919 1687
Grok Beta 0.916 0.906 0.928 0.917 1685
Pixtral Large (2411) 0.913 0.884 0.952 0.917 1683
OpenThinker (32B-L) 0.916 0.915 0.917 0.916 1681
Qwen 2.5 (14B-L) 0.915 0.904 0.928 0.916 1680
GPT-4 Turbo (2024-04-09) 0.912 0.880 0.955 0.916 1679
GPT-4o (2024-08-06) 0.913 0.895 0.936 0.915 1666
Qwen 2.5 (32B-L) 0.915 0.919 0.909 0.914 1665
Gemini 2.0 Flash 0.909 0.872 0.960 0.914 1664
Llama 3.1 (70B-L) 0.912 0.908 0.917 0.912 1664
o1 (2024-12-17) 0.911 0.895 0.931 0.912 1663
Nous Hermes 2 (11B-L) 0.912 0.912 0.912 0.912 1663
Gemini 1.5 Flash 0.909 0.889 0.936 0.912 1663
Grok 2 (1212) 0.900 0.864 0.949 0.905 1658
Gemini 1.5 Flash (8B) 0.905 0.909 0.901 0.905 1658
Gemini 1.5 Pro 0.900 0.859 0.957 0.905 1657
Falcon3 (10B-L) 0.904 0.891 0.920 0.906 1656
Exaone 3.5 (32B-L) 0.907 0.913 0.899 0.906 1656
Aya (35B-L) 0.908 0.925 0.888 0.906 1655
Gemini 2.0 Flash-Lite (02-05) 0.903 0.872 0.944 0.907 1655
Gemma 2 (27B-L) 0.905 0.892 0.923 0.907 1654
Llama 3.3 (70B-L) 0.904 0.880 0.936 0.907 1653
Llama 3.1 (405B) 0.904 0.880 0.936 0.907 1653
Aya Expanse (32B-L) 0.905 0.888 0.928 0.907 1652
Open Mixtral 8x22B 0.911 0.935 0.883 0.908 1652
Aya Expanse (8B-L) 0.905 0.876 0.944 0.909 1651
GLM-4 (9B-L) 0.911 0.925 0.893 0.909 1651
Nemotron (70B-L) 0.908 0.896 0.923 0.909 1650
DeepSeek-R1 (671B) 0.905 0.869 0.955 0.910 1650
Sailor2 (20B-L) 0.912 0.933 0.888 0.910 1650
Gemma 3 (27B-L)* 0.904 0.865 0.957 0.909 1650
DeepSeek-V3 (671B) 0.913 0.948 0.875 0.910 1650
GPT-4o mini (2024-07-18) 0.908 0.884 0.939 0.911 1649
Phi-4 (14B-L) 0.901 0.899 0.904 0.902 1645
Qwen 2.5 (7B-L) 0.900 0.887 0.917 0.902 1644
Hermes 3 (70B-L) 0.905 0.937 0.869 0.902 1643
Gemma 3 (12B-L)* 0.899 0.866 0.944 0.903 1642
Mistral Large (2411) 0.896 0.863 0.941 0.901 1629
o1-preview (2024-09-12)+ 0.800 0.731 0.991 0.841 1622
o3-mini (2025-01-31) 0.896 0.886 0.909 0.897 1615
DeepSeek-R1 D-Qwen (14B-L) 0.897 0.896 0.899 0.897 1614
o1-mini (2024-09-12) 0.895 0.878 0.917 0.897 1599
Mistral NeMo (12B-L) 0.891 0.873 0.915 0.893 1586
Command R7B Arabic (7B-L)* 0.897 0.926 0.864 0.894 1584
GPT-3.5 Turbo (0125) 0.875 0.822 0.957 0.884 1564
QwQ (32B-L) 0.892 0.940 0.837 0.886 1562
Gemma 2 (9B-L) 0.876 0.818 0.968 0.886 1560
Mistral (7B-L) 0.891 0.897 0.883 0.890 1558
Llama 3.1 (8B-L) 0.889 0.878 0.904 0.891 1556
Marco-o1-CoT (7B-L) 0.888 0.866 0.917 0.891 1554
Phi-4-mini (3.8B-L)* 0.884 0.891 0.875 0.883 1553
Mistral Small (22B-L) 0.871 0.806 0.976 0.883 1551
OpenThinker (7B-L) 0.872 0.812 0.968 0.883 1549
Tülu3 (8B-L) 0.881 0.893 0.867 0.880 1541
Tülu3 (70B-L) 0.891 0.962 0.813 0.882 1538
OLMo 2 (7B-L) 0.871 0.868 0.875 0.871 1527
OLMo 2 (13B-L) 0.867 0.804 0.971 0.879 1525
Llama 3.2 (3B-L) 0.876 0.885 0.864 0.874 1525
Claude 3.5 Haiku (20241022) 0.885 0.947 0.816 0.877 1522
Pixtral-12B (2409) 0.865 0.804 0.965 0.878 1519
Mistral Saba* 0.867 0.810 0.957 0.878 1516
Orca 2 (7B-L) 0.876 0.910 0.835 0.871 1514
Claude 3.7 Sonnet (20250219)* 0.887 0.950 0.816 0.878 1513
DeepSeek-R1 D-Llama (8B-L) 0.865 0.837 0.907 0.871 1511
Claude 3.5 Sonnet (20241022) 0.887 0.950 0.816 0.878 1510
Yi 1.5 (9B-L) 0.859 0.826 0.909 0.865 1496
Granite 3.1 (8B-L) 0.869 0.921 0.808 0.861 1495
Yi Large 0.871 0.979 0.757 0.854 1449
Nous Hermes 2 Mixtral (47B-L) 0.867 0.963 0.763 0.851 1438
Mistral OpenOrca (7B-L) 0.863 0.939 0.776 0.850 1434
Gemma 3 (4B-L)* 0.820 0.742 0.981 0.845 1428
Ministral-8B (2410) 0.823 0.744 0.984 0.847 1427
Exaone 3.5 (8B-L) 0.853 0.913 0.781 0.842 1416
Codestral Mamba (7B) 0.827 0.774 0.923 0.842 1415
Dolphin 3.0 (8B-L) 0.807 0.731 0.971 0.834 1386
Granite 3.2 (8B-L)* 0.849 0.940 0.747 0.832 1359
Yi 1.5 (34B-L) 0.849 0.955 0.733 0.830 1333
Solar Pro (22B-L) 0.844 0.916 0.757 0.829 1304
Hermes 3 (8B-L) 0.840 0.932 0.733 0.821 1244
DeepSeek-R1 D-Qwen (7B-L) 0.821 0.832 0.805 0.818 1228
Nemotron-Mini (4B-L) 0.771 0.696 0.963 0.808 1197
Phi-3 Medium (14B-L) 0.815 0.940 0.672 0.784 1089
Yi 1.5 (6B-L) 0.807 0.908 0.683 0.779 1056
DeepScaleR (1.5B-L)* 0.620 0.688 0.440 0.537 891
Perspective 0.55 0.768 0.986 0.544 0.701 880
Granite 3 MoE (3B-L) 0.747 0.894 0.560 0.689 877
DeepSeek-R1 D-Qwen (1.5B-L) 0.633 0.685 0.493 0.574 800
Perspective 0.60 0.731 0.989 0.467 0.634 778
Perspective 0.70 0.665 1.000 0.331 0.497 747
Perspective 0.80 0.609 1.000 0.219 0.359 665
Granite 3.1 MoE (3B-L) 0.555 1.000 0.109 0.197 577

Task Description

  • In this cycle, we used a balanced sample of 5000 messages for toxic-language detection in Spanish split in a proportion of 70/15/15 for training, validation, and testing in case of potential fine-tuning jobs.
  • The sample corresponds to ground-truth CLANDESTINO data prepared for CLEF TextDetox 2024.
  • The task involved a toxicity zero-shot classification using Google’s and Jigsaw’s core definitions of incivility and toxicity. The temperature was set at zero, and the performance metrics were averaged for binary classification. In Gemini models 1.5, the temperature was set at the default value.
  • It is important to note that QwQ. Marco-o1-CoT, o1-preview, o1-mini, DeepSeek-R1, o3-mini and o1 incorporated internal reasoning steps. The temperature was set as the default variable in the OpenAI reasoning models and GPT-4.5-preview.
  • After the billions of parameters in parenthesis, the uppercase L implies that the model was deployed locally. In this cycle, Ollama v0.6.4 and Python Ollama, OpenAI, Anthropic, GenerativeAI and MistralAI dependencies were utilised.
  • Rookie models in this cycle are marked with an asterisk.
  • The plus symbol indicates that the model is inactive since it was not tested in this cycle. In these cases, we follow a Keep the Last Known Elo-Score policy.