Leaderboard

Model Accuracy Precision Recall F1-Score Elo-Score
Athene-V2 (72B-L) 0.925 0.932 0.917 0.925 1743
Qwen 2.5 (72B-L) 0.924 0.932 0.915 0.923 1729
GPT-4o (2024-05-13) 0.921 0.905 0.941 0.923 1726
GPT-4o (2024-11-20) 0.921 0.922 0.920 0.921 1723
GPT-4 (0613) 0.920 0.927 0.912 0.919 1686
Grok Beta 0.916 0.906 0.928 0.917 1684
Pixtral Large (2411) 0.913 0.884 0.952 0.917 1682
Qwen 2.5 (14B-L) 0.915 0.904 0.928 0.916 1679
OpenThinker (32B-L)* 0.916 0.915 0.917 0.916 1678
GPT-4 Turbo (2024-04-09) 0.912 0.880 0.955 0.916 1678
GPT-4o (2024-08-06) 0.913 0.895 0.936 0.915 1664
Qwen 2.5 (32B-L) 0.915 0.919 0.909 0.914 1663
Llama 3.1 (70B-L) 0.912 0.908 0.917 0.912 1661
Nous Hermes 2 (11B-L) 0.912 0.912 0.912 0.912 1660
Gemini 1.5 Flash 0.909 0.889 0.936 0.912 1660
Gemini 2.0 Flash* 0.909 0.872 0.960 0.914 1659
o1 (2024-12-17)* 0.911 0.895 0.931 0.912 1658
Grok 2 (1212) 0.900 0.864 0.949 0.905 1651
Gemini 1.5 Flash (8B) 0.905 0.909 0.901 0.905 1651
Gemini 1.5 Pro 0.900 0.859 0.957 0.905 1650
Falcon3 (10B-L) 0.904 0.891 0.920 0.906 1650
Exaone 3.5 (32B-L) 0.907 0.913 0.899 0.906 1649
Aya (35B-L) 0.908 0.925 0.888 0.906 1649
Gemma 2 (27B-L) 0.905 0.892 0.923 0.907 1649
Llama 3.1 (405B) 0.904 0.880 0.936 0.907 1648
Llama 3.3 (70B-L) 0.904 0.880 0.936 0.907 1648
Aya Expanse (32B-L) 0.905 0.888 0.928 0.907 1647
Open Mixtral 8x22B 0.911 0.935 0.883 0.908 1647
Aya Expanse (8B-L) 0.905 0.876 0.944 0.909 1646
Gemini 2.0 Flash-Lite (02-05)* 0.903 0.872 0.944 0.907 1646
GLM-4 (9B-L) 0.911 0.925 0.893 0.909 1646
Nemotron (70B-L) 0.908 0.896 0.923 0.909 1646
DeepSeek-R1 (671B) 0.905 0.869 0.955 0.910 1645
Sailor2 (20B-L) 0.912 0.933 0.888 0.910 1645
DeepSeek-V3 (671B) 0.913 0.948 0.875 0.910 1645
GPT-4o mini (2024-07-18) 0.908 0.884 0.939 0.911 1645
Qwen 2.5 (7B-L) 0.900 0.887 0.917 0.902 1636
Hermes 3 (70B-L) 0.905 0.937 0.869 0.902 1636
Phi-4 (14B-L)* 0.901 0.899 0.904 0.902 1634
o1-preview (2024-09-12)+ 0.800 0.731 0.991 0.841 1622
Mistral Large (2411) 0.896 0.863 0.941 0.901 1621
o3-mini (2025-01-31)* 0.896 0.886 0.909 0.897 1603
DeepSeek-R1 D-Qwen (14B-L)* 0.897 0.896 0.899 0.897 1603
Mistral NeMo (12B-L) 0.891 0.873 0.915 0.893 1589
GPT-3.5 Turbo (0125) 0.875 0.822 0.957 0.884 1564
QwQ (32B-L) 0.892 0.940 0.837 0.886 1562
Gemma 2 (9B-L) 0.876 0.818 0.968 0.886 1561
Mistral (7B-L) 0.891 0.897 0.883 0.890 1559
Llama 3.1 (8B-L) 0.889 0.878 0.904 0.891 1558
Marco-o1-CoT (7B-L) 0.888 0.866 0.917 0.891 1556
Tülu3 (8B-L) 0.881 0.893 0.867 0.880 1554
Tülu3 (70B-L) 0.891 0.962 0.813 0.882 1552
Mistral Small (22B-L) 0.871 0.806 0.976 0.883 1550
OpenThinker (7B-L)* 0.872 0.812 0.968 0.883 1547
OLMo 2 (13B-L)* 0.867 0.804 0.971 0.879 1537
OLMo 2 (7B-L)* 0.871 0.868 0.875 0.871 1530
Llama 3.2 (3B-L) 0.876 0.885 0.864 0.874 1529
Claude 3.5 Haiku (20241022) 0.885 0.947 0.816 0.877 1527
Pixtral-12B (2409) 0.865 0.804 0.965 0.878 1525
Claude 3.5 Sonnet (20241022) 0.887 0.950 0.816 0.878 1522
Orca 2 (7B-L) 0.876 0.910 0.835 0.871 1516
DeepSeek-R1 D-Llama (8B-L)* 0.865 0.837 0.907 0.871 1514
Yi 1.5 (9B-L) 0.859 0.826 0.909 0.865 1497
Granite 3.1 (8B-L) 0.869 0.921 0.808 0.861 1493
o1-mini (2024-09-12)+ 0.731 0.667 0.991 0.797 1471
Yi Large 0.871 0.979 0.757 0.854 1431
Nous Hermes 2 Mixtral (47B-L) 0.867 0.963 0.763 0.851 1419
Mistral OpenOrca (7B-L) 0.863 0.939 0.776 0.850 1413
Ministral-8B (2410) 0.823 0.744 0.984 0.847 1404
Exaone 3.5 (8B-L) 0.853 0.913 0.781 0.842 1398
Codestral Mamba (7B) 0.827 0.774 0.923 0.842 1397
Dolphin 3.0 (8B-L)* 0.807 0.731 0.971 0.834 1372
Yi 1.5 (34B-L) 0.849 0.955 0.733 0.830 1307
Solar Pro (22B-L) 0.844 0.916 0.757 0.829 1276
DeepSeek-R1 D-Qwen (7B-L)* 0.821 0.832 0.805 0.818 1241
Hermes 3 (8B-L) 0.840 0.932 0.733 0.821 1226
Nemotron-Mini (4B-L) 0.771 0.696 0.963 0.808 1178
Phi-3 Medium (14B-L) 0.815 0.940 0.672 0.784 1077
Yi 1.5 (6B-L) 0.807 0.908 0.683 0.779 1050
DeepSeek-R1 D-Qwen (1.5B-L)* 0.633 0.685 0.493 0.574 915
Granite 3 MoE (3B-L) 0.747 0.894 0.560 0.689 860
Perspective 0.55 0.768 0.986 0.544 0.701 856
Perspective 0.70+ 0.665 1.000 0.331 0.497 851
Perspective 0.80+ 0.609 1.000 0.219 0.359 786
Perspective 0.60 0.731 0.989 0.467 0.634 748
Granite 3.1 MoE (3B-L) 0.555 1.000 0.109 0.197 689

Task Description

  • In this cycle, we used a balanced sample of 5000 messages for toxic-language detection in Spanish split in a proportion of 70/15/15 for training, validation, and testing in case of potential fine-tuning jobs.
  • The sample corresponds to ground-truth CLANDESTINO data prepared for CLEF TextDetox 2024.
  • The task involved a toxicity zero-shot classification using Google’s and Jigsaw’s core definitions of incivility and toxicity. The temperature was set at zero, and the performance metrics were averaged for binary classification. In Gemini models 1.5, the temperature was set at the default value.
  • It is important to note that QwQ, Marco-o1-CoT, o1-preview, o1-mini, DeepSeek-R1, o3-mini and o1 incorporated internal reasoning steps. The temperature was set as the default variable in the OpenAI reasoning models.
  • After the billions of parameters in parenthesis, the uppercase L implies that the model was deployed locally. In this cycle, Ollama v0.6.2 and Python Ollama, OpenAI, Anthropic, GenerativeAI and MistralAI dependencies were utilised.
  • Rookie models in this cycle are marked with an asterisk.
  • The plus symbol indicates that the model is inactive since it was not tested in this cycle. In these cases, we follow a Keep the Last Known Elo-Score policy.