Leveraging Online Olympiad-Level Math Problems for LLMS Training and Contamination-Resistant Evaluation

Introduction

Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts. In addition, current benchmarks are prone to contamination, leading to unreliable evaluations. In this work, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions. Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting in AoPS-Instruct, a dataset of more than 650,000 high-quality QA pairs. Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introduces LiveAoPSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability.

Contamination-Resistant Evaluation

LiveAoPSBench

Accuracy trends of various LLMs on LiveAoPSBench over an 18-month period highlight a consistent decline in performance. We separate math expert models from general-purpose models, observing that degradation in accuracy varies across models, ranging from 2.4% to 23.6%.

Dataset Pipeline

AoPS-Instruct Pipeline

Top: Dataset curation pipeline (Training). First, irrelevant topics are detected by a small LLM, then we extract questions and answers from relevant discussions, and then each answer is rewritten into a step-by-step solution.

Bottom: LiveAoPSBench curation pipeline (Evaluation). We take the most recent posts, use two LLMs to rewrite the solution, filter out questions without clear final answers, and create the final evaluation set.

Dataset Statistics

Dataset Statistics

Here is the comparison of our dataset with other related datasets from the literature. Our dataset uniquely includes timestamp information and leverages open-source large language models (LLMs) like Qwen 2.5 72B for solution rewrites. denotes inclusion of additional training datasets such as GSM8K, Orca-Math, and MATH. Datasets marked with have their solutions entirely generated by LLMs.

Dataset Statistics

We provide a better overview of the AoPS dataset in Figure. As shown in Figure a, more than 60% of the questions have only one answer, while around 24% and 8% have two and three answers, respectively. Figure b shows the number of posts across each year, with a cut-off of August 2024. We observe that each year at least 15K mathematical questions are posted to the forum. This translates to more than 1,000 monthly questions, showcasing the potential of the AoPS forum to be used as a training and, especially, evaluation set. Figure c shows a breakdown of the types of questions in our dataset. Proof questions and numerical questions with about 32% and 28% constitute the majority of the questions in our dataset.

Finally, Figure d shows the pairwise overlap between each pair of popular supervised fine-tuning datasets using substring matching between the two datasets of each pair. Among the two Olympiad-level datasets (i.e., ours and Numina), our dataset has the least overlap with common datasets (with less than 14.1% overlap), indicating a significant number of new data points.