Top 10 LLM Research Papers of 2025

Vasu Deo Sankrityayan Last Updated : 15 Jun, 2025
7 min read

2025 as an year has been home to several breakthroughs when it comes to large language models (LLMs). The technology has found a home in almost every domain imaginable and is increasingly being integrated into conventional workflows. With so much happening around, it’s a tall order to keep track of significant findings. This article would help acquaint you with the most popular LLM research papers that’ve come out this year. This would help you stay up-to-date with the latest breakthroughs in AI.

Top 10 LLM Research Papers

The research papers have been obtained from Hugging Face, an online platform for AI-related content. The metric used for selection is the upvotes parameter on Hugging Face. The following are 10 of the most well-received research study papers of 2025:

1. Mutarjim: Advancing Bidirectional Arabic-English Translation

Mutarjim

Category: Natural Language Processing
Mutarjim is a compact yet powerful 1.5B parameter language model for bidirectional Arabic-English translation, based on Kuwain-1.5B, that achieves state-of-the-art performance against significantly larger models and introduces the Tarjama-25 benchmark.
Objectives: The main objective is to develop an efficient and accurate language model optimized for bidirectional Arabic-English translation. It addresses limitations of current LLMs in this domain and introduces a robust benchmark for evaluation.

Outcome:

  1. Mutarjim (1.5B parameters) achieved state-of-the-art performance on the Tarjama-25 benchmark for Arabic-to-English translation.
  2. Unidirectional variants, such as Mutarjim-AR2EN, outperformed the bidirectional model.
  3. The continued pre-training phase significantly improved translation quality.

Full Paper: https://arxiv.org/abs/2505.17894

2. Qwen3 Technical Report

Qwen

Category: Natural Language Processing
This technical report introduces Qwen3, a new series of LLMs featuring integrated thinking and non-thinking modes, diverse model sizes, enhanced multilingual capabilities, and state-of-the-art performance across various benchmarks.
Objective: The primary objective of the paper is to introduce the Qwen3 LLM series, designed to enhance performance, efficiency, and multilingual capabilities, notably by integrating flexible thinking and non-thinking modes and optimizing resource usage for diverse tasks.

Outcome

  1. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks.
  2. The flagship Qwen3-235B-A22B model achieved 85.7 on AIME’24 and 70.7 on LiveCodeBench v5.
  3. Qwen3-235B-A22B-Base outperformed DeepSeek-V3-Base on 14 out of 15 evaluation benchmarks.
  4. Strong-to-weak distillation proved highly efficient, requiring approximately 1/10 of the GPU hours compared to direct reinforcement learning.
  5. Qwen3 expanded multilingual support from 29 to 119 languages and dialects, enhancing global accessibility and cross-lingual understanding.

Full Paper: https://arxiv.org/abs/2505.09388

3. Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

A survey on LMRMs

Category: Multi-Modal
This paper provides a comprehensive survey of large multimodal reasoning models (LMRMs), outlining a four-stage developmental roadmap for multimodal reasoning research.
Objective: The main objective is to clarify the current landscape of multimodal reasoning and inform the design of next-generation multimodal reasoning systems capable of comprehensive perception, precise understanding, and deep reasoning in diverse environments.

Outcome: The survey’s experimental findings highlight current LMRM limitations in the Audio-Video Question Answering (AVQA) task. Additionally, GPT-4o scores 0.6% on the BrowseComp benchmark, improving to 1.9% with browsing tools, demonstrating weak tool-interactive planning.

Full Paper: https://arxiv.org/abs/2505.04921

4. Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Absolute Zero

Category: Reinforcement Learning
This paper introduces Absolute Zero, a novel Reinforcement Learning with Verifiable Rewards (RLVR) paradigm. It enables language models to autonomously generate and solve reasoning tasks, achieving self-improvement without relying on external human-curated data.
Objective: The primary objective is to develop a self-evolving reasoning system that overcomes the scalability limitations of human-curated data. By learning to propose tasks that maximize its learning progress and improve its reasoning capabilities.

Outcome:

  1. AZR achieves overall state-of-the-art (SOTA) performance on coding and mathematical reasoning tasks.
  2. Specifically, AZR-Coder-7B achieves an overall average score of 50.4, surpassing previous best models by 1.8 absolute percentage points on combined math and coding tasks without any curated data.
  3. The performance improvements scale with model size: 3B, 7B, and 14B coder models achieve gains of +5.7, +10.2, and +13.2 points, respectively.

Full Paper: https://arxiv.org/abs/2505.03335

5. Seed1.5-VL Technical Report

Seed1.5-VL

Category: Multi-Modal
This report introduces Seed1.5-VL, a compact vision-language foundation model designed for general-purpose multimodal understanding and reasoning.
Objective: The primary objective is to advance general-purpose multimodal understanding and reasoning by addressing the scarcity of high-quality vision-language annotations and efficiently training large-scale multimodal models with asymmetrical architectures.

Outcome

  1. Seed1.5-VL achieves state-of-the-art (SOTA) performance on 38 out of 60 evaluated public benchmarks.
  2. It excels in document understanding, grounding, and agentic tasks.
  3. The model achieves an MMMU score of 77.9 (thinking mode), which is a key indicator of multimodal reasoning ability.

Full Paper: https://arxiv.org/abs/2505.07062

6. Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Shifting AI

Category: Machine Learning
This position paper advocates for a paradigm shift in AI efficiency from model-centric to data-centric compression, focusing on token compression to address the growing computational bottleneck of long token sequences in large AI models.
Objective: The paper aims to reposition AI efficiency research by arguing that the dominant computational bottleneck has shifted from model size to the quadratic cost of self-attention over long token sequences, necessitating a focus on data-centric token compression.

Outcome: 

  1. Token compression is quantitatively shown to reduce computational complexity quadratically and memory usage linearly with sequence length reduction.
  2. Empirical comparisons reveal that simple random token dropping often surprisingly outperforms meticulously engineered token compression methods.

Full Paper: https://arxiv.org/abs/2505.19147

7. Emerging Properties in Unified Multimodal Pretraining

Unified Multimodal Pretraining

Category: Multi-Modal
BAGEL is an open-source foundational model for unified multimodal understanding and generation, exhibiting emerging capabilities in complex multimodal reasoning.

Objective: The primary objective is to bridge the gap between academic models and proprietary systems in multimodal understanding.

Outcome

  1. BAGEL significantly outperforms existing open-source unified models in both multimodal generation and understanding across standard benchmarks.
  2. On image understanding benchmarks, BAGEL achieved an 85.0 score on MMBench and 69.3 on MMVP.
  3. For text-to-image generation, BAGEL attained an 0.88 overall score on the GenEval benchmark.
  4. The model exhibits advanced emerging capabilities in complex multimodal reasoning.
  5. The integration of Chain-of-Thought (CoT) reasoning improved BAGEL’s IntelligentBench score from 44.9 to 55.3.

Full Paper: https://arxiv.org/abs/2505.14683

8. MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

MinMax-Speech

Category: Natural Language Processing
MiniMax-Speech is an autoregressive Transformer-based Text-to-Speech (TTS) model that employs a learnable speaker encoder and Flow-VAE to achieve high-quality, expressive zero-shot and one-shot voice cloning across 32 languages.

Objective: The primary objective is to develop a TTS model capable of high-fidelity, expressive zero-shot voice cloning from untranscribed reference audio.

Outcome

  1. MiniMax-Speech achieved state-of-the-art results on the objective voice cloning metric.
  2. The model secured the top position on the Artificial Arena leaderboard with an ELO score of 1153.
  3. In multilingual evaluations, MiniMax-Speech significantly outperformed ElevenLabs Multilingual v2 in languages with complex tonal structures.
  4. The Flow-VAE integration improved TTS synthesis, as evidenced by a test-zh zero-shot WER of 0.748.

Full Paper: https://arxiv.org/abs/2505.07916

9. Beyond ‘Aha!’: Toward Systematic Meta-Abilities Alignment

Beyond "Aha"

Category: Natural Language Processing
This paper introduces a systematic method to align large reasoning models (LRMs) with fundamental meta-abilities. It does so using self-verifiable synthetic tasks and a three-stage reinforcement learning pipeline.

Objective: To overcome the unreliability and unpredictability of emergent “aha moments” in LRMs by explicitly aligning them with domain-general reasoning meta-abilities (deduction, induction, and abduction).

Outcome

  1. Meta-ability alignment (Stage A + B) transferred to unseen benchmarks, with the merged 32B model showing a 3.5% gain in overall average accuracy (48.1%) compared to the instruction-tuned baseline (44.6%) across math, coding, and science benchmarks.
  2. Domain-specific RL from the meta-ability-aligned checkpoint (Stage C) further boosted performance; the 32B Domain-RL-Meta model achieved a 48.8% overall average, representing a 4.2% absolute gain over the 32B instruction baseline (44.6%) and a 1.4% gain over direct RL from instruction models (47.4%).
  3. The meta-ability-aligned model demonstrated a higher frequency of targeted cognitive behaviors.

Full Paper: https://arxiv.org/abs/2505.10554

10. Chain-of-Model Learning for Language Model

Chain-of Model Learning

Category: Natural Language Processing
This paper introduces “Chain-of-Model” (CoM), a novel learning paradigm for language models (LLMs) that integrates causal relationships into hidden states as a chain, enabling improved scaling efficiency and inference flexibility.

Objective: The primary objective is to address the limitations of existing LLM scaling strategies, which often require training from scratch and activate a fixed scale of parameters, by developing a framework that allows progressive model scaling, elastic inference, and more efficient training and tuning for LLMs.

Outcome

  1. CoLM family achieves comparable performance to standard Transformer models.
  2. Chain Expansion demonstrates performance improvements (e.g., TinyLLaMA-v1.1 with expansion showed a 0.92% improvement in average accuracy).
  3. CoLM-Air significantly accelerates prefilling (e.g., CoLM-Air achieved nearly 1.6x to 3.0x faster prefilling, and up to 27x speedup when combined with MInference).
  4. Chain Tuning boosts GLUE performance by fine-tuning only a subset of parameters.

Full Paper: https://arxiv.org/abs/2505.11820

Conclusion

What can be concluded from all these LLM research papers is that language models are now being used extensively for a variety of purposes. Their use case has vastly gravitated from text generation (the original workload it was designed for). The research’s are predicated on the plethora of frameworks and protocols that have been developed around LLMs. It draws attention to the fact that most of the research is being done in AI, machine learning, and similar disciplines, making it even more necessary for one to stay updated about them.

With the most popular LLM research papers now at your disposal, you can integrate their findings to create state-of-the-art developments. While most of them improve upon the preexisting techniques, the results achieved provide radical transformations. This gives a promising outlook for further research and developments in the already booming field of language models. 

I specialize in reviewing and refining AI-driven research, technical documentation, and content related to emerging AI technologies. My experience spans AI model training, data analysis, and information retrieval, allowing me to craft content that is both technically accurate and accessible.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear