- [2025-09] One paper accepted by NeurIPS 2025 Dataset and Benchmark Track!
- [2024-11] Glad to have joined Mμ Lab at IAI, PKU as a research intern! This marks the beginning of my journey into generative model research.
- [2024-10] Honored to receive the National Scholarship of the Ministry of Education of PRC!
- [2023-10] Honored to receive the National Scholarship of the Ministry of Education of PRC!
|
Papers
My research interest focuses on foundation models. I am dedicated to model compression, model acceleration, and related areas.
|
|
TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, Muhan Zhang
arxiv preprint, 2025
arxiv /
We propose Tensor-Parallel Latent Attention (TPLA): a scheme that partitions both the latent representation and each head’s input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce. TPLA preserves the benefits of a compressed KV cache while unlocking TP efficiency.
|
|
LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?
Ziyuan He*, Yuxuan Wang*, Jiaqi Li*, Kexin Liang, Muhan Zhang
NeurIPS Datasets and Benchmarks Track, 2025
arxiv /
code /
huggingface /
LooGLE v2 is a benchmark designed to evaluate the long-context and long-dependency capabilities of large language models. Its key highlight is the use of ultra-long texts, with a strong emphasis on long-dependency, and it is entirely designed with real-world, real-task scenarios.
|
This homepage is designed based on Jon Barron's website. Last updated: Nov. 5, 2025
© 2025 Yuxuan Wang
|