AI资讯日报 - 2025/7/5

👨‍🔬 Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng, Qian Xu, Meng Li, Yong Gui, Yijun He, Jianing Qiu, Jindong Hong, Jiankai Sun

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

学术论文 ArXiv 重要度: 8

揭示LLMs中的自我纠正盲点，提出通过简单干预显著减少盲点的方法。

👨‍🔬 Ken Tsui

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

学术论文 ArXiv 重要度: 7

Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model's free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice--but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models--even small ones--achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.

👨‍🔬 Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

学术论文 ArXiv 重要度: 7

StepHint通过多级逐步提示增强强化学习的推理能力，解决近失奖励和探索停滞问题。

👨‍🔬 Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, Rui Yan

USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network

学术论文 ArXiv 重要度: 7

USAD通过无监督数据增强和时空注意力扩散网络提升人类活动识别性能。

👨‍🔬 Ying Yu, Hang Xiao, Siyao Li, Jiarui Li, Haotian Tang, Hanyu Liu, Chao Li

DNN-Based Precoding in RIS-Aided mmWave MIMO Systems With Practical Phase Shift

学术论文 ArXiv 重要度: 7

研究RIS辅助的mmWave MIMO系统中基于DNN的预编码设计，提升系统吞吐量。

👨‍🔬 Po-Heng Chou, Ching-Wen Chen, Wan-Jen Huang, Walid Saad, Yu Tsao, Ronald Y. Chang

Subtyping in DHOL -- Extended preprint

学术论文 ArXiv 重要度: 6

扩展DHOL，增加精化和商类型作为子类型的特殊情况，提升表达能力和自动化支持。

👨‍🔬 Colin Rothgang, Florian Rabe

Establishing Best Practices for Building Rigorous Agentic Benchmarks

学术论文 ArXiv 重要度: 6

提出Agentic Benchmark Checklist (ABC)，为构建严格的代理基准提供指南。

👨‍🔬 Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellerman, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang

🤖 AI资讯日报

📊 今日趋势总结

Ask HN: What's the pain using current AI algorithms?

Ask HN: Is the rate of progress in AI exponential?

NLP, AI, ML, bots – a passing trend or much more? What's your take on this?

Ask HN: Anyone concerned about NYC Local Law 144?

The AI Crackpot Index

Common Lisp + Machine Learning Internship at Google (Mountain View, CA)

The Next Bill Gates or Albert Einstein in AI “Chris Clark” – Yourobot

50% Cheaper GPUs for cloud-computing / Saving devs 50% compared to AWS

Ask HN: Dipping my toes with artificial intelligence and what to expect? (CS)

Ask HN: Thoughts on grad school? (CS PhD)

Bioinformatician

Show HN: Startup Raising capital through Book Sales

NVIDIA/DeepLearningExamples

microsoft/nni

horovod/horovod

flairNLP/flair

ivy-llc/ivy

hindupuravinash/the-gan-zoo

iterative/dvc

aleju/imgaug

vdumoulin/conv_arithmetic

gunthercox/ChatterBot

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

Moral Responsibility or Obedience: What Do We Want from AI?

LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network

DNN-Based Precoding in RIS-Aided mmWave MIMO Systems With Practical Phase Shift

Subtyping in DHOL -- Extended preprint

Establishing Best Practices for Building Rigorous Agentic Benchmarks

📅 历史日报目录