AI资讯日报 - 2025/6/29

👨‍🔬 Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su

Whole-Body Conditioned Egocentric Video Prediction

学术论文 ArXiv 重要度: 8

研究通过人体动作预测第一人称视频，提出了一种基于扩散变换器的自回归条件模型。

👨‍🔬 Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik

WorldVLA: Towards Autoregressive Action World Model

学术论文 ArXiv 重要度: 8

提出WorldVLA，一个结合视觉-语言-动作模型和世界模型的自回归动作世界模型。

👨‍🔬 Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, Hao Chen

Potemkin Understanding in Large Language Models

学术论文 ArXiv 重要度: 8

Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.

👨‍🔬 Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan

TITAN: Query-Token based Domain Adaptive Adversarial Learning

学术论文 ArXiv 重要度: 8

提出TITAN，一种基于查询令牌的域自适应对抗学习框架，用于源自由域自适应目标检测。

👨‍🔬 Tajamul Ashraf, Janibul Bashir

HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation

学术论文 ArXiv 重要度: 7

引入HalluSegBench，首个通过反事实视觉推理评估分割幻觉的基准。

👨‍🔬 Xinzhuo Li, Adheesh Juvekar, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Ismini Lourentzou

"What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets

学术论文 ArXiv 重要度: 7

分析用户如何通过大规模对话AI数据集寻求健康信息，揭示了用户交互的多样性和潜在风险。

👨‍🔬 Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal

Process mining-driven modeling and simulation to enhance fault diagnosis in cyber-physical systems

学术论文 ArXiv 重要度: 7

提出一种结合过程挖掘和随机模拟的新型故障诊断方法，用于增强网络物理系统的故障诊断。

👨‍🔬 Francesco Vitale, Nicola Dall'Ora, Sebastiano Gaiardelli, Enrico Fraccaroli, Nicola Mazzocca, Franco Fummi

PsyLite Technical Report

学术论文 ArXiv 重要度: 6

介绍PsyLite，一个轻量级心理辅导大语言模型代理，通过两阶段训练策略提升模型能力。

👨‍🔬 Fangjun Ding, Renyu Zhang, Xinyu Feng, Chengye Xie, Zheng Zhang, Yanting Zhang

Ad-Hoc Human-AI Coordination Challenge

学术论文 ArXiv 重要度: 6

介绍AH2AC2挑战，旨在克服人类评估的高成本和难以复现的限制，促进人类-AI协调研究。

👨‍🔬 Tin Dizdarević, Ravi Hammond, Tobias Gessler, Anisoara Calinescu, Jonathan Cook, Matteo Gallici, Andrei Lupu, Jakob Nicolaus Foerster

skLEP: A Slovak General Language Understanding Benchmark

学术论文 ArXiv 重要度: 5

介绍skLEP，首个专为斯洛伐克自然语言理解模型设计的全面基准测试。

👨‍🔬 Marek Šuppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, Kristína Sásiková, Martin Tamajka, Marián Šimko

🤖 AI资讯日报

📊 今日趋势总结

The Next Bill Gates or Albert Einstein in AI “Chris Clark” – Yourobot

Ask HN: Is the rate of progress in AI exponential?

Ask HN: What's the pain using current AI algorithms?

NLP, AI, ML, bots – a passing trend or much more? What's your take on this?

Common Lisp + Machine Learning Internship at Google (Mountain View, CA)

50% Cheaper GPUs for cloud-computing / Saving devs 50% compared to AWS

The AI Crackpot Index

Ask HN: Dipping my toes with artificial intelligence and what to expect? (CS)

Ask HN: Anyone concerned about NYC Local Law 144?

Ask HN: Thoughts on grad school? (CS PhD)

Show HN: Startup Raising capital through Book Sales

Bioinformatician

deepset-ai/haystack

mlflow/mlflow

huggingface/datasets

RasaHQ/rasa

apache/mxnet

recommenders-team/recommenders

serengil/deepface

zergtant/pytorch-handbook

fchollet/deep-learning-with-python-notebooks

amusi/CVPR2025-Papers-with-Code

bee-san/Ciphey

TheAlgorithms/C

mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Whole-Body Conditioned Egocentric Video Prediction

WorldVLA: Towards Autoregressive Action World Model

Potemkin Understanding in Large Language Models

TITAN: Query-Token based Domain Adaptive Adversarial Learning

HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation

"What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets

Process mining-driven modeling and simulation to enhance fault diagnosis in cyber-physical systems

PsyLite Technical Report

Ad-Hoc Human-AI Coordination Challenge

skLEP: A Slovak General Language Understanding Benchmark

📅 历史日报目录