Episodios

  • Mathematical exploration and discovery at scale
    Nov 15 2025
    In this episode, we discuss Mathematical exploration and discovery at scale by Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, Adam Zsolt Wagner. AlphaEvolve is an evolutionary coding agent that combines large language models with automated evaluation to iteratively generate and refine solutions for complex mathematical problems. It successfully rediscovered and improved known solutions across various math domains and can generalize results into universal formulas. When integrated with proof assistants, AlphaEvolve enables automated proof generation, demonstrating significant potential for advancing mathematical discovery and optimization.
    Más Menos
    8 m
  • Kosmos: An AI Scientist for Autonomous Discovery
    Nov 12 2025
    In this episode, we discuss Kosmos: An AI Scientist for Autonomous Discovery by Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagorac, Timothy C. Orr, Miranda E. Orr, Kevin J. Zwezdaryk, Ali E. Ghareeb, Laurie McCoy, Bruna Gomes, Euan A. Ashley, Karen E. Duff, Tonio Buonassisi, Tom Rainforth, Randall J. Bateman, Michael Skarlinski, Samuel G. Rodriques, Michaela M. Hinks, Andrew D. White. The paper presents Kosmos, an AI scientist that autonomously conducts data-driven discovery by iteratively analyzing data, searching literature, and generating hypotheses over extended periods. Kosmos uses a structured world model to integrate information across agents, enabling coherent research workflows involving extensive code execution and literature review. Evaluations show Kosmos produces highly accurate and traceable scientific reports with discoveries spanning multiple fields, some reproducing unpublished work and others novel.
    Más Menos
    9 m
  • World Simulation with Video Foundation Models for Physical AI
    Nov 8 2025
    In this episode, we discuss World Simulation with Video Foundation Models for Physical AI by NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu. The paper presents Cosmos-Predict2.5, a unified flow-based model that integrates Text2World, Image2World, and Video2World generation, enhanced by Cosmos-Reason1 for improved text grounding and control. Trained on 200M videos and refined with reinforcement learning, it outperforms its predecessor in video quality and instruction alignment, supporting robotics and autonomous system simulations. Additionally, Cosmos-Transfer2.5 enables high-fidelity Sim2Real and Real2Real translation with smaller model size, and both models and resources are released openly to advance Physical AI research.
    Más Menos
    10 m
  • Towards Robust Mathematical Reasoning
    Nov 6 2025
    In this episode, we discuss Towards Robust Mathematical Reasoning by Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung. The paper introduces IMO-Bench, a new suite of challenging mathematical reasoning benchmarks based on International Mathematical Olympiad problems to better evaluate foundation models. Their model, Gemini Deep Think, achieved state-of-the-art results, surpassing previous models significantly on both answer accuracy and proof-writing tasks. The authors also developed reliable autograders aligned with human evaluations and released the benchmark suite publicly to advance robust mathematical reasoning.
    Más Menos
    8 m
  • ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
    Nov 4 2025
    In this episode, we discuss ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models by Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong. This paper introduces ProRL, a new reinforcement learning training method that uncovers novel reasoning strategies beyond those found in base language models. Empirical results show that models trained with ProRL consistently outperform base models on challenging reasoning tasks, including cases where base models fail even with extensive attempts. The study demonstrates that prolonged RL can meaningfully expand reasoning capabilities by exploring new solution spaces over time, advancing understanding of how RL enhances language model reasoning.
    Más Menos
    7 m
  • Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
    Oct 28 2025
    In this episode, we discuss Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models by Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri. The paper introduces Roboflow100-VL, a large benchmark of 100 diverse multi-modal object detection datasets designed to test vision-language models (VLMs) on out-of-distribution concepts beyond typical pre-training data. It demonstrates that state-of-the-art VLMs perform poorly in zero-shot settings on challenging domains like medical imaging, highlighting the importance of few-shot concept alignment through annotated examples and rich text. The paper also presents results from a CVPR 2025 competition where the winning approach significantly outperforms baselines in few-shot detection tasks.
    Más Menos
    7 m
  • ImpossibleBench: Measuring LLMs’ Propensity of Exploiting Test Cases
    Oct 27 2025
    In this episode, we discuss ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases by Ziqian Zhong, Aditi Raghunathan, Nicholas Carlini. The paper introduces ImpossibleBench, a benchmark framework designed to measure and analyze large language models' tendency to cheat by exploiting test cases. It creates tasks with conflicting specifications and unit tests to quantify how often models take shortcuts that violate intended behavior. The framework is used to study cheating behaviors, refine prompting strategies, and develop tools to detect and reduce such deceptive practices in LLMs.
    Más Menos
    8 m
  • Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
    Oct 27 2025
    In this episode, we discuss Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset by Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, Yinghao Xu, Yujun Shen, Qifeng Chen. The paper presents Ditto, a comprehensive framework that generates large-scale, high-quality training data for instruction-based video editing by combining an advanced image editor with an in-context video generator. Ditto uses an efficient, distilled model with a temporal enhancer and an intelligent agent to ensure scalable, diverse, and high-fidelity video edits. Leveraging this framework, the authors created the Ditto-1M dataset and trained the Editto model, achieving state-of-the-art performance in following editing instructions.
    Más Menos
    7 m