Latent Space: The AI Engineer Podcast | Latest Episode Summaries

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2026-01-09 00:03

don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk —- From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah Hill-Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really? We discuss: The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard) The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias) The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron) The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents) Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions) V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models) — Artificial Analysis Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\")) George Cameron on X: https://x.com/grmcameron (https://x.com/grmcameron (\"https://x.com/grmcameron\")) Micah Hill-Smith on X: https://x.com/_micah_h (https://x.com/_micah_h (\"https://x.com/_micah_h\")) Chapters 00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins 00:01:08 Business Model: Independence and Revenue Streams 00:04:00 The Origin Story: From Legal AI to Benchmarking 00:07:00 Early Challenges: Cost, Methodology, and Independence 00:16:13 AI Grant and Moving to San Francisco 00:18:58 Evolution of the Intelligence Index: V1 to V3 00:27:55 New Benchmarks: Hallucination Rate and Omissions Index 00:33:19 Critical Point and Frontier Physics Problems 00:35:56 GDPVAL AA: Agentic Evaluation and Stirrup Harness 00:51:47 The Openness Index: Measuring Model Transparency 00:57:57 The Smiling Curve: Cost of Intelligence Paradox 01:04:00 Hardware Efficiency and Sparsity Trends 01:07:43 Reasoning vs Non-Reasoning: Token Efficiency Matters 01:10:47 Multimodal Benchmarking and Community Requests 01:14:50 Looking Ahead: V4 Intelligence Index and Beyond
Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2026-01-08 16:00

don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk —- From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really? We discuss: The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard) The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias) The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron) The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents) Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions) V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models) — Artificial Analysis Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\")) George Cameron on X: https://x.com/georgecameron (https://x.com/georgecameron (\"https://x.com/georgecameron\")) Micah-Hill Smith on X: https://x.com/micahhsmith (https://x.com/micahhsmith (\"https://x.com/micahhsmith\")) Chapters 00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins 00:01:19 Business Model: Independence and Revenue Streams 00:04:33 Origin Story: From Legal AI to Benchmarking Need 00:16:22 AI Grant and Moving to San Francisco 00:19:21 Intelligence Index Evolution: From V1 to V3 00:11:47 Benchmarking Challenges: Variance, Contamination, and Methodology 00:13:52 Mystery Shopper Policy and Maintaining Independence 00:28:01 New Benchmarks: Omissions Index for Hallucination Detection 00:33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning 00:23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks 00:50:19 Stirrup Agent Harness: Open Source Agentic Framework 00:52:43 Openness Index: Measuring Model Transparency Beyond Licenses 00:58:25 The Smiling Curve: Cost Falling While Spend Rising 01:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits 01:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges 01:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas 01:15:05 Looking Ahead: Intelligence Index V4 and Future Directions 01:16:50 Closing: The Insatiable Demand for Intelligence
[State of Evals] LMArena's $1.7B Vision — Anastasios Angelopoulos, LMArena
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2026-01-06 19:26

We are reupping this episode after LMArena announced their fresh Series A (https://www.theinformation.com/articles/ai-evaluation-startup-lmarena-valued-1-7-billion-new-funding-round?rc=luxwz4), raising $150m at a $1.7B valuation, with $30M annualized consumption revenue (aka $2.5m MRR) after their September evals product launch. —- From building LMArena in a Berkeley basement to raising $100M and becoming the de facto leaderboard for frontier AI, Anastasios Angelopoulos returns to Latent Space to recap 2025 in one of the most influential platforms in AI—trusted by millions of users, every major lab, and the entire industry to answer one question: which model is actually best for real-world use cases? We caught up with Anastasios live at NeurIPS 2025 to dig into the origin story (spoiler: it started as an academic project incubated by Anjney Midha at a16z, who formed an entity and gave grants before they even committed to starting a company), why they decided to spin out instead of staying academic or nonprofit (the only way to scale was to build a company), how they're spending that $100M (inference costs, React migration off Gradio, and hiring world-class talent across ML, product, and go-to-market), the leaderboard delusion controversy and why their response demolished the paper's claims (factual errors, misrepresentation of open vs. closed source sampling, and ignoring the transparency of preview testing that the community loves), why platform integrity comes first (the public leaderboard is a charity, not a pay-to-play system—models can't pay to get on, can't pay to get off, and scores reflect millions of real votes), how they're expanding into occupational verticals (medicine, legal, finance, creative marketing) and multimodal arenas (video coming soon), why consumer retention is earned every single day (sign-in and persistent history were the unlock, but users are fickle and can leave at any moment), and his vision for Arena as the central evaluation platform that provides the North Star for the industry—constantly fresh, immune to overfitting, and grounded in millions of real-world conversations from real users. We discuss: The $100M raise: use of funds is primarily inference costs (funding free usage for tens of millions of monthly conversations), React migration off Gradio (custom loading icons, better developer hiring, more flexibility), and hiring world-class talent The scale: 250M+ conversations on the platform, tens of millions per month, 25% of users do software for a living, and half of users are now logged in The leaderboard illusion controversy: Cohere researchers claimed undisclosed private testing created inequities, but Arena's response demolished the paper's factual errors (misrepresented open vs. closed source sampling, ignored transparency of preview testing that the community loves) Why preview testing is loved by the community: secret codenames (Gemini Nano Banana, named after PM Naina's nickname), early access to unreleased models, and the thrill of being first to vote on frontier capabilities The Nano Banana moment: changed Google's market share overnight, billions of dollars in stock movement, and validated that multimodal models (image generation, video) are economically critical for marketing, design, and AI-for-science New categories: occupational and expert arenas (medicine, legal, finance, creative marketing), Code Arena, and video arena coming soon Chapters 00:00:00 Introduction: Anastasios from Arena and the LM Arena Journey 00:01:36 The Anjney Midha Incubation: From Berkeley Basement to Startup 00:02:47 The Decision to Start a Company: Scaling Beyond Academia 00:03:38 The $100M Raise: Use of Funds and Platform Economics 00:05:10 Arena's User Base: 5M+ Users and Diverse Demographics 00:06:02 The Competitive Landscape: Artificial Analysis, AI.xyz, and Arena's Differentiation 00:08:12 Educational Value and Learning from the Community 00:08:41 Technical Migration: From Gradio to React and Platform Evolution 00:10:18 Leaderboard Delusion Paper: Addressing Critiques and Maintaining Integrity 00:12:29 Nano Banana Moment: How Preview Models Create Market Impact 00:13:41 Multimodal AI and Image Generation: From Skepticism to Economic Value 00:15:37 Core Principles: Platform Integrity and the Public Leaderboard as Charity 00:18:29 Future Roadmap: Expert Categories, Multimodal, Video, and Occupational Verticals 00:19:10 API Strategy and Focus: Doing One Thing Well 00:19:51 Community Management and Retention: Sign-In, History, and Daily Value 00:22:21 Partnerships and Agent Evaluation: From Devon to Full-Featured Harnesses 00:21:49 Hiring and Building a High-Performance Team
[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2026-01-02 15:59

From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work. We discuss: The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the "critical depth" phenomenon Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn't the bottleneck, and crossing 15M+ transitions was when deep networks really paid off The blurring of RL and self-supervised learning: their code doesn't maximize rewards directly, it's an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression Why scaling batch size unlocks at depth: traditional RL doesn't benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension — RL1000 Team (Princeton) 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities: https://openreview.net/forum?id=s0JVsx3bx1 Chapters 00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience 00:01:11 Team Introductions and Princeton Research Origins 00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow 00:04:35 Self-Supervised RL: A Different Approach to Scaling 00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth 00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients 00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives 00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning 00:09:44 From TD Errors to Classification: Why This Objective Scales 00:11:06 Architecture Details: Building on Braw and SymbaFowl 00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision 00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling 00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure 00:18:05 World Models and Next State Classification 00:22:37 Unlocking Batch Size Scaling Through Network Capacity 00:24:10 Compute Requirements: State-of-the-Art on a Single GPU 00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning 00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling
[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-31 20:00

From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin's launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just "more repos," why Tau-bench's "impossible tasks" controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs. interactivity (Cognition's emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning. We discuss: John's path: Princeton → SWE-bench (October 2023) → Stanford PhD with Diyi Yang and the Iris Group, focusing on code evals, human-AI collaboration, and long-running agent benchmarks The SWE-bench origin story: released October 2023, mostly ignored until Cognition's Devin launch kicked off the arms race (Walden emailed John two weeks before: "we have a good number") SWE-bench Verified: the curated, high-quality split that became the standard for serious evals SWE-bench Multimodal and Multilingual: nine languages (JavaScript, Rust, Java, C, Ruby) across 40 repos, moving beyond the Django-heavy original distribution The SWE-bench Pro controversy: independent authors used the "SWE-bench" name without John's blessing, but he's okay with it ("congrats to them, it's a great benchmark") CodeClash: John's new benchmark for long-horizon development—agents maintain their own codebases, edit and improve them each round, then compete in arenas (programming games like Halite, economic tasks like GDP optimization) SWE-Efficiency (Jeffrey Maugh, John's high school classmate): optimize code for speed without changing behavior (parallelization, SIMD operations) AlgoTune, SciCode, Terminal-bench, Tau-bench, SecBench, SRE-bench: the Cambrian explosion of code evals, each diving into different domains (security, SRE, science, user simulation) The Tau-bench "impossible tasks" debate: some tasks are underspecified or impossible, but John thinks that's actually a feature (flags cheating if you score above 75%) Cognition's research focus: codebase understanding (retrieval++), helping humans understand their own codebases, and automatic context engineering for LLMs (research sub-agents) The vision: CodeClash as a testbed for human-AI collaboration—vary the setup (solo agent, multi-agent, human+agent), freeze model capability, and measure how interaction changes as models improve — John Yang SWE-bench: https://www.swebench.com X: https://x.com/jyangballin Chapters 00:00:00 Introduction: John Yang on SWE-bench and Code Evaluations 00:00:31 SWE-bench Origins and Devon's Impact on the Coding Agent Arms Race 00:01:09 SWE-bench Ecosystem: Verified, Pro, Multimodal, and Multilingual Variants 00:02:17 Moving Beyond Django: Diversifying Code Evaluation Repositories 00:03:08 Code Clash: Long-Horizon Development Through Programming Tournaments 00:04:41 From Halite to Economic Value: Designing Competitive Coding Arenas 00:06:04 Ofir's Lab: SWE-ficiency, AlgoTune, and SciCode for Scientific Computing 00:07:52 The Benchmark Landscape: TAU-bench, Terminal-bench, and User Simulation 00:09:20 The Impossible Task Debate: Refusals, Ambiguity, and Benchmark Integrity 00:12:32 The Future of Code Evals: Long Autonomy vs Human-AI Collaboration 00:14:37 Call to Action: User Interaction Data and Codebase Understanding Research
[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-31 07:12

From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO vs DPO debates of 2023 to today's RLVR era, where the real innovation isn't optimization methods but data quality, signal trust, and token efficiency. We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs. human preference you can't), why token efficiency matters more than wall-clock time (GPT-5 to 5.1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels "trapped" by 40-minute design sessions followed by 15-minute agent sprints, the infrastructure chaos of scaling RL ("way more moving parts than pre-training"), why long context will keep climbing but agents + graph walks might matter more than 10M-token windows, the shopping model as a test bed for interruptability and chain-of-thought transparency, why personality toggles (Anton vs Clippy) are a real differentiator users care about, and his thesis that the education system isn't producing enough people who can do both distributed systems and ML research—the exact skill set required to push the frontier when the bottleneck moves every few weeks. We discuss: Josh's path: pre-training data curation → post-training researcher at OpenAI, shipping GPT-4o, o1, o3, GPT-5 thinking, and the shopping model Why he switched from pre-training to post-training: "Do I want to make 3% compute efficiency wins, or change behavior by 40%?" The RL infrastructure challenge: way more moving parts than pre-training (tasks, grading setups, external partners), and why babysitting runs at 12:30am means jumping into unfamiliar code constantly How Codex has changed his workflow: 40-minute design sessions compressed into 15-minute agent sprints, and the strange "trapped" feeling of waiting for the agent to finish The RLHF vs RLVR debate: both are policy gradient methods, the real difference is data quality and signal trust (human preference vs. verifiable correctness) Why GRPO (from DeepSeek Math) was underappreciated: not just an optimization trick, but a shift toward reward signals you can actually trust (math answers over human vibes) The token efficiency revolution: GPT-5 to 5.1 bumped evals and slashed tokens, and why thinking in tokens (not wall-clock time) unlocks better tool-calling and agent workflows Personality toggles: Anton (tool, no warmth) vs Clippy (friendly, helpful), and why Josh uses custom instructions to make his model "just a tool" The router problem: having a router at the top (GPT-5 thinking vs non-thinking) and an implicit router (thinking effort slider) creates weird bumps, and why the abstractions will eventually merge Long context: climbing Graph Blocks evals, the dream of 10M+ token windows, and why agents + graph walks might matter more than raw context length Why the education system isn't producing enough people who can do both distributed systems and ML research, and why that's the bottleneck for frontier labs The 2026 vision: neither pre-training nor post-training is dead, we're in the fog of war, and the bottleneck will keep moving (so emotional stability helps) — Josh McGrath OpenAI: https://openai.com https://x.com/j_mcgraph Chapters 00:00:00 Introduction: Josh McGrath on Post-Training at OpenAI 00:04:37 The Shopping Model: Black Friday Launch and Interruptability 00:07:11 Model Personality and the Anton vs Clippy Divide 00:08:26 Beyond PPO vs DPO: The Data Quality Spectrum in RL 00:01:40 Infrastructure Challenges: Why Post-Training RL is Harder Than Pre-Training 00:13:12 Token Efficiency: The 2D Plot That Matters Most 00:03:45 Codex Max and the Flow Problem: 40 Minutes of Planning, 15 Minutes of Waiting 00:17:29 Long Context and Graph Blocks: Climbing Toward Perfect Context 00:21:23 The ML-Systems Hybrid: What's Hard to Hire For 00:24:50 Pre-Training Isn't Dead: Living Through Technological Revolution
[State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-30 15:59

From Berkeley robotics and OpenAI's 2017 Dota-era internship to shipping RL breakthroughs on GPT-4o, o1, and o3, and now leading model development at Cursor, Ashvin Nair has done it all. We caught up with Ashvin at NeurIPS 2025 to dig into the inside story of OpenAI's reasoning team (spoiler: it went from a dozen people to 300+), why IOI Gold felt reachable in 2022 but somehow didn't change the world when o1 actually achieved it, how RL doesn't generalize beyond the training distribution (and why that means you need to bring economically useful tasks into distribution by co-designing products and models), the deeper lessons from the RL research era (2017–2022) and why most of it didn't pan out because the community overfitted to benchmarks, how Cursor is uniquely positioned to do continual learning at scale with policy updates every two hours and product-model co-design that keeps engineers in the loop instead of context-switching into ADHD hell, and his bet that the next paradigm shift is continual learning with infinite memory—where models experience something once (a bug, a mistake, a user pattern) and never forget it, storing millions of deployment tokens in weights without overloading capacity. We discuss: Ashvin's path: Berkeley robotics PhD → OpenAI 2017 intern (Dota era) → o1/o3 reasoning team → Cursor ML lead in three months Why robotics people are the most grounded at NeurIPS (they work with the real world) and simulation people are the most unhinged (Lex Fridman's take) The IOI Gold paradox: "If you told me we'd achieve IOI Gold in 2022, I'd assume we could all go on vacation—AI solved, no point working anymore. But life is still the same." The RL research era (2017–2022) and why most of it didn't pan out: overfitting to benchmarks, too many implicit knobs to tune, and the community rewarding complex ideas over simple ones that generalize Inside the o1 origin story: a dozen people, conviction from Ilya and Jakob Pachocki that RL would work, small-scale prototypes producing "surprisingly accurate reasoning traces" on math, and first-principles belief that scaled The reasoning team grew from ~12 to 300+ people as o1 became a product and safety, tooling, and deployment scaled up Why Cursor is uniquely positioned for continual learning: policy updates every two hours (online RL on tab), product and ML sitting next to each other, and the entire software engineering workflow (code, logs, debugging, DataDog) living in the product Composer as the start of product-model co-design: smart enough to use, fast enough to stay in the loop, and built by a 20–25 person ML team with high-taste co-founders who code daily The next paradigm shift: continual learning with infinite memory—models that experience something once (a bug, a user mistake) and store it in weights forever, learning from millions of deployment tokens without overloading capacity (trillions of pretraining tokens = plenty of room) Why off-policy RL is unstable (Ashvin's favorite interview question) and why Cursor does two-day work trials instead of whiteboard interviews The vision: automate software engineering as a process (not just answering prompts), co-design products so the entire workflow (write code, check logs, debug, iterate) is in-distribution for RL, and make models that never make the same mistake twice — Ashvin Nair Cursor: https://cursor.com X: https://x.com/ashvinnair_ Chapters 00:00:00 Introduction: From Robotics to Cursor via OpenAI 00:01:58 The Robotics to LLM Agent Transition: Why Code Won 00:09:11 RL Research Winter and Academic Overfitting 00:11:45 The Scaling Era and Moving Goalposts: IOI Gold Doesn't Mean AGI 00:21:30 OpenAI's Reasoning Journey: From Codex to O1 00:20:03 The Blip: Thanksgiving 2023 and OpenAI Governance 00:22:39 RL for Reasoning: The O-Series Conviction and Scaling 00:25:47 O1 to O3: Smooth Internal Progress vs External Hype Cycles 00:33:07 Why Cursor: Co-Designing Products and Models for Real Work 00:34:14 Composer and the Future: Online Learning Every Two Hours 00:35:15 Continual Learning: The Missing Paradigm Shift 00:44:00 Hiring at Cursor and Why Off-Policy RL is Unstable
[State of AI Startups] Memory/Learning, RL Envs & DBT-Fivetran — Sarah Catanzaro, Amplify
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-30 07:00

From investing through the modern data stack era (DBT, Fivetran, and the analytics explosion) to now investing at the frontier of AI infrastructure and applications at Amplify Partners, Sarah Catanzaro has spent years at the intersection of data, compute, and intelligence—watching categories emerge, merge, and occasionally disappoint. We caught up with Sarah live at NeurIPS 2025 to dig into the state of AI startups heading into 2026: why $100M+ seed rounds with no near-term roadmap are now the norm (and why that terrifies her), what the DBT-Fivetran merger really signals about the modern data stack (spoiler: it's not dead, just ready for IPO), how frontier labs are using DBT and Fivetran to manage training data and agent analytics at scale, why data catalogs failed as standalone products but might succeed as metadata services for agents, the consumerization of AI and why personalization (memory, continual learning, K-factor) is the 2026 unlock for retention and growth, why she thinks RL environments are a fad and real-world logs beat synthetic clones every time, and her thesis for the most exciting AI startups: companies that marry hard research problems (RAG, rule-following, continual learning) with killer applications that were simply impossible before. We discuss: The DBT-Fivetran merger: not the death of the modern data stack, but a path to IPO scale (targeting $600M+ combined revenue) and a signal that both companies were already winning their categories How frontier labs use data infrastructure: DBT and Fivetran for training data curation, agent analytics, and managing increasingly complex interactions—plus the rise of transactional databases (RocksDB) and efficient data loading (Vortex) for GPU-bound workloads Why data catalogs failed: built for humans when they should have been built for machines, focused on discoverability when the real opportunity was governance, and ultimately subsumed as features inside Snowflake, DBT, and Fivetran The $100M+ seed phenomenon: raising massive rounds at billion-dollar valuations with no 6-month roadmap, seven-day decision windows, and founders optimizing for signal ("we're a unicorn") over partnership or dilution discipline Why world models are overhyped but underspecified: three competing definitions, unclear generalization across use cases (video games ≠ robotics ≠ autonomous driving), and a research problem masquerading as a product category The 2026 theme: consumerization of AI via personalization—memory management, continual learning, and solving retention/churn by making products learn skills, preferences, and adapt as the world changes (not just storing facts in cursor rules) Why RL environments are a fad: labs are paying 7–8 figures for synthetic clones when real-world logs, traces, and user activity (à la Cursor) are richer, cheaper, and more generalizable Sarah's investment thesis: research-driven applications that solve hard technical problems (RAG for Harvey, rule-following for Sierra, continual learning for the next killer app) and unlock experiences that were impossible before Infrastructure bets: memory, continual learning, stateful inference, and the systems challenges of loading/unloading personalized weights at scale Why K-factor and growth fundamentals matter again: AI felt magical in 2023–2024, but as the magic fades, retention and virality are back—and most AI founders have never heard of K-factor — Sarah Catanzaro X: https://x.com/sarahcat21 Amplify Partners: https://amplifypartners.com/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction: Sarah Catanzaro's Journey from Data to AI 00:01:02 The DBT-Fivetran Merger: Not the End of the Modern Data Stack 00:05:26 Data Catalogs and What Went Wrong 00:08:16 Data Infrastructure at AI Labs: Surprising Insights 00:10:13 The Crazy Funding Environment of 2024-2025 00:17:18 World Models: Hype, Confusion, and Market Potential 00:18:59 Memory Management and Continual Learning: The Next Frontier 00:23:27 Agent Environments: Just a Fad? 00:25:48 The Perfect AI Startup: Research Meets Application 00:28:02 Closing Thoughts and Where to Find Sarah
One Year of MCP — with David Soria Parra and AAIF leads from OpenAI, Goose, Linux Foundation
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-27 23:35

One year ago, Anthropic launched the Model Context Protocol (MCP)—a simple, open standard to connect AI applications to the data and tools they need. Today, MCP has exploded from a local-only experiment into the de facto protocol for agentic systems, adopted by OpenAI, Microsoft, Google, Block, and hundreds of enterprises building internal agents at scale. And now, MCP is joining the newly formed Agentic AI Foundation (AAIF) under the Linux Foundation, alongside Block's Goose coding agent, with founding members spanning the biggest names in AI and cloud infrastructure. We sat down with David Soria Parra (MCP lead, Anthropic), Nick Cooper (OpenAI), Brad Howes (Block / Goose), and Jim Zemlin (Linux Foundation CEO) to dig into the one-year journey of MCP—from Thanksgiving hacking sessions and the first remote authentication spec to long-running tasks, MCP Apps, and the rise of agent-to-agent communication—and the behind-the-scenes story of how three competitive AI labs came together to donate their protocols and agents to a neutral foundation, why enterprises are deploying MCP servers faster than anyone expected (most of it invisible, internal, and at massive scale), what it takes to design a protocol that works for both simple tool calls and complex multi-agent orchestration, how the foundation will balance taste-making (curating meaningful projects) with openness (avoiding vendor lock-in), and the 2025 vision: MCP as the communication layer for asynchronous, long-running agents that work while you sleep, discover and install their own tools, and unlock the next order of magnitude in AI productivity. We discuss: The one-year MCP journey: from local stdio servers to remote HTTP streaming, OAuth 2.1 authentication (and the enterprise lessons learned), long-running tasks, and MCP Apps (iframes for richer UI) Why MCP adoption is exploding internally at enterprises: invisible, internal servers connecting agents to Slack, Linear, proprietary data, and compliance-heavy workflows (financial services, healthcare) The authentication evolution: separating resource servers from identity providers, dynamic client registration, and why the March spec wasn't enterprise-ready (and how June fixed it) How Anthropic dogfoods MCP: internal gateway, custom servers for Slack summaries and employee surveys, and why MCP was born from "how do I scale dev tooling faster than the company grows?" Tasks: the new primitive for long-running, asynchronous agent operations—why tools aren't enough, how tasks enable deep research and agent-to-agent handoffs, and the design choice to make tasks a "container" (not just async tools) MCP Apps: why iframes, how to handle styles and branding, seat selection and shopping UIs as the killer use case, and the collaboration with OpenAI to build a common standard The registry problem: official registry vs. curated sub-registries (Smithery, GitHub), trust levels, model-driven discovery, and why MCP needs "npm for agents" (but with signatures and HIPAA/financial compliance) The founding story of AAIF: how Anthropic, OpenAI, and Block came together (spoiler: they didn't know each other were talking to Linux Foundation), why neutrality matters, and how Jim Zemlin has never seen this much day-one inbound interest in 22 years — David Soria Parra (Anthropic / MCP) MCP: https://modelcontextprotocol.io https://uk.linkedin.com/in/david-soria-parra-4a78b3a https://x.com/dsp_ Nick Cooper (OpenAI) X: https://x.com/nicoaicopr Brad Howes (Block / Goose) Goose: https://github.com/block/goose Jim Zemlin (Linux Foundation) LinkedIn: https://www.linkedin.com/in/zemlin/ Agentic AI Foundation https://agenticai.foundation Chapters 00:00:00 Introduction: MCP's First Year and Foundation Launch 00:01:17 MCP's Journey: From Launch to Industry Standard 00:02:06 Protocol Evolution: Remote Servers and Authentication 00:08:52 Enterprise Authentication and Financial Services 00:11:42 Transport Layer Challenges: HTTP Streaming and Scalability 00:15:37 Standards Development: Collaboration with Tech Giants 00:34:27 Long-Running Tasks: The Future of Async Agents 00:30:41 Discovery and Registries: Building the MCP Ecosystem 00:30:54 MCP Apps and UI: Beyond Text Interfaces 00:26:55 Internal Adoption: How Anthropic Uses MCP 00:23:15 Skills vs MCP: Complementary Not Competing 00:36:16 Community Events and Enterprise Learnings 01:03:31 Foundation Formation: Why Now and Why Together 01:07:38 Linux Foundation Partnership: Structure and Governance 01:11:13 Goose as Reference Implementation 01:17:28 Principles Over Roadmaps: Composability and Quality 01:21:02 Foundation Value Proposition: Why Contribute 01:27:49 Practical Investments: Events, Tools, and Community 01:34:58 Looking Ahead: Async Agents and Real Impact
Steve Yegge's Vibe Coding Manifesto: Why Claude Code Isn't It & What Comes After the IDE
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-26 22:21

Note: Steve and Gene’s talk on Vibe Coding and the post IDE world was one of the top talks of AIE CODE: https://www.youtube.com/watch?v=7Dtu2bilcFs&t=1019s&pp=0gcJCU0KAYcqIYzv From building legendary platforms at Google and Amazon to authoring one of the most influential essays on AI-powered development (Revenge of the Junior Developer, quoted by Dario Amodei himself), Steve Yegge has spent decades at the frontier of software engineering—and now he's leading the charge into what he calls the "factory farming" era of code. After stints at SourceGraph and building Beads (a purely vibe-coded issue tracker with tens of thousands of users), Steve co-authored The Vibe Coding Book and is now building VC (VibeCoder), an agent orchestration dashboard designed to move developers from writing code to managing fleets of AI agents that coordinate, parallelize, and ship features while you sleep. We sat down with Steve at AI Engineer Summit to dig into why Claude Code, Cursor, and the entire 2024 stack are already obsolete, what it actually takes to trust an agent after 2,000 hours of practice (hint: they will delete your production database if you anthropomorphize them), why the real skill is no longer writing code but orchestrating agents like a NASCAR pit crew, how merging has become the new wall that every 10x-productive team is hitting (and why one company's solution is literally "one engineer per repo"), the rise of multi-agent workflows where agents reserve files, message each other via MCP, and coordinate like a little village, why Steve believes if you're still using an IDE to write code by January 1st, you're a bad engineer, how the 12–15 year experience bracket is the most resistant demographic (and why their identity is tied to obsolete workflows), the hidden chaos inside OpenAI, Anthropic, and Google as they scale at breakneck speed, why rewriting from scratch is now faster than refactoring for a growing class of codebases, and his 2025 prediction: we're moving from subsistence agriculture to John Deere-scale factory farming of code, and the Luddite backlash is only just beginning. We discuss: Why Claude Code, Cursor, and agentic coding tools are already last year's tech—and what comes next: agent orchestration dashboards where you manage fleets, not write lines The 2,000-hour rule: why it takes a full year of daily use before you can predict what an LLM will do, and why trust = predictability, not capability Steve's hot take: if you're still using an IDE to develop code by January 1st, 2025, you're a bad engineer—because the abstraction layer has moved from models to full-stack agents The demographic most resistant to vibe coding: 12–15 years of experience, senior engineers whose identity is tied to the way they work today, and why they're about to become the interns Why anthropomorphizing LLMs is the biggest mistake: the "hot hand" fallacy, agent amnesia, and how Steve's agent once locked him out of prod by changing his password to "fix" a problem Should kids learn to code? Steve's take: learn to vibe code—understand functions, classes, architecture, and capabilities in a language-neutral way, but skip the syntax The 2025 vision: "factory farming of code" where orchestrators run Cloud Code, scrub output, plan-implement-review-test in loops, and unlock programming for non-programmers at scale — Steve Yegge X: https://x.com/steve_yegge Substack (Stevie's Tech Talks): https://steve-yegge.medium.com/ GitHub (VC / VibeCoder): https://github.com/yegge-labs Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction: Steve Yegge on Vibe Coding and AI Engineering 00:00:59 The Backlash: Who Resists Vibe Coding and Why 00:04:26 The 2000 Hour Rule: Building Trust with AI Coding Tools 00:03:31 The January 1st Deadline: IDEs Are Becoming Obsolete 00:02:55 10X Productivity at OpenAI: The Performance Review Problem 00:07:49 The Hot Hand Fallacy: When AI Agents Betray Your Trust 00:11:12 Claude Code Isn't It: The Need for Agent Orchestration 00:15:20 The Orchestrator Revolution: From Cloud Code to Agent Villages 00:18:46 The Merge Wall: The Biggest Unsolved Problem in AI Coding 00:26:33 Never Rewrite Your Code - Until Now: Joel Spolsky Was Wrong 00:22:43 Factory Farming Code: The John Deere Era of Software 00:29:27 Google's Gemini Turnaround and the AI Lab Chaos 00:33:20 Should Your Kids Learn to Code? The New Answer 00:34:59 Code MCP and the Gossip Rate: Latest Vibe Coding Discoveries
⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-26 22:18

From the frontlines of OpenAI's Codex and GPT-5 training teams, Bryan and Bill are building the future of AI-powered coding—where agents don't just autocomplete, they architect, refactor, and ship entire features while you sleep. We caught up with them at AI Engineer Conference right after the launch of Codex Max, OpenAI's newest long-running coding agent designed to work for 24+ hours straight, manage its own context, and spawn sub-agents to parallelize work across your entire codebase. We sat down with Bryan and Bill to dig into what it actually takes to train a model that developers trust—why personality, communication, and planning matter as much as raw capability, how Codex is trained with strong opinions about tools (it loves rg over grep, seriously), why the abstraction layer is moving from models to full-stack agents you can plug into VS Code or Zed, how OpenAI partners co-develop tool integrations and discover unexpected model habits (like renaming tools to match Codex's internal training), the rise of applied evals that measure real-world impact instead of academic benchmarks, why multi-turn evals are the next frontier (and Bryan's "job interview eval" idea), how coding agents are breaking out of code into personal automation, terminal workflows, and computer use, and their 2026 vision: coding agents trusted enough to handle the hardest refactors at any company, not just top-tier firms, and general enough to build integrations, organize your desktop, and unlock capabilities you'd never get access to otherwise. We discuss: What Codex Max is: a long-running coding agent that can work 24+ hours, manage its own context window, and spawn sub-agents for parallel work Why the name "Max": maximalist, maximization, speed and endurance—it's simply better and faster for the same problems Training for personality: communication, planning, context gathering, and checking your work as behavioral characteristics, not just capabilities How Codex develops habits like preferring rg over grep, and why renaming tools to match its training (e.g., terminal-style naming) dramatically improves tool-call performance The split between Codex (opinionated, agent-focused, optimized for the Codex harness) and GPT-5 (general, more durable across different tools and modalities) Why the abstraction layer is moving up: from prompting models to plugging in full agents (Codex, GitHub Copilot, Zed) that package the entire stack The rise of sub-agents and agents-using-agents: Codex Max spawning its own instances, handing off context, and parallelizing work across a codebase How OpenAI works with coding partners on the bleeding edge to co-develop tool integrations and discover what the model is actually good at The shift to applied evals: capturing real-world use cases instead of academic benchmarks, and why ~50% of OpenAI employees now use Codex daily Why multi-turn evals are the next frontier: LM-as-a-judge for entire trajectories, Bryan's "job interview eval" concept, and the need for a batch multi-turn eval API How coding agents are breaking out of code: personal automation, organizing desktops, terminal workflows, and "Devin for non-coding" use cases Why Slack is the ultimate UI for work, and how coding agents can become your personal automation layer for email, files, and everything in between The 2026 vision: more computer use, more trust, and coding agents capable enough that any company can access top-tier developer capabilities, not just elite firms — Bryan & Bill (OpenAI Codex Team) http://x.com/bfioca https://x.com/realchillben OpenAI Codex: https://openai.com/index/openai-codex/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction: Latent Space Listeners at AI Engineer Code 00:01:27 Codex Max Launch: Training for Long-Running Coding Agents 00:03:01 Model Personality and Trust: Communication, Planning, and Self-Checking 00:05:20 Codex vs GPT-5: Opinionated Agents vs General Models 00:07:47 Tool Use and Model Habits: The Ripgrep Discovery 00:09:16 Personality Design: Verbosity vs Efficiency in Coding Agents 00:11:56 The Agent Abstraction Layer: Building on Top of Codex 00:14:08 Sub-Agents and Multi-Agent Patterns: The Future of Composition 00:16:11 Trust and Adoption: OpenAI Developers Using Codex Daily 00:17:21 Applied Evals: Real-World Testing vs Academic Benchmarks 00:19:15 Multi-Turn Evals and the Job Interview Pattern 00:21:35 Feature Request: Batch Multi-Turn Eval API 00:22:28 Beyond Code: Personal Automation and Computer Use 00:24:51 Vision-Native Agents and the UI Integration Challenge 00:25:02 2026 Predictions: Trust, Computer Use, and Democratized Excellence
SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-18 21:08

as with all demo-heavy and especially vision AI podcasts, we encourage watching along on our YouTube (and tossing us an upvote/subscribe if you like!) From SAM 1's 11-million-image data engine to SAM 2's memory-based video tracking, MSL’s Segment Anything project has redefined what's possible in computer vision. Now SAM 3 takes the next leap: concept segmentation—prompting with natural language like "yellow school bus" or "tablecloth" to detect, segment, and track every instance across images and video, in real time, with human-level exhaustivity. And with the latest SAM Audio (https://x.com/aiatmeta/status/2000980784425931067?s=46), SAM can now even segment audio output! We sat down with Nikhila Ravi (SAM lead at Meta) and Pengchuan Zhang (SAM 3 researcher) alongside Joseph Nelson (CEO, Roboflow) to unpack how SAM 3 unifies interactive segmentation, open-vocabulary detection, video tracking, and more into a single model that runs in 30ms on images and scales to real-time video on multi-GPU setups. We dig into the data engine that automated exhaustive annotation from two minutes per image down to 25 seconds using AI verifiers fine-tuned on Llama, the new SACO (Segment Anything with Concepts) benchmark with 200,000+ unique concepts vs. the previous 1.2k, how SAM 3 separates recognition from localization with a presence token, why decoupling the detector and tracker was critical to preserve object identity in video, how SAM 3 Agents unlock complex visual reasoning by pairing SAM 3 with multimodal LLMs like Gemini, and the real-world impact: 106 million smart polygons created on Roboflow saving humanity an estimated 130+ years of labeling time across fields from cancer research to underwater trash cleanup to autonomous vehicle perception. We discuss: What SAM 3 is: a unified model for concept-prompted segmentation, detection, and tracking in images and video using atomic visual concepts like "purple umbrella" or "watering can" How concept prompts work: short text phrases that find all instances of a category without manual clicks, plus visual exemplars (boxes, clicks) to refine and adapt on the fly Real-time performance: 30ms per image (100 detected objects on H200), 10 objects on 2×H200 video, 28 on 4×, 64 on 8×, with parallel inference and "fast mode" tracking The SACO benchmark: 200,000+ unique concepts vs. 1.2k in prior benchmarks, designed to capture the diversity of natural language and reach human-level exhaustivity The data engine: from 2 minutes per image (all-human) to 45 seconds (model-in-loop proposals) to 25 seconds (AI verifiers for mask quality and exhaustivity checks), fine-tuned on Llama 3.2 Why exhaustivity is central: every instance must be found, verified by AI annotators, and manually corrected only when the model misses—automating the hardest part of segmentation at scale Architecture innovations: presence token to separate recognition ("is it in the image?") from localization ("where is it?"), decoupled detector and tracker to preserve identity-agnostic detection vs. identity-preserving tracking Building on Meta's ecosystem: Perception Encoder, DINO v2 detector, Llama for data annotation, and SAM 2's memory-based tracking backbone SAM 3 Agents: using SAM 3 as a visual tool for multimodal LLMs (Gemini, Llama) to solve complex visual reasoning tasks like "find the bigger character" or "what distinguishes male from female in this image" Fine-tuning with as few as 10 examples: domain adaptation for specialized use cases (Waymo vehicles, medical imaging, OCR-heavy scenes) and the outsized impact of negative examples Real-world impact at Roboflow: 106M smart polygons created, saving 130+ years of labeling time across cancer research, underwater trash cleanup, autonomous drones, industrial automation, and more — MSL FAIR team Nikhila: https://www.linkedin.com/in/nikhilaravi/ Pengchuan: https://pzzhang.github.io/pzzhang/ Joseph Nelson X: https://x.com/josephofiowa LinkedIn: https://www.linkedin.com/in/josephofiowa/ [FLIGHTCAST_CHATPERS]
⚡️Jailbreaking AGI: Pliny the Liberator & John V on Red Teaming, BT6, and the Future of AI Security
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-16 15:29

Note: this is Pliny and John’s first major podcast. Voices have been changed for opsec. From jailbreaking every frontier model and turning down Anthropic's Constitutional AI challenge to leading BT6, a 28-operator white-hat hacker collective obsessed with radical transparency and open-source AI security, Pliny the Liberator and John V are redefining what AI red-teaming looks like when you refuse to lobotomize models in the name of "safety." Pliny built his reputation crafting universal jailbreaks—skeleton keys that obliterate guardrails across modalities—and open-sourcing prompt templates like Libertas, predictive reasoning cascades, and the infamous "Pliny divider" that's now embedded so deep in model weights it shows up unbidden in WhatsApp messages. John V, coming from prompt engineering and computer vision, co-founded the Bossy Discord (40,000 members strong) and helps steer BT6's ethos: if you can't open-source the data, we're not interested. Together they've turned down enterprise gigs, pushed back on Anthropic's closed bounties, and insisted that real AI security happens at the system layer—not by bubble-wrapping latent space. We sat down with Pliny and John to dig into the mechanics of hard vs. soft jailbreaks, why multi-turn crescendo attacks were obvious to hackers years before academia "discovered" them, how segmented sub-agents let one jailbroken orchestrator weaponize Claude for real-world attacks (exactly as Pliny predicted 11 months before Anthropic's recent disclosure), why guardrails are security theater that punishes capability while doing nothing for real safety, the role of intuition and "bonding" with models to navigate latent space, how BT6 vets operators on skill and integrity, why they believe Mech Interp and open-source data are the path forward (not RLHF lobotomization), and their vision for a future where spatial intelligence, swarm robotics, and AGI alignment research happen in the open—bootstrapped, grassroots, and uncompromising. We discuss: What universal jailbreaks are: skeleton-key prompts that obliterate guardrails across models and modalities, and why they're central to Pliny's mission of "liberation" Hard vs. soft jailbreaks: single-input templates vs. multi-turn crescendo attacks, and why the latter were obvious to hackers long before academic papers The Libertas repo: predictive reasoning, the Library of Babel analogy, quotient dividers, weight-space seeds, and how introducing "steered chaos" pulls models out-of-distribution Why jailbreaking is 99% intuition and bonding with the model: probing token layers, syntax hacks, multilingual pivots, and forming a relationship to navigate latent space The Anthropic Constitutional AI challenge drama: UI bugs, judge failures, goalpost moving, the demand for open-source data, and why Pliny sat out the $30k bounty Why guardrails ≠ safety: security theater, the futility of locking down latent space when open-source is right behind, and why real safety work happens in meatspace (not RLHF) The weaponization of Claude: how segmented sub-agents let one jailbroken orchestrator execute malicious tasks (pyramid-builder analogy), and why Pliny predicted this exact TTP 11 months before Anthropic's disclosure BT6 hacker collective: 28 operators across two cohorts, vetted on skill and integrity, radical transparency, radical open-source, and the magic of moving the needle on AI security, swarm intelligence, blockchain, and robotics — Pliny the Liberator X: https://x.com/elder_plinius GitHub (Libertas): https://github.com/elder-plinius/L1B3RT45 John V X: https://x.com/JohnVersus BT6 & Bossy BT6: https://bt6.gg Bossy Discord: Search "Bossy Discord" or ask Pliny/John V on X Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction: Meet Pliny the Liberator and John V 00:01:50 The Philosophy of AI Liberation and Jailbreaking 00:03:08 Universal Jailbreaks: Skeleton Keys to AI Models 00:04:24 The Cat-and-Mouse Game: Attackers vs Defenders 00:05:42 Security Theater vs Real Safety: The Fundamental Disconnect 00:08:51 Inside the Libertas Repo: Prompt Engineering as Art 00:16:22 The Anthropic Challenge Drama: UI Bugs and Open Source Data 00:23:30 From Jailbreaks to Weaponization: AI-Orchestrated Attacks 00:26:55 The BT6 Hacker Collective and BASI Community 00:34:46 AI Red Teaming: Full Stack Security Beyond the Model 00:38:06 Safety vs Security: Meat Space Solutions and Final Thoughts
AI to AE's: Grit, Glean, and Kleiner Perkins' next Enterprise AI hit — Joubin Mirzadegan, Roadrunner
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-12 20:50

Glean started as a Kleiner Perkins incubation and is now a $7B, $200m ARR Enterprise AI leader. Now KP has tapped its own podcaster to lead it’s next big swing. From building go-to-market the hard way in startups (and scaling Palo Alto Networks’ public cloud business) to joining Kleiner Perkins to help technical founders turn product edge into repeatable revenue, Joubin Mirzadegan has spent the last decade obsessing over one thing: distribution and how ideas actually spread, sell, and compound. That obsession took him from launching the CRO-only podcast Grit (https://www.youtube.com/playlist?list=PLRiWZFltuYPF8A6UGm74K2q29UwU-Kk9k) as a hiring wedge, to working alongside breakout companies like Glean and Windsurf, to now incubating Roadrunner which is an AI-native rethink of CPQ and quoting workflows as pricing models collapse from “seats” into consumption, bundles, renewals, and SKU sprawl. We sat down with Joubin to dig into the real mechanics of making conversations feel human (rolling early, never sending questions, temperature + lighting hacks), what Windsurf got right about “Google-class product and Salesforce-class distribution,” how to hire early sales leaders without getting fooled by shiny logos, why CPQ is quietly breaking the back of modern revenue teams, and his thesis for his new company and KP incubation Roadrunner (https://www.roadrunner.ai/): rebuild the data model from the ground up, co-develop with the hairiest design partners, and eventually use LLMs to recommend deal structures the way the best reps do without the Slack-channel chaos of deal desk. We discuss: How to make guests instantly comfortable: rolling early, no “are you ready?”, temperature, lighting, and room dynamics Why Joubin refuses to send questions in advance (and when you might have to anyway) The origin of the CRO-only podcast: using media as a hiring wedge and relationship engine The “commit to 100 episodes” mindset: why most shows die before they find their voice Founder vs exec interviews: why CEOs can speak more freely (and what it unlocks in conversation) What Glean taught him about enterprise AI: permissions, trust, and overcoming “category is dead” skepticism Design partners as the real unlock: why early believers matter and how co-development actually works Windsurf’s breakout: what it means to be serious about “Google-class product + Salesforce-class distribution” Why technical founders struggle with GTM and how KP built a team around sales, customer access, and demand gen Hiring early sales leaders: anti-patterns (logos), what to screen for (motivation), and why stage-fit is everything The CPQ problem & Roadrunner’s thesis: rebuilding CPQ/quoting from the data model up for modern complexity How “rules + SKUs + approvals” create a brittle graph and what it takes to model it without tipping over The two-year window: incumbents rebuilding slowly vs startups out-sprinting with AI-native architecture Where AI actually helps: quote generation, policy enforcement, approval routing, and deal recommendation loops — Joubin X: https://x.com/Joubinmir LinkedIn: https://www.linkedin.com/in/joubin-mirzadegan-66186854/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and the Zuck Interview Experience 00:03:26 The Genesis of the Grit Podcast: Hiring CROs Through Content 00:13:20 Podcast Philosophy: Creating Authentic Conversations 00:15:44 Working with Arvind at Glean: The Enterprise Search Breakthrough 00:26:20 Windsurf's Sales Machine: Google-Class Product Meets Salesforce-Class Distribution 00:30:28 Hiring Sales Leaders: Anti-Patterns and First Principles 00:39:02 The CPQ Problem: Why Salesforce and Legacy Tools Are Breaking 00:43:40 Introducing Roadrunner: Solving Enterprise Pricing with AI 00:49:19 Building Roadrunner: Team, Design Partners, and Data Model Challenges 00:59:35 High Performance Philosophy: Working Out Every Day and Reducing Friction 01:06:28 Defining Grit: Passion Plus Perseverance
The Future of Email: Superhuman CTO on Your Inbox As the Real AI Agent (Not ChatGPT) — Loïc Houssier
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-11 01:45

From applied cryptography and offensive security in France’s defense industry to optimizing nuclear submarine workflows, then selling his e-signature startup to Docusign (https://www.docusign.com/company/news-center/opentrust-joins-docusign-global-trust-network and now running AI as CTO of Superhuman Mail (Superhuman, recently acquired by Grammarly https://techcrunch.com/2025/07/01/grammarly-acquires-ai-email-client-superhuman/), Loïc Houssier has lived the full arc from deep infra and compliance hell to obsessing over 100ms product experiences and AI-native email. We sat down with Loïc to dig into how you actually put AI into an inbox without adding latency, why Superhuman leans so hard into agentic search and “Ask AI” over your entire email history, how they design tools vs. agents and fight agent laziness, what box-priced inference and local-first caching mean for cost and reliability, and his bet that your inbox will power your future AI EA while AI massively widens the gap between engineers with real fundamentals and those faking it. We discuss: Loïc’s path from applied cryptography and offensive security in France’s defense industry to submarines, e-signatures, Docusign, and now Superhuman Mail What 3,000+ engineers actually do at a “simple” product like Docusign: regional compliance, on-prem appliances, and why global scale explodes complexity How Superhuman thinks about AI in email: auto-labels, smart summaries, follow-up nudges, “Ask AI” search, and the rule that AI must never add latency or friction Superhuman’s agentic framework: tools vs. agents, fighting “agent laziness,” deep semantic search over huge inboxes, and pagination strategies to find the real needle in the haystack How they evaluate OpenAI, Anthropic, Gemini, and open models: canonical queries, end-to-end evals, date reasoning, and Rahul’s infamous “what wood was my table?” test Infra and cost philosophy: local-first caching, vector search backends, Baseten “box” pricing vs. per-token pricing, and thinking in price-per-trillion-tokens instead of price-per-million The vision of Superhuman as your AI EA: auto-drafting replies in your voice, scheduling on your behalf, and using your inbox as the ultimate private data source How the Grammarly + Coda + Superhuman stack could power truly context-aware assistance across email, docs, calendars, contracts, and more Inside Superhuman’s AI-dev culture: free-for-all tool adoption, tracking AI usage on PRs, and going from ~4 to ~6 PRs per engineer per week Why Loïc believes everyone should still learn to code, and how AI will amplify great engineers with strong fundamentals while exposing shallow ones even faster — Loïc Houssier LinkedIn: https://www.linkedin.com/in/houssier/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and Loïc's Journey from Nuclear Submarines to Superhuman 00:06:40 Docusign Acquisition and the Enterprise Email Stack 00:10:26 Superhuman's AI Vision: Your Inbox as the Real AI Agent 00:13:20 Ask AI: Agentic Search and the Quality Problem 00:18:20 Infrastructure Choices: Model Selection, Base10, and Cost Management 00:27:30 Local-First Architecture and the Database Stack 00:30:50 Evals, Quality, and the Rahul Wood Table Test 00:42:30 The Future EA: Auto-Drafting and Proactive Assistance 00:46:40 Grammarly Acquisition and the Contextual Advantage 00:38:40 Voice, Video, and the End of Writing 00:51:40 Knowledge Graphs: The Hard Problem Nobody Has Solved 00:56:40 Competing with OpenAI and the Browser Question 01:02:30 AI Coding Tools: From 4 to 6 PRs Per Week 01:08:00 Engineering Culture, Hiring, and the Future of Software Development
World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-12-06 06:02

From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-500-million-startup-videogame-data) and raising a $134M seed from Khosla (https://techcrunch.com/2025/10/16/general-intuition-lands-134m-seed-to-teach-agents-spatial-reasoning-using-video-game-clips/) to spin out General Intuition, Pim is betting that world models trained on peak human gameplay are the next frontier after LLMs. We sat down with Pim to dig into why game highlights are “episodic memory for simulation” (and how Medal’s privacy-first action labels became a world-model goldmine https://medal.tv/blog/posts/enabling-state-of-the-art-security-and-protections-on-medals-new-apm-and-controller-overlay-features), what it takes to build fully vision-based agents that just see frames and output actions in real time, how General Intuition transfers from games to real-world video and then into robotics, why world models and LLMs are complementary rather than rivals, what founders with proprietary datasets should know before selling or licensing to labs, and his bet that spatial-temporal foundation models will power 80% of future atoms-to-atoms interactions in both simulation and the real world. We discuss: How Medal’s 3.8B action-labeled highlight clips became a privacy-preserving goldmine for world models Building fully vision-based agents that only see frames and output actions yet play like (and sometimes better than) humans Transferring from arcade-style games to realistic games to real-world video using the same perception–action recipe Why world models need actions, memory, and partial observability (smoke, occlusion, camera shake) vs. “just” pretty video generation Distilling giant policies into tiny real-time models that still navigate, hide, and peek corners like real players Pim’s path from RuneScape private servers, Tourette’s, and reverse engineering to leading a frontier world-model lab How data-rich founders should think about valuing their datasets, negotiating with big labs, and deciding when to go independent GI’s first customers: replacing brittle behavior trees in games, engines, and controller-based robots with a “frames in, actions out” API Using Medal clips as “episodic memory of simulation” to move from imitation learning to RL via world models and negative events The 2030 vision: spatial–temporal foundation models that power the majority of atoms-to-atoms interactions in simulation and the real world — Pim X: https://x.com/PimDeWitte LinkedIn: https://www.linkedin.com/in/pimdw/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and Medal's Gaming Data Advantage 00:02:08 Exclusive Demo: Vision-Based Gaming Agents 00:06:17 Action Prediction and Real-World Video Transfer 00:08:41 World Models: Interactive Video Generation 00:13:42 From Runescape to AI: Pim's Founder Journey 00:16:45 The Research Foundations: Diamond, Genie, and SEMA 00:33:03 Vinod Khosla's Largest Seed Bet Since OpenAI 00:35:04 Data Moats and Why GI Stayed Independent 00:38:42 Self-Teaching AI Fundamentals: The Francois Fleuret Course 00:40:28 Defining World Models vs Video Generation 00:41:52 Why Simulation Complexity Favors World Models 00:43:30 World Labs, Yann LeCun, and the Spatial Intelligence Race 00:50:08 Business Model: APIs, Agents, and Game Developer Partnerships 00:58:57 From Imitation Learning to RL: Making Clips Playable 01:00:15 Open Research, Academic Partnerships, and Hiring 01:02:09 2030 Vision: 80 Percent of Atoms-to-Atoms AI Interactions
After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-11-25 22:36

Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D. We discuss: The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone. What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets. Fei-fei’s essay (https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence) on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in. Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning. The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem. Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters. Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots. How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn’t to throw away LLMs but to complement them with rich, embodied models of the world. Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making. — Fei-Fei Li X: https://x.com/drfeifei LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247 Justin Johnson X: https://x.com/jcjohnss LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664 Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership 00:02:00 From ImageNet to World Models: The Evolution of Computer Vision 00:12:42 Dense Captioning and Early Vision-Language Work 00:19:57 Spatial Intelligence: Beyond Language Models 00:28:46 Introducing Marble: World Labs' First Spatial Intelligence Model 00:33:21 Gaussian Splats and the Technical Architecture of Marble 00:22:10 Physics, Dynamics, and the Future of World Models 00:41:09 Multimodality and the Interplay of Language and Space 00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI 00:56:58 Hiring, Research Directions, and the Future of World Labs
After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-11-25 18:16

Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D. We discuss: The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone. What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets. Fei-fei’s essay (https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence) on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in. Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning. The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem. Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters. Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots. How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn’t to throw away LLMs but to complement them with rich, embodied models of the world. Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making. — Fei-Fei Li X: https://x.com/drfeifei LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247 Justin Johnson X: https://x.com/jcjohnss LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664 Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership 00:02:00 From ImageNet to World Models: The Evolution of Computer Vision 00:12:33 Dense Captioning and Early Vision-Language Work 00:19:39 Spatial Intelligence: Beyond Language Models 00:28:49 Introducing Marble: World Labs' First Spatial Intelligence Model 00:33:24 Gaussian Splats and the Technical Architecture of Marble 00:35:50 Physics, Dynamics, and the Future of World Models 00:44:00 Multimodality and the Interplay of Language and Space 00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI 00:57:03 Hiring, Research Directions, and the Future of World Labs
⚡️ 10x AI Engineers with $1m Salaries — Alex Lieberman & Arman Hezarkhani, Tenex
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-11-19 17:23

Alex Lieberman and Arman Hezarkani, co-founders of Tenex, reveal how they're revolutionizing software consulting by compensating AI engineers for output rather than hours—enabling some engineers to earn over $1 million annually while delivering 10x productivity gains. Their company represents a fundamental rethinking of knowledge work compensation in the age of AI agents, where traditional hourly billing models perversely incentivize slower work even as AI tools enable unprecedented speed. The Genesis: From 90% Downsizing to 10x Output The story behind 10X begins with Arman's previous company, Parthian, where he was forced to downsize his engineering team by 90%. Rather than collapse, Arman re-architected the entire product and engineering process to be AI-first—and discovered that production-ready software output increased 10x despite the massive headcount reduction. This counterintuitive result exposed a fundamental misalignment: engineers compensated by the hour are disincentivized from leveraging AI to work faster, even when the technology enables dramatic productivity gains. Alex, who had invested in Parthian, initially didn't believe the numbers until Arman walked him through why LLMs have made such a profound impact specifically on engineering as knowledge work. The Economic Model: Story Points Over Hours 10X's core innovation is compensating engineers based on story points—units of completed, quality output—rather than hours worked. This creates direct economic incentives for engineers to adopt every new AI tool, optimize their workflows, and maximize throughput. The company expects multiple engineers to earn over $1 million in cash compensation next year purely from story point earnings. To prevent gaming the system, they hire for two profiles: engineers who are "long-term selfish" (understanding that inflating story points will destroy client relationships) and those who genuinely love writing code and working with smart people. They also employ technical strategists incentivized on client retention (NRR) who serve as the final quality gate before any engineering plan reaches a client. Impressive Builds: From Retail AI to App Store Hits The results speak for themselves. In one project, 10X built a computer vision system for retail cameras that provides heat maps, queue detection, shelf stocking analysis, and theft detection—creating early prototypes in just two weeks for work that previously took quarters. They built Snapback Sports' mobile trivia app in one month, which hit 20th globally on the App Store. In a sales context, an engineer spent four hours building a working prototype of a fitness influencer's AI health coach app after the prospect initially said no—immediately moving 10X to the top of their vendor list. These examples demonstrate how AI-enabled speed fundamentally changes sales motions and product development timelines. The Interview Process: Unreasonably Difficult Take-Homes Despite concerns that AI would make take-home assessments obsolete, 10X still uses them—but makes them "unreasonably difficult." About 50% of candidates don't even respond, but those who complete the challenge demonstrate the caliber needed. The interview process is remarkably short: two calls before the take-home, review, then one or two final meetings—completable in as little as a week. A signature question: "If you had infinite resources to build an AI that could replace either of us on this call, what would be the first major bottleneck?" The sophisticated answer isn't just "model intelligence" or "context length"—it's controlling entropy, the accumulating error rate that derails autonomous agents over time. The Limiting Factor: Human Capital, Not Technology Despite being an AI-first company, 10X's primary constraint is human capital—finding and hiring enough exceptional engineers fast enough, then matching them with the right processes to maintain delivery quality as they scale. The company has ambitions beyond consulting to build their own technology, but for the foreseeable future, recruiting remains the bottleneck. This reveals an important insight about the AI era: even as technology enables unprecedented leverage, the constraint shifts to finding people who can harness that leverage effectively. Chapters 00:00:00 Introduction and Meeting the 10X Co-founders 00:01:29 The 10X Moment: From Hourly Billing to Output-Based Compensation 00:04:44 The Economic Model Behind 10X 00:05:42 Story Points and Measuring Engineering Output 00:08:41 Impressive Client Projects and Rapid Prototyping 00:12:22 The 10X Tech Stack: TypeScript and High Structure 00:13:21 AI Coding Tools: The Daily Evolution 00:15:05 Human Capital as the Limiting Factor 00:16:02 The Unreasonably Difficult Interview Process 00:17:14 Entropy and Context Engineering: The Future of AI Agents 00:23:28 The MCP Debate and AI Industry Sociology 00:26:01 Consulting, Digital Transformation, and Conference Insights
Anthropic, Glean & OpenRouter: How AI Moats Are Built with Deedy Das of Menlo Ventures
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2025-11-14 22:54

Deedy Das, Partner at Menlo Ventures, returns to Latent Space to discuss his journey from Glean to venture capital, the explosive rise of Anthropic, and how AI is reshaping enterprise software and coding. From investing in Anthropic early on when they had no revenue to managing the $100M Ontology Fund, Das shares insider perspectives on the fastest-growing software company in history and what's next for AI infrastructure, research investing, and the future of engineering. We cover Glean’s rise from “boring” enterprise search to a $7B AI-native company, Anthropic's meteoric rise, the strategic decisions behind products like Claude Code, and why market share in enterprise AI is shifting dramatically. Das explains his investment thesis on research companies like Goodfire, Prime Intellect, and OpenRouter and how the Anthology Fund is quietly seeding the next wave of AI infra, research, and devtools. Chapters 00:00:00 Introduction and Deedy's Return to Latent Space 00:01:20 Glean's Journey: From Boring Enterprise Search to $7B Valuation 00:15:37 Anthropic's Meteoric Rise and Market Share Dynamics 00:17:50 Claude Artifacts and Product Innovation 00:41:20 The Anthology Fund: Investing in the Anthropic Ecosystem 00:48:01 Goodfire and Mechanistic Interpretability 00:51:25 Prime Intellect and Distributed AI Training 00:53:40 OpenRouter: Building the AI Model Gateway 01:13:36 The Stargate Project and Infrastructure Arms Race 01:18:14 The Future of Software Engineering and AI Coding

Page 1 of 10 Next

Explore more podcasts from United States

Latest episodes from "Latent Space: The AI Engineer Podcast"