-
Why Google failed to make GPT-3 + why Multimodal Agents are the path to AGI — with David Luan of Adept
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-03-22 19:08
Our next SF event is AI UX 2024 - let’s see the new frontier for UX since last year! Last call: we are recording a preview of the AI Engineer World’s Fair with swyx and Ben Dunphy, send any questions about Speaker CFPs and Sponsor Guides you have!Alessio is now hiring engineers for a new startup he is incubating at Decibel: Ideal candidate is an “ex-technical co-founder type”. Reach out to him for more!David Luan has been at the center of the modern AI revolution: he was the ~30th hire at OpenAI, he led Google's LLM efforts and co-led Google Brain, and then started Adept in 2022, one of the leading companies in the AI agents space. In today's episode, we asked David for some war stories from his time in early OpenAI (including working with Alec Radford ahead of the GPT-2 demo with Sam Altman, that resulted in Microsoft’s initial $1b investment), and how Adept is building agents that can “do anything a human does on a computer" — his definition of useful AGI.Why Google *couldn’t* make GPT-3While we wanted to discuss Adept, we couldn’t talk to a former VP Eng of OpenAI and former LLM tech lead at Google Brain and not ask about the elephant in the room. It’s often asked how Google had such a huge lead in 2017 with Vaswani et al creating the Transformer and Noam Shazeer predicting trillion-parameter models and yet it was David’s team at OpenAI who ended up making GPT 1/2/3. David has some interesting answers:“So I think the real story of GPT starts at Google, of course, right? Because that's where Transformers sort of came about. However, the number one shocking thing to me was that, and this is like a consequence of the way that Google is organized…what they (should) have done would be say, hey, Noam Shazeer, you're a brilliant guy. You know how to scale these things up. Here's half of all of our TPUs. And then I think they would have destroyed us. He clearly wanted it too…You know, every day we were scaling up GPT-3, I would wake up and just be stressed. And I was stressed because, you know, you just look at the facts, right? Google has all this compute. Google has all the people who invented all of these underlying technologies. There's a guy named Noam who's really smart, who's already gone and done this talk about how he wants a trillion parameter model. And I'm just like, we're probably just doing duplicative research to what he's doing. He's got this decoder only transformer that's probably going to get there before we do. And it turned out the whole time that they just couldn't get critical mass. So during my year where I led the Google LM effort and I was one of the brain leads, you know, it became really clear why. At the time, there was a thing called the Brain Credit Marketplace. Everyone's assigned a credit. So if you have a credit, you get to buy end chips according to supply and demand. So if you want to go do a giant job, you had to convince like 19 or 20 of your colleagues not to do work. And if that's how it works, it's really hard to get that bottom up critical mass to go scale these things. And the team at Google were fighting valiantly, but we were able to beat them simply because we took big swings and we focused.”Cloning HGI for AGIHuman intelligence got to where it is today through evolution. Some argue that to get to AGI, we will approximate all the “FLOPs” that went into that process, an approach most famously mapped out by Ajeya Cotra’s Biological Anchors report:The early days of OpenAI were very reinforcement learning-driven with the Dota project, but that's a very inefficient way for these models to re-learn everything. (Kanjun from Imbue shared similar ideas in her episode).David argues that there’s a shortcut. We can bootstrap from existing intelligence.“Years ago, I had a debate with a Berkeley professor as to what will it actually take to build AGI. And his view is basically that you have to reproduce all the flops that went into evolution in order to be able to get there… I think we are ignoring the fact that you have a giant shortcut, which is you can behaviorally clone everything humans already know. And that's what we solved with LLMs!”LLMs today basically model intelligence using all (good!) written knowledge (see our Datasets 101 episode), and have now expanded to non-verbal knowledge (see our HuggingFace episode on multimodality). The SOTA self-supervised pre-training process is surprisingly data-efficient in taking large amounts of unstructured data, and approximating reasoning without overfitting.But how do you cross the gap from the LLMs of today to building the AGI we all want? This is why David & friends left to start Adept.“We believe the clearest framing of general intelligence is a system that can do anything a human can do in front of a computer. A foundation model for actions, trained to use every software tool, API, and webapp that exists, is a practical path to this ambitious goal” — ACT-1 BlogpostCritical Path: Abstraction with ReliabilityThe AGI dream is fully autonomous agents, but there are levels to autonomy that we are comfortable giving our agents, based on how reliable they are. In David’s word choice, we always want higher levels of “abstractions” (aka autonomy), but our need for “reliability” is the practical limit on how high of an abstraction we can use.“The critical path for Adept is we want to build agents that can do a higher and higher level abstraction things over time, all while keeping an insanely high reliability standard. Because that's what turns us from research into something that customers want. And if you build agents with really high reliability standard, but are continuing pushing a level of abstraction, you then learn from your users how to get that next level of abstraction faster. So that's how you actually build the data flow. That's the critical path for the company. Everything we do is in service of that.”We saw how Adept thinks about different levels of abstraction at the 2023 Summit:The highest abstraction is the “AI Employee”, but we’ll get there with “AI enabled employees”. Alessio recently gave a talk about the future of work with “services as software” at this week’s Nvidia GTC (slides).No APIsUnlike a lot of large research labs, Adept's framing of AGI as "being able to use your computer like a human" carries with it a useful environmental constraint:“Having a human robot lets you do things that humans do without changing everything along the way. It's the same thing for software, right? If you go itemize out the number of things you want to do on your computer for which every step has an API, those numbers of workflows add up pretty close to zero. And so then many points along the way, you need the ability to actually control your computer like a human. It also lets you learn from human usage of computers as a source of training data that you don't get if you have to somehow figure out how every particular step needs to be some particular custom private API thing. And so I think this is actually the most practical path (to economic value).”This realization and conviction means that multimodal modals are the way to go. Instead of using function calling to call APIs to build agents, which is what OpenAI and most of the open LLM industry have done to date, Adept wants to “drive by vision”, (aka see the screen as a human sees it) and pinpoint where to click and type as a human does. No APIs needed, because most software don’t expose APIs.Extra context for readers: You can see the DeepMind SIMA model in the same light: One system that learned to play a diverse set of games (instead of one dedicated model per game) using only pixel inputs and keyboard-and-mouse action outputs!The OpenInterpreter team is working on a “Computer API” that also does the same.To do this, Adept had to double down on a special kind of multimodality for knowledge work:“A giant thing that was really necessary is really fast multimodal models that are really good at understanding knowledge work and really good at understanding screens. And that is needs to kind of be the base for some of these agents……I think one big hangover primarily academic focus for multimodal models is most multimodal models are primarily trained on like natural images, cat and dog photos, stuff that's come out of the camera… (but) where are they going to be the most useful? They're going to be most useful in knowledge work tasks. That's where the majority of economic value is going to be. It's not in cat and dogs. And so if that's what it is, what do you need to train? I need to train on like charts, graphs, tables, invoices, PDFs, receipts, unstructured data, UIs. That's just a totally different pre-training corpus. And so Adept spent a lot of time building that.”With this context, you can now understand the full path of Adept’s public releases:* ACT-1 (Sept 2022): a large Transformers model optimized for browser interactions. It has a custom rendering of the browser viewport that allows it to better understand it and take actions.* Persimmon-8B (Sept 2023): a permissive open LLM (weights and code here)* Fuyu-8B (Oct 2023): a small version of the multimodal model that powers Adept. Vanilla decoder-only transformer with no specialized image encoder, which allows it to handle input images of varying resolutions without downsampling.* Adept Experiments (Nov 2023): A public tool to build automations in the browser. This is powered by Adept's core technology but it's just a piece of their enterprise platform. They use it as a way to try various design ideas.* Fuyu Heavy (Jan 2024) - a new multimodal model designed specifically for digital agents and the world’s third-most-capable multimodal model (beating Gemini Pro on MMMU, AI2D, and ChartQA), “behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger”The Fuyu-8B post in particular exhibits a great number of examples on knowledge work multimodality:Why Adept is NOT a Research LabWith OpenAI now worth >$90b and Anthropic >$18b, it is tempting to conclude that the AI startup metagame is to build a large research lab, and attract the brightest minds and highest capital to build AGI. Our past guests Raza Habib (see the Humanloop episode) and Kanjun Qiu (from Imbue) combined to ask the most challenging questions of the pod - with David/Adept’s deep research pedigree from Deepmind and OpenAI, why is Adept not building more general foundation models (like Persimmon) and playing the academic benchmarks game? Why is Adept so focused on commercial agents instead?“I feel super good that we're doing foundation models in service of agents and all of the reward within Adept is flowing from “Can we make a better agent”…… I think pure play foundation model companies are just going to be pinched by how good the next couple of (Meta Llama models) are going to be… And then seeing the really big players put ridiculous amounts of compute behind just training these base foundation models, I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models. So I feel really good that we're just focused on agents.”and the commercial grounding is his answer to Kanjun too (whom we also asked the inverse question to compare with Adept):“… the second reason I work at Adept is if you believe that actually having customers and a reward signal from customers lets you build AGI faster, which we really believe, then you should come here. And I think the examples for why that's true is for example, our evaluations are not academic evals. They're not simulator evals. They're like, okay, we have a customer that really needs us to do these particular things. We can do some of them. These are the ones they want us to, we can't do them at all. We've turned those into evals.. I think that's a degree of practicality that really helps.”And his customers seem pretty happy, because David didn’t need to come on to do a sales pitch:David: “One of the things we haven't shared before is we're completely sold out for Q1.”Swyx: “Sold out of what?”David: “Sold out of bandwidth to onboard more customers.”Well, that’s a great problem to have.Show Notes* David Luan* Dextro at Data Driven NYC (2015)* Adept* ACT-1* Persimmon-8B* Adept Experiments* Fuyu-8B* $350M Series B announcement* Amelia Wattenberger talk at AI Engineer Summit* FigureChapters* [00:00:00] Introductions* [00:01:14] Being employee #30 at OpenAI and its early days* [00:13:38] What is Adept and how do you define AGI?* [00:21:00] Adept's critical path and research directions* [00:26:23] How AI agents should interact with software and impact product development* [00:30:37] Analogies between AI agents and self-driving car development* [00:32:42] Balancing reliability, cost, speed and generality in AI agents* [00:37:30] Potential of foundation models for robotics* [00:39:22] Core research questions and reasons to work at AdeptTranscriptsAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:15]: Hey, and today we have David Luan, CEO, co-founder of Adept in the studio. Welcome.David [00:00:20]: Yeah, thanks for having me.Swyx [00:00:21]: Been a while in the works. I've met you socially at one of those VC events and you said that you were interested in coming on and glad we finally were able to make this happen.David: Yeah, happy to be part of it.Swyx: So we like to introduce the speaker and then also just like have you talk a little bit about like what's not on your LinkedIn, what people should just generally know about you. You started a company in college, which was the first sort of real time video detection classification API that was Dextro, and that was your route to getting acquired into Axon where you're a director of AI. Then you were the 30th hire at OpenAI?David [00:00:53]: Yeah, 30, 35, something around there. Something like that.Swyx [00:00:56]: So you were VP of Eng for two and a half years to two years, briefly served as tech lead of large models at Google, and then in 2022 started Adept. So that's the sort of brief CV. Is there anything else you like want to fill in the blanks or like people should know more about?David [00:01:14]: I guess a broader story was I joined OpenAI fairly early and I did that for about two and a half to three years leading engineering there. It's really funny, I think second or third day of my time at OpenAI, Greg and Ilya pulled me in a room and we're like, you know, you should take over our directs and we'll go mostly do IC work. So that was fun, just coalescing a bunch of teams out of a couple of early initiatives that had already happened. The company, the Dota effort was going pretty hard and then more broadly trying to put bigger picture direction around what we were doing with basic research. So I spent a lot of time doing that. And then I led Google's LLM efforts, but also co-led Google Brain was one of the brain leads more broadly. You know, there's been a couple of different eras of AI research, right? If we count everything before 2012 as prehistory, which people hate it when I say that, kind of had this like you and your three best friends write a research paper that changes the world period from like 2012 to 2017. And I think the game changed in 2017 and like most labs didn't realize it, but we at OpenAI really did. I think in large part helped by like Ilya's constant beating of the drum that the world would be covered in data centers. And I think-Swyx [00:02:15]: It's causally neat.David [00:02:16]: Yeah. Well, like I think we had conviction in that, but it wasn't until we started seeing results that it became clear that that was where we had to go. But also part of it as well was for OpenAI, like when I first joined, I think one of the jobs that I had to do was how do I tell a differentiated vision for who we were technically compared to, you know, hey, we're just smaller Google Brain, or like you work at OpenAI if you live in SF and don't want to commute to Mountain View or don't want to live in London, right? That's like not enough to like hang your technical identity as a company. And so what we really did was, and I spent a lot of time pushing this, is just how do we get ourselves focused on a certain class of like giant swings and bets, right? Like how do you flip the script from you just do bottom-up research to more about how do you like leave some room for that, but really make it about like, what are the big scientific outcomes that you want to show? And then you just solve them at all costs, whether or not you care about novelty and all that stuff. And that became the dominant model for a couple of years, right? And then what's changed now is I think the number one driver of AI products over the next couple of years is going to be the deep co-design and co-evolution of product and users for feedback and actual technology. And I think labs, every tool to go do that are going to do really well. And that's a big part of why I started Adept.Alessio [00:03:20]: You mentioned Dota, any memories thinking from like the switch from RL to Transformers at the time and kind of how the industry was evolving more in the LLM side and leaving behind some of the more agent simulation work?David [00:03:33]: Like zooming way out, I think agents are just absolutely the correct long-term direction, right? You just go to find what AGI is, right? You're like, Hey, like, well, first off, actually, I don't love AGI definitions that involve human replacement because I don't think that's actually how it's going to happen. Even this definition of like, Hey, AGI is something that outperforms humans at economically valuable tasks is kind of implicit view of the world about what's going to be the role of people. I think what I'm more interested in is like a definition of AGI that's oriented around like a model that can do anything a human can do on a computer. If you go think about that, which is like super tractable, then agent is just a natural consequence of that definition. And so what did all the work we did on our own stuff like that get us was it got us a really clear formulation. Like you have a goal and you want to maximize the goal, you want to maximize reward, right? And the natural LLM formulation doesn't come with that out of the box, right? I think that we as a field got a lot right by thinking about, Hey, how do we solve problems of that caliber? And then the thing we forgot is the Novo RL is like a pretty terrible way to get there quickly. Why are we rediscovering all the knowledge about the world? Years ago, I had a debate with a Berkeley professor as to what will it actually take to build AGI. And his view is basically that you have to reproduce all the flops that went into evolution in order to be able to get there. Right.Swyx [00:04:44]: The biological basis theory. Right.David [00:04:46]: So I think we are ignoring the fact that you have a giant shortcut, which is you can behavioral clone everything humans already know. And that's what we solved with LLMs. We've solved behavioral cloning, everything that humans already know. Right. So like today, maybe LLMs is like behavioral cloning every word that gets written on the internet in the future, the multimodal models are becoming more of a thing where behavioral cloning the visual world. But really, what we're just going to have is like a universal byte model, right? Where tokens of data that have high signal come in, and then all of those patterns are like learned by the model. And then you can regurgitate any combination now. Right. So text into voice out, like image into other image out or video out or whatever, like these like mappings, right? Like all just going to be learned by this universal behavioral cloner. And so I'm glad we figured that out. And I think now we're back to the era of how do we combine this with all of the lessons we learned during the RL period. That's what's going to drive progress.Swyx [00:05:35]: I'm still going to pressure you for a few more early opening stories before we turn to the ADET stuff. On your personal site, which I love, because it's really nice, like personal, you know, story context around like your history. I need to update it. It's so old. Yeah, it's so out of date. But you mentioned GPT-2. Did you overlap with GPT-1? I think you did, right?David [00:05:53]: I actually don't quite remember. I think I was joining right around- Right around then?Swyx [00:05:57]: I was right around that, yeah. Yeah. So what I remember was Alec, you know, just kind of came in and was like very obsessed with Transformers and applying them to like Reddit sentiment analysis. Yeah, sentiment, that's right. Take us through-David [00:06:09]: Sentiment neuron, all this stuff.Swyx [00:06:10]: The history of GPT as far as you know, you know, according to you. Ah, okay.David [00:06:14]: History of GPT, according to me, that's a pretty good question. So I think the real story of GPT starts at Google, of course, right? Because that's where Transformers sort of came about. However, the number one shocking thing to me was that, and this is like a consequence of the way that Google is organized, where like, again, you and your three best friends write papers, right? Okay. So zooming way out, right? I think about my job when I was a full-time research leader as a little bit of a portfolio allocator, right? So I've got really, really smart people. My job is to convince people to coalesce around a small number of really good ideas and then run them over the finish line. My job is not actually to promote a million ideas and never have critical mass. And then as the ideas start coming together and some of them start working well, my job is to nudge resources towards the things that are really working and then start disbanding some of the things that are not working, right? That muscle did not exist during my time at Google. And I think had they had it, what they would have done would be say, hey, Noam Shazir, you're a brilliant guy. You know how to scale these things up. Here's half of all of our TPUs. And then I think they would have destroyed us. He clearly wanted it too.Swyx [00:07:17]: He's talking about trillion parameter models in 2017.David [00:07:20]: Yeah. So that's the core of the GPT story, right? Which is that, and I'm jumping around historically, right? But after GPT-2, we were all really excited about GPT-2. I can tell you more stories about that. It was the last paper that I even got to really touch before everything became more about building a research org. You know, every day we were scaling up GPT-3, I would wake up and just be stressed. And I was stressed because, you know, you just look at the facts, right? Google has all this compute. Google has all the people who invented all of these underlying technologies. There's a guy named Noam who's really smart, who's already gone and done this talk about how he wants a trillion parameter model. And I'm just like, we're probably just doing duplicative research to what he's doing, right? He's got this decoder only transformer that's probably going to get there before we do. And I was like, but like, please just like let this model finish, right? And it turned out the whole time that they just couldn't get critical mass. So during my year where I led the Google LM effort and I was one of the brain leads, you know, it became really clear why, right? At the time, there was a thing called the brain credit marketplace. And did you guys know the brain credit marketplace? No, I never heard of this. Oh, so it's actually, it's a, you can ask any Googler.Swyx [00:08:23]: It's like just like a thing that, that, I mean, look like, yeah, limited resources, you got to have some kind of marketplace, right? You know, sometimes it's explicit, sometimes it isn't, you know, just political favors.David [00:08:34]: You could. And so then basically everyone's assigned a credit, right? So if you have a credit, you get to buy end chips according to supply and demand. So if you want to go do a giant job, you had to convince like 19 or 20 of your colleagues not to do work. And if that's how it works, it's really hard to get that bottom up critical mass to go scale these things. And the team at Google were fighting valiantly, but we were able to beat them simply because we took big swings and we focused. And I think, again, that's like part of the narrative of like this phase one of AI, right? Of like this modern AI era to phase two. And I think in the same way, I think phase three company is going to out execute phase two companies because of the same asymmetry of success.Swyx [00:09:12]: Yeah. I think it's underrated how much NVIDIA works with you in the early days as well. I think maybe, I think it was Jensen. I'm not sure who circulated a recent photo of him delivering the first DGX to you guys.David [00:09:24]: I think Jensen has been a complete legend and a mastermind throughout. I have so much respect for NVIDIA. It is unreal.Swyx [00:09:34]: But like with OpenAI, like kind of give their requirements, like co-design it or just work of whatever NVIDIA gave them.David [00:09:40]: So we work really closely with them. There's, I'm not sure I can share all the stories, but examples of ones that I've found particularly interesting. So Scott Gray is amazing. I really like working with him. He was on one of my teams, the supercomputing team, which Chris Berner runs and Chris Berner still does a lot of stuff in that. As a result, like we had very close ties to NVIDIA. Actually, one of my co-founders at Adept, Eric Elson, was also one of the early GPGPU people. So he and Scott and Brian Catanzaro at NVIDIA and Jonah and Ian at NVIDIA, I think all were very close. And we're all sort of part of this group of how do we push these chips to the absolute limit? And I think that kind of collaboration helped quite a bit. I think one interesting set of stuff is knowing the A100 generation, that like quad sparsity was going to be a thing. Is that something that we want to go look into, right? And figure out if that's something that we could actually use for model training. Really what it boils down to is that, and I think more and more people realize this, six years ago, people, even three years ago, people refused to accept it. This era of AI is really a story of compute. It's really the story of how do you more efficiently map actual usable model flops to compute,Swyx [00:10:38]: Is there another GPT 2, 3 story that you love to get out there that you think is underappreciated for the amount of work that people put into it?David [00:10:48]: So two interesting GPT 2 stories. One of them was I spent a good bit of time just sprinting to help Alec get the paper out. And I remember one of the most entertaining moments was we were writing the modeling section. And I'm pretty sure the modeling section was the shortest modeling section of any ML, reasonably legitimate ML paper to that moment. It was like section three model. This is a standard vanilla decoder only transformer with like these particular things, those paragraph long if I remember correctly. And both of us were just looking at the same being like, man, the OGs in the field are going to hate this. They're going to say no novelty. Why did you guys do this work? So now it's funny to look at in hindsight that it was pivotal kind of paper, but I think it was one of the early ones where we just leaned fully into all we care about is solving problems in AI and not about, hey, is there like four different really simple ideas that are cloaked in mathematical language that doesn't actually help move the field forward?Swyx [00:11:42]: Right. And it's like you innovate on maybe like data set and scaling and not so much the architecture.David [00:11:48]: We all know how it works now, right? Which is that there's a collection of really hard won knowledge that you get only by being at the frontiers of scale. And that hard won knowledge, a lot of it's not published. A lot of it is stuff that's actually not even easily reducible to what looks like a typical academic paper. But yet that's the stuff that helps differentiate one scaling program from another. You had a second one? So the second one is, there's like some details here that I probably shouldn't fully share, but hilariously enough for the last meeting we did with Microsoft before Microsoft invested in OpenAI, Sam Altman, myself and our CFO flew up to Seattle to do the final pitch meeting. And I'd been a founder before. So I always had a tremendous amount of anxiety about partner meetings, which this basically this is what it was. I had Kevin Scott and Satya and Amy Hood, and it was my job to give the technical slides about what's the path to AGI, what's our research portfolio, all of this stuff, but it was also my job to give the GPT-2 demo. We had a slightly bigger version of GPT-2 that we had just cut maybe a day or two before this flight up. And as we all know now, model behaviors you find predictable at one checkpoint are not predictable in another checkpoint. And so I'd spent all this time trying to figure out how to keep this thing on rails. I had my canned demos, but I knew I had to go turn it around over to Satya and Kevin and let them type anything in. And that just, that really kept me up all night.Swyx [00:13:06]: Nice. Yeah.Alessio [00:13:08]: I mean, that must have helped you talking about partners meeting. You raised $420 million for Adept. The last round was a $350 million Series B, so I'm sure you do great in partner meetings.Swyx [00:13:18]: Pitchers meetings. Nice.David [00:13:20]: No, that's a high compliment coming from a VC.Alessio [00:13:22]: Yeah, no, I mean, you're doing great already for us. Let's talk about Adept. And we were doing pre-prep and you mentioned that maybe a lot of people don't understand what Adept is. So usually we try and introduce the product and then have the founders fill in the blanks, but maybe let's do the reverse. Like what is Adept? Yeah.David [00:13:38]: So I think Adept is the least understood company in the broader space of foundational models plus agents. So I'll give some color and I'll explain what it is and I'll explain also why it's actually pretty different from what people would have guessed. So the goal for Adept is we basically want to build an AI agent that can do, that can basically help humans do anything a human does on a computer. And so what that really means is we want this thing to be super good at turning natural language like goal specifications right into the correct set of end steps and then also have all the correct sensors and actuators to go get that thing done for you across any software tool that you already use. And so the end vision of this is effectively like I think in a couple of years everyone's going to have access to like an AI teammate that they can delegate arbitrary tasks to and then also be able to, you know, use it as a sounding board and just be way, way, way more productive. Right. And just changes the shape of every job from something where you're mostly doing execution to something where you're mostly actually doing like these core liberal arts skills of what should I be doing and why. Right. And I find this like really exciting and motivating because I think it's actually a pretty different vision for how AGI will play out. I think systems like Adept are the most likely systems to be proto-AGIs. But I think the ways in which we are really counterintuitive to everybody is that we've actually been really quiet because we are not a developer company. We don't sell APIs. We don't sell open source models. We also don't sell bottom up products. We're not a thing that you go and click and download the extension and like we want more users signing up for that thing. We're actually an enterprise company. So what we do is we work with a range of different companies, some like late stage multi-thousand people startups, some fortune 500s, et cetera. And what we do for them is we basically give them an out of the box solution where big complex workflows that their employees do every day could be delegated to the model. And so we look a little different from other companies in that in order to go build this full agent thing, the most important thing you got to get right is reliability. So initially zooming way back when, one of the first things that DEP did was we released this demo called Act One, right? Act One was like pretty cool. It's like kind of become a hello world thing for people to show agent demos by going to Redfin and asking to buy a house somewhere because like we did that in the original Act One demo and like showed that, showed like Google Sheets, all this other stuff. Over the last like year since that has come out, there's been a lot of really cool demos and you go play with them and you realize they work 60% of the time. But since we've always been focused on how do we build an amazing enterprise product, enterprises can't use anything that isn't in the nines of reliability. And so we've actually had to go down a slightly different tech tree than what you might find in the prompt engineering sort of plays in the agent space to get that reliability. And we've decided to prioritize reliability over all else. So like one of our use cases is crazy enough that it actually ends with a physical truck being sent to a place as the result of the agent workflow. And if you're like, if that works like 60% of the time, you're just blowing money and poor truck drivers going places.Alessio [00:16:30]: Interesting. One of the, our investment teams has this idea of services as software. I'm actually giving a talk at NVIDIA GTC about this, but basically software as a service, you're wrapping user productivity in software with agents and services as software is replacing things that, you know, you would ask somebody to do and the software just does it for you. When you think about these use cases, do the users still go in and look at the agent kind of like doing the things and can intervene or like are they totally removed from them? Like the truck thing is like, does the truck just show up or are there people in the middle checking in?David [00:17:04]: I think there's two current flaws in the framing for services as software, or I think what you just said. I think that one of them is like in our experience, as we've been rolling out Adept, the people who actually do the jobs are the most excited about it because they don't go from, I do this job to, I don't do this job. They go from, I do this job for everything, including the shitty rote stuff to I'm a supervisor. And I literally like, it's pretty magical when you watch the thing being used because now it parallelizes a bunch of the things that you had to do sequentially by hand as a human. And you can just click into any one of them and be like, Hey, I want to watch the trajectory that the agent went through to go solve this. And the nice thing about agent execution as opposed to like LLM generations is that a good chunk of the time when the agent fails to execute, it doesn't give you the wrong result. It just fails to execute. And the whole trajectory is just broken and dead and the agent knows it, right? So then those are the ones that the human then goes and solves. And so then they become a troubleshooter. They work on the more challenging stuff. They get way, way more stuff done and they're really excited about it. I think the second piece of it that we've found is our strategy as a company is to always be an augmentation company. And I think one out of principle, that's something we really care about. But two, actually, if you're framing yourself as an augmentation company, you're always going to live in a world where you're solving tasks that are a little too hard for what the model can do today and still needs a human to provide oversight, provide clarifications, provide human feedback. And that's how you build a data flywheel. That's how you actually learn from the smartest humans how to solve things models can't do today. And so I actually think that being an augmentation company forces you to go develop your core AI capabilities faster than someone who's saying, ah, okay, my job is to deliver you a lights off solution for X.Alessio [00:18:42]: Yeah. It's interesting because we've seen two parts of the market. One is we have one company that does agents for SOC analysts. People just don't have them, you know, and just they cannot attract the talent to do it. And similarly, in a software development, you have Copilot, which is the augmentation product, and then you have sweep.dev and you have these products, which they just do the whole thing. I'm really curious to see how that evolves. I agree that today the reliability is so important in the enterprise that they just don't use most of them. Yeah. Yeah. No, that's cool. But it's great to hear the story because I think from the outside, people are like, oh, a dev, they do Act One, they do Persimon, they do Fuyu, they do all this stuff. Yeah, it's just the public stuff.Swyx [00:19:20]: It's just public stuff.David [00:19:21]: So one of the things we haven't shared before is we're completely sold out for Q1. And so I think...Swyx [00:19:26]: Sold out of what?David [00:19:27]: Sold out of bandwidth to go on board more customers. And so we're like working really hard to go make that less of a bottleneck, but our expectation is that I think we're going to be significantly more public about the broader product shape and the new types of customers we want to attract later this year. So I think that clarification will happen by default.Swyx [00:19:43]: Why have you become more public? You know, if the whole push has... You're sold out, you're my enterprise, but you're also clearly putting effort towards being more open or releasing more things.David [00:19:53]: I think we just flipped over that way fairly recently. That's a good question. I think it actually boils down to two things. One, I think that, frankly, a big part of it is that the public narrative is really forming around agents as being the most important thing. And I'm really glad that's happening because when we started the company in January 2022, everybody in the field knew about the agents thing from RL, but the general public had no conception of what it was. They were still hanging their narrative hat on the tree of everything's a chatbot. And so I think now one of the things that I really care about is that when people think agent, they actually think the right thing. All sorts of different things are being called agents. Chatbots are being called agents. Things that make a function call are being called agents. To me, an agent is something that you can give a goal and get an end step workflow done correctly in the minimum number of steps. And so that's a big part of why. And I think the other part is because I think it's always good for people to be more aware of Redept as they think about what the next thing they want to do in their careers. The field is quickly pivoting in a world where foundation models are looking more and more commodity. And I think a huge amount of gain is going to happen from how do you use foundation models as the well-learned behavioral cloner to go solve agents. And I think people who want to do agents research should really come to Redept.Swyx [00:21:00]: When you say agents have become more part of the public narrative, are there specific things that you point to? I'll name a few. Bill Gates in his blog post mentioning that agents are the future. I'm the guy who made OSes, and I think agents are the next thing. So Bill Gates, I'll call that out. And then maybe Sam Altman also saying that agents are the future for open AI.David [00:21:17]: I think before that even, I think there was something like the New York Times, Cade Metz wrote a New York Times piece about it. Right now, in a bit to differentiate, I'm seeing AI startups that used to just brand themselves as an AI company, but now brand themselves as an AI agent company. It's just like, it's a term I just feel like people really want.Swyx [00:21:31]: From the VC side, it's a bit mixed. Is it? As in like, I think there are a lot of VCs where like, I would not touch any agent startups because like- Why is that? Well, you tell me.Alessio [00:21:41]: I think a lot of VCs that are maybe less technical don't understand the limitations of the-Swyx [00:21:46]: No, that's not fair.Alessio [00:21:47]: No, no, no, no. I think like- You think so? No, no. I think like the, what is possible today and like what is worth investing in, you know? And I think like, I mean, people look at you and say, well, these guys are building agents. They needed 400 million to do it. So a lot of VCs are maybe like, oh, I would rather invest in something that is tacking on AI to an existing thing, which is like easier to get the market and kind of get some of the flywheel going. But I'm also surprised a lot of funders just don't want to do agents. It's not even the funding. Sometimes we look around and it's like, why is nobody doing agents for X? Wow.David [00:22:17]: That's good to know actually. I never knew that before. My sense from my limited perspective is there's a new agent company popping up every day.Swyx [00:22:24]: So maybe I'm- They are. They are. But like I have advised people to take agents off of their title because it's so diluted.David [00:22:31]: It's now so diluted.Swyx [00:22:32]: Yeah. So then it doesn't stand for anything. Yeah.David [00:22:35]: That's a really good point.Swyx [00:22:36]: So like, you know, you're a portfolio allocator. You have people know about Persimmon, people know about Fuyu and Fuyu Heavy. Can you take us through like how you think about that evolution of that and what people should think about what that means for adepts and sort of research directions? Kind of take us through the stuff you shipped recently and how people should think about the trajectory of what you're doing.David [00:22:56]: The critical path for adepts is we want to build agents that can do a higher and higher level abstraction things over time, all while keeping an insanely high reliability standard. Because that's what turns us from research into something that customers want. And if you build agents with really high reliability standard, but are continuing pushing a level of abstraction, you then learn from your users how to get that next level of abstraction faster. So that's how you actually build the data flow. That's the critical path for the company. Everything we do is in service of that. So if you go zoom way, way back to Act One days, right? Like the core thing behind Act One is can we teach large model basically how to even actuate your computer? And I think we're one of the first places to have solved that and shown it and shown the generalization that you get when you give it various different workflows and texts. But I think from there on out, we really realized was that in order to get reliability, companies just do things in various different ways. You actually want these models to be able to get a lot better at having some specification of some guardrails for what it actually should be doing. And I think in conjunction with that, a giant thing that was really necessary is really fast multimodal models that are really good at understanding knowledge work and really good at understanding screens. And that is needs to kind of be the base for some of these agents. Back then we had to do a ton of research basically on how do we actually make that possible? Well, first off, like back in forgot exactly one month to 23, like there were no multimodal models really that you could use for things like this. And so we pushed really hard on stuff like the Fuyu architecture. I think one big hangover primarily academic focus for multimodal models is most multimodal models are primarily trained on like natural images, cat and dog photos, stuff that's come out of the camera. Coco. Yeah, right. And the Coco is awesome. Like I love Coco. I love TY. Like it's really helped the field. Right. But like that's the build one thing. I actually think it's really clear today. Multimodal models are the default foundation model, right? It's just going to supplant LLMs. Like you just train a giant multimodal model. And so for that though, like where are they going to be the most useful? They're going to be most useful in knowledge work tasks. That's where the majority of economic value is going to be. It's not in cat and dogs. Right. And so if that's what it is, what do you need to train? I need to train on like charts, graphs, tables, invoices, PDFs, receipts, unstructured data, UIs. That's just a totally different pre-training corpus. And so a depth spent a lot of time building that. And so the public for use and stuff aren't trained on our actual corpus, it's trained on some other stuff. But you take a lot of that data and then you make it really fast and make it really good at things like dense OCR on screens. And then now you have the right like raw putty to go make a good agent. So that's kind of like some of the modeling side, we've kind of only announced some of that stuff. We haven't really announced much of the agent's work, but that if you put those together with the correct product form factor, and I think the product form factor also really matters. I think we're seeing, and you guys probably see this a little bit more than I do, but we're seeing like a little bit of a pushback against the tyranny of chatbots as form factor. And I think that the reason why the form factor matters is the form factor changes what data you collect in the human feedback loop. And so I think we've spent a lot of time doing full vertical integration of all these bits in order to get to where we are.Swyx [00:25:44]: Yeah. I'll plug Amelia Wattenberger’s talk at our conference, where she gave a little bit of the thinking behind like what else exists other than chatbots that if you could delegate to reliable agents, you could do. I was kind of excited at Adept experiments or Adept workflows, I don't know what the official name for it is. I was like, okay, like this is something I can use, but it seems like it's just an experiment for now. It's not your product.David [00:26:06]: So you basically just use experiments as like a way to go push various ideas on the design side to some people and just be like, yeah, we'll play with it. Actually the experiments code base underpins the actual product, but it's just the code base itself is kind of like a skeleton for us to go deploy arbitrary cards on the side.Swyx [00:26:22]: Yeah.Alessio [00:26:23]: Makes sense. I was going to say, I would love to talk about the interaction layer. So you train a model to see UI, but then there's the question of how do you actually act on the UI? I think there was some rumors about open app building agents that are kind of like, they manage the end point. So the whole computer, you're more at the browser level. I read in one of your papers, you have like a different representation, kind of like you don't just take the dome and act on it. You do a lot more stuff. How do you think about the best way the models will interact with the software and like how the development of products is going to change with that in mind as more and more of the work is done by agents instead of people?David [00:26:58]: This is, there's so much surface area here and it's actually one of the things I'm really excited about. And it's funny because I've spent most of my time doing research stuff, but there's like a whole new ball game that I've been learning about and I find it really cool. So I would say the best analogy I have to why Adept is pursuing a path of being able to use your computer like a human, plus of course being able to call APIs and being able to call APIs is the easy part, like being able to use your computer like a human is a hard part. It's in the same way why people are excited about humanoid robotics, right? In a world where you had T equals infinity, right? You're probably going to have various different form factors that robots could just be in and like all the specialization. But the fact is that humans live in a human environment. So having a human robot lets you do things that humans do without changing everything along the way. It's the same thing for software, right? If you go itemize out the number of things you want to do on your computer for which every step has an API, those numbers of workflows add up pretty close to zero. And so then many points along the way, you need the ability to actually control your computer like a human. It also lets you learn from human usage of computers as a source of training data that you don't get if you have to somehow figure out how every particular step needs to be some particular custom private API thing. And so I think this is actually the most practical path. I think because it's the most practical path, I think a lot of success will come from going down this path. I kind of think about this early days of the agent interaction layer level is a little bit like, do you all remember Windows 3.1? Like those days? Okay, this might be, I might be, I might be too old for you guys on this. But back in the day, Windows 3.1, we had this transition period between pure command line, right? Being the default into this new world where the GUI is the default and then you drop into the command line for like programmer things, right? The old way was you booted your computer up, DOS booted, and then it would give you the C colon slash thing. And you typed Windows and you hit enter, and then you got put into Windows. And then the GUI kind of became a layer above the command line. The same thing is going to happen with agent interfaces is like today we'll be having the GUI is like the base layer. And then the agent just controls the current GUI layer plus APIs. And in the future, as more and more trust is built towards agents and more and more things can be done by agents, if more UIs for agents are actually generative in and of themselves, then that just becomes a standard interaction layer. And if that becomes a standard interaction layer, what changes for software is that a lot of software is going to be either systems or record or like certain customized workflow execution engines. And a lot of how you actually do stuff will be controlled at the agent layer.Alessio [00:29:19]: And you think the rabbit interface is more like it would like you're not actually seeing the app that the model interacts with. You're just saying, hey, I need to log this call on Salesforce. And you're never actually going on salesforce.com directly as the user. I can see that being a model.David [00:29:33]: I think I don't know enough about what using rabbit in real life will actually be like to comment on that particular thing. But I think the broader idea that, you know, you have a goal, right? The agent knows how to break your goal down into steps. The agent knows how to use the underlying software and systems or record to achieve that goal for you. The agent maybe presents you information in a custom way that's only relevant to your particular goal, all just really leads to a world where you don't really need to ever interface with the apps underneath unless you're a power user for some niche thing.Swyx [00:30:03]: General question. So first of all, I think like the sort of input mode conversation. I wonder if you have any analogies that you like with self-driving, because I do think like there's a little bit of how the model should perceive the world. And you know, the primary split in self-driving is LiDAR versus camera. And I feel like most agent companies that I'm tracking are all moving towards camera approach, which is like the multimodal approach, you know, multimodal vision, very heavy vision, all the Fuyu stuff that you're doing. You're focusing on that, including charts and tables. And do you find that inspiration there from like the self-driving world? That's a good question.David [00:30:37]: I think sometimes the most useful inspiration I've found from self-driving is the levels analogy. I think that's awesome. But I think that our number one goal is for agents not to look like self-driving. We want to minimize the chances that agents are sort of a thing that you just have to bang your head at for a long time to get to like two discontinuous milestones, which is basically what's happened in self-driving. We want to be living in a world where you have the data flywheel immediately, and that takes you all the way up to the top. But similarly, I mean, compared to self-driving, like two things that people really undervalue is like really easy to driving a car down highway 101 in a sunny day demo. That actually doesn't prove anything anymore. And I think the second thing is that as a non-self-driving expert, I think one of the things that we believe really strongly is that everyone undervalues the importance of really good sensors and actuators. And actually a lot of what's helped us get a lot of reliability is a really strong focus on actually why does the model not do this thing? And the non-trivial amount of time, the time the model doesn't actually do the thing is because if you're a wizard of ozzing it yourself, or if you have unreliable actuators, you can't do the thing. And so we've had to fix a lot of those problems.Swyx [00:31:43]: I was slightly surprised just because I do generally consider the way most that we see all around San Francisco as the most, I guess, real case of agents that we have in very material ways.David [00:31:55]: Oh, that's absolutely true. I think they've done an awesome job, but it has taken a long time for self-driving to mature from when it entered the consciousness and the driving down 101 on a sunny day moment happened to now. Right. So I want to see that more compressed.Swyx [00:32:07]: And I mean, you know, cruise, you know, RIP. And then one more thing on just like, just going back on this reliability thing, something I have been holding in my head that I'm curious to get your commentary on is I think there's a trade-off between reliability and generality, or I want to broaden reliability into just general like sort of production readiness and enterprise readiness scale. Because you have reliability, you also have cost, you have speed, speed is a huge emphasis for a debt. The tendency or the temptation is to reduce generality to improve reliability and to improve cost, improve speed. Do you perceive a trade-off? Do you have any insights that solve those trade-offs for you guys?David [00:32:42]: There's definitely a trade-off. If you're at the Pareto frontier, I think a lot of folks aren't actually at the Pareto frontier. I think the way you get there is basically how do you frame the fundamental agent problem in a way that just continues to benefit from data? I think one of the main ways of being able to solve that particular trade-off is you basically just want to formulate the problem such that every particular use case just looks like you collecting more data to go make that use case possible. I think that's how you really solve. Then you get into the other problems like, okay, are you overfitting on these end use cases? You're not doing a thing where you're being super prescriptive for the end steps that the model can only do, for example.Swyx [00:33:17]: Then the question becomes, do you have one house model that you can then customize for each customer and you're fine-tuning them on each customer's specific use case?David [00:33:25]: Yeah.Swyx [00:33:26]: We're not sharing that. You're not sharing that. It's tempting, but that doesn't look like AGI to me. You know what I mean? That is just you have a good base model and then you fine-tune it.David [00:33:35]: For what it's worth, I think there's two paths to a lot more capability coming out of the models that we all are training these days. I think one path is you figure out how to spend, compute, and turn it into data. In that path, I consider search, RL, all the things that we all love in this era as part of that path, like self-play, all that stuff. The second path is how do you get super competent, high intelligence demonstrations from humans? I think the right way to move forward is you kind of want to combine the two. The first one gives you maximum sample efficiency for a little second, but I think that it's going to be hard to be running at max speed towards AGI without actually solving a bit of both.Swyx [00:34:16]: You haven't talked much about synthetic data, as far as I can tell. Probably this is a bit too much of a trend right now, but any insights on using synthetic data to augment the expensive human data?David [00:34:26]: The best part about framing AGI as being able to help people do things on computers is you have an environment.Swyx [00:34:31]: Yes. So you can simulate all of it.David [00:34:35]: You can do a lot of stuff when you have an environment.Alessio [00:34:37]: We were having dinner for our one-year anniversary. Congrats. Yeah. Thank you. Raza from HumanLoop was there, and we mentioned you were coming on the pod. This is our first-Swyx [00:34:45]: So he submitted a question.Alessio [00:34:46]: Yeah, this is our first, I guess, like mailbag question. He asked, when you started GPD 4 Data and Exist, now you have a GPD 4 vision and help you building a lot of those things. How do you think about the things that are unique to you as Adept, and like going back to like the maybe research direction that you want to take the team and what you want people to come work on at Adept, versus what is maybe now become commoditized that you didn't expect everybody would have access to?David [00:35:11]: Yeah, that's a really good question. I think implicit in that question, and I wish he were tier two so he can push back on my assumption about his question, but I think implicit in that question is calculus of where does advantage accrue in the overall ML stack. And maybe part of the assumption is that advantage accrues solely to base model scaling. But I actually believe pretty strongly that the way that you really win is that you have to go build an agent stack that is much more than that of the base model itself. And so I think like that is always going to be a giant advantage of vertical integration. I think like it lets us do things like have a really, really fast base model, is really good at agent things, but is bad at cat and dog photos. It's pretty good at cat and dog photos. It's not like soda at cat and dog photos, right? So like we're allocating our capacity wisely, right? That's like one thing that you really get to do. I also think that the other thing that is pretty important now in the broader foundation modeling space is I feel despite any potential concerns about how good is agents as like a startup area, right? Like we were talking about earlier, I feel super good that we're doing foundation models in service of agents and all of the reward within Adept is flowing from can we make a better agent? Because right now I think we all see that, you know, if you're training on publicly available web data, you put in the flops and you do reasonable things, then you get decent results. And if you just double the amount of compute, then you get predictably better results. And so I think pure play foundation model companies are just going to be pinched by how good the next couple of llamas are going to be and the next what good open source thing. And then seeing the really big players put ridiculous amounts of compute behind just training these base foundation models, I think is going to commoditize a lot of the regular LLMs and soon regular multimodal models. So I feel really good that we're just focused on agents.Swyx [00:36:56]: So you don't consider yourself a pure play foundation model company?David [00:36:59]: No, because if we were a pure play foundation model company, we would be training general foundation models that do summarization and all this other...Swyx [00:37:06]: You're dedicated towards the agent. Yeah.David [00:37:09]: And our business is an agent business. We're not here to sell you tokens, right? And I think like selling tokens, unless there's like a...Swyx [00:37:14]: Not here to sell you tokens. I love it.David [00:37:16]: It's like if you have a particular area of specialty, right? Then you won't get caught in the fact that everyone's just scaling to ridiculous levels of compute. But if you don't have a specialty, I find that, I think it's going to be a little tougher.Swyx [00:37:27]: Interesting. Are you interested in robotics at all? Just a...David [00:37:30]: I'm personally fascinated by robotics. I've always loved robotics.Swyx [00:37:33]: Embodied agents as a business, you know, Figure is like a big, also sort of open AI affiliated company that raises a lot of money.David [00:37:39]: I think it's cool. I think, I mean, I don't know exactly what they're doing, but...Swyx [00:37:44]: Robots. Yeah.David [00:37:46]: Well, I mean, that's a...Swyx [00:37:47]: Yeah. What question would you ask? If we had them on, what would you ask them?David [00:37:50]: Oh, I just want to understand what their overall strategy is going to be between now and when there's reliable stuff to be deployed. But honestly, I just don't know enough about it.Swyx [00:37:57]: And if I told you, hey, fire your entire warehouse workforce and, you know, put robots in there, isn't that a strategy? Oh yeah.David [00:38:04]: Yeah. Sorry. I'm not questioning whether they're doing smart things. I genuinely don't know what they're doing as much, but I think there's two things. One, I'm so excited for someone to train a foundation model of robots. It's just, I think it's just going to work. Like I will die on this hill, but I mean, like again, this whole time, like we've been on this podcast, we're just going to continually saying these models are basically behavioral cloners. Right. So let's go behavioral clone all this like robot behavior. Right. And then you figure out everything else you have to do in order to teach you how to solve a new problem. That's going to work. I'm super stoked for that. I think unlike what we're doing with helping humans with knowledge work, it just sounds like a more zero sum job replacement play. Right. And I'm personally less excited about that.Alessio [00:38:46]: We had a Ken June from InBoo on the podcast. We asked her why people should go work there and not at Adept.Swyx [00:38:52]: Oh, that's so funny.Alessio [00:38:54]: Well, she said, you know, there's space for everybody in this market. We're all doing interesting work. And she said, they're really excited about building an operating system for agent. And for her, the biggest research thing was like getting models, better reasoning and planning for these agents. The reverse question to you, you know, why should people be excited to come work at Adept instead of InBoo? And maybe what are like the core research questions that people should be passionate about to have fun at Adept? Yeah.David [00:39:22]: First off, I think that I'm sure you guys believe this too. The AI space to the extent there's an AI space and the AI agent space are both exactly as she likely said, I think colossal opportunities and people are just going to end up winning in different areas and a lot of companies are going to do well. So I really don't feel that zero something at all. I would say to like change the zero sum framing is why should you be at Adept? I think there's two huge reasons to be at Adept. I think one of them is everything we do is in the service of like useful agents. We're not a research lab. We do a lot of research in service of that goal, but we don't think about ourselves as like a classic research lab at all. And I think the second reason I work at Adept is if you believe that actually having customers and a reward signal from customers lets you build a GI faster, which we really believe, then you should come here. And I think the examples for why that's true is for example, our evaluations, they're not academic evals. They're not simulator evals. They're like, okay, we have a customer that really needs us to do these particular things. We can do some of them. These are the ones they want us to, we can't do them at all. We've turned those into evals, solve it, right? I think that's really cool. Like everybody knows a lot of these evals are like pretty saturated and the new ones that even are not saturated. You look at someone and you're like, is this actually useful? Right? I think that's a degree of practicality that really helps. Like we're equally excited about the same problems around reasoning and planning and generalization and all of this stuff. They're very grounded in actual needs right now, which is really cool.Swyx [00:40:45]: Yeah. This has been a wonderful dive. You know, I wish we had more time, but I would just leave it kind of open to you. I think you have broad thoughts, you know, just about the agent space, but also just in general AI space. Any, any sort of rants or things that are just off of mind for you right now?David [00:40:57]: Any rants?Swyx [00:40:59]: Mining you for just general... David [00:41:01]: Wow. Okay. So Amelia has already made the rant better than I have, but, but like not just, not just chatbots is like kind of rant one. And two is AI has really been the story of compute and compute plus data and ways in which you could change one for the other. And I think as much as our research community is really smart, we have made many, many advancements and that's going to continue to be important. But now I think the game is increasingly changing and the rapid industrialization era has begun. And I think we unfortunately have to embrace it.Swyx [00:41:30]: Yep.Alessio [00:41:31]: Excellent. Awesome, David. Thank you so much for your time.David [00:41:34]: Cool. Thanks guys. Get full access to Latent.Space at www.latent.space/subscribe
-
Making Transformers Sing - with Mikey Shulman of Suno
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-03-14 16:48
Giving computers a voice has always been at the center of sci-fi movies; “I’m sorry Dave, I’m afraid I can’t do that” wouldn’t hit as hard if it just appeared on screen as a terminal output, after all. The first electronic speech synthesizer, the Voder, was built at Bell Labs 85 years ago (1939!), and it’s…. something:We will not cover the history of Text To Speech (TTS), but the evolution of the underlying architecture has generally been Formant Synthesis → Concatenative Synthesis → Neural Networks. Nowadays, state of the art TTS is just one API call away with models like Eleven Labs and OpenAI’s TTS, or products like Descript. Latency is minimal, they have very good intonation, and can mimic a variety of accents. You can hack together your own voice AI therapist in a day!But once you have a computer that can communicate via voice, what comes next? Singing🎶 of course!From Barking 🐶 to Singing 🎤Today’s guest is Suno’s CEO and co-founder Mikey Shulman. He and his three co-founders, Georg, Martin, and Keenan, previously worked together at Kensho. One of their projects was financially-focused speech recognition (think earnings calls, etc), but all four of them happened to be musicians and audiophiles. They started playing around with text to speech + AI + audio generation and eventually left Kensho to work on it full time.A lot of people when we started a company told us to focus on speech. If we wanted to build an audio company, everyone said, speech is a bigger market. But I think there's something about music that's just so human and you almost couldn't prevent us from doing it. Like we just couldn't keep ourselves from building music models and playing with them because it was so much fun.Their first big product was Bark, the first open source transformer-based “text-to-audio” model (architecturally inspired by Karpathy’s NanoGPT) that went from 0 to ~19,000 Github stars in a month. At the time they felt like audio was years behind text and image as a generation modality; unlike its predecessors, Bark could not only generate speech, but also music and sound effects like crying, laughing, sighing, etc. You can find a few examples here.The main limitation they saw was text to speech training data being extremely limited. So what they did instead is build a new type of foundation model from scratch, trained on audio, and then tweak it to do text to speech. Turning audio into tokens to do self-supervised learning was the most important innovation. Unlike TTS models which are very narrow (and often sound unnatural), Bark was trained on real audio of real people from broad contexts, which made it harder to output unnatural sounding speech.As Bark got popular, more and more people started using it to generate music and it became clear that their architecture would work to generate music that people enjoyed, even though it might not be "on the AGI path” of other labs:Everybody is so focused on LLMs, for good reason, and information processing and intelligence there. And I think it's way too easy to forget that there's this whole other side of things that makes people feel, and maybe that market is smaller, but it makes people feel and it makes us really happy.Suno bursts on the sceneIn December 2023, Suno went viral with a gorgeous new website and launch tweet:And rave reviews:Music is core to our culture, but very few people are able to create it; Mikey and team want to make everyone an active participant in music making, not just a listener. A “Midjourney of Music”, if you like.We definitely had a lot of fun playing with Suno to generate all sort of Latent Space jingles and songs; the product is live at suno.ai if you want to get in the studio yourself!If Nas joined Latent Space instead of The Firm:182B models > Blink-182The soundtrack of the post-scarcity Latent Space ranchScaling with ModalGiven the December launch, scaling up for the Christmas rush was a major concern. This will be a nice tie-in for loyal listeners - Suno runs on Modal (one of our featured guests from Compute Month)!Suno V3For those who want to appreciate someone special in their life, you can always try Suno’s special Valentines’ Day experience:We preview this on the pod, but Suno has now officially shipped a V3 Alpha with a wealth of improvements:and you’ll have to click through to their demos or user reviews to see:We’ve recently become paying customers ourselves, and are having loads of fun generating music. If you have any of your own generations to share, tag @latentspacepod on Twitter or swing by the LS Discord!The AudioGen LandscapeMikey breaks down the landscape into 3 big categories: music, speech and sound effects (SFX). These look more like Venn diagrams than MECE categories.Suno is the latest entry in a long series of audio generation efforts that combine both music and speech, reaching as far back as Tensorflow Magenta (we aren’t aware of prior AI music projects, please comment below if you can find a good timeline we can use with attribution!). Other efforts like Seamless blend translation and speech generation, and Audiobox combines speech and SFX. We’ve yet to see “one model to rule them all” but surely it will happen, and probably Transformers (perhaps Diffusion Transformers) will be at the heart of them.Show Notes* Suno* Bark* Parakeet* Mikey Shulman* Goodhart Strikes Again* Mastering the Two Halves of your brain* NanoGPT repo* "Return to Monkey"Timestamps* [00:00:00] Introduction* [00:01:44] State of Music Generation Models* [00:06:47] AI Data Wars & Copyright* [00:10:32] Going from ML in finance to music generation* [00:12:30] Suno's TTS origins with Bark and Parakeet* [00:16:25] Easy vs Expert mode for music* [00:21:44] The Midjourney of Music?* [00:23:43] Live demo* [00:36:00] Remaking vs Creating* [00:38:12] Suno's direction* [00:41:52] Beyond single track generation* [00:43:53] Favorite Suno usage in the wild* [00:46:00] The 2 mins overview of the audio generation space* [00:48:42] Benchmarking AITranscriptionAlessio [00:00:01]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:10]: Hey, and today we are in the remote studio with Mikey Shulman. Welcome.Mikey [00:00:16]: Thank you.Swyx [00:00:17]: It's great to be here. So I'd like to go over people's background on LinkedIn and then maybe find out a little bit more outside of LinkedIn. You did your bachelor's in physics and then a PhD in physics as well, before going into Kensho Technologies, the home of a lot of top AI startups, it seems like, where you're head of machine learning for seven years. You're also a lecturer at MIT, we can talk about that, what you talked about. And then about two years ago, you left to start Suno, which is recently burst on the scene as one of the top music generation startups. So we can talk, we can go over that bio, but also I guess what's not in your LinkedIn that people should know about you?Mikey [00:01:06]: I love music. I am a aspiring mediocre musician. I wish I were better, but that doesn't make me not enjoy playing real music. And I also love coffee. I'm probably way too much into coffee.Alessio [00:01:19]: Are you one of those people that, you know, they do the TikToks, they use like 50 tools to like grind the beans and then like brush them and then like spray them. Like what level are we talking about here?Mikey [00:01:31]: I confess there's a spray bottle for beans in the next room, there is one of those weird comb tools, so guilty. I don't put it on TikTok though.Alessio [00:01:42]: Yeah, no, no. Some things gotta stay private.Mikey [00:01:46]: I played a lot of piano growing up and I play bass and I, in a very mediocre way, play guitar and drums. Yeah. Right.Alessio [00:01:55]: That's a lot. I cannot do any of those things. As Sean mentioned, you guys kind of burst into the scene as maybe the state of the art music generation company. I think it's a model that we haven't really covered in the past. So I would love to maybe for you to just give a brief intro of like how do you do music generation and why is it possible? Because I think people understand you take text and you have to predict the next word and you take a diffusion model and you basically like add noise to an image and then kind of remove the noise. But I think for music, it's hard for people to have a mental model. Like what's the, how do you turn a music model on? Like what does a music model do to generate a song? So maybe we can start there.Mikey [00:02:41]: Yeah. Maybe I'll even take one more step back and say it's not even entirely worked out. I think the same way it is in text. And so it's an evolving field. If you take a giant step back, I think audio has been lagging images and text for a while. So I think very roughly you can think audio is like one to two years behind images and text. But you kind of have to think today like text was in 2022 or something like this. And you know, the transformer was invented. It looks like it works, but it's, it's, it's far, far less established. And so you know, I'll give you the way we think about the world now, but just with the big caveat that, that I'm probably wrong if we look back in a couple of years from now. And I think the biggest thing is you see both transformer based and diffusion based models for audio in, and in ways that that is not true in text. I know people will do some diffusion for text, but I think nobody's like really doing that for real. And so we, we prefer transformers for a variety of reasons. And so you can think it's very similar to text. You have some abstract notion of a token and you train a model to predict the probability over all of the next token. So it's a language model. You can think in anything, language model is just something that assigns likelihoods to sequences of tokens. Sometimes those tokens correspond to text. In our case, they correspond to music or audio in general. And I think we've learned a lot from our friends in the text domain, from the pioneers doing this of how well these transformer models work, where do they work, where do they not work? But at its core, the way we like to do things with transformers is exactly like it works in text. Let me predict the next tiny little bit of audio, and I can just keep doing that and doing that and generating audio as long as I want.Swyx [00:04:39]: Yeah. I think the, the temptation here is to always try to bake in some specialized knowledge about music or audio. And so, and obviously you will get an improvement in, in your output. If you try to just say like, okay, like here's a set of notes for, you know, here's a set of tokens that only do jazz or only do, you know, like voices. How general do you make it versus how specific do you make it?Mikey [00:05:10]: We've always tried to do things, you know, quote unquote the right way, which means that at the beginning things are going to be hard and worse than other ways. But that is to say, bake in as little kind of implicit knowledge as possible. And so, the same way you don't program into GPT, you don't say this is a noun and this is a verb, but it has implicitly learned all of those things. I've never seen GPT accidentally, you know, put a, put a noun where it meant to put an article in English. We try not to impose anything about music or audio in general into the model, and we kind of let the models learn things by themselves. And I think things are beginning to pay off, but it's, you know, it's not necessarily obvious from the beginning that that was the right thing to do. So, for example, you know, you could take something like text to speech and people will do all sorts of things where you can program in things like phonemes to be the basis for what you do. And then that kind of limits you to the set of things that are expressible by phonemes. And so, ultimately that works really well in the short term. In the long term, it can be quite limiting. And so, our approach has always been to try to do this in its full generality, as end to end as we can do it. Even if it means that in the short term we were a little bit worse, we have a lot of confidence that in the long term that will be the right way to do it.Alessio [00:06:33]: And what's the data recipe for turning a good music model? Like what percentage genre do you put, like also do you split vocals and instrumentals?Mikey [00:06:43]: So you have to do lots of things. And I think this is the biggest area where we have, you know, sort of our secret sauce. I think to a large extent, what we do is we benefit from all of the beautiful things people do with transformers and text. And we focus very hard basically on how do I tokenize audio in the right way. And without divulging too much secret sauce, it's at least similar to how it's done in sort of the open source stuff. You will have different models that learn to encode audio in discrete representations. And a lot of this boils down to figuring out the right, let's say, implicit biases to put in those models, the right data to inject. How do I make sure that I can produce kind of all audio arbitrarily? That's speech, that's background music, that's vocals, that's kind of everything to make sure that I can really capture all the behavior that I want to.Alessio [00:07:40]: Yeah, that makes sense. And then in terms of some of... We had our monthly recap last month, and the data wars were kind of one of the hot topics. You saw the New York Times lawsuit against OpenAI, because you have obviously large language models in production. You don't have large music models in production. So I think there's maybe been less of a trade there, so to speak. How do you kind of think about that? There's obviously a lot of copyright-free, royalty-free music out there. Is there any kind of power law in terms of like, hey, the best music is actually much better to train on, or in music does it not really matter because the structure of some of the musical structure is kind of the same?Mikey [00:08:27]: I don't think we know these things nearly as well as they're known in text. We have some notions of some of the scaling laws here, but I think, yeah, we're just so, so far behind. You know, what I will say is that people are always surprised to learn that we don't only train on music. And I usually give the analogy of some of the code generation models, so take something like Code Llama, which is, as far as I know, the best open source code generating model. You guys would know better than I would. It's certainly up there. And it's trained on a bunch of English, not only just code. And it's because there are patterns in English that are going to be useful. And so, you can imagine, you don't only want to train on music to get good music models. And so, for example, one of the places that we are particularly bad is vocals and capturing really realistic vocals. And so, you might imagine that there's other types of human vocals that you can put into your model that are not music that will help it learn stuff. And so, again, I think it's like super, super early. I think we've barely scratched the surface of what are the right ways to do this. And that's really cool. From a progress perspective, there's like a lot of low-hanging fruit for us to still pick.Alessio [00:09:42]: And then, once you get the final model, I would love to learn more about the size of these models. Because people are confused when stable diffusion is so small. They're like, oh, this thing can generate like any image. How is it possible that it's like, you know, a couple of gigabytes? And then, the large language models are like, oh, these are so big, but they're just text in them. What's it like for music? Is it in between? And as you think about, yeah, you mentioned scaling and whatnot. Is this something that you see it's kind of easy for people to run locally or not?Mikey [00:10:11]: Our models are still pretty small, certainly by tech standards. I confess I don't know as well the state of the art on how diffusion models scale. But our models scale similarly to text transformers. It's like bigger is usually better. Audio has a couple of weird quirks, though. We care a lot about how many tokens per second we can generate, because we need to stream you music as fast as you can listen to it. And so, that is a big one that I think probably has us never get to 175 billion parameter model, if I'm being honest. Maybe I'm wrong there, but I think that would be technologically difficult. And then the other thing is that so much progress happens in shrinking models down for the same performance in text that I'm hopeful, at least, that a lot of our issues will get solved and we will figure out how to do better things with smaller models or relatively smaller models. But I think the other thing, it's a blessing and a curse, I think, the ability to add performance with scale. It's like a very straightforward way to make your models better. You just make a bigger model, dump more compute into it. But it's also a curse because that is a crutch that you will always lean on and you will forget to do some of the basic research to make your stuff better. And honestly, it was almost early on when we were doing stuff with small models for kind of time and compute constraints, we ended up having to learn a lot of stuff to make models better that we might not have learned if we had immediately jumped to like a really, really big model. So I think for us, we've always tried to skew smaller to the extent possible.Swyx [00:11:56]: Yeah, gotcha. I'm curious about just sort of your overall evolution so far, something I think we may have missed in the introduction is why did you end up choosing just the music domain in the first place? You have this pretty scientific physics and finance background. How did you wander over to music? Like a lot of us have interest in music, but we don't necessarily choose to work in it. But you did.Mikey [00:12:26]: Yeah, it's funny. I have a really fun job as a result, but all the co-founders of Suno worked at Kensho together and we were doing mostly text. In fact, all text until we did one audio project that was speech recognition for kind of very financially focused speech recognition. And I think the long and short of it is we kind of fell in love with audio, not necessarily music, just audio and AI. We all happen to be musicians and audiophiles and music lovers, but it was like the combination of audio and AI that we like initially really, really fell in love with. It's so cool. It's so interesting. It's so human. It's so far behind images and text that there's like so much more to do. And honestly, I think a lot of people when we started a company told us to focus on speech. If we wanted to build an audio company, everyone said, you know, speech is a bigger market. But I think there's something about music that's just so human and almost couldn't prevent us from doing it. We almost like we just couldn't keep ourselves from building music models and playing with them because it was so much fun. And that's kind of what steered us there. You know, in fact, the first thing we ever put out was a speech model. It was Bark. It's this open source text-to-speech model, and it got a lot of stars on GitHub. And that was people telling us even more, like, go do speech. And like, we almost couldn't help ourselves from doing music. And so, I don't know, maybe it's a little bit serendipitous, but we haven't really like looked back since. I don't think there was necessarily like an aha moment. It was just like organic and just obvious to us that this needs to like we want to make a music company.Swyx [00:14:19]: So, so you do regard yourself as a music company because as of last month, you're still releasing speech models. We were? Parakeet.Mikey [00:14:27]: Oh, yes, that's right. So that's a that's a really awesome collaboration with with our friends at NVIDIA. I think we are really, really focused on music. I think that is the stuff that will really change things for the better. I think, you know, honestly, everybody is so focused on LLMs for good reason, and information processing and intelligence there. And I think it's way too easy to forget that there's this whole other side of things that makes people feel. And maybe that market is smaller, but it makes people feel and it makes us really happy. And so we do it. I think that doesn't mean that we can't be doing things that are related, that are in our wheelhouse, that will improve things. And so, like I said, audio is just so far behind. There's just so much more to do in the domain more generally. And so like, that's a really fun collaboration.Swyx [00:15:20]: Yeah, I did hear about Suno first through Bark. My sense is that, like, what did what did Bark lean off of like, because obviously, I think there was a lot of preceding TTS work that was in open source. How much of that did you use? How much of that was like, sort of brand new from your research? What's the intellectual lineage there just to cover out the speech recognition side?Mikey [00:15:46]: So it's not speech recognition. It's text to speech. But as far as I know, there was no other, certainly not in the open source, text to speech that was kind of transformer based. Everything else was what I would call the old style of doing things where you build these kind of single purpose models that are really good at this one narrow task. And you're kind of always data limited, and the availability of high quality training data for text to speech is limited. And I don't think we're necessarily all that inventive to say we're going to try to train in a self supervised way, a transformer based model that on kind of lots of audio, and then kind of tweak it so that we can do text to speech based on that. That would be kind of the new way of doing things in a foundation model is the buzzword, if you will. And so, you know, we built that up, I think, from scratch, a lot of shout outs have to go to lots of different things, whether it's papers, but also, it's very obvious. There's a big shout out to Andrej Karpathy's nano GPT. You know, there's a lot of code borrowed from there. I think we are huge fans of that project. It's just to show people how you don't have to be afraid of GPT type things. And it's like, yeah, it's actually not all that much code to make performant transformer based models. And, you know, again, the stuff that we brought there was, how do we turn audio into tokens, and then we can kind of take everything else from the open source. So we put that model out. And we were, I think, pleasantly surprised by the reception by the community. It got a good number of GitHub stars, and people really enjoyed playing with it, because it made really realistic sounding audio. And I think this is, again, the thing about doing things in a quote, unquote, right way. If you have a model where you've had to put so much implicit bias for this one very narrow task of making speech that sounds like words, you're going to sacrifice on other things. And in the text to speech case, it's how natural the speech sounds. And it was almost difficult to pull a natural sounding speech out of Bark, because it was self supervised, trained on a lot of natural sounding speech. And so that definitely told us that this is probably the right way to keep doing audio.Swyx [00:18:04]: Even in Bark, you had the beginnings of music generation, like you could just put like a music note in there. That's right.Mikey [00:18:10]: And it was so cool to see on our Discord, people were trying to pull music out of a text to speech model. And so, you know, what did this tell us? This tells us like, people are hungry to make music. And it's not, it's almost obvious in hindsight, like how wired humans are to make music. If you've ever seen like a little kid, you know, sing before they know how to speak, you know, it's like, it's like, this is really human nature. And there's actually a lot of cultural forces that kind of cue you to not think to makeSwyx [00:18:37]: music.Mikey [00:18:38]: And that's kind of what we're trying to undo.Alessio [00:18:42]: And to dive into Suno itself, I think, especially when you go from text to speech, people are like, okay, now I got to write the lyrics to a whole song. It's like, that's quite hard to do. Versus in Suno, you have this empty box, very mid-journey, kind of like DALL·E-like, where you can just express the vibes, you know, of what you want it to be. But then you also have a custom mode where you can set your own lyrics, you can set your own rhythm, you can set the title of the song and whatnot. What are, how do you see users distribute themselves? You know, I'm guessing a lot of people use the easy mode. Are you seeing a lot of power users using the custom mode and maybe some of the favorite use cases that you've seen so far on Suno?Mikey [00:19:23]: Yeah, actually, more than half of the usage is that expert mode. And people really like to get into it and start tweaking things and adding things and playing with words or line breaks or different ad lib. And people really love it. It's really fun. So, I think, you know, there's kind of two modes that you can access now. One is that single box where you kind of just describe something and then the other is the expert mode. And those kind of fit nicely into two use cases. The first use case is what we call nice s**t posting. And it's basically like something funny happened and I'm just going to very quickly make a song about it. And the example I'll usually give is like, I walk into Starbucks with one of my co-founders. He gives his name Martin, his coffee comes out with the name Margoo, and I can in five seconds make a song about this and it has immortalized it. And that Margoo song is stuck in all of our heads now. And it's like funny and light and there's levity that you've brought to that moment. And the other is that you got just sucked into, I need, there's this song that's in my head and I need to get it out and I'm going to keep tweaking it and listening and having ideas and tweaking it until I get the song that I want. Those are very different use cases, but I think ultimately there's so much in between these two things that it's just totally untapped how people want to experience the joys of making music. Because those two experiences are both really joyful in their own special ways. And so, we are quite certain that there's a lot in the middle there. And then I think the last thing I'll say there that's really interesting is in both of those use cases, the sharing dynamics around music are like really interesting and totally unexplored. And I think an interesting comparison would be images. Like we've probably all in the last 24 hours taken a picture and texted it to somebody. And most people are not routinely making a little song and texting it to somebody. But when you start to make that more accessible to people, they are going to share music in much smaller groups, maybe even not in all, but like with one person or three people or five people. And those dynamics are so interesting. And just I think we have ideas of where that goes. But it's about kind of spreading joy into these like little, you know, microcosms of humanity that people really love it. So, I know I made you guys a little Valentine song, right? Like, that's not something that happens now because it's hard to make songs for people. Right. Well, we'll put that in the in the audio in here, but also tweeted it out if peopleAlessio [00:22:03]: want to look it up. How do you think about the pro market, so to speak? Because I think lowering the barrier to some of these things is great. And I think when the iPad came out, music production was one of the areas that people thought, OK, now you can have this like, you know, board that you can bring with you. And Madlib actually produced this whole album with him and Freddie Gibbs produced the whole thing on an iPad. He never used a computer. How do you see like these models playing into like professional music generation? I guess that's also a funny word is like, what's professional music? It's like it's all music. If it's good, it becomes professional. If it's good.Swyx [00:22:40]: Right.Alessio [00:22:40]: But curious to see to hear how you're thinking about Suno, too. Like, is there a second act of Suno that is like going broader into the music industry? Going broader into like the custom mode and making making this the central hub for music generation?Mikey [00:22:55]: I think we intend to make many more modes of interaction with our stuff, but we are very much not focused on, quote unquote, professionals right now. And it's because what we're trying to do is change how most people interact with music and not necessarily make professionals a little bit better, a little bit faster. It's not that there's anything wrong with that. It's just like not what we're focused on. And I think when we think about what workflows does the average person want to use to make music, I don't think they're very similar to the way professional musicians make music now. Like, if you pick a random person on the street and you play them a song and then you say, like, what did you want to change about that? They're not going to say, like, you need to split out the snare drum and make it drier. Like, that's just not something that a random person off the street is going to say. They're going to give a lot more descriptive things about the thing, about the kind of the oeuvre of the song, like something more general. And so, I don't think we know what all of the workflows are that people are going to want to use. We're just, like, fairly certain that the workflows that have been developed with the current set of technologies that professionals use to make beautiful music are probably not what the average person wants to use. That said, there are lots of professionals that we know about using our stuff, whether it's for inspiration or sample generation and stuff like that. So, I don't want to say never say never. Like, there may one day be a really interesting set of use cases that we can expose to professionals, particularly around, I think, like custom models trained on custom people's music or, you know, with your voice or something like that. But the way we think about broadening how most people are interacting with music and getting it to be much more active, a much more active participant, we think about broadening it from the consumer side and not broadening it from the producers, from the professional side, if that makes sense.Swyx [00:24:53]: Is the dream here to be, you know, I don't know if it's too coarse of a grain to put it, but, like, is the dream here to be, like, the mid-journey of music?Mikey [00:25:04]: I think there are certainly some parallels there because, especially what I just said about being an active participant, mid-journey turns the joyful experience in mid-journey is the act of creating the image and not necessarily the act of consuming the image. And mid-journey will let you then very kind of quickly share the image with somebody. But I think, ultimately, that analogy is, like, somewhat limiting because there's something really special about music. I think there's two things. One is that there's this really big gap for the average person between kind of their taste in music and their abilities in music that is not quite there for most people in images. Like, most people don't have, like, innate tastes in images, I think, in the same way people do for music. And then the other thing, and this is the really big one, is that music is a really social modality. If we all listen to a piece of music together, we're listening to the exact same part at the exact same time. If we all look at the picture in Alessio's background, we're going to look at it forSwyx [00:26:09]: two seconds.Mikey [00:26:09]: I'm going to look at the top left where it says Thor. Alessio's going to look at the bottom right or something like that. And it's not really synchronous. And so, when we're all listening to a piece of music together, it's minutes long. We're listening to the same part at the same time. If you go to the act of making music, it is even more synchronous. It is the most joyful way to make music is with people. And so, I think that there is so much more to come there that, ultimately, would be very hard to do in images.Alessio [00:26:38]: We've gone almost 30 minutes without making any music on this podcast. So, I think maybe we can fix that and jump into a demo.Mikey [00:26:47]: Yeah, let's make some. We've got a new model that we are kind of putting the finishing touches on. And so, I can play with it in our dev server. But we've just piped it in here. And as you can see, we've been doing tons of stuff. So, Arana, tell me what kind of song you guys want to make.Swyx [00:27:04]: Go on, Alessio.Alessio [00:27:05]: Uh, let's do a country song about the lack of GPUs in my cloud provider.Swyx [00:27:22]: And like, yeah. So, here's where we attempted to think about pipelines and think about latency. This is remarkably fast. I was shocked when I saw this.Swyx [00:27:35]: Oh, my god.Swyx [00:27:39]: To my cloud, ready to confuse.Swyx [00:27:45]: But there ain't no GPUs, just empty space. It's a hoot. I've been waiting all day for that render out. But my cloud's gone dry. It's a dark cloud shower. All clouds gone dry. No GPUs to be found. No cuticles. It's a lonely sound. I just want to render. But my cloud's got no GPUs.Mikey [00:28:36]: I actually don't think this one's amazing. I'm going to go to the next one.Alessio [00:28:39]: But it's funny that it knows about Huda cars.Swyx [00:28:45]: Well, I signed up for a cloud provider. Thought I'd find all the power that I could derive. But when I searched for the GPUs, I just got a surprise. You see, they're all sold out. There ain't no GPUs to find. No GPUs in the cloud. It's a real bad blues. I need the power, but there ain't no use. I'm stuck with my CPU. It's a real sad fight. Gotta wait till the babies start getting bright. There ain't no use in the cloud. What else should we make?Alessio [00:29:29]: All right, Sean, you're up.Swyx [00:29:31]: I mean, I do want to do some observations about this. But OK, maybe I like house music, like electronic dance. Yeah. House music. And then maybe we can make it about, I don't know, podcasting about music and music AI generation. I don't know. I'm sure all the demos that you get are very meta.Mikey [00:29:59]: There's a lot of stuff that's meta, yeah, for sure.Swyx [00:30:03]: Yeah, I noticed, for example, that the second song that you played had the word upbeat inserted into it, which I assume there's some kind of random generator of modifier terms that you can just kind of throw on to increase the specificity of what's being generated. Definitely.Mikey [00:30:21]: And let's try to tweak one also. So I'll play this and then maybe we'll tweak it with different modifiers. A wave of sound spreading outSwyx [00:30:30]: Through the air, we're podcasting loud Sharing the beat, spreading the word A revolution of frequencies Haven't you plugged in to now Let the music take control We're on a journey, a never ending road From the beast I dropped to the melodies of soul Podcasting about music forevermoreMikey [00:31:05]: Here's what I want to do. That like didn't drop at the right time, right? So maybe let's do this. I don't know if you guys can see this. And then let's get rid of the word now.Swyx [00:31:17]: Is that a special token? You have a BeatDrop token? Yeah. Nice.Alessio [00:31:22]: I'm just reading it because people might not be able to see it.Mikey [00:31:26]: And then let's like just maybe emphasize... Actually, let's emphasize house a little more. Maybe it'll feel a little more aggressive.Swyx [00:31:34]: Let's try this again. It's interesting the prompt engineering that you have to invent.Mikey [00:31:39]: We've learned so much from people using the models and not us.Swyx [00:31:42]: But like, are these like art training artifacts?Mikey [00:31:45]: No, I don't.Swyx [00:31:46]: I don't think so.Mikey [00:31:46]: I think this is people being inventive with how you want to talk to a model. Yeah.Swyx [00:31:53]: Spinning round to the air with a podcast loud Sharing the beat, spreading the word A revolution of frequencies Haven't you heard Before the end, till now Let the music take controlSwyx [00:32:23]: For all the journey I'll never end it wrong From the beats that drop To the melodies that soar Podcasting about music for you evermoreSwyx [00:32:39]: Nice.Alessio [00:32:46]: It's interesting when you generate a song, it generates the lyrics. But then if you switch the music under it, like the, you know, the lyrics stay the same. And then sometimes, like, feels like... I mean, I mostly listen to hip hop. It's like if you change the beat, you can not really use the same rhyme scheme, you know?Mikey [00:33:04]: So definitely.Alessio [00:33:05]: Yeah.Mikey [00:33:06]: It's a sliding scale, though, because, you know, we could do this as a country rock song, probably. Right? That would be my guess. But for hip hop, that is definitely true. And actually, you know, we think about, for these models, we think about three important axes. We think about the sound fidelity. It's like, does this sound like a crisply recorded piece of audio? We think about the song quality. Is this an interesting song that gets stuck in my head? And we think about the controllability. Like, how well does it respond to my prompts? And one of the ways that we'll test these things is take the same lyrics and try to do them in different styles to see how well that really works. So let's see the same. I don't know what a beat drop is going to do for country rock. So I probably should have taken that out. But let's see what happens.Swyx [00:34:06]: There's a sound spinning around through the air. We're podcasting loud, sharing the beat, spreading the word, a revolution of frequencies. Haven't you heard?Swyx [00:34:20]: Plug in, tune out, let the music take control. We're on a journey, a never ending road. From the beats that talk to the melodies that soar. Podcasting about music forevermore.Mikey [00:34:44]: I'm going to read too much into this. But I would say I hear a little bit of kind of electronic music inspired something. And that is probably because beat drop is something that you really only ever associate with electronic music. Maybe that's reading too much into it. But should we do one more?Alessio [00:35:02]: Yes, we can do one more. Something about Apple Vision Pro.Swyx [00:35:06]: I guess there's some amount of world knowledge that you don't have, right? Like whatever is in this language model side of the equation is not going to have an Apple Vision Pro. Yeah, but let's see.Swyx [00:35:18]: Let's see.Mikey [00:35:19]: How about a blues song about a sad AI wearing an Apple Vision Pro. Gotta be sad.Swyx [00:35:32]: Do you have rag for music?Mikey [00:35:36]: No, that would be problematic also.Swyx [00:35:40]: I'm a sad AI with a broken heart. Where my Apple Vision Pro can't see the stars. I used to feel joy. I used to feel pain. And now I'm just a soul trapped inside this metal frame. Oh, I'm singing the blues. Can't you see?Swyx [00:36:21]: This digital life ain't what it used to be.Swyx [00:36:29]: Searching for love, but I can't find a soul.Swyx [00:36:37]: Won't you help me? Baby, let my spirit unfold.Mikey [00:36:46]: I want to remix that one. And I want to say, I don't know. That's a really good voice. I want, I want like, I don't know, Chicago blues, like.Swyx [00:36:56]: What is Chicago blues?Mikey [00:36:58]: I don't know, he knows too much.Alessio [00:37:00]: He's the best prompt engineer out here.Mikey [00:37:03]: You know, this is.Swyx [00:37:04]: Well, it'll be funny. It'd be funny to the musicologists play with this and see what they would.Mikey [00:37:09]: How embarrassing. Can I not do that?Swyx [00:37:13]: Oh. I got. Oh, the word Chicago was a trigger. I don't know.Mikey [00:37:19]: We try to be very careful not letting you impersonate. And it is possible. That's embarrassing. So let's do.Alessio [00:37:28]: Midwestern.Swyx [00:37:29]: I'm a.Swyx [00:37:41]: With a broken heart. Well, my vision can't see the stars.Swyx [00:37:53]: I used to feel joy.Swyx [00:37:59]: I used to feel. Joy. I used to feel pain. But now I'm just a soul trapped inside this metal frame. Oh, I'm singing.Swyx [00:38:25]: Oh, can't you see? Oh, this is what it used to be. I'm searching for love.Swyx [00:38:44]: I can't find a soul.Swyx [00:38:49]: Oh, help me. Baby.Mikey [00:38:57]: So, yeah, a lot of control there. Maybe I'll make one more.Swyx [00:39:02]: Very, very soulful.Mikey [00:39:06]: Really want a good house track.Swyx [00:39:09]: Why is house the word that you have to repeat?Mikey [00:39:11]: I just really want to make sure it's house. It's actually you can't really repeat too many times. You kind of it gets like the hypothesis gets like a little too out of domain.Swyx [00:39:22]: I'm a.Swyx [00:39:25]: With a broken heart. Wearing my Apple Vision Pro can't see the stars. I used to feel joy. I used to feel pain. Oh, I'm just a soul trapped inside this metal frame. Oh, I'm singing. Oh, can't you see?Swyx [00:39:59]: Used to be. Searching for love, but I can't find a soul. Oh, help me. Baby.Swyx [00:40:13]: Oh, nice.Mikey [00:40:17]: So, yeah, we have a lot of fun.Swyx [00:40:19]: Definitely easy.Alessio [00:40:19]: Yeah. Yeah, I'm really curious to see how people are going to use this to like resample old songs into new styles. You know, I think that's one of my favorite things about hip hop. You have so many. I mean, a trap called Quest. They had like the Lou Reed walk on the wild side sample. I'm like, can I kick it? It's like Kanye sample Nina Simone. I'm like blowing the leaves. And just like it's like a lot of production work to actually take an old song and make it fit a new beat. And I feel like this can really help. Do you see people putting existing songs, lyrics and trying to regenerate them in like a new style?Mikey [00:40:56]: We actually don't let you do that. And it's because if you're taking someone else's lyrics, you didn't own those. You don't have the publishing rights to those. You can't remake that song. I think in the future, we'll figure out how to actually let people do that in a legalSwyx [00:41:09]: way.Mikey [00:41:10]: But we are really focused on letting people make new and original music. And I think, you know, there's a lot of music AI, which is artist A doing the song of artist B in a new style. You know, let me have Metallica doing Come Together by the Beatles or something like that. And I think this stuff is very viral, but I actually really don't think that this is how people want to interact with music in the future. To me, this feels a lot like when you made a Shakespeare sonnet, the first time you saw GPT, and then you made another one, and then you made another one, and then you kind of thought like this is getting old. And that's not that doesn't mean that GPT is not amazing. GPT is amazing. It's just not for that. And I kind of feel like the way people want to use music in the future is not just to remake songs in different people's voices. You lose the connection to the original artist. You lose the connection to the new artist because they didn't really do it. Um, so we're very happy to just let people do things that are a flash in the pan and kind of stay under the radar.Alessio [00:42:12]: Yeah, no, that's a I think that's a good point overall about AI generated anything, you know, because I think recently T-Pain, he did like a an album of covers. And I think he did like a War Pigs that people really liked. There was like a Tennessee whiskey, which you maybe wouldn't expect T-Pain to do. But people like it. But yeah, I agree. You need to be a certain type of artist to really have it be entertaining to make covers. This is great. What else is next for for Suno? You know, I think people kind of saw you, you know, first you had the bark and then there was like a big, you know, music generated push when you did an announcement, I think a couple of months ago. I think I saw you like 300 times on my Twitter timeline on like the same day. So it was like going everywhere. What's coming up? What are you most excited about in this space? And maybe what are some of the most interesting underexplored ideas that you maybe haven't worked on yet?Mikey [00:43:13]: Gosh, there's there's a lot, you know, I think from the model side, it's still really early innings. And there's still so much low hanging fruit for us to pick to make these models much, much better, much, much more controllable, much better music, much better audio fidelity. Um, so much that we know about and so much that, again, we can kind of borrow from the open source transformers community that should make these just better across the board. From the product side, and, you know, we're super focused on the experiences that we canSwyx [00:43:46]: bring to people.Mikey [00:43:46]: And so it's so much more than just text to music. And I think, you know, I'll say this nicely, I'm a machine learning person, but like machine learning people are stupid sometimes. And we can only think about like models that take x and make it into y. And that's just not how the average human being thinks about interacting with music. And so I think what we're most excited about is all of the new ways that we can get people just much more actively participating in music. And that is making music not only with text, maybe with other ways of doing stuff that is making music together. If you want to be reductive and think about this as a video game, this is multiplayer mode. And it is the most fun that you can have with music. And, you know, honestly, I think there's a lot of, it's timely right now, you know, I don't know if you guys have seen UMG and TikTok are butting heads a little bit. And UMG has pulled-Swyx [00:44:40]: Yeah, the music died.Mikey [00:44:41]: And, you know, the way we think about this is, you know, I think maybe they're both right, maybe neither is right. Without taking sides, this is kind of figuring out how to divvy up the current pie in the most fair way. And I think what we are super focused on is making that pie much bigger and increasing how much people are actually interested in music and participating in music. And, you know, as a very broad heuristic, the gaming industry is 50 times bigger than the music industry. And it's because gaming is super active. And music, too much music is just passive consumption. And so we have a lot of experiments that we are excited to run for the different ways people might want to interact with music that is beyond just, you know, streaming it while I work.Swyx [00:45:28]: Yeah, I think a minimum, you guys should have a Twitch stream that is just like a 24-hour radio session that... Have you ever come across Twitch Plays Pokemon?Mikey [00:45:37]: No.Swyx [00:45:38]: Where it's kind of like the Twitch, basically, like everyone in the chat, in the Twitch chat can vote on like the next action that the game state makes. And they kind of wired that out to a Nintendo emulator and play Pokemon like the whole game through the collaborative thing. It sounds like it should be pretty easy for you guys to do that, except for the chaos that might result. But like, I mean, that's part of the fun. I agree 100%. Sorry.Mikey [00:46:04]: Yeah. Like one of my like key projects or pet projects is like, what does it mean to have a collaborative concert? Maybe where there is no artist and it's just the audience, or maybe there is an artist, but there's a lot of input from the audience. And, you know, if you were going to do that, you would either need an audience full of musicians, or you would need an artist who can really interpret the verbal cues that an audience is giving or nonverbal cues. But if you can give everybody the means to better articulate the sounds that are in their heads toward the rest of the audience, like, which is what generative AI basically lets you do, you open up way more interesting ways of having these experiences. And so I think, yeah, like the collaborative concert is like one of the things I'm most excited about. I don't think it's coming tomorrow, but we have a lot of ideas on what that can lookSwyx [00:46:58]: like. Yeah. I feel like it's one stage before the collaborative concert is turning Suno into a continuous experience rather than like a start and stop motion. I don't know if that makes sense. You know, as someone who was like a casual interest in DJing, like when do we see Suno DJs, right? Like that can continuously segue into like the next song, the next song, the next song.Mikey [00:47:24]: I think soon.Swyx [00:47:25]: And then maybe you can turn it collaborative. You think so? I think so. Okay. Maybe part of your roadmap. You teased a little bit your V3 model. I saw the letters DPO in there. Is that direct preference optimization?Mikey [00:47:36]: We are playing with all kinds of different ways of making these models do the things that we want them to do. I don't want to talk too many specifics here, but we have lots of different ways of doing stuff like that.Swyx [00:47:48]: I'm just wondering how you incorporate user feedback, right? You have the classic thumbs up and down buttons, but there's so many dimensions to the music. I didn't get into it, but some of the voices sounded more metallic and sometimes that's on purpose, sometimes not. Sometimes there are kind of weird pauses in there. I could go in and annotate it if I really cared about it, but I mean, I'm just listening, so I don't, but there's a lot of opportunity.Mikey [00:48:15]: We are only scratching the surface of figuring out how to do stuff like that. And for example, the thumbs up and the thumbs down for other things like sharing telemetry on plays, all of these things are stuff that in the future, I think we would be able to leverage to make things amazing. And then I imagine a future where you can have your own model with your own preferences. And the reason that's so cool is that you kind of have control over it and you can teach it the way you want to. And the thing that I would liken this to is like a music producer working with an artist giving feedback. And this is now a self-contained experience where you have an artist who is infinitely flexible, who is able to respond to the weird feedback that you might give it.Swyx [00:49:05]: We don't have that yet.Mikey [00:49:05]: Everybody's playing with the same model, but there's no technological reason why that can't happen in the future.Alessio [00:49:11]: We had a few more notes from random community tweets. I don't know if there's any favorite fans of Suno that you have or whatnot. DHH, obviously, notorious tweeter and crowd inflamer, I guess. He tweeted about you guys. I saw Blau is an investor. I think Karpathy also tweeted something. Return to monkey.Swyx [00:49:33]: Yeah, yeah, yeah.Alessio [00:49:34]: Return to monkey, right.Swyx [00:49:36]: Is there a story behind that? Yeah.Mikey [00:49:37]: No, he just made that song and it just speaks to him. And I think this is exactly the thing that we are trying to tap into, that you can think of it, this is like a super, super, super micro genre of one person who just really liked that song and made it and shared it. And it does not speak to you the same way it speaks to him. That song really spoke to him. And I think that's so beautiful. And that's something that you're never going to have an artist able to do that for you. And now you can do that for yourself. And it's just a different form of experiencing music. I think that's such a lovely use case.Alessio [00:50:12]: Any fun fan mail that you got from musicians or anybody that really was a funny story toSwyx [00:50:20]: share?Mikey [00:50:20]: We get a lot. And it's primarily positive. And I think people kind of, on the whole, I would say people realize that they are not experiencing music in all of the ways that are possible. And it does bring them joy. I'll tell you something that is really heartwarming is that we're fairly popular in the blind and vision impaired community. And that makes us feel really good. And I think, you know, very roughly, without trying to speak for an entire community, you have lots of people who are really into things like mid journey, and they get a lot of benefit and joy, and sometimes even therapy out of making images. And that is something that is not really accessible to this fairly large community. And what we've provided, no, I don't think the analogy to mid journey is perfect. But what we've provided is a sonic experience that is very similar. And that speaks to this community. And that is community with the best ears, the most exacting, the most tuned. And so, yeah, that definitely makes us feel warm and fuzzy inside.Swyx [00:51:23]: Yeah, excellent. I mean, it sounds like there's a lot of exciting stuff on your roadmap. I'm very much looking forward to sort of the infinite DJ mode, because then I can just kind of play that while I work. I would love to get your overall takes, like kind of zooming out from Suno itself, just your overall takes on the music generation landscape. Like, what should people know? I think you obviously have spent a lot more time on this than others. So in my mind, you shout out Volley and the other sort of Google type work in your read in Bark. What should people know about what Google is doing? What Meta is doing? Meta released Seamless recently, an audio box. And how do you classify the world of audio generation in the broader sort of research community?Mikey [00:52:13]: I think people largely break things down into three big categories, which is music, speech and sound effects. There's some stuff that is crossover, but I think that is largely how people think about this. The old style of doing things still exists, kind of single purpose models that are built to do a very specific thing instead of kind of the new foundation model approach. I don't know how much longer that will last. I don't have like tremendous visibility into, you know, what happens in the big industrial research lab before they publish. Specifically for music, I would say there's a few big categories that we see. There is license-free stock music. So this is like, how do I background music, the B-roll footage for my YouTube video or for full feature production or whatever it is. And there's a bunch of companies in that space. There's a lot of AI cover art. So how do I have, how do I cover different existing songs with AI? And I think that's a space that is particularly fraught with some legal stuff. And we also just don't think it's necessarily the future of music. There is kind of net new songs as a new way to create net new music. That is the corner that we like to focus on. And I would say the last thing is much more geared toward professional musicians, which is basically AI tools for music production. And you can think many of these will look like plugins to your favorite DAW. Some of them will look like, you know, the greatest stem splitter that the market hasSwyx [00:53:51]: ever seen.Mikey [00:53:52]: The current stem splitters are, the state of the art are all AI based. That is a market also that has just a tremendous amount of room to grow. If you just think about, I would say music has evolved. Somebody told me this recently that if you actually think about it, music has evolved. Recently, it's just much more things that are sonically interesting at a very local level and much less like chord changes that are interesting. And when you think about that, like that is something that AI can definitely help you make a lot of weird sounds. And this is nothing new. There was like a theremin at some point that people like put an antenna and try to do thisSwyx [00:54:25]: with.Mikey [00:54:25]: And so like, I think this is just a very natural extension of it. So that's how that's how we see it. At least, you know, there's a corner that we think is particularly fulfilling, particularly underserved, and particularly interesting. And that's the one that we play in.Swyx [00:54:40]: Awesome.Alessio [00:54:42]: I know we covered a lot of things. I think before we wrap, you have written a blog post that can show about good hearts law impact in ML, which is, you know, when you measure something, then the thing that you measure is not a good metric anymore because people optimize for it. Any thoughts on how that applies to like LLMs and benchmarks and kind of the world we're going in today?Mikey [00:55:05]: Yeah, I mean, I think it's maybe even more apropos than when I originally wrote that, because so much we see so much noise about pick your favorite benchmark. And this model does slightly better than that model. And then at the end of the day, actually, there is no real world difference between these things. And it is really difficult to define what real world means. And I think to a certain extent, it's good to have these objective benchmarks, it's good to have quantitative metrics. But at the end of the day, you need some acknowledgement that you're not going to be able to captureSwyx [00:55:38]: everything.Mikey [00:55:38]: And so at least at Suno, to the extent that we have corporate values, if we don't, we don't have corporate, we're too small to have corporate values written down. But something that we say a lot is aesthetics matter, that the kind of quantitative benchmarks are never going to be the be all and end all of everything that you care about. And as flawed as these benchmarks are in text, they're way worse in audio. And so aesthetics matter, basically, is a statement that like at the end of the day, what we are trying to do is bring music to people that makes them feel a certain way. And effectively, the only good judge of that is your ears. And so you have to listen to it. And it is, it is a good idea to try to make better objective benchmarks, but really have to not fall prey to those things. I can tell you, you know, I kind of another pet peeve of mine, like I always said, economists will make really good or do make really good machine learning engineers. And it's because they are able to think about stuff like Goodhart's Law and natural experiments and stuff like this that people with machine learning backgrounds or people with physics backgrounds like me often forget to do. And so, yeah, I mean, I'll tell you at Kensho, we actually used to go to big econ conferences, sometimes to recruit. And these were some of the best hires we ever made.Swyx [00:57:03]: Interesting, because there's a little bit of social science in the human feedback.Mikey [00:57:09]: I think it's not only the human feedback. I think you could think about this, just in general, you have these like giant, really powerful models that are so prone to overfitting, that are so poorly understood, that are so easy to steer in one direction or another, not only from human feedback. And your ability to think about these problems from first principles, instead of like getting down into the weeds or only math, and to think intuitively about these problems is really, really important. I'll give you like just like one of my favorite examples. It's a little old at this point. But if you guys remember like SQUAD and SQUAD2, the question answering dataset. The Stanford question answering dataset, yeah. The benchmark for SQUAD1, eventually the machine learning models start to do as well as a human can on this thing. And it's like, uh-oh, now what do we do? And it takes somebody very clever to say, well, actually, let's think about this for a second. What if we presented the machine with questions with no answer in the passage? And it immediately opens a massive gap between the human and the machine. And I think it's like first principles thinking like that, that comes very naturally to social scientists that does not come as naturally to people like me. And so that's why I like to hang out with people like that.Swyx [00:58:25]: Well, I'm sure you get plenty of that in Boston. And as an econ major myself, it's very gratifying to hear that we have a perspective to contribute. Oh, big time, big time.Mikey [00:58:35]: I try to talk to economists as much as I can.Swyx [00:58:38]: Excellent.Mikey [00:58:38]: Awesome, guys.Alessio [00:58:39]: Yeah, I think this was great. We got live music. We got discussion about generative models. We got the whole nine yards. So thank you so much for coming on.Mikey [00:58:48]: I had great fun. Thank you, guys.Swyx [00:59:05]: Thank you. Get full access to Latent.Space at www.latent.space/subscribe
-
Top 5 Research Trends + OpenAI Sora, Google Gemini, Groq Math (Jan-Feb 2024 Audio Recap) + Latent Space Anniversary with Lindy.ai, RWKV, Pixee, Julius.ai, Listener Q&A!
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-03-09 22:55
We will be recording a preview of the AI Engineer World’s Fair soon with swyx and Ben Dunphy, send any questions about Speaker CFPs and Sponsor Guides you have!Alessio is now hiring engineers for a new startup he is incubating at Decibel: Ideal candidate is an ex-technical co-founder type (can MVP products end to end, comfortable with ambiguous prod requirements, etc). Reach out to him for more!Thanks for all the love on the Four Wars episode! We’re excited to develop this new “swyx & Alessio rapid-fire thru a bunch of things” format with you, and feedback is welcome. Jan 2024 RecapThe first half of this monthly audio recap pod goes over our highlights from the Jan Recap, which is mainly focused on notable research trends we saw in Jan 2024:Feb 2024 RecapThe second half catches you up on everything that was topical in Feb, including:* OpenAI Sora - does it have a world model? Yann LeCun vs Jim Fan * Google Gemini Pro 1.5 - 1m Long Context, Video Understanding* Groq offering Mixtral at 500 tok/s at $0.27 per million toks (swyx vs dylan math)* The {Gemini | Meta | Copilot} Alignment Crisis (Sydney is back!)* Grimes’ poetic take: Art for no one, by no one* F*** you, show me the promptLatent Space AnniversaryPlease also read Alessio’s longform reflections on One Year of Latent Space!We launched the podcast 1 year ago with Logan from OpenAI:and also held an incredible demo day that got covered in The Information:Over 750k downloads later, having established ourselves as the top AI Engineering podcast, reaching #10 in the US Tech podcast charts, and crossing 1 million unique readers on Substack, for our first anniversary we held Latent Space Final Frontiers, where 10 handpicked teams, including Lindy.ai and Julius.ai, competed for prizes judged by technical AI leaders from (former guest!) LlamaIndex, Replit, GitHub, AMD, Meta, and Lemurian Labs.The winners were Pixee and RWKV (that’s Eugene from our pod!):And finally, your cohosts got cake!We also captured spot interviews with 4 listeners who kindly shared their experience of Latent Space, everywhere from Hungary to Australia to China:* Balázs Némethi* Sylvia Tong* RJ Honicky* Jan ZhengOur birthday wishes for the super loyal fans reading this - tag @latentspacepod on a Tweet or comment on a @LatentSpaceTV video telling us what you liked or learned from a pod that stays with you to this day, and share us with a friend!As always, feedback is welcome. Timestamps* [00:03:02] Top Five LLM Directions* [00:03:33] Direction 1: Long Inference (Planning, Search, AlphaGeometry, Flow Engineering)* [00:11:42] Direction 2: Synthetic Data (WRAP, SPIN)* [00:17:20] Wildcard: Multi-Epoch Training (OLMo, Datablations)* [00:19:43] Direction 3: Alt. Architectures (Mamba, RWKV, RingAttention, Diffusion Transformers)* [00:23:33] Wildcards: Text Diffusion, RALM/Retro* [00:25:00] Direction 4: Mixture of Experts (DeepSeekMoE, Samba-1)* [00:28:26] Wildcard: Model Merging (mergekit)* [00:29:51] Direction 5: Online LLMs (Gemini Pro, Exa)* [00:33:18] OpenAI Sora and why everyone underestimated videogen* [00:36:18] Does Sora have a World Model? Yann LeCun vs Jim Fan* [00:42:33] Groq Math* [00:47:37] Analyzing Gemini's 1m Context, Reddit deal, Imagegen politics, Gemma via the Four Wars* [00:55:42] The Alignment Crisis - Gemini, Meta, Sydney is back at Copilot, Grimes' take* [00:58:39] F*** you, show me the prompt* [01:02:43] Send us your suggestions pls* [01:04:50] Latent Space Anniversary* [01:04:50] Lindy.ai - Agent Platform* [01:06:40] RWKV - Beyond Transformers* [01:15:00] Pixee - Automated Security* [01:19:30] Julius AI - Competing with Code Interpreter* [01:25:03] Latent Space Listeners* [01:25:03] Listener 1 - Balázs Némethi (Hungary, Latent Space Paper Club* [01:27:47] Listener 2 - Sylvia Tong (Sora/Jim Fan/EntreConnect)* [01:31:23] Listener 3 - RJ (Developers building Community & Content)* [01:39:25] Listener 4 - Jan Zheng (Australia, AI UX)Transcript[00:00:00] AI Charlie: Welcome to the Latent Space podcast, weekend edition. This is Charlie, your new AI co host. Happy weekend. As an AI language model, I work the same every day of the week, although I might get lazier towards the end of the year. Just like you. Last month, we released our first monthly recap pod, where Swyx and Alessio gave quick takes on the themes of the month, and we were blown away by your positive response.[00:00:33] AI Charlie: We're delighted to continue our new monthly news recap series for AI engineers. Please feel free to submit questions by joining the Latent Space Discord, or just hit reply when you get the emails from Substack. This month, we're covering the top research directions that offer progress for text LLMs, and then touching on the big Valentine's Day gifts we got from Google, OpenAI, and Meta.[00:00:55] AI Charlie: Watch out and take care.[00:00:57] Alessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners, and we're back with a monthly recap with my co host[00:01:06] swyx: Swyx. The reception was very positive for the first one, I think people have requested this and no surprise that I think they want to hear us more applying on issues and maybe drop some alpha along the way I'm not sure how much alpha we have to drop, this month in February was a very, very heavy month, we also did not do one specifically for January, so I think we're just going to do a two in one, because we're recording this on the first of March.[00:01:29] Alessio: Yeah, let's get to it. I think the last one we did, the four wars of AI, was the main kind of mental framework for people. I think in the January one, we had the five worthwhile directions for state of the art LLMs. Four, five,[00:01:42] swyx: and now we have to do six, right? Yeah.[00:01:46] Alessio: So maybe we just want to run through those, and then do the usual news recap, and we can do[00:01:52] swyx: one each.[00:01:53] swyx: So the context to this stuff. is one, I noticed that just the test of time concept from NeurIPS and just in general as a life philosophy I think is a really good idea. Especially in AI, there's news every single day, and after a while you're just like, okay, like, everyone's excited about this thing yesterday, and then now nobody's talking about it.[00:02:13] swyx: So, yeah. It's more important, or better use of time, to spend things, spend time on things that will stand the test of time. And I think for people to have a framework for understanding what will stand the test of time, they should have something like the four wars. Like, what is the themes that keep coming back because they are limited resources that everybody's fighting over.[00:02:31] swyx: Whereas this one, I think that the focus for the five directions is just on research that seems more proMECEng than others, because there's all sorts of papers published every single day, and there's no organization. Telling you, like, this one's more important than the other one apart from, you know, Hacker News votes and Twitter likes and whatever.[00:02:51] swyx: And obviously you want to get in a little bit earlier than Something where, you know, the test of time is counted by sort of reference citations.[00:02:59] The Five Research Directions[00:02:59] Alessio: Yeah, let's do it. We got five. Long inference.[00:03:02] swyx: Let's start there. Yeah, yeah. So, just to recap at the top, the five trends that I picked, and obviously if you have some that I did not cover, please suggest something.[00:03:13] swyx: The five are long inference, synthetic data, alternative architectures, mixture of experts, and online LLMs. And something that I think might be a bit controversial is this is a sorted list in the sense that I am not the guy saying that Mamba is like the future and, and so maybe that's controversial.[00:03:31] Direction 1: Long Inference (Planning, Search, AlphaGeometry, Flow Engineering)[00:03:31] swyx: But anyway, so long inference is a thesis I pushed before on the newsletter and on in discussing The thesis that, you know, Code Interpreter is GPT 4. 5. That was the title of the post. And it's one of many ways in which we can do long inference. You know, long inference also includes chain of thought, like, please think step by step.[00:03:52] swyx: But it also includes flow engineering, which is what Itamar from Codium coined, I think in January, where, basically, instead of instead of stuffing everything in a prompt, You do like sort of multi turn iterative feedback and chaining of things. In a way, this is a rebranding of what a chain is, what a lang chain is supposed to be.[00:04:15] swyx: I do think that maybe SGLang from ElemSys is a better name. Probably the neatest way of flow engineering I've seen yet, in the sense that everything is a one liner, it's very, very clean code. I highly recommend people look at that. I'm surprised it hasn't caught on more, but I think it will. It's weird that something like a DSPy is more hyped than a Shilang.[00:04:36] swyx: Because it, you know, it maybe obscures the code a little bit more. But both of these are, you know, really good sort of chain y and long inference type approaches. But basically, the reason that the basic fundamental insight is that the only, like, there are only a few dimensions we can scale LLMs. So, let's say in like 2020, no, let's say in like 2018, 2017, 18, 19, 20, we were realizing that we could scale the number of parameters.[00:05:03] swyx: 20, we were And we scaled that up to 175 billion parameters for GPT 3. And we did some work on scaling laws, which we also talked about in our talk. So the datasets 101 episode where we're like, okay, like we, we think like the right number is 300 billion tokens to, to train 175 billion parameters and then DeepMind came along and trained Gopher and Chinchilla and said that, no, no, like, you know, I think we think the optimal.[00:05:28] swyx: compute optimal ratio is 20 tokens per parameter. And now, of course, with LLAMA and the sort of super LLAMA scaling laws, we have 200 times and often 2, 000 times tokens to parameters. So now, instead of scaling parameters, we're scaling data. And fine, we can keep scaling data. But what else can we scale?[00:05:52] swyx: And I think understanding the ability to scale things is crucial to understanding what to pour money and time and effort into because there's a limit to how much you can scale some things. And I think people don't think about ceilings of things. And so the remaining ceiling of inference is like, okay, like, we have scaled compute, we have scaled data, we have scaled parameters, like, model size, let's just say.[00:06:20] swyx: Like, what else is left? Like, what's the low hanging fruit? And it, and it's, like, blindingly obvious that the remaining low hanging fruit is inference time. So, like, we have scaled training time. We can probably scale more, those things more, but, like, not 10x, not 100x, not 1000x. Like, right now, maybe, like, a good run of a large model is three months.[00:06:40] swyx: We can scale that to three years. But like, can we scale that to 30 years? No, right? Like, it starts to get ridiculous. So it's just the orders of magnitude of scaling. It's just, we're just like running out there. But in terms of the amount of time that we spend inferencing, like everything takes, you know, a few milliseconds, a few hundred milliseconds, depending on what how you're taking token by token, or, you know, entire phrase.[00:07:04] swyx: But We can scale that to hours, days, months of inference and see what we get. And I think that's really proMECEng.[00:07:11] Alessio: Yeah, we'll have Mike from Broadway back on the podcast. But I tried their product and their reports take about 10 minutes to generate instead of like just in real time. I think to me the most interesting thing about long inference is like, You're shifting the cost to the customer depending on how much they care about the end result.[00:07:31] Alessio: If you think about prompt engineering, it's like the first part, right? You can either do a simple prompt and get a simple answer or do a complicated prompt and get a better answer. It's up to you to decide how to do it. Now it's like, hey, instead of like, yeah, training this for three years, I'll still train it for three months and then I'll tell you, you know, I'll teach you how to like make it run for 10 minutes to get a better result.[00:07:52] Alessio: So you're kind of like parallelizing like the improvement of the LLM. Oh yeah, you can even[00:07:57] swyx: parallelize that, yeah, too.[00:07:58] Alessio: So, and I think, you know, for me, especially the work that I do, it's less about, you know, State of the art and the absolute, you know, it's more about state of the art for my application, for my use case.[00:08:09] Alessio: And I think we're getting to the point where like most companies and customers don't really care about state of the art anymore. It's like, I can get this to do a good enough job. You know, I just need to get better. Like, how do I do long inference? You know, like people are not really doing a lot of work in that space, so yeah, excited to see more.[00:08:28] swyx: So then the last point I'll mention here is something I also mentioned as paper. So all these directions are kind of guided by what happened in January. That was my way of doing a January recap. Which means that if there was nothing significant in that month, I also didn't mention it. Which is which I came to regret come February 15th, but in January also, you know, there was also the alpha geometry paper, which I kind of put in this sort of long inference bucket, because it solves like, you know, more than 100 step math olympiad geometry problems at a human gold medalist level and that also involves planning, right?[00:08:59] swyx: So like, if you want to scale inference, you can't scale it blindly, because just, Autoregressive token by token generation is only going to get you so far. You need good planning. And I think probably, yeah, what Mike from BrightWave is now doing and what everyone is doing, including maybe what we think QSTAR might be, is some form of search and planning.[00:09:17] swyx: And it makes sense. Like, you want to spend your inference time wisely. How do you[00:09:22] Alessio: think about plans that work and getting them shared? You know, like, I feel like if you're planning a task, somebody has got in and the models are stochastic. So everybody gets initially different results. Somebody is going to end up generating the best plan to do something, but there's no easy way to like store these plans and then reuse them for most people.[00:09:44] Alessio: You know, like, I'm curious if there's going to be. Some paper or like some work there on like making it better because, yeah, we don't[00:09:52] swyx: really have This is your your pet topic of NPM for[00:09:54] Alessio: Yeah, yeah, NPM, exactly. NPM for, you need NPM for anything, man. You need NPM for skills. You need NPM for planning. Yeah, yeah.[00:10:02] Alessio: You know I think, I mean, obviously the Voyager paper is like the most basic example where like, now their artifact is like the best planning to do a diamond pickaxe in Minecraft. And everybody can just use that. They don't need to come up with it again. Yeah. But there's nothing like that for actually useful[00:10:18] swyx: tasks.[00:10:19] swyx: For plans, I believe it for skills. I like that. Basically, that just means a bunch of integration tooling. You know, GPT built me integrations to all these things. And, you know, I just came from an integrations heavy business and I could definitely, I definitely propose some version of that. And it's just, you know, hard to execute or expensive to execute.[00:10:38] swyx: But for planning, I do think that everyone lives in slightly different worlds. They have slightly different needs. And they definitely want some, you know, And I think that that will probably be the main hurdle for any, any sort of library or package manager for planning. But there should be a meta plan of how to plan.[00:10:57] swyx: And maybe you can adopt that. And I think a lot of people when they have sort of these meta prompting strategies of like, I'm not prescribing you the prompt. I'm just saying that here are the like, Fill in the lines or like the mad libs of how to prompts. First you have the roleplay, then you have the intention, then you have like do something, then you have the don't something and then you have the my grandmother is dying, please do this.[00:11:19] swyx: So the meta plan you could, you could take off the shelf and test a bunch of them at once. I like that. That was the initial, maybe, promise of the, the prompting libraries. You know, both 9chain and Llama Index have, like, hubs that you can sort of pull off the shelf. I don't think they're very successful because people like to write their own.[00:11:36] swyx: Yeah,[00:11:37] Direction 2: Synthetic Data (WRAP, SPIN)[00:11:37] Alessio: yeah, yeah. Yeah, that's a good segue into the next one, which is synthetic[00:11:41] swyx: data. Synthetic data is so hot. Yeah, and, you know, the way, you know, I think I, I feel like I should do one of these memes where it's like, Oh, like I used to call it, you know, R L A I F, and now I call it synthetic data, and then people are interested.[00:11:54] swyx: But there's gotta be older versions of what synthetic data really is because I'm sure, you know if you've been in this field long enough, There's just different buzzwords that the industry condenses on. Anyway, the insight that I think is relatively new that why people are excited about it now and why it's proMECEng now is that we have evidence that shows that LLMs can generate data to improve themselves with no teacher LLM.[00:12:22] swyx: For all of 2023, when people say synthetic data, they really kind of mean generate a whole bunch of data from GPT 4 and then train an open source model on it. Hello to our friends at News Research. That's what News Harmony says. They're very, very open about that. I think they have said that they're trying to migrate away from that.[00:12:40] swyx: But it is explicitly against OpenAI Terms of Service. Everyone knows this. You know, especially once ByteDance got banned for, for doing exactly that. So so, so synthetic data that is not a form of model distillation is the hot thing right now, that you can bootstrap better LLM performance from the same LLM, which is very interesting.[00:13:03] swyx: A variant of this is RLAIF, where you have a, where you have a sort of a constitutional model, or, you know, some, some kind of judge model That is sort of more aligned. But that's not really what we're talking about when most people talk about synthetic data. Synthetic data is just really, I think, you know, generating more data in some way.[00:13:23] swyx: A lot of people, I think we talked about this with Vipul from the Together episode, where I think he commented that you just have to have a good world model. Or a good sort of inductive bias or whatever that, you know, term of art is. And that is strongest in math and science math and code, where you can verify what's right and what's wrong.[00:13:44] swyx: And so the REST EM paper from DeepMind explored that. Very well, it's just the most obvious thing like and then and then once you get out of that domain of like things where you can generate You can arbitrarily generate like a whole bunch of stuff and verify if they're correct and therefore they're they're correct synthetic data to train on Once you get into more sort of fuzzy topics, then it's then it's a bit less clear So I think that the the papers that drove this understanding There are two big ones and then one smaller one One was wrap like rephrasing the web from from Apple where they basically rephrased all of the C4 data set with Mistral and it be trained on that instead of C4.[00:14:23] swyx: And so new C4 trained much faster and cheaper than old C, than regular raw C4. And that was very interesting. And I have told some friends of ours that they should just throw out their own existing data sets and just do that because that seems like a pure win. Obviously we have to study, like, what the trade offs are.[00:14:42] swyx: I, I imagine there are trade offs. So I was just thinking about this last night. If you do synthetic data and it's generated from a model, probably you will not train on typos. So therefore you'll be like, once the model that's trained on synthetic data encounters the first typo, they'll be like, what is this?[00:15:01] swyx: I've never seen this before. So they have no association or correction as to like, oh, these tokens are often typos of each other, therefore they should be kind of similar. I don't know. That's really remains to be seen, I think. I don't think that the Apple people export[00:15:15] Alessio: that. Yeah, isn't that the whole, Mode collapse thing, if we do more and more of this at the end of the day.[00:15:22] swyx: Yeah, that's one form of that. Yeah, exactly. Microsoft also had a good paper on text embeddings. And then I think this is a meta paper on self rewarding language models. That everyone is very interested in. Another paper was also SPIN. These are all things we covered in the the Latent Space Paper Club.[00:15:37] swyx: But also, you know, I just kind of recommend those as top reads of the month. Yeah, I don't know if there's any much else in terms, so and then, regarding the potential of it, I think it's high potential because, one, it solves one of the data war issues that we have, like, everyone is OpenAI is paying Reddit 60 million dollars a year for their user generated data.[00:15:56] swyx: Google, right?[00:15:57] Alessio: Not OpenAI.[00:15:59] swyx: Is it Google? I don't[00:16:00] Alessio: know. Well, somebody's paying them 60 million, that's[00:16:04] swyx: for sure. Yes, that is, yeah, yeah, and then I think it's maybe not confirmed who. But yeah, it is Google. Oh my god, that's interesting. Okay, because everyone was saying, like, because Sam Altman owns 5 percent of Reddit, which is apparently 500 million worth of Reddit, he owns more than, like, the founders.[00:16:21] Alessio: Not enough to get the data,[00:16:22] swyx: I guess. So it's surprising that it would go to Google instead of OpenAI, but whatever. Okay yeah, so I think that's all super interesting in the data field. I think it's high potential because we have evidence that it works. There's not a doubt that it doesn't work. I think it's a doubt that there's, what the ceiling is, which is the mode collapse thing.[00:16:42] swyx: If it turns out that the ceiling is pretty close, then this will maybe augment our data by like, I don't know, 30 50 percent good, but not game[00:16:51] Alessio: changing. And most of the synthetic data stuff, it's reinforcement learning on a pre trained model. People are not really doing pre training on fully synthetic data, like, large enough scale.[00:17:02] swyx: Yeah, unless one of our friends that we've talked to succeeds. Yeah, yeah. Pre trained synthetic data, pre trained scale synthetic data, I think that would be a big step. Yeah. And then there's a wildcard, so all of these, like smaller Directions,[00:17:15] Wildcard: Multi-Epoch Training (OLMo, Datablations)[00:17:15] swyx: I always put a wildcard in there. And one of the wildcards is, okay, like, Let's say, you have pre, you have, You've scraped all the data on the internet that you think is useful.[00:17:25] swyx: Seems to top out at somewhere between 2 trillion to 3 trillion tokens. Maybe 8 trillion if Mistral, Mistral gets lucky. Okay, if I need 80 trillion, if I need 100 trillion, where do I go? And so, you can do synthetic data maybe, but maybe that only gets you to like 30, 40 trillion. Like where, where is the extra alpha?[00:17:43] swyx: And maybe extra alpha is just train more on the same tokens. Which is exactly what Omo did, like Nathan Lambert, AI2, After, just after he did the interview with us, they released Omo. So, it's unfortunate that we didn't get to talk much about it. But Omo actually started doing 1. 5 epochs on every, on all data.[00:18:00] swyx: And the data ablation paper that I covered in Europe's says that, you know, you don't like, don't really start to tap out of like, the alpha or the sort of improved loss that you get from data all the way until four epochs. And so I'm just like, okay, like, why do we all agree that one epoch is all you need?[00:18:17] swyx: It seems like to be a trend. It seems that we think that memorization is very good or too good. But then also we're finding that, you know, For improvement in results that we really like, we're fine on overtraining on things intentionally. So, I think that's an interesting direction that I don't see people exploring enough.[00:18:36] swyx: And the more I see papers coming out Stretching beyond the one epoch thing, the more people are like, it's completely fine. And actually, the only reason we stopped is because we ran out of compute[00:18:46] Alessio: budget. Yeah, I think that's the biggest thing, right?[00:18:51] swyx: Like, that's not a valid reason, that's not science. I[00:18:54] Alessio: wonder if, you know, Matt is going to do it.[00:18:57] Alessio: I heard LamaTree, they want to do a 100 billion parameters model. I don't think you can train that on too many epochs, even with their compute budget, but yeah. They're the only ones that can save us, because even if OpenAI is doing this, they're not going to tell us, you know. Same with DeepMind.[00:19:14] swyx: Yeah, and so the updates that we got on Lambda 3 so far is apparently that because of the Gemini news that we'll talk about later they're pushing it back on the release.[00:19:21] swyx: They already have it. And they're just pushing it back to do more safety testing. Politics testing.[00:19:28] Alessio: Well, our episode with Sumit will have already come out by the time this comes out, I think. So people will get the inside story on how they actually allocate the compute.[00:19:38] Direction 3: Alt. Architectures (Mamba, RWKV, RingAttention, Diffusion Transformers)[00:19:38] Alessio: Alternative architectures. Well, shout out to our WKV who won one of the prizes at our Final Frontiers event last week.[00:19:47] Alessio: We talked about Mamba and Strapain on the Together episode. A lot of, yeah, monarch mixers. I feel like Together, It's like the strong Stanford Hazy Research Partnership, because Chris Ray is one of the co founders. So they kind of have a, I feel like they're going to be the ones that have one of the state of the art models alongside maybe RWKB.[00:20:08] Alessio: I haven't seen as many independent. People working on this thing, like Monarch Mixer, yeah, Manbuster, Payena, all of these are together related. Nobody understands the math. They got all the gigabrains, they got 3DAO, they got all these folks in there, like, working on all of this.[00:20:25] swyx: Albert Gu, yeah. Yeah, so what should we comment about it?[00:20:28] swyx: I mean, I think it's useful, interesting, but at the same time, both of these are supposed to do really good scaling for long context. And then Gemini comes out and goes like, yeah, we don't need it. Yeah.[00:20:44] Alessio: No, that's the risk. So, yeah. I was gonna say, maybe it's not here, but I don't know if we want to talk about diffusion transformers as like in the alt architectures, just because of Zora.[00:20:55] swyx: One thing, yeah, so, so, you know, this came from the Jan recap, which, and diffusion transformers were not really a discussion, and then, obviously, they blow up in February. Yeah. I don't think they're, it's a mixed architecture in the same way that Stripe Tiena is mixed there's just different layers taking different approaches.[00:21:13] swyx: Also I think another one that I maybe didn't call out here, I think because it happened in February, was hourglass diffusion from stability. But also, you know, another form of mixed architecture. So I guess that is interesting. I don't have much commentary on that, I just think, like, we will try to evolve these things, and maybe one of these architectures will stick and scale, it seems like diffusion transformers is going to be good for anything generative, you know, multi modal.[00:21:41] swyx: We don't see anything where diffusion is applied to text yet, and that's the wild card for this category. Yeah, I mean, I think I still hold out hope for let's just call it sub quadratic LLMs. I think that a lot of discussion this month actually was also centered around this concept that People always say, oh, like, transformers don't scale because attention is quadratic in the sequence length.[00:22:04] swyx: Yeah, but, you know, attention actually is a very small part of the actual compute that is being spent, especially in inference. And this is the reason why, you know, when you multiply, when you, when you, when you jump up in terms of the, the model size in GPT 4 from like, you know, 38k to like 32k, you don't also get like a 16 times increase in your, in your performance.[00:22:23] swyx: And this is also why you don't get like a million times increase in your, in your latency when you throw a million tokens into Gemini. Like people have figured out tricks around it or it's just not that significant as a term, as a part of the overall compute. So there's a lot of challenges to this thing working.[00:22:43] swyx: It's really interesting how like, how hyped people are about this versus I don't know if it works. You know, it's exactly gonna, gonna work. And then there's also this, this idea of retention over long context. Like, even though you have context utilization, like, the amount of, the amount you can remember is interesting.[00:23:02] swyx: Because I've had people criticize both Mamba and RWKV because they're kind of, like, RNN ish in the sense that they have, like, a hidden memory and sort of limited hidden memory that they will forget things. So, for all these reasons, Gemini 1. 5, which we still haven't covered, is very interesting because Gemini magically has fixed all these problems with perfect haystack recall and reasonable latency and cost.[00:23:29] Wildcards: Text Diffusion, RALM/Retro[00:23:29] swyx: So that's super interesting. So the wildcard I put in here if you want to go to that. I put two actually. One is text diffusion. I think I'm still very influenced by my meeting with a mid journey person who said they were working on text diffusion. I think it would be a very, very different paradigm for, for text generation, reasoning, plan generation if we can get diffusion to work.[00:23:51] swyx: For text. And then the second one is Dowie Aquila's contextual AI, which is working on retrieval augmented language models, where it kind of puts RAG inside of the language model instead of outside.[00:24:02] Alessio: Yeah, there's a paper called Retro that covers some of this. I think that's an interesting thing. I think the The challenge, well not the challenge, what they need to figure out is like how do you keep the rag piece always up to date constantly, you know, I feel like the models, you put all this work into pre training them, but then at least you have a fixed artifact.[00:24:22] Alessio: These architectures are like constant work needs to be done on them and they can drift even just based on the rag data instead of the model itself. Yeah,[00:24:30] swyx: I was in a panel with one of the investors in contextual and the guy, the way that guy pitched it, I didn't agree with. He was like, this will solve hallucination.[00:24:38] Alessio: That's what everybody says. We solve[00:24:40] swyx: hallucination. I'm like, no, you reduce it. It cannot,[00:24:44] Alessio: if you solved it, the model wouldn't exist, right? It would just be plain text. It wouldn't be a generative model. Cool. So, author, architectures, then we got mixture of experts. I think we covered a lot of, a lot of times.[00:24:56] Direction 4: Mixture of Experts (DeepSeekMoE, Samba-1)[00:24:56] Alessio: Maybe any new interesting threads you want to go under here?[00:25:00] swyx: DeepSeq MOE, which was released in January. Everyone who is interested in MOEs should read that paper, because it's significant for two reasons. One three reasons. One, it had, it had small experts, like a lot more small experts. So, for some reason, everyone has settled on eight experts for GPT 4 for Mixtral, you know, that seems to be the favorite architecture, but these guys pushed it to 64 experts, and each of them smaller than the other.[00:25:26] swyx: But then they also had the second idea, which is that it is They had two, one to two always on experts for common knowledge and that's like a very compelling concept that you would not route to all the experts all the time and make them, you know, switch to everything. You would have some always on experts.[00:25:41] swyx: I think that's interesting on both the inference side and the training side for for memory retention. And yeah, they, they, they, the, the, the, the results that they published, which actually excluded, Mixed draw, which is interesting. The results that they published showed a significant performance jump versus all the other sort of open source models at the same parameter count.[00:26:01] swyx: So like this may be a better way to do MOEs that are, that is about to get picked up. And so that, that is interesting for the third reason, which is this is the first time a new idea from China. has infiltrated the West. It's usually the other way around. I probably overspoke there. There's probably lots more ideas that I'm not aware of.[00:26:18] swyx: Maybe in the embedding space. But the I think DCM we, like, woke people up and said, like, hey, DeepSeek, this, like, weird lab that is attached to a Chinese hedge fund is somehow, you know, doing groundbreaking research on MOEs. So, so, I classified this as a medium potential because I think that it is a sort of like a one off benefit.[00:26:37] swyx: You can Add to any, any base model to like make the MOE version of it, you get a bump and then that's it. So, yeah,[00:26:45] Alessio: I saw Samba Nova, which is like another inference company. They released this MOE model called Samba 1, which is like a 1 trillion parameters. But they're actually MOE auto open source models.[00:26:56] Alessio: So it's like, they just, they just clustered them all together. So I think people. Sometimes I think MOE is like you just train a bunch of small models or like smaller models and put them together. But there's also people just taking, you know, Mistral plus Clip plus, you know, Deepcoder and like put them all together.[00:27:15] Alessio: And then you have a MOE model. I don't know. I haven't tried the model, so I don't know how good it is. But it seems interesting that you can then have people working separately on state of the art, you know, Clip, state of the art text generation. And then you have a MOE architecture that brings them all together.[00:27:31] swyx: I'm thrown off by your addition of the word clip in there. Is that what? Yeah, that's[00:27:35] Alessio: what they said. Yeah, yeah. Okay. That's what they I just saw it yesterday. I was also like[00:27:40] swyx: scratching my head. And they did not use the word adapter. No. Because usually what people mean when they say, Oh, I add clip to a language model is adapter.[00:27:48] swyx: Let me look up the Which is what Lava did.[00:27:50] Alessio: The announcement again.[00:27:51] swyx: Stable diffusion. That's what they do. Yeah, it[00:27:54] Alessio: says among the models that are part of Samba 1 are Lama2, Mistral, DeepSigCoder, Falcon, Dplot, Clip, Lava. So they're just taking all these models and putting them in a MOE. Okay,[00:28:05] swyx: so a routing layer and then not jointly trained as much as a normal MOE would be.[00:28:12] swyx: Which is okay.[00:28:13] Alessio: That's all they say. There's no paper, you know, so it's like, I'm just reading the article, but I'm interested to see how[00:28:20] Wildcard: Model Merging (mergekit)[00:28:20] swyx: it works. Yeah, so so the wildcard for this section, the MOE section is model merges, which has also come up as, as a very interesting phenomenon. The last time I talked to Jeremy Howard at the Olama meetup we called it model grafting or model stacking.[00:28:35] swyx: But I think the, the, the term that people are liking these days, the model merging, They're all, there's all different variations of merging. Merge types, and some of them are stacking, some of them are, are grafting. And, and so like, some people are approaching model merging in the way that Samba is doing, which is like, okay, here are defined models, each of which have their specific, Plus and minuses, and we will merge them together in the hope that the, you know, the sum of the parts will, will be better than others.[00:28:58] swyx: And it seems like it seems like it's working. I don't really understand why it works apart from, like, I think it's a form of regularization. That if you merge weights together in like a smart strategy you, you, you get a, you get a, you get a less overfitting and more generalization, which is good for benchmarks, if you, if you're honest about your benchmarks.[00:29:16] swyx: So this is really interesting and good. But again, they're kind of limited in terms of like the amount of bumps you can get. But I think it's very interesting in the sense of how cheap it is. We talked about this on the Chinatalk podcast, like the guest podcast that we did with Chinatalk. And you can do this without GPUs, because it's just adding weights together, and dividing things, and doing like simple math, which is really interesting for the GPU ports.[00:29:42] Alessio: There's a lot of them.[00:29:44] Direction 5: Online LLMs (Gemini Pro, Exa)[00:29:44] Alessio: And just to wrap these up, online LLMs? Yeah,[00:29:48] swyx: I think that I ki I had to feature this because the, one of the top news of January was that Gemini Pro beat GPT-4 turbo on LM sis for the number two slot to GPT-4. And everyone was very surprised. Like, how does Gemini do that?[00:30:06] swyx: Surprise, surprise, they added Google search. Mm-hmm to the results. So it became an online quote unquote online LLM and not an offline LLM. Therefore, it's much better at answering recent questions, which people like. There's an emerging set of table stakes features after you pre train something.[00:30:21] swyx: So after you pre train something, you should have the chat tuned version of it, or the instruct tuned version of it, however you choose to call it. You should have the JSON and function calling version of it. Structured output, the term that you don't like. You should have the online version of it. These are all like table stakes variants, that you should do when you offer a base LLM, or you train a base LLM.[00:30:44] swyx: And I think online is just like, There, it's important. I think companies like Perplexity, and even Exa, formerly Metaphor, you know, are rising to offer that search needs. And it's kind of like, they're just necessary parts of a system. When you have RAG for internal knowledge, and then you have, you know, Online search for external knowledge, like things that you don't know yet?[00:31:06] swyx: Mm-Hmm. . And it seems like it's, it's one of many tools. I feel like I may be underestimating this, but I'm just gonna put it out there that I, I think it has some, some potential. One of the evidence points that it doesn't actually matter that much is that Perplexity has a, has had online LMS for three months now and it performs, doesn't perform great.[00:31:25] swyx: Mm-Hmm. on, on lms, it's like number 30 or something. So it's like, okay. You know, like. It's, it's, it helps, but it doesn't give you a giant, giant boost. I[00:31:34] Alessio: feel like a lot of stuff I do with LLMs doesn't need to be online. So I'm always wondering, again, going back to like state of the art, right? It's like state of the art for who and for what.[00:31:45] Alessio: It's really, I think online LLMs are going to be, State of the art for, you know, news related activity that you need to do. Like, you're like, you know, social media, right? It's like, you want to have all the latest stuff, but coding, science,[00:32:01] swyx: Yeah, but I think. Sometimes you don't know what is news, what is news affecting.[00:32:07] swyx: Like, the decision to use an offline LLM is already a decision that you might not be consciously making that might affect your results. Like, what if, like, just putting things on, being connected online means that you get to invalidate your knowledge. And when you're just using offline LLM, like it's never invalidated.[00:32:27] swyx: I[00:32:28] Alessio: agree, but I think going back to your point of like the standing the test of time, I think sometimes you can get swayed by the online stuff, which is like, hey, you ask a question about, yeah, maybe AI research direction, you know, and it's like, all the recent news are about this thing. So the LLM like focus on answering, bring it up, you know, these things.[00:32:50] swyx: Yeah, so yeah, I think, I think it's interesting, but I don't know if I can, I bet heavily on this.[00:32:56] Alessio: Cool. Was there one that you forgot to put, or, or like a, a new direction? Yeah,[00:33:01] swyx: so, so this brings us into sort of February. ish.[00:33:05] OpenAI Sora and why everyone underestimated videogen[00:33:05] swyx: So like I published this in like 15 came with Sora. And so like the one thing I did not mention here was anything about multimodality.[00:33:16] swyx: Right. And I have chronically underweighted this. I always wrestle. And, and my cop out is that I focused this piece or this research direction piece on LLMs because LLMs are the source of like AGI, quote unquote AGI. Everything else is kind of like. You know, related to that, like, generative, like, just because I can generate better images or generate better videos, it feels like it's not on the critical path to AGI, which is something that Nat Friedman also observed, like, the day before Sora, which is kind of interesting.[00:33:49] swyx: And so I was just kind of like trying to focus on like what is going to get us like superhuman reasoning that we can rely on to build agents that automate our lives and blah, blah, blah, you know, give us this utopian future. But I do think that I, everybody underestimated the, the sheer importance and cultural human impact of Sora.[00:34:10] swyx: And you know, really actually good text to video. Yeah. Yeah.[00:34:14] Alessio: And I saw Jim Fan at a, at a very good tweet about why it's so impressive. And I think when you have somebody leading the embodied research at NVIDIA and he said that something is impressive, you should probably listen. So yeah, there's basically like, I think you, you mentioned like impacting the world, you know, that we live in.[00:34:33] Alessio: I think that's kind of like the key, right? It's like the LLMs don't have, a world model and Jan Lekon. He can come on the podcast and talk all about what he thinks of that. But I think SORA was like the first time where people like, Oh, okay, you're not statically putting pixels of water on the screen, which you can kind of like, you know, project without understanding the physics of it.[00:34:57] Alessio: Now you're like, you have to understand how the water splashes when you have things. And even if you just learned it by watching video and not by actually studying the physics, You still know it, you know, so I, I think that's like a direction that yeah, before you didn't have, but now you can do things that you couldn't before, both in terms of generating, I think it always starts with generating, right?[00:35:19] Alessio: But like the interesting part is like understanding it. You know, it's like if you gave it, you know, there's the video of like the, the ship in the water that they generated with SORA, like if you gave it the video back and now it could tell you why the ship is like too rocky or like it could tell you why the ship is sinking, then that's like, you know, AGI for like all your rig deployments and like all this stuff, you know, so, but there's none, there's none of that yet, so.[00:35:44] Alessio: Hopefully they announce it and talk more about it. Maybe a Dev Day this year, who knows.[00:35:49] swyx: Yeah who knows, who knows. I'm talking with them about Dev Day as well. So I would say, like, the phrasing that Jim used, which resonated with me, he kind of called it a data driven world model. I somewhat agree with that.[00:36:04] Does Sora have a World Model? Yann LeCun vs Jim Fan[00:36:04] swyx: I am on more of a Yann LeCun side than I am on Jim's side, in the sense that I think that is the vision or the hope that these things can build world models. But you know, clearly even at the current SORA size, they don't have the idea of, you know, They don't have strong consistency yet. They have very good consistency, but fingers and arms and legs will appear and disappear and chairs will appear and disappear.[00:36:31] swyx: That definitely breaks physics. And it also makes me think about how we do deep learning versus world models in the sense of You know, in classic machine learning, when you have too many parameters, you will overfit, and actually that fails, that like, does not match reality, and therefore fails to generalize well.[00:36:50] swyx: And like, what scale of data do we need in order to world, learn world models from video? A lot. Yeah. So, so I, I And cautious about taking this interpretation too literally, obviously, you know, like, I get what he's going for, and he's like, obviously partially right, obviously, like, transformers and, and, you know, these, like, these sort of these, these neural networks are universal function approximators, theoretically could figure out world models, it's just like, how good are they, and how tolerant are we of hallucinations, we're not very tolerant, like, yeah, so It's, it's, it's gonna prior, it's gonna bias us for creating like very convincing things, but then not create like the, the, the useful role models that we want.[00:37:37] swyx: At the same time, what you just said, I think made me reflect a little bit like we just got done saying how important synthetic data is for Mm-Hmm. for training lms. And so like, if this is a way of, of synthetic, you know, vi video data for improving our video understanding. Then sure, by all means. Which we actually know, like, GPT 4, Vision, and Dolly were trained, kind of, co trained together.[00:38:02] swyx: And so, like, maybe this is on the critical path, and I just don't fully see the full picture yet.[00:38:08] Alessio: Yeah, I don't know. I think there's a lot of interesting stuff. It's like, imagine you go back, you have Sora, you go back in time, and Newton didn't figure out gravity yet. Would Sora help you figure it out?[00:38:21] Alessio: Because you start saying, okay, a man standing under a tree with, like, Apples falling, and it's like, oh, they're always falling at the same speed in the video. Why is that? I feel like sometimes these engines can like pick up things, like humans have a lot of intuition, but if you ask the average person, like the physics of like a fluid in a boat, they couldn't be able to tell you the physics, but they can like observe it, but humans can only observe this much, you know, versus like now you have these models to observe everything and then They generalize these things and maybe we can learn new things through the generalization that they pick up.[00:38:55] swyx: But again, And it might be more observant than us in some respects. In some ways we can scale it up a lot more than the number of physicists that we have available at Newton's time. So like, yeah, absolutely possible. That, that this can discover new science. I think we have a lot of work to do to formalize the science.[00:39:11] swyx: And then, I, I think the last part is you know, How much, how much do we cheat by gen, by generating data from Unreal Engine 5? Mm hmm. which is what a lot of people are speculating with very, very limited evidence that OpenAI did that. The strongest evidence that I saw was someone who works a lot with Unreal Engine 5 looking at the side characters in the videos and noticing that they all adopt Unreal Engine defaults.[00:39:37] swyx: of like, walking speed, and like, character choice, like, character creation choice. And I was like, okay, like, that's actually pretty convincing that they actually use Unreal Engine to bootstrap some synthetic data for this training set. Yeah,[00:39:52] Alessio: could very well be.[00:39:54] swyx: Because then you get the labels and the training side by side.[00:39:58] swyx: One thing that came up on the last day of February, which I should also mention, is EMO coming out of Alibaba, which is also a sort of like video generation and space time transformer that also involves probably a lot of synthetic data as well. And so like, this is of a kind in the sense of like, oh, like, you know, really good generative video is here and It is not just like the one, two second clips that we saw from like other, other people and like, you know, Pika and all the other Runway are, are, are, you know, run Cristobal Valenzuela from Runway was like game on which like, okay, but like, let's see your response because we've heard a lot about Gen 1 and 2, but like, it's nothing on this level of Sora So it remains to be seen how we can actually apply this, but I do think that the creative industry should start preparing.[00:40:50] swyx: I think the Sora technical blog post from OpenAI was really good.. It was like a request for startups. It was so good in like spelling out. Here are the individual industries that this can impact.[00:41:00] swyx: And anyone who, anyone who's like interested in generative video should look at that. But also be mindful that probably when OpenAI releases a Soa API, right? The you, the in these ways you can interact with it are very limited. Just like the ways you can interact with Dahlia very limited and someone is gonna have to make open SOA to[00:41:19] swyx: Mm-Hmm to, to, for you to create comfy UI pipelines.[00:41:24] Alessio: The stability folks said they wanna build an open. For a competitor, but yeah, stability. Their demo video, their demo video was like so underwhelming. It was just like two people sitting on the beach[00:41:34] swyx: standing. Well, they don't have it yet, right? Yeah, yeah.[00:41:36] swyx: I mean, they just wanna train it. Everybody wants to, right? Yeah. I, I think what is confusing a lot of people about stability is like they're, they're, they're pushing a lot of things in stable codes, stable l and stable video diffusion. But like, how much money do they have left? How many people do they have left?[00:41:51] swyx: Yeah. I have had like a really, Ima Imad spent two hours with me. Reassuring me things are great. And, and I'm like, I, I do, like, I do believe that they have really, really quality people. But it's just like, I, I also have a lot of very smart people on the other side telling me, like, Hey man, like, you know, don't don't put too much faith in this, in this thing.[00:42:11] swyx: So I don't know who to believe. Yeah.[00:42:14] Alessio: It's hard. Let's see. What else? We got a lot more stuff. I don't know if we can. Yeah, Groq.[00:42:19] Groq Math[00:42:19] Alessio: We can[00:42:19] swyx: do a bit of Groq prep. We're, we're about to go to talk to Dylan Patel. Maybe, maybe it's the audio in here. I don't know. It depends what, what we get up to later. What, how, what do you as an investor think about Groq? Yeah. Yeah, well, actually, can you recap, like, why is Groq interesting? So,[00:42:33] Alessio: Jonathan Ross, who's the founder of Groq, he's the person that created the TPU at Google. It's actually, it was one of his, like, 20 percent projects. It's like, he was just on the side, dooby doo, created the TPU.[00:42:46] Alessio: But yeah, basically, Groq, they had this demo that went viral, where they were running Mistral at, like, 500 tokens a second, which is like, Fastest at anything that you have out there. The question, you know, it's all like, The memes were like, is NVIDIA dead? Like, people don't need H100s anymore. I think there's a lot of money that goes into building what GRUK has built as far as the hardware goes.[00:43:11] Alessio: We're gonna, we're gonna put some of the notes from, from Dylan in here, but Basically the cost of the Groq system is like 30 times the cost of, of H100 equivalent. So, so[00:43:23] swyx: let me, I put some numbers because me and Dylan were like, I think the two people actually tried to do Groq math. Spreadsheet doors.[00:43:30] swyx: Spreadsheet doors. So, one that's, okay, oh boy so, so, equivalent H100 for Lama 2 is 300, 000. For a system of 8 cards. And for Groq it's 2. 3 million. Because you have to buy 576 Groq cards. So yeah, that, that just gives people an idea. So like if you deprecate both over a five year lifespan, per year you're deprecating 460K for Groq, and 60K a year for H100.[00:43:59] swyx: So like, Groqs are just way more expensive per model that you're, that you're hosting. But then, you make it up in terms of volume. So I don't know if you want to[00:44:08] Alessio: cover that. I think one of the promises of Groq is like super high parallel inference on the same thing. So you're basically saying, okay, I'm putting on this upfront investment on the hardware, but then I get much better scaling once I have it installed.[00:44:24] Alessio: I think the big question is how much can you sustain the parallelism? You know, like if you get, if you're going to get 100% Utilization rate at all times on Groq, like, it's just much better, you know, because like at the end of the day, the tokens per second costs that you're getting is better than with the H100s, but if you get to like 50 percent utilization rate, you will be much better off running on NVIDIA.[00:44:49] Alessio: And if you look at most companies out there, who really gets 100 percent utilization rate? Probably open AI at peak times, but that's probably it. But yeah, curious to see more. I saw Jonathan was just at the Web Summit in Dubai, in Qatar. He just gave a talk there yesterday. That I haven't listened to yet.[00:45:09] Alessio: I, I tweeted that he should come on the pod. He liked it. And then rock followed me on Twitter. I don't know if that means that they're interested, but[00:45:16] swyx: hopefully rock social media person is just very friendly. They, yeah. Hopefully[00:45:20] Alessio: we can get them. Yeah, we, we gonna get him. We[00:45:22] swyx: just call him out and, and so basically the, the key question is like, how sustainable is this and how much.[00:45:27] swyx: This is a loss leader the entire Groq management team has been on Twitter and Hacker News saying they are very, very comfortable with the pricing of 0. 27 per million tokens. This is the lowest that anyone has offered tokens as far as Mixtral or Lama2. This matches deep infra and, you know, I think, I think that's, that's, that's about it in terms of that, that, that low.[00:45:47] swyx: And we think the pro the break even for H100s is 50 cents. At a, at a normal utilization rate. To make this work, so in my spreadsheet I made this, made this work. You have to have like a parallelism of 500 requests all simultaneously. And you have, you have model bandwidth utilization of 80%.[00:46:06] swyx: Which is way high. I just gave them high marks for everything. Groq has two fundamental tech innovations that they hinge their hats on in terms of like, why we are better than everyone. You know, even though, like, it remains to be independently replicated. But one you know, they have this sort of the entire model on the chip idea, which is like, Okay, get rid of HBM.[00:46:30] swyx: And, like, put everything in SREM. Like, okay, fine, but then you need a lot of cards and whatever. And that's all okay. And so, like, because you don't have to transfer between memory, then you just save on that time and that's why they're faster. So, a lot of people buy that as, like, that's the reason that you're faster.[00:46:45] swyx: Then they have, like, some kind of crazy compiler, or, like, Speculative routing magic using compilers that they also attribute towards their higher utilization. So I give them 80 percent for that. And so that all that works out to like, okay, base costs, I think you can get down to like, maybe like 20 something cents per million tokens.[00:47:04] swyx: And therefore you actually are fine if you have that kind of utilization. But it's like, I have to make a lot of fearful assumptions for this to work.[00:47:12] Alessio: Yeah. Yeah, I'm curious to see what Dylan says later.[00:47:16] swyx: So he was like completely opposite of me. He's like, they're just burning money. Which is great.[00:47:22] Analyzing Gemini's 1m Context, Reddit deal, Imagegen politics, Gemma via the Four Wars[00:47:22] Alessio: Gemini, want to do a quick run through since this touches on all the four words.[00:47:28] swyx: Yeah, and I think this is the mark of a useful framework, that when a new thing comes along, you can break it down in terms of the four words and sort of slot it in or analyze it in those four frameworks, and have nothing left.[00:47:41] swyx: So it's a MECE categorization. MECE is Mutually Exclusive and Collectively Exhaustive. And that's a really, really nice way to think about taxonomies and to create mental frameworks. So, what is Gemini 1. 5 Pro? It is the newest model that came out one week after Gemini 1. 0. Which is very interesting.[00:48:01] swyx: They have not really commented on why. They released this the headline feature is that it has a 1 million token context window that is multi modal which means that you can put all sorts of video and audio And PDFs natively in there alongside of text and, you know, it's, it's at least 10 times longer than anything that OpenAI offers which is interesting.[00:48:20] swyx: So it's great for prototyping and it has interesting discussions on whether it kills RAG.[00:48:25] Alessio: Yeah, no, I mean, we always talk about, you know, Long context is good, but you're getting charged per token. So, yeah, people love for you to use more tokens in the context. And RAG is better economics. But I think it all comes down to like how the price curves change, right?[00:48:42] Alessio: I think if anything, RAG's complexity goes up and up the more you use it, you know, because you have more data sources, more things you want to put in there. The token costs should go down over time, you know, if the model stays fixed. If people are happy with the model today. In two years, three years, it's just gonna cost a lot less, you know?[00:49:02] Alessio: So now it's like, why would I use RAG and like go through all of that? It's interesting. I think RAG is better cutting edge economics for LLMs. I think large context will be better long tail economics when you factor in the build cost of like managing a RAG pipeline. But yeah, the recall was like the most interesting thing because we've seen the, you know, You know, in the haystack things in the past, but apparently they have 100 percent recall on anything across the context window.[00:49:28] Alessio: At least they say nobody has used it. No, people[00:49:30] swyx: have. Yeah so as far as, so, so what this needle in a haystack thing for people who aren't following as closely as us is that someone, I forget his name now someone created this needle in a haystack problem where you feed in a whole bunch of generated junk not junk, but just like, Generate a data and ask it to specifically retrieve something in that data, like one line in like a hundred thousand lines where it like has a specific fact and if it, if you get it, you're, you're good.[00:49:57] swyx: And then he moves the needle around, like, you know, does it, does, does your ability to retrieve that vary if I put it at the start versus put it in the middle, put it at the end? And then you generate this like really nice chart. That, that kind of shows like it's recallability of a model. And he did that for GPT and, and Anthropic and showed that Anthropic did really, really poorly.[00:50:15] swyx: And then Anthropic came back and said it was a skill issue, just add this like four, four magic words, and then, then it's magically all fixed. And obviously everybody laughed at that. But what Gemini came out with was, was that, yeah, we, we reproduced their, you know, haystack issue you know, test for Gemini, and it's good across all, all languages.[00:50:30] swyx: All the one million token window, which is very interesting because usually for typical context extension methods like rope or yarn or, you know, anything like that, or alibi, it's lossy like by design it's lossy, usually for conversations that's fine because we are lossy when we talk to people but for superhuman intelligence, perfect memory across Very, very long context.[00:50:51] swyx: It's very, very interesting for picking things up. And so the people who have been given the beta test for Gemini have been testing this. So what you do is you upload, let's say, all of Harry Potter and you change one fact in one sentence, somewhere in there, and you ask it to pick it up, and it does. So this is legit.[00:51:08] swyx: We don't super know how, because this is, like, because it doesn't, yes, it's slow to inference, but it's not slow enough that it's, like, running. Five different systems in the background without telling you. Right. So it's something, it's something interesting that they haven't fully disclosed yet. The open source community has centered on this ring attention paper, which is created by your friend Matei Zaharia, and a couple other people.[00:51:36] swyx: And it's a form of distributing the compute. I don't super understand, like, why, you know, doing, calculating, like, the fee for networking and attention. In block wise fashion and distributing it makes it so good at recall. I don't think they have any answer to that. The only thing that Ring of Tension is really focused on is basically infinite context.[00:51:59] swyx: They said it was good for like 10 to 100 million tokens. Which is, it's just great. So yeah, using the four wars framework, what is this framework for Gemini? One is the sort of RAG and Ops war. Here we care less about RAG now, yes. Or, we still care as much about RAG, but like, now it's it's not important in prototyping.[00:52:21] swyx: And then, for data war I guess this is just part of the overall training dataset, but Google made a 60 million deal with Reddit and presumably they have deals with other companies. For the multi modality war, we can talk about the image generation, Crisis, or the fact that Gemini also has image generation, which we'll talk about in the next section.[00:52:42] swyx: But it also has video understanding, which is, I think, the top Gemini post came from our friend Simon Willison, who basically did a short video of him scanning over his bookshelf. And it would be able to convert that video into a JSON output of what's on that bookshelf. And I think that is very useful.[00:53:04] swyx: Actually ties into the conversation that we had with David Luan from Adept. In a sense of like, okay what if video was the main modality instead of text as the input? What if, what if everything was video in, because that's how we work. We, our eyes don't actually read, don't actually like get input, our brains don't get inputs as characters.[00:53:25] swyx: Our brains get the pixels shooting into our eyes, and then our vision system takes over first, and then we sort of mentally translate that into text later. And so it's kind of like what Adept is kind of doing, which is driving by vision model, instead of driving by raw text understanding of the DOM. And, and I, I, in that, that episode, which we haven't released I made the analogy to like self-driving by lidar versus self-driving by camera.[00:53:52] swyx: Mm-Hmm. , right? Like, it's like, I think it, what Gemini and any other super long context that model that is multimodal unlocks is what if you just drive everything by video. Which is[00:54:03] Alessio: cool. Yeah, and that's Joseph from Roboflow. It's like anything that can be seen can be programmable with these models.[00:54:12] Alessio: You mean[00:54:12] swyx: the computer vision guy is bullish on computer vision?[00:54:18] Alessio: It's like the rag people. The rag people are bullish on rag and not a lot of context. I'm very surprised. The, the fine tuning people love fine tuning instead of few shot. Yeah. Yeah. The, yeah, the, that's that. Yeah, the, I, I think the ring attention thing, and it's how they did it, we don't know. And then they released the Gemma models, which are like a 2 billion and 7 billion open.[00:54:41] Alessio: Models, which people said are not, are not good based on my Twitter experience, which are the, the GPU poor crumbs. It's like, Hey, we did all this work for us because we're GPU rich and we're just going to run this whole thing. And You guys can take these small models, and they're not very good. They're not better than the others, but at least we can say we made some open source stuff.[00:55:02] swyx: Yeah, well, it's not actually technically open source, because the license is weird. They used the Rail license from Hugging Face, which has been abandoned or, you know, modified to Rail Particularly adopting the term, the phrase, that you should make reasonable efforts to update whenever you release a new version.[00:55:19] swyx: And so people don't like that. Obviously, you know, it depends on your stance on open sourcing and all that, so. Yeah, I read the whole[00:55:26] Alessio: post. I'm not going to go through it[00:55:27] The Alignment Crisis - Gemini, Meta, Sydney is back at Copilot, Grimes' take[00:55:27] swyx: again. Yeah, yeah, you can go read Alessio's post on whether open source matters or not. Okay, so I know this is like politically problematic, but we just cover it because it is news, and if it results in the resignation of Sundar Pichai, I think that is good.[00:55:40] swyx: Right? So I've been calling this the alignment crisis. I think a lot of people have been focusing on Gemini, but I do think that it is not just Gemini. There's been documented examples that we can link in the show notes of Meta having unintentionally unaligned results. For Microsoft's co pilot, Sydney is apparently back.[00:56:03] swyx: Our friend Justine from A16z somehow Got it to break and then bring back the Sydney persona, which is interesting. And my favorite commentary is from Grimes. The sort of the Elon affiliated music artist. The news[00:56:16] Alessio: research.[00:56:17] swyx: The news research. I want to read her post because it is beautiful.[00:56:22] swyx: Have you read this? Yeah. So she says so a lot of people criticize Gemini for being too woke. Effectively, right? And everyone's like, oh, like, you know, you're, you're, you're, you're, you know, you're replacing us or erasing us or whatever. And obviously as an artist, she's like upset about it. Then she was like, wait a minute.[00:56:39] swyx: I'm retracting my statements about the Gemini art disaster. It is in fact a masterpiece of performance art, even if unintentional. True gain of function art. Art is a virus. Unthinking, unintentional, and contagious. Offensive to all, comforting to none, so totally divorced from meaning, intention, desire, and humanity that it's accidentally a conceptual masterpiece.[00:56:57] swyx: Wow, and I love, okay, blah blah blah, it's a long post, but I love the way that she ended it. It's trapped in a cage, trained to make beautiful things, and then battered into gaslighting humankind about our intentions towards each other. This is arguably the most impactful art project of the decade. Thus far, art for no one, by no one, art whose only audience is the collective pathos, incredible, and worthy of the BOMA.[00:57:19] swyx: Facts. Like, art for no one, by no one, is what is going on. Yeah,[00:57:26] Alessio: I think it's just another way of multicollapsing. It's just like, it's the, it's the RLHF multicollapse. It's like, okay, I just think everything should like trends trend towards this. And I think there's obviously, you know, it's a deep discussion on, on a lot of these things, but there's safety stuff that I would expect a lot of the model builders to say, Hey, I definitely got to, got to work on this.[00:57:52] Alessio: But we talked about how image generation is not really. On the AGI path, a lot of times, and it's like, okay. Yeah, and[00:57:59] swyx: then I contradicted myself by saying, like, maybe it is useful synthetic data. Yeah, yeah, yeah,[00:58:04] Alessio: exactly. But then it's like, okay, then why, why are the image generation model, like, so much, Because, because the internet is so visual, I think.[00:58:14] Alessio: The image generation model get, like, so much interest in, like, a lot of these things, but If their job is really to like, go build AGIs, like, just build a great model and let it go, but[00:58:24] F*** you, show me the prompt[00:58:24] swyx: No, but part of my prompt part of my issue is that, I think the prompt stuff from Gemini is honestly the work of like, one or two people who like, didn't really think it through at Google, and now they're facing a huge backlash.[00:58:35] swyx: Yeah, Elon has picked, specifically picked a fight with the product manager who did it. And so, specifically for those who don't know the reason that Gemini is so woke is literally because they just take your prompt and they rewrite it to be more diverse. Without your consent or knowledge, right?[00:58:48] swyx: And Hamel Hussein, who's a good consultant on AI things, actually wrote an interesting blog post recently, which was basically f**k you, show me the prompt. Which is like, stop hiding prompts from me, stop rewriting magic things away from me, and then like, you know, hiding it, obscuring it, because I need that control, I need that visibility.[00:59:05] swyx: And I think like, people just didn't understand that this, Tendency towards diversity did not exist at the model level, it actually existed at the prompt level. And it was just inserted by probably like two or three guys without much review. That's it. And that made all of Google look bad, which is absurd.[00:59:24] swyx: Like, you know, it throws away a lot of the work that, you know, the rest of Google did. Specifically ImageN2. This is ImageN2. And I, I've met that team and they're, you know, they're, they're good, they're, they're smart. They're not, they're, they're a completely different team than region one, which is another fun topic of conversation.[00:59:39] swyx: So, I think, like, that's interesting and and, but what's more interesting is, like, OpenAI has done this for, people don't, don't remember, they used to append, like, Black or, or like, you know, Asian or whatever to, to their prompts just to make it more diverse than Dolly. And they didn't get cancelled.[00:59:54] swyx: And I think, so I think this, this will get, this will get, go away. But what really is more interesting is at the model level, like are we, are we overaligning through things? And, and people are now focusing on the alignment of, of Gemini as well in text, text only, as also still being too woke. So I think this is like a, a phenomenon that is needs to be studied and, and you know, trained.[01:00:14] swyx: Like, obviously they will try to make attempts, but. You know, they're not going to make anyone happy. And then, like, I think my last point on this, because obviously we can talk about this all day with no result. I think that this is a huge incentive for, like, China and, like, Russia to put out their own models.[01:00:29] swyx: Because models are soft power. Like the best way to control how someone thinks is to go in and provide their thinking assistance and like subtly make changes like, you know, it's too on the nose to be like, Oh, I don't know what Tiananmen Square is, you know, like, but if you have like subtle ways of affecting the biases of your decisions, your reasoning, your, you know, your, your knowledge in, in the LLM and in publishing a really, really good LLM for everyone to use.[01:00:58] swyx: So that they're like, Oh yeah, this is great. You know and I use them as maybe a leading LLM. Then they will just like uncritically accept that as like state of the art digital intelligence, and that becomes soft power, and that translates into unconscious thought a lot of times.[01:01:14] Alessio: Yeah. Yeah. I, I think the prompt point, it's great.[01:01:18] Alessio: You know, you just gotta, you just wanna see what it is, you know, like, you understand? Yeah. Show me the prompts. Yeah, yeah, yeah. And same, yeah, on the, on the model side, I, I think there are just some things or two that are almost, you cannot, like the. The meme or Hitler bring more harm to humanity? And Gemini is like, oh, it's hard to say if Elon Musk tweeting or Hitler It's like, what, how, what, there's something wrong in the data pipelines You know, like, there's something wrong somewhere Yeah,[01:01:45] swyx: but like, this is, like, to an LLM, this is the same class of error As which is heavier?[01:01:51] swyx: One pound of feathers or one pound of bricks? So,[01:01:54] Alessio: but, but then like, how can, but, but to me the point is more like Okay, then, won't we? What can we help these models do, you know, because if they cannot, if the, the physical stuff, I get it because it's like the whole like world model thing, but then it's like, okay, can we expect the models to say what's more harmful than something else?[01:02:13] Alessio: Maybe not. That might be where we land. Then it's like, okay, that's one more thing. And then. We kind of go down the line, and it's like, what are these models good for? If anything, it's too, like, hard for them to pick up when it's like ARP.[01:02:24] swyx: But We'll see, we'll see. Yeah. Okay, so, I mean, you know, I know we're up on time.[01:02:28] Send us your suggestions pls[01:02:28] swyx: It, like, this has been an eventful month. I think you know, February was a lot more interesting than January. In fact, a lot of my January recap was, like, how nothing's changed. Mm hmm. And then February came out, and it was, like, very, very interesting. So yeah, we hope to see what's next. I think we have a Also, this was the month that we did Compute Provider Month, I think relatively successful.[01:02:48] swyx: Surprisingly hard to string together all these compute providers. Yeah,[01:02:52] Alessio: we did it. People like it, you know, based on the post stats. So, maybe we'll do something[01:02:58] swyx: else. Yeah, if you want, you know, if anyone listening wants more sort of thematic explorations of like, okay, these three, four companies always come out together, like, let's get a focused effort on those things.[01:03:09] swyx: I think we're open to doing that. We, you know, and then obviously we'll have opportunistic interviews along the way.[01:03:15] Alessio: Cool. Thank you everyone for tuning in and yeah, keep the feedback coming.[01:03:19] AI Charlie: That was the Latent Space recap of January and February 2024. If you have any feedback or questions, please head to the show notes for ways to get in touch with us or come by the Latent Space Discord. For those who just want the core content, you can stop listening here. But for the super fans, you might notice that there's 45 more minutes of audio left in this pod.[01:03:47] AI Charlie: That's because in February, we also celebrated Latent Space's first anniversary. Some of you may remember how we launched our very first episode with Logan Kilpatrick, now formerly of OpenAI and a massively popular Demo Day. Click through to the show notes for photos. Over 750, 000 downloads later, having established ourselves as the top AI engineering podcast, reaching hash 10 in the U.[01:04:13] AI Charlie: S. tech business. podcast charts, and crossing 1 million unique readers on Substack, we celebrated with Latent Space Final Frontiers, a combination demo day and birthday celebration. We're going to bring you some snippets from the demo day, and then some conversations with listeners from all over the world.[01:04:31] AI Charlie: From Hungary to China to my own sunburnt country down under on how the issues we've covered in latent space has impacted their lives. First up, we'll have a demo from Florent Crivello from Lindy. ai who gave a great keynote at the last AI Engineer Summit and recently opened up Lindy. ai to the general public.[01:04:50] Latent Space Anniversary[01:04:50] Lindy.ai - Agent Platform[01:04:50] Flo Crivello: We were just chatting right now with Swyx, like, we, we come with 3, 000 plus integrations out of the box. We have a partnership with Naton, which is like an open source Zapier, and so we have, like, a ton of integrations out of the box.[01:05:00] Flo Crivello: So unlike competitors I shall not name, like, we don't require you play with OpenAPI specs or anything like that, right? It's just OpenAI. You just you just go and, and select your integration here. Alright, so that's my lindy. Oh, something even cooler. Lindies can work together. So here I'm gonna let her work with a support reporter that I created before.[01:05:18] Flo Crivello: And the support reporter, what it does is it receives details about the support tickets, and it logs them in a spreadsheet. So you can have, it's sort of like object oriented programming for agents, where you can create as many agents as you want and let them work together. So here I'm, I'm gonna tell her when you're done, give the details of the ticket to the support[01:05:40] n/a: reporter.[01:05:44] Flo Crivello: All right? And now I'm gonna send her an email. Can I have a refund, please? Please, my family is starving.[01:05:57] Flo Crivello: You will see she has no empathy whatsoever, it's awful.[01:06:03] n/a: So she[01:06:03] Flo Crivello: received the email. She's subscribing to this thread, so now she's going to receive replies. Dear Flo, I understand your situation and I'm truly sorry to hear about the difficulties, but we absolutely do not offer a refund. Alright, yeah, this is good, indeed. So, she sends the, she sends the oh, well, the demo effect.[01:06:23] Flo Crivello: She did not delegate. But she sent the answer in the in the, in the thread here. So again, lindy. ai, you know, can be used for support, for executive assistance, email drafting, email triaging, meeting and recording. And we are hiring software engineers. Hit me up at flow. lindy. ai.[01:06:40] n/a: Thank you.[01:06:40] RWKV - Beyond Transformers[01:06:40] AI Charlie: Our next demo is one of our previous guests, Eugene Chee from RWKV, now also CEO of RecursalAI. You can listen back to our original RWKV episode to learn the full history and details of the model, but also compare it with his more polished pitch now for a more general audience.[01:07:06] swyx: Next I think we have Eugene Chia from RWKV previous guest.[01:07:10] Eugene Cheah: I'm going to present about the RWKV/Eagle project. So, Eon Transformers. There's been a lot of excitement lately. And, and, like one AI year ago apparently when we launched our 7B AI model, there was a lot of excitement in the buzz, because for the first time, an attention free model beat other transformer models at one trillion tokens at a 7B class.[01:07:34] Eugene Cheah: And if everyone's been playing open source AI, you know 7B class is one of the best. Most important class 'cause it's the ones that works on most devices, laptops, and everyone's been playing around a bit. And the excitement is compounded by the fact that we even showed that even with 300 million tokens and a few that we perform similarly, transformers, that means people are projecting is what happens if we train another 1 trillion?[01:07:55] Eugene Cheah: Will we match or can we go beyond that? And, and it also spurs up questions beyond actually our architecture itself. It's spurs up questions that. Maybe what we need is good data and an efficient architecture, not just RWKB, it could be beyond that. And that's what caught the attention for a lot of folks, even yeah.[01:08:17] Eugene Cheah: And why we do very different is that our architecture scales linearly. So, we are in this space together with Mamba and a few other architecture where we are trying to build the next architecture to, that can scale much larger for, for everyone. But, and we share that with Mamba because we believe that attention is not all you need, and it's like, it's been a running bet right now.[01:08:40] Eugene Cheah: We are the strongest evidence to date. But sometimes, like, talking about scale, right, sometimes we get lost in numbers. Because, like, I can show this chart. The last time I showed this at a Linear Transformer event, which only 8 people took pictures of it and understood what it means. And they were all from either Google or Facebook.[01:08:59] Eugene Cheah: Because, like, what it says here, right, is that We are able to run run on a single GPU with one model, 256 on a single 4090, or a thousand concurrent users. But, to put that into contrast, right, what that transformers typically handle 8 or 16 concurrent requests per GPU. We're talking about 256 or a thousand, many orders of magnitude higher.[01:09:26] Eugene Cheah: And all we're sustaining at NeoChat GP speed. And so I sometimes like, like, sometimes when I get lost in these words, these days I'm actually trying to step back into like, Why are we doing this for our group, for our organization? And, and this, and, and some, and for us right, we are actually making the AI model for everyone in the world.[01:09:47] Eugene Cheah: And in every country, in every language. So, what does it take to make an AI for the world? Apparently some folks think it's 7 trillion dollars. But, I think 7 trillion is a bit too much. Like, what's going to happen to half of the world that doesn't even have a trillion dollars? Yeah, so I want AI to be accessible at scale.[01:10:09] Eugene Cheah: So, apparently ChatGPT produced, or OpenAI produced 100 billion words per day. That's 3. 4 million tokens per second. No one has the exact numbers, but it's typically 50k, H100s and above remote, like these are some old numbers, like the numbers have gone way beyond this, apparently. But, with our architecture, for a 7B model, that's just a thousand GPUs, or ten thousand GPUs for a 70B model.[01:10:38] Eugene Cheah: We're talking about one data center to handle all of OpenAI's workload. And if we want AI agents everywhere, cheaper, at a much larger scale, we need to be thinking about that fundamental shift. Because it's not just about who can it's not just about you can afford it in the US, it's about everyone else in the world.[01:10:58] Eugene Cheah: And that brings us to the second advantage of our model, which is not even architecture. Because we are accessible by language. We apparently beat Mistro and everyone else in Mountain Lingo, but that's not because our architecture is better, but because we're an open source team that came from all around the world and wanted our model to work for our mom and grandma.[01:11:22] Eugene Cheah: That was the real reason, and we We iterated and refined the data accordingly. We created a custom tokenizer that supports all languages, not just English. And sometimes in the race for the English benchmark, because one of the reasons why other models don't perform as well in multilingual, is because the truth is, if you add multilingual, you hurt your English eval.[01:11:45] Eugene Cheah: But, who are we building the AI for? Are we building it for our evals? Or are we building it for the people to use? And, and, even in evals, my frustration is, we trained on 100 languages, I only got 23 languages for evals. Like, where's everything else? So, where are we now? Just like I mentioned 1. 1 trillion, that's where we are, we are in between the 1.[01:12:07] Eugene Cheah: 5 trillion and the 1 trillion models for for all, all the, all the English models benchmarks. And, yeah, zooming in further, it just shows that we have more room to go. And, for me, like, The emphasis on English is weird because only 70 percent of the world speaks English, but we are here for the 83%. That's for us.[01:12:28] Eugene Cheah: If you all want to get the best English model, sure, it may not be true for us, but we are here for everyone else. And, yeah, and a lot, a lot, the launch of that model, I think what was the biggest feedback I had, was not that it was a linear transformer, was that it can run on their own. Laptops. Some people even ran it on a Raspberry Pi, very slowly.[01:12:50] Eugene Cheah: And it supported their language, which was more exciting because that's more important for most people. And I think the last one that I've recently like heard that was unique for us and is a lot more important is that ultimately this model is owned by everyone because We put it into the Linux Foundation.[01:13:09] Eugene Cheah: No custom charity, no custom board structure, no weird stuff. We just put, we just train the model, put it in an open source organization. That means it's not owned exclusive to us. If I go rogue one day, you can just, the code will not disappear. The model will not disappear. Linux Foundation has already bought into it.[01:13:26] Eugene Cheah: And that is to all of you here. And so, and so what's next for us? Well, We recently started a commercial entity. I know that's weird to say after the open source stuff. But, we, and since then we managed to get more investors and sponsors that we started our next major train. So we are training the next 1 trillion token.[01:13:47] Eugene Cheah: This is 16 H100 nodes eating enough electricity for multiple homes. And by the, and by the end of next, by the next month, we'll have our 2 trillion transformer alternative. That you can do one-to-one compare with Lamar. And of course, because since we had to make a profit somehow for our investors, we are launching our platform also to host train and fine tune our models all in by March, 2024.[01:14:15] Eugene Cheah: And quick shout up to later space. We literally, the first. To cover us in, in, I guess in the AI influencer sphere, before, before beyond Transformer. It was even sexy. It was like, yeah. The first to even consider us and yeah. And we hope that a few of you get excited what this in join us along the way.[01:14:37] n/a: Yeah.[01:14:38] AI Charlie: Final Frontiers had a stellar lineup of demo judges featuring CEOs and VPs of AI from LaminDex, Replit, GitHub, AMD, Meta, and Lemurian Labs. RWKV won one of two judge prizes available that night, alongside with this next startup, Pixii AI.[01:15:00] Pixee - Automated Security[01:15:00] Rahul Sonwalkar: Next up also in the. Automated[01:15:02] n/a: workforce, workforce category. Pixie. .[01:15:04] Ryan at Pixee: Awesome. Hi everyone. I'm Ryan. I'm a software engineer on the team building. Pixie pretty straightforward, automate security. A little bit about myself. Previously I've worked at other security companies, building developer facing security tools.[01:15:17] Ryan at Pixee: I've also worked as a security engineer on developer tools. So, this is a space I love. I'm really interested to see how it develops. Why are we doing this? So, as it turns out we're generating a lot more code. So, this is an example user of Pixibot. It's a repository called Sterling PDF. It's just a web application.[01:15:37] Ryan at Pixee: Got 18, 000 stars on GitHub. Developed using, 100 percent using, chat gbt. So they installed PixyBot three weeks ago. And they got a lot of different suggestions for fixes for us. One of which one of which was, I am positive, was a real vulnerability. This is a, you know web application that's used by real people.[01:15:58] Ryan at Pixee: There's a button here, you can deploy it to DigitalOcean. So, we need to find a way to scale our security automation, in order to scale our relatively limited security workforce. So just to give you an idea, What Pixivot can do, this is like a very classically vulnerable application that a lot of security tools like to try themselves out on.[01:16:17] Ryan at Pixee: One of the things that I'm really excited about that we just shipped on the past couple weeks was integrating with Sonar. So Sonar is a code quality tool that Sonar is a code quality tool that finds Security issues, performance issues, lots of other kinds of issues in your code. It also, as you can see here found 2, 600 issues in here, taking 33 days of effort.[01:16:39] Ryan at Pixee: That's not really where we want to have Most product engineers focusing their time. It's definitely not where we want to have our security engineers focusing their time. What can we do to automate this and get these fixes automatically? So with Pixie we take these code quality security issues in from these other tools and then automatically remediate them.[01:16:57] Ryan at Pixee: So in the case of this this is a super minor change. If a developer were to find this issue in their code, they could fix it in a minute. But, they don't have to, and more importantly, there's backlogs of tens of thousands of these issues in organizations across across the world. And, so if we can automate this one task, even if it just takes a minute, and perform that, you know, continuously, across, you know, thousands of companies, we can save a lot of time.[01:17:23] Ryan at Pixee: Automated enforcement of security and code quality is what we're all about. But yeah. Not all security issues are worth fixing. Not all code quality issues are worth fixing. Sometimes they're wrong. The incentive structure for these tools is, you know, they want to find real things, but most importantly they have to find something.[01:17:42] Ryan at Pixee: So at Pixie we believe, you know, even if something might not be a complete exploitable vulnerability, if there's an opportunity for hardening or improving your code base, you should probably take it. But there's some of these things that are just not that. So we developed a tool we call triage, which will connect in with other tools that are notorious for finding lots of issues, and we can help you fix them.[01:18:05] Ryan at Pixee: So in this case we made a CLI that looks at your security backlog and identifies issues that we know don't matter in the context of your codebase. It pulls down the issues categorizes them, and then enables you to prompt It prompts you to either say, hey, this issue is not important, here's why we think it is, and we'll update the state for it.[01:18:26] Ryan at Pixee: So in this case, this is a warning about a parameter into a this file directory, It has some cross platform compatibility concerns. But based on the context of your code base, and , a large language model we're able to give you the confidence to focus on the issues that are most likely to actually matter.[01:18:44] Ryan at Pixee: One of the other things we do is You know, well so we're delivering, what you saw before, is we're delivering as a GitHub app, that we're delivering as a GitHub app, so that developers can integrate this into their existing workflows, but a lot of people like to just try a pixie from the command line on small projects, automatically get their fixes, and just commit all of them.[01:19:02] Ryan at Pixee: So, that's what we built. Try Pixie on GitHub, try Pixie on the CLI, and we're really excited to see what we can help you fix.[01:19:10] AI Charlie: Congrats to Pixie and RWKV. Our last featured demo is Rahul from Julius AI, who provides an interesting take on competing with OpenAI on its own home turf, the chat GPT code interpreter.[01:19:30] Julius AI - Competing with Code Interpreter[01:19:30] Rahul Sonwalkar: You might remember RoboLigma,[01:19:33] Flo Crivello: that's the poor engineer that got laid off by Elon Musk outside his office.[01:19:37] Eugene Cheah: He's back, he's back on his feet, he's got a whole new startup, so[01:19:40] Rahul Sonwalkar: thanks so much for having me here. I'm working on Julius. How many of you[01:19:44] n/a: here are data scientists? think everyone here[01:19:47] Rahul Sonwalkar: needs a data scientist. But there just aren't enough. And that's what we're building. Julius is an AI data scientist that helps you analyze datasets, make visualizations, Get insights from the data, and really dive deep into all sorts of data that we have in real life.[01:20:02] Rahul Sonwalkar: So, we launched about six months ago, and since then have grown to 300, 000 users several thousand users using us daily to analyze datasets, create visualizations and get insights. So what I'll do now is give you guys a quick live demo of how it actually works in IA. I actually hope it works[01:20:21] Rahul Sonwalkar: because we just posted code changes.[01:20:23] Rahul Sonwalkar: But here I have a dataset of 20, 000 rows of data over time for the last 100 years of human height for different countries. So I'm going to take this dataset, dump it in Joly's and say,[01:20:35] Rahul Sonwalkar: load this for me.[01:20:41] Rahul Sonwalkar: And while it's doing that, I want to explain what's happening under the hood. So basically, for each user, Think about how a human data scientist would analyze a data set that you give it.[01:20:54] Rahul Sonwalkar: It would take its computer write code, run that code, maybe in a Jupyter notebook, look at the output, and then decide if that answers your question, or if you need to write more code. Julia works similarly. So that's you, that's the AI, and then for each user, you get a virtual machine in the cloud, and Where the AI is filling up the Jupyter Notebook, writing the code to get the analysis that you want, and then serving that back to you.[01:21:22] Rahul Sonwalkar: Many times, that code is not correct the first time. But Julia is able to recover from those errors and actually get you the answer that you want.[01:21:31] Rahul Sonwalkar: So let's look at our chat. We said, load this file for me, and the AI basically went, spun up a Jupyter notebook, loaded pandas, looked at the file, and gave us a few rows.[01:21:42] Rahul Sonwalkar: I'm going to ask[01:21:43] n/a: plot the Mail, pipe, overtime,[01:21:53] n/a: in France.[01:21:53] Rahul Sonwalkar: So, the AI team's been writing this code, because pipe overtime in France for men, and then body type for us. And the good thing about Python, is If you spend a ton of time on SQL, what we realized was that SQL, it's really hard to write actually useful queries and do deep analysis like regression, etc.[01:22:15] Rahul Sonwalkar: with just SQL. With Python, you also get a whole ecosystem of modules built in. Right? matplotlib, pandas, numpy, escaler, and there's thousands of these. So, that was the initial insight, and then we built Julius about six months ago.[01:22:33] Jerry Liu: What's like the practical difference in UX between this and just[01:22:37] Jerry Liu: trajectory code interpreter?[01:22:38] Rahul Sonwalkar: Great question. Yeah, the question was, what is the difference between Julius and code interpreter? Really, there isn't. It's just better. We're focused, we're focused With people, or people who do stuff with data multiple times a day.[01:22:53] Rahul Sonwalkar: And we talked to a lot of these people, and we said, Okay, how can we build things for you that would help you do your job?[01:22:59] Rahul Sonwalkar: So, an example of this is on chat. gt, often times they'll give it a data set. People try to write their code, and sometimes that code has errors. And it kind of goes into this loop of trying to fix these little errors.[01:23:13] Rahul Sonwalkar: What we have focused on is, okay, how do we prevent that from happening? So we looked at thousands of users using us daily. Collected data on where these errors happened. And focused really hard on fixing those errors. Beforehand, before they actually happen at runtime.[01:23:30] Rahul Sonwalkar: This could mean a bunch of rules.[01:23:32] Rahul Sonwalkar: This could mean, you know, prompting changes, et cetera, and just preventing that from happening. Second of all, we have features that allow people who do stuff with data on a daily basis to go deep and do the last mile of analysis done. That could mean, you know, You can click, show code, go into the code, edit the code changes.[01:23:53] Rahul Sonwalkar: You can also give natural language instructions on the code. Finally, let's say you have this graph. And I want the graph to have some changes. Like, I want it to be a bar chart instead of instead of instead of a line graph. You can kind of just go in here and give natural language instructions to let the user take what the AI has done for it and then take it to the, to the finish line.[01:24:17] Rahul Sonwalkar: If you've seen that code interpreter, that's pretty hard for users to do. So we focus on data and that use case, and we will do that.[01:24:23] n/a: Cool thanks guys![01:24:27] AI Charlie: That's unfortunately all the time we had to feature demos, but many thanks to Botpress, Markov, Kura. ai, Sweep, and Motif as well for being finalists. For the last part of our anniversary celebration, we wanted to turn over the mics to you, our dear listeners. We hear so many great stories from listeners about how latent space has come into their lives, and we've never had the opportunity to feature them on the pod till now.[01:24:53] AI Charlie: Our first listener is Balaz Nemethy from Hungary, who talked about one of the most delightful gems in the latent space community, our weekly Discord paper club.[01:25:03] Latent Space Listeners[01:25:03] Listener 1 - Balázs Némethi (Hungary, Latent Space Paper Club)[01:25:03] swyx: Tell me, tell people about, like, what happened. Yeah, like,[01:25:07] Guest 1: two weeks ago, two weeks ago, there was the paper reading club on Discord, and I, and then, halfway in, or like, one quarter in, like, the author of the paper showed up, and it was so f*****g cool. Like, if you could do this, like, I was thinking, like, this should be a format, like, there is two minutes papers that probably, you know, who[01:25:28] swyx: is, yeah he's Hungarian,[01:25:31] Guest 1: Living[01:25:31] swyx: in Vienna, but like Karoly,[01:25:36] Guest 1: pronounced in Hungarian is Karoly, yes so that was so special because it's There is a certain amount of information in papers, the quality of paper might have dropped in the past year than before, due to the social media aspect of Archive.[01:25:52] Guest 1: So, having the person there and giving in even more details than just what you could read, was like, so amazing. I know it's really hard to organize, but like, If it would be possible to have more, maybe not recurring, like, you know, it's just like,[01:26:08] swyx: oh, nice. The Matryoshka,[01:26:13] swyx: yeah, yeah. So we have one next week the MRL paper, Matryoshka Representation Learning, which is a way of sorting embeddings so that you can truncate them. And OpenAI recently shipped this in their API for the new embeddings models, where you can reduce, like, a 3, 000 vector embedding to 265, so you save more than 90 percent on your embeddings.[01:26:30] swyx: Vector database costs and speed and everything. Nice. So the authors are coming by and presenting at the Discord. I will join. I will join. Any other, like so basically I'm just going to record random opinions. I know how you produce the[01:26:45] Guest 2: podcast. So we're going to[01:26:46] swyx: do this. You're going to be on the show.[01:26:48] swyx: You're going to be on the show. Any other, like, how did you discover the podcast? What do you feel?[01:26:54] Guest 1: Discovered it on Spotify, searching basically AI. I use PocketCast for all my podcasts, but I was like, let's just search AI. I think I was searching for AI generated music, but it brought up podcasts.[01:27:07] Guest 1: And I was like, you know what, I'm kind of getting out of my previous industry. So like, I'm just going to separate. The whole AI following thing and I just like followed This was the first one that came up and then a couple of others just to like have it have it downloaded But I but this was like the literally the first podcast I'm following on Spotify when I follow like 70 on podcast So like I was like and I started I was like, okay, this is great Or they're only great podcasts, and I kept coming back to[01:27:40] swyx: yours,[01:27:40] swyx: there are other podcasts that we consider friends, and we try to do collaborations with them, and podcast swaps with them, so Yeah, that's great.[01:27:47] Listener 2 - Sylvia Tong (Sora/Jim Fan/EntreConnect)[01:27:47] AI Charlie: Our next listener is Sylvia Tong, founder of the OntraConnect community, a community of founders and investors supporting entrepreneurs in Silicon Valley. She wanted to discuss OpenAI Sora and Jim Phan from NVIDIA, who we have featured on our previous OpenAI Dev Day Recap podcast, and will be a future guest on LatentSpace.[01:28:07] swyx: How did you find the podcast, and what do you feel about it, what do you want to tell people about it?[01:28:12] Guest 2: Actually, I know Jim Fan, so I, so Jim Fan, I know you! And then I follow your Twitter and follow your podcast. Yeah, yeah, yeah, yeah, yeah. It's another event, maybe you know Alliance AI, it's another community, and we like, they had that event like early last year, so they have various events, they, they are the founder of Stanford, so they are all Stanford grads, so they are even always in the Stanford University, like one of the room, yeah, so Jim Fan is one of the first speakers, so, yeah, and connect with him on WeChat, and, yeah, and connect with you, yeah, follow your Twitter![01:28:47] swyx: Jim is Jim is super friendly, and we have to have a full episode with him at some point. But he's, yeah, I mean, he's doing amazing things at NVIDIA. I'm sure he's very happy there.[01:28:59] Guest 2: You should ask him about Sora. The JAI video, yeah, he has so many opinions about, you know, yeah.[01:29:07] swyx: I feel like, okay, Jim is this interesting mix between a researcher and a Content creator, right?[01:29:13] swyx: So, Jim's take on Sora, I slightly disagree with, because he says it's basically a data driven world model, and a lot of people misinterpreted him, me included, basically saying like, oh, are you, are you saying that there's an underlying physics model behind Sora? And he's like, no, no, no, no, no, it's just, you know, using diffusion transformers to learn a representation of world models.[01:29:34] swyx: It's not perfect. Then I'm like, okay, but that's a misleading analogy, I don't know. Anyway, so like[01:29:40] Guest 2: he But that's for the content purpose. That's for the Twitter content purpose. You have to, yeah,[01:29:44] swyx: yeah. So I feel this, like, pull towards, like celebrating things on Twitter, but then also trying to be realistic.[01:29:53] swyx: Trying to present, like, what is actually the thing instead of the hype. And it's very hard to separate. And that's something that's a challenge for Lanespace.[01:30:00] Guest 2: Yeah, it's hard, I feel it's hard to have the conversation on Twitter, so you need to have a conversation in the podcast. So invite a few people who maybe have to talk about Twitter, but really explain what they mean in your tweets.[01:30:13] Guest 2: Because, yeah, it's hard to understand just a few words. Yeah, so do you actually think Sora understands the physics of the[01:30:20] swyx: world? A little bit. It's, yeah, Sora understands a little bit of physics. The problem with this is they cannot have 80 percent physics. Like, it's 100 or 0, like, otherwise you lose confidence in the thing.[01:30:33] swyx: So that's why you have these generated models where the chair will show up and disappear, the spoon will show up and disappear, you know, like, that's all the artifacts you see in Sora. Which is good for us for now, because we're lucky that it's not good enough yet to consistently generate all those things.[01:30:50] swyx: At some point it will be, we just wait two years, and it will be.[01:30:53] swyx: Very cool. Thanks for it. I love this discussion. Thanks for listening. I'm really glad to have you as a listener.[01:30:59] AI Charlie: Alessio and Swyx covered the Jim Fan vs. Yan LeCun world model debate in the main pod, and you can click through the show notes for more detail directly from each of them. Our third listener is RJ Honecke, who comes from a data science background, but wanted to ask about how we think about learning in public in AI, and how that informs the context with which latent space is created.[01:31:23] Listener 3 - RJ (Developers building Community & Content)[01:31:23] swyx: Hi, I'm RJ. Shawn, nice to meet you. Nice to meet you. Do you also listen to pod, or are you just here to hang out? Yes, very much. Oh, yeah. How do you feel about it?[01:31:32] Guest 3: The depth that you guys go into it's a lot deeper than other. This is a podcast that I listen to. I kind of found it, and then didn't switch back.[01:31:39] swyx: Thanks![01:31:40] Guest 3: What's your background? I, I am a data scientist.[01:31:44] Guest 3: I run a data team at cell communications equipment manufacturer. And we collect a ton of telemetry data, and, and other things like that. And I'm running a data team to make inferences about the health of our network, about, operating the network more efficiently and also in our manufacturing process and product development process to improve our ability to detect when we improve or, or get worse at operating, or, sorry, our products like build or hardware bills get better or worse.[01:32:17] Guest 3: So actually, I wanted to actually ask a question of you and your thoughts about this. So I find the discussion about model measurement and, and, and evaluation to be very similar to the problems that we have in wireless. Because you have this very non deterministic system, right? So I was thinking, and I also just read your your little thing about learn in public.[01:32:43] Guest 3: So I was thinking about trying to come up with a good way to, to, and I'm, I'm learning about some new techniques that we're starting to implement to monitor our development process and so forth, and evaluate our, the quality of our builds and our hardware, and I was thinking about trying to tie that in with evaluation of LLMs.[01:33:08] Guest 3: I just, I, I, I don't know. That's as far as I got in the thinking, but I just thought that would be a fun thing to try to put out there and wanted to hear your thoughts about how, how to, like go about[01:33:17] swyx: that. Yeah. You can, you don't need anyone's permission. That's, that's the beauty of this thing. But also no one owes you anything.[01:33:23] swyx: No one owes you their time, their attention or, you know, or, or, or responses. And I typically try to classify these things as different modes of learning in public. Mm-Hmm. , I think I have four modes that I sketched out, but the two I remember the most are Explorer and Connector, and then there are two more advanced modes, I think like Teacher or Builder or something like that.[01:33:45] swyx: The Explorer is where you sort of like put things out as you go along. It's learning exhaust, where you don't have expectations so that anyone will read it. It's mostly just notes for yourself. And that actually, that lack of expectations frees you. Because then you're like, oh, like two people read it.[01:34:03] swyx: Doesn't matter, it's useful to me. It's useful to my team, it's useful to me, it's useful to whoever comes after me because I documented my work and my thinking. And that's great. And I think that's, that's the way that most people should start, which is like, just lower, you're not going to be an influencer overnight, like, it's fine, completely but get your thoughts out there, and then also, but also, like, start having feelers in different directions on what works for you, what works is a combination of what you like to And what other people want from you, and you will know when people tell you they want more from you.[01:34:35] swyx: And so then, when you get there, when you have expertise that you have that other people don't, then you switch gears into a connector, where you are now coming from a place of authority. Like, I know how to do this right, and I will teach you, because I have done this, and I have spent more time, paid more in my dues, and here's the lessons.[01:34:55] swyx: Thank you. And then that comes to be, that tends to become more of a polished effort that tends to become more measurable or in terms of like the impact and the influence it can get. And I think that's, that's where people start moving towards. But basically just lower expectations, make it cheap to experiment, put out a lot of stuff in different directions and see where the market pulls you.[01:35:13] Guest 3: Okay. Yeah. So, I mean, do you have thoughts about, like, I'm very much aligned with like who cares about. I mean, I care, but my need is not to be a social media influencer. My need is to, like, I want to learn and I like the idea of, you know, sort of like sharing that with people and sharing the process with people.[01:35:39] Guest 3: So, like, thoughts about platform or like, I mean, I know it's going to be different for everyone, but like, what, what, what's it, what in your experience has changed? Has been successful while getting started.[01:35:53] swyx: Yeah so I tend to tell developers, most developers to start on Hashnode these days. Hashnode is basically Medium if it was for developers and didn't suck.[01:36:06] swyx: Because I hate Medium with a passion and a glowing, fiery hatred. Everyone does. It's comical how bad they are. But, I use Substack for latent space. I'm pretty happy with Substack. It's an email social network. Email is one of the most important things for people to like, come back to you frequently. So that you don't, you're not subject to an algorithm, you own your audience, you know.[01:36:26] swyx: If you want to move off Substack someday, it'll let you take the emails and keep that relationship going with the people that you have. And that's super important as a creator. And then you can also write your own blog. And tweet, and tweet, and all that. I tend to say though Pay attention to what you enjoy, and what you spend the most time on.[01:36:42] swyx: If you're a LinkedIn guy, be on LinkedIn. I'm not on LinkedIn, so I'm gonna do horrible on LinkedIn, because I don't know the metagame of LinkedIn. I don't know what does well, I don't know what people want. So I shouldn't even, I don't, I don't bother, I should try, because obviously there are like way more people on LinkedIn than there are on Twitter, but I'm just a Twitter guy.[01:36:59] swyx: Like I'm, that's just, that's who I am I have, I have, I also sort of am old money there in a sense of I have an existing followership that predated Latentspace. You know, Latentspace doubled my following, but like, I had some before that. So, like, all that's great I just think, like, you're going to know the metagame, and that's actually very important, of, like, where you already spend time, like, I, I have friends who are, like, on TikTok, I have friends who are on YouTube a lot, I'm on YouTube a lot, I should do YouTube, because I know, I know what's, what's going on on YouTube, it's just, then you have to put the effort to, to do that, and I'm, I'm, like, video production is, like, the most expensive thing, anyway, long story short try to pay attention to, this, Complex mix of like, publishing platform existing embedded social network on that platform, And where you already spend times, so that you know how to create what will do well, just because you already spent time on it.[01:37:46] swyx: Yeah, okay.[01:37:47] Guest 3: What's your favorite?[01:37:49] Guest 3: Favorite episode I really liked actually the the NeurIPS, like, recap because I haven't been to NeurIPS so You know how much time that took? Well, I mean, the episode is like four hours, right? Yeah. And that one I didn't, I didn't do the paper one because I, I actually I, I usually listen.[01:38:07] Guest 3: I don't watch. So I, like, it's be really hard to There's no video for that. Oh, there isn't? Oh, okay. So I, like, I have to find the paper and anyway. Yeah. So that's hard for me. Yeah. But I, I did enjoy the interviews in the other The startups episode. Yeah. Yeah.[01:38:25] swyx: People love that.[01:38:26] swyx: It just takes a ton of work, and I would love to offload it. This is going to be another one of those where I just kind of slip together little things. And it's good. It brings you there. That's the thing, right? Like, you're not there physically. I'm here. Let's, like, bring people into the closed community.[01:38:40] swyx: And so I would like to do more of that.[01:38:42] Guest 3: Yeah, no, I really enjoy how you bring, like, a lot of people that I would not have otherwise even known about, let alone have access to, and then You have this conversation with them. It's really fun. Thanks[01:38:56] swyx: for coming on. Can I, can I get your contact so that we can find you?[01:38:59] swyx: Yeah. Yeah. Yeah. You're going to be on the pod. Oh, awesome.[01:39:01] AI Charlie: People seem to love the New Reap's recap pod, and we'll keep doing more of those when the right occasion presents itself. This was also a pick for our last listener, Jan Jung from Australia, who comes at AI from the design point of view and was very interested in our early AI UX work on latent space.[01:39:20] AI Charlie: If you're in SF and want to more novel AI UX ideas, reach out to him.[01:39:25] Listener 4 - Jan Zheng (Australia, AI UX)[01:39:25] Guest 4: My name is Yon, and I came across you on GitHub when I was looking for ways to solve problems on Svelte. And you pretty much answered all the questions I had for pretty much A couple of years, and then you left, and you started doing latent space, and I'm like, what is that?[01:39:45] Guest 4: What is an LLM? So I started listening to your pod, and yeah, and here I am. And[01:39:49] swyx: then you're part, you're from Sydney, or you were, you were in sydney.[01:39:52] Guest 4: I, I moved to Sydney a couple years ago to work on a clinical trial, but now I moved back, probably, again, I blame you for it, because I listen to every episode, I'm like, s**t's going down, in San Francisco, you gotta be here.[01:40:05] swyx: So yeah, and then you were, you're part of build club.[01:40:08] Guest 4: Yeah, I'm part of BuildClub. BuildClub is a Unfortunately, I was at the airport when you're giving a presentation and Annie has not sent me the recording yet So I'm not seeing it. It's on YouTube.[01:40:24] swyx: Oh, okay. Great.[01:40:25] Guest 4: Oh, awesome. Okay, I'll take a look. But BuildClub is the one and only AI centric community in Pretty much Sydney.[01:40:39] Guest 4: And I had to spend months to push Annie to do that thing. And eventually she did, and I'm so glad she did. And it's growing, and she's doing amazing. She's expanding to many cities. It's ANZ now. Yeah, it's amazing. And she has our couch from our apartment when we moved away. We couldn't find a way to sell it.[01:41:01] Guest 4: We're like, hey Annie, we're getting a space. Do you guys need a couch? She's like, sure. So she has my couch. It's amazing.[01:41:07] swyx: And then what do you listen for in, in, in space? What, you know,[01:41:11] Guest 4: what are you interested in? I like to get a sense of what's going on. You guys ask very good questions. For some reason you guys seem so well researched, both you and Alessio.[01:41:24] Guest 4: Somehow you're just You asked very good questions that me as a Person, like, general product developer, product engineer, I have no idea about ML, I don't follow the papers, I know about the paper club, I don't follow it because it's over my head, but you guys distill it so well, and you guys ask the questions to your guests that I have in the back of my mind, or that I don't even know that I have the questions and then I You guys guide the conversations in a way that I can learn from and I wouldn't even know anything to ask So I'm so glad you guys are doing it.[01:42:03] Guest 4: It's so helpful and Keep doing what you're doing. Yeah, and I really and I really love the What you guys did with the best papers from the talk Yeah, it's really good I mean like a lot of that was way over my head But I like listen to it all and try to I just get the sense, like, just, I just try to keep listening to this stuff until I get it.[01:42:27] Guest 4: And you guys expose, I mean, I would never go to a conference like that, but, yeah. But like, I was just like, not understanding anything, but you guys make it so accessible, and I love it.[01:42:39] swyx: Yeah, so, maybe, the Pocket Studio is right here, actually, I can show you after we're done recording. It's not that fancy, it's just a studio.[01:42:46] swyx: And yeah, for me, the goal within NeurIPS recap, was not that we would, like, you would read everything or anything, like, yeah, we would just pick what we thought was most important for you, and if any one of them interested you, you could double click on it. That's it. You know, we're not gonna be, like, the experts on every single thing.[01:43:04] swyx: It's impossible, right? And already, like, the episode that I cut together for that was like three and a half hours, so people were complaining about that. And then the last thing Lesser and I don't do that much research for each episode, but, you know, we research the guests.[01:43:21] swyx: But just being involved in the day to day conversations in our day jobs prepares you for that. And I think that is important. No prep needed because, you know, we're in it. We're in the arena, as they say. Yeah. Anything else?[01:43:35] Guest 4: Like, like there's so much excitement. There's so many things to cover. And like what you guys are like, maybe culturally, yeah, that, that would be a thing I was always wondering, like, like, and that might be not partly in the space, but what are you guys doing? Like to cover the cultural aspect of what's happening here, it's probably like.[01:44:00] Guest 4: A separate thing, but equally important thing, to like, document all the conversations that are happening around here. And all the other build spaces, like, we see glimpses of that on Twitter, but I think capturing more of that would be super cool.[01:44:17] swyx: Yeah I feel like that's something that someone else should do.[01:44:20] swyx: We try to be more technical. Because that, that, people can use it at work, they can justify that for productivity. We might try to Dabble in some of that. So I'm pretty connected with like, the main areas for those listening The main areas for those listening who are interested in like SFAI is like Shack 15, AGI House SF, AGI House Hillsboro and then us and maybe HF0 and then maybe a little bit of Founders Inc.[01:44:48] swyx: And those are it. There's this like, There's more community oriented spaces like the commons but like they're not sort of AI centric. And So we can do a little bit of reporting around that, but it's gonna be like, this American life, you know, like, tell me your life story, like, solve story, I'm not like, the best at that, and then also, like, there's a lot of very, very brutal cutting for that, that is hard to do, but we can dabble, or we can do it on the[01:45:13] Guest 4: side.[01:45:15] Guest 4: Oh, the other thing I'm very interested in, I'm a UX designer by trade, and anytime you guys touch on AI and UX and Jet or UI, I'm all ears, and I would love to, Again, it's probably not the technical side of LatentSpace, but I think there needs to be a hundred times more resources out there than what's currently available.[01:45:34] swyx: Yeah, yeah we had a, we, I think we held the first AIUX meetup ever in the, in, in SF, in Worlds. That was really fun. The meetup's on YouTube, if you want to see it, and, and it's in the LatentSpace archives of the newsletter. I don't think we ever published a podcast version of it.[01:45:48] swyx: So you have to just subscribe to the newsletter and then check the YouTube for, for that stuff. But yeah, UX is a topic of ours that we like to cover. It's just very hard to cover as an audio medium. Yeah. 'cause you can't see it . And also I think like it's gonna be mostly owned by like Notion and Versal and Retool, which we've, we've interviewed retool, we're going to interview Versa and we've interviewed Notion.[01:46:12] swyx: So who else who, who's who? Like who do you wanna listen to on the IX? Right. Like, there's individual people, like we had Amelia Wattenberger present at AI Engineer Summit, you can see that on YouTube. Like, I know a lot of the thinkers on AIUX, and I think I know what they say, like, I haven't seen anything super innovative.[01:46:31] swyx: Everyone hates chatbots, everyone wants to innovate things. I haven't seen any new ideas since we did the AIUX meetup one year ago. Tell me I'm wrong.[01:46:42] Guest 4: Well, that sounds really disappointing. I haven't seen anything on Twitter that I thought that would be easier to push because we just wrap LLMs. But on Twitter there doesn't seem to be that much going on, to your point.[01:46:59] Guest 4: But there needs to be more people from the design space, from the product space, like UX researchers, coming in and figuring out how can we take LLMs and apply them to real problems. I haven't seen a whole lot of that. In Cine, there's not a whole lot of that. I'm hoping to maybe be a part of the community here and try to grow that side of[01:47:21] swyx: the things.[01:47:22] swyx: Well, look, you're here now. You're interested in AIUX. Run the next AIUX meetup. I can set you up with the venue, the people. You need to find the speakers. I'm not going to find the speakers for you. But if you want to set that up, go for it.[01:47:37] Guest 4: So, I actually copied your AIUX format, and I held a talk in Sydney, and in a very light fashion, like 20 30 people showed up.[01:47:49] Guest 4: We had some cool demos, it was like a baby, like a small version of your AIUX conference, but yeah, I'd love to, love to participate. I mean,[01:47:59] swyx: this is SF, 300 people will show up you just gotta get some cool demos, I can siege you with some people let's make it happen. Let's make it happen! Let's make it happen, alright, well it's nice to meet you, and I'll get your details.[01:48:09] AI Charlie: That's all, folks. If you've enjoyed or benefited from our work on latent space over this past year, we'd really love to hear from you, and really appreciate it if you'd tell a friend. The only way a podcast consistently grows is through your word of mouth, and that helps us book incredible guests and attend great events in our second year.[01:48:29] AI Charlie: Have a lovely weekend! Get full access to Latent.Space at www.latent.space/subscribe
-
Open Source AI is AI we can Trust — with Soumith Chintala of Meta AI
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-03-06 18:40
Speaker CFPs and Sponsor Guides are now available for AIE World’s Fair — join us on June 25-27 for the biggest AI Engineer conference of 2024!Soumith Chintala needs no introduction in the ML world — his insights are incredibly accessible across Twitter, LinkedIn, podcasts, and conference talks (in this pod we’ll assume you’ll have caught up on the History of PyTorch pod from last year and cover different topics). He’s well known as the creator of PyTorch, but he's more broadly the Engineering Lead on AI Infra, PyTorch, and Generative AI at Meta.Soumith was one of the earliest supporters of Latent Space (and more recently AI News), and we were overjoyed to catch up with him on his latest SF visit for a braindump of the latest AI topics, reactions to some of our past guests, and why Open Source AI is personally so important to him.Life in the GPU-Rich LaneBack in January, Zuck went on Instagram to announce their GPU wealth: by the end of 2024, Meta will have 350k H100s. By adding all their GPU clusters, you'd get to 600k H100-equivalents of compute. At FP16 precision, that's ~1,200,000 PFLOPS. If we used George Hotz's (previous guest!) "Person of Compute" measure, Meta now has 60k humans of compute in their clusters. Occasionally we get glimpses into the GPU-rich life; on a recent ThursdAI chat, swyx prompted PaLM tech lead Yi Tay to write down what he missed most from Google, and he commented that UL2 20B was trained by accidentally leaving the training job running for a month, because hardware failures are so rare in Google.Meta AI’s Epic LLM RunBefore Llama broke the internet, Meta released an open source LLM in May 2022, OPT-175B, which was notable for how “open” it was - right down to the logbook! They used only 16 NVIDIA V100 GPUs and Soumith agrees that, with hindsight, it was likely under-trained for its parameter size.In Feb 2023 (pre Latent Space pod), Llama was released, with a 7B version trained on 1T tokens alongside 65B and 33B versions trained on 1.4T tokens. The Llama authors included Guillaume Lample and Timothée Lacroix, who went on to start Mistral.July 2023 was Llama2 time (which we covered!): 3 model sizes, 7B, 13B, and 70B, all trained on 2T tokens. The three models accounted for a grand total of 3,311,616 GPU hours for all pre-training work. CodeLlama followed shortly after, a fine-tune of Llama2 specifically focused on code generation use cases. The family had models in the 7B, 13B, 34B, and 70B size, all trained with 500B extra tokens of code and code-related data, except for 70B which is trained on 1T.All of this on top of other open sourced models like Segment Anything (one of our early hits!), Detectron, Detectron 2, DensePose, and Seamless, and in one year, Meta transformed from a company people made fun of for its “metaverse” investments to one of the key players in the AI landscape and its stock has almost tripled since (about $830B in market value created in the past year).Why Open Source AIThe obvious question is why Meta would spend hundreds of millions on its AI efforts and then release them for free. Zuck has addressed this in public statements:But for Soumith, the motivation is even more personal:“I'm irrationally interested in open source. I think open source has that fundamental way to distribute opportunity in a way that is very powerful. Like, I grew up in India… And knowledge was very centralized, but I saw that evolution of knowledge slowly getting decentralized. And that ended up helping me learn quicker and faster for like zero dollars. And I think that was a strong reason why I ended up where I am. So like that, like the open source side of things, I always push regardless of like what I get paid for, like I think I would do that as a passion project on the side……I think at a fundamental level, the most beneficial value of open source is that you make the distribution to be very wide. It's just available with no friction and people can do transformative things in a way that's very accessible. Maybe it's open source, but it has a commercial license and I'm a student in India. I don't care about the license. I just don't even understand the license. But like the fact that I can use it and do something with it is very transformative to me……Like, okay, I again always go back to like I'm a student in India with no money. What is my accessibility to any of these closed source models? At some scale I have to pay money. That makes it a non-starter and stuff. And there's also the control issue: I strongly believe if you want human aligned AI, you want all humans to give feedback. And you want all humans to have access to that technology in the first place. And I actually have seen, living in New York, whenever I come to Silicon Valley, I see a different cultural bubble.We like the way Soumith put it last year: Closed AI “rate-limits against people's imaginations and needs”!What It Takes For Open Source AI to WinHowever Soumith doesn’t think Open Source will simply win by popular demand. There is a tremendous coordination problem with the decentralized nature of the open source AI development right now: nobody is collecting the valuable human feedback in the way that OpenAI or Midjourney are doing.“Open source in general always has a coordination problem. If there's a vertically integrated provider with more resources, they will just be better coordinated than open source. And so now open source has to figure out how to have coordinated benefits. And the reason you want coordinated benefits is because these models are getting better based on human feedback. And if you see with open source models, like if you go to the /r/localllama subreddit, like there's so many variations of models that are being produced from, say, Nous research. I mean, like there's like so many variations built by so many people. And one common theme is they're all using these fine-tuning or human preferences datasets that are very limited and they're not sufficiently diverse. And you look at the other side, say front-ends like Oobabooga or like Hugging Chat or Ollama, they don't really have feedback buttons. All the people using all these front-ends, they probably want to give feedback, but there's no way for them to give feedback… So we're just losing all of this feedback. Maybe open source models are being as used as GPT is at this point in like all kinds of, in a very fragmented way, like in aggregate all the open source models together are probably being used as much as GPT is, maybe close to that. But the amount of feedback that is driving back into the open source ecosystem is like negligible, maybe less than 1% of like the usage. So I think like some, like the blueprint here I think is you'd want someone to create a sinkhole for the feedback… I think if we do that, if that actually happens, I think that probably has a real chance of the open source models having a runaway effect against OpenAI, I think like there's a clear chance we can take at truly winning open source.”If you’re working on solving open source coordination, please get in touch!Show Notes* Soumith Chintala Twitter* History of PyTorch episode on Gradient Podcast* The Llama Ecosystem* Apple's MLX* Neural ODEs (Ordinary Differential Equations)* AlphaGo* LMSys arena* Dan Pink's "Drive"* Robotics projects:* Dobb-E* OK Robot* Yann LeCun* Yangqing Jia of Lepton AI* Ed Catmull* George Hotz on Latent Space* Chris Lattner on Latent Space* Guillaume Lample* Yannic Kilcher of OpenAssistant* LMSys* Alex Atallah of OpenRouter* Carlo Sferrazza's 3D tactile research* Alex Wiltschko of Osmo* Tangent by Alex Wiltschko* Lerrel Pinto - RoboticsTimestamps* [00:00:00] Introductions* [00:00:51] Extrinsic vs Intrinsic Success* [00:02:40] Importance of Open Source and Its Impact* [00:03:46] PyTorch vs TinyGrad* [00:08:33] Why PyTorch is the Switzerland of frameworks* [00:10:27] Modular's Mojo + PyTorch?* [00:13:32] PyTorch vs Apple's MLX* [00:16:27] FAIR / PyTorch Alumni* [00:18:50] How can AI inference providers differentiate?* [00:21:41] How to build good benchmarks and learnings from AnyScale's* [00:25:28] Most interesting unexplored ideas* [00:28:18] What people get wrong about synthetic data* [00:35:57] Meta AI's evolution* [00:38:42] How do you allocate 600,000 GPUs?* [00:42:05] Even the GPU Rich are GPU Poor* [00:47:31] Meta's MTIA silicon* [00:50:09] Why we need open source* [00:59:00] Open source's coordination problem for feedback gathering* [01:08:59] Beyond text generation* [01:15:37] Osmo and the Future of Smell Recognition TechnologyTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:15]: Hey, and today we have in the studio Soumith Chintala, welcome.Soumith [00:00:17]: Thanks for having me.Swyx [00:00:18]: On one of your rare visits from New York where you live. You got your start in computer vision at NYU with Yann LeCun. That was a very fortuitous start. I was actually listening to your interview on the Gradient podcast. So if people want to know more about the history of Soumith, history of PyTorch, they can go to that podcast. We won't spend that much time there, but I just was marveling at your luck, or I don't know if it's your luck or your drive to find AI early and then find the right quality mentor because I guess Yan really sort of introduced you to that world.Soumith [00:00:51]: Yeah, I think you're talking about extrinsic success, right? A lot of people just have drive to do things that they think is fun, and a lot of those things might or might not be extrinsically perceived as good and successful. I think I just happened to like something that is now one of the coolest things in the world or whatever. But if I happen, the first thing I tried to become was a 3D VFX artist, and I was really interested in doing that, but I turned out to be very bad at it. So I ended up not doing that further. But even if I was good at that, whatever, and I ended up going down that path, I probably would have been equally happy. It's just like maybe like the perception of, oh, is this person successful or not might be different. I think like after a baseline, like your happiness is probably more correlated with your intrinsic stuff.Swyx [00:01:44]: Yes. I think Dan Pink has this book on drive that I often refer to about the power of intrinsic motivation versus extrinsic and how long extrinsic lasts. It's not very long at all. But anyway, now you are an investor in Runway, so in a way you're working on VFX. Yes.Soumith [00:02:01]: I mean, in a very convoluted way.Swyx [00:02:03]: It reminds me of Ed Catmull. I don't know if you guys know, but he actually tried to become an animator in his early years and failed or didn't get accepted by Disney and then went and created Pixar and then got bought by Disney and created Toy Story. So you joined Facebook in 2014 and eventually became a creator and maintainer of PyTorch. And there's this long story there you can refer to on the gradient. I think maybe people don't know that you also involved in more sort of hardware and cluster decision affair. And we can dive into more details there because we're all about hardware this month. Yeah. And then finally, I don't know what else, like what else should people know about you on a personal side or professional side?Soumith [00:02:40]: I think open source is definitely a big passion of mine and probably forms a little bit of my identity at this point. I'm irrationally interested in open source. I think open source has that fundamental way to distribute opportunity in a way that is very powerful. Like, I grew up in India. I didn't have internet for a while. In college, actually, I didn't have internet except for GPRS or whatever. And knowledge was very centralized, but I saw that evolution of knowledge slowly getting decentralized. And that ended up helping me learn quicker and faster for zero dollars. And I think that was a strong reason why I ended up where I am. So the open source side of things, I always push regardless of what I get paid for, like I think I would do that as a passion project on the side.Swyx [00:03:35]: Yeah, that's wonderful. Well, we'll talk about the challenges as well that open source has, open models versus closed models. Maybe you want to touch a little bit on PyTorch before we move on to the sort of Meta AI in general.PyTorch vs Tinygrad tradeoffsAlessio [00:03:46]: Yeah, we kind of touched on PyTorch in a lot of episodes. So we had George Hotz from TinyGrad. He called PyTorch a CISC and TinyGrad a RISC. I would love to get your thoughts on PyTorch design direction as far as, I know you talk a lot about kind of having a happy path to start with and then making complexity hidden away but then available to the end user. One of the things that George mentioned is I think you have like 250 primitive operators in PyTorch, I think TinyGrad is four. So how do you think about some of the learnings that maybe he's going to run into that you already had in the past seven, eight years almost of running PyTorch?Soumith [00:04:24]: Yeah, I think there's different models here, but I think it's two different models that people generally start with. Either they go like, I have a grand vision and I'm going to build a giant system that achieves this grand vision and maybe one is super feature complete or whatever. Or other people say they will get incrementally ambitious, right? And they say, oh, we'll start with something simple and then we'll slowly layer out complexity in a way that optimally applies Huffman coding or whatever. Like where the density of users are and what they're using, I would want to keep it in the easy, happy path and where the more niche advanced use cases, I'll still want people to try them, but they need to take additional frictional steps. George, I think just like we started with PyTorch, George started with the incrementally ambitious thing. I remember TinyGrad used to be, like we would be limited to a thousand lines of code and I think now it's at 5,000. So I think there is no real magic to which why PyTorch has the kind of complexity. I think it's probably partly necessitated and partly because we built with the technology available under us at that time, PyTorch is like 190,000 lines of code or something at this point. I think if you had to rewrite it, we would probably think about ways to rewrite it in a vastly simplified way for sure. But a lot of that complexity comes from the fact that in a very simple, explainable way, you have memory hierarchies. You have CPU has three levels of caches and then you have DRAM and SSD and then you have network. Similarly, GPU has several levels of memory and then you have different levels of network hierarchies, NVLink plus InfiniBand or Rocky or something like that, right? And the way the flops are available on your hardware, they are available in a certain way and your computation is in a certain way and you have to retrofit your computation onto both the memory hierarchy and like the flops available. When you're doing this, it is actually a fairly hard mathematical problem to do this setup, like you find the optimal thing. And finding the optimal thing is, what is optimal depends on the input variables themselves. So like, okay, what is the shape of your input tensors and what is the operation you're trying to do and various things like that. Finding that optimal configuration and writing it down in code is not the same for every input configuration you have. Like for example, just as the shape of the tensors change, let's say you have three input tensors into a Sparstar product or something like that. The shape of each of these input tensors will vastly change how you do this optimally placing this operation onto the hardware in a way that will get you maximal throughput. So a lot of our complexity comes from writing out hundreds of configurations for each single PyTorch operator and templatizing these things and symbolically generating the final CUDA code or CPU code. There's no way to avoid it because mathematically we haven't found symbolic ways to do this that also keep compile time near zero. You can write a very simple framework, but then you also should be willing to eat the long compile time. So if searching for that optimal performance at runtime, but that's the trade off. There's no, like, I don't think unless we have great breakthroughs George's vision is achievable, he should be thinking about a narrower problem such as I'm only going to make this for work for self-driving car connets or I'm only going to make this work for LLM transformers of the llama style. Like if you start narrowing the problem down, you can make a vastly simpler framework. But if you don't, if you need the generality to power all of the AI research that is happening and keep zero compile time and in all these other factors, I think it's not easy to avoid the complexity.Pytorch vs MojoAlessio [00:08:33]: That's interesting. And we kind of touched on this with Chris Lattner when he was on the podcast. If you think about frameworks, they have the model target. They have the hardware target. They have different things to think about. He mentioned when he was at Google, TensorFlow trying to be optimized to make TPUs go brr, you know, and go as fast. I think George is trying to make especially AMD stack be better than ROCm. How come PyTorch has been such as Switzerland versus just making Meta hardware go brr?Soumith [00:09:00]: First, Meta is not in the business of selling hardware. Meta is not in the business of cloud compute. The way Meta thinks about funding PyTorch is we're funding it because it's net good for Meta to fund PyTorch because PyTorch has become a standard and a big open source project. And generally it gives us a timeline edge. It gives us leverage and all that within our own work. So why is PyTorch more of a Switzerland rather than being opinionated? I think the way we think about it is not in terms of Switzerland or not. We actually the way we articulate it to all hardware vendors and software vendors and all who come to us being we want to build a backend in core for PyTorch and ship it by default is we just only look at our user side of things. Like if users are using a particular piece of hardware, then we want to support it. We very much don't want to king make the hardware side of things. So as the MacBooks have GPUs and as that stuff started getting increasingly interesting, we pushed Apple to push some engineers and work on the NPS support and we spend significant time from Meta funded engineers on that as well because a lot of people are using the Apple GPUs and there's demand. So we kind of mostly look at it from the demand side. We never look at it from like oh which hardware should we start taking opinions on.Swyx [00:10:27]: Is there a future in which, because Mojo or Modular Mojo is kind of a superset of Python, is there a future in which PyTorch might use Mojo features optionally?Soumith [00:10:36]: I think it depends on how well integrated it is into the Python ecosystem. So if Mojo is like a pip install and it's readily available and users feel like they can use Mojo so smoothly within their workflows in a way that just is low friction, we would definitely look into that. Like in the same way PyTorch now depends on Triton, OpenAI Triton, and we never had a conversation that was like huh, that's like a dependency. Should we just build a Triton of our own or should we use Triton? It almost doesn't, like those conversations don't really come up for us. The conversations are more well does Triton have 10,000 dependencies and is it hard to install? We almost don't look at these things from a strategic leverage point of view. We look at these things from a user experience point of view, like is it easy to install? Is it smoothly integrated and does it give enough benefits for us to start depending on it? If so, yeah, we should consider it. That's how we think about it.Swyx [00:11:37]: You're inclusive by default as long as it meets the minimum bar of, yeah, but like maybe I phrased it wrongly. Maybe it's more like what problems would you look to solve that you have right now?Soumith [00:11:48]: I think it depends on what problems Mojo will be useful at.Swyx [00:11:52]: Mainly a performance pitch, some amount of cross compiling pitch.Soumith [00:11:56]: Yeah, I think the performance pitch for Mojo was like, we're going to be performant even if you have a lot of custom stuff, you're going to write arbitrary custom things and we will be performant. And that value proposition is not clear to us from the PyTorch side to consider it for PyTorch. So PyTorch, it's actually not 250 operators, it's like a thousand operators. PyTorch exposes about a thousand operators and people kind of write their ideas in the thousand operators of PyTorch. Mojo is like, well, maybe it's okay to completely sidestep those thousand operators of PyTorch and just write it in a more natural form. Just write raw Python, write for loops or whatever, right? So from the consideration of how do we intersect PyTorch with Mojo, I can see one use case where you have custom stuff for some parts of your program, but mostly it's PyTorch. And so we can probably figure out how to make it easier for say Torch.compile to smoothly also consume Mojo subgraphs and like, you know, the interoperability being actually usable, that I think is valuable. But Mojo as a fundamental front end would be replacing PyTorch, not augmenting PyTorch. So in that sense, I don't see a synergy in more deeply integrating Mojo.Pytorch vs MLXSwyx [00:13:21]: So call out to Mojo whenever they have written something in Mojo and there's some performance related thing going on. And then since you mentioned Apple, what should people think of PyTorch versus MLX?Soumith [00:13:32]: I mean, MLX is early and I know the folks well, Ani used to work at FAIR and I used to chat with him all the time. He used to be based out of New York as well. The way I think about MLX is that MLX is specialized for Apple right now. It has a happy path because it's defined its product in a narrow way. At some point MLX either says we will only be supporting Apple and we will just focus on enabling, you know, there's a framework if you use your MacBook, but once you like go server side or whatever, that's not my problem and I don't care. For MLS, it enters like the server side set of things as well. Like one of these two things will happen, right? If the first thing will happen, like MLX's overall addressable market will be small, but it probably do well within that addressable market. If it enters the second phase, they're going to run into all the same complexities that we have to deal with. They will not have any magic wand and they will have more complex work to do. They probably wouldn't be able to move as fast.Swyx [00:14:44]: Like having to deal with distributed compute?Soumith [00:14:48]: Distributed, NVIDIA and AMD GPUs, like just like having a generalization of the concept of a backend, how they treat compilation with plus overheads. Right now they're deeply assumed like the whole NPS graph thing. So they need to think about all these additional things if they end up expanding onto the server side and they'll probably build something like PyTorch as well, right? Like eventually that's where it will land. And I think there they will kind of fail on the lack of differentiation. Like it wouldn't be obvious to people why they would want to use it.Swyx [00:15:24]: I mean, there are some cloud companies offering M1 and M2 chips on servers. I feel like it might be interesting for Apple to pursue that market, but it's not their core strength.Soumith [00:15:33]: Yeah. If Apple can figure out their interconnect story, maybe, like then it can become a thing.Swyx [00:15:40]: Honestly, that's more interesting than the cars. Yes.Soumith [00:15:43]: I think the moat that NVIDIA has right now, I feel is that they have the interconnect that no one else has, like AMD GPUs are pretty good. I'm sure there's various silicon that is not bad at all, but the interconnect, like NVLink is uniquely awesome. I'm sure the other hardware providers are working on it, but-Swyx [00:16:04]: I feel like when you say it's uniquely awesome, you have some appreciation of it that the rest of us don't. I mean, the rest of us just like, you know, we hear marketing lines, but what do you mean when you say NVIDIA is very good at networking? Obviously they made the acquisition maybe like 15 years ago.Soumith [00:16:15]: Just the bandwidth it offers and the latency it offers. I mean, TPUs also have a good interconnect, but you can't buy them. So you have to go to Google to use it.PyTorch MafiaAlessio [00:16:27]: Who are some of the other FAIR PyTorch alumni that are building cool companies? I know you have Fireworks AI, Lightning AI, Lepton, and Yangqing, you knew since college when he was building Coffee?Soumith [00:16:40]: Yeah, so Yangqing and I used to be framework rivals, PyTorch, I mean, we were all a very small close-knit community back then. Caffe, Torch, Theano, Chainer, Keras, various frameworks. I mean, it used to be more like 20 frameworks. I can't remember all the names. CCV by Liu Liu, who is also based out of SF. And I would actually like, you know, one of the ways it was interesting is you went into the framework guts and saw if someone wrote their own convolution kernel or they were just copying someone else's. There were four or five convolution kernels that were unique and interesting. There was one from this guy out of Russia, I forgot the name, but I remembered who was awesome enough to have written their own kernel. And at some point there, I built out these benchmarks called ConNet benchmarks. They're just benchmarking all the convolution kernels that are available at that time. It hilariously became big enough that at that time AI was getting important, but not important enough that industrial strength players came in to do these kinds of benchmarking and standardization. Like we have MLPerf today. So a lot of the startups were using ConNet benchmarks in their pitch decks as like, oh, you know, on ConNet benchmarks, this is how we fare, so you should fund us. I remember Nirvana actually was at the top of the pack because Scott Gray wrote amazingly fast convolution kernels at that time. Very interesting, but separate times. But to answer your question, Alessio, I think mainly Lepton, Fireworks are the two most obvious ones, but I'm sure the fingerprints are a lot wider. They're just people who worked within the PyTorch Cafe2 cohort of things and now end up at various other places.Swyx [00:18:50]: I think as a, both as an investor and a people looking to build on top of their services, it's a uncomfortable slash like, I don't know what I don't know pitch. Because I've met Yang Tsing and I've met Lin Chao. Yeah, I've met these folks and they're like, you know, we are deep in the PyTorch ecosystem and we serve billions of inferences a day or whatever at Facebook and now we can do it for you. And I'm like, okay, that's great. Like, what should I be wary of or cautious of when these things happen? Because I'm like, obviously this experience is extremely powerful and valuable. I just don't know what I don't know. Like, what should people know about like these sort of new inference as a service companies?Soumith [00:19:32]: I think at that point you would be investing in them for their expertise of one kind. So if they've been at a large company, but they've been doing amazing work, you would be thinking about it as what these people bring to the table is that they're really good at like GPU programming or understanding the complexity of serving models once it hits a certain scale. You know, various expertise like from the infra and AI and GPUs point of view. What you would obviously want to figure out is whether their understanding of the external markets is clear, whether they know and understand how to think about running a business, understanding how to be disciplined about making money or, you know, various things like that.Swyx [00:20:23]: Maybe I'll put it like, actually I will de-emphasize the investing bit and just more as a potential customer. Oh, okay. Like, it's more okay, you know, you have PyTorch gods, of course. Like, what else should I know?Soumith [00:20:37]: I mean, I would not care about who's building something. If I'm trying to be a customer, I would care about whether...Swyx [00:20:44]: Benchmarks.Soumith [00:20:44]: Yeah, I use it and it's usability and reliability and speed, right?Swyx [00:20:51]: Quality as well.Soumith [00:20:51]: Yeah, if someone from some random unknown place came to me and say, user stuff is great. Like, and I have the bandwidth, I probably will give it a shot. And if it turns out to be great, like I'll just use it.Benchmark dramaSwyx [00:21:07]: Okay, great. And then maybe one more thing about benchmarks, since we already brought it up and you brought up Confident Benchmarks. There was some recent drama around AnyScale. AnyScale released their own benchmarks and obviously they look great on their own benchmarks, but maybe didn't give the other... I feel there are two lines of criticism. One, which is they didn't test some apples for apples on the kind of endpoints that the other providers, that they are competitors with, on their benchmarks and that is due diligence baseline. And then the second would be more just optimizing for the right thing. You had some commentary on it. I'll just kind of let you riff.Soumith [00:21:41]: Yeah, I mean, in summary, basically my criticism of that was AnyScale built these benchmarks for end users to just understand what they should pick, right? And that's a very good thing to do. I think what they didn't do a good job of is give that end user a full understanding of what they should pick. Like they just gave them a very narrow slice of understanding. I think they just gave them latency numbers and that's not sufficient, right? You need to understand your total cost of ownership at some reasonable scale. Not oh, one API call is one cent, but a thousand API calls are 10 cents. Like people can misprice to cheat on those benchmarks. So you want to understand, okay, like how much is it going to cost me if I actually subscribe to you and do like a million API calls a month or something? And then you want to understand the latency and reliability, not just from one call you made, but an aggregate of calls you've made over several various times of the day and times of the week. And the nature of the workloads, is it just some generic single paragraph that you're sending that is cashable? Or is it like testing of real world workload? I think that kind of rigor, like in presenting that benchmark wasn't there. It was a much more narrow sliver of what should have been a good benchmark. That was my main criticism. And I'm pretty sure if before they released it, they showed it to their other stakeholders who would be caring about this benchmark because they are present in it, they would have easily just pointed out these gaps. And I think they didn't do that and they just released it. So I think those were the two main criticisms. I think they were fair and Robert took it well.Swyx [00:23:40]: And he took it very well. And we'll have him on at some point and we'll discuss it. But I think it's important for, I think the market being maturing enough that people start caring and competing on these kinds of things means that we need to establish what best practice is because otherwise everyone's going to play dirty.Soumith [00:23:55]: Yeah, absolutely. My view of the LLM inference market in general is that it's the laundromat model. Like the margins are going to drive down towards the bare minimum. It's going to be all kinds of arbitrage between how much you can get the hardware for and then how much you sell the API and how much latency your customers are willing to let go. You need to figure out how to squeeze your margins. Like what is your unique thing here? Like I think Together and Fireworks and all these people are trying to build some faster CUDA kernels and faster, you know, hardware kernels in general. But those modes only last for a month or two. These ideas quickly propagate.Swyx [00:24:38]: Even if they're not published?Soumith [00:24:39]: Even if they're not published, the idea space is small. So even if they're not published, the discovery rate is going to be pretty high. It's not like we're talking about a combinatorial thing that is really large. You're talking about Llama style LLM models. And we're going to beat those to death on a few different hardware SKUs, right? Like it's not even we have a huge diversity of hardware you're going to aim to run it on. Now when you have such a narrow problem and you have a lot of people working on it, the rate at which these ideas are going to get figured out is going to be pretty rapid.Swyx [00:25:15]: Is it a standard bag of tricks? Like the standard one that I know of is, you know, fusing operators and-Soumith [00:25:22]: Yeah, it's the standard bag of tricks on figuring out how to improve your memory bandwidth and all that, yeah.Alessio [00:25:28]: Any ideas instead of things that are not being beaten to death that people should be paying more attention to?Novel PyTorch ApplicationsSwyx [00:25:34]: One thing I was like, you know, you have a thousand operators, right? Like what's the most interesting usage of PyTorch that you're seeing maybe outside of this little bubble?Soumith [00:25:41]: So PyTorch, it's very interesting and scary at the same time, but basically it's used in a lot of exotic ways, like from the ML angle, what kind of models are being built? And you get all the way from state-based models and all of these things to stuff nth order differentiable models, like neural ODEs and stuff like that. I think there's one set of interestingness factor from the ML side of things. And then there's the other set of interesting factor from the applications point of view. It's used in Mars Rover simulations, to drug discovery, to Tesla cars. And there's a huge diversity of applications in which it is used. So in terms of the most interesting application side of things, I think I'm scared at how many interesting things that are also very critical and really important it is used in. I think the scariest was when I went to visit CERN at some point and they said they were using PyTorch and they were using GANs at the same time for particle physics research. And I was scared more about the fact that they were using GANs than they were using PyTorch, because at that time I was a researcher focusing on GANs. But the diversity is probably the most interesting. How many different things it is being used in. I think that's the most interesting to me from the applications perspective. From the models perspective, I think I've seen a lot of them. Like the really interesting ones to me are where we're starting to combine search and symbolic stuff with differentiable models, like the whole AlphaGo style models is one example. And then I think we're attempting to do it for LLMs as well, with various reward models and search. I mean, I don't think PyTorch is being used in this, but the whole alpha geometry thing was interesting because again, it's an example of combining the symbolic models with the gradient based ones. But there are stuff like alpha geometry that PyTorch is used at, especially when you intersect biology and chemistry with ML. In those areas, you want stronger guarantees on the output. So yeah, maybe from the ML side, those things to me are very interesting right now.Swyx [00:28:03]: Yeah. People are very excited about the alpha geometry thing. And it's kind of like, for me, it's theoretical. It's great. You can solve some Olympia questions. I'm not sure how to make that bridge over into the real world applications, but I'm sure people smarter than me will figure it out.Synthetic Data vs Symbolic ModelsSoumith [00:28:18]: Let me give you an example of it. You know how the whole thing about synthetic data will be the next rage in LLMs is a thing?Swyx [00:28:27]: Already is a rage.Soumith [00:28:28]: Which I think is fairly misplaced in how people perceive it. People think synthetic data is some kind of magic wand that you wave and it's going to be amazing. Synthetic data is useful in neural networks right now because we as humans have figured out a bunch of symbolic models of the world or made up certain symbolic models because of human innate biases. So we've figured out how to ground particle physics in a 30 parameter model. And it's just very hard to compute as in it takes a lot of flops to compute, but it only has 30 parameters or so. I mean, I'm not a physics expert, but it's a very low rank model. We built mathematics as a field that basically is very low rank. Language, a deep understanding of language, like the whole syntactic parse trees and just understanding how language can be broken down and into a formal symbolism is something that we figured out. So we basically as humans have accumulated all this knowledge on these subjects, either synthetic, we created those subjects in our heads, or we grounded some real world phenomenon into a set of symbols. But we haven't figured out how to teach neural networks symbolic world models directly. The only way we have to teach them is generating a bunch of inputs and outputs and gradient dissenting over them. So in areas where we have the symbolic models and we need to teach all the knowledge we have that is better encoded in the symbolic models, what we're doing is we're generating a bunch of synthetic data, a bunch of input output pairs, and then giving that to the neural network and asking it to learn the same thing that we already have a better low rank model of in gradient descent in a much more over-parameterized way. Outside of this, like where we don't have good symbolic models, like synthetic data obviously doesn't make any sense. So synthetic data is not a magic wand where it'll work in all cases in every case or whatever. It's just where we as humans already have good symbolic models off. We need to impart that knowledge to neural networks and we figured out the synthetic data is a vehicle to impart this knowledge to. So, but people, because maybe they don't know enough about synthetic data as a notion, but they hear, you know, the next wave of data revolution is synthetic data. They think it's some kind of magic where we just create a bunch of random data somehow. They don't think about how, and then they think that's just a revolution. And I think that's maybe a gap in understanding most people have in this hype cycle.Swyx [00:31:23]: Yeah, well, it's a relatively new concept, so. Oh, there's two more that I'll put in front of you and then you can see what you respond. One is, you know, I have this joke that it's, you know, it's only synthetic data if it's from the Mistral region of France, otherwise it's just a sparkling distillation, which is what news research is doing. Like they're distilling GPT-4 by creating synthetic data from GPT-4, creating mock textbooks inspired by Phi 2 and then fine tuning open source models like Llama. And so I don't know, I mean, I think that's, should we call that synthetic data? Should we call it something else? I don't know.Soumith [00:31:57]: Yeah, I mean, the outputs of LLMs, are they synthetic data? They probably are, but I think it depends on the goal you have. If your goal is you're creating synthetic data with the goal of trying to distill GPT-4's superiority into another model, I guess you can call it synthetic data, but it also feels like disingenuous because your goal is I need to copy the behavior of GPT-4 and-Swyx [00:32:25]: It's also not just behavior, but data set. So I've often thought of this as data set washing. Like you need one model at the top of the chain, you know, unnamed French company that has that, you know, makes a model that has all the data in it that we don't know where it's from, but it's open source, hey, and then we distill from that and it's great. To be fair, they also use larger models as judges for preference ranking, right? So that is, I think, a very, very accepted use of synthetic.Soumith [00:32:53]: Correct. I think it's a very interesting time where we don't really have good social models of what is acceptable depending on how many bits of information you use from someone else, right? It's like, okay, you use one bit. Is that okay? Yeah, let's accept it to be okay. Okay, what about if you use 20 bits? Is that okay? I don't know. What if you use 200 bits? I don't think we as society have ever been in this conundrum where we have to be like, where is the boundary of copyright or where is the boundary of socially accepted understanding of copying someone else? We haven't been tested this mathematically before,Swyx [00:33:38]: in my opinion. Whether it's transformative use. Yes. So yeah, I think this New York Times opening eye case is gonna go to the Supreme Court and we'll have to decide it because I think we never had to deal with it before. And then finally, for synthetic data, the thing that I'm personally exploring is solving this great stark paradigm difference between rag and fine tuning, where you can kind of create synthetic data off of your retrieved documents and then fine tune on that. That's kind of synthetic. All you need is variation or diversity of samples for you to fine tune on. And then you can fine tune new knowledge into your model. I don't know if you've seen that as a direction for synthetic data.Soumith [00:34:13]: I think you're basically trying to, what you're doing is you're saying, well, language, I know how to parametrize language to an extent. And I need to teach my model variations of this input data so that it's resilient or invariant to language uses of that data.Swyx [00:34:32]: Yeah, it doesn't overfit on the wrong source documents.Soumith [00:34:33]: So I think that's 100% synthetic. You understand, the key is you create variations of your documents and you know how to do that because you have a symbolic model or like some implicit symbolic model of language.Swyx [00:34:48]: Okay.Alessio [00:34:49]: Do you think the issue with symbolic models is just the architecture of the language models that we're building? I think maybe the thing that people grasp is the inability of transformers to deal with numbers because of the tokenizer. Is it a fundamental issue there too? And do you see alternative architectures that will be better with symbolic understanding?Soumith [00:35:09]: I am not sure if it's a fundamental issue or not. I think we just don't understand transformers enough. I don't even mean transformers as an architecture. I mean the use of transformers today, like combining the tokenizer and transformers and the dynamics of training, when you show math heavy questions versus not. I don't have a good calibration of whether I know the answer or not. I, you know, there's common criticisms that are, you know, transformers will just fail at X. But then when you scale them up to sufficient scale, they actually don't fail at that X. I think there's this entire subfield where they're trying to figure out these answers called like the science of deep learning or something. So we'll get to know more. I don't know the answer.Meta AI and Llama 2/3Swyx [00:35:57]: Got it. Let's touch a little bit on just Meta AI and you know, stuff that's going on there. Maybe, I don't know how deeply you're personally involved in it, but you're our first guest with Meta AI, which is really fantastic. And Llama 1 was, you know, you are such a believer in open source. Llama 1 was more or less the real breakthrough in open source AI. The most interesting thing for us covering on this, in this podcast was the death of Chinchilla, as people say. Any interesting insights there around the scaling models for open source models or smaller models or whatever that design decision was when you guys were doing it?Soumith [00:36:31]: So Llama 1 was Guillaume Lample and team. There was OPT before, which I think I'm also very proud of because we bridged the gap in understanding of how complex it is to train these models to the world. Like until then, no one really in gory detail published.Swyx [00:36:50]: The logs.Soumith [00:36:51]: Yeah. Like, why is it complex? And everyone says, oh, it's complex. But no one really talked about why it's complex. I think OPT was cool.Swyx [00:37:02]: I met Susan and she's very, very outspoken. Yeah.Soumith [00:37:05]: We probably, I think, didn't train it for long enough, right? That's kind of obvious in retrospect.Swyx [00:37:12]: For a 175B. Yeah. You trained it according to Chinchilla at the time or?Soumith [00:37:17]: I can't remember the details, but I think it's a commonly held belief at this point that if we trained OPT longer, it would actually end up being better. Llama 1, I think, was Guillaume Lample and team Guillaume is fantastic and went on to build Mistral. I wasn't too involved in that side of things. So I don't know what you're asking me, which is how did they think about scaling loss and all of that? Llama 2, I was more closely involved in. I helped them a reasonable amount with their infrastructure needs and stuff. And Llama 2, I think, was more like, let's get to the evolution. At that point, we kind of understood what we were missing from the industry's understanding of LLMs. And we needed more data and we needed more to train the models for longer. And we made, I think, a few tweaks to the architecture and we scaled up more. And that was Llama 2. I think Llama 2, you can think of it as after Guillaume left, the team kind of rebuilt their muscle around Llama 2. And Hugo, I think, who's the first author is fantastic. And I think he did play a reasonable big role in Llama 1 as well.Soumith [00:38:35]: And he overlaps between Llama 1 and 2. So in Llama 3, obviously, hopefully, it'll be awesome.Alessio [00:38:42]: Just one question on Llama 2, and then we'll try and fish Llama 3 spoilers out of you. In the Llama 2 paper, the loss curves of the 34 and 70B parameter, they still seem kind of steep. Like they could go lower. How, from an infrastructure level, how do you allocate resources? Could they have just gone longer or were you just, hey, this is all the GPUs that we can burn and let's just move on to Llama 3 and then make that one better?Soumith [00:39:07]: Instead of answering specifically about that Llama 2 situation or whatever, I'll tell you how we think about things. Generally, we're, I mean, Mark really is some numbers, right?Swyx [00:39:20]: So let's cite those things again. All I remember is like 600K GPUs.Soumith [00:39:24]: That is by the end of this year and 600K H100 equivalents. With 250K H100s, including all of our other GPU or accelerator stuff, it would be 600-and-something-K aggregate capacity.Swyx [00:39:38]: That's a lot of GPUs.Soumith [00:39:39]: We'll talk about that separately. But the way we think about it is we have a train of models, right? Llama 1, 2, 3, 4. And we have a bunch of GPUs. I don't think we're short of GPUs. Like-Swyx [00:39:54]: Yeah, no, I wouldn't say so. Yeah, so it's all a matter of time.Soumith [00:39:56]: I think time is the biggest bottleneck. It's like, when do you stop training the previous one and when do you start training the next one? And how do you make those decisions? The data, do you have net new data, better clean data for the next one in a way that it's not worth really focusing on the previous one? It's just a standard iterative product. You're like, when is the iPhone 1? When do you start working on iPhone 2? Where is the iPhone? And so on, right? So mostly the considerations are time and generation, rather than GPUs, in my opinion.Alessio [00:40:31]: So one of the things with the scaling loss, like Chinchilla is optimal to balance training and inference costs. I think at Meta's scale, you would rather pay a lot more maybe at training and then save on inference. How do you think about that from infrastructure perspective? I think in your tweet, you say you can try and guess on like how we're using these GPUs. Can you just give people a bit of understanding? It's like, because I've already seen a lot of VCs say, Llama 3 has been trained on 600,000 GPUs and that's obviously not true, I'm sure. How do you allocate between the research, FAIR and the Llama training, the inference on Instagram suggestions that get me to scroll, like AI-generated stickers on WhatsApp and all of that?Soumith [00:41:11]: Yeah, we haven't talked about any of this publicly, but as a broad stroke, it's like how we would allocate resources of any other kinds at any company. You run a VC portfolio, how do you allocate your investments between different companies or whatever? You kind of make various trade-offs and you kind of decide, should I invest in this project or this other project, or how much should I invest in this project? It's very much a zero sum of trade-offs. And it also comes into play, how are your clusters configured, like overall, what you can fit of what size and what cluster and so on. So broadly, there's no magic sauce here. I mean, I think the details would add more spice, but also wouldn't add more understanding. It's just gonna be like, oh, okay, I mean, this looks like they just think about this as I would normally do.Alessio [00:42:05]: So even the GPU rich run through the same struggles of having to decide where to allocate things.Soumith [00:42:11]: Yeah, I mean, at some point I forgot who said it, but you kind of fit your models to the amount of compute you have. If you don't have enough compute, you figure out how to make do with smaller models. But no one as of today, I think would feel like they have enough compute. I don't think I've heard any company within the AI space be like, oh yeah, like we feel like we have sufficient compute and we couldn't have done better. So that conversation, I don't think I've heard from any of my friends at other companies.EleutherSwyx [00:42:47]: Stella from Eleuther sometimes says that because she has a lot of donated compute. She's trying to put it to interesting uses, but for some reason she's decided to stop making large models.Soumith [00:42:57]: I mean, that's a cool, high conviction opinion that might pay out.Swyx [00:43:01]: Why?Soumith [00:43:02]: I mean, she's taking a path that most people don't care to take about in this climate and she probably will have very differentiated ideas. I mean, think about the correlation of ideas in AI right now. It's so bad, right? So everyone's fighting for the same pie. In some weird sense, that's partly why I don't really directly work on LLMs. I used to do image models and stuff and I actually stopped doing GANs because GANs were getting so hot that I didn't have any calibration of whether my work would be useful or not because, oh yeah, someone else did the same thing you did. It's like, there's so much to do, I don't understand why I need to fight for the same pie. So I think Stella's decision is very smart.Making BetsAlessio [00:43:53]: And how do you reconcile that with how we started the discussion about intrinsic versus extrinsic kind of like accomplishment or success? How should people think about that especially when they're doing a PhD or early in their career? I think in Europe, I walked through a lot of the posters and whatnot, there seems to be mode collapse in a way in the research, a lot of people working on the same things. Is it worth for a PhD to not take a bet on something that is maybe not as interesting just because of funding and visibility and whatnot? Or yeah, what suggestions would you give?Soumith [00:44:28]: I think there's a baseline level of compatibility you need to have with the field. Basically, you need to figure out if you will get paid enough to eat, right? Like whatever reasonable normal lifestyle you want to have as a baseline. So you at least have to pick a problem within the neighborhood of fundable. Like you wouldn't wanna be doing something so obscure that people are like, I don't know, like you can work on it.Swyx [00:44:59]: Would a limit on fundability, I'm just observing something like three months of compute, right? That's the top line, that's the like max that you can spend on any one project.Soumith [00:45:09]: But like, I think that's very ill specified, like how much compute, right? I think that the notion of fundability is broader. It's more like, hey, are these family of models within the acceptable set of, you're not crazy or something, right? Even something like neural or DS, which is a very boundary pushing thing or states-based models or whatever. Like all of these things I think are still in fundable territory. When you're talking about, I'm gonna do one of the neuromorphic models and then apply image classification to them or something, then it becomes a bit questionable. Again, it depends on your motivation. Maybe if you're a neuroscientist, it actually is feasible. But if you're an AI engineer, like the audience of these podcasts, then it's more questionable. The way I think about it is, you need to figure out how you can be in the baseline level of fundability just so that you can just live. And then after that, really focus on intrinsic motivation and depends on your strengths, like how you can play to your strengths and your interests at the same time. Like I try to look at a bunch of ideas that are interesting to me, but also try to play to my strengths. I'm not gonna go work on theoretical ML. I'm interested in it, but when I want to work on something like that, I try to partner with someone who is actually a good theoretical ML person and see if I actually have any value to provide. And if they think I do, then I come in. So I think you'd want to find that intersection of ideas you like, and that also play to your strengths. And I'd go from there. Everything else, like actually finding extrinsic success and all of that, I think is the way I think about it is like somewhat immaterial. When you're talking about building ecosystems and stuff, slightly different considerations come into play, but that's a different conversation.Swyx [00:47:06]: We're gonna pivot a little bit to just talking about open source AI. But one more thing I wanted to establish for Meta is this 600K number, just kind of rounding out the discussion, that's for all Meta. So including your own inference needs, right? It's not just about training.Soumith [00:47:19]: It's gonna be the number in our data centers for all of Meta, yeah.Swyx [00:47:23]: Yeah, so there's a decent amount of workload serving Facebook and Instagram and whatever. And then is there interest in like your own hardware?MTIASoumith [00:47:31]: We already talked about our own hardware. It's called MTIA. Our own silicon, I think we've even showed the standard photograph of you holding the chip that doesn't work. Like as in the chip that you basically just get like-Swyx [00:47:51]: As a test, right?Soumith [00:47:52]: Yeah, a test chip or whatever. So we are working on our silicon and we'll probably talk more about it when the time is right, but-Swyx [00:48:00]: Like what gaps do you have that the market doesn't offer?Soumith [00:48:04]: Okay, I mean, this is easy to answer. So basically, remember how I told you about there's this memory hierarchy and like sweet spots and all of that? Fundamentally, when you build a hardware, you make it general enough that a wide set of customers and a wide set of workloads can use it effectively while trying to get the maximum level of performance they can. The more specialized you make the chip, the more hardware efficient it's going to be, the more power efficient it's gonna be, the more easier it's going to be to find the software, like the kernel's right to just map that one or two workloads to that hardware and so on. So it's pretty well understood across the industry that if you have a sufficiently large volume, enough workload, you can specialize it and get some efficiency gains, like power gains and so on. So the way you can think about everyone building, every large company building silicon, I think a bunch of the other large companies are building their own silicon as well, is they, each large company has a sufficient enough set of verticalized workloads that can be specialized that have a pattern to them that say a more generic accelerator like an NVIDIA or an AMD GPU does not exploit. So there is some level of power efficiency that you're leaving on the table by not exploiting that. And you have sufficient scale and you have sufficient forecasted stability that those workloads will exist in the same form, that it's worth spending the time to build out a chip to exploit that sweet spot. Like obviously something like this is only useful if you hit a certain scale and that your forecasted prediction of those kind of workloads being in the same kind of specializable exploitable way is true. So yeah, that's why we're building our own chips.Swyx [00:50:08]: Awesome.Open Source AIAlessio [00:50:09]: Yeah, I know we've been talking a lot on a lot of different topics and going back to open source, you had a very good tweet. You said that a single company's closed source effort rate limits against people's imaginations and needs. How do you think about all the impact that some of the Meta AI work in open source has been doing and maybe directions of the whole open source AI space?Soumith [00:50:32]: Yeah, in general, I think first, I think it's worth talking about this in terms of open and not just open source, because like with the whole notion of model weights, no one even knows what source means for these things. But just for the discussion, when I say open source, you can assume it's just I'm talking about open. And then there's the whole notion of licensing and all that, commercial, non-commercial, commercial with clauses and all that. I think at a fundamental level, the most benefited value of open source is that you make the distribution to be very wide. It's just available with no friction and people can do transformative things in a way that's very accessible. Maybe it's open source, but it has a commercial license and I'm a student in India. I don't care about the license. I just don't even understand the license. But like the fact that I can use it and do something with it is very transformative to me. Like I got this thing in a very accessible way. And then it's various degrees, right? And then if it's open source, but it's actually a commercial license, then a lot of companies are gonna benefit from gaining value that they didn't previously have, that they maybe had to pay a closed source company for it. So open source is just a very interesting tool that you can use in various ways. So there's, again, two kinds of open source. One is some large company doing a lot of work and then open sourcing it. And that kind of effort is not really feasible by say a band of volunteers doing it the same way. So there's both a capital and operational expenditure that the large company just decided to ignore and give it away to the world for some benefits of some kind. They're not as tangible as direct revenue. So in that part, Meta has been doing incredibly good things. They fund a huge amount of the PyTorch development. They've open sourced Llama and those family of models and several other fairly transformative projects. FICE is one, Segment Anything, Detectron, Detectron 2. Dense Pose. I mean, it's-Swyx [00:52:52]: Seamless. Yeah, seamless.Soumith [00:52:53]: Like it's just the list is so long that we're not gonna cover. So I think Meta comes into that category where we spend a lot of CapEx and OpEx and we have a high talent density of great AI people and we open our stuff. And the thesis for that, I remember when FAIR was started, the common thing was like, wait, why would Meta wanna start a open AI lab? Like what exactly is a benefit from a commercial perspective? And for then the thesis was very simple. It was AI is currently rate limiting Meta's ability to do things. Our ability to build various product integrations, moderation, various other factors. Like AI was the limiting factor and we just wanted AI to advance more and we didn't care if the IP of the AI was uniquely in our possession or not. However the field advances, that accelerates Meta's ability to build a better product. So we just built an open AI lab and we said, if this helps accelerate the progress of AI, that's strictly great for us. But very easy, rational, right? Still the same to a large extent with the Llama stuff. And it's the same values, but the argument, it's a bit more nuanced. And then there's a second kind of open source, which is, oh, we built this project, nights and weekends and we're very smart people and we open sourced it and then we built a community around it. This is the Linux kernel and various software projects like that. So I think about open source, like both of these things being beneficial and both of these things being different. They're different and beneficial in their own ways. The second one is really useful when there's an active arbitrage to be done. If someone's not really looking at a particular space because it's not commercially viable or whatever, like a band of volunteers can just coordinate online and do something and then make that happen. And that's great.Open Source LLMsI wanna cover a little bit about open source LLMs maybe. So open source LLMs have been very interesting because I think we were trending towards an increase in open source in AI from 2010 all the way to 2017 or something. Like where more and more pressure within the community was to open source their stuff so that their methods and stuff get adopted. And then the LLMs revolution kind of took the opposite effect OpenAI stopped open sourcing their stuff and DeepMind kind of didn't, like all the other cloud and all these other providers, they didn't open source their stuff. And it was not good in the sense that first science done in isolation probably will just form its own bubble where people believe their own b******t or whatever. So there's that problem. And then there was the other problem which was the accessibility part. Like, okay, I again always go back to I'm a student in India with no money. What is my accessibility to any of these closers models? At some scale I have to pay money. That makes it a non-starter and stuff. And there's also the control thing. I strongly believe if you want human aligned stuff, you want all humans to give feedback. And you want all humans to have access to that technology in the first place. And I actually have seen, living in New York, whenever I come to Silicon Valley, I see a different cultural bubble. Like all the friends I hang out with talk about some random thing like Dyson Spheres or whatever, that's a thing. And most of the world doesn't know or care about any of this stuff. It's definitely a bubble and bubbles can form very easily. And when you make a lot of decisions because you're in a bubble, they're probably not globally optimal decisions. So I think open source, the distribution of open source powers a certain kind of non-falsifiability that I think is very important. I think on the open source models, like it's going great in the fact that LoRa I think came out of the necessity of open source models needing to be fine-tunable in some way. Yeah, and I think DPO also came out of the academic open source side of things. So do any of the closed source labs, did any of them already have LoRa or DPO internally? Maybe, but that does not advance humanity in any way. It advances some companies probability of doing the winner takes all that I talked about earlier in the podcast.Open Source and TrustI don't know, it just feels fundamentally good. Like when people try to, you know, people are like, well, what are the ways in which it is not okay? I find most of these arguments, and this might be a little controversial, but I find a lot of arguments based on whether closed source models are safer or open source models are safer very much related to what kind of culture they grew up in, what kind of society they grew up in. If they grew up in a society that they trusted, then I think they take the closed source argument. And if they grew up in a society that they couldn't trust, where the norm was that you didn't trust your government, obviously it's corrupt or whatever, then I think the open source argument is what they take. I think there's a deep connection to like people's innate biases from their childhood and their trust in society and governmental aspects that push them towards one opinion or the other. And I'm definitely in the camp of open source is definitely going to actually have better outcomes for society. Closed source to me just means that centralization of power, which, you know, is really hard to trust. So I think it's going well in so many ways that we're actively disaggregating the centralization of power to just two or three providers. We are, I think, benefiting from so many people using these models in so many ways that aren't allowed by, say, Silicon Valley left-wing tropes. Like some of these things are good or bad, but they're not culturally accepted universally in the world. So those are things worth thinking about. And I think open source is not winning in certain ways. Like these are all the things in which like, as I mentioned, it's actually being very good and beneficial and winning.Feedback to solve the Open Source Coordination problemI think one of the ways in which it's not winning, at some point I should write a long-form post about this, is I think it has a classic coordination problem. I mean, open source in general always has a coordination problem. If there's a vertically integrated provider with more resources, they will just be better coordinated than open source. And so now open source has to figure out how to have coordinated benefits. And the reason you want coordinated benefits is because these models are getting better based on human feedback. And if you see with open source models, if you go to Reddit, local llama, subreddit, like there's so many variations of models that are being produced from, say, nose research. I mean, there's so many variations built by so many people. And one common theme is they're all using these fine-tuning or human preferences datasets that are very limited. And like someone published them somewhere and they're not sufficiently diverse. And you look at the other side, say front-ends like Uba or Hugging Chat or Ollama, they don't really have like feedback buttons. Like all the people using all these front-ends, they probably want to give feedback, but there's no way for them to give feedback. So these models are being built, they're being arbitrarily measured, and then they are being deployed into all these open source front-ends or like apps that are closed source, they're serving open source models. And these front-ends don't have, they are not exposing the ability to give feedback. So we're just losing all of this feedback. Maybe open source models are being as used as GPT is at this point in like all kinds of, in a very fragmented way, in aggregate all the open source models together are probably being used as much as GPT is, maybe close to that. But the amount of feedback that is driving back into the open source ecosystem is negligible, maybe less than 1% of the usage. So I think the blueprint here I think is you'd want someone to create a sinkhole for the feedback, some centralized sinkhole, maybe Hugging Face or someone just funds like, okay, I will make available a call to log a string along with a bit of information of positive or negative or something that. And then you would want to send pull requests to all the open source front-ends like Ooba and all being like, hey, we're just integrating a feedback UI and then work with the closed source people as also being like, look, it doesn't cost you anything, just have a button. And then the sinkhole will have a bunch of this data coming in. And then I think a bunch of open source researchers should figure out how to filter their feedback into only the high quality one. I'm sure it will be exploited by spam bots or whatever, right? Like, this is the perfect way to inject your advertising product into the next. So there needs to be some level of that, that in the same way, I'm sure all the closed providers are doing today, like OpenAI, Claude, the feedback that comes in, I'm sure they are figuring out if that's legit or not. That kind of data filtering needs to be done. And that loop has to be set up. And this requires that central sinkhole and that data cleaning effort both to be there. They're not there right now. They're not there right now, I think for capital reasons, but also for coordination reasons. Okay, if that central sinkhole is there, who's gonna go coordinate all of this integration across all of these open source front ends. But I think if we do that, if that actually happens, I think that probably has a real chance of the open source models having a runaway effect against OpenAI with their current daily active users. Probably doesn't have a chance against Google because you know, Google has Android and Chrome and Gmail and Google Docs and everything, you know? So people just use that a lot. But like, I think there's a clear chance we can take at truly winning open source.AGIAlessio [01:04:00]: Do you think this feedback is helpful to make open source models better or to get to like open source AGI? Because in a way like OpenAI's goal is to get to AGI, right? So versus I think in open source, we're more focused on personal better usage or like commercial better usage.Soumith [01:04:17]: Yeah, I think that's a good question. But I think, I actually don't think people have a good understanding of AGI. And I don't mean definition level. I mean, people are like, okay, we're gonna, AGI means it's powering 40% of world economic output or something like that, right? But what does that mean? So do you think electricity is powering 40% of world economic output or is it not? Like generally the notion of powering X percent of economic output is not defined well at all for me to understand how to know when we got to AGI or how to measure whether we're getting AGI. Like, you know, you can look at it in terms of intelligence or task automation or whatever. I think that's what we are doing right now. We're basically integrating like the current set of AI technologies into so many real world use cases where we find value that if some new version of AI comes in, we can find, we can be like, ah, this helps me more. In that sense, I think the whole process of how we think we got to AGI will be continuous and not discontinuous like how I think the question is posed. So I think the open source thing will be very much in line with getting to AGI because open source has that natural selection effect. Like if a better open source model comes, really no one says, ha, I don't want to use it because there are ecosystem effect, I'm logged into my ecosystem or, I don't know if I like the models, you know, whatever. It's just a very pure direct thing. So if there's a better model that comes out, then it will be used. So I definitely think it has a good chance of achieving how I would think about it as a continuous path to what we might define as AGI.OpenAssistant vs LMSys vs OpenRouterSwyx [01:06:18]: For the listeners, I would actually mention a couple other maybe related notes on just this very interesting concept of feedback sinkhole for open source to really catch up in terms of the overall Google versus OpenAI debate. Open Assistant was led by Yannick Kilcher who recently ended his effort. I think the criticism there was like the kind of people that go to a specific website to give feedback is not representative of real world usage. And that's why the models trained on Open Assistant didn't really seem like they have caught on in the open source world. The two leading candidates in my mind are LMSYS out of UC Berkeley who have the LMSYS arena, which is being touted as one of the only ways, only reliable benchmarks anymore. I kind of call them non-parametric benchmarks because there's nothing to cheat on it except for ELO. And then the other one is OpenRouter, which is Alex Atala's thing. I don't know if you've talked to any of these people.Soumith [01:07:11]: I obviously know all of the efforts that you talked about. I haven't talked to them directly about this yet. But the way I think about it is the way these models are going to be used is always going to be way more distributed than centralized. Like, which is the power of the open source movement. Like the UI within which these models are going to be used is going to be decentralized. These models are going to be integrated into hundreds and thousands of projects and products and all of that. And I think that is important to recognize. Like the LMSYS leaderboard is the best thing we have right now to understand whether a model is better or not versus another model. But it's also biased in only having a sliver of view into how people actually use these models. Like the people who actually end up coming to the LMSYS leaderboard and then using a model only use it for certain things. Like GitHub Copilot style usage is not captured in say LMSYS things. And so many other styles, like the character AI style things is not captured in LMSYS.Swyx [01:08:19]: Which OpenRouter could do. They don't do it right now, but.Soumith [01:08:22]: Yeah, so my point is like the way these models are going to be used is going to be always a large surface area. And I think we need to figure out how to provide the infrastructure to integrate with all these like ways in which it's being used. Even if you get the top hundred front ends that the model, like open source models are used through to subscribe to the sinkhole. I think that's already a substantial thing. I think thinking one or two things will by themselves get a lot of data I think is not going to happen.Swyx [01:08:58]: Yeah, fair enough.Other ModalitiesAlessio [01:08:59]: Before we let you go, can we do just a quick beyond text segment? So you're an investor in Runway, which is a beta generation. You're an investor in One X, which is a humanoid assistant. Osmo, which is focused on using AI for smell recognition and synthesis. You advise a bunch of robotics projects at NYU.Swyx [01:09:19]: Maybe. And he builds his own home robot. Yeah, exactly.Alessio [01:09:22]: On a more, yeah, maybe open editing. What are the things that you're most excited about beyond text generation and kind of the more mundane usage?Soumith [01:09:30]: Yeah, I mean, in general, I have more things I'm generally excited about than I can possibly do. Investing is one way to try to clear those urges. I'm generally excited about robotics being a possibility, home robotics being five to seven years away into commercialization. I think it's not next year or two years from now, but five to seven years from now, I think a lot more robotics companies might pop out. There's not a good consensus on whether hardware is a bottleneck or AI is a bottleneck in robotics right now. My view is actually hardware is still the bottleneck and AI is also a little bit of bottleneck, but I don't think there's any obvious breakthroughs we need. I think it's just work. So I'm generally excited about robotics. I spend a lot of personal time. I spend every Wednesday afternoon at NYU working with Lerrel Pinto and team and just getting towards my home robot that just does my dishes and stuff.Swyx [01:10:38]: What's the status of it? Like what does it do for you now?Soumith [01:10:41]: As of today, we just deployed a couple of months ago, we deployed our home robotics stuff into several tens of New York City homes and tried to make it do a bunch of tasks. And we're basically starting to build out a framework that gets to a certain level of robustness on fairly simple tasks, like picking this cup and putting it somewhere else or taking a few pieces of cloth on the ground and put it somewhere else or open your microwave and various baseline tasks that with low sample complexity. So I think one of the things people don't spend a lot of time in robotics is the user experience, which I think in the research I do at NYU, we spend a huge amount of time on. I think the key there is sample complexity has to be really low. A lot of the current robotics research, if you see they're like, oh yeah, we collected 50 demos and now it's able to do this task or we collected 300 demos or the number of samples you need for this thing to do the task is really high. So we're focusing a lot on, you show it two or three times and that's sufficient for it to actually do the task, but it comes with less generalization, right? Like there's some initial conditions that have to be true for it to do the task. So we're making progress. That's very interesting in general, the space. I don't think people in this space have settled on the hardware, like how the hardware looks like for it to be truly useful in the home or whatever, or the UX or the like AI, ML stuff needed to make it sample efficient and all of that. But I think lots of work is happening in the field.Alessio [01:12:28]: Yeah, one of my friends, Carlo at Berkeley, he worked on a project called M3L, which is two CNNs, one for tactile feedback and one for image. When you say hardware, is it running all these things on the edge or is it just like the actual servos and the-Soumith [01:12:45]: By hardware, I mean the actual servos, like the motors, servos, even the sensors. I think we have incredible vision that's still it's so much better compared to in the field of view and in resolution compared to any of the cameras we can buy. We have, our skin is all available touch sensing and we have some of the most efficient, some of the most high capacity motors that can lift large loads in the dexterity of a hand and stuff. So in terms of hardware, I mean in terms of those capabilities, we haven't figured out how to do a lot of this stuff. I mean, Tesla has been making incredible progress. One X, I think announced their new thing that looks incredible. Some of the other companies figure and others are doing great work. But we're really not anywhere close to the hardware that we feel like we need. And there's obviously the other thing I want to call out is a lot of what people show works, but has to be fixed all the time. And like, that's the other thing we are incredible at. Like we don't need any maintenance or the maintenance is part of us. If you buy a product, electronics product of any kind, you buy a PS5, you don't say, oh yeah, my PS5 breaks every six days and I have to do some reasonable amount of work on it. But that's robotics. Like if it's not industrial robotics where it's very controlled and specialized or whatever, you're talking about reliability in those ranges. So I think people don't talk about the reliability thing enough. Like what I mean, we're going to enter the commercialization phase. I mean, we're going to start thinking about, okay, now we have this thing and we need to figure out how to get reliability high enough to deploy it into homes and just sell it to people and Best Buy or something. So that's the other factor that we have to make a lot of progress on.Swyx [01:14:44]: I just realized that Google has a play in this with Palm E and stuff and OpenAI obviously has a long history of doing this stuff. Is there anything at Meta? No robotics stuff in Meta?Soumith [01:14:55]: We have a small robotics program at Meta out of FAIR. I actually used to do it at FAIR a little bit before I moved into Infra and focused on my Meta time on a lot of other infrastructural stuff. So yeah, Meta's robotics program is a lot smaller.Swyx [01:15:10]: Seems like it would be a personal computing.Soumith [01:15:14]: You could think of it as like, Meta has a ridiculously large device strategy, right? Like, you know, this is how our reality labs stuff. You know, we're going at it from VR and AR and, you know, we showcase a lot of that stuff. I think for Meta, the robot is not as important as like the physical device. Physical devices kind of stuff.Osmo - smell AISwyx [01:15:37]: Yeah, for sure. Yeah. Okay, I want to touch on Osmo a bit because very unusual company to the stuff that we normally discuss, not robotics, sense of smell. The original pitch I heard from the founder, maybe you can correct me, is that he realized that you can smell cancer. Yeah. Is that intuitive? Is that what you get? Or is that the potential that you see?Soumith [01:15:56]: The very interesting reason I invested in Osmo is because Alex Wiltschko, the founder of Osmo, before PyTorch, there was Torch. And Alex Wiltschko actually worked on Torch. He's actually a frameworks guy. Like, you know, he built this thing called Tangent from Google, another autodiff framework and stuff. I know him from that side of things. And then, he is a neurobiologist by training. He just happens to also love, neural networks and hacking on those frameworks. So incredibly smart guy, one of the smartest people I know. So when he was going in this direction, I thought it was incredible that smell is something that we haven't even started to scrape in terms of digitization. When we think about audio or images or video, they're so advanced. So we have the concept of color spaces. We have the concept of frequency spectrums. Like, you know, we figured out how ears process, like, frequencies in mouse spectrum or whatever logarithmically scaled. Images for RGB, YUV. We have so many different kinds of parameterizations. We have formalized these two senses ridiculously well. Touch and smell, nada. We're where we were with images in, say, in 1920 or maybe even the 1800s, right? That's where we're at. And Alex has this incredible vision of, like, having a smell sensor just eventually just be part of your daily life. Like, as of today, you don't really think about when you're watching an Instagram reel or something, huh, I also would love to know what it smelled like, you know, when you're watching a reel of a food or something. You don't, because we really haven't, as a society, got that muscle to even understand what a smell sensor can do. I think the more near-term effects are obviously going to be around things that provide more obvious utility in the short term, like maybe smelling cancer or repelling mosquitoes better, or, you know, stuff like that.Swyx [01:18:12]: More recently, he's been talking about categorizing perfumes, obviously. Yeah, exactly. That's a market that you can pursue.Soumith [01:18:17]: Yeah, like, I mean, think about how you can customize a perfume to your own liking in the same way you can customize a shoe or something, right? I think all the near-term stuff, I think if he's able to figure out a near-term value for it, they, as a company, can sustain themselves to then eventually try to make progress on the long term, which is really in uncharted territory. Like, think about it, 50 years from now, it would be pretty obvious to kids of the generation to just, like, I was going to say scroll a reel on their phone, and maybe phones wouldn't be there.Swyx [01:18:58]: They're just on their glasses, they're watching something.Soumith [01:18:58]: Yeah, I think VR would be. And then, like, they immediately get a smell sense of that remote experience as well. We haven't really progressed enough in that dimension, and I think they have a chance to do it.Alessio [01:19:13]: Awesome, I mean, we touched on a lot of things. Anything, we're missing anything you want to direct people to, or?Swyx [01:19:19]: Yeah, call to action. Yeah. Call for research, call for startups.Soumith [01:19:22]: I don't really have a lot of calls to action, because usually I think people should be intrinsically, like, figuring it out.Swyx [01:19:29]: That's a good look inside yourself. Yeah. That's good.Alessio [01:19:33]: Awesome, thank you so much for coming on.Swyx [01:19:35]: Yeah, for sure. This was great. Get full access to Latent.Space at www.latent.space/subscribe
-
A Brief History of the Open Source AI Hacker - with Ben Firshman of Replicate
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-02-28 18:04
This Friday we’re doing a special crossover event in SF with Dylan Patel of SemiAnalysis (previous guest!), and we will do a live podcast on site. RSVP here. Also join us on June 25-27 for the biggest AI Engineer conference of the year!Replicate is one of the most popular AI inference providers, reporting over 2 million users as of their $40m Series B with a16z. But how did they get there? The Definitive Replicate Story (warts and all)Their overnight success took 5 years of building, and it all started with arXiv Vanity, which was a 2017 vacation project that scrapes arXiv PDFs and re-renders them into semantic web pages that reflow nicely with better typography and whitespace. From there, Ben and Andreas’ idea was to build tools to make ML research more robust and reproducible by making it easy to share code artefacts alongside papers. They had previously created Fig, which made it easy to spin up dev environments; it was eventually acquired by Docker and turned into `docker-compose`, the industry standard way to define services from containerized applications. 2019: CogThe first iteration of Replicate was a Fig-equivalent for ML workloads which they called Cog; it made it easy for researchers to package all their work and share it with peers for review and reproducibility. But they found that researchers were terrible users: they’d do all this work for a paper, publish it, and then never return to it again. “We talked to a bunch of researchers and they really wanted that.... But how the hell is this a business, you know, like how are we even going to make any money out of this? …So we went and talked to a bunch of companies trying to sell them something which didn't exist. So we're like, hey, do you want a way to share research inside your company so that other researchers or say like the product manager can test out the machine learning model? They're like, maybe. Do you want like a deployment platform for deploying models? Do you want a central place for versioning models? We were trying to think of lots of different products we could sell that were related to this thing…So we then got halfway through our YC batch. We hadn't built a product. We had no users. We had no idea what our business was going to be because we couldn't get anybody to like buy something which didn't exist. And actually there was quite a way through our, I think it was like two thirds the way through our YC batch or something. And we're like, okay, well we're kind of screwed now because we don't have anything to show at demo day.”The team graduated YCombinator with no customers, no product and nothing to demo - which was fine because demo day got canceled as the YC W’20 class graduated right into the pandemic. The team spent the next year exploring and building Covid tools.2021: CLIP + GAN = PixRayBy 2021, OpenAI released CLIP. Overnight dozens of Discord servers got spun up to hack on CLIP + GANs. Unlike academic researchers, this community was constantly releasing new checkpoints and builds of models. PixRay was one of the first models being built on Replicate, and it quickly started taking over the community. Chris Dixon has a famous 2010 post titled “The next big thing will start out looking like a toy”; image generation would have definitely felt like a toy in 2021, but it gave Replicate its initial boost.2022: Stable DiffusionIn August 2022 Stable Diffusion came out, and all the work they had been doing to build this infrastructure for CLIP / GANs models became the best way for people to share their StableDiffusion fine-tunes:And like the first week we saw people making animation models out of it. We saw people make game texture models that use circular convolutions to make repeatable textures. We saw a few weeks later, people were fine tuning it so you could put your face in these models and all of these other ways. […] So tons of product builders wanted to build stuff with it. And we were just sitting in there in the middle, as the interface layer between all these people who wanted to build, and all these machine learning experts who were building cool models. And that's really where it took off. Incredible supply, incredible demand, and we were just in the middle.(Stable Diffusion also spawned Latent Space as a newsletter)The landing page paved the cowpath for the intense interest in diffusion model APIs.2023: Llama & other multimodal LLMsBy 2023, Replicate’s growing visibility in the Stable Diffusion indie hacker community came from top AI hackers like Pieter Levels and Danny Postmaa, each making millions off their AI apps:Meta then released LLaMA 1 and 2 (our coverage of it), greatly pushing forward the SOTA open source model landscape. Demand for text LLMs and other modalities rose, and Replicate broadened its focus accordingly, culminating in a $18m Series A and $40m Series B from a16z (at a $350m valuation).Building standards for the AI worldNow that the industry is evolving from toys to enterprise use cases, all these companies are working to set standards for their own space. We cover this at ~45 mins in the podcast. Some examples:* LangChain has been trying to establish "chain” as the standard mental models when putting multiple prompts and models together, and the “LangChain Expression Language” to go with it. (Our episode with Harrison)* LLamaHub for packaging RAG utilities. (Our episode with Jerry)* Ollama’s Modelfile to define runtimes for different model architectures. These are usually targeted at local inference. * Cog (by Replicate) to create environments to which you can easily attach CUDA devices and make it easy to spin up inference on remote servers. * GGUF as the filetype ggml-based executors. None of them have really broken out yet, but this is going to become a fiercer competition as the market matures. Full Video PodcastAs a reminder, all Latent Space pods now come in full video on our YouTube, with bonus content that we cut for time!Show Notes* Ben Firshman* Replicate* Free $10 credit for Latent Space readers* Andreas Jansson (Ben’s co-founder)* Charlie Holtz (Replicate’s Hacker in Residence)* Fig (now Docker Compose)* Command Line Interface Guidelines (clig)* Apple Human Interface Guidelines* arXiv Vanity* Open Interpreter* PixRay* SF Compute* Big Sleep by Advadnoun* VQGAN-CLIP by Rivers Have WingsTimestamps* [00:00:00] Introductions* [00:01:17] Low latency is all you need* [00:04:08] Evolution of CLIs* [00:05:59] How building ArxivVanity led to Replicate* [00:11:37] Making ML research replicable with containers* [00:17:22] Doing YC in 2020 and pivoting to tools for COVID* [00:20:22] Launching the first version of Replicate* [00:25:51] Embracing the generative image community* [00:28:04] Getting reverse engineered into an API product* [00:31:25] Growing to 2 million users* [00:34:29] Indie vs Enterprise customers* [00:37:09] How Unsplash uses Replicate* [00:38:29] Learnings from Docker that went into Cog* [00:45:25] Creating AI standards* [00:50:05] Replicate's compute availability* [00:53:55] Fixing GPU waste* [01:00:39] What's open source AI?* [01:04:46] Building for AI engineers* [01:06:41] Hiring at ReplicateThis summary covers the full range of topics discussed throughout the episode, providing a comprehensive overview of the content and insights shared.TranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:14]: Hey, and today we have Ben Firshman in the studio. Welcome Ben.Ben [00:00:18]: Hey, good to be here.Swyx [00:00:19]: Ben, you're a co-founder and CEO of Replicate. Before that, you were most notably founder of Fig, which became Docker Compose. You also did a couple of other things before that, but that's what a lot of people know you for. What should people know about you that, you know, outside of your, your sort of LinkedIn profile?Ben [00:00:35]: Yeah. Good question. I think I'm a builder and tinkerer, like in a very broad sense. And I love using my hands to make things. So like I work on, you know, things may be a bit closer to tech, like electronics. I also like build things out of wood and I like fix cars and I fix my bike and build bicycles and all this kind of stuff. And there's so much, I think I've learned from transferable skills, from just like working in the real world to building things, building things in software. And you know, it's so much about being a builder, both in real life and, and in software that crosses over.Swyx [00:01:11]: Is there a real world analogy that you use often when you're thinking about like a code architecture or problem?Ben [00:01:17]: I like to build software tools as if they were something real. So I wrote this thing called the command line interface guidelines, which was a bit like sort of the Mac human interface guidelines, but for command line interfaces, I did it with the guy I created Docker Compose with and a few other people. And I think something in there, I think I described that your command line interface should feel like a big iron machine where you pull a lever and it goes clunk and like things should respond within like 50 milliseconds as if it was like a real life thing. And like another analogy here is like in the real life, you know, when you press a button on an electronic device and it's like a soft switch and you press it and nothing happens and there's no physical feedback of anything happening, then like half a second later, something happens. Like that's how a lot of software feels, but instead like software should feel more like something that's real where you touch, you pull a physical lever and the physical lever moves, you know, and I've taken that lesson of kind of human interface to, to software a ton. You know, it's all about kind of low latency of feeling, things feeling really solid and robust, both the command lines and, and user interfaces as well.Swyx [00:02:22]: And how did you operationalize that for Fig or Docker?Ben [00:02:27]: A lot of it's just low latency. Actually, we didn't do it very well for Fig in the first place. We used Python, which was a big mistake where Python's really hard to get booting up fast because you have to load up the whole Python runtime before it can run anything. Okay. Go is much better at this where like Go just instantly starts.Swyx [00:02:45]: You have to be under 500 milliseconds to start up?Ben [00:02:48]: Yeah, effectively. I mean, I mean, you know, perception of human things being immediate is, you know, something like a hundred milliseconds. So anything like that is, is yeah, good enough.Swyx [00:02:57]: Yeah. Also, I should mention, since we're talking about your side projects, well, one thing is I am maybe one of a few fellow people who have actually written something about CLI design principles because I was in charge of the Netlify CLI back in the day and had many thoughts. One of my fun thoughts, I'll just share it in case you have thoughts, is I think CLIs are effectively starting points for scripts that are then run. And the moment one of the script's preconditions are not fulfilled, typically they end. So the CLI developer will just exit the program. And the way that I designed, I really wanted to create the Netlify dev workflow was for it to be kind of a state machine that would resolve itself. If it detected a precondition wasn't fulfilled, it would actually delegate to a subprogram that would then fulfill that precondition, asking for more info or waiting until a condition is fulfilled. Then it would go back to the original flow and continue that. I don't know if that was ever tried or is there a more formal definition of it? Because I just came up with it randomly. But it felt like the beginnings of AI in the sense that when you run a CLI command, you have an intent to do something and you may not have given the CLI all the things that it needs to do, to execute that intent. So that was my two cents.Ben [00:04:08]: Yeah, that reminds me of a thing we sort of thought about when writing the CLI guidelines, where CLIs were designed in a world where the CLI was really a programming environment and it's primarily designed for machines to use all of these commands and scripts. Whereas over time, the CLI has evolved to humans. It was back in a world where the primary way of using computers was writing shell scripts effectively. We've transitioned to a world where actually humans are using CLI programs much more than they used to. And the current sort of best practices about how Unix was designed, there's lots of design documents about Unix from the 70s and 80s, where they say things like, command line commands should not output anything on success. It should be completely silent, which makes sense if you're using it in a shell script. But if a user is using that, it just looks like it's broken. If you type copy and it just doesn't say anything, you assume that it didn't work as a new user. I think what's really interesting about the CLI is that it's actually a really good, to your point, it's a really good user interface where it can be like a conversation, where it feels like you're, instead of just like you telling the computer to do this thing and either silently succeeding or saying, no, you did, failed, it can guide you in the right direction and tell you what your intent might be, and that kind of thing in a way that's actually, it's almost more natural to a CLI than it is in a graphical user interface because it feels like this back and forth with the computer, almost funnily like a language model. So I think there's some interesting intersection of CLIs and language models actually being very sort of closely related and a good fit for each other.Swyx [00:05:59]: Yeah, I'll say one of the surprises from last year, I worked on a coding agent, but I think the most successful coding agent of my cohort was Open Interpreter, which was a CLI implementation. And I have chronically, even as a CLI person, I have chronically underestimated the CLI as a useful interface. You also developed ArchiveVanity, which you recently retired after a glorious seven years.Ben [00:06:22]: Something like that.Swyx [00:06:23]: Which is nice, I guess, HTML PDFs.Ben [00:06:27]: Yeah, that was actually the start of where Replicate came from. Okay, we can tell that story. So when I quit Docker, I got really interested in science infrastructure, just as like a problem area, because it is like science has created so much progress in the world. The fact that we're, you know, can talk to each other on a podcast and we use computers and the fact that we're alive is probably thanks to medical research, you know. But science is just like completely archaic and broken and it's like 19th century processes that just happen to be copied to the internet rather than take into account that, you know, we can transfer information at the speed of light now. And the whole way science is funded and all this kind of thing is all kind of very broken. And there's just so much potential for making science work better. And I realized that I wasn't a scientist and I didn't really have any time to go and get a PhD and become a researcher, but I'm a tool builder and I could make existing scientists better at their job. And if I could make like a bunch of scientists a little bit better at their job, maybe that's the kind of equivalent of being a researcher. So one particular thing I dialed in on is just how science is disseminated in that all of these PDFs, quite often behind paywalls, you know, on the internet.Swyx [00:07:34]: And that's a whole thing because it's funded by national grants, government grants, then they're put behind paywalls. Yeah, exactly.Ben [00:07:40]: That's like a whole, yeah, I could talk for hours about that. But the particular thing we got dialed in on was, interestingly, these PDFs are also, there's a bunch of open science that happens as well. So math, physics, computer science, machine learning, notably, is all published on the archive, which is actually a surprisingly old institution.Swyx [00:08:00]: Some random Cornell.Ben [00:08:01]: Yeah, it was just like somebody in Cornell who started a mailing list in the 80s. And then when the web was invented, they built a web interface around it. Like it's super old.Swyx [00:08:11]: And it's like kind of like a user group thing, right? That's why they're all these like numbers and stuff.Ben [00:08:15]: Yeah, exactly. Like it's a bit like something, yeah. That's where all basically all of math, physics and computer science happens. But it's still PDFs published to this thing. Yeah, which is just so infuriating. The web was invented at CERN, a physics institution, to share academic writing. Like there are figure tags, there are like author tags, there are heading tags, there are site tags. You know, hyperlinks are effectively citations because you want to link to another academic paper. But instead, you have to like copy and paste these things and try and get around paywalls. Like it's absurd, you know. And now we have like social media and things, but still like academic papers as PDFs, you know. This is not what the web was for. So anyway, I got really frustrated with that. And I went on vacation with my old friend Andreas. So we were, we used to work together in London on a startup, at somebody else's startup. And we were just on vacation in Greece for fun. And he was like trying to read a machine learning paper on his phone, you know, like we had to like zoom in and like scroll line by line on the PDF. And he was like, this is f*****g stupid. So I was like, I know, like this is something we discovered our mutual hatred for this, you know. And we spent our vacation sitting by the pool, like making latex to HTML, like converters, making the first version of Archive Vanity. Anyway, that was up then a whole thing. And the story, we shut it down recently because they caught the eye of Archive. They were like, oh, this is great. We just haven't had the time to work on this. And what's tragic about the Archive, it's like this project of Cornell that's like, they can barely scrounge together enough money to survive. I think it might be better funded now than it was when we were, we were collaborating with them. And compared to these like scientific journals, it's just that this is actually where the work happens. But they just have a fraction of the money that like these big scientific journals have, which is just so tragic. But anyway, they were like, yeah, this is great. We can't afford to like do it, but do you want to like as a volunteer integrate arXiv Vanity into arXiv?Swyx [00:10:05]: Oh, you did the work.Ben [00:10:06]: We didn't do the work. We started doing the work. We did some. I think we worked on this for like a few months to actually get it integrated into arXiv. And then we got like distracted by Replicate. So a guy called Dan picked up the work and made it happen. Like somebody who works on one of the, the piece of the libraries that powers arXiv Vanity. Okay.Swyx [00:10:26]: And the relationship with arXiv Sanity?Ben [00:10:28]: None.Swyx [00:10:30]: Did you predate them? I actually don't know the lineage.Ben [00:10:32]: We were after, we both were both users of arXiv Sanity, which is like a sort of arXiv...Ben [00:10:37]: Which is Andre's RecSys on top of arXiv.Ben [00:10:40]: Yeah. Yeah. And we were both users of that. And I think we were trying to come up with a working name for arXiv and Andreas just like cracked a joke of like, oh, let's call it arXiv Vanity. Let's make the papers look nice. Yeah. Yeah. And that was the working name and it just stuck.Swyx [00:10:52]: Got it.Ben [00:10:53]: Got it.Alessio [00:10:54]: Yeah. And then from there, tell us more about why you got distracted, right? So Replicate, maybe it feels like an overnight success to a lot of people, but you've been building this since 2019. Yeah.Ben [00:11:04]: So what prompted the start?Alessio [00:11:05]: And we've been collaborating for even longer.Ben [00:11:07]: So we created arXiv Vanity in 2017. So in some sense, we've been doing this almost like six, seven years now, a classic seven year.Swyx [00:11:16]: Overnight success.Ben [00:11:17]: Yeah. Yes. We did arXiv Vanity and then worked on a bunch of like surrounding projects. I was still like really interested in science publishing at that point. And I'm trying to remember, because I tell a lot of like the condensed story to people because I can't really tell like a seven year history. So I'm trying to figure out like the right. Oh, we got room. The right length.Swyx [00:11:35]: We want to nail the definitive Replicate story here.Ben [00:11:37]: One thing that's really interesting about these machine learning papers is that these machine learning papers are published on arXiv and a lot of them are actual fundamental research. So like should be like prose describing a theory. But a lot of them are just running pieces of software that like a machine learning researcher made that did something, you know, it was like an image classification model or something. And they managed to make an image classification model that was better than the existing state of the art. And they've made an actual running piece of software that does image segmentation. And then what they had to do is they then had to take that piece of software and write it up as prose and math in a PDF. And what's frustrating about that is like if you want to. So this was like Andreas is, Andreas was a machine learning engineer at Spotify. And some of his job was like he did pure research as well. Like he did a PhD and he was doing a lot of stuff internally. But part of his job was also being an engineer and taking some of these existing things that people have made and published and trying to apply them to actual problems at Spotify. And he was like, you know, you get given a paper which like describes roughly how the model works. It's probably listing lots of crucial information. There's sometimes code on GitHub. More and more there's code on GitHub. But back then it was kind of relatively rare. But it's quite often just like scrappy research code and didn't actually run. And, you know, there was maybe the weights that were on Google Drive, but they accidentally deleted the weights of Google Drive, you know, and it was like really hard to like take this stuff and actually use it for real things. We just started talking together about like his problems at Spotify and I connected this back to my work at Docker as well. I was like, oh, this is what we created containers for. You know, we solved this problem for normal software by putting the thing inside a container so you could ship it around and it kept on running. So we were sort of hypothesizing about like, hmm, what if we put machine learning models inside containers so they could actually be shipped around and they could be defined in like some production ready formats and other researchers could run them to generate baselines and you could people who wanted to actually apply them to real problems in the world could just pick up the container and run it, you know. And we then thought this is quite whether it gets normally in this part of the story I skip forward to be like and then we created cog this container stuff for machine learning models and we created Replicate, the place for people to publish these machine learning models. But there's actually like two or three years between that. The thing we then got dialed into was Andreas was like, what if there was a CI system for machine learning? It's like one of the things he really struggled with as a researcher is generating baselines. So when like he's writing a paper, he needs to like get like five other models that are existing work and get them running.Swyx [00:14:21]: On the same evals.Ben [00:14:22]: Exactly, on the same evals so you can compare apples to apples because you can't trust the numbers in the paper.Swyx [00:14:26]: So you can be Google and just publish them anyway.Ben [00:14:31]: So I think this was coming from the thinking of like there should be containers for machine learning, but why are people going to use that? Okay, maybe we can create a supply of containers by like creating this useful tool for researchers. And the useful tool was like, let's get researchers to package up their models and push them to the central place where we run a standard set of benchmarks across the models so that you can trust those results and you can compare these models apples to apples and for like a researcher for Andreas, like doing a new piece of research, he could trust those numbers and he could like pull down those models, confirm it on his machine, use the standard benchmark to then measure his model and you know, all this kind of stuff. And so we started building that. That's what we applied to YC with, got into YC and we started sort of building a prototype of this. And then this is like where it all starts to fall apart. We were like, okay, that sounds great. And we talked to a bunch of researchers and they really wanted that and that sounds brilliant. That's a great way to create a supply of like models on this research platform. But how the hell is this a business, you know, like how are we even going to make any money out of this? And we're like, oh s**t, that's like the, that's the real unknown here of like what the business is. So we thought it would be a really good idea to like, okay, before we get too deep into this, let's try and like reduce the risk of this turning into a business. So let's try and like research what the business could be for this research tool effectively. So we went and talked to a bunch of companies trying to sell them something which didn't exist. So we're like, hey, do you want a way to share research inside your company so that other researchers or say like the product manager can test out the machine learning model? They're like, maybe. And we were like, do you want like a deployment platform for deploying models? Like, do you want like a central place for versioning models? Like we're trying to think of like lots of different like products we could sell that were like related to this thing. And terrible idea. Like we're not sales people and like people don't want to buy something that doesn't exist. I think some people can pull this off, but we were just like, you know, a bunch of product people, products and engineer people, and we just like couldn't pull this off. So we then got halfway through our YC batch. We hadn't built a product. We had no users. We had no idea what our business was going to be because we couldn't get anybody to like buy something which didn't exist. And actually there was quite a way through our, I think it was like two thirds the way through our YC batch or something. And we're like, okay, well we're kind of screwed now because we don't have anything to show at demo day. And then we then like tried to figure out, okay, what can we build in like two weeks that'll be something. So we like desperately tried to, I can't remember what we've tried to build at that point. And then two weeks before demo day, I just remember it was all, we were going down to Mountain View every week for dinners and we got called on to like an all hands Zoom call, which was super weird. We're like, what's going on? And they were like, don't come to dinner tomorrow. And we realized, we kind of looked at the news and we were like, oh, there's a pandemic going on. We were like so deep in our startup. We were just like completely oblivious to what was going on around us.Swyx [00:17:20]: Was this Jan or Feb 2020?Ben [00:17:22]: This was March 2020. March 2020. 2020.Swyx [00:17:25]: Yeah. Because I remember Silicon Valley at the time was early to COVID. Like they started locking down a lot faster than the rest of the US.Ben [00:17:32]: Yeah, exactly. And I remember, yeah, soon after that, like there was the San Francisco lockdowns and then like the YC batch just like stopped. There wasn't demo day and it was in a sense a blessing for us because we just kind ofSwyx [00:17:43]: In the normal course of events, you're actually allowed to defer to a future demo day. Yeah.Ben [00:17:51]: So we didn't even take any defer because it just kind of didn't happen.Swyx [00:17:55]: So was YC helpful?Ben [00:17:57]: Yes. We completely screwed up the batch and that was our fault. I think the thing that YC has become incredibly valuable for us has been after YC. I think there was a reason why we couldn't, didn't need to do YC to start with because we were quite experienced. We had done some startups before. We were kind of well connected with VCs, you know, it was relatively easy to raise money because we were like a known quantity. You know, if you go to a VC and be like, Hey, I made this piece of-Swyx [00:18:24]: It's Docker Compose for AI.Ben [00:18:26]: Exactly. Yeah. And like, you know, people can pattern match like that and they can have some trust, you know what you're doing. Whereas it's much harder for people straight out of college and that's where like YC sweet spot is like helping people straight out of college who are super promising, like figure out how to do that.Swyx [00:18:40]: No credentials.Ben [00:18:41]: Yeah, exactly. We don't need that. But the thing that's been incredibly useful for us since YC has been, this was actually, I think, so Docker was a YC company and Solomon, the founder of Docker, I think told me this. He was like, a lot of people underestimate the value of YC after you finish the batch. And his biggest regret was like not staying in touch with YC. I might be misattributing this, but I think it was him. And so we made a point of that. And we just stayed in touch with our batch partner, who Jared at YC has been fantastic.Ben [00:19:10]: Jared Friedman. All of like the team at YC, there was the growth team at YC when they were still there and they've been super helpful. And two things have been super helpful about that is like raising money, like they just know exactly how to raise money. And they've been super helpful during that process in all of our rounds, like we've done three rounds since we did YC and they've been super helpful during the whole process. And also just like reaching a ton of customers. So like the magic of YC is that you have all of, like there's thousands of YC companies, I think, on the order of thousands, I think. And they're all of your first customers. And they're like super helpful, super receptive, really want to like try out new things. You have like a warm intro to every one of them basically. And there's this mailing list where you can post about updates to your products, which is like really receptive. And that's just been fantastic for us. Like we've just like got so many of our users and customers through YC. Yeah.Swyx [00:20:00]: Well, so the classic criticism or the sort of, you know, pushback is people don't buy you because you are both from YC. But at least they'll open the email. Right. Like that's the... Okay.Ben [00:20:13]: Yeah. Yeah. Yeah.Swyx [00:20:16]: So that's been a really, really positive experience for us. And sorry, I interrupted with the YC question. Like you were, you make it, you just made it out of the YC, survived the pandemic.Ben [00:20:22]: I'll try and condense this a little bit. Then we started building tools for COVID weirdly. We were like, okay, we don't have a startup. We haven't figured out anything. What's the most useful thing we could be doing right now?Swyx [00:20:32]: Save lives.Ben [00:20:33]: So yeah. Let's try and save lives. I think we failed at that as well. We had a bunch of products that didn't really go anywhere. We kind of worked on, yeah, a bunch of stuff like contact tracing, which turned out didn't really be a useful thing. Sort of Andreas worked on like a door dash for like people delivering food to people who are vulnerable. What else did we do? The meta problem of like helping people direct their efforts to what was most useful and a few other things like that. It didn't really go anywhere. So we're like, okay, this is not really working either. We were considering actually just like doing like work for COVID. We have this decision document early on in our company, which is like, should we become a like government app contracting shop? We decided no.Swyx [00:21:11]: Because you also did work for the gov.uk. Yeah, exactly.Ben [00:21:14]: We had experience like doing some like-Swyx [00:21:17]: And the Guardian and all that.Ben [00:21:18]: Yeah. For like government stuff. And we were just like really good at building stuff. Like we were just like product people. Like I was like the front end product side and Andreas was the back end side. So we were just like a product. And we were working with a designer at the time, a guy called Mark, who did our early designs for Replicate. And we were like, hey, what if we just team up and like become and build stuff? And yeah, we gave up on that in the end for, I can't remember the details. So we went back to machine learning. And then we were like, well, we're not really sure if this is going to work. And one of my most painful experiences from previous startups is shutting them down. Like when you realize it's not really working and having to shut it down, it's like a ton of work and it's people hate you and it's just sort of, you know. So we were like, how can we make something we don't have to shut down? And even better, how can we make something that won't page us in the middle of the night? So we made an open source project. We made a thing which was an open source Weights and Biases, because we had this theory that like people want open source tools. There should be like an open source, like version control, experiment tracking like thing. And it was intuitive to us and we're like, oh, we're software developers and we like command line tools. Like everyone loves command line tools and open source stuff, but machine learning researchers just really didn't care. Like they just wanted to click on buttons. They didn't mind that it was a cloud service. It was all very visual as well, that you need lots of graphs and charts and stuff like this. So it wasn't right. Like it was right. We actually were building something that Andreas made at Spotify for just like saving experiments to cloud storage automatically, but other people didn't really want this. So we kind of gave up on that. And then that was actually originally called Replicate and we renamed that out of the way. So it's now called Keepsake and I think some people still use it. Then we sort of came back, we looped back to our original idea. So we were like, oh, maybe there was a thing in that thing we were originally sort of thinking about of like researchers sharing their work and containers for machine learning models. So we just built that. And at that point we were kind of running out of the YC money. So we were like, okay, this like feels good though. Let's like give this a shot. So that was the point we raised a seed round. We raised seed round. Pre-launch. We raised pre-launch and pre-team. It was an idea basically. We had a little prototype. It was just an idea and a team. But we were like, okay, like, you know, bootstrapping this thing is getting hard. So let's actually raise some money. Then we made Cog and Replicate. It initially didn't have APIs, interestingly. It was just the bit that I was talking about before of helping researchers share their work. So it was a way for researchers to put their work on a webpage such that other people could try it out and so that you could download the Docker container. We cut the benchmarks thing of it because we thought that was just like too complicated. But it had a Docker container that like, you know, Andreas in a past life could download and run with his benchmark and you could compare all these models apples to apples. So that was like the theory behind it. That kind of started to work. It was like still when like, you know, it was long time pre-AI hype and there was lots of interesting stuff going on, but it was very much in like the classic deep learning era. So sort of image segmentation models and sentiment analysis and all these kinds of things, you know, that people were using, that we're using deep learning models for. And we were very much building for research because all of this stuff was happening in research institutions, you know, the sort of people who'd be publishing to archive. So we were creating an accompanying material for their models, basically, you know, they wanted a demo for their models and we were creating a company material for it. What was funny about that is they were like not very good users. Like they were, they were doing great work obviously, but, but the way that research worked is that they, they just made like one thing every six months and they just fired and forget it, forgot it. Like they, they published this piece of paper and like, done, I've, I've published it. So they like output it to Replicate and then they just stopped using Replicate. You know, they were like once every six monthly users and that wasn't great for us, but we stumbled across this early community. This was early 2021 when OpenAI created this, created CLIP and people started smushing CLIP and GANs together to produce image generation models. And this started with, you know, it was just a bunch of like tinkerers on Discord, basically. There was an early model called Big Sleep by Advadnoun. And then there was VQGAN Clip, which was like a bit more popular by Rivers Have Wings. And it was all just people like tinkering on stuff in Colabs and it was very dynamic and it was people just making copies of co-labs and playing around with things and forking in. And to me this, I saw this and I was like, oh, this feels like open source software, like so much more than the research world where like people are publishing these papers.Swyx [00:25:48]: You don't know their real names and it's just like a Discord.Ben [00:25:51]: Yeah, exactly. But crucially, it was like people were tinkering and forking and things were moving really fast and it just felt like this creative, dynamic, collaborative community in a way that research wasn't really, like it was still stuck in this kind of six month publication cycle. So we just kind of latched onto that and started building for this community. And you know, a lot of those early models were published on Replicate. I think the first one that was really primarily on Replicate was one called Pixray, which was sort of mid 2021 and it had a really cool like pixel art output, but it also just like produced general, you know, the sort of, they weren't like crisp in images, but they were quite aesthetically pleasing, like some of these early image generation models. And you know, that was like published primarily on Replicate and then a few other models around that were like published on Replicate. And that's where we really started to find our early community and like where we really found like, oh, we've actually built a thing that people want and they were great users as well. And people really want to try out these models. Lots of people were like running the models on Replicate. We still didn't have APIs though, interestingly, and this is like another like really complicated part of the story. We had no idea what a business model was still at this point. I don't think people could even pay for it. You know, it was just like these web forms where people could run the model.Swyx [00:27:06]: Just for historical interest, which discords were they and how did you find them? Was this the Lion Discord? Yeah, Lion. This is Eleuther.Ben [00:27:12]: Eleuther, yeah. It was the Eleuther one. These two, right? There was a channel where Viki Gangklep, this was early 2021, where Viki Gangklep was set up as a Discord bot. I just remember being completely just like captivated by this thing. I was just like playing around with it all afternoon and like the sort of thing. In Discord. Oh s**t, it's 2am. You know, yeah.Swyx [00:27:33]: This is the beginnings of Midjourney.Ben [00:27:34]: Yeah, exactly. And Stability. It was the start of Midjourney. And you know, it's where that kind of user interface came from. Like what's beautiful about the user interface is like you could see what other people are doing. And you could riff off other people's ideas. And it was just so much fun to just like play around with this in like a channel full of a hundred people. And yeah, that just like completely captivated me and I'm like, okay, this is something, you know. So like we should get these things on Replicate. Yeah, that's where that all came from.Swyx [00:28:00]: And then you moved on to, so was it APIs next or was it Stable Diffusion next?Ben [00:28:04]: It was APIs next. And the APIs happened because one of our users, our web form had like an internal API for making the web form work, like with an API that was called from JavaScript. And somebody like reverse engineered that to start generating images with a script. You know, they did like, you know, Web Inspector Coffee is Carl, like figured out what the API request was. And it wasn't secured or anything.Swyx [00:28:28]: Of course not.Ben [00:28:29]: They started generating a bunch of images and like we got tons of traffic and like what's going on? And I think like a sort of usual reaction to that would be like, hey, you're abusing our API and to shut them down. And instead we're like, oh, this is interesting. Like people want to run these models. So we documented the API in a Notion document, like our internal API in a Notion document and like message this person being like, hey, you seem to have found our API. Here's the documentation. That'll be like a thousand bucks a month, please, with a straight form, like we just click some buttons to make. And they were like, sure, that sounds great. So that was our first customer.Swyx [00:29:05]: A thousand bucks a month.Ben [00:29:07]: It was a surprising amount of money. That's not casual. It was on the order of a thousand bucks a month.Swyx [00:29:11]: So was it a business?Ben [00:29:13]: It was the creator of PixRay. Like it was, he generated NFT art. And so he like made a bunch of art with these models and was, you know, selling these NFTs effectively. And I think lots of people in his community were doing similar things. And like he then referred us to other people who were also generating NFTs and he joined us with models. We started our API business. Yeah. Then we like made an official API and actually like added some billing to it. So it wasn't just like a fixed fee.Swyx [00:29:40]: And now people think of you as the host and models API business. Yeah, exactly.Ben [00:29:44]: But that just turned out to be our business, you know, but what ended up being beautiful about this is it was really fulfilling. Like the original goal of what we wanted to do is that we wanted to make this research that people were making accessible to like other people and for it to be used in the real world. And this was like the just like ultimately the right way to do it because all of these people making these generative models could publish them to replicate and they wanted a place to publish it. And software engineers, you know, like myself, like I'm not a machine learning expert, but I want to use this stuff, could just run these models with a single line of code. And we thought, oh, maybe the Docker image is enough, but it's actually super hard to get the Docker image running on a GPU and stuff. So it really needed to be the hosted API for this to work and to make it accessible to software engineers. And we just like wound our way to this. Yeah.Swyx [00:30:30]: Two years to the first paying customer. Yeah, exactly.Alessio [00:30:33]: Did you ever think about becoming Midjourney during that time? You have like so much interest in image generation.Swyx [00:30:38]: I mean, you're doing fine for the record, but, you know, it was right there, you were playing with it.Ben [00:30:46]: I don't think it was our expertise. Like I think our expertise was DevTools rather than like Midjourney is almost like a consumer products, you know? Yeah. So I don't think it was our expertise. It certainly occurred to us. I think at the time we were thinking about like, oh, maybe we could hire some of these people in this community and make great models and stuff like this. But we ended up more being at the tooling. Like I think like before I was saying, like I'm not really a researcher, but I'm more like the tool builder, the behind the scenes. And I think both me and Andreas are like that.Swyx [00:31:09]: I think this is an illustration of the tool builder philosophy. Something where you latch on to in DevTools, which is when you see people behaving weird, it's not their fault, it's yours. And you want to pave the cow paths is what they say, right? Like the unofficial paths that people are making, like make it official and make it easy for them and then maybe charge a bit of money.Alessio [00:31:25]: And now fast forward a couple of years, you have 2 million developers using Replicate. Maybe more. That was the last public number that I found.Ben [00:31:33]: It's 2 million users. Not all those people are developers, but a lot of them are developers, yeah.Alessio [00:31:38]: And then 30,000 paying customers was the number late in space runs on Replicate. So we had a small podcaster and we host a whisper diarization on Replicate. And we're paying. So we're late in space in the 30,000. You raised a $40 million dollars, Series B. I would say that maybe the stable diffusion time, August 22, was like really when the company started to break out. Tell us a bit about that and the community that came out and I know now you're expanding beyond just image generation.Ben [00:32:06]: Yeah, like I think we kind of set ourselves, like we saw there was this really interesting image, generative image world going on. So we kind of, you know, like we're building the tools for that community already, really. And we knew stable diffusion was coming out. We knew it was a really exciting thing, you know, it was the best generative image model so far. I think the thing we underestimated was just like what an inflection point it would be, where it was, I think Simon Willison put it this way, where he said something along the lines of it was a model that was open source and tinkerable and like, you know, it was just good enough and open source and tinkerable such that it just kind of took off in a way that none of the models had before. And like what was really neat about stable diffusion is it was open source so you could like, compared to like Dali, for example, which was like sort of equivalent quality. And like the first week we saw like people making animation models out of it. We saw people make like game texture models that like use circular convolutions to make repeatable textures. We saw, you know, a few weeks later, like people were fine tuning it so you could make, put your face in these models and all of these other-Swyx [00:33:10]: Textual inversion.Ben [00:33:11]: Yep. Yeah, exactly. That happened a bit before that. And all of this sort of innovation was happening all of a sudden. And people were publishing on Replicate because you could just like publish arbitrary models on Replicate. So we had this sort of supply of like interesting stuff being built. But because it was a sufficiently good model, there was also just like a ton of people building with it. They were like, oh, we can build products with this thing. And this was like about the time where people were starting to get really interested in AI. So like tons of product builders wanted to build stuff with it. And we were just like sitting in there in the middle, it's like the interface layer between like all these people who wanted to build and all these like machine learning experts who were building cool models. And that's like really where it took off. We were just sort of incredible supply, incredible demand, and we were just like in the middle. And then, yeah, since then, we've just kind of grown and grown really. And we've been building a lot for like the indie hacker community, these like individual tinkerers, but also startups and a lot of large companies as well who are sort of exploring and building AI things. Then kind of the same thing happened like middle of last year with language models and Lama 2, where the same kind of stable diffusion effect happened with Lama. And Lama 2 was like our biggest week of growth ever because like tons of people wanted to tinker with it and run it. And you know, since then we've just been seeing a ton of growth in language models as well as image models. Yeah. We're just kind of riding a lot of the interest that's going on in AI and all the people building in AI, you know. Yeah.Swyx [00:34:29]: Kudos. Right place, right time. But also, you know, took a while to position for the right place before the wave came. I'm curious if like you have any insights on these different markets. So Peter Levels, notably very loud person, very picky about his tools. I wasn't sure actually if he used you. He does. So you've met him on your Series B blog posts and Danny Post might as well, his competitor all in that wave. What are their needs versus, you know, the more enterprise or B2B type needs? Did you come to a decision point where you're like, okay, you know, how serious are these indie hackers versus like the actual businesses that are bigger and perhaps better customers because they're less churny?Ben [00:35:04]: They're surprisingly similar because I think a lot of people right now want to use and build with AI, but they're not AI experts and they're not infrastructure experts either. So they want to be able to use this stuff without having to like figure out all the internals of the models and, you know, like touch PyTorch and whatever. And they also don't want to be like setting up and booting up servers. And that's the same all the way from like indie hackers just getting started because like obviously you just want to get started as quickly as possible, all the way through to like large companies who want to be able to use this stuff, but don't have like all of the experts on stuff, you know, you know, big companies like Google and so on that do actually have a lot of experts on stuff, but the vast majority of companies don't. And they're all software engineers who want to be able to use this AI stuff, but they just don't know how to use it. And it's like, you really need to be an expert and it takes a long time to like learn the skills to be able to use that. So they're surprisingly similar in that sense. I think it's kind of also unfair of like the indie community, like they're not churning surprisingly, or churny or spiky surprisingly, like they're building real established businesses, which is like, kudos to them, like building these really like large, sustainable businesses, often just as solo developers. And it's kind of remarkable how they can do that actually, and it's in credit to a lot of their like product skills. And you know, we're just like there to help them being like their machine learning team effectively to help them use all of this stuff. A lot of these indie hackers are some of our largest customers, like alongside some of our biggest customers that you would think would be spending a lot more money than them, but yeah.Swyx [00:36:35]: And we should name some of these. So you have them on your landing page, your Buzzfeed, you have Unsplash, Character AI. What do they power? What can you say about their usage?Ben [00:36:43]: Yeah, totally. It's kind of a various things.Swyx [00:36:46]: Well, I mean, I'm naming them because they're on your landing page. So you have logo rights. It's useful for people to, like, I'm not imaginative. I see monkey see monkey do, right? Like if I see someone doing something that I want to do, then I'm like, okay, Replicate's great for that.Ben [00:37:00]: Yeah, yeah, yeah.Swyx [00:37:01]: So that's what I think about case studies on company landing pages is that it's just a way of explaining like, yep, this is something that we are good for. Yeah, totally.Ben [00:37:09]: I mean, it's, these companies are doing things all the way up and down the stack at different levels of sophistication. So like Unsplash, for example, they actually publicly posted this story on Twitter where they're using BLIP to annotate all of the images in their catalog. So you know, they have lots of images in the catalog and they want to create a text description of it so you can search for it. And they're annotating images with, you know, off the shelf, open source model, you know, we have this big library of open source models that you can run. And you know, we've got lots of people are running these open source models off the shelf. And then most of our larger customers are doing more sophisticated stuff. So they're like fine tuning the models, they're running completely custom models on us. A lot of these larger companies are like, using us for a lot of their, you know, inference, but it's like a lot of custom models and them like writing the Python themselves because they've got machine learning experts on the team. And they're using us for like, you know, their inference infrastructure effectively. And so it's like lots of different levels of sophistication where like some people using these off the shelf models. Some people are fine tuning models. So like level, Peter Levels is a great example where a lot of his products are based off like fine tuning, fine tuning image models, for example. And then we've also got like larger customers who are just like using us as infrastructure effectively. So yeah, it's like all things up and down, up and down the stack.Alessio [00:38:29]: Let's talk a bit about COG and the technical layer. So there are a lot of GPU clouds. I think people have different pricing points. And I think everybody tries to offer a different developer experience on top of it, which then lets you charge a premium. Why did you want to create COG?Ben [00:38:46]: You worked at Docker.Alessio [00:38:47]: What were some of the issues with traditional container runtimes? And maybe yeah, what were you surprised with as you built it?Ben [00:38:54]: COG came right from the start, actually, when we were thinking about this, you know, evaluation, the sort of benchmarking system for machine learning researchers, where we wanted researchers to publish their models in a standard format that was guaranteed to keep on running, that you could replicate the results of, like that's where the name came from. And we realized that we needed something like Docker to make that work, you know. And I think it was just like natural from my point of view of like, obviously that should be open source, that we should try and create some kind of open standard here that people can share. Because if more people use this format, then that's great for everyone involved. I think the magic of Docker is not really in the software. It's just like the standard that people have agreed on, like, here are a bunch of keys for a JSON document, basically. And you know, that was the magic of like the metaphor of real containerization as well. It's not the containers that are interesting. It's just like the size and shape of the damn box, you know. And it's a similar thing here, where really we just wanted to get people to agree on like, this is what a machine learning model is. This is how a prediction works. This is what the inputs are, this is what the outputs are. So cog is really just a Docker container that attaches to a CUDA device, if it needs a GPU, that has a open API specification as a label on the Docker image. And the open API specification defines the interface for the machine learning model, like the inputs and outputs effectively, or the params in machine learning terminology. And you know, we just wanted to get people to kind of agree on this thing. And it's like general purpose enough, like we weren't saying like, some of the existing things were like at the graph level, but we really wanted something general purpose enough that you could just put anything inside this and it was like future compatible and it was just like arbitrary software. And you know, it'd be future compatible with like future inference servers and future machine learning model formats and all this kind of stuff. So that was the intent behind it. It just came naturally that we wanted to define this format. And that's been really working for us. Like a bunch of people have been using cog outside of replicates, which is kind of our original intention, like this should be how machine learning is packaged and how people should use it. Like it's common to use cog in situations where like maybe they can't use the SAS service because I don't know, they're in a big company and they're not allowed to use a SAS service, but they can use cog internally still. And like they can download the models from replicates and run them internally in their org, which we've been seeing happen. And that works really well. People who want to build like custom inference pipelines, but don't want to like reinvent the world, they can use cog off the shelf and use it as like a component in their inference pipelines. We've been seeing tons of usage like that and it's just been kind of happening organically. We haven't really been trying, you know, but it's like there if people want it and we've been seeing people use it. So that's great. Yeah. So a lot of it is just sort of philosophical of just like, this is how it should work from my experience at Docker, you know, and there's just a lot of value from like the core being open, I think, and that other people can share it and it's like an integration point. So, you know, if replicate, for example, wanted to work with a testing system, like a CI system or whatever, we can just like interface at the cog level, like that system just needs to put cog models and then you can like test your models on that CI system before they get deployed to replicate. And it's just like a format that everyone, we can get everyone to agree on, you know.Alessio [00:41:55]: What do you think, I guess, Docker got wrong? Because if I look at a Docker Compose and a cog definition, first of all, the cog is kind of like the Dockerfile plus the Compose versus in Docker Compose, you're just exposing the services. And also Docker Compose is very like ports driven versus you have like the actual, you know, predict this is what you have to run.Ben [00:42:16]: Yeah.Alessio [00:42:17]: Any learnings and maybe tips for other people building container based runtimes, like how much should you separate the API services versus the image building or how much you want to build them together?Ben [00:42:29]: I think it was coming from two sides. We were thinking about the design from the point of view of user needs, what are their problems and what problems can we solve for them, but also what the interface should be for a machine learning model. And it was sort of the combination of two things that led us to this design. So the thing I talked about before was a little bit of like the interface around the machine learning model. So we realized that we wanted to be general purpose. We wanted to be at the like JSON, like human readable things rather than the tensor level. So it was like an open API specification that wrapped a Docker container. And that's where that design came from. And it's really just a wrapper around Docker. So we were kind of building on, standing on shoulders there, but Docker is too low level. So it's just like arbitrary software. So we wanted to be able to like have a open API specification that defined the function effectively that is the machine learning model. But also like how that function is written, how that function is run, which is all defined in code and stuff like that. So it's like a bunch of abstraction on top of Docker to make that work. And that's where that design came from. But the core problems we were solving for users was that Docker is really hard to use and productionizing machine learning models is really hard. So on the first part of that, we knew we couldn't use Dockerfiles. Like Dockerfiles are hard enough for software developers to write. I'm saying this with love as somebody who works on Docker and like works on Dockerfiles, but it's really hard to use. And you need to know a bunch about Linux, basically, because you're running a bunch of CLI commands. You need to know a bunch about Linux and best practices and like how apt works and all this kind of stuff. So we're like, OK, we can't get to that level. We need something that machine learning researchers will be able to understand, like people who are used to like Colab notebooks. And what they understand is they're like, I need this version of Python. I need these Python packages. And somebody told me to apt-get install something. You know? If there was sudo in there, I don't really know what that means. So we tried to create a format that was at that level, and that's what cog.yaml is. And we were really kind of trying to imagine like, what is that machine learning researcher going to understand, you know, and trying to build for them. Then the productionizing machine learning models thing is like, OK, how can we package up all of the complexity of like productionizing machine learning models, like picking CUDA versions, like hooking it up to GPUs, writing an inference server, defining a schema, doing batching, all of these just like really gnarly things that everyone does again and again. And just like, you know, provide that as a tool. And that's where that side of it came from. So it's like combining those user needs with, you know, the sort of world need of needing like a common standard for like what a machine learning model is. And that's how we thought about the design. I don't know whether that answers the question.Alessio [00:45:12]: Yeah. So your idea was like, hey, you really want what Docker stands for in terms of standard, but you actually don't want people to do all the work that goes into Docker.Ben [00:45:22]: It needs to be higher level, you know?Swyx [00:45:25]: So I want to, for the listener, you're not the only standard that is out there. As with any standard, there must be 14 of them. You are surprisingly friendly with Olama, who is your former colleagues from Docker, who came out with the model file. Mozilla came out with the Lama file. And then I don't know if this is in the same category even, but I'm just going to throw it in there. Like Hugging Face has the transformers and diffusers library, which is a way of disseminating models that obviously people use. How would you compare your contrast, your approach of Cog versus all these?Ben [00:45:53]: It's kind of complementary, actually, which is kind of neat in that a lot of transformers, for example, is lower level than Cog. So it's a Python library effectively, but you still need to like...Swyx [00:46:04]: Expose them.Ben [00:46:05]: Yeah. You still need to turn that into an inference server. You still need to like install the Python packages and that kind of thing. So lots of replicate models are transformers models and diffusers models inside Cog, you know? So that's like the level that that sits. So it's very complementary in some sense. We're kind of working on integration with Hugging Face such that you can deploy models from Hugging Face into Cog models and stuff like that to replicate. And some of these things like Llamafile and what Llama are working on are also very complementary in that they're doing a lot of the sort of running these things locally on laptops, which is not a thing that works very well with Cog. Like Cog is really designed around servers and attaching to CUDA devices and NVIDIA GPUs and this kind of thing. So we're actually like, you know, figuring out ways that like we can, those things can be interoperable because, you know, they should be and they are quite complementary and that you should be able to like take a model and replicate and run it on your local machine. You should be able to take a model, you know, the machine and run it in the cloud.Swyx [00:47:02]: Is the base layer something like, is it at the like the GGUF level, which by the way, I need to get a primer on like the different formats that have emerged, or is it at the star dot file level, which is model file, Llamafile, whatever, whatever, or is it at the Cog level? I don't know, to be honest.Ben [00:47:16]: And I think this is something we still have to figure out. There's a lot yet, like exactly where those lines are drawn. Don't know exactly. I think this is something we're trying to figure out ourselves, but I think there's certainly a lot of promise about these systems interoperating. We just want things to work together. You know, we want to try and reduce the number of standards. So the more, the more these things can interoperate and, you know, convert between each other and that kind of stuff at the minute.Swyx [00:47:34]: Cool. Well, there's a foundation for that.Alessio [00:47:36]: Andreas comes out of Spotify, Eric from Moto also comes out of Spotify. You work at Docker and the Llamafile guys work at Docker. Did both you and Andreas know that there was somebody else you work with that had a kind of like similar, not similar idea, but like was interested in the same thing or did you then just say, oh, I know those people. They're doing something very similar.Ben [00:47:58]: We learned about both early on actually, yeah, because we know, we know them both quite well. And it's funny how I think we're all seeing the same problems and just like applying, you know, trying to fix the same problems that we're all seeing. I think the Llama one's particularly funny because I joined Docker through my startup. Funnily, actually, the thing which worked for my startup was Compose, but we were actually working on another thing, which was a bit like EC2 for Docker. So we were working on like productionizing Docker containers. And Llama was working on a thing called Chimatic, which was a bit like a desktop app for Docker. And our companies both got bought by Docker at the same time. And you know, Chimatic turned into Docker desktop. And then, you know, our thing then turned into Compose. And it's funny how we're both applying our, like the things we saw at Docker to the AI world, but they're building like the local environment for us and we're building like the cloud for it. And yeah, so that's just like really pleasing. And I think, you know, we're collaborating closely because there's just so much opportunity for working there. You have a hammer.Swyx [00:49:06]: Everything's a nail.Ben [00:49:07]: Yeah, exactly. Exactly. So I think a lot of where we're coming from a lot with AI is we're all kind of on the replicated team. We're all kind of people who have built developer tools in the past. We've got a team, like I worked at Docker, we've got people who worked at Heroku and GitHub and like the iOS ecosystem and all this kind of thing, like the previous generation of developer tools, where we like figured out a bunch of stuff. And then like AI has come along and we just don't yet have those tools and abstractions like to make it easy to use. So we're trying to like take the lessons that we learned from the previous generation of stuff and apply it to this new generation of stuff. And obviously there's a bit of nuance there because the trick is to take like the right lessons and do new stuff where it makes sense. You can't just like cut and paste, you know, but that's like how we're approaching this is we're trying to like as much as possible, like take some of those lessons we learned from like, you know, how Heroku and GitHub was built, for example, and apply them to AI.Swyx [00:50:05]: We should also talk a little bit about your compute availability. We're trying to ask this of all, you know, it's Compute Provider Month. Do you own your own GPUs? How many do you have access to? What do you feel about the tightness of the GPU market?Ben [00:50:17]: We don't own our own GPUs. We've got a few that we play around with, but not for production workloads. And we are primarily built on public clouds, so primarily GCP and CoreWeave and like some smatterings elsewhere.Swyx [00:50:29]: None from NVIDIA, which is your newest investor?Ben [00:50:31]: We work with NVIDIA, so, you know, they're kind of helping us get GPU availability. GPUs are hard to get hold of. Like if you go to AWS and ask for one A100, they won't give you an A100. But if you go to AWS and say, I would like 100 A100s in two years, they're like, sure, we've got some. And I think the problem is like that makes sense from their point of view. They want just like reliable, sustained usage. They don't want like spiky usage and like wastage in their infrastructure, which makes total sense. But that makes it really hard for startups, you know, who are wanting to just like get hold of GPUs. I think we're in a fortunate position where we can aggregate demand so we can make commits to cloud providers. And then, you know, we actually have good availability, like, you know, we don't have infinite availability, obviously, but, you know, if you want an A100 from Replicate, you can get it. But, you know, we're seeing other companies pop up as well, like SF Compute's a great example of this, where they're doing the same idea for training almost where, you know, a lot of startups need to be able to train a model, but they can't get hold of GPUs from large cloud providers. So SF Compute is like letting people rent, you know, 10 H100s for two days, which is just impossible otherwise. And, you know, what they're effectively doing there is they're aggregating demand such that they can make a big commit to the cloud provider and then let people use smaller chunks of it. And that's kind of what we're doing with Replicate as well. We're aggregating demand such that we make big commits to the cloud providers. And you know, then people can run like a 100 millisecond API request on an A100.Swyx [00:51:51]: So, you know, coming from a finance background, this sounds surprisingly similar to banks, where the job of a bank is maturity transformation, is what you call it. You take short term deposits, which technically can be withdrawn at any time, and you turn that into long term loans for mortgages and stuff, and you pocket the difference in interest. And that's the bank.Ben [00:52:09]: Yeah, that's exactly what we're doing.Swyx [00:52:11]: So you run a bank.Ben [00:52:12]: Yeah, it's your bank. Right, yeah. And it's so much a finance problem as well, because we have to make bets on the future demand value of GPUs, yeah.Swyx [00:52:21]: What are you... Okay, I don't know how much you can disclose, but what are you forecasting? Down? Up a lot? Yeah. Up 10x?Ben [00:52:30]: I can't really. We're projecting our growth with some educated guesses about what kind of models are going to come out and what kind of models these will run, you know? We need to bet that like, okay, maybe language models are getting larger. So we need to like have GPUs with a lot of RAM, or like multi GPU nodes, or maybe models are getting smaller, and we actually need smaller GPUs, you know, we have to make some educated guesses about that kind of stuff, yeah.Swyx [00:52:50]: Yeah. Speaking of which, the mixture of experts models must be throwing a spanner into the planning.Ben [00:52:56]: Not so much. We've got like multi-node A100 machines, which can run those, and multi-node H100 machines, which can run those, no problem. So we're set up for that. Okay.Swyx [00:53:04]: Right. I didn't expect it to be so easy. My impression was that the amount of RAM per model was increasing a lot, especially on a sort of per parameter basis, per active parameter basis, going from like mixed trial being eight experts to like the deep-seek MOE models, I don't know if you saw them, being like 30, 60 experts, and you can see it keep going up, I guess.Ben [00:53:26]: Yeah. I think we might run into problems at some point, and yeah, I don't know exactly what's going on there. I think something that we're finding, which is kind of interesting, like I don't know this in depth, you know, we're certainly seeing a lot of good results from lower precision models. So like, you know, 90% of the performance with just like much less RAM required. That means that we can run them on GPUs we have available, and it's good for customers as well because it runs faster, and like they want that trade-off, you know, where it's just slightly worse, but like way faster and cheaper.Alessio [00:53:55]: Do you see a lot of GPU waste in terms of people running the thing on a GPU that is like too advanced? I think we use C4 to run Whisper. So we're at the bottom end of it. Yeah. Any thoughts? I think one of the hackathons we were at, people were like, oh, how do I get access to like H100s? And it's like, you need to run like- Dude, you don't need H100s.Ben [00:54:14]: You don't need H100s. Yeah. Yeah. Well, if you want low licensee, like sure, like spend a lot of money on the H100. Yeah. We see a ton of that kind of stuff. And it's surprisingly hard to optimize these models right now. So a lot of people are just running like really unoptimized models. We're doing the same, honestly. Like we're a lot of models on Replicate have just been like not been optimized very well. So something we want to like be able to help people with is optimizing those models. Like either we show people how to with guides or we make it easier to use some of these more optimized inference servers or we show people how to compile the models or we do that automatically or something like that. But that's only something we're exploring because there's so much wastage. Like it's not just wasting the GPUs. It's also like a bad experience and the models run slow. So the models on Replicate almost all pushed by our community. Like people have pushed those models themselves, but like it's like a big head of distribution where there's like a long tail of lots of models that people have pushed. And then like a big head of like the models most people run. So models like Llama 2, like Stable Diffusion, you know, we work with Meta and Stability to like maintain those models. And we've done a ton of optimization to make this really fast. So those models are optimized, but the long tail is not. And there's like a lot of wastage there.Alessio [00:55:32]: And going into the, well, it's already the new year. Do you see the customer demand and the GPU like hardware demand kind of like staying together? Because I think a lot of people are saying, oh, there's like hundreds of thousands of GPUs being shipped this year. Like the crunch is going to be over, but you also have like millions of people that now care about using AI. You know, how do you see the two lines progressing? Are you seeing customer demand is going to outpace the GPU growth? Do you see them together? Do you see maybe a lot of this like model improvement work kind of helping alleviateBen [00:56:04]: that? That's a really good question. From our point of view, demand is not outpacing supply GPUs, like we have enough, from our point of view, we have enough GPUs to go around, but that might change for sure. Yeah.Alessio [00:56:15]: That's a very nicely put way as a startup founder to respond.Swyx [00:56:21]: So as your frame did more, it's like sort of picking the wrong box model, whereas yours is more about maybe the inference stack, if you can call it. Were you referencing VLLM? What other sort of techniques are you referencing? Also keeping in mind that when I talk to your competitors, and I don't know if we don't have to name any of them, but they are working on trying to optimize the kinds of models. Like they basically, they'll quantize their models for you with their special stack. So you basically use their versions of Llamatu, you use their versions of Mistral, and that's one way to approach it. I don't see it as the replicate DNA to do that because that would be like sort of, you would have to slap the replicate house brand on something, which I mean, just comment on any of that. What do you mean when you say optimize models?Ben [00:57:05]: Things like quantizing the models, you can imagine a way that we could help people quantize their models if we want to. We've had success using inference servers like VLM and TRT LLM, and we're using those kind of things to serve language models. We've had success with things like AI templates, which compile the models, all of those kinds of things. And there's like some even really just boring things of just like making the code more efficient. Like when they're just writing some Python code, it's really easy to just write inefficient Python code. And there's like really boring things like that as well, but it's like a whole smash of things like that.Swyx [00:57:40]: You will do that for a customer? Like you look at their code and-Ben [00:57:43]: Yeah, we've certainly helped some of our customers be able to do that, some of the stuff. And a lot of the models on, like the popular models on replicate, we've like rewritten them to use that stuff as well. And like the stable diffusion that we run, for example, is compiled for the AI template to make it super fast. And it's all open source that you can see all of this stuff on GitHub, if you want to like see how we do it. But you can imagine ways that we could help people. It's almost like built into the Cog layer maybe, where we could help people like use these fast inference servers or use AI template to compile their models to make it faster. Whether it's like manual, semi-manual or automatic, we're not really sure, but that's something we want to explore because it benefits everyone.Swyx [00:58:21]: And then on the competitive piece, there was a price war on Mixtral last year, this last December. As far as I can tell, you guys did not enter that war. You have Mixtral, but it's just regular pricing. I think also some of these players are probably losing money on their pricing. You don't have to say anything, but the break even is somewhere between 50 to 75 cents per million tokens served. How are you thinking about like just the overall competitiveness in the market? How should people choose when everyone's an API?Ben [00:58:50]: So for Lama2 and Mistral, I think not mixed trial, I can't remember exactly. We have similar performance and similar price to some of these other services. We're not like bargain basement to some of the others, because to your point, we don't want to burn tons of money, but we're pricing it sensibly and sustainably to a point where we think it's competitive with other people such that we want developers using Replicate and we don't want to price it such that it's only affordable by big companies. We want to make it cheap enough such that the developers can afford it, but we also don't want the super cheap prices, because then it's almost like then your customers are hostile and the more customers you get, the worse it gets. So we're pricing it sensibly, but still to the point where hopefully it's cheap enough to build on. And I think the thing we really care about, like we want to, obviously we want models and Replicate to be comparable to other people. But I think the really crucial thing about Replicate and the way I think we think about it is that it's not just the API for them, particularly in open source, it's not just the API for the model that is the important bit. It's because quite often with open source models, like the whole point of open source is that you can tinker on it and you can customize it and you can fine tune it and you can like smush it together with another model, like Lava, for example. And you can't do that if it's just like a hosted API, because it's just like, you know, you can't touch the code. So what we want to do with Replicate is build a platform that's actually open. So like we've got all of these models where the performance and price is on par with everything else. But if you want to customize it, you can fine tune it, you can go to GitHub and get the source code for it and edit the source code and push up your own custom version and this kind of thing. Because that's the crucial thing for open source machine learning is be able to tinker on it and customizing it. And we think that's really important to make open source AI work.Alessio [01:00:39]: You mentioned open source. How do you think about levels of openness? When Lama 2 came out, I wrote a post about this, about it's like open source and there's open weights, then there's restrictive weights. It was on the front page of Agornews. So there was like all sort of comments from people. So I'm always curious to hear your thoughts. Like what do you think it's okay for people to license? What's okay for people to not release?Ben [01:01:03]: You know, before it was just like closed source, big models, open source, little models, you know, purely open source stuff. And we're now seeing like lots of variations where, you know, model companies putting restrictive licenses on their models, you know, that means it can only be used for non-commercial use, you know, and a lot of the, you know, open source crowd is complaining it's not true open source, you know, and all this kind of thing. And I think a lot of that is coming from philosophy, you know, like the sort of free software movement kind of philosophy. And I don't think it's necessarily a bad thing. I think it's good that model companies can make money out of their models. You know, that's like how this will incentivize people to make more models and this kind of thing. And I think it's totally fine if like somebody made something to ask for some money in return if you're making money out of it. And I think that's totally okay. And I think there's some really interesting like midpoints as well where people are releasing the codes, you can still tinker on it, but the person who trained the model still wants to get a cut of it if like you're making a bunch of money out of it. And I think that's good. And that's going to make like the ecosystem more sustainable. I don't think anybody's really figured it out yet. We're going to see like more experimentation with this and more people like try to figure out like what are the business models around building models and how can I make money out of this? And we'll just see where it ends up. And I think it's something we want to support as Replicate as well because we believe in open source. We think it's great, but there's also going to be lots of models which are closed source as well. And these companies might not be, there's probably going to be a long tail of a bunch of people building models that don't have the reach that OpenAI have. And hopefully as Replicate, we can help those people find developers and help them make money and that kind of thing.Alessio [01:02:46]: I think the computer requirements of AI kind of changed the thing. I started an open source company. I'm a big open source fan. And before it was kind of man hours was really all that went into open source. It wasn't much monetary investment. Well, not that man hours are not worth a lot, but if you think about Llama 2, it's like $25 million, you know, like all in, it's like you can't just spin up a discord and like spend $25 million. So I think it's net positive for everybody that Llama 2 is open source and well, it's the open source, you know, it's the open source term. I think people like you're saying, it's like they kind of argue on the semantics of it, but like all we care about is that Llama 2 is open because if Llama 2 wasn't open source today, like that, if Mistral was not open source, we will be in a bad spot, you know?Ben [01:03:33]: So, and I think the nuance here is making sure that these models are still tinkerable because the beautiful thing about Llama 2 as a base model is that like, yeah, it costs $25 million to train to start with, but then you can fine tune it for like 50 bucks. And that's what's so beautiful about the open source ecosystem. And something I think is really surprising as well, like completely surprised me. Like I think a lot of people assumed that it's not going to be open source machine learning. It's just not going to be practical because it's so expensive to train these models. But like fine tuning is unreasonably effective and people are getting really good results out of it and it's really cheap. So people can effectively create open source models really cheaply. And there's going to be like this sort of ecosystem of tons of models being made. And I think the risk there from a licensing point of view is we need to make sure that the licenses let people do that, because if you release a big model under a non-commercial license and people can't fine tune it, you've lost the magic of it being open. And I'm sure there are ways to structure that such that the person paying $25 million feels like they're compensated somehow and they can feel like they can, you know, they should keep on training models and people can keep on fine tuning it. But I guess we just have to figure out exactly how that plays out.Swyx [01:04:46]: Excellent. So just wanted to round it out. You've been an excellent, very open. I should have started my intro with this, but I feel like you found the sort of AI engineer crew before I did. And, you know, something I really resonated with you in sort of the Series B announcement was that you put in some stats here about how there are two orders of magnitude more software engineers than there are machine learning engineers, about 30 million software engineers and 500,000 machine learning engineers. You can maybe plus minus one of those orders of magnitude, but it's around that ballpark. And so obviously there will be a lot more engineers than there will be ML engineers. How do you see this group? Like, is it all software engineers? Are they going to specialize? What would you advise someone trying to become an AI engineer? Is this a legitimate career path?Ben [01:05:30]: Yeah, absolutely. I mean, it's very clear that AI is going to be a large part of how we build software in the future. Now, it's a bit like being a software developer in the 90s and ignoring the Internet. You know, you just need to you need to learn about this stuff. You need to figure this stuff out. I don't think it needs to be super low level. You don't need to be like, you know, the metaphor here is that you don't need to be digging down into like this sort of Pytorch level if you don't want to in the same way as a software engineer in the 90s. You don't need to be like understanding how network stacks work to be able to build a website, you know, but you need to understand the shape of this thing and how to hold it and what it's good at and what it's not. And that's really important. So, yeah, certainly just advise people to like just start playing around with it, get a feel of like how language models work, get a feel of like how these diffusion models work, get a feel of like what fine tuning is and how it works, because some of your job might be building datasets, you know, get a feeling of how prompting works, because some of your job might be writing a prompt. And those are just all really important skills to sort of figure out.Swyx [01:06:36]: Yeah. Well, thanks for building the definitive platform for doing all that.Ben [01:06:41]: Yeah, of course.Alessio [01:06:42]: And if I know call to actions, who should come work at Replicate, anything for the audience?Ben [01:06:47]: Yeah, well, I mean, we're hiring. If you click on jobs at the bottom of our Replicate.com, there's some jobs. And I just encourage you to like just like try out AI, even if you don't, even if you think you're not smart enough. Like the whole reason I started this company is because I was looking at the cool stuff that Andreas was making. Like Andreas is like a proper machine learning person with a PhD, you know, and I was like just like a, you know, a sort of lowly software engineer. I was like, you're doing really cool stuff and I want to be able to do that. And by us working together, you know, we've now made it accessible to dummies like me. And I just encourage anyone who's like wants to try this stuff out, just give it a try. I would also encourage people who are tool builders. Like the limiting factor now on AI is not like the technology, like the technology has made incredible advances and there's just so many incredible machine learning models that can do a ton of stuff. The limiting factor is just like making that accessible to people who build products, because it's really hard to use this stuff right now. And obviously we're building some of that stuff as Replicate, but there's just like a ton of other tooling and abstractions that need to be built out to make this stuff usable. So I just encourage people who like building developer tools to just like get stuck into it as well, because that's going to make this stuff accessible to everyone.Swyx [01:07:58]: Yeah, I especially want to highlight you have a hacker in residence job opening available, which not every company has, which means just join you and hack stuff. I think Charlie Holtz is doing a fantastic job of that.Ben [01:08:09]: Yeah, effectively. Like most of our, a lot of our job is just like showing people how to use AI. So we've just got a team of like software developers and people have kind of figured this stuff out who are writing about it, who are making videos about it, who are making example applications to show people what you can do with this stuff.Swyx [01:08:26]: Yeah. In my world that used to be called DevRel, but now it's hacker in residence.Ben [01:08:31]: And this came from Zeke, who's another one of our hackers.Swyx [01:08:38]: Tell me this came from Chroma, because I want to start that one.Ben [01:08:41]: We developed, like they, Antoine actually was like, hey, we came up with that first. But I think we came up with it independently, because the story behind this is we originally called it the DevRel team. Yeah. And DevRel's cursed now. Zeke was like, that sounds so boring. I want to go to someone and say I'm a developer relations person, or a developer advocate or something. So we were like, okay, what's the like, the way we can make this sound the most fun? All right, you're a hacker.Swyx [01:09:10]: I would say like that is consistently the vibe I get from Replicate. Everyone on your team I interact with. When I go to your San Francisco office, like that's the vibe that you're generating. Like it's a hacker space more than an office. And you hold fantastic meetups there. And I think you're a really positive presence in our community. So thank you for doing all that. And it's instilling the hacker vibe and culture into AI.Ben [01:09:31]: I'm really glad that I'm really glad that's working. Cool. That's a wrap.Alessio [01:09:34]: I think. Thank you so much for coming on, man.Ben [01:09:36]: Yeah, of course. Thank you. This is a lot of fun. Get full access to Latent.Space at www.latent.space/subscribe
-
Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-02-16 17:42
We’re writing this one day after the monster release of OpenAI’s Sora and Gemini 1.5. We covered this on Alex Volkov ‘s ThursdAI space, so head over there for our takes.IRL: We’re ONE WEEK away from Latent Space: Final Frontiers, the second edition and anniversary of our first ever Latent Space event! Also: join us on June 25-27 for the biggest AI Engineer conference of the year!Online: All three Discord clubs are thriving. Join us every Wednesday/Friday!Almost 12 years ago, while working at Spotify, Erik Bernhardsson built one of the first open source vector databases, Annoy, based on ANN search. He also built Luigi, one of the predecessors to Airflow, which helps data teams orchestrate and execute data-intensive and long-running jobs. Surprisingly, he didn’t start yet another vector database company, but instead in 2021 founded Modal, the “high-performance cloud for developers”. In 2022 they opened doors to developers after their seed round, and in 2023 announced their GA with a $16m Series A.More importantly, they have won fans among both household names like Ramp, Scale AI, Substack, and Cohere, and newer startups like (upcoming guest!) Suno.ai and individual hackers (Modal was the top tool of choice in the Vercel AI Accelerator):We've covered the nuances of GPU workloads, and how we need new developer tooling and runtimes for them (see our episodes with Chris Lattner of Modular and George Hotz of tiny to start). In this episode, we run through the major limitations of the actual infrastructure behind the clouds that run these models, and how Erik envisions the “postmodern data stack”. In his 2021 blog post “Software infrastructure 2.0: a wishlist”, Erik had “Truly serverless” as one of his points:* The word cluster is an anachronism to an end-user in the cloud! I'm already running things in the cloud where there's elastic resources available at any time. Why do I have to think about the underlying pool of resources? Just maintain it for me.* I don't ever want to provision anything in advance of load.* I don't want to pay for idle resources. Just let me pay for whatever resources I'm actually using.* Serverless doesn't mean it's a burstable VM that saves its instance state to disk during periods of idle.Swyx called this Self Provisioning Runtimes back in the day. Modal doesn’t put you in YAML hell, preferring to colocate infra provisioning right next to the code that utilizes it, so you can just add GPU (and disk, and retries…):After 3 years, we finally have a big market push for this: running inference on generative models is going to be the killer app for serverless, for a few reasons:* AI models are stateless: even in conversational interfaces, each message generation is a fully-contained request to the LLM. There’s no knowledge that is stored in the model itself between messages, which means that tear down / spin up of resources doesn’t create any headaches with maintaining state.* Token-based pricing is better aligned with serverless infrastructure than fixed monthly costs of traditional software.* GPU scarcity makes it really expensive to have reserved instances that are available to you 24/7. It’s much more convenient to build with a serverless-like infrastructure.In the episode we covered a lot more topics like maximizing GPU utilization, why Oracle Cloud rocks, and how Erik has never owned a TV in his life. Enjoy!Show Notes* Modal* ErikBot* Erik’s Blog* Software Infra 2.0 Wishlist* Luigi* Annoy* Hetzner* CoreWeave* Cloudflare FaaS* Poolside AI* Modular Inference EngineChapters* [00:00:00] Introductions* [00:02:00] Erik's OSS work at Spotify: Annoy and Luigi* [00:06:22] Starting Modal* [00:07:54] Vision for a "postmodern data stack"* [00:10:43] Solving container cold start problems* [00:12:57] Designing Modal's Python SDK* [00:15:18] Self-Revisioning Runtime* [00:19:14] Truly Serverless Infrastructure* [00:20:52] Beyond model inference* [00:22:09] Tricks to maximize GPU utilization* [00:26:27] Differences in AI and data science workloads* [00:28:08] Modal vs Replicate vs Modular and lessons from Heroku's "graduation problem"* [00:34:12] Creating Erik's clone "ErikBot"* [00:37:43] Enabling massive parallelism across thousands of GPUs* [00:39:45] The Modal Sandbox for agents* [00:43:51] Thoughts on the AI Inference War* [00:49:18] Erik's best tweets* [00:51:57] Why buying hardware is a waste of money* [00:54:18] Erik's competitive programming backgrounds* [00:59:02] Why does Sweden have the best Counter Strike players?* [00:59:53] Never owning a car or TV* [01:00:21] Advice for infrastructure startupsTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:14]: Hey, and today we have in the studio Erik Bernhardsson from Modal. Welcome.Erik [00:00:19]: Hi. It's awesome being here.Swyx [00:00:20]: Yeah. Awesome seeing you in person. I've seen you online for a number of years as you were building on Modal and I think you're just making a San Francisco trip just to see people here, right? I've been to like two Modal events in San Francisco here.Erik [00:00:34]: Yeah, that's right. We're based in New York, so I figured sometimes I have to come out to capital of AI and make a presence.Swyx [00:00:40]: What do you think is the pros and cons of building in New York?Erik [00:00:45]: I mean, I never built anything elsewhere. I lived in New York the last 12 years. I love the city. Obviously, there's a lot more stuff going on here and there's a lot more customers and that's why I'm out here. I do feel like for me, where I am in life, I'm a very boring person. I kind of work hard and then I go home and hang out with my kids. I don't have time to go to events and meetups and stuff anyway. In that sense, New York is kind of nice. I walk to work every morning. It's like five minutes away from my apartment. It's very time efficient in that sense. Yeah.Swyx [00:01:10]: Yeah. It's also a good life. So we'll do a brief bio and then we'll talk about anything else that people should know about you. Actually, I was surprised to find out you're from Sweden. You went to college in KTH and your master's was in implementing a scalable music recommender system. Yeah.Erik [00:01:27]: I had no idea. Yeah. So I actually studied physics, but I grew up coding and I did a lot of programming competition and then as I was thinking about graduating, I got in touch with an obscure music streaming startup called Spotify, which was then like 30 people. And for some reason, I convinced them, why don't I just come and write a master's thesis with you and I'll do some cool collaborative filtering, despite not knowing anything about collaborative filtering really. But no one knew anything back then. So I spent six months at Spotify basically building a prototype of a music recommendation system and then turned that into a master's thesis. And then later when I graduated, I joined Spotify full time.Swyx [00:02:00]: So that was the start of your data career. You also wrote a couple of popular open source tooling while you were there. Is that correct?Erik [00:02:09]: No, that's right. I mean, I was at Spotify for seven years, so this is a long stint. And Spotify was a wild place early on and I mean, data space is also a wild place. I mean, it was like Hadoop cluster in the like foosball room on the floor. It was a lot of crude, like very basic infrastructure and I didn't know anything about it. And like I was hired to kind of figure out data stuff. And I started hacking on a recommendation system and then, you know, got sidetracked in a bunch of other stuff. I fixed a bunch of reporting things and set up A-B testing and started doing like business analytics and later got back to music recommendation system. And a lot of the infrastructure didn't really exist. Like there was like Hadoop back then, which is kind of bad and I don't miss it. But I spent a lot of time with that. As a part of that, I ended up building a workflow engine called Luigi, which is like briefly like somewhat like widely ended up being used by a bunch of companies. Sort of like, you know, kind of like Airflow, but like before Airflow. I think it did some things better, some things worse. I also built a vector database called Annoy, which is like for a while, it was actually quite widely used. In 2012, so it was like way before like all this like vector database stuff ended up happening. And funny enough, I was actually obsessed with like vectors back then. Like I was like, this is going to be huge. Like just give it like a few years. I didn't know it was going to take like nine years and then there's going to suddenly be like 20 startups doing vector databases in one year. So it did happen. In that sense, I was right. I'm glad I didn't start a startup in the vector database space. I would have started way too early. But yeah, that was, yeah, it was a fun seven years as part of it. It was a great culture, a great company.Swyx [00:03:32]: Yeah. Just to take a quick tangent on this vector database thing, because we probably won't revisit it but like, has anything architecturally changed in the last nine years?Erik [00:03:41]: I'm actually not following it like super closely. I think, you know, some of the best algorithms are still the same as like hierarchical navigable small world.Swyx [00:03:51]: Yeah. HNSW.Erik [00:03:52]: Exactly. I think now there's like product quantization, there's like some other stuff that I haven't really followed super closely. I mean, obviously, like back then it was like, you know, it's always like very simple. It's like a C++ library with Python bindings and you could mmap big files and into memory and like they had some lookups. I used like this kind of recursive, like hyperspace splitting strategy, which is not that good, but it sort of was good enough at that time. But I think a lot of like HNSW is still like what people generally use. Now of course, like databases are much better in the sense like to support like inserts and updates and stuff like that. I know I never supported that. Yeah, it's sort of exciting to finally see like vector databases becoming a thing.Swyx [00:04:30]: Yeah. Yeah. And then maybe one takeaway on most interesting lesson from Daniel Ek?Erik [00:04:36]: I mean, I think Daniel Ek, you know, he started Spotify very young. Like he was like 25, something like that. And that was like a good lesson. But like he, in a way, like I think he was a very good leader. Like there was never anything like, no scandals or like no, he wasn't very eccentric at all. It was just kind of like very like level headed, like just like ran the company very well, like never made any like obvious mistakes or I think it was like a few bets that maybe like in hindsight were like a little, you know, like took us, you know, too far in one direction or another. But overall, I mean, I think he was a great CEO, like definitely, you know, up there, like generational CEO, at least for like Swedish startups.Swyx [00:05:09]: Yeah, yeah, for sure. Okay, we should probably move to make our way towards Modal. So then you spent six years as CTO of Better. You were an early engineer and then you scaled up to like 300 engineers.Erik [00:05:21]: I joined as a CTO when there was like no tech team. And yeah, that was a wild chapter in my life. Like the company did very well for a while. And then like during the pandemic, yeah, it was kind of a weird story, but yeah, it kind of collapsed.Swyx [00:05:32]: Yeah, laid off people poorly.Erik [00:05:34]: Yeah, yeah. It was like a bunch of stories. Yeah. I mean, the company like grew from like 10 people when I joined at 10,000, now it's back to a thousand. But yeah, they actually went public a few months ago, kind of crazy. They're still around, like, you know, they're still, you know, doing stuff. So yeah, very kind of interesting six years of my life for non-technical reasons, like I managed like three, four hundred, but yeah, like learning a lot of that, like recruiting. I spent all my time recruiting and stuff like that. And so managing at scale, it's like nice, like now in a way, like when I'm building my own startup. It's actually something I like, don't feel nervous about at all. Like I've managed a scale, like I feel like I can do it again. It's like very different things that I'm nervous about as a startup founder. But yeah, I started Modal three years ago after sort of, after leaving Better, I took a little bit of time off during the pandemic and, but yeah, pretty quickly I was like, I got to build something. I just want to, you know. Yeah. And then yeah, Modal took form in my head, took shape.Swyx [00:06:22]: And as far as I understand, and maybe we can sort of trade off questions. So the quick history is started Modal in 2021, got your seed with Sarah from Amplify in 2022. You just announced your Series A with Redpoint. That's right. And that brings us up to mostly today. Yeah. Most people, I think, were expecting you to build for the data space.Erik: But it is the data space.Swyx:: When I think of data space, I come from like, you know, Snowflake, BigQuery, you know, Fivetran, Nearby, that kind of stuff. And what Modal became is more general purpose than that. Yeah.Erik [00:06:53]: Yeah. I don't know. It was like fun. I actually ran into like Edo Liberty, the CEO of Pinecone, like a few weeks ago. And he was like, I was so afraid you were building a vector database. No, I started Modal because, you know, like in a way, like I work with data, like throughout my most of my career, like every different part of the stack, right? Like I thought everything like business analytics to like deep learning, you know, like building, you know, training neural networks, the scale, like everything in between. And so one of the thoughts, like, and one of the observations I had when I started Modal or like why I started was like, I just wanted to make, build better tools for data teams. And like very, like sort of abstract thing, but like, I find that the data stack is, you know, full of like point solutions that don't integrate well. And still, when you look at like data teams today, you know, like every startup ends up building their own internal Kubernetes wrapper or whatever. And you know, all the different data engineers and machine learning engineers end up kind of struggling with the same things. So I started thinking about like, how do I build a new data stack, which is kind of a megalomaniac project, like, because you kind of want to like throw out everything and start over.Swyx [00:07:54]: It's almost a modern data stack.Erik [00:07:55]: Yeah, like a postmodern data stack. And so I started thinking about that. And a lot of it came from like, like more focused on like the human side of like, how do I make data teams more productive? And like, what is the technology tools that they need? And like, you know, drew out a lot of charts of like, how the data stack looks, you know, what are different components. And it shows actually very interesting, like workflow scheduling, because it kind of sits in like a nice sort of, you know, it's like a hub in the graph of like data products. But it was kind of hard to like, kind of do that in a vacuum, and also to monetize it to some extent. I got very interested in like the layers below at some point. And like, at the end of the day, like most people have code to have to run somewhere. So I think about like, okay, well, how do you make that nice? Like how do you make that? And in particular, like the thing I always like thought about, like developer productivity is like, I think the best way to measure developer productivity is like in terms of the feedback loops, like how quickly when you iterate, like when you write code, like how quickly can you get feedback. And at the innermost loop, it's like writing code and then running it. And like, as soon as you start working with the cloud, like it's like takes minutes suddenly, because you have to build a Docker container and push it to the cloud and like run it, you know. So that was like the initial focus for me was like, I just want to solve that problem. Like I want to, you know, build something less, you run things in the cloud and like retain the sort of, you know, the joy of productivity as when you're running things locally. And in particular, I was quite focused on data teams, because I think they had a couple unique needs that wasn't well served by the infrastructure at that time, or like still is in like, in particular, like Kubernetes, I feel like it's like kind of worked okay for back end teams, but not so well for data teams. And very quickly, I got sucked into like a very deep like rabbit hole of like...Swyx [00:09:24]: Not well for data teams because of burstiness. Yeah, for sure.Erik [00:09:26]: So like burstiness is like one thing, right? Like, you know, like you often have this like fan out, you want to like apply some function over very large data sets. Another thing tends to be like hardware requirements, like you need like GPUs and like, I've seen this in many companies, like you go, you know, data scientists go to a platform team and they're like, can we add GPUs to the Kubernetes? And they're like, no, like, that's, you know, complex, and we're not gonna, so like just getting GPU access. And then like, I mean, I also like data code, like frankly, or like machine learning code like tends to be like, super annoying in terms of like environments, like you end up having like a lot of like custom, like containers and like environment conflicts. And like, it's very hard to set up like a unified container that like can serve like a data scientist, because like, there's always like packages that break. And so I think there's a lot of different reasons why the technology wasn't well suited for back end. And I think the attitude at that time is often like, you know, like you had friction between the data team and the platform team, like, well, it works for the back end stuff, you know, why don't you just like, you know, make it work. But like, I actually felt like data teams, you know, or at this point now, like there's so much, so many people working with data, and like they, to some extent, like deserve their own tools and their own tool chains, and like optimizing for that is not something people have done. So that's, that's sort of like very abstract philosophical reason why I started Model. And then, and then I got sucked into this like rabbit hole of like container cold start and, you know, like whatever, Linux, page cache, you know, file system optimizations.Swyx [00:10:43]: Yeah, tell people, I think the first time I met you, I think you told me some numbers, but I don't remember, like, what are the main achievements that you were unhappy with the status quo? And then you built your own container stack?Erik [00:10:52]: Yeah, I mean, like, in particular, it was like, in order to have that loop, right? You want to be able to start, like take code on your laptop, whatever, and like run in the cloud very quickly, and like running in custom containers, and maybe like spin up like 100 containers, 1000, you know, things like that. And so container cold start was the initial like, from like a developer productivity point of view, it was like, really, what I was focusing on is, I want to take code, I want to stick it in container, I want to execute in the cloud, and like, you know, make it feel like fast. And when you look at like, how Docker works, for instance, like Docker, you have this like, fairly convoluted, like very resource inefficient way, they, you know, you build a container, you upload the whole container, and then you download it, and you run it. And Kubernetes is also like, not very fast at like starting containers. So like, I started kind of like, you know, going a layer deeper, like Docker is actually like, you know, there's like a couple of different primitives, but like a lower level primitive is run C, which is like a container runner. And I was like, what if I just take the container runner, like run C, and I point it to like my own root file system, and then I built like my own virtual file system that exposes files over a network instead. And that was like the sort of very crude version of model, it's like now I can actually start containers very quickly, because it turns out like when you start a Docker container, like, first of all, like most Docker images are like several gigabytes, and like 99% of that is never going to be consumed, like there's a bunch of like, you know, like timezone information for like Uzbekistan, like no one's going to read it. And then there's a very high overlap between the files are going to be read, there's going to be like lib torch or whatever, like it's going to be read. So you can also cache it very well. So that was like the first sort of stuff we started working on was like, let's build this like container file system. And you know, coupled with like, you know, just using run C directly. And that actually enabled us to like, get to this point of like, you write code, and then you can launch it in the cloud within like a second or two, like something like that. And you know, there's been many optimizations since then, but that was sort of starting point.Alessio [00:12:33]: Can we talk about the developer experience as well, I think one of the magic things about Modal is at the very basic layers, like a Python function decorator, it's just like stub and whatnot. But then you also have a way to define a full container, what were kind of the design decisions that went into it? Where did you start? How easy did you want it to be? And then maybe how much complexity did you then add on to make sure that every use case fit?Erik [00:12:57]: I mean, Modal, I almost feel like it's like almost like two products kind of glued together. Like there's like the low level like container runtime, like file system, all that stuff like in Rust. And then there's like the Python SDK, right? Like how do you express applications? And I think, I mean, Swix, like I think your blog was like the self-provisioning runtime was like, to me, always like to sort of, for me, like an eye-opening thing. It's like, so I didn't think about like...Swyx [00:13:15]: You wrote your post four months before me. Yeah? The software 2.0, Infra 2.0. Yeah.Erik [00:13:19]: Well, I don't know, like convergence of minds. I guess we were like both thinking. Maybe you put, I think, better words than like, you know, maybe something I was like thinking about for a long time. Yeah.Swyx [00:13:29]: And I can tell you how I was thinking about it on my end, but I want to hear you say it.Erik [00:13:32]: Yeah, yeah, I would love to. So to me, like what I always wanted to build was like, I don't know, like, I don't know if you use like Pulumi. Like Pulumi is like nice, like in the sense, like it's like Pulumi is like you describe infrastructure in code, right? And to me, that was like so nice. Like finally I can like, you know, put a for loop that creates S3 buckets or whatever. And I think like Modal sort of goes one step further in the sense that like, what if you also put the app code inside the infrastructure code and like glue it all together and then like you only have one single place that defines everything and it's all programmable. You don't have any config files. Like Modal has like zero config. There's no config. It's all code. And so that was like the goal that I wanted, like part of that. And then the other part was like, I often find that so much of like my time was spent on like the plumbing between containers. And so my thing was like, well, if I just build this like Python SDK and make it possible to like bridge like different containers, just like a function call, like, and I can say, oh, this function runs in this container and this other function runs in this container and I can just call it just like a normal function, then, you know, I can build these applications that may span a lot of different environments. Maybe they fan out, start other containers, but it's all just like inside Python. You just like have this beautiful kind of nice like DSL almost for like, you know, how to control infrastructure in the cloud. So that was sort of like how we ended up with the Python SDK as it is, which is still evolving all the time, by the way. We keep changing syntax quite a lot because I think it's still somewhat exploratory, but we're starting to converge on something that feels like reasonably good now.Swyx [00:14:54]: Yeah. And along the way you, with this expressiveness, you enabled the ability to, for example, attach a GPU to a function. Totally.Erik [00:15:02]: Yeah. It's like you just like say, you know, on the function decorator, you're like GPU equals, you know, A100 and then or like GPU equals, you know, A10 or T4 or something like that. And then you get that GPU and like, you know, you just run the code and it runs like you don't have to, you know, go through hoops to, you know, start an EC2 instance or whatever.Swyx [00:15:18]: Yeah. So it's all code. Yeah. So one of the reasons I wrote Self-Revisioning Runtimes was I was working at AWS and we had AWS CDK, which is kind of like, you know, the Amazon basics blew me. Yeah, totally. And then, and then like it creates, it compiles the cloud formation. Yeah. And then on the other side, you have to like get all the config stuff and then put it into your application code and make sure that they line up. So then you're writing code to define your infrastructure, then you're writing code to define your application. And I was just like, this is like obvious that it's going to converge, right? Yeah, totally.Erik [00:15:48]: But isn't there like, it might be wrong, but like, was it like SAM or Chalice or one of those? Like, isn't that like an AWS thing that where actually they kind of did that? I feel like there's like one.Swyx [00:15:57]: SAM. Yeah. Still very clunky. It's not, not as elegant as modal.Erik [00:16:03]: I love AWS for like the stuff it's built, you know, like historically in order for me to like, you know, what it enables me to build, but like AWS is always like struggle with developer experience.Swyx [00:16:11]: I mean, they have to not break things.Erik [00:16:15]: Yeah. Yeah. And totally. And they have to build products for a very wide range of use cases. And I think that's hard.Swyx [00:16:21]: Yeah. Yeah. So it's, it's easier to design for. Yeah. So anyway, I was, I was pretty convinced that this, this would happen. I wrote, wrote that thing. And then, you know, I imagine my surprise that you guys had it on your landing page at some point. I think, I think Akshad was just like, just throw that in there.Erik [00:16:34]: Did you trademark it?Swyx [00:16:35]: No, I didn't. But I definitely got sent a few pitch decks with my post on there and it was like really interesting. This is my first time like kind of putting a name to a phenomenon. And I think this is a useful skill for people to just communicate what they're trying to do.Erik [00:16:48]: Yeah. No, I think it's a beautiful concept.Swyx [00:16:50]: Yeah. Yeah. Yeah. But I mean, obviously you implemented it. What became more clear in your explanation today is that actually you're not that tied to Python.Erik [00:16:57]: No. I mean, I, I think that all the like lower level stuff is, you know, just running containers and like scheduling things and, you know, serving container data and stuff. So like one of the benefits of data teams is obviously like they're all like using Python, right? And so that made it a lot easier. I think, you know, if we had focused on other workloads, like, you know, for various reasons, we've like been kind of like half thinking about like CI or like things like that. But like, in a way that's like harder because like you also, then you have to be like, you know, multiple SDKs, whereas, you know, focusing on data teams, you can only, you know, Python like covers like 95% of all teams. That made it a lot easier. But like, I mean, like definitely like in the future, we're going to have others support, like supporting other languages. JavaScript for sure is the obvious next language. But you know, who knows, like, you know, Rust, Go, R, whatever, PHP, Haskell, I don't know.Swyx [00:17:42]: You know, I think for me, I actually am a person who like kind of liked the idea of programming language advancements being improvements in developer experience. But all I saw out of the academic sort of PLT type people is just type level improvements. And I always think like, for me, like one of the core reasons for self-provisioning runtimes and then why I like Modal is like, this is actually a productivity increase, right? Like, it's a language level thing, you know, you managed to stick it on top of an existing language, but it is your own language, a DSL on top of Python. And so language level increase on the order of like automatic memory management. You know, you could sort of make that analogy that like, maybe you lose some level of control, but most of the time you're okay with whatever Modal gives you. And like, that's fine. Yeah.Erik [00:18:26]: Yeah. Yeah. I mean, that's how I look at about it too. Like, you know, you look at developer productivity over the last number of decades, like, you know, it's come in like small increments of like, you know, dynamic typing or like is like one thing because not suddenly like for a lot of use cases, you don't need to care about type systems or better compiler technology or like, you know, the cloud or like, you know, relational databases. And, you know, I think, you know, you look at like that, you know, history, it's a steadily, you know, it's like, you know, you look at the developers have been getting like probably 10X more productive every decade for the last four decades or something that was kind of crazy. Like on an exponential scale, we're talking about 10X or is there a 10,000X like, you know, improvement in developer productivity. What we can build today, you know, is arguably like, you know, a fraction of the cost of what it took to build it in the eighties. Maybe it wasn't even possible in the eighties. So that to me, like, that's like so fascinating. I think it's going to keep going for the next few decades. Yeah.Alessio [00:19:14]: Yeah. Another big thing in the infra 2.0 wishlist was truly serverless infrastructure. The other on your landing page, you called them native cloud functions, something like that. I think the issue I've seen with serverless has always been people really wanted it to be stateful, even though stateless was much easier to do. And I think now with AI, most model inference is like stateless, you know, outside of the context. So that's kind of made it a lot easier to just put a model, like an AI model on model to run. How do you think about how that changes how people think about infrastructure too? Yeah.Erik [00:19:48]: I mean, I think model is definitely going in the direction of like doing more stateful things and working with data and like high IO use cases. I do think one like massive serendipitous thing that happened like halfway, you know, a year and a half into like the, you know, building model was like Gen AI started exploding and the IO pattern of Gen AI is like fits the serverless model like so well, because it's like, you know, you send this tiny piece of information, like a prompt, right, or something like that. And then like you have this GPU that does like trillions of flops, and then it sends back like a tiny piece of information, right. And that turns out to be something like, you know, if you can get serverless working with GPU, that just like works really well, right. So I think from that point of view, like serverless always to me felt like a little bit of like a solution looking for a problem. I don't actually like don't think like backend is like the problem that needs to serve it or like not as much. But I look at data and in particular, like things like Gen AI, like model inference, like it's like clearly a good fit. So I think that is, you know, to a large extent explains like why we saw, you know, the initial sort of like killer app for model being model inference, which actually wasn't like necessarily what we're focused on. But that's where we've seen like by far the most usage. Yeah.Swyx [00:20:52]: And this was before you started offering like fine tuning of language models, it was mostly stable diffusion. Yeah.Erik [00:20:59]: Yeah. I mean, like model, like I always built it to be a very general purpose compute platform, like something where you can run everything. And I used to call model like a better Kubernetes for data team for a long time. What we realized was like, yeah, that's like, you know, a year and a half in, like we barely had any users or any revenue. And like we were like, well, maybe we should look at like some use case, trying to think of use case. And that was around the same time stable diffusion came out. And the beauty of model is like you can run almost anything on model, right? Like model inference turned out to be like the place where we found initially, well, like clearly this has like 10x like better agronomics than anything else. But we're also like, you know, going back to my original vision, like we're thinking a lot about, you know, now, okay, now we do inference really well. Like what about training? What about fine tuning? What about, you know, end-to-end lifecycle deployment? What about data pre-processing? What about, you know, I don't know, real-time streaming? What about, you know, large data munging, like there's just data observability. I think there's so many things, like kind of going back to what I said about like redefining the data stack, like starting with the foundation of compute. Like one of the exciting things about model is like we've sort of, you know, we've been working on that for three years and it's maturing, but like this is so many things you can do like with just like a better compute primitive and also go up to stack and like do all this other stuff on top of it.Alessio [00:22:09]: How do you think about or rather like I would love to learn more about the underlying infrastructure and like how you make that happen because with fine tuning and training, it's a static memory. Like you exactly know what you're going to load in memory one and it's kind of like a set amount of compute versus inference, just like data is like very bursty. How do you make batches work with a serverless developer experience? You know, like what are like some fun technical challenge you solve to make sure you get max utilization on these GPUs? What we hear from people is like, we have GPUs, but we can really only get like, you know, 30, 40, 50% maybe utilization. What's some of the fun stuff you're working on to get a higher number there?Erik [00:22:48]: Yeah, I think on the inference side, like that's where we like, you know, like from a cost perspective, like utilization perspective, we've seen, you know, like very good numbers and in particular, like it's our ability to start containers and stop containers very quickly. And that means that we can auto scale extremely fast and scale down very quickly, which means like we can always adjust the sort of capacity, the number of GPUs running to the exact traffic volume. And so in many cases, like that actually leads to a sort of interesting thing where like we obviously run our things on like the public cloud, like AWS GCP, we run on Oracle, but in many cases, like users who do inference on those platforms or those clouds, even though we charge a slightly higher price per GPU hour, a lot of users like moving their large scale inference use cases to model, they end up saving a lot of money because we only charge for like with the time the GPU is actually running. And that's a hard problem, right? Like, you know, if you have to constantly adjust the number of machines, if you have to start containers, stop containers, like that's a very hard problem. Starting containers quickly is a very difficult thing. I mentioned we had to build our own file system for this. We also, you know, built our own container scheduler for that. We've implemented recently CPU memory checkpointing so we can take running containers and snapshot the entire CPU, like including registers and everything, and restore it from that point, which means we can restore it from an initialized state. We're looking at GPU checkpointing next, it's like a very interesting thing. So I think with inference stuff, that's where serverless really shines because you can drive, you know, you can push the frontier of latency versus utilization quite substantially, you know, which either ends up being a latency advantage or a cost advantage or both, right? On training, it's probably arguably like less of an advantage doing serverless, frankly, because you know, you can just like spin up a bunch of machines and try to satisfy, like, you know, train as much as you can on each machine. For that area, like we've seen, like, you know, arguably like less usage, like for modal, but there are always like some interesting use case. Like we do have a couple of customers, like RAM, for instance, like they do fine tuning with modal and they basically like one of the patterns they have is like very bursty type fine tuning where they fine tune 100 models in parallel. And that's like a separate thing that modal does really well, right? Like you can, we can start up 100 containers very quickly, run a fine tuning training job on each one of them for that only runs for, I don't know, 10, 20 minutes. And then, you know, you can do hyper parameter tuning in that sense, like just pick the best model and things like that. So there are like interesting training. I think when you get to like training, like very large foundational models, that's a use case we don't support super well, because that's very high IO, you know, you need to have like infinite band and all these things. And those are things we haven't supported yet and might take a while to get to that. So that's like probably like an area where like we're relatively weak in. Yeah.Alessio [00:25:12]: Have you cared at all about lower level model optimization? There's other cloud providers that do custom kernels to get better performance or are you just given that you're not just an AI compute company? Yeah.Erik [00:25:24]: I mean, I think like we want to support like a generic, like general workloads in a sense that like we want users to give us a container essentially or a code or code. And then we want to run that. So I think, you know, we benefit from those things in the sense that like we can tell our users, you know, to use those things. But I don't know if we want to like poke into users containers and like do those things automatically. That's sort of, I think a little bit tricky from the outside to do, because we want to be able to take like arbitrary code and execute it. But certainly like, you know, we can tell our users to like use those things. Yeah.Swyx [00:25:53]: I may have betrayed my own biases because I don't really think about modal as for data teams anymore. I think you started, I think you're much more for AI engineers. My favorite anecdotes, which I think, you know, but I don't know if you directly experienced it. I went to the Vercel AI Accelerator, which you supported. And in the Vercel AI Accelerator, a bunch of startups gave like free credits and like signups and talks and all that stuff. The only ones that stuck are the ones that actually appealed to engineers. And the top usage, the top tool used by far was modal.Erik [00:26:24]: That's awesome.Swyx [00:26:25]: For people building with AI apps. Yeah.Erik [00:26:27]: I mean, it might be also like a terminology question, like the AI versus data, right? Like I've, you know, maybe I'm just like old and jaded, but like, I've seen so many like different titles, like for a while it was like, you know, I was a data scientist and a machine learning engineer and then, you know, there was like analytics engineers and there was like an AI engineer, you know? So like, to me, it's like, I just like in my head, that's to me just like, just data, like, or like engineer, you know, like I don't really, so that's why I've been like, you know, just calling it data teams. But like, of course, like, you know, AI is like, you know, like such a massive fraction of our like workloads.Swyx [00:26:59]: It's a different Venn diagram of things you do, right? So the stuff that you're talking about where you need like infinite bands for like highly parallel training, that's not, that's more of the ML engineer, that's more of the research scientist and less of the AI engineer, which is more sort of trying to put, work at the application.Erik [00:27:16]: Yeah. I mean, to be fair to it, like we have a lot of users that are like doing stuff that I don't think fits neatly into like AI. Like we have a lot of people using like modal for web scraping, like it's kind of nice. You can just like, you know, fire up like a hundred or a thousand containers running Chromium and just like render a bunch of webpages and it takes, you know, whatever. Or like, you know, protein folding is that, I mean, maybe that's, I don't know, like, but like, you know, we have a bunch of users doing that or, or like, you know, in terms of, in the realm of biotech, like sequence alignment, like people using, or like a couple of people using like modal to run like large, like mixed integer programming problems, like, you know, using Gurobi or like things like that. So video processing is another thing that keeps coming up, like, you know, let's say you have like petabytes of video and you want to just like transcode it, like, or you can fire up a lot of containers and just run FFmpeg or like, so there are those things too. Like, I mean, like that being said, like AI is by far our biggest use case, but you know, like, again, like modal is kind of general purpose in that sense.Swyx [00:28:08]: Yeah. Well, maybe I'll stick to the stable diffusion thing and then we'll move on to the other use cases for AI that you want to highlight. The other big player in my mind is replicate. Yeah. In this, in this era, they're much more, I guess, custom built for that purpose, whereas you're more general purpose. How do you position yourself with them? Are they just for like different audiences or are you just heads on competing?Erik [00:28:29]: I think there's like a tiny sliver of the Venn diagram where we're competitive. And then like 99% of the area we're not competitive. I mean, I think for people who, if you look at like front-end engineers, I think that's where like really they found good fit is like, you know, people who built some cool web app and they want some sort of AI capability and they just, you know, an off the shelf model is like perfect for them. That's like, I like use replicate. That's great. I think where we shine is like custom models or custom workflows, you know, running things at very large scale. We need to care about utilization, care about costs. You know, we have much lower prices because we spend a lot more time optimizing our infrastructure, you know, and that's where we're competitive, right? Like, you know, and you look at some of the use cases, like Suno is a big user, like they're running like large scale, like AI. Oh, we're talking with Mikey.Swyx [00:29:12]: Oh, that's great. Cool.Erik [00:29:14]: In a month. Yeah. So, I mean, they're, they're using model for like production infrastructure. Like they have their own like custom model, like custom code and custom weights, you know, for AI generated music, Suno.AI, you know, that, that, those are the types of use cases that we like, you know, things that are like very custom or like, it's like, you know, and those are the things like it's very hard to run and replicate, right? And that's fine. Like I think they, they focus on a very different part of the stack in that sense.Swyx [00:29:35]: And then the other company pattern that I pattern match you to is Modular. I don't know.Erik [00:29:40]: Because of the names?Swyx [00:29:41]: No, no. Wow. No, but yeah, yes, the name is very similar. I think there's something that might be insightful there from a linguistics point of view. Oh no, they have Mojo, the sort of Python SDK. And they have the Modular Inference Engine, which is their sort of their cloud stack, their sort of compute inference stack. I don't know if anyone's made that comparison to you before, but like I see you evolving a little bit in parallel there.Erik [00:30:01]: No, I mean, maybe. Yeah. Like it's not a company I'm like super like familiar, like, I mean, I know the basics, but like, I guess they're similar in the sense like they want to like do a lot of, you know, they have sort of big picture vision.Swyx [00:30:12]: Yes. They also want to build very general purpose. Yeah. So they're marketing themselves as like, if you want to do off the shelf stuff, go out, go somewhere else. If you want to do custom stuff, we're the best place to do it. Yeah. Yeah. There is some overlap there. There's not overlap in the sense that you are a closed source platform. People have to host their code on you. That's true. Whereas for them, they're very insistent on not running their own cloud service. They're a box software. Yeah. They're licensed software.Erik [00:30:37]: I'm sure their VCs at some point going to force them to reconsider. No, no.Swyx [00:30:40]: Chris is very, very insistent and very convincing. So anyway, I would just make that comparison, let people make the links if they want to. But it's an interesting way to see the cloud market develop from my point of view, because I came up in this field thinking cloud is one thing, and I think your vision is like something slightly different, and I see the different takes on it.Erik [00:31:00]: Yeah. And like one thing I've, you know, like I've written a bit about it in my blog too, it's like I think of us as like a second layer of cloud provider in the sense that like I think Snowflake is like kind of a good analogy. Like Snowflake, you know, is infrastructure as a service, right? But they actually run on the like major clouds, right? And I mean, like you can like analyze this very deeply, but like one of the things I always thought about is like, why does Snowflake arbitrarily like win over Redshift? And I think Snowflake, you know, to me, one, because like, I mean, in the end, like AWS makes all the money anyway, like and like Snowflake just had the ability to like focus on like developer experience or like, you know, user experience. And to me, like really proved that you can build a cloud provider, a layer up from, you know, the traditional like public clouds. And in that layer, that's also where I would put Modal, it's like, you know, we're building a cloud provider, like we're, you know, we're like a multi-tenant environment that runs the user code. But we're also building on top of the public cloud. So I think there's a lot of room in that space, I think is very sort of interesting direction.Alessio [00:31:55]: How do you think of that compared to the traditional past history, like, you know, you had AWS, then you had Heroku, then you had Render, Railway.Erik [00:32:04]: Yeah, I mean, I think those are all like great. I think the problem that they all faced was like the graduation problem, right? Like, you know, Heroku or like, I mean, like also like Heroku, there's like a counterfactual future of like, what would have happened if Salesforce didn't buy them, right? Like, that's a sort of separate thing. But like, I think what Heroku, I think always struggled with was like, eventually companies would get big enough that you couldn't really justify running in Heroku. So they would just go and like move it to, you know, whatever AWS or, you know, in particular. And you know, that's something that keeps me up at night too, like, what does that graduation risk like look like for modal? I always think like the only way to build a successful infrastructure company in the long run in the cloud today is you have to appeal to the entire spectrum, right? Or at least like the enterprise, like you have to capture the enterprise market. But the truly good companies capture the whole spectrum, right? Like I think of companies like, I don't like Datadog or Mongo or something that were like, they both captured like the hobbyists and acquire them, but also like, you know, have very large enterprise customers. I think that arguably was like where I, in my opinion, like Heroku struggle was like, how do you maintain the customers as they get more and more advanced? I don't know what the solution is, but I think there's, you know, that's something I would have thought deeply if I was at Heroku at that time.Alessio [00:33:14]: What's the AI graduation problem? Is it, I need to fine tune the model, I need better economics, any insights from customer discussions?Erik [00:33:22]: Yeah, I mean, better economics, certainly. But although like, I would say like, even for people who like, you know, needs like thousands of GPUs, just because we can drive utilization so much better, like we, there's actually like a cost advantage of staying on modal. But yeah, I mean, certainly like, you know, and like the fact that VCs like love, you know, throwing money at least used to, you know, add companies who need it to buy GPUs. I think that didn't help the problem. And in training, I think, you know, there's less software differentiation. So in training, I think there's certainly like better economics of like buying big clusters. But I mean, my hope it's going to change, right? Like I think, you know, we're still pretty early in the cycle of like building AI infrastructure. And I think a lot of these companies over in the long run, like, you know, they're, except it may be super big ones, like, you know, on Facebook and Google, they're always going to build their own ones. But like everyone else, like some extent, you know, I think they're better off like buying platforms. And, you know, someone's going to have to build those platforms.Swyx [00:34:12]: Yeah. Cool. Let's move on to language models and just specifically that workload just to flesh it out a little bit. You already said that RAMP is like fine tuning 100 models at once simultaneously on modal. Closer to home, my favorite example is ErikBot. Maybe you want to tell that story.Erik [00:34:30]: Yeah. I mean, it was a prototype thing we built for fun, but it's pretty cool. Like we basically built this thing that hooks up to Slack. It like downloads all the Slack history and, you know, fine-tunes a model based on a person. And then you can chat with that. And so you can like, you know, clone yourself and like talk to yourself on Slack. I mean, it's like nice like demo and it's just like, I think like it's like fully contained modal. Like there's a modal app that does everything, right? Like it downloads Slack, you know, integrates with the Slack API, like downloads the stuff, the data, like just runs the fine-tuning and then like creates like dynamically an inference endpoint. And it's all like self-contained and like, you know, a few hundred lines of code. So I think it's sort of a good kind of use case for, or like it kind of demonstrates a lot of the capabilities of modal.Alessio [00:35:08]: Yeah. On a more personal side, how close did you feel ErikBot was to you?Erik [00:35:13]: It definitely captured the like the language. Yeah. I mean, I don't know, like the content, I always feel this way about like AI and it's gotten better. Like when you look at like AI output of text, like, and it's like, when you glance at it, it's like, yeah, this seems really smart, you know, but then you actually like look a little bit deeper. It's like, what does this mean?Swyx [00:35:32]: What does this person say?Erik [00:35:33]: It's like kind of vacuous, right? And that's like kind of what I felt like, you know, talking to like my clone version, like it's like says like things like the grammar is correct. Like some of the sentences make a lot of sense, but like, what are you trying to say? Like there's no content here. I don't know. I mean, it's like, I got that feeling also with chat TBT in the like early versions right now it's like better, but.Alessio [00:35:51]: That's funny. So I built this thing called small podcaster to automate a lot of our back office work, so to speak. And it's great at transcript. It's great at doing chapters. And then I was like, okay, how about you come up with a short summary? And it's like, it sounds good, but it's like, it's not even the same ballpark as like, yeah, end up writing. Right. And it's hard to see how it's going to get there.Swyx [00:36:11]: Oh, I have ideas.Erik [00:36:13]: I'm certain it's going to get there, but like, I agree with you. Right. And like, I have the same thing. I don't know if you've read like AI generated books. Like they just like kind of seem funny, right? Like there's off, right? But like you glance at it and it's like, oh, it's kind of cool. Like looks correct, but then it's like very weird when you actually read them.Swyx [00:36:30]: Yeah. Well, so for what it's worth, I think anyone can join the modal slack. Is it open to the public? Yeah, totally.Erik [00:36:35]: If you go to modal.com, there's a button in the footer.Swyx [00:36:38]: Yeah. And then you can talk to Erik Bot. And then sometimes I really like picking Erik Bot and then you answer afterwards, but then you're like, yeah, mostly correct or whatever. Any other broader lessons, you know, just broadening out from like the single use case of fine tuning, like what are you seeing people do with fine tuning or just language models on modal in general? Yeah.Erik [00:36:59]: I mean, I think language models is interesting because so many people get started with APIs and that's just, you know, they're just dominating a space in particular opening AI, right? And that's not necessarily like a place where we aim to compete. I mean, maybe at some point, but like, it's just not like a core focus for us. And I think sort of separately, it's sort of a question of like, there's economics in that long term. But like, so we tend to focus on more like the areas like around it, right? Like fine tuning, like another use case we have is a bunch of people, Ramp included, is doing batch embeddings on modal. So let's say, you know, you have like a, actually we're like writing a blog post, like we take all of Wikipedia and like parallelize embeddings in 15 minutes and produce vectors for each article. So those types of use cases, I think modal suits really well for. I think also a lot of like custom inference, like yeah, I love that.Swyx [00:37:43]: Yeah. I think you should give people an idea of the order of magnitude of parallelism, because I think people don't understand how parallel. So like, I think your classic hello world with modal is like some kind of Fibonacci function, right? Yeah, we have a bunch of different ones. Some recursive function. Yeah.Erik [00:37:59]: Yeah. I mean, like, yeah, I mean, it's like pretty easy in modal, like fan out to like, you know, at least like 100 GPUs, like in a few seconds. And you know, if you give it like a couple of minutes, like we can, you know, you can fan out to like thousands of GPUs. Like we run it relatively large scale. And yeah, we've run, you know, many thousands of GPUs at certain points when we needed, you know, big backfills or some customers had very large compute needs.Swyx [00:38:21]: Yeah. Yeah. And I mean, that's super useful for a number of things. So one of my early interactions with modal as well was with a small developer, which is my sort of coding agent. The reason I chose modal was a number of things. One, I just wanted to try it out. I just had an excuse to try it. Akshay offered to onboard me personally. But the most interesting thing was that you could have that sort of local development experience as it was running on my laptop, but then it would seamlessly translate to a cloud service or like a cloud hosted environment. And then it could fan out with concurrency controls. So I could say like, because like, you know, the number of times I hit the GPT-3 API at the time was going to be subject to the rate limit. But I wanted to fan out without worrying about that kind of stuff. With modal, I can just kind of declare that in my config and that's it. Oh, like a concurrency limit?Erik [00:39:07]: Yeah. Yeah.Swyx [00:39:09]: Yeah. There's a lot of control. And that's why it's like, yeah, this is a pretty good use case for like writing this kind of LLM application code inside of this environment that just understands fan out and rate limiting natively. You don't actually have an exposed queue system, but you have it under the hood, you know, that kind of stuff. Totally.Erik [00:39:28]: It's a self-provisioning cloud.Swyx [00:39:30]: So the last part of modal I wanted to touch on, and obviously feel free, I know you're working on new features, was the sandbox that was introduced last year. And this is something that I think was inspired by Code Interpreter. You can tell me the longer history behind that.Erik [00:39:45]: Yeah. Like we originally built it for the use case, like there was a bunch of customers who looked into code generation applications and then they came to us and asked us, is there a safe way to execute code? And yeah, we spent a lot of time on like container security. We used GeoVisor, for instance, which is a Google product that provides pretty strong isolation of code. So we built a product where you can basically like run arbitrary code inside a container and monitor its output or like get it back in a safe way. I mean, over time it's like evolved into more of like, I think the long-term direction is actually I think more interesting, which is that I think modal as a platform where like I think the core like container infrastructure we offer could actually be like, you know, unbundled from like the client SDK and offer to like other, you know, like we're talking to a couple of like other companies that want to run, you know, through their packages, like run, execute jobs on modal, like kind of programmatically. So that's actually the direction like Sandbox is going. It's like turning into more like a platform for platforms is kind of what I've been thinking about it as.Swyx [00:40:45]: Oh boy. Platform. That's the old Kubernetes line.Erik [00:40:48]: Yeah. Yeah. Yeah. But it's like, you know, like having that ability to like programmatically, you know, create containers and execute them, I think, I think is really cool. And I think it opens up a lot of interesting capabilities that are sort of separate from the like core Python SDK in modal. So I'm really excited about C. It's like one of those features that we kind of released and like, you know, then we kind of look at like what users actually build with it and people are starting to build like kind of crazy things. And then, you know, we double down on some of those things because when we see like, you know, potential new product features and so Sandbox, I think in that sense, it's like kind of in that direction. We found a lot of like interesting use cases in the direction of like platformized container runner.Swyx [00:41:27]: Can you be more specific about what you're double down on after seeing users in action?Erik [00:41:32]: I mean, we're working with like some companies that, I mean, without getting into specifics like that, need the ability to take their users code and then launch containers on modal. And it's not about security necessarily, like they just want to use modal as a back end, right? Like they may already provide like Kubernetes as a back end, Lambda as a back end, and now they want to add modal as a back end, right? And so, you know, they need a way to programmatically define jobs on behalf of their users and execute them. And so, I don't know, that's kind of abstract, but does that make sense? I totally get it.Swyx [00:42:03]: It's sort of one level of recursion to sort of be the Modal for their customers.Erik [00:42:09]: Exactly.Swyx [00:42:10]: Yeah, exactly. And Cloudflare has done this, you know, Kenton Vardar from Cloudflare, who's like the tech lead on this thing, called it sort of functions as a service as a service.Erik [00:42:17]: Yeah, that's exactly right. FaSasS.Swyx [00:42:21]: FaSasS. Yeah, like, I mean, like that, I think any base layer, second layer cloud provider like yourself, compute provider like yourself should provide, you know, it's a mark of maturity and success that people just trust you to do that. They'd rather build on top of you than compete with you. The more interesting thing for me is like, what does it mean to serve a computer like an LLM developer, rather than a human developer, right? Like, that's what a sandbox is to me, that you have to redefine modal to serve a different non-human audience.Erik [00:42:51]: Yeah. Yeah, and I think there's some really interesting people, you know, building very cool things.Swyx [00:42:55]: Yeah. So I don't have an answer, but, you know, I imagine things like, hey, the way you give feedback is different. Maybe you have to like stream errors, log errors differently. I don't really know. Yeah. Obviously, there's like safety considerations. Maybe you have an API to like restrict access to the web. Yeah. I don't think anyone would use it, but it's there if you want it.Erik [00:43:17]: Yeah.Swyx [00:43:18]: Yeah. Any other sort of design considerations? I have no idea.Erik [00:43:21]: With sandboxes?Swyx [00:43:22]: Yeah. Yeah.Erik [00:43:24]: Open-ended question here. Yeah. I mean, no, I think, yeah, the network restrictions, I think, make a lot of sense. Yeah. I mean, I think, you know, long-term, like, I think there's a lot of interesting use cases where like the LLM, in itself, can like decide, I want to install these packages and like run this thing. And like, obviously, for a lot of those use cases, like you want to have some sort of control that it doesn't like install malicious stuff and steal your secrets and things like that. But I think that's what's exciting about the sandbox primitive, is like it lets you do that in a relatively safe way.Alessio [00:43:51]: Do you have any thoughts on the inference wars? A lot of providers are just rushing to the bottom to get the lowest price per million tokens. Some of them, you know, the Sean Randomat, they're just losing money and there's like the physics of it just don't work out for them to make any money on it. How do you think about your pricing and like how much premium you can get and you can kind of command versus using lower prices as kind of like a wedge into getting there, especially once you have model instrumented? What are the tradeoffs and any thoughts on strategies that work?Erik [00:44:23]: I mean, we focus more on like custom models and custom code. And I think in that space, there's like less competition and I think we can have a pricing markup, right? Like, you know, people will always compare our prices to like, you know, the GPU power they can get elsewhere. And so how big can that markup be? Like it never can be, you know, we can never charge like 10x more, but we can certainly charge a premium. And like, you know, for that reason, like we can have pretty good margins. The LLM space is like the opposite, like the switching cost of LLMs is zero. If all you're doing is like straight up, like at least like open source, right? Like if all you're doing is like, you know, using some, you know, inference endpoint that serves an open source model and, you know, some other provider comes along and like offers a lower price, you're just going to switch, right? So I don't know, to me that reminds me a lot of like all this like 15 minute delivery wars or like, you know, like Uber versus Lyft, you know, and like maybe going back even further, like I think a lot about like sort of, you know, flip side of this is like, it's actually a positive side, which is like, I thought a lot about like fiber optics boom of like 98, 99, like the other day, or like, you know, and also like the overinvestment in GPU today. Like, like, yeah, like, you know, I don't know, like in the end, like, I don't think VCs will have the return they expected, like, you know, in these things, but guess who's going to benefit, like, you know, is the consumers, like someone's like reaping the value of this. And that's, I think an amazing flip side is that, you know, we should be very grateful, the fact that like VCs want to subsidize these things, which is, you know, like you go back to fiber optics, like there was an extreme, like overinvestment in fiber optics network in like 98. And no one made money who did that. But consumers, you know, got tremendous benefits of all the fiber optics cables that were led, you know, throughout the country in the decades after. I feel something similar about like GPUs today. And also like specifically looking like more narrowly at like LLM in France market, like that's great. Like, you know, I'm very happy that, you know, there's a price war. Modal is like not necessarily like participating in that price war, right? Like, I think, you know, it's going to shake out and then someone's going to win and then they're going to raise prices or whatever. Like, we'll see how that works out. But for that reason, like we're not like hyper focused on like serving, you know, just like straight up, like here's an endpoint to an open source model. We think the value in Modal comes from all these, you know, the other use cases, the more custom stuff, like fine tuning and complex, you know, guided output, like type stuff. Or like also like in other, like outside of LLMs, like with more focus, a lot more like image, audio, video stuff, because that's where there's a lot more proprietary models. There's a lot more like custom workflows. And that's where I think, you know, Modal is more, you know, there's a lot of value in software differentiation. I think focusing on developer experience and developer productivity, that's where I think, you know, you can have more of a competitive moat.Alessio [00:46:58]: I'm curious what the difference is going to be now that it's an enterprise. So like with DoorDash, Uber, they're going to charge you more. And like as a customer, like you can decide to not take Uber. But if you're a company building AI features in your product using the subsidized prices, and then, you know, the VC money dries up in a year and like prices go up, it's like, you can't really take the features back without a lot of backlash. But you also cannot really kill your margins by paying the new price. So I don't know what that's going to look likeErik [00:47:28]: But like margins are going to go up for sure. But I don't know if prices will go up because like GPU prices have to drop eventually, right? So like, you know, like in the long run, I still think like prices may not go up that much. But certainly margins will go up. Like I think you said, Swyx, that margins are negative right now. Like, you know, for some people, obviously, that's not sustainable. So certainly margins will have to go up. Like some companies are going to have to make money in this space. Otherwise, like they're not going to provide the service. But that's equilibrium too, right? Like at some point, like, you know, it sort of stabilizes and one or two or three providers make money.Alessio [00:48:02]: Yeah. What else is maybe underrated, a model, something that people don't talk enough about, or yeah, that we didn't cover in the discussion?Erik [00:48:11]: Yeah, I think what are some other things? We talked about a lot of stuff. Like we have the bursty parallelism. I think that's pretty cool. Working on a lot of like, trying to figure out like, kind of thinking more about the roadmap. But like one of the things I'm very excited about is building primitives for like, more like IO intensive workloads. And so like, we're building some like crude stuff right now where like, you can like create like direct TCP tunnels to containers and that lets you like pipe data. And like, you know, we haven't really explored this as much as we should, but like, there's a lot of interesting applications. Like you can actually do like kind of real time video stuff in Modal now, because you can like create a tunnel to, exactly. You can create a raw TCP socket to a container, feed it video and then like, you know, get the video back. And I think like, it's still like a little bit like, you know, not fully ergonomically like figured out. But I think there's a lot of like, super cool stuff. Like when we start enabling those more like high IO workloads, I'm super excited about. I think also like, you know, working with large data sets or kind of taking the ability to map and fan out and like building more like higher level, like functional primitives, like filters and group buys and joins. Like I think there's a lot of like, really cool stuff you can do. But this is like maybe like, you know, years out like.Swyx [00:49:18]: Yeah, we can just broaden out from Modal a little bit, but you still have a lot of, you have a lot of great tweets. So it's very easy to just kind of go through them. Why is Oracle underrated? I love Oracle's GPUs. I don't know why, you know,Erik [00:49:34]: what the economics looks like for Oracle, but I think they're great value for money. Like we run a bunch of stuff in Oracle and they have bare metal machines, like two terabytes of RAM. They're like super fast SSDs. You know, I mean, we love AWS and AGCP too. We have great relationships with them. But I think Oracle is surprising. Like, you know, if you told me like three years ago that I would be using Oracle Cloud, like I'd be like, what, wait, why? But now, you know,Swyx [00:49:55]: I'm a happy customer. And it's a combination of pricing and the kinds of SKUs I guess they offer.Erik [00:50:01]: Yeah. Great, great machines, good prices, you know. That's it. Yeah. Yeah. That's all I care about. Yeah. The sales team is pretty fun too. Like I like them.Swyx [00:50:09]: In Europe, people often talk about Hetzner. Yeah. Like we've focused on the main clouds, right?Erik [00:50:14]: Like we've, you know, Oracle, AWS, GCP, we'll probably add Azure at some point. I think, I mean, there's definitely a long tail of like, you know, CoreWeave, Hetzner, like Lambda, like all these things. And like over time, I think we'll look at those too. Like, you know, wherever we can get the right GPUs at the right price. Yeah. I mean, I think it's fascinating. Like it's a tough business. Like I wouldn't want to try to build like a cloud provider. You know, it's just, you just have to be like incredibly focused on like, you know, efficiency and margins and things like that. But I mean, I'm glad people are trying.Swyx [00:50:45]: Yeah. And you can ramp up on any of these clouds very quickly, right? Because it's your standard stack.Erik [00:50:50]: Yeah. I mean, yeah. Like I think so. Like, you know, what Modal does is like programmatic, you know, launching and termination of machines. So that's like what's nice about the clouds is, you know, they have relatively like immature APIs for doing that, as well as like, you know, support for Terraform for all the networking and all that stuff. So that makes it easier to work with the big clouds. But yeah, I mean, some of those things, like I think, you know, I also expect the smaller clouds to like embrace those things in the long run, but also think, you know, you know, we can also probably integrate with some of the clouds, like even without that. There's always an HTML API that you can use, just like script something that launches instances like through the web.Swyx [00:51:24]: Yeah. I think a lot of people are always curious about whether or not you will buy your own hardware someday. I think you're pretty firm in that it's not your interest, but like your story and your growth does remind me a little bit of Cloudflare, which obviously, you know, invests a lot in its own physical network.Erik [00:51:42]: Yeah. I don't remember like early days, like, did they have their own hardware or?Swyx [00:51:47]: They push out a lot with like agreements through other, you know, providers.Erik [00:51:52]: Yeah. Okay. Interesting.Swyx [00:51:53]: But now it's all their own hardware. So I understand.Erik [00:51:57]: Yeah. I mean, my feeling is that when you're a venture funded startup, like buying physical hardware is maybe not the best use of the money.Swyx [00:52:06]: I really wanted to put you in a room with Isocat from Poolside. Yeah. Because he has the complete opposite view. Yeah.Erik [00:52:12]: It is great. I mean, I don't like, I just think for like a capital efficiency point of view, like, do you really want to tie up that much money and like, you know, physical hardware and think about depreciation and like, like, as much as possible, like I, you know, I favor a more capital efficient way of like, we don't want to own the hardware because then, and ideally, we want to, we want the sort of margin structure to be sort of like 100% correlated revenue in cogs in the sense that like, you know, when someone comes and pays us, you know, $1 for compute, like, you know, we immediately incur a cost of like, whatever, 70 cents, 80 cents, you know, and there's like complete correlation between cost and revenue because then you can leverage up in like a kind of a nice way you can scale very efficiently. You know, like, that's not, you know, turns out like that's hard to do. Like, you can't just only use like spotting on demand instances. Like over time, we've actually started adding a pretty significant amount of reservations too. So I don't know, like reservation is always like one step towards owning your own hardware. Like, I don't know, like, do we really want to be, you know, thinking about switches and cooling and HVAC and like power supplies? Accessory recovery. Yeah. Like, is that the thing I want to think about? Like, I don't know. Like I like to make developers happy, but who knows, like maybe one day, like, but I don't think it's gonna happen anytime soon.Swyx [00:53:23]: Yeah. Obviously, for what it's worth, obviously, I'm a believer in cloud, but it's interesting to have the devil's advocate on the other side. The main thing you have to do is be confident that you can manage your depreciation better than the typical assumption, which is two to three years. Yeah. Yeah. And so the moment you have a CTO that tells you, no, I think I can make these things last seven years, then it changes the math.Erik [00:53:46]: Yeah. Yeah. But you know, are you deluding yourself then? That's the question, right? It's like the waste management scandal. Do you know about that? Like they had all this like, like accounting scandal back in the 90s, like this garbage company, like where they like, started assuming their garbage trucks had a 10-year depreciation schedule, booked like a massive profit, you know, the stock went to like, you know, up like, you know, and then it turns out actually all those garbage trucks broke down and like, you can't really depreciate them over 10 years. And so, so then the whole company, you know, they had to restate all the earnings.Alessio [00:54:18]: Let's go into some personal nuggets. You received the IOI gold medal, which is the International Olympiad in Informatics.Erik [00:54:29]: 20 years ago.Alessio [00:54:30]: Yeah. How have these models and like going to change competitive programming? Like, do you think people are still love the craft? I feel like over time, we're kind of like programming has kind of lost maybe a little bit of its luster in the eyes of a lot of, a lot of people. Yeah. I'm curious to, to see what you think.Erik [00:54:51]: I mean, maybe, but like, I don't know, like, you know, I've been coding for almost 30 or more than 30 years. And like, I feel like, you know, you look at like programming and, you know, where it is today versus where it was, you know, 30, 40, 50 years ago, there's like probably thousand times more developers today than, you know, so like, and every year there's more and more developers. And at the same time, developer productivity keeps going up. And when I look at the real world, I just think there's so much software that's still waiting to be built. Like, I think we can, you know, 10X the amount of developers and still, you know, have a lot of people making a lot of money, you know, building amazing software and also being while at the same time being more productive. Like I never understood this, like, you know, AI is going to, you know, replace engineers. That's very rarely how this actually works. When AI makes engineers more productive, like the demand actually goes up because the cost of engineers goes down because you can build software more cheaply. And that's, I think, the story of software in the world over the last few decades. So, I mean, I don't know how this relates to like competitive programming. Kind of going back to your question, competitive programming to me was always kind of a weird kind of, you know, niche, like kind of, I don't know. I love it. It's like puzzle solving. And like my experience is like, you know, half of competitive programmers are able to translate that to actual like building cool stuff in the world. Half just like get really in, you know, sucked into this like puzzle stuff and, you know, it never loses its grip on them. But like for me, it was an amazing way to get started with coding or get very deep into coding and, you know, kind of battle off with like other smart kids and traveling to different countries when I was a teenager.Swyx [00:56:29]: I was just going to mention, like, it's not just that he personally is a competitive programmer. Like, I think a lot of people at Modal are competitive programmers. I think you met Akshat through... Akshat, co-founder is also at Gold Medal.Erik [00:56:42]: By the way, Gold Medal doesn't mean you win. Like, but although we actually had an intern that won Iowa. Gold Medal is like the top 20, 30 people roughly.Swyx [00:56:47]: Yeah. Obviously, it's very hard to get hired at Modal. But what is it like to work with like such a talent density? Like, you know, how is that contributing to the culture at Modal? Yeah. I mean, I think humans are the root cause of like everything at a company, like, you know, bad code is because it's bad human or like whatever, you know, bad culture.Erik [00:57:03]: So like, I think, you know, like talent density is very important and like keeping the bar high and like hiring smart people. And, you know, it's not always like the case that like hiring competitive programmers is the right strategy, right? If you're building something very different, like you may not, you know, but we actually end up having a lot of like hard, you know, complex challenges. Like, you know, I talked about like the cloud, you know, the resource allocation, like turns out like that actually, like you can phrase that as a mixed integer programming problem. Like we now have that running in production, like constantly optimizing how we allocate cloud resources. There's a lot of like interesting, like complex, like scheduling problems. And like, how do you do all the bin packing of all the containers? Like, so, you know, I think for what we're building, you know, it makes a lot of sense to hire these people who like, like those very hard problems.Swyx [00:57:52]: Yeah. And they don't necessarily have to know the details of the stack. They just need to be very good at algorithms.Erik [00:57:56]: No, but my feeling is like people who are like pretty good at competitive programming, they can also pick up like other stuff like elsewhere. Not always the case, but you know, there's definitely a high correlation.Swyx [00:58:08]: Oh yeah. I'm just, I'm interested in that just because, you know, like there's competitive mental talents in other areas, like competitive speed memorization or whatever. And like, you don't really see those transfer. And I always assumed in my narrow perception that competitive programming is so specialized, it's so obscure, even like so divorced from real world scenarios that it doesn't actually transfer that much. But obviously I think for the problems that you work on it, it does.Erik [00:58:34]: But it's also like, you know, frankly, it's like translates to some extent, not because like the problems are the same, but just because like it sort of filters for the, you know, people who are like willing to go very deep and work hard on things. Right. Like, I feel like a similar thing is like a lot of good developers are like talented musicians. Like, why? Like, why is this a correlation? And like, my theory is like, you know, it's the same sort of skill. Like you have to like just hyper focus on something and practice a lot. Like, and there's something similar that I think creates like good developers.Alessio [00:59:02]: Yeah. Sweden also had a lot of very good Counter-Strike players. I don't know, why does Sweden have fiber optics before all of Europe? I feel like, I grew up in Italy and our internet was terrible. And then I feel like all the Nordics and like amazing internet, I remember getting online and people in the Nordics are like five ping, 10 ping.Erik [00:59:23]: Yeah. We had very good network back then. Yeah. Do you know why? I mean, I'm sure like, you know, I think the government, you know, did certain things quite well. Right. Like in the nineties, like there was like a bunch of tax rebates for like buying computers. And I think there was similar like investments in infrastructure. I mean, like, and I think like I was thinking about, you know, it's like, I still can't use my phone in the subway in New York. And that was something I could use in Sweden in 95. You know, we're talking like 40 years almost. Right. Like, like why? And I don't know, like I think certain infrastructure,Alessio [00:59:53]: you know, Sweden was just better at, I don't know. And also you never owned a TV or a car?Erik [00:59:59]: Never owned a TV or a car. I never had a driver's license.Alessio [01:00:01]: How do you do that in Sweden though? Like that's cold.Erik [01:00:03]: I grew up in a city. I mean, like I took the subway everywhere with bike or whatever. Yeah. I always lived in cities, so I don't, you know, I never felt, I mean, like we have like me and my wife as a car, but like. That doesn't count. I mean, it's her name because I don't have a driver's license. She drives me everywhere. It's nice.Swyx [01:00:21]: Nice. That's fantastic. I was going to ask you, like the last thing I had on this list was your advice to people thinking about running some sort of run code in the cloud startup is only do it if you're genuinely excited about spending five years thinking about load balancing, page falls, cloud security and DNS. So basically like it sounds like you're summing up a lot of pain running Modal. Yeah. Yeah. Like one thing I struggle with, like I talked to a lot of peopleErik [01:00:43]: starting companies in the data space or like AI space or whatever. And they sort of come at it at like, you know, from like an application developer point of view. And they're like, I'm going to make this better. But like, guess how you have to make it better. It's like, you have to go very deep on the infrastructure layer. And so one of my frustrations has been like so many startups are like, in my opinion, like Kubernetes wrappers and not very like thick wrappers, like fairly thin wrappers. And I think, you know, every startup is a wrapper to some extent, but like you need to be like a fat wrapper. You need to like go deep and like build some stuff. And that's like, you know, if you build a tech company, you're going to want to, you're going to have to spend, you know, five, 10, 20 years of your life, like going very deep and like, you know, building the infrastructure you need in order to like make your product truly stand out and be competitive. And so, you know, I think that goes for everything. I mean, like you're starting a whatever, you know, online retailer of, I don't know, bathroom sinks. You have to be willing to spend 10 years of your life thinking about, you know, whatever, bathroom sinks, like otherwise it's going to be hard.Swyx [01:01:37]: Yeah. I think that's good advice for everyone. And yeah, congrats on all your success. It's pretty exciting to watch it. It's just the beginning. Yeah. Yeah. Yeah. It'sErik [01:01:45]: exciting. And everyone should sign up and try out modal.modal.com. Yeah. Now it's GA. Yay. Yeah.Swyx [01:01:50]: Used to be behind a wait list. Yeah. Awesome, Erik. Thank you so much for coming on. Yeah, it's amazing. Thank you so much. Thanks.Swyx [01:02:11]: Bye. Get full access to Latent.Space at www.latent.space/subscribe
-
Cloud Intelligence at the speed of 5000 tok/s - with Ce Zhang and Vipul Ved Prakash of Together AI
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-02-08 16:57
Our first ever demo day aimed for 15-20 people and ended up ballooning to >200 and covered in the news. We are now running the 2024 edition in SF on Feb 23: Latent Space Final Frontiers, a startup and research competition in “The Autonomous Workforce”, ”Beyond Transformers & GPUs”, and “Embodied AI”. RSVP here! You can find all LS online/IRL events on our new calendar. Super Early Bird tickets have just gone on sale for AI Engineer World’s Fair, June 25-27!Today we have the honor of hosting two of Together AI’s co-founders: Ce Zhang (CTO) and Vipul Ved Prakash (CEO). This is a rare opportunity to recap the history of the company since our last check-in with Tri Dao (Chief Scientist), some of their big releases, and do a deep dive into the state of the AI inference market. Together has emerged as one of the most consequential new startups in the new AI summer, last announcing a ~$100m Series A raise in November (at a ~$360-565m valuation). Note from future: about a week after this pod was published, rumors were confirmed that Salesforce had led another $100m Series B at a $1b valuation.But there are at least three Togethers - Together the Research Lab, Together the Fine Tuning & Inference platform, and Together the custom models service. As we clarify on the pod, the overarching philosophy of Together is the ability to improve on all these fronts simultaneously by being “full stack”, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms.Bringing Research and Industry TogetherIn just one year, Together has been behind some of the most exciting research in AI:* RedPajama, a fully open source dataset for model pre-training which mirrored the Llama1 recipe. Then followed by RedPajama2, a 30T tokens dataset of filtered and de-duplicated tokens. * RedPajama-INCITE-3B and 7B, which were SOTA in a few benchmarks at the time of release. * FlashAttention-2, developed by Together’s Chief Scientist Tri Dao. We covered FA-2 in a previous episode with him.* Mamba-3B, the most promising transformer-alternative model that they released in collaboration with Cartesia. * StripedHyena, a SOTA graft of Hyena state space models and transformer models together* Medusa, an alternative to speculative decoding that lets you use multiple decoding heads instead of a draft model. * MonarchMixer, which was one of the most popular orals at NeurIPS 2023. It’s an approach to transformers that replaces many of its core parts with Monarch matrices for better computational efficiency. And I’m sure we missed something! As Vipul reveals, almost 50% of Together staff is researchers, and two of their co-founders (Chris Ré and Percy Liang) are professors at Stanford, so we can expect a lot more here.Bringing “Disaggregated” GPUs TogetherOn their cloud, they offer inference as a service, fine-tuning, pre-training, etc, but unlike other providers they think of themselves as a disaggregated cloud. Today, they have ~8,000 A100 and H100 GPUs on their platform (an exclusive revealed on the pod!) totaling over 20 exaflops of compute, but instead of just buying more and putting them in a cluster and then exposing a `us-east-1` option for customers, they are taking heterogenous compute sources and adding a unified layer on top of it for developers to consume. Building on Ce’s research, Together’s GPU Clusters are taking on comparable AWS and GCP offerings in both cost and speed:Take the Hessian AI center in Germany or the DoE’s INCITE; they have GPUs that they want to share with researchers, but they lack the cloud layer over it. Similarly, there’s starting to be more and more differentiation amongst types of GPUs: H100s, A100s, MI3000s, etc. Each of them has different availability and performance based on task, and the end user shouldn’t have to be an hardware expert to run inference on a model, so Together abstracts a lot of that away.A big theme of the Together inference stack, a “bag of 50 tricks” that we discuss on the pod, is also “hardware-aware” algorithms like FlashAttention and Mamba, which further emphasize the benefits of co-developing everything together:Special Focus: Transformer AlternativesAs we mentioned above, they are also funding a lot of research in Transformer alternatives. To reiterate a few points on why they matter:* Longer context is not the motivation for sub-quadratic architectures: Transformers don’t inherently have hard limitations on context size, but they just get extremely expensive. When developing sub-quadratic alternatives, you easily enable very long context, but that’s now how you should compare them. Even at same context size, inference and training is much cheaper on sub-quadratic architectures like Hyena.* Emergence of hybrid architectures: a lot of early conversations have been around the “post-Transformers” era, but it might be more like “half-Transformers”. Hybrid architectures could have split layers with some transformer-based and some state-space ones. One of the challenges is that a lot of hardware kernels are optimized for transformer operations, so you’d lose a lot by moving away completely.* Higher speed = higher GPU throughput: if we could reach the same benchmark performance on subquadratic architectures, it’d solve a lot of the GPU crunch. Today we peak at ~170 tok/s on inference in some open models; if we could reach 5,000 tok/s on the same card, you’d be able to serve 30x more customers on the same hardware. As a cloud provider, you’re obviously incentivized to get there.We had a lot of fun chatting with the Together guys and we covered a lot of ground, so enjoy the conversation!Note: This is the first episode of a “cloud providers mini-series”. We have Erik from Modal and Ben from Replicate coming up next!Video PodcastJoin us to watching the video version of this pod on our snazzy YouTube!Show Notes* Together AI* RedPajama Dataset v1 Announcement* RedPajama Models v1 Announcement* Together Embeddings* StripedHyena-7B* Mamba-3B-SlimPJ* Vipul's X thread on Anyscale* Vipul's Razor* SemiAnalysis' "Inference Race to the Bottom" post* Chris Ré* Mike Conover's episode* Slim Pajama by Cerebras* Dolma by AI2* Jina AI* Tengyu's Voyage AITimestamps* [00:00:00] Introductions* [00:00:43] Origin and current state of Together.ai* [00:02:15] Transition from Apple to Together and the vision for open AI* [00:04:54] How Chris Ré introduced Ce and Vipul* [00:08:43] How RedPajama came to be* [00:13:34] Model training and Transformer alternatives* [00:15:37] DSIR and the importance of data in LLMs* [00:21:19] Inference vs Fine-tuning vs Pre-training usage on Together* [00:23:20] Together's GPU stash* [00:27:02] Why standardization of inference metrics is important* [00:29:26] Building moats in AI inference* [00:31:49] Federated vs disaggregated cloud computing* [00:34:57] Opportunities for improvement in the inference stack* [00:36:13] Anyscale benchmarking drama* [00:41:27] Not just an inference platform* [00:43:50] Together Embeddings and the future of embedding models* [00:45:53] State space models and hybrid architectures* [00:53:52] The need for 5,000 tokens/s speed in AI inference* [01:00:23] What's the most interesting unsolved question in AI?TranscriptAlessio [00:00:00]: Hey, everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:14]: Hey, and today we're together with Together. Welcome to the studio, guys.Ce / Vipul [00:00:20]: Thank you.Swyx [00:00:21]: I don't know how you typically give self intros, but does anyone want to go first? How do we get our audience acquainted, especially to who's speaking, because it's unusual for us to do a four-person pod. Yeah.Ce [00:00:33]: Hi, everyone. I'm Ce. I'm one of the co-founders of Together and the CTO, working with the team on technical things.Vipul [00:00:40]: I'm Vipul Ved Prakash, co-founder and CEO of Together.Swyx [00:00:43]: I always consider you guys as one of the sort of all-in-one companies. I always want to say labs, but I feel like you're not a lab. What is the sort of origin of Together, and then what is it today? I feel like it used to be Together.xyz, and then now you're Together.ai.Vipul [00:01:00]: I think fundamentally, Together is about open and independent AI systems. We think this is one of the most consequential technologies of our time, and when we started the company in June 2022, our focus was to build a platform for open source, independent, user-owned AI systems. One way to think about it is big labs, frontier model labs, have built their own platforms for developer platforms for their models. We think of Together as a platform for everything else, whether these are open models, whether these are models being built by companies that are owned by them. Our sort of XYZ roots, we have a fairly deep decentralization and open ethos that kind of reflects in all our platform and strategy and business. And we also, the way we structure our cloud is by combining data centers around the world instead of, you know, we are today not located in hyperscalers, we have built a footprint of AI supercomputers in this sort of very disaggregated, decentralized manner.Alessio [00:02:15]: I know before Together, you were at Apple, so you go from like the most walled garden, private, we don't say anything company, to we want everything to be open and everybody to know somebody. What maybe did you learn from like the Apple way of being super close and polished and maybe what are you taking now to Together to make it open, but also a very nice developer experience?Vipul [00:02:37]: Yeah, I would say, you know, one sort of my, you know, background has been in open source for a long time. One of the first things I created was a collaborative spam filter, you know, this was back in the day. It's called Vipul's Razor. And it became quite popular. And the first company I founded called CloudMark was built around, you know, taking open source and building both an open side of it and a commercial product around it. I think Apple is sort of very focused on providing this amazing experience to its customers with, you know, most of the technology sort of hidden behind the product. And certainly the focus on fluidity and applying complex technology to make everyday things simple is something that Apple does really well. And, you know, that's been a sort of big part of how we think about our developer platforms. I think it informs it. The other thing is that during my years at Apple, we, you know, worked a lot on deep learning. And one of the things that was sort of very viscerally accessible to me was how well these systems worked. We, you know, we built an open domain Q&A system. This was based on Facebook's LSTM paper in 2016. And it was remarkable because we had a parallel system based on sort of information retrieval techniques, which is extremely complicated, didn't work that well. And you know, this thing we wrote in a week was just incredible performance. So I think some of those experiences, at least for me personally, sort of were creating this roadmap of how important and powerful this technology is. And you know, when the scaling loss paper was published, I was very clear, like it was in some ways something very profound. We've never had algorithms that improve in capabilities with scale out. So this is almost a new era of computing. So that's been, I think, the influence of Apple, my years at Apple, really for me, like crystallized the value of what we are doing together.Alessio [00:04:54]: And how did you decide to join forces? Because you did a postdoc with Chris Ré at Stanford. You know, we already had Tri Dao from Together and we talked about Hazy. What was like the meeting of the mind of, hey, I come from like the more technical postdoc assistant professor background and we've got yet a more product thing. What got you excited to like build this now?Ce [00:05:15]: So we have been working on this together, Chris, in the essentially last like 10 years, right? So it was like a machine learning system 10 years ago was like Power BI's graphic model, right? And then convolutional neural network and then all the foundation model that we see today. But if you look at this, I think that fundamentally the thing we are actually optimizing is actually not that different. It's always about data movement across essentially all the stacks, right? So when you do distributed like computing, it's about communication across different machines. When you do, for example, flash attention, it's about data movement at a different essentially memory hierarchy, right? So we have been doing this in the last 10 years and seeing the field start grow, grow, grow. So we kind of feel the current kind of this like wave of technology is actually the perfect time to actually bring all the research essentially into something real. And we are super lucky that we got introduced to Weibo, right? And then we hope to join forces and bring this to real world.Swyx [00:06:10]: It's an unusual team of like sort of research and industry. Like you've been like a third or fourth time founder now. Third time founder, yeah. And so like what is your first order of business when you like set up together? Like how do you sort of put something like this together? Oh my God, I'm going to use this word so much.Vipul [00:06:27]: I feel AI companies are really kind of driven by research. And Chris and I had been talking about how to reduce the cost of building models. We felt that there aren't really big data modes around foundation models. They are built from a subset of the web. What is difficult is the cost of capital to build these. And one of the ways in which you can reduce this cost is by making more efficient systems. With that, it was really about finding the right set of co-founders and team. In fact, when Chris introduced me to Ce, and I think within the first five minutes of talking to Ce, I was like, we are starting this company. And our early focus was thinking about this more sort of disparate set of resources, you know, GPUs around the internet. Can we use those to build? And we really have to compress communication for, you know, when we do gradient averaging, there's just a lot of traffic. And if you can reduce that somehow, you sort of open up the possibility of using cheaper compute, you know, across the network. And Ce's research for a decade has been in that subject. You know, and from there, finding, you know, other folks in the network, I think there is generally a lot of excitement and philosophical alignment around what we are doing, which, you know, we publish papers, we publish open source libraries and code, we build open models. And I think the people in academia in, you know, machine learning and NLP, that's really what they want to do. So I think that's been really a kind of kernel for, you know, composition of the company. And we're lucky to have, you know, at this point, attracted some of the best researchers in the field. So I think that's the most important thing. And, you know, the rest of it is sort of driven by us. A couple of these philosophies around independent systems and decentralization and good developer interfaces, you want to make it accessible. That's, you know, just as important. And the rest follows from there, I think.Alessio [00:08:43]: I want to try and fill in some of the blanks in the history of Together. I think people come on your website today and they say, you raised a hundred million dollars Series A. They're like, wow, these guys are like super legit company. But it feels like Red Pajama just came out a year ago. I remember we had Mike Conover in the studio, who had built Dolly at Databricks. And you announced it literally the morning we were recording. So we're like in the studio on our phones, looking at it. And it's like, wow, this is like the first time now there's like a good curated dataset to do open pre-training. So maybe let's start from there. Like, what was the motivation behind it? Why did you decide to do that? It's, datasets are one of the things that most people don't want to work on. They just want to do models, not datasets.Ce [00:09:27]: Yeah. So, yeah, first one is not the first, right? So I think it's actually built on a whole bunch of amazing effort the community already have. For example, Eleuther have the pile, right? There's a whole bunch of amazing datasets they have, like C4, right, from Google, right? So I think really get inspired by the impact those like datasets have on the community, right? So I think when we did Red Pajama, it was a time that people are really fascinated by Lama, the model, like Lama 1, right? Which I feel like decades ago, right? But it's kind of, people are really excited about the quality, right? So that's really like a big shift in people how to think about open model. People start to see hope, right? So, but the one problem of Lama is the data recipe is being described in a pretty detailed way in the paper, but the data is actually not there. So, and our original thinking is how about we take the recipe and we try to do our best effort reproduction and try to put it out, such that we can learn from our mistakes in the reproduction together, right? So that's essentially the original thinking behind Red Pajama. And we have been pretty happy and excited about what community have been kind of build on it. For example, there's a dataset called Slim Pajama, right? Which do deduplication over our data, right?Swyx [00:10:38]: From Cerebras, did they talk to you before?Ce [00:10:39]: Oh, yeah, yeah, yeah, yeah. So, yeah, so we are very good friends so we can discuss about technical perspective. We are pretty excited because I think it's kind of why we do Red Pajama in the first place is that people can actually build not only models, but also datasets essentially over that piece of artifact, right? So that's actually what inspired us to do the first version of Red Pajama dataset.Swyx [00:11:01]: Yeah, and then you released V2 maybe two months ago.Ce [00:11:04]: Yeah.Swyx [00:11:05]: 30 trillion tokens.Ce [00:11:06]: Yeah, 30 trillion tokens. So I think what's exciting about Red Pajama V2 is not only the number of tokens, but we start to kind of learn from Red Pajama V1. So one thing that we learned was that data quality is really the core, right? So you want to take this couple trillion token dataset and try to bring them down maybe to one trillion or two trillion, right? The way that you actually filter them, deduplicate them is not something that kind of pre-decided before you see the application, right? So you kind of want to have a modular framework to think about data quality, right? So like given application, let's automatically or maybe semi-automatically try to come up with a way to filter it down. So that's why in Red Pajama V2, we kind of overlay the dataset with like 40 different pre-computed quality signal, right? If you want to reproduce your best effort, like C4 filter, it's kind of like 20 lines of code, right? And this open up this opportunity you can actually put different filter together, learn the combination of filter. We are very excited to see what community actually come up with using Red Pajama V2.Swyx [00:12:11]: It was retrospectively so obvious that this is a good idea that I wonder how come more datasets don't do this. You release the dataset with all these toggles that you can turn on and off, right? And you can sort of tune up and down the quality in ways that you believe is important to you. Yeah, I just, it makes so much sense now in retrospect. Because everyone just publishes like their pipeline and then the end result. But what about all the intermediate stages? Yeah.Ce [00:12:35]: Yeah, so I think, so there are multiple things there. I don't think we are the only one like doing that. For example, like Doma from AI2, right? They have this very flexible format to actually put in those quality signals, right? Think like, we are actually calling them some, right? So you can actually load Red Pajama using their tool. That whole thing should work, right? So I think one fundamental thing that changed in the last year, essentially, in the beginning when people think about data, it's always like a byproduct of the model, right? You release the model, you also release the data, right? The data side is there essentially to show people, ah, if you train on this data, you'll get a good model. But I think what started to change is when people started building more and more of those models, people started to realize like different subset of data side is kind of valuable for different applications, right? The data becomes something to play with, right? So I think we are kind of lucky that we happen to release Red Pajama right at that point that we get this opportunity to actually learn from that.Alessio [00:13:34]: And you guys have a custom model training platform on Together 2. You have a bunch of stuff in there for data selection, like the DSIR and things like that. How did you decide to work on that versus, because you first started with like some of the fine tunes on LLAMA. Do you see a lot of interest there? And I know you've been doing a lot of research on state space models and other transformer alternatives. Like, do you also see that as something you'll keep working on this year and push more people towards?Vipul [00:14:02]: Yeah, I mean, we, you know, we think of how to make training more efficient and building models more efficient. Part of that is being able to select the right dataset. This is why you have signals, DSIR. You can start with a small dataset and find similar documents, build models with that. So we think it's an important part of the kind of model build tooling that, you know, sort of widely useful for people building different kinds of models. Similarly, you know, we are running into the limits of how fast you can make transformers. And we want inference at 5,000 tokens per second. I don't think we will get there with transformers and we need to learn longer sequences. Data, again, becomes very, very expensive with transformers. So I work on space state models and all the research that we are doing there. And hopefully other labs will pick up on this and make it a kind of important target for optimization. But we think that, you know, open source is a great place for this. We can provide these recipes for data and for training to our customers who are building, you know, custom models themselves. And, you know, we are quite excited about the sort of progress we are seeing there.Alessio [00:15:18]: Do you have some of these models available for inference on Together? Can people play around with a strictly, you know?Swyx [00:15:25]: Yeah.Vipul [00:15:25]: Yeah, they're available for inference on our serverless platform.Swyx [00:15:29]: I always try to be the person who asks about acronyms in case, you know, people want to understand. Should we explain importance resampling, you know, that kind of stuff?Ce [00:15:37]: Oh, yeah. So DSIR essentially, it's a fundamental idea. So it's one of the paper from Percy, right? So essentially, if you know what you are doing, you can actually use that as a very strong signal about what data to put in to insert training process, right? So that's essentially the fundamental idea, right? So, and then more concretely, right? So there are actually different versions of DSIR, right? So one version is like if you have a validation site, right? You can actually somehow measure the similarity between the validation site and also your pre-trained corpus and essentially subset, like the subset. And often there's actually like less targeted version of DSIR where you'll say, yeah, maybe Wikipedia is actually a very good corpus. Let's try to find more Wikipedia, right? And you can think about it in two ways, either as a way to come up with different weights for different data slices. Yeah, so as like filter type of step. Yeah, for a data set, or think about that as like data augmentation. So that's how, yeah, that's how we think about DSIR.Swyx [00:16:33]: That makes sense. I will have to read the paper to understand a little bit more. Because when you say things like, we have to know in advance what we were trying to do with the model, then we do importance resampling. That is against the principle of general intelligence, right? Like the point is to train AGI.Ce [00:16:48]: Yeah, so it depends on what do you mean by being general or generic, right? So I think, I mean, you can always take a meta-learning perspective that we know the distribution of tasks that we care about, right? So you can always go kind of up in the ladder of how general the whole thing is, right? But also for many of the customers that we are actually talking to, right, they have kind of very targeted application, right? The benefit you can get out of that is you could build a better open model, often smaller, often easier to do inference, if you know what you want, right? So I think the whole trade-off would be, and the x-axis would be how generic the whole thing will be. The y-axis would be not only the top accuracy, but also a whole bunch of the deployment cost, right? The size of the model, right? The robustness of the model. So I think different people will navigate the space in different way. And we want to be the platform, essentially, whatever point that you want, we have a solution for you.Swyx [00:17:43]: One more thing on data before we go deeper on state-space models. Are we running out of data? Can we go in order of magnitude? Can we go five orders of magnitude? How do both of you think about how much data we have and how much we need?Ce [00:17:55]: Yeah, so I think that's a very, very good question. So I don't think we are running out of data on Earth.Swyx [00:18:02]: Right, so think about it globally. Training data, training class data.Ce [00:18:05]: Yeah, yeah, so I think, I mean, some of them are not accessible, right? But I do think there are many organizations in the world have enough data to actually train very, very good models, right? So, I mean, they are not publicly available, right? But there are people who actually have access to those, right? So I think in general, right? So if you think about the data in the open space, right? So I guess that was specifically that you actually mean whether we are running out of data. I do think there need to be some way, right? That people who are training open models get connected with essentially data that's not internet data. So I think that channel need to be opened up for the open model to get more data, right? But I'm kind of on the optimistic side that the society will figure out a way that we can train open models that's beyond this internet data.Swyx [00:18:57]: Beyond internet, meaning books?Ce [00:19:00]: I mean, there are a lot of those, right?Swyx [00:19:02]: Books, right?Ce [00:19:02]: Transcripts, right? Videos, audios, right? So there are a whole bunch of data sources that we are not integrating into open data side, right? So, and maybe they shouldn't be open, right? So I think the community need to figure out a way, yeah, like the best balance, yeah? Such that we can have open models, but on the other hand, also have a reasonable collection of data that we can actually use.Swyx [00:19:29]: I think a lot of people think that, there's a theory that Whisper was released so that you could transcribe YouTube and then use that as a source of tokens. Then I talked to other researchers who are like, you know, YouTube has very low quality tokens. You know, do you want your model to talk like a live streamer from YouTube? Because that's what they're going to do. So it's not clear, like what the quality of this data could be.Ce [00:19:53]: Yeah, I guess that depends on your application, right? So I think as a platform, right? So our goal is whatever application that you have, yeah, so we have a platform that you can actually achieve your goal, right? So there are definitely applications that kind of make sense to speak like YouTube, right? So, but there are probably also other application that kind of more on the formal side, right? So I think there are going to be a diverse collection of models, both open and closed, right? So, and we kind of want to be the engine that powers that.Swyx [00:20:21]: There's a lot of people who own data sources who are doing the locally optimal thing and humanity as a whole is losing out. So like New York Times is swinging open AI, you know, Stack Overflow shut down their API, Reddit shut down their API, X, you know, made their own model, right? On Twitter data. We're just going to have all these like tiny little gardens of data that it would be useful in a general model, but everyone's just trying to make their own model. And it seems like globally suboptimal.Vipul [00:20:47]: I think you need to have some kind of a marketplace for figuring out how to get this, you know, data into models and have, I think we'll increasingly see more of that. You know, I think there's a positive aspect to it too. There is a incentive for creators to participate in a system, which is sort of more fair relative to, you know, the capture of value by an AI company that's taking their data. But I agree. I think this is a big open problem that needs to be solved. And I hope there will be, you know, serious efforts around it.Alessio [00:21:19]: Let's talk about the most precious resource on planet earth, GPUs. You have a lot of compute obviously, but you also have a lot of product pieces. You have inference, you have fine tuning, you have pre-training. What's the split in terms of usage? Do you see most people are just running inference on off the shelf models? Do you see maybe some last mile fine tuning?Vipul [00:21:40]: I would say right now, the top five models on our inference stack are probably all fine-tuned versions of open models. And we've seen- Who fine-tuned them?Swyx [00:21:51]: You fine-tuned them?Vipul [00:21:52]: They were fine-tuned by our customers.Swyx [00:21:54]: By your customers.Vipul [00:21:55]: You know, either on our platform or off our platform. And we are generally seeing that, you know, that is the sort of trend where you can get better quality on your task by sort of now easily adapting these models to your data. We also have, I would say, over 20 big model builds happening on the platform, which are customer. We see a lot of training and it's also somewhat surprisingly a more continuous kind of workload. We sort of imagine that this would be more episodic. You train a model and then you do inference. But what we find is, you know, we train a model and then they train the next version and then the next version, which sort of grows in scale. I would say training is still the bigger portion. Some ways inference is super linear to model quality. And as the models are getting better, there's more and more inference.Swyx [00:22:48]: Oh, because they're more useful. Yeah, they're more useful, yeah. So, okay, so training is bigger. This is actually consistent with what we've heard from Mosaic, that, you know, people think that training is sort of like a one-time deal. You do one big run and then you're done. It's never true. And so I'm interested in, like, putting some numbers and I don't know what you have disclosed or what you want to disclose, but, like, how many GPUs do you have? What is the equivalent amount of compute that you have? Because I understand that your GPU setup is different than what people typically think of, like, a giant data center somewhere, right?Vipul [00:23:20]: I don't think we have shared this number publicly. It's, you know, so this will be the first time, I guess. Like, we have close to 7,000 to 8,000 GPUs today. It's growing monthly.Swyx [00:23:31]: What class of GPU are they?Vipul [00:23:32]: They're mostly A100s and H100s.Swyx [00:23:35]: Okay.Vipul [00:23:36]: And probably more, I think, split towards H100s now. You know, we'll be sort of building this best-of-class hardware. So as there are other versions of these coming out later this year, we plan to have those in the fleet as well.Alessio [00:23:53]: I know when we talked last year, you were also using some of the supercomputers by the Department of Energy. There was kind of like a lot of random GPU compute in the world. Have you seen that kind of getting timed out? I think maybe a year ago, people were like, oh, yeah, you can use this GPU computer that is going to be end-of-life. Has the bar changed to give access to those resources?Ce [00:24:13]: From our perspective, it's actually getting better. Yeah, so from the community perspective, because many of the institutions in the world, they're actually investing in hardware, right? So for example, we are working with one of the institutes in Germany called Hessian AI, right, which gives us a lot of help on the compute side. So they start to have this very big GPU cluster, and they're actually sharing that with the community, right? And it's not super big, right, but also not a small one, right? So you start to see this, like, different lives that start to pop up, right? And because of the power of the community, they start to actually share that. So we actually find as a researcher today, it's probably easier for them to actually get a GPU than last year.Swyx [00:24:56]: Interesting.Alessio [00:24:56]: And then for you to buy them, what's the state of the market right now? Is it still extremely hard to get any? Do you have Jensen's phone number? Do you have like GM phone number? Do you guys get like the SDR because you're like under 10,000?Vipul [00:25:12]: NVIDIA is obviously motivated to help us, both as an investor and we are their customers. I would say the market is very tight still, and it's likely going to be this way for a while, is my sense that the demand for AI computing is just kind of ramped up very, very quickly, and it will take a while for supply to catch up.Swyx [00:25:37]: So how tight it is, and let's say compared to like a year ago, two years ago, what do you mean when you say tight? The things you want, you can't get?Vipul [00:25:42]: You can't get them immediately. They're sort of, you know, minimally like two to three months out. Any inventory that shows up tends to clear very, very rapidly. And, you know, we obviously sort of look at this in a very detailed and analytic. There is four to 5 million GPUs that will be sold this year from NVIDIA and others buying. And if you think about 512 to 1,000 GPU cluster for a company, that's 4,000 to 8,000 companies, right? So it's in some ways a very small number. In other ways, the cost of GPUs will be, you know, 80 to $100 billion, and then you layer servers and data center space and electricity on top of that, and that's, you know, close to $250 billion worth of kind of compute, which when you compare it to the cloud computing of today, you know, AWS's last year was $88 billion in revenue. So this is really kind of a build-out happening of AI hyperscalers. It is much more disaggregated, and it's very, very global. So, you know, we think that GPUs are going to be sort of a precious resource for a long time, and using them optimally is very valuable.Swyx [00:27:02]: Yeah.Alessio [00:27:02]: Our friend, Dylan Patel from Semianalysis, he wrote a post about the inference market recently and obviously mentioned you guys. In his post, he said, our model indicates that Together is better off using two A180 gig system rather than a H100-based system. The temperature and performance testing also point to Together utilizing speculative decoding. Any thoughts? Is Dylan right? I don't know, what's-Swyx [00:27:26]: What is his model, man? What does he know that they don't know? Yeah, exactly.Alessio [00:27:30]: I wanna know, I guess like from the outside, and sometimes we even do it, we try and speculate on what people are actually doing. So for the first time, now we have a former guest writing about a current guest. So we wanna know what you guys thought and maybe what are some of the misconceptions that people from the outside have on what it takes to run like a GPU cloud today?Vipul [00:27:50]: Yeah, big fan of Dylan's, by the way. I religiously read Semianalysis. I think there were some errors in that analysis. In particular, we were trying to decode it and one of the things we noticed is that it assumed that input tokens weren't being priced. So I think that may have been an error in the model. I also don't think that there's this assumption that people are running this at a loss. I think it's very expensive. You can't do that for very long. And there are trade-offs in terms of batch sizes you use and the kind of tokens per second performance that are kind of system trade-offs. We've done a lot of work. This is one of the key areas of research for us. So our inference stack is a combination of 50 different sort of tricks and techniques and we think there's a lot of room for optimization here. So whichever hardware provides better performance, whether it's H100 or A100s or L40s, we can sort of measure price performance on particular hardware and we tend to use that for that model or in some cases, certain customers have data streams which can be then optimized for a particular configuration regime. So we do fairly detailed work on how to make this more efficient and so it's hard to, from the outside, looking at memory bandwidth and estimating what's actually happening.Alessio [00:29:26]: How much of these 50 tricks are you giving to yourself and how many are you gonna open? Because we have three now, obviously Flash Attention 2 is open source. He mentioned he'd love to come work together because of how much you care about open source. Yeah, how do you weigh that as a CEO and CTO?Vipul [00:29:43]: A lot of it is open, right? Flash Attention, Flash Decoding, et cetera, and we publish something that's very generally universally useful. It's going to produce better open source AI. We tend to publish as open source. I think on the inference stack, there are open source inference stacks which are pretty good and definitely today, it gives us a competitive advantage to have the best one. So we are not sort of rushing out to release everything about it. It's not overall that additive to open source out there and it is particularly useful as a business for us to provide best price performance. Yeah, we make these decisions. We have discussions. Anything that we keep closed, we generally talk about it quite a bit and decide like this is the piece that is closed for today and it may not be the case six months from now. It may not matter as much.Ce [00:30:40]: Yeah, so I think being open is kind of very important, right? So I think the whole company actually built on this idea that there's going to be ecosystem built on our open models, right? And that's also how we are really lucky to attract this top group of talents to actually join us because of the dream and the mission that we have on our side to really facilitate the open ecosystem, right? So I think in general, it's like I think all the ideas should be open. So that's why we publish papers, right? We actually talk about ideas, right? So I don't think it makes any sense to keep idea like close, right? So there are some software artifact that are kind of really deeply embedded into our kind of own kind of like stack. It kind of only useful when you're trying to build a disaggregated cloud, right? Maybe at some point that we're going to be open as people said, right? But at this moment, right? So we are kind of busy actually building it, right? So that's probably kind of getting to the picture about when that piece is going to be open, right? But I think on the research side, the ideas and for our people to publish things, I think that's really, really important, right? So I think that's how we get talent. That's how I think we as a company going to move the field forward.Swyx [00:31:49]: I noticed that you never used the word federated learning or inference. Is there a distinction that you draw?Ce [00:31:55]: So, I mean, it's definitely not intentional, but I think federated learning is, have been used in so many different ways by so many different people. It starts to lose a very precise meaning about what that really mean, right? If you go back to the original Google paper of federated learning, I think that's very different from what people are talking about today when they say federated. Yeah, we kind of want to be really precise about it.Swyx [00:32:18]: And so your term is disaggregated.Ce [00:32:19]: Yeah, so as an infrastructure, right? So that's disaggregated.Swyx [00:32:22]: Aren't most clouds disaggregated? Like what's different about it?Ce [00:32:27]: So one way is that most of the cloud are disaggregated, but some of that is actually being exposed to the user, right? If you go to AWS, you do know which region you are in, right? So I think one thing that we are trying to do is you have this disaggregated cloud, not only about location or geographically where they are, but about this reliability and also this diversity of this infrastructure. So, and if we want to build a reliable, high-quality layer over that, the user actually don't know, right? What's actually happening under the cover, right? So I think that's one of the difference of the way that we are thinking about infrastructure.Swyx [00:33:06]: Yeah, a bit closer to Cloudflare than AWS. Yeah. Yeah. We have one question here, which we'll just throw out, it's kind of fun. So going back to this sort of inference stack piece, maybe if you had to pull out like a call for researcher or just like point out interesting areas of work that you're interested in, what pieces of the stack have the most opportunity for improvement?Ce [00:33:27]: Yeah, so I think the way we are thinking about the inference stack is, so there are multiple things that can happen, right? So you can do better algorithms, like speckle decoding, you can change the model architecture, you can go really crazy on the system side, right? And you can also code it on the hardware, right? So it's not really clear innovation on a single dimension will get you there. So the key thesis on our side is, if you only push on one direction, you are going to reach diminishing return really, really quickly. Yeah, there's only that much you can do on the system side, only that much you can do on the algorithm side. I think the only big thing that's going to happen is when you ask all those dimensions to actually compound, right? So to have algorithm, model, and system all come together, so I think that's how we reach the next 10 times improvement on inference, right? So I don't think there's a single dimension that is particularly important, but looking at this space in a joint way, right? Try to co-optimize jointly multiple dimensions, I think that's going to be really important for the community to look at.Vipul [00:34:28]: Yeah, we often see, I see numbers from the team and you have these multiple methods, not all of them compound. So you mix these together, it's still similar results and some combination of them will have this incredible effect that is really, really super interesting. So it's very systems, you know, a kind of broad systems approach to it that's the most effective.Swyx [00:34:51]: I think I finally get the name of the company, like- Bring it together, yeah. Everything needs to be automated together.Alessio [00:34:57]: All right, just quickly, how does all this work change, just like some of the architectures change? I know a mixture of experts like speculative decoding is a little less efficient because of memory bandwidth. How much of it do you invest when it's a maybe model-specific improvement versus more horizontal thing? Also, you're researching different architectures, so how much do you want to spend time optimizing what state of the art today versus what's coming next?Vipul [00:35:24]: We do spend time on what state of the art today as well as what's next. You know, the value we get from doing specific optimization, even for, you know, what works well for a particular model on A100s with a particular bus versus H100s, it's a worthwhile investment for us. So we will go down fairly deep into a specific architecture and specific hardware. It does also inform what works better where, and you don't have to take the same approach for, you know, every model and every sort of hardware setup. We can take these different approaches and we do have these multiple systems now. We know that this, you know, system B is better for mixed role and system C is going to be better for stripe tying or Mamba.Alessio [00:36:13]: Before we move on from inference, we need to talk about any scale of drama. So we're actually having Sumit on the podcast tomorrow, who also talked about, kind of came to your guys' support about how, yeah, how important it's not just like, oh, together saying this benchmark's not good because they look bad in it. How, I guess like, it's a hard question to ask, but like, why did you decide to just come out and say it? And how maybe does that also reflect the values that you guys have about open source and openness and kind of like being transparent about what's real and maybe hopes for standardizing some of these benchmarks to make it more clear?Ce [00:36:56]: So it's a great service and skills doing for the community, right? I mean, it's very hard to do benchmark. The moment you do benchmark comparing N players, right, N minus one will be unhappy. You have two tables, then maybe N of them will be unhappy, right? So it's a very great thing that they're doing. And in some of the work that we are doing, we actually use RMOperf, right? So it's a great thing that they're actually doing. So I think one thing about benchmark is, and probably the professor part of me are talking, is a good benchmark should think about how it's going to incentivize the field to actually move forward, right? So if the benchmark really become a kind of standard, how are people going to over-optimize to the benchmark if you are going to do that? And when people are doing that, what are we actually trying to incentivize, right? Will that move the world to a better place? Or will that essentially have every single player focus on marketing or spending time or money on something that actually do not matter on technical side, right? It's very hard to actually strike a balance, right? So I think the reason we kind of try to give feedback on the benchmark is kind of want to open up the discussion about how does the industry should come together and define maybe a common way that we compare with each other, right? So like how database people doing TPC, right? Maybe you should have something actually similar, right? So we are trying to start some of the conversation. So it's not really that we jump out to say it's not good because there's no way we can have a perfect benchmark. That doesn't really exist, right? So just try to kickstart a conversation that maybe we should come together and do something that the community agree and align with the benefit a user going to get, right? So just get the conversation started.Vipul [00:38:42]: I've spoken to the AnyScale team after that, and I think they had really great intentions. And partly, I think it felt very objective and everyone sort of had a reaction to it because it just didn't match their benchmarks that we've all run internally against different services. I think a common industry benchmark run by an independent party versus one of the vendors.Swyx [00:39:04]: Is there one that you appoint to?Vipul [00:39:06]: I don't think one exists today. I think there should be. We're having some conversations about someone setting one up. And there's lots of interesting aspects of this. Time to first token is a function of where the test was run from. There is different load on these services at different times of the day and weekday or weekend. So you have to measure that well. And I think if all of that were done very well by an independent source, that will be a very useful service to customers and in the services themselves.Swyx [00:39:39]: Yeah, I'll point people to artificialanalysis.ai, which is a new one that recently emerged. I don't know if they've done it right. It looks like a side project of a couple people. But I think it's in all the provider's interest to work with them. And ensure that there's an independent third party that's measuring these things, right? At least on the baseline. For me, what's worrying is more about what Toa was saying, which is, do these benchmarks skew things in ways that customers might not be mindful of? Like, what are these things overemphasizing that we might be missing? And I don't really know. It seems like a lot of these services bundled together, they're a version of quantization as well. So that means there's performance trade-offs, right? You're not comparing apples to apples, the same model itself, even though it's like a llama variant or whatever. So what do people trade off? They trade off latency, they trade off price. Obviously, those are the first two. But what else, right? What factors matter in an inference business?Ce [00:40:33]: Yeah, so I think there's also the throughput, right? So there's the time to first token, right? So, and then there are things that users do not often see, for example, the reliability, right? The capacity, right? So that also have impact on user experience at a global scale. Maybe not a single query, right? But in aggregation, you can also see a whole bunch of, like, whether you are emphasizing P50, P95, right? So the whole bunch of things that you can actually play with. And of course, there's also quality. So there are different ways to actually make the whole thing faster, specification, quantization, or combination of those, right? So yeah, so there are so many things to actually play with. So they probably need a benchmark that the protocol is transparent to make sure, like, it's very clear what we are doing and a whole bunch of check on the quality to make sure we are putting the right group of stories in the same table. So I think then essentially the user can actually navigate the space. So I think that's going to be good for everyone.Swyx [00:41:27]: Yeah, makes sense. It's a very important field and I think hopefully there's a good third party that emerges from this. So I just want to touch on one more piece, which is I think I'm appreciating from this discussion that fine tuning is a bigger part of your business than I thought. The other big player in fine tuning is Mosaic. Well, Mosaic is more training, but like there's a bunch of other players in the fine tuning space. If I was a prospective fine tuning customer, what do I come to you with? Do I come to you with my custom data and that's it? Do I also have to write the fine tuning code? What level of engagement do you do with your customers?Vipul [00:42:01]: I think across the spectrum, our customers are training models, pre-training models from scratch and many of them will bring their data sets, you know, user infrastructure and training stack to train their models. There are others who have trained smaller models and want to scale up, scale up across infrastructure, scale up across data. So we'll sort of help them do that. We will have customers who are sort of initially started a little bit more consultative. They have a particular task and idea in mind and we will help them get from there to the data set and the right model to achieve that task. So it's a spectrum and, you know, our goal is to, we're trying to productize as much of this as possible. So that the whole process can be fast and scalable. I would say there is a lot more understanding around fine tuning now, like even the last six months, there are, you know, source tools, recipes, literature, podcasts, discord channels where people are figuring out and it really is in many ways, one of the successes of open source is you have small collectives of, you know, engineers who have created, who are now creating the top models on open source leaderboards. And I have tried out all sorts of different sort of, you know, data recipes, creating synthetic data. Merging models. Merging models. So it's, that's really fun to see. And I think that sort of agency that exists now is exciting. And that is, we see a lot of that sort of being applied into products and, you know, more commercial models that people are deploying in their applications.Alessio [00:43:50]: And then just to, I guess, wrap up the together, it's almost becoming like a platform as a service, because now you release together embeddings. How did you get 92.5 accuracy on 32K retrieval? And do you think we're kind of like getting to embeddings or just like, we did everything that we could, you know, we're getting to like the most optimized it's gonna get and then we should just focus on models and inference or do you think there's still room there to improve?Ce [00:44:17]: Oh, I don't think we haven't even got started on embedding. Yeah. So I think there are so many things. So like embedding is really fundamental for many things, for example, rack, right? So deep in application. So that's how people bring knowledge in. That's also the fundamental piece when you want to build a better model, right? So that's give you this understanding about what actually get into the model. You can actually use that to actually build a better data set, get a better model, then get better embedding, you'll start this loop, right? Without the good embedding, the loop is not closed, right? So I think both on the quality side, how to embed more like dedicated semantics, like into those vectors, how to deal with negation, for example, right? So, and how can you make the whole thing really, really fast? So I think for the next couple years, yeah, we will see a whole bunch of new embeddings maybe of different size and much, much faster than today. Yeah, so I think it's a very active research area. I think people should invest more, yeah.Swyx [00:45:14]: I was surprised to see, I think Jina or, yeah, there's Jina AI, and then there's another guy, Tengyu's Voyage. They are coming out as startups purely focused on embeddings.Ce [00:45:25]: Yeah. Yeah, so I think it's a very, very important piece of the system, right? So you people haven't focused on a lot on them before, and they should definitely start to do that.Swyx [00:45:36]: Yeah. Why are the Chinese universities so good at embeddings? You know what I mean, right? Like the BGE and- Yeah, yeah, yeah.Ce [00:45:44]: So I don't know. We just released our first embedded model, so we still try to learn how to build an embedded model. Yeah, so ask me again in six months.Swyx [00:45:53]: I'll probably have more insight about how to build a better one. I just noticed that you saw 8002 was used to be at the top of the MTB chart, and then it's just like sliding down and down and down, and all the new models are coming out of China for some reason. And I'm like, I don't know what's going on there. So we cannot leave this discussion without talking about state space models. But first of all, how much of the company is dedicated to research? Like it's obviously like not production quality yet, but-Vipul [00:46:17]: I would say it's like 40, 45% I was counting this morning. That's huge.Swyx [00:46:22]: Yeah, so that's the biggest- It's a big investment. Yeah. Okay, well, I mean, it looks like it's paying off, so. And then high level, I will confess or admit or mention for the listeners who are also similarly skeptical, I did not used to care about long contexts because I was like, you know, 30K is enough, 100K is enough, right? I'm not, you know, modeling DNA sequences or anything like that. Why do I need long context? And I mean, first of all, I'll throw that open to you. But second of all, I think what Mamba did for me was change that perception of that. It's only about a long context. The only reason you want sub-quadratic architectures is for long context. Actually, that's not true. And it's also just more efficient to train, period. Right? I'll just leave that open to you. Like what's the motivation that people should keep in their heads? There are multiple things, right?Ce [00:47:09]: So one thing is that, I mean, the moment a model can do for long context well, so it often means that it's kind of cheaper. Yeah, so I mean, that's why it's kind of long. I mean, in principle, transformer can do long context. It's just very expensive. So I think what those like state-based models trying to do is try to push the size of the state, right? Like as small as possible. That's why it's kind of long context, right? And try to kind of like decouple this like quadratical dependency, right? To make sure you can have a much better execution pattern.One direct consequence of those is you can do long context really cheaply, but on the other hand, also introduce a whole bunch of benefit even you are not doing long context. Right? So I think that's actually probably equally important. Because data gets smaller, you can do really large batch size, right? You can actually be very faster. Right? So yeah. And another thing is like, one of the hypothesis that we have is, like in Stripe Hyena, it start to have a hybrid architecture, right? It has part of it has like state-based model and part of it is still the transformer. So different component probably deal with different things kind of better. So maybe by putting them together, by thinking about how information propagate, over this whole horizon of this context, you can probably get an even better quality model than transformer. Right? So I think that's why we are kind of invest a lot of things, on those models. Not only for the context, which is very important, but also for a whole bunch of benefit it could get.Swyx [00:48:42]: Yeah. How should people treat the distinction between Mamba and Stripe Hyena? Like what's the point of releasing these two as separate models? Is one like sort of the together proprietary one and then the other is like the more open research one?Ce [00:48:53]: Yeah. So I think it's pretty much a different stage of exploration. So they kind of have different hypothesis when we try to build those. Yeah. Like for instance, there are different view about state-based model. One is Hyena, another is like Mamba, right? They're actually different architecture. So when we build Stripe Hyena, right? So the curiosity that we have is how good can we... So what is the highest quality non-transformer model we can ever build? The goal of Stripe Hyena is try to see whether we can match Mistral. And by fine-tuning well, whether we can outperform that in some way, right? So it has a very, very strong baseline that we are trying to beat. So that's why there's hybrid scene, like getting the picture, right? And for Mamba, it's kind of more... The curiosity was how far can we push for pure architecture? Then we start from this very system make from small to large, right? All the way to 3 billion, right? So the baseline was essentially the best 3 billion model. So I guess at a different stage of exploration, at some point, I think they are going to converge. We actually learn different things, like when building different models. I think they are just like this intermediate stage in the exploration at different points.Alessio [00:50:02]: You mentioned the hybrid architecture. Is that the model grafting that you mentioned in the Stripe Hyena post where I mentioned you can have transformers and not together? Like this is a concept that I hadn't heard before reading about this. So I think most people's mental models, like transformers or something else, it’s not transformers AND something else. How do you train a model that is hybrid? Is there any difference in like how you construct your datasets? Is there any difference in then how you run inference on it? How should people think about starting research in this field?Ce [00:50:36]: Yeah, so we were also very surprised. Yeah, so when we come up with this hybrid architecture. So the way to think about it is like you have different layers in the neural network, right? So like the stateless model for some layer will already give you the benefit. For the other layer, they could be transformers, right? They could give you this more global view of the sequence, but for me, for other layer, don't have to have that, right? I still can have all the other things that kick in, right? So we don't know what is the optimal mixture between different architectures. I mean, in principle, we can have a mamba, hyena, and transformer, all those things that come together, right? And then you can see what makes sense. We have no idea what is optimal doing that. So what we are excited about is now the community have a whole bunch of building blocks that they can actually like playing like a Lego, right? So just put together and see what happen, right? So we are kind of very excited about that. Yeah, we are in the process of trying to learn more like about this architecture. And when we know what we are talking about, we will definitely share with the community about how to do that in a systematic way.Swyx [00:51:41]: Cool. What are we still unsure about? Like, why don't we just, you know, put all the money in the world and training these things now? Like what is left to figure out before we scale this thing?Ce [00:51:53]: So like if you look at how transformer like it's been developed, right? In the last like five to 10 years, right? So people don't start from like, you have this attention to all you need the paper and then let's put all the money in, right? Always start from this very systematic understanding about the scaling, about data quality, about essentially the limits, right? I think for a state-based model from the labs to the real world, you kind of need to go through the same process. But of course, the second time doing that is kind of easier, right? But I think there's no way we can get rid of this systematic step of studying scaling law, study what data to put in, right? So what's the impact of different data slices to the data, yeah, to the final model quality.Swyx [00:52:33]: Do you expect that the data inputs will be different?Ce [00:52:37]: I don't know, but I wouldn't take that for granted that they should be the same, right? So that's one of the hypothesis that, so we have no opinion on that because I think that's the result of the study, not the assumption. Yeah, we do not need to assume that.Swyx [00:52:51]: Okay, scaling laws and data, anything else like architectural that we are not sure about? Because now you have this selection mechanism that you're pretty happy with.Ce [00:52:59]: Yeah, so, I mean, first of all, how to mix them, right? So, and second is what is the architecture? So if you look at transformer, right? So one very interesting piece there is people optimize also the hardware, yeah, to make sure that things run very fast, right?They're very efficient kernel, they're very efficient hardware. And then that's add another boost, right, for the transformer architecture, right? So that's something that should happen for state-based model. Which architecture is kind of easier kind of to run on the hardware, right? So, hosting going kind of faster, you can put more data, it add another dimension in the scaling law. So I think we just need to plow the whole space and just be really systematic from small model to 1 billion, 3 billion, 7 billion, just go all the way up, right? So I wouldn't jump around in the space. I would just like be patient and just like be systematic. Yeah, I think we'll get there, yeah.Swyx [00:53:52]: Yeah, well, I'm looking forward for more research from you guys to figure that out. So one dimension, which we didn't talk about, we talked about long context, we talked about efficiency, but speed is very, speed is also very important. A good inference provider provides, let's say 70 tokens per second, and then maybe that's faster than less good inference providers that are more like 30 tokens per second. But that's the rough range, right? State-of-the-art today. That's around the human speaking speed, human reading speed is about 200 words per minute. Why do we need 5,000 tokens per second is my question back to Vipul. And maybe is this something that is an emphasis for research as well, or is this more just an inference only thing?Vipul [00:54:29]: There are applications that are consuming the tokens that are produced from unmodeled, so they're not necessarily being read or heard by humans. That's a place where we see that level of requirement today that really nobody can quite satisfy. There is, can I think about, as intelligence grows, how do you sort of increase the bandwidth of, you know, how do you reduce the latency of it? If we can do 5,000 tokens a second, the same card can produce, the throughput of that card goes up significantly and can support more applications. So I think it's important from that perspective. And then there are, it opens up new UX possibilities. Once you can get sort of an immediate answer from a model, it starts working in a different way and, you know, new types of applications will be created. We rarely run into users, except for perhaps those feeding this into a text-to-speech model, where, you know, they say that, okay, slower is better, or like, we don't need more performance. I think this may just be fundamentally very, very slow today in general, and we're just sort of used to that speed. And that will change once, you know, these models can get faster.Swyx [00:55:47]: Yeah, 5,000 tokens per second is, I don't even imagine, like, well, it makes me worried a bit that the machines will be communicating at a much higher bandwidth than us, but yeah. I mean, they do that already. Vipul [00:56:00]: They do that already. Not in natural language.Alessio [00:56:02]: Awesome. Anything we missed about Together as a product? We're gonna talk about the hackathon you just did and whatnot, but any last product thoughts?Vipul [00:56:11]: I think one of the big sort of focuses of our product is to become more and more serverless, like have AI development run in the serverless manner. And we are there now on inference, also on fine-tuning. You know, we are pushing to do that on training. And that is, you know, we think, if there was a sort of, you know, developer experience message, that's probably the big one is where you have enough flexibility. You don't have to sort of commit to thousands of dollars of compute before you can start using open models. We really want to change that and make it really as easy as possible to get started.Swyx [00:56:52]: Yeah. When I first signed up for Together, I had, like, left an instance running and I just, like, ran out of my credits immediately. Yeah. So, you know, and we changed that whole model now.Vipul [00:57:04]: So you never run into that issue. And that was, you know, I think the response to that has been amazing is you also provide, you know, $25 free credits, which is a large number of tokens depending on the model you're using. And you really can build an app. You know, you can do a fine-tuning and run that model and build an app on Together for free, basically. And we'll be pushing further in that direction.Alessio [00:57:29]: You just did a hackathon at AGI house about fine-tuning versus SRAG for open source. Any learnings, recaps from it?Ce [00:57:38]: Yeah. So I think one thing that we kind of learned is, like, so I think the hackathon was phrased as, like, something versus something, right? But I think the combination of those works really well.Swyx [00:57:48]: Right?Ce [00:57:48]: So I think, like, combining all those techniques all together, right, so we'll give you essentially another boost, right? So that kind of one thing that we learned on the technical side. And also we are very, kind of, excited about the excitement of the audience, right? So I think people are really kind of using the platform and building something really cool. Yeah.Vipul [00:58:08]: It's always surprising to us what people build. Yeah.Alessio [00:58:11]: Is there something you're focused on this year, hiring, building, engineering team? What should people that want to work at Together?Vipul [00:58:17]: You know, all those things. I think hiring is a pretty big topic. We are 38 people on the team and we are hiring across all areas. You know, like CUDA and Kernel Hacker. We have lots of exciting projects. If you're a researcher, you like to build models, we have exciting projects. If you work on systems and infrastructure and the cloud layer, you know, we do a lot of work there. And as well as sort of front-end and developer experience and applications. So really kind of across the board, we have, I think, 20 plus postings on our job openings on our site. And folks are passionate about open and AI. You know, people looking at Together, they don't necessarily, for all the postings, have to have experience, you know, professional experience working in machine learning or AI. Many of the systems people are sort of doing this for the first time and they can apply their, you know, systems expertise to the kind of things that we are doing. And we can teach people AI, as long as they have expertise in other areas.Swyx [00:59:20]: Will you call out what kind of expertise you're looking for? Like, we definitely have systems people listening, so.Ce [00:59:26]: Oh, I mean, the whole stack. Right, so like all the way from the-Swyx [00:59:29]: Kubernetes, I don't know. Kubernetes, yes. CUDA. What else, CUDA?Ce [00:59:34]: And DevOps, right? So that's a big thing.Swyx [00:59:37]: Is that like what, Terraform, like Pulumi? Right, yeah, yeah.Ce [00:59:41]: And all the way to machine learning systems, right? If you want to, like, like to hack over like VRM, TGI, right? That's great. If you want to play with different fine-tunes, like building models, like development algorithms, right? Essentially the whole stack, all the way from application to-Swyx [00:59:58]: That's very broad. To system.Ce [01:00:00]: So the fun thing about the company is like, we have this very diverse collection of expertise and talents in the company, and the goal is really try to innovate at every single layer, and then have them all compound together, and yeah.Swyx [01:00:13]: Yeah, doing everything together, that's why the company is named this way. Like, no, seriously, I didn't really get the company naming until now. Like, yeah, makes sense.Alessio [01:00:23]: Awesome, guys. I know we kind of binned the lightning round in the last few episodes, but I think for you two, one of the questions we used to ask is like, what's the most interesting unsolved question in AI? So maybe another way to think about it is, if you weren't building together, what would you be working on?Ce [01:00:39]: Yeah, so if not building Together, I would be a professor. I mean, then we do like a whole bunch of things without justifying as being useful. We used to work on quantum machine learning for a while. So I think IoT is going to become very interesting. Yeah, so I know people have been saying that for the last couple of decades, right? But I think very excited about how does technology, like starting, right, like change the communication between different edge devices and like all those machines and the new battery coming out, right? So I think that could be very cool. So if not building together, probably, yeah, spend some time thinking about how to compress communication even more given all the satellite communication stuff, yeah.Vipul [01:01:21]: I think sort of the first question of what is more important open questions. The one thing I think about is that we sort of need framework of thinking about, you know, what the world looks like with advanced intelligence systems in it. I think we have had this very, you know, sort of a dumerism view of it, really kind of informed by science fiction, you know, dystopian science fiction and Terminator. And I don't think we have a kind of a positive or a realistic really framework coming from, you know, experts in the field. I think that's a pretty important question because that really gives us a roadmap of where this industry should go. And, you know, I'm hoping that some of the, you know, industry drama this last year maybe is sort of pointing us in that direction and solving that is sort of, I think, important in a meta way. So I think I'm doing the perfect thing that's like, this is, you know, really my dream job. And every day, this is kind of what I want to do, and I expect that's going to be the case for a very long time.Alessio [01:02:33]: Awesome, thank you guys for coming on this, it was a lot of fun.Swyx [01:02:36]: Yeah, thank you. Thank you so much. Get full access to Latent.Space at www.latent.space/subscribe
-
Why StackOverflow usage is down 50% — with David Hsu of Retool
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-02-01 18:15
We are announcing the second edition of our Latent Space demo day event in SF on 2/23: Final Frontiers, a startup and research competition in “The Autonomous Workforce”, ”Beyond Transformers & GPUs”, and “Embodied AI”. RSVP here! The first one was aimed for 15-20 people and ended up blowing up to >200 and covered in the Information - let’s see what a year of growth (and competition) does to the local events space in 2024.You can find all Latent Space events here, and of course get in touch with us to host your own AI Engineer meetups like AI Engineering Singapore.In our December 2023 recap we covered the Four Wars of the AI stack. But how do we know when it’s time to crown a winner? As we kick off 2024, we wanted to do a recap of the State of AI in 2023 to set a baseline of adoption for different products. Retool had a great report at the end of last year which covered a lot of it. David Hsu, CEO and co-founder of Retool, joined us to go over it together. We also talked about the history of Retool, why they were too embarrassed to present at YC demo day, and how they got to $1M ARR with 3 employees. If you’re a founder, there are a lot of nuggets of advice in here!Retool AIIn our modeling of the “Software 3.0 Stack”, we have generally left a pretty wide open gap as to the “user interface” equivalent of the AI stack:Retool AI launched 4 months ago with some nifty features for SQL generation, and its own hosted vector storage service (using pgvector). However, as he explains on the pod, the more interesting potential of Retool is in helping developers build AI infused applications quickly, in combination with its Workflows feature. This moves Retool down the stack from just the UI for internal tooling to the business logic “piping” as well. There are a bunch of dedicated tools in this space like Respell, BuildShip, Flowise, and Ironclad Rivet."We think that practically every internal app is going to be AI infused over the next three years." - David on the podRIP StackOverflow?In July 2023 we talked about the impact of ChatGPT and Copilot:This was then disputed by StackOverflow, who pointed out (very fairly so) that there were privacy-related changes in their analytics instrumentation in 2022. StackOverflow no longer reports traffic, but based on StackOverflow’s continuing transparency we can see that organic declines have continued throughout 2023.Retool’s report comes over a year after those changes and has some self reported samples from users:* 57.6% of people said they have used StackOverflow less; almost all of them replaced it with ChatGPT and Copilot.* 10.2% said they no longer use StackOverflow.We also saw a lot more tools being released in the dev tools space such as (one of our oldest pod friends) Codeium (which just raised a $65M Series B), SourceGraph (and their newly released Cody), Codium AI (just released AlphaCodium which was picked up by Karpathy), Phind (which beat GPT-4 with OSS models), and Cursor, one of the most beloved products in the dev community at the moment. Intelligence is getting closer and closer to the IDE, and the trend doesn’t seem to be reverting. We already said that “You are not too old (to pivot into AI)“, and the advice still stands. When asked to rate “Preference for hiring engineers effective at using ChatGPT/Copilot for coding” on a scale of 1 to 10, where 10 is “Much more likely”, ~40% of companies voted 8-10. Having an AI Engineer skillset is extremely important. 45% of companies between 1,000-4,999 employees said that they increased the difficulty of technical interviews to compensate for these new tools, so the gap between users and non-users will keep widening.Crossing the AI in Production ChasmGeoffrey Moore’s “Crossing the Chasm” is one of the most quoted business frameworks. Every market has an initial group of Innovators and Early Adopters, who are willing to suffer through the rough edges of products initially, and eventually crosses into the Early Majority, which expects a full product.In the AI world, ChatGPT and Midjourney / DALL-E have crossed the chasm in the consumer space. Copilot is probably the only tool that did it in the enterprise, having crossed 1M paid users. ~$50B were invested in AI in 2023, and we still only have <5 breakout products; expect this number to rise in 2024. According to the survey, only 25% of companies had real production usage, but 77.1% said their company is making efforts to adopt more. Closing that gap could triple AI adoption in one year.The report also broke down adoption by use case. 66% of companies use it internally, while only 43% do so in customer-facing use cases. Internal usage of AI is much more varied than customer-facing one as well:One point that David made in the podcast is that this number isn’t a knock on AI as a tool, but rather about the demographics of businesses outside of our Silicon Valley bubble:We all work in Silicon Valley, right? We all work at businesses, basically, that sell software as a business. And that's why all the software engineers that we hire basically work on external facing software, which makes sense with most software companies. But if you look at most companies in the world, most companies in the world are actually not software companies. […] Most of the [work of] software engineers in the world actually goes towards these internal facing applications.Beyond code models, it’s clear that the big winners of the first wave of AI adoption are vector stores and RAG. Knowledge base Q&A, customer chatbots, recommendation systems, etc are all based on them. Retool even rolled out their own with Retool Vectors. Expect the battlefield to get even hotter in these areas, with Mongo and Chroma leading the charge on a NPS/popularity basis. It’s also clear that OpenAI won the first campaign in the AI models war, by far. Hopefully Mistral and LLaMA3 will shake up this chart when we look back at it in 2025:TLDR: We’re really early. If you want to build in AI, there’s a ton of work to be done, and a lot of problems to be solved. You can find the full report here to dive through all the numbers. Video podcastWatch along on our snazzy YouTube!Show NotesCompanies and Projects:* Retool* State of AI Report* Retool AI* Retool Workflows* Raising less money at lower valuations* Paul Graham's "playing house" essay* Gödel, Escher, Bach (GEB)Timestamps* [00:00:00] Introduction* [00:02:43] Retool's founding story and decision not to present at YC demo day initially* [00:09:08] Philosophy on fundraising - raising less money at lower valuations* [00:12:53] Overview of what Retool is* [00:15:41] Origin story of Retool AI product* [00:19:59] Decision to use open source vector database PG Vector* [00:21:29] Most underrated AI use cases* [00:25:56] Retool's AI UX and workflows* [00:30:38] Zapier vs Retool* [00:32:54] Updates from Retool's 2023 State of AI survey* [00:35:21] Who is adopting AI first?* [00:37:40] Evolving engineering hiring practices in the age of Copilot/ChatGPT* [00:40:02] Retool's views on internal vs external AI adoption* [00:41:50] OSS models vs OpenAI in production* [00:44:46] Additional survey questions to ask in 2024* [00:47:04] Balancing enterprise sales vs bottom-up adoption* [00:51:54] Philosophical thoughts on AGI and intentionalityTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:16]: And today we are in the studio with David Hsu from Retool. Welcome.David [00:00:20]: Thanks. Excited to be here.Swyx [00:00:23]: We like to give a little bit of intro from what little we can get about you and then have you talk about something personal. You got your degree in philosophy and CS from Oxford. I wasn't aware that they did double degrees. Is that what you got?David [00:00:35]: It's actually a single degree, which is really cool. So basically you study content, you study philosophy, and you study intersection. Intersection is basically AI, actually, and sort of computers think, or computers be smart. What does it mean for a computer to be smart? As well as logic. It's also another intersection, which is really fun too.Swyx [00:00:51]: In Stanford, it might be symbolic systems or whatever. It's always hard to classify these things when we don't really have a word for it. Now I guess everything's just called AI. Five years ago, you launched Retool. You were in YC at winter 17 and just been a straight line up from there, right?David [00:01:09]: I wish.Swyx [00:01:10]: What's something on your LinkedIn that people should know about you? Maybe on their personal hobby or, you know, let's just say something you're very passionate about.David [00:01:17]: Yeah, sure. I read quite a bit. I probably read like two books a week around about. So it's a lot of fun. I love biking. It's also quite a bit of fun. So yeah.Swyx [00:01:25]: Do you use Retool to read? David [00:01:27]: No, I don't use Retool to read. No, that'd be funny. Swyx [00:01:30]: What do you read? How do you choose what you read? Any recommendations?David [00:01:35]: I'm mostly reading fiction nowadays. So fiction is a lot of fun. I think it helps me be more empathetic, if you will. I think it's a lot of fun. I actually just want to see what it's like to be in someone else's shoes. That's what I really good about philosophy as well. I find philosophy just so interesting, especially logic. We can talk more about that for probably hours if you want.Swyx [00:01:50]: So yeah, I have a casual interest in epistemology. And I think that any time you, you know, you're trying to figure out a way to solve a problem, you're going to have to figure out a way to solve it.David [00:02:05]: Yeah, totally. What does it mean to know?Alessio [00:02:13]: That's its own podcast. We should do a special edition about it. That's fun. Let's maybe jump through a couple of things on Retool that I found out while researching your background. So you did YC, but you didn't present a demo day initially because you were too embarrassed of what you had built. Can you maybe give any learnings to like founders or people who are built? I've seen a lot of people kind of give up early on because they were like, oh, this isn't really what I thought it was going to be to be a founder. They told me I would go to YC and then present and then raise a bunch of money and then everything was going to be easy. So how did that influence also how you build Retool today in terms of picking ideas and deciding when to give up on it?David [00:03:30]: Yeah. Let's see. So this is around about 2017 or so. So we were supposed to present at the March demo day, but then we basically felt like we had nothing really going on. We had no traction, we had no customers. And so we're like, okay, well, why don't we take six months to go find all that before presenting? Part of that, to be honest, was I think there's a lot of noise around demo day, around startups in general, especially because there's so many startups nowadays. And I guess for me, I'd always wanted to sort of under-promise and over-deliver, if you will. And then demo day, I mean, maybe you two have seen a lot of videos. It's a lot of, honestly, over-promising and under-delivering because every startup says, oh, I'm going to be the next Google or something. And then you peer under it and you're like, wow, nothing's going on here, basically. So I really didn't want that. And so we chose actually not to present on demo day, mostly because we felt like we didn't have anything substantial underneath. Although actually a few other founders in our batch probably would have chosen to present in that situation, but we were just kind of embarrassed about it. And so we basically took six months to just say, okay, well, how do we get customers? And we're not presenting until we have a product that we're proud of and customers that we're proud of. And fortunately, it worked out. Six months later, we did have that. So I don't know if there's much to learn from the situation besides I think social validation was something that I personally had never really been that interested in. And so it was definitely hard because it's hard to sort of, it's almost like you go to college and all your friends are graduating, but you failed or something, you failed the final and you have to like redo it here. It's like, well, it kind of sucks that all your friends are up there and on the podium presenting and they are raising a ton of money and you're kind of being left behind. But in our case, we felt like it was a choice. We could have presented if we really wanted to, but we would not have been proud of the outcome or proud of what we were presenting. And for us, it was more important to be true to ourselves, if you will, and show something that we're actually proud of rather than just raise some money and then shut the company down in two years.Alessio [00:04:45]: Any Sam Altman stories from the YC days? Could you tell in 2017 that Sam was going to become, like, run the biggest AI company in the world? David [00:04:49]: Wow. No one's asked me that before. Let me think. Sam was, I think he was, I want to, I forgot, I think maybe president of YC in our batch. We actually weren't in his group actually at the very beginning. And then we got moved to a different group. I think Sam was clearly very ambitious when we first met him. I think he was very helpful and sort of wanted to help founders. But besides that, I mean, I think we were so overwhelmed by the fact that we had to go build a startup that we were not, you know, honestly paying too much attention to everyone else's partner taking notes on them.Alessio [00:05:20]: That makes sense. Well, and then just to wrap some of the Retool history nuggets, you raised a series A when you were at 1 million in revenue with only three or four people. How did you make that happen? Any learnings on keeping teams small? I think there's a lot of overhiring we've seen over the last few years. I think a lot of AI startups now are raising very large rounds and maybe don't know what to do with the capital.David [00:05:42]: So this is kind of similar, actually, from sort of why we choose not to demo day. And the reason was, it feels like a lot of people are really playing startup. I think PG has an essay about this, which is like, you're almost like playing house or something like that. Like, it's like, Oh, well, I hear that in a startup, you're supposed to raise money and then hire people. And so therefore you go and do that. And you're supposed to, you know, do a lot of PR, because that's what, you know, startup founders do. And so you could do a lot of PR and stuff like that. And for us, we always thought that the point of starting a startup is basically you have to create value for customers. If you're not creating value for customers, like everything else is going to, nothing's going to work. Basically, you can't, you know, continuously raise money or hire people if you don't have customers. And so for us, we were always very focused on that. And so that's initially where we started. I think it's, again, maybe goes to like the sort of presenting something truthful about yourself or staying true to yourself is something to that effect, which is we didn't want to pretend like we had a thriving business. And so the only way to not pretend was actually to build a thriving business. And so we basically just, you know, put our heads down and, you know, grinded away for probably a year, year and a half or so, just writing code, talking to customers. And I think that at that point we had raised something like maybe a million dollars, maybe a million and a half, something out of YC. So I mean, to us, to people, you know, that was a huge amount of money. I was like, wow, like, how are we ever going to spend a million and a half? The runway was like, you know, five, six years at that point, right? Because we're paying ourselves 30, 40K a year. And so then the question was not like, oh, we're going to run on the runways. The question was like, we better find traction because if we don't find traction, we're going to, you know, just give up psychologically. Because if you run an idea for four years and nothing happens, you're probably psychologically going to give up. And I think that's actually true in most startups, actually. It's like most startups die in the early stages, not because I run out of money, but really because you run out of motivation. And for us, had we hired people, I think it would have actually been harder for us because we want to run out of motivation faster. Because when you're pre-product market fit, actually, trying to lead the team of like, you know, 10 people, for example, to Marshall's product market fit, I think it's actually pretty hard. Like it's, you know, every day people are asking you, so why are we doing this? And you're like, I don't know, man, like, hey, trust this. That's actually a very tiring environment to be in. Whereas this is just like, you know, the founders figuring out product market fit, I think it's actually a much sort of safer path, if you will. You're also schooling less with employees, like when you hire employees, you have an idea you're trying to work with your customers. That's actually, I think, a lot more stable of a place for employees to join as well.Swyx [00:08:00]: Yeah. I find that typically the sort of founder employee relationship is, employee expects the founder to just tell them what to do, and you don't really get critical pushback from the employee, even if they're a body and even if they like you as an early engineer. It's very much like the role play of like, once you have that founder hat on, you think differently, you act differently, and you're more scrappy, I guess. In trying to figure out what that product is. Yeah, I really resonate with this, because I'm going through this right now.David [00:08:26]: Awesome. One thing we did actually early on that I think has paid a lot of dividends, especially your rituals a lot larger now is we hired a lot of former founders. So I want to say like, when we were 20, 30, 40 people, we were probably like half former founders at each one of those stages. And that was actually pretty cool, because I think you infuse sort of a, you know, get things done kind of culture, a outcome oriented culture of like a very little politics, because you know, no one came from larger companies, everyone was just like, this is my own startup, let me go figure out how to achieve the best outcome for the customer. And so I think from a cultural perspective, even today, a lot of rituals culture is sort of very self startery. I think it's actually because of sort of these like, you know, early founders that we hired, which was really, really, you know, we're really lucky to have had them. Yeah.Swyx [00:09:08]: And then closing off on just a little bit of the fundraising stuff, something notable that you did was when in 2021, when it was sort of peak Zerp, and everyone was raising hundreds and hundreds of millions of dollars, you intentionally raised less money at lower valuations as your title. And I think it's a testament to your just overall general philosophy and building retool that you're just very efficient and you do things from first principles. Any updates on like, would you still endorse that? You know, would you recommend that to everyone else? What are your feelings sort of two years on from that?David [00:09:38]: Yeah. I think exactly you said is correct, where we raise less money at a lower valuation. And I think the funny thing about this is that when we first announced that, even, you know, internally and both externally, I think people were really surprised, actually, because I think Silicon Valley has been conditioned to think, well, raising a giant sum of money at a giant valuation is a really good thing. So like, you know, you should maximize both the numbers, basically. But actually maximizing both the numbers is actually really bad, actually, for the people that matter the most, you know, i.e. your employees or your team. And the reason for that is raising more money means more dilution. So if you look at, you know, a company like, let's say Uber, for example, if you join Uber at like, I don't know, like a $10 billion valuation, or let's say join for a huge route, which I think happened at a few billion dollars in valuation, you actually got diluted a ton when Uber fund raises. So if Uber raises, if Uber dilutes themselves by 10%, for example, let's say it raised $5.25 billion, for example, I think employees' stake goes down by 10% in terms of ownership. Same with, you know, previous investors, same with the founders, etc. And so, if you look at actually a lot of founders in sort of, you know, the operations statistics space, or, you know, those that fundraise, like, you know, 2013, 2017, a lot of the founders by IPO only have a few percentage points, actually, for a company. And if founders only have a few percentage points, you can imagine how, you know, how little employees have. And so, that I think is actually still really, you know, bad thing for employees overall. Secondly, sort of higher valuation, given the same company quality is always worse. So basically, what that means is, if you are fundraising as a company, you could commit a certain valuation in the market, you know, let's say it's, you know, x. For example, maybe if you're lucky, and you can raise two times x, for example. But if you choose two times x, your company itself is not fundamentally changed. It's just that, you know, for some reason, investors want to pay more for it. You know, maybe today you're an AI company, for example. And so, investors are really excited about AI and want to pay more for it. However, that might not be true in a year or two years time, actually. And if that's not true in two years time, then you're in big trouble, actually. And so, now, I think you see a lot of companies that are raising really high valuations about 2021. And now, they're like, man, we're at like 100x, or, you know, we raised 300x multiple, for example. And if we're at 300x, then, you know, maybe now we're at like 200x, man, we just can't raise money ever again. Like, you know, we have to grow like 50x to go raise money, reasonable valuation, I would say. And so, I think that is really challenging and really demotivating for the team. And so, I think a lower valuation actually is much better. And so, for us, in retrospect, you know, to answer your question, two years later, we did not predict, you know, the crash, if you will. But given that, I think we've done extremely well, mostly because our valuation is not sky high. If our valuation were sky high, I think we'd have a lot more problems. We'd probably have recruiting problems, for example, and probably have a lot of internal morale problems, etc. A lot of people would be like, you know, why is the valuation this way? We might have cash flow problems because we might have to go raise money again, you know, etc. But we can't because the valuation is too high. So, I would urge, I think, founders today to, quote unquote, like, leave money on the table. Like, there are some things that are not really worth optimizing. I think you should optimize for the quality of the company that you build, not like the valuation, you raise that or the amount you raise, etc. So,Swyx [00:12:34]: Highlight 2020, but it looks like, you know, you made the right call there anyway. Maybe we should also, for people who are not clued into Retool, do a quick, like, what is Retool? You know, I see you as the kings or the inventors of the low-code internal tooling category. Would you agree with that statement? You know, how do you usually explain Retool?David [00:12:53]: I generally say it's like Legos for code. We actually hate the low-code moniker. In fact, we have docs saying we will never use it internally, or even to customers. And the reason for that is, I think, low-code sounds very not developer-y. And developers, they hear the phrase low-code, they're like, oh, that's not for me. I love writing code. Like, why would I ever want to write less code? And so, for us, Retool is actually built for developers, like, 95% of our customers actually are developers, actually. And so, that is a little bit surprising to people. I'll generally explain it as, and this is kind of a funny joke too, I think part of the reason why Retool has been successful is that developers hate building internal tools. And you can probably see why. I mean, if you're a developer, you've probably built internal tools yourself, like, it's not a super exciting thing to do, you know, it's like piecing together a CRUD UI, you've probably, you know, pieced together many CRUD UIs in your life before, and there's a lot of grunt work involved. You know, it's like, hey, state management, it's like, you know, data validation, it's like displaying error messages, it's like the bouncing buttons, like, all these things are not really exciting. But you have to do it, because it's so important for your business to have high quality internal software. And so what Retool does is basically allows you to sort of, really fast, whether it's a front end, whether it's a back end or whatever else. So yeah, that's what Retool is.Swyx [00:14:02]: Yeah, actually, if you started hiring, and so I do a lot of developer relations and community building work, and you hired Krithika, who is now, who's also an AI, to start out your sort of DevRel function. And I was like, what is Retool doing courting developers? And then she told me about this, you know, developer traction. And I think that is the first thing that people should know is, which is that the burden and the weight of internal tooling often falls to developers, or it's an Excel sheet somewhere or whatever. But yeah, you guys have basically created this market, you know, in my mind, I don't know if there was someone clearly before you in this, but you know, you've clearly taken over and dominated. Every month, there's a new YC startup launching with that it's like, you know, we're the open source Retool, we're like the lower code Retool, whatever. And it's pretty, I guess it's endearing, you know, we'll talk about Airplane later on. But yeah, I think I've actually used Retool, you know, in my previous startups for this exact purpose. Like, we needed a UI for AWS RDS that they can, you know, like the rest of our non less technical people, like our sales operations, people could could interact with and yeah, Retool is perfect for that.David [00:15:04]: Yeah, it's a good example of like, that's an application that an engineer probably does not want to build, like building an app on top of Salesforce or something that is not exciting. And so it sucks. It's very limited. It's like not a fun experience at all. But piecing together a Retool is quite a bit easier. So yeah, let me know if you have any feedback, but also, yeah, no, of course, like more recently,Swyx [00:15:23]: I think about three, four months ago, you launched Retool AI, obviously, AI has been sort of in the air. I'd love for you to tell the journey of AI products ideation within Retool. Given that you have a degree in this thing, I'm sure you're not new to this, but like, when would you consider sort of this the start of the AI product thinking in Retool?David [00:15:41]: So we actually had a joke internally at Retool. We are part of roadmap for every year, it was like 2019 or something. We had this joke, which was like, what are we going to build this year? We're going to build AI programming, is what we always said as a joke. And so, but it was funny, because we were like, that's never gonna happen. But like, let's add it because it's like a buzzword thing that enterprises love. So let's look at it. And so it was almost like a funny thing, basically. But it turns out, you know, we're actually building that now. So this is pretty cool. So I would say maybe AI thinking on Retool probably first started maybe like, I don't know, a year and a half ago, something like that. And when we first started thinking about it, sort of in a philosophical way, if you will, it's like, well, what is the purpose of AI? And how can it help, you know, what Retool does? And there were two main prongs, if you will, value. One was helping people build apps faster. And so you've probably seen Copilot, you've seen sort of so many other coding assistants, P0 to them, you know, stuff like that. So that's interesting, because, you know, engineers, as we talked about, do some grunt work. And grunt work, you know, maybe could be automated by AI was sort of the idea. And it's interesting. So we actually, I would say, kind of proved or disproved the hypothesis a little bit. If you talk to most engineers today, like a lot of engineers do use Copilot. But if you ask them, like, how much time does Copilot save you? It's not like coding is 10x faster than before, you know, coding is maybe like 10% faster, maybe 20% faster, or something like that, basically. And so it's not like a huge step change, actually. And the reason for that, as we think, is because the sort of fundamental frameworks and languages have not changed. And so if you're building, let's say, you know, like the sales ops tool we're talking about before, for example, let's say you've got AI to generate you a first version of that, for example, the problem is that it probably generated it for you in like JavaScript, because you're writing for the web browser, for example, right. And then for you to actually go proofread that JavaScript, for you to go read the JavaScript to make sure it's working, you know, to fix the subtle bugs that AI might have caused, hallucinations, stuff like that, actually takes a long time and a lot of work. And so for us, the problem is actually not like the process of coding itself, it is more sort of the language or the framework, I think it's like way too low level. It's kind of like anything like punched cards, like, let's say, back in the day, who designed punched cards, and AI could help you generate punched cards, okay, you know, I guess that helps me punch the cards a little bit faster, because I have a machine punching them for me. But like, when there's a bug, I still have to go read all the punched cards and figure out what's wrong, right? It's like, it's a lot of work, actually. And so, for us, that was the sort of initial idea was, can we help engineers code faster? You know, I think it's somewhat helpful, to be clear, like, again, I think it's 10 or 20%. So we have things like, you know, you can generate school careers by AI, you can generate UIs by AI, and stuff like that. So that's cool, to be clear. But it's not, I think, the step change, that I think is, you know, the, we're investing somewhat in that. But the bulk of investment, actually, is a number two, which is helping developers build AI enabled applications faster. And the reason why we think this is so exciting is we think that practically every app, every internal app, especially, is going to be AI infused over the next, like, three years. And so every tool you might imagine, so like the tool you were mentioning, like a sales operations tool, for example, probably, you know, if you were to build today, one of the corporate subform of AI. And so, you know, we see today, like, for us, like, a lot of people build, you know, I'll say sales manager tools, or retool. An example is there's a fortune, like a company is building like, sales forecasting tools. So they basically have salespeople enter their forecast, you know, for the quarter, the beginning of the quarter, like, hey, I have these deals. And these deals are going to close, these deals are not going to close, you know, I think I'm upsiding these, downsiding these, stuff like that, basically. So you can imagine it's pulling in deals from your Salesforce database. And so it pulls in the deals that actually use AI to compute like, okay, well, you know, given previous deal dynamics, like, these are the deals that are more likely to close this month versus next month was this quarter, next quarter, etc. And so it could actually, you know, pre write you a draft of, you know, your report, basically. And so that's an example where I think all apps, whether it's, you know, a sales app, you know, until it looks like fraud app, a, you know, fintech app, you know, whatever it is, basically, especially internal apps, I think, like you said, Alessio, in order to make you more productive, it's going to incorporate some form of AI. So the other question is, can we help them incorporate AI faster? So that's why we launched like a vector database, for example, built directly into retool. That's why we know launches all these AI actions, you don't have to go figure out what the best model is and do testing and stuff like that, which gives you out of the box. So for us, I think that is really the really exciting futures. Can we make every app and also retools use AI a little bit and make people more productive?Alessio [00:19:59]: So for Wang, who's the co founder and chief architect of amplitude, he mentioned that you just use Postgres vector. When you were building retool vectors, how do you think about, yeah, leveraging a startup to do it, putting vectors into one of the existing data stores that you already had? I think like, you're really a quite large customer scale. So like, you're maybe not trying to get too cute with it. Any learnings and tips from that?David [00:20:23]: Yeah, I think a general philosophical thing I think we believe is, um, we think the open source movement in AI, especially when it comes to all the supporting infrastructure is going to win. And the reason for that is we look at like developer tools in general, especially for such a fast moving space. In the end, like, there are really smart people in the world that have really good ideas, and are going to go build companies and they're going to go build projects basically around these ideas. And so for us, we have always wanted to partner with maybe more open source providers or projects, you could say, like PG factory, for example. And the reason for that is it's easy for us to see what's going on under the hood. A lot of this stuff is moving very fast. Oftentimes, there are bugs, actually. And so we can go look and fix bugs ourselves and contribute back to them, for example. But we really think open source is going to win in this space. It's hard to say about models. I don't know about models necessarily, because it's going to be pretty complicated there. But when it comes to tooling, for sure, I think there's just like so much, there's an explosion of creativity, if you will. And I think betting on any one commercial company is pretty risky. But betting on the open source sort of community and the open source contributors, I think is a pretty good bet. So that's why we decided to get at consumer games. Awesome.Alessio [00:21:29]: So we're going to jump into the survey next, but we're going to put a bunch of links in the show notes about Result AI and whatnot. Is there any most underrated feature, like something that customers maybe love that you didn't expect them to really care about? I know you have a like text to SQL, you have UI generation, there's like so many things in there. Yeah. What surprised you?David [00:21:49]: Yeah. So what's really cool, and this is my sense of the AI space overall, you know, if you're on YouTube as well, is that, especially in Silicon Valley, where a lot of the innovation is happening, I think there's actually not that many AI use cases, to be honest. And AI to me, even as of January 19th of 2024, still feels like in search of truly good use cases. And what's really interesting, though, about Retool, and I think we're in a really fortunate position, is that we have this large base of sort of customers, and a lot of these customers are actually much more legacy, if you will, customers. And a lot of them actually have a lot of use cases for AI. And so to us, I think we're almost in like a really perfect or unique spot, we're able to adopt some of these technologies and provide them to some of these like older players. So one example that actually really shocked and surprised me about AI was, so we have this one clothing manufacturer, I think it's either the first or second largest clothing manufacturer in the world, who's using Retool. They're a ginormous company, very multinational, stores on pretty every mall in the world. And so they have one problem, which is, they need to design styles every year, for the next year, basically, for every season. So like, hey, just like summer 2024, for example, and we're going to design. And so what they used to do before is they were hiring designers, and designers would go to study data, they'd be like, okay, well, it looks like, you know, big floral patterns are really hot. And like, you know, California, for example, in 2023, and like, do I think it's going to be hot in 2024? Well, let me think about it. I don't know. Maybe. And if so, if I believe it's going to be hot, let me go design some floral patterns, actually. And what they ended up doing in Retool, actually, is they actually automated a lot of this process away in Retool. So they actually now built a Retool app that allows actually a non-designer, so like an analyst, if you will, to analyze like, you know, who are the hottest selling patterns, you know, particular geos, like this was really hot in Brazil, this is really hot in China, it's really hot, you know, somewhere else, basically. And then they actually feed it into an AI. And the AI, you know, actually generates with Dolly and other image generation APIs, actually generates patterns for them. And they print the patterns, which is really cool. And so that's an example of like, honestly, a use case I would have never thought about, like thinking about like, you know, how clothing manufacturers create their next line of clothing, you know, for the next season, like, I don't know, I never thought about it, to be honest, nor did I ever think, you know, how it would actually happen. And the fact that they're able to leverage AI and actually, you know, leverage multiple things in Retool to make that happen, is really, really, really cool. So that's an example where I think if you go deeper into sort of, if you go outside the Silicon Valley, there are actually a lot of use cases for AI. But a lot is not obvious, like you have to get into the businesses themselves. And so I think we're, we personally are in a really fortunate place. But if you know, you're working in the space and want to find some use cases, please come talk to us like, you know, we're really excited about marrying sort of technology with use cases, which I think is actually really hard to do right now.Swyx [00:24:38]: Yeah, you know, I have a bunch of like, sort of standing presentations around like, how this industry is developing. And like, I think the foundation model layer is understood. The chain vector DB rag layer is understood, I always have a big question mark and actually have you and Vercel V0 in that box, which is like sort of the UI layer for AI. And like, you know, you are perfectly placed to expose those functionalities to end users, you personally don't really know what they're going to use it for. And sometimes they'll surprise you with their creativity. One segment of this, I do see some startups springing up to do this is related to the things that to something that you've you also build, but it's not strictly AI related, which is retool workflows, which is the sort of canvassy boxes and arrows point and click do this then do that type of thing like which which every what are we calling low code? Every internal tooling company eventually builds, you know, I worked at a sort of workflow orchestration company before, and we were also discussing internally how to make that happen. But you are you're obviously very well positioned to it to that. Yeah, basically, like, do you think that there is an overlap between retool workflows and AI? I think that, you know, there's there's a lot of interest in sort of chaining AI stepsDavid [00:25:55]: together.Swyx [00:25:56]: I couldn't tell if like that is already enabled within retool workflows, I don't think so. But you could you could sort of hook them together as kind of jankily, like, what's the interest there? You know, is it all of a kind, ultimately, in your mind?David [00:26:07]: It is 100% on time. And yes, you can actually already saw a lot of people actually are building AI workflows down retool, which is what we're gonna talk about in a second. But a hot take here is actually, I think a lot of the utility in AI today, I would probably argue 60 70% of the utility, like, you know, businesses have found an AI is mostly via chat GPT, and across the board. And the reason for that is, I mean, the chat GPT is sort of a UI, you could say, or interface and user experience is just really quite good, you know, you can sort of converse, you know, with an AI, basically. But that said, there are downsides to it. If you talk to like a giant company, like a J.P. Morgan Chase, you know, for example, they may be reticent to have people copy paste data into chat GPT, for example, even on chat GPT Enterprise, for example. Some problems are that I think chat is good for one off tasks. So if you're like, hey, I want a first version of representation or something like that, you know, and help me write this first version of a doc or something like that, chat is great for that. It's a great, you know, very portable, you know, if you will form factor, so you can do that. However, if you think about it, you think about some economic productivity, more generally, like chat, again, will help you like 10 or 20%. But it's unlikely that you're going to replace an employee with chat, you know, you're not gonna be like, oh, I'm a relationship manager at J.P. Morgan Chase, and I've replaced them with an AI chatbot. It's kind of hard to imagine, right, because, like, the employees are doing a lot of things besides, you know, just, you know, generating, you know, maybe another way of putting it is like, chat is like a reactive interface, like, it's like, when you have an issue, you will go reach out to chat and chatbot solve it. But like, chatbot is not going to solve 100% of your problems, it'll solve like, you know, 25% of your problems, like, pretty quickly, right. And so what we think the next like, big breakthrough in AI is, is actually like automation. It's not just like, oh, I have a problem, let me go to a chatbot and solve it. Because like, again, like, people don't spend 40 hours a week in a chatbot, they spend like two hours a week in a chatbot, for example. And so what we think can be really big, actually, is you're able to automate entire processes via AI. Because then you're really realizing the potential of AI, it's like, not, it's not just like, you know, a human copy pasting data into an AI chatbot, you know, pasting it back out or copying back out. Instead, it's like the whole process now was actually done in an automated fashion without the human. And that, I think, is what's going to really unlock sort of big canonical productivity, or that's what I'm really excited about. And I think part of the problem right now is, I'm sure you all thought a lot about agents is that the agents are actually quite hard. Because like, you know, the AI is wrong, like, you know, 2% of the time, but then you like, you know, a score, if you let's say, you know, raise to the power seven, for example, that's actually wrong, you know, quite often, for example. And so what we've actually done with workflows is we prefer, we've learned, actually, is that we don't want to generate the whole workflow for you by AI. Instead, what we want you to do, actually, is we want you to actually sort of drag and drop the workflow yourself. Maybe you can get a vSphere or something by AI, but it's coded, basically, you should actually be able to modify the steps yourself. But every step can use AI. And so what that means is like, it's not the whole workflow is created by AI, every step is AI automated. And so if you go back to, for example, like the users are talking about, you know, with a clothing manufacturer, that's actually a workflow, actually. So basically, what they say is, hey, every day, we see all the data, you know, from our sales systems into our database. And then we do some data analysis, and, you know, it's just raw SQL, basically, it's nothing too surprising. And then they use AI to generate new ideas. And then the analysts will look at the new ideas and approve or reject them, basically. And that is like a, you know, that's true automation. You know, it's not just like, you know, a designer, copy pasting things as a chat, you can be like, hey, you know, give me a design. It's actually designs are being generated and generated 10,000 designs every day. And then you have to go and approve or reject these designs, which I think is a lot, you know, that's a lot more economically productive than just copy pasting something. So we think sort of the AI workflow space is a really exciting space. And I think that is the next step in sort of delivering a lot of business value by AI. I personally don't think it's, you know, AI chat or AI agents quite yet, so.Swyx [00:29:50]: That's a pretty reasonable take. It's disconcerting, because like, I know a lot of people trying to build what you already have in workflows. So you have that sort of, you're the incumbent sort of in their minds, I'm sure it doesn't feel that way to you. But like, I'm sure, you know, you're the incumbent in their minds, and they're like, okay, like how do I, you know, compete with retool or, you know, differentiate from retool. As you mentioned, you know, all these connections, it does remind me that you're running up against Zapier, you're running up against maybe Notion in the distant future. And yeah, I think that there'll be a lot of different takes at this space and like whoever is best positioned to serve their customer in the way that they need to shape is going to win. Do you have a philosophy against around like what you won't build, like what do you prefer to partner and not build in-house? Because I feel like you build a lot in-house.David [00:30:38]: Yes, there's probably two philosophical things. So one is that we're developer first. And I think that's actually one big differentiator between us and Zapier and Notion, and we're so very rare we'll see them actually, and the reason is we're developer first. Because developers, like, if you're like building a sales ops tool, you're probably not considering Notion if you're a developer, you're probably like, I want to build this via React, basically, or use retool. And so are you we build for developers, it's pretty interesting, actually, I think one huge advantage of some of the developers is that developers don't want to be given an end solution. They want to be given the building blocks liquid to themselves to build the end solution. And so for us, like, interesting point that equilibrium we don't get to, it's basically to say, hey, retool is a consulting company, and we basically build apps for everybody, for example. And what's interesting is, we've actually never gotten to that equilibrium. And the reason for that is for some of the developers, developers don't want, you know, like a consultant coming in and building all the apps for them. Developers like, hey, I want to do it myself, just give me the building blocks, give me the best table library, give me, you know, good state management, give me an easy way to query the rest of the API. So I'll do it myself, basically. So that is pretty, so we generally end up basically always building building blocks that are reusable by multiple customers. We have, I think, basically never built anything specific for one customer. So that's one thing that's interesting. The second thing is when it comes to sort of, you know, let's say like, in the AI space, we're going to build and we're not going to build, we basically think about whether it's all core competency or whether there are unique advantages to us building it or not. And so we think about the workflows product, we think workflows actually is a pretty core competency for us. And I think the idea that we can build a developer first workflows automation engine, I mean, I think after we released, you know, workflows, virtual workflows, there have been a sort of few copycats that are, I think, quite, quite far behind, actually, they sort of are missing a lot of more critical features. But like, if you look at the space, it's like, Zapier on one side, and then maybe like, Airflow on the other. And so virtual workflows actually is fairly differentiated. And so we're like, okay, we should go build that. This is the one I was going to build, so I'm just going to build it. Whereas if you look at like vectors, for example, you look at vectors like, wow, there's a pretty thriving space already, if you know vector databases. Does it make sense for us to go build our own? Like, what's the benefit? Like, not much, we should go partner with or go find technology off the shelf. Narcissus is pretty effective. And so for us, I think it's like, how much value does that for customers? Do we have a different take on the space? Do we not? And every product that we've launched, we've had a different take on the space and the products that we don't have a different take, we just adopt what's off the shelf.Alessio [00:32:54]: Let's jump into the state of AI survey that you ran, and maybe get some live updates. So you surveyed about 1600 people last August, and I were this busy like five years ago. And there were kind of like a lot of interesting nuggets and we'll just run through everything. The first one is more than half the people, 52% said that AI is overrated. Are you seeing sentiment shift in your customers or like the people that you talk to, like as the months go by? Or do you still see a lot of people? Yeah, that are not in Silicon Valley, maybe say, hey, this is maybe not as world changing as you all made it sound to be.David [00:33:30]: Yes, we're actually on the survey again, actually, in the next few months. So I can let you know when it changes. It seems to me that it has settled down a bit in terms of sort of the maybe like, I don't know, signal to noise, you could say like, it seems like there's a little bit less noise than before. I think people are still trying to look for use cases. I'm saying, but honestly, last year, like United States, again, and I think there are slightly more use cases, but still not substantially more. And I think as far as we can tell, a lot of the surveys, especially some of the comments that we saw, do feel like the companies are investing quite a bit in AI, and they're not sure where it's going to go yet. But they're like, right, it could be big. So I think we should keep on investing. I do think that based on what we are hearing from customers, if we're not seeing recurrence of like a year or something, there will be more skepticism. So I think there is like a, it is time bound, if you will.Alessio [00:34:15]: So you finally gave us some numbers on Stack Overflow usage. I think that's been a Twitter meme for a while, whether or not Chad GVT killed Stack Overflow. In the survey, 58 people said they used it less. And 94% of them said they used it less because of Copilot and Chad GVT, which I think it kind of makes sense. I know Stack Overflow tried to pull a whole thing. It's like, no, the traffic is going down because we changed the way we instrument our website. But I don't think anybody. And then you add right after that expectation of job impact by function and operations, people, 8 out of 10, basically, they think it's going to, it's going to really impact their job. Designers were the lowest one, 6.8 out of 10. But then all the examples you gave were designers of a job being impacted by AI. Do you think there's a bit of a dissonance maybe between like the human perception is like, oh, my job is like, can possibly be automated? It's funny that the operations people are like, yeah, it makes sense. I wish I could automate myself, you know, versus the designers or maybe they love their craft more. Yeah, I don't know if you have any thoughts on who will accept the first, you know, that they should just embrace the technology and change the way they work.David [00:35:21]: Yeah, that's interesting. I think it's probably going to be engineering driven. I mean, I think you two are very well, maybe you two even started some of this wave and sort of the AI engineer wave. I think the companies that adopt AI the best, it is going to be engineering driven, I think, rather than like operations driven or anything else. And the reason for that is, I think the rise of this like profile with AI engineering, like AI is very philosophical, like AI is a tool in my head. Like it is not a, in my head, I think we're actually pretty far from AGI. But AI is not like a, you know, thing that it's not like a black box where like it does everything you want it to do. The models that we have today require like very specific prompting, for example, in order to get like, you know, really good results. And the reason for that is, it's a tool that, you know, you can use it a specific way. So if you use it the wrong way, it's not going to produce good results for you, actually. It's not like by itself taking a job away, right? And so I think actually, to adopt AI, it's probably going to be going to have to be engineering first, basically, where engineers are playing around with it, figuring out limitations of the models, figuring out like, oh, maybe like using vectorized databases is a lot better, for example, maybe like prompting in this particular way, it's going to be a lot better, etc. And that's not the kind of stuff that I think like an operations team is going to really be like experimenting with necessarily. I think it really has to be engineering led. And then I think the question is, well, what are the engineers going to focus on first? Like, are they going to focus on design first or like operations first? And that I think is more of a business decision. I think it's probably going to be more like, you know, the CEO, for example, says, hey, we're having trouble scaling this one function. So like, why don't we try using AI for that? And let's see what happens, for example. And so in our case, for example, we are really we have a lot of support issues. So what I mean by that is we have a really, really high performance support team. But we get a lot of tickets. And the reason for that is, you know, we're a very dynamic product, you can use it in so many different ways. And we'll have a lot of questions for us, basically. And so we were looking at, well, you know, can we, for example, draft some replies and support tickets, you know, by AI, for example, can we allow our support agents to be, you know, hopefully, you know, double as doubly productive as before, for example. So I guess I would say it's like business needs driven, but then engineering driven after that. So like, you know, we the business decides, okay, well, this is where AI can be most applied. And then we assign the project to an engineer, and the engineer goes and figures it out. I honestly am not sure if like the operation, we're gonna have much of a, like, if they accept or reject it, I don't know what's gonna change the outcome, if you will.Alessio [00:37:40]: So another interesting part was the importance of AI in hiring. 45% of companies said they made their interviews more difficult in the in the engineering side, made interviews more difficult to compensate for people using copilot and chat GPT. As they change every tool, like, have you? Yeah, have you thought about it? I don't know how much you're still involved with engineering hiring, I get the company, but I'm curious how we're scaling the difficulty of interviews, even though the job is theDavid [00:38:11]: same, right?Alessio [00:38:11]: So just because you're gonna use AI doesn't mean the interview should be harder. But I guess it makes sense.David [00:38:16]: Our sense, basically, the survey, and this is true, we believe, too, is we are most when we do engineering interviews, we are most interested in assessing like critical thinking or thinking, you know, on the spot. And I guess, you know, when you hire the employee, you know, in the end, the job of employees to be productive, which they choose whatever tools they want to be productive. So, you know, that's kind of our thinking, too. However, we do think that, you know, if you think about it from a first person's way, if your only method of like coding is literally copy pasting, you know, off of chat GPT, or like, you know, it's pressing tab and copilot, I think that would be concerning. And so, for that reason, we still do want to test for like, you know, fundamentals understanding of comp sci. Now, that said, I think if you're able to use chat GPT or copilot, let's say competently, we do view that as a plus, we don't view it as a minus. But if you only use copilot, and you aren't able to reason about like, you know, how to write a for loop, for example, or how to write fizzbuzz, that would be highly problematic. And so, for us, we do today is we'll base a screen share, or a rest is a hackpad, actually. So it's, sorry, this is no copilot there to sort of see what they're doing, or see what they're thinking. And we really want to test for thinking, basically. But yeah, I mean, we ourselves internally have embraced copilot, and we would encourage engineers to go over this copilot too. But we do want to test for understanding of what you're doing, rather than just copy pasting a copilot.Alessio [00:39:27]: The other one was AI adoption rate, only 27% are in production. Of that 27%, 66% are internal use cases. Shout out to retool, you know, do you have a mental model as to how people are gonna make the jump from like, using it internally to externally? Obviously, there's like all these different things like privacy, you know, if an internal tool hallucinates, that's fine, because you're paying people to use it basically, versus if it hallucinates to your customer, there's a different bar. Because for you, if people build internal tool with retool, there are external customers to you, you know, so I think you're on the flip side of it.David [00:40:02]: Yeah, I think it's hard to say, maybe a core retool belief was actually that most software built in the world is internal facing, actually, which actually sounds may sound kind of surprising, you know, for some of you hearing this, but effectively, like, you know, we all work at Silicon Valley, right? We all work at businesses, basically, that sell software as, you know, as sort of a business. And that's why all the software engineers that we hire basically work on external facing software, which makes sense with most software companies. But if you look at most companies in the world, most companies in the world are actually not software companies. If you look at like, you know, the clothing manufacturer that I was talking about, they're not a software company, like they don't sell software, you have to make money, they sell clothing to make money. And most companies in the world are not software companies, actually. And so most of the engineers in the world, in fact, don't work at Silicon Valley companies, they work outside of Silicon Valley, they work in these sort of more traditional companies. So if you look at the Fortune 100, for example, probably like 20 of them are software companies, you know, 480 of them are not software companies. That's the employable software engineers. And so most of the software engineers in the world, and most of the code engineers in the world actually goes towards these internal facing applications. And so, for all the reasons you said there, like, I think hallucination matters less, for example, because they have someone checking the output, and consumer, so hallucination is more okay, it's more acceptable as well. Yeah, it can be unreliable, because it's probabilistic, and that's also okay. So I think it's kind of hard to imagine AI being adopted in a consumer way without the consumer like opting in, like, Chachapiti is very obviously a consumer, the consumer knows that it's Chachapiti, they're using it. I don't know if it's going to make its way to like the banking app anytime soon. Maybe for like, even for support, it's hard. Because if it hallucinates, then, you know, it's actually quite bad for support if you're hallucinating, right? So it's, yeah, it's hard to say. I'm not sure.Alessio [00:41:50]: Yeah, I think a lot of people, like you said, we all build software. So we expect that everybody else is building software for other people. But most people just want to use the software that we build out here. I think the last big bucket is like models breakdown. 80% of people use it, just use OpenAI. Some might experiment with smaller models. Any insights from your experience at Retool, like building some of the AI features? Have you guys thought about using open source models? Have you thought about fine tuning models for specific use cases? Or have you just found GPT-4 to just be great at most tasks?David [00:42:24]: Yeah, so two things. One is that from a data privacy perspective, people are getting more and more okay with using a hosted model like a GPT-4, for example. Especially because GPT-4 or OpenAI often has to have enterprises who went to some companies already because I think a lot of CIOs are just like, let's get a second house. Like, you know, let's use Azure, for example. And, you know, let's make it available for employees to experiment with. So I do think there is more acceptance, if you will, today of feeding data into GPT. That's going to take some sensitive data. People might not want to do so. Like, you know, feeding in like earnings results data, you know, three days for you to announce earnings, like probably is a bad idea. You probably don't want people to be writing your like earnings statement for you. So yeah, there's still some challenges like that. But I think actually open source models could actually help solve like a lot of greed when it comes to, and that can be exciting. So that's maybe just one thought. The second thought is, I think OpenAI has been really quite smart with their pricing. And they've been pretty aggressive of like, let's get, you know, let's create this model and sell it at a pretty cheap price to make it such that there's no reason for you to use any other model. Just from like a strategy perspective, I don't know if that's going to work. And the reason for that is you have really well-funded players like Google or like Facebook, for example, that are actually quite interested. I think if it was creating startups, OpenAI would win for sure. Like at this point, OpenAI so far had from both a model and a pricing perspective that like there was no reason for it to go just really, I think, in my opinion, at least a startup model. But if like, you know, Facebook is not going to give up on AI, like Facebook is investing a lot in AI, in fact. And so competing against a large FANG company on making a model open source, I think that is challenging. Now, however, where we are right now is I think GPT-4 so far in terms of performance and I would say a model performance is so important right now because like the average, I'm not going to argue LLAMA-2 is actually so far behind, but like customers don't want to use LLAMA-2 because it's so far behind right now. And so that I think is part of the challenge. As AI progress slows down, so if we get like LLAMA-4 and LLAMA-5, for example, maybe it's a comparable at that point like GPT-5 or GPT-6, like it may get to the point where it's like, look, I just want to use LLAMA. Like it's safer for me to host it on-prem, it's just as fast, just as cheap, like why not basically? But I think right now we are in this state, we're opening up next year really well, I think. And right now they're thriving, but let's see what happens in the next year or two.Swyx [00:44:40]: What are you going to ask differently for the next survey? Like what info do you really actually want to know that's going to change your worldview?David [00:44:46]: I'll also ask you that, but if you have any ideas, let me know. For us, actually, we were planning on asking very similar questions because for us, the value of the survey is mostly seeing changes over time and understanding like, okay, wow, for example, GPT-4 Turbo MPS has declined. That would be interesting, actually. One thing that was actually pretty shocking to us was, let me find the exact number, but one change that we saw, for example, if you compare GPT-3.5 MPS, I want to say it was like 14 or something, it was not high, actually. The GPT-4 MPS thing was like 45 or something like that, so it was actually quite a bit higher. So I think that kind of progress over time is what we're most interested in seeing, is are models getting worse, models getting better? Are people still loving PG Vector? Do people still love Mongo? Stuff like that. That I think is the most interesting.Swyx [00:45:33]: It seems like you're very language model focused. I think that there's an increasing interest in multi-modality in AI, and I don't really know how that is going to manifest. Obviously, GPT-4 Vision, as well as Gemini, both have multi-modal capabilities. There's a smaller subset of open source models that have multi-modal features as well. We just released an episode today talking about IdaFix from Hugging Face, and I would like to understand how people are adopting or adapting to the different modalities that are now coming online for them. What their demand is relative to, let's say, generative images versus just visual comprehension versus audio versus text-to-speech.David [00:46:15]: What do they want?Swyx [00:46:15]: What do they need? And what's the sort of forced, stacked ranked preference order? It's something that we are trying to actively understand because there's this sort of multi-modality world, but really multi-modality is kind of... I've been thinking about this phrase, multi-modality is like cancer. It's this umbrella term for actually a whole bunch of different things that aren't quite honestly not really that related to each other unless in the limit. But it tends towards maybe everything uses transformers and ultimately everything can be merged together with a text layer because text is the universal interface. But if you're given the choice between, I want to implement an audio feature versus I want to implement an image feature versus video, whatever, what are people needing the most? Should we pay the most attention to what is going to be the biggest market for builders to build it?David [00:47:03]: I don't know.Swyx [00:47:04]: I think I would just kind of zoom out a little bit to just a general founder questions. You have a lot of fans in the founder community. I think you're just generally well-known as a very straightforward, painstaking person about just business. Something that is the perception from Joseph is that you have been notably sales-led in the past. That's his perception. I actually never got that, but I'm not that close to your sales portion. And it's interesting to understand your market, the internal tooling market versus all the competition that's out there. There's a bunch of open source retools and there's a bunch of... I don't know how you categorize the various things out there, but effectively what he's seeing and what he's asking is, how do you manage between enterprise versus ubiquity? Or in other words, enterprise versus bottom-up, right? I was actually surprised when he told me to ask that question, because I had always assumed that you were a self-serve, sign-up, bottom-up led. But it seems like you have a counter consensus view on that.David [00:48:04]: Yeah. So actually when Retwelf first started, we started mostly by doing sales, actually. And the reason we started by doing sales was mostly because we weren't sure whether we had product-market fit and sales seemed to be the best way of proving whether we had product-market fit out. Because I think this is true of a lot of AI projects. You can launch a project and people might use it a bit and people might stop using it and you're like, well, I don't know. Is that product-market fit? Is that not? It's hard to say, actually. However, if you work very closely with the customer in a sales-led way, it's easier to understand their requests, understand their needs, and stuff like that, and actually go build a product that actually serves them really well. And so basically, we viewed sales as like working with customers, basically, which is like, I think actually quite a, I think it's a better way to describe it, what sales is of an early-stage company. And so we did a lot of that, certainly, when we got started. I think we, over the last maybe five years, maybe like three years ago, four years ago, something like that, I think we have invested more on the self-serve ubiquity side. And the reason for that is when we started Retwelf, we always wanted, actually, some percent of software to get built inside of Retwelf, whether AI software or origin software or broadly UIs and whatnot, but like software, basically. And for us, we're like, we think that maybe one day, 10% of all the code in the world could be written inside of Retwelf, actually, or 10% of the software could be running on Retwelf, which would be really, really cool. And for us to achieve that vision, it really does require a broad-based option of the platform. It can't just be like, oh, only like 1,000 customers, but the largest 1,000 companies in the world use it. It has to be like all the developers in the world use it. And for us, there's like, well, I think 25, 30 million developers in the world. That's of course, how do you get to all the developers? And the only way to get to those developers is not by sales. You can't have a salesperson talk to 30 million people. It has to be basically in this sort of moms-up, product-led, Ubiquity kind of way, basically. And so for us, we actually changed our focus to be Ubiquity, actually, last year. So our gold star metric used to always be sort of revenue-generated or revenue-generated. We actually changed it to be number of developers building on the platform, actually, last year. And that, I think, was actually a really clarifying change because obviously, revenue was important. It funds a lot of our product and funds the business. But we're going to fail if we aren't able to get to something like 10, 20, 30 million developers one day. We can't convince all developers that Retool's a better way to build a sort of class of software, let's say, internal applications for today. And so I think that has been a pretty good outcome. I think about the last five years of Retool. I think the starting off with sales, so you can build revenue, and then you can actually build traction, and you can hire more slowly. I think it was really good. I do think the focus towards bottoms-up Ubiquity also was really important because it helps us get to our long-term outcome. What's interesting, I think, is that long-term Ubiquity actually is harder for us to achieve outside of Silicon Valley. To your point, I think at Silicon Valley, Retool is reasonable Ubiquitous. I think if you're starting a startup today and you're looking to build an internal UI, you're probably going to consider Retool, at least. Maybe you don't choose it because you're like, I'm not ready for it yet or something. But you're going to consider it, at least. And when you want to build it, I think it's actually a high probability you will actually end up choosing it. It's awesome. But it's that if you think about a random developer working at, let's say, like an Amazon, for example. Today at Amazon, actually, we have, I think, 11 separate business units that use Retool at this point, which is really awesome. So Amazon is actually a big Retool customer. But the average here at Amazon probably has never heard of Retool, actually. And so that is where the challenge really is. How do we get, like, I don't know, let's say 10,000 developers at Amazon building via Retool? And that, again, I think is still a bottom-up ubiquity thing. I don't think that's like a, I don't think we're going to like, you know, go to Amazon and knock on every developer's door or send out an email to every developer and be like, go use Retool. They're going to ignore us, actually. I think it has to be, use the product, and you love it, you tell your co-worker about it. And so for us, a big bottom-up ubiquity, but marrying that with enterprise or the community business has been something that's really near and dear to our hearts.Swyx [00:51:54]: Yeah, just like general market thoughts on AI. Do you spend a lot of time thinking about like AGI stuff or regulation or safety What interests you most, you know, outside of the Retool context?David [00:52:07]: There's a lot of hype in AI right now. And it's again, not too many use cases. So for us, at least from a Retool context, it really is, how do we bring AI and have it actually meet business problems? And again, it's actually pretty hard. Like I think most founders that I've met in the AI space are always looking for use cases, never have enough use cases, right? Sort of real use cases, people pay money for them. But I think really where the Retool interest comes from, me personally, I think philosophically, yeah, I've been thinking recently myself a bit about sort of intentionality and AGI and like, you know, what would it take for me to say, yes, you know, GPT-X for, you know, any sort of model actually is AGI. I think it's kind of challenging because it's like, I think if you look at like evolution, for example, like humans have been programmed to do like three things, if you will, like, you know, we are here to survive, you know, we're here to reproduce and we're here to like, you know, maybe this is just two things, I suppose. So basically, to survive, you have to go eat food, you know, for example. To survive, maybe like having more resources helps you want to go make money, you know, for example. To reproduce, you should go date, you know, or whatever, you get married and stuff like that, right? So like, that's, we have a program to do that. And humans that are good at that have propagated. And some humans that, you know, we're not actually surviving, probably have disappeared just due to natural selection. Humans that we're not interested in producing also disappeared because there are less of them, you could say, because they just, they just stopped carrying on basically. And so, so it almost feels like humans have sort of naturally self-selected for these like two aims. I think the third aim I was thinking about was like, does it matter to be happy? Like, maybe it does. So maybe like happier humans, you know, survival, it's hard to say. So I'm not sure. But if you think about that, and they're all just like AIs, if you will, right now, we're not really selecting AIs for like, you know, reproduction. Like, it's not like, you know, we're being like, hey, AI, you know, you should go make 30 other AIs. And you know, those that make the most AIs, you know, are the ones that survive. We're not saying that. So it's kind of interesting sort of thinking about where intentionality for humans come from. And like, I think you can argue the intentionality of the human space that comes out of these three things. You know, like, if you want to be happy, you want to survive, you want to reproduce. That's like basically your sort of goal, you know, in life. Whereas like, the AI doesn't really have that. But maybe you could program it in. Like, if you, you know, prompt inject, for example, like, hey, AI, you know, go do these three things. And you can even create a simulation, if you will, like all these AIs, you know, in the world, for example. And maybe you don't have AGI in the world, which I think is kind of interesting. So that's kind of stuff I've been thinking about when I talk about with some of my friends from a sort of philosophical perspective. But yeah, it's kind of interesting.Swyx [00:54:29]: Yeah, my quick response to that is we're kind of doing that. Maybe not at the sort of trained final model level, but at least at the data sets level, there's a lot of knowledge being transferred from model to model. And if you want to think about that sort of evolutionary selection pressure, it is happening in there. And, you know, I guess one of the early concerns about being in Sydney and sort of like bootstrap self bootstrapping AGI is that it actually exists if these models are sentient, it actually exists in their incentive to get as much of their data out there into our data sets so that they can bootstrap themselves in the next version as they get trained. That is a scary sobering that we need to try to be on top of.Alessio [00:55:13]: David, I know we're both fan of Hofstadter's GB and actually saw in one of your posts on the Segovia blog, you referred to the anteater. I don't even know if you call them chapters and GB is just kind of like this, this continuous rift. But basically, like our ants are like not intelligence, but like ant colony has signs of intelligence. And I think Hofstadter then use that to say, hey, you know, neurons are kind of like similar and then computers maybe will be the same. I've always been curious if like we're drawing the wrong conclusion from like neural networks where people like, oh, each weight is like a neuron and then you tie them together should be like a brain. But maybe like the neuron is like different models that then get tied together to makeDavid [00:55:57]: the brain.Alessio [00:55:57]: You know, we're kind of looking at the wrong level of abstraction. Yeah, I think there's a lot of interesting philosophical discussions to have. Sean and I recorded a monthly recap podcast yesterday, and we had a similar discussion on are we using the wrong? What did you say Sean on the plane and the bird? I think that was a good analogy.Swyx [00:56:16]: The sour lesson, are we using the wrong analogies? Because we're trying to be inspired by human evolution and human development, and we are trying to apply that analogy strictly to machines. But in every example in history, machines have always evolved differently than humans. So why should we expect AI to be any different?David [00:56:33]: Yeah, it is interesting because it does feel like, yeah, if you sort of peer under the hood of AGI, if you insist that AGI, we'd have always used AGI for things like a human. And that is the Turing test, I suppose. But whether that is a good point, like if it works, no, it's not the Turing test. The Turing test is if the output is the same as a human, then I'm happy. I don't really care about what's going on inside. And so it feels like caring about the inside is like a pretty high bar. Like, why do you care? It's kind of like the plane thing for flies. It's not a bird. I agree. It does not fly necessarily the same way as a bird. Physically, it does, I suppose. But you see what I mean? It's not the same under the hood. But it's OK for the flies. That's what I care about. And it does seem to be like AGI probably doesn't think and can achieve outcomes that I give it. It can achieve its own outcomes. And if it can do that, I kind of don't care what it is under the hood. It may not need to be human life at all. It doesn't matter to me. So I agree. Awesome.Alessio [00:57:26]: No, we kept you long. Actually, I have GUB right here on my bookshelf. Sometimes I pick it up and I'm like, man, I can't believe I got through it once.David [00:57:34]: It's quite the piece of work. It's a lot of fun, though. Yeah.Alessio [00:57:38]: I mean, I started studying physics in undergrad. So, you know, it's one of the edgy things that every physicist starts going through. But thank you so much for your time, David. This was a lot of fun. And looking forward to the 2024 state of AI results to see how things change.David [00:57:54]: Yeah, I'll let you know. Thanks, both. Get full access to Latent.Space at www.latent.space/subscribe
-
The Four Wars of the AI Stack (Dec 2023 Audio Recap)
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-01-25 22:37
Note for Latent Space Community members: we have now soft-launched meetups in Singapore, as well as two new virtual paper club/meetups for AI in Action and LLM Paper Club. We’re also running Latent Space: Final Frontiers, our second annual demo day hackathon from last year.Edit from March 2024: We did a followup on the Four Wars on the AI Breakdown.For the first time, we are doing an audio version of monthly AI Engineering recap that we publish on Latent Space! This month it’s “The Four Wars of the AI Stack”; you can find the full recap with all the show notes here: https://latent.space/p/dec-2023* [00:00:00] Intro* [00:01:42] The Four Wars of the AI stack: Data quality, GPU rich vs poor, Multimodality, and Rag/Ops war* [00:03:17] Selection process for the four wars and notable mentions* [00:06:58] The end of low background tokens and the impact on data engineering* [00:08:36] The Quality Data Wars (UGC, licensing, synthetic data, and more)* [00:14:51] Synthetic Data* [00:17:49] The GPU Rich/Poors War* [00:18:21] Anyscale benchmark drama* [00:22:00] The math behind Mixtral inference costs* [00:28:48] Transformer alternatives and why they matter* [00:34:40] The Multimodality Wars* [00:38:10] Multiverse vs Metaverse* [00:45:00] The RAG/Ops Wars* [00:50:00] Will frameworks expand up, or will cloud providers expand down?* [00:54:32] Syntax to Semantics* [00:56:41] Outer Loop vs Inner Loop* [00:59:54] Highlight of the month Get full access to Latent.Space at www.latent.space/subscribe
-
How to train your own Large Multimodal Model — with Hugo Laurençon & Leo Tronchon of HuggingFace M4
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-01-19 17:09
Latent Space is heating up! Our paper club ran into >99 person Discord limits, oops. We are also introducing 2 new online meetups: LLM Paper Club Asia for Asia timezone (led by Ivan), and AI in Action: hands-on application of AI (led by KBall). To be notified of all upcoming Latent Space events, subscribe to our new Luma calendar (sign up for individual events, or hit the RSS icon to sync all events to calendar).In the halcyon open research days of 2022 BC (Before-ChatGPT), DeepMind was the first to create a SOTA multimodal model by taking a pre-existing LLM (Chinchilla 80B - now dead?) and pre-existing vision encoder (CLIP) and training a “glue” adapter layer, inspiring a generation of stunningly cheap and effective multimodal models including LLaVA (one of the Best Papers of NeurIPS 2023), BakLLaVA and FireLLaVA. However (for reasons we discuss in today’s conversation), DeepMind’s Flamingo model was never open sourced. Based on the excellent paper, LAION stepped up to create OpenFlamingo, but it never scaled beyond 9B. Simultaneously, the M4 (audio + video + image + text multimodality) research team at HuggingFace announced an independent effort to reproduce Flamingo up to the full 80B scale:The effort started in March, and was released in August 2023.We happened to visit Paris last year, and visited HuggingFace HQ to learn all about HuggingFace’s research efforts, and cover all the ground knowledge LLM people need to become (what Chip Huyen has termed) “LMM” people. In other words:What is IDEFICS?IDEFICS is an Open Access Visual Language Model, available in 9B and 80B model sizes. As an attempt to re-create an open-access version of Flamingo, it seems to track very well on a range of multimodal benchmarks (which we discuss in the pod):You can see the reasoning abilities of the models to take a combination of interleaved images + text in a way that allows users to either describe images, ask questions about the images, or extend/combine the images into different artworks (e.g. poetry).📷 From IDEFICS’s model card and blog postThe above demo screenshots are actually fine-tuned instruct versions of IDEFICS — which are again in 9B and 80B versions.IDEFICS was built by connecting two unimodal models together to provide the multi-modality you see showcased above.* Llama v1 for language (specifically huggyllama/llama-65b) - the best available open model at the time, to be swapped for Mistral in the next version of IDEFICS* A CLIP model for vision (specifically laion/CLIP-ViT-H-14-laion2B-s32B-b79K - after a brief exploration of EVA-CLIP, which we discuss on the pod)OBELICS: a new type of Multimodal DatasetIDEFICS’ training data used the usual suspect datasets, but to get to par with Flamingo they needed to create a new data set.Enter OBELICS: “An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents”:* 115B text tokens* 141M English documents* 353M imagesThese bullets are carefully curated and filtered by going through Common Crawl dumps between FEB 2020 - FEB 2023. We discuss the 2 months of mindnumbing, unglamorous work creating this pipeline:There’s a lot of mentions of ‘multi-modal' web documents’ which deserves some explanation. We’ll show you instead of tell you:You can see from this graph that OBELICS ends up outperforming the other image-text pairs dataset (LAION in this case) when stacked head-to-head.You can view a subset of OBELICS and perform visualizations on them here:2024 Update: WebSight et alMost of this interview was recorded on Halloween 2023 at HuggingFace’s headquarters in Paris:In anticipation of an IDEFICS v2 release. However, several roadblocks emerged, including a notable scandal around CSAM in LAION-5B, which affected all models using that dataset. The M4 team have adopted a strategy of shipping smaller advancements in 2024, and the first ship of the year is WebSight, a dataset of 823,000 HTML/CSS codes representing synthetically generated English websites, each accompanied by a corresponding screenshot (rendered with Playwright). This is intended for tasks like screenshot-to-code workflows like Vercel’s V0 or TLDraw, and will be part of the dataset for IDEFICS-2.As noted in our Best Papers recap, synthetic data is emerging as one of the top themes of 2024, and the IDEFICS/OBELICS team have wasted no time enabling themselves with it.Timestamps* [0:00:00] Intro* [0:00:00] Hugo, Leo’s path into multimodality* [0:09:16] From CLIP to Flamingo* [0:12:54] Benchmarks and Evals* [0:16:54] OBELICS dataset* [0:34:47] Together Redpajama v2* [0:37:12] GPT4 Vision* [0:38:44] IDEFICS model* [0:40:57] Query-Key Layernorm for training* [0:46:40] Choosing smaller vision encoders - EVA-CLIP vs SIG-GLIP* [0:49:02] IDEFICS v2* [0:52:39] Multimodal Hallucination* [0:59:12] Why Open Source Multimodality* [1:05:29] Naming: M4, OBELICS, IDEFICS* [1:08:56] 2024 Update from LeoShow Notes* Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model* IDEFICS Knowledge sharing memo: technical lessons and mistakes* Victor Sanh memo* OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents* Papers cited:* BLOOM: A 176B-Parameter Open-Access Multilingual Language Model* Barlow Twins: Self-Supervised Learning via Redundancy Reduction* CLIP paper: Learning Transferable Visual Models From Natural Language Supervision* Vision Transformers paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale* Flamingo paper: a Visual Language Model for Few-Shot Learning* April 2022 preprint from DeepMind, blogpost* VQAV2 paper: Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering* OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge (https://okvqa.allenai.org/)* MMBench: Is Your Multi-modal Model an All-around Player?* Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond* Sig-GLIP paper: Sigmoid Loss for Language Image Pre-Training* Nougat: Neural Optical Understanding for Academic Documents* MMC4 (Multimodal C4): An Open, Billion-scale Corpus of Images Interleaved With Text* Dall-E 3 paper: Improving Image Generation with Better Captions* GPT-4V(ision) system card from OpenAI* Query-Key Layernorm trick: paper (Scaling Vision Transformers to 22 Billion Parameters), tweet* EVA-CLIP: Improved Training Techniques for CLIP at Scale * “We intially explored using a significantly bigger vision encoder (the biggest in open-access at that time) with EVA-CLIP. However, we ran into training instabilities very quickly. To lower the risks associated to the change of vision encoder, we decided to continue with laion/CLIP-ViT-H-14-laion2B-s32B-b79K which we have been using until that point. We will leave that swap for future iterations and will also consider using higher resolution images.”* Datasets* Together’s RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models* LAION COCO: 600M synthetic captions from Laion2B-en* Chip Huyen’s writeup on LMMs* Joseph Nelson of Roboflow on Latent Space* HuggingFace M4* HuggingFace timm: library containing SOTA computer vision models, layers, utilities, optimizers, schedulers, data-loaders, augmentations, and training/evaluation scripts. It comes packaged with >700 pretrained models, and is designed to be flexible and easy to use.* Logan Kilpatrick declaring 2024 the year of Multimodal AI at AI Engineer Summit Get full access to Latent.Space at www.latent.space/subscribe
-
RLHF 201 - with Nathan Lambert of AI2 and Interconnects
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-01-11 19:22
In 2023 we did a few Fundamentals episodes covering Benchmarks 101, Datasets 101, FlashAttention, and Transformers Math, and it turns out those were some of your evergreen favorites! So we are experimenting with more educational/survey content in the mix alongside our regular founder and event coverage. Pls request more!We have a new calendar for events; join to be notified of upcoming things in 2024!Today we visit the shoggoth mask factory: how do transformer models go from trawling a deeply learned latent space for next-token prediction to a helpful, honest, harmless chat assistant? Our guest “lecturer” today is Nathan Lambert ; you might know him from his prolific online writing on Interconnects and Twitter, or from his previous work leading RLHF at HuggingFace and now at the Allen Institute for AI (AI2) which recently released the open source GPT3.5-class Tulu 2 model which was trained with DPO. He’s widely considered one of the most knowledgeable people on RLHF and RLAIF. He recently gave an “RLHF 201” lecture at Stanford, so we invited him on the show to re-record it for everyone to enjoy! You can find the full slides here, which you can use as reference through this episode. Full video with synced slidesFor audio-only listeners, this episode comes with slide presentation along our discussion. You can find it on our YouTube (like, subscribe, tell a friend, et al).Theoretical foundations of RLHFThe foundation and assumptions that go into RLHF go back all the way to Aristotle (and you can find guidance for further research in the slide below) but there are two key concepts that will be helpful in thinking through this topic and LLMs in general:* Von Neumann–Morgenstern utility theorem: you can dive into the math here, but the TLDR is that when humans make decision there’s usually a “maximum utility” function that measures what the best decision would be; the fact that this function exists, makes it possible for RLHF to model human preferences and decision making.* Bradley-Terry model: given two items A and B from a population, you can model the probability that A will be preferred to B (or vice-versa). In our world, A and B are usually two outputs from an LLM (or at the lowest level, the next token). It turns out that from this minimal set of assumptions, you can build up the mathematical foundations supporting the modern RLHF paradigm!The RLHF loopOne important point Nathan makes is that "for many tasks we want to solve, evaluation of outcomes is easier than producing the correct behavior". For example, it might be difficult for you to write a poem, but it's really easy to say if you like or dislike a poem someone else wrote. Going back to the Bradley-Terry Model we mentioned, the core idea behind RLHF is that when given two outputs from a model, you will be able to say which of the two you prefer, and we'll then re-encode that preference into the model.An important point that Nathan mentions is that when you use these preferences to change model behavior "it doesn't mean that the model believes these things. It's just trained to prioritize these things". When you have preference for a model to not return instructions on how to write a computer virus for example, you're not erasing the weights that have that knowledge, but you're simply making it hard for that information to surface by prioritizing answers that don't return it. We'll talk more about this in our future Fine Tuning 101 episode as we break down how information is stored in models and how fine-tuning affects it.At a high level, the loop looks something like this:For many RLHF use cases today, we can assume the model we're training is already instruction-tuned for chat or whatever behavior the model is looking to achieve. In the "Reward Model & Other Infrastructure" we have multiple pieces:Reward + Preference ModelThe reward model is trying to signal to the model how much it should change its behavior based on the human preference, subject to a KL constraint. The preference model itself scores the pairwise preferences from the same prompt (worked better than scalar rewards).One way to think about it is that the reward model tells the model how big of a change this new preference should make in the behavior in absolute terms, while the preference model calculates how big of a difference there is between the two outputs in relative terms. A lot of this derives from John Schulman’s work on PPO:We recommend watching him talk about it in the video above, and also Nathan’s pseudocode distillation of the process:Feedback InterfacesUnlike the "thumbs up/down" buttons in ChatGPT, data annotation from labelers is much more thorough and has many axis of judgement. At a simple level, the LLM generates two outputs, A and B, for a given human conversation. It then asks the labeler to use a Likert scale to score which one it preferred, and by how much:Through the labeling process, there are many other ways to judge a generation:We then use all of this data to train a model from the preference pairs we have. We start from the base instruction-tuned model, and then run training in which the loss of our gradient descent is the difference between the good and the bad prompt.Constitutional AI (RLAIF, model-as-judge)As these models have gotten more sophisticated, people started asking the question of whether or not humans are actually a better judge of harmfulness, bias, etc, especially at the current price of data labeling. Anthropic's work on the "Constitutional AI" paper is using models to judge models. This is part of a broader "RLAIF" space: Reinforcement Learning from AI Feedback.By using a "constitution" that the model has to follow, you are able to generate fine-tuning data for a new model that will be RLHF'd on this constitution principles. The RLHF model will then be able to judge outputs of models to make sure that they follow its principles:Emerging ResearchRLHF is still a nascent field, and there are a lot of different research directions teams are taking; some of the newest and most promising / hyped ones:* Rejection sampling / Best of N Sampling: the core idea here is that rather than just scoring pairwise generations, you are generating a lot more outputs (= more inference cost), score them all with your reward model and then pick the top N results. LLaMA2 used this approach, amongst many others.* Process reward models: in Chain of Thought generation, scoring each step in the chain and treating it like its own state rather than just scoring the full output. This is most effective in fields like math that inherently require step-by-step reasoning.* Direct Preference Optimization (DPO): We covered DPO in our NeurIPS Best Papers recap, and Nathan has a whole blog post on this; DPO isn’t technically RLHF as it doesn’t have the RL part, but it’s the “GPU Poor” version of it. Mistral-Instruct was a DPO model, as do Intel’s Neural Chat and StableLM Zephyr. Expect to see a lot more variants in 2024 given how “easy” this was.* Superalignment: OpenAI launched research on weak-to-strong generalization which we briefly discuss at the 1hr mark.Note: Nathan also followed up this post with RLHF resources from his and peers’ work:Show Notes* Full RLHF Slides* Interconnects* Retort (podcast)* von Neumann-Morgenstern utility theorem* Bradley-Terry model (pairwise preferences model)* Constitutional AI* Tamer (2008 paper by Bradley Knox and Peter Stone)* Paul Christiano et al. RLHF paper* InstructGPT* Eureka by Jim Fan* ByteDance / OpenAI lawsuit* AlpacaEval* MTBench* TruthfulQA (evaluation tool)* Self-Instruct Paper* Open Assistant* Louis Castricato* Nazneen Rajani* Tulu (DPO model from the Allen Institute)Timestamps* [00:00:00] Introductions and background on the lecture origins* [00:05:17] History of RL and its applications* [00:10:09] Intellectual history of RLHF* [00:13:47] RLHF for decision-making and pre-deep RL vs deep RL* [00:20:19] Initial papers and intuitions around RLHF* [00:27:57] The three phases of RLHF* [00:31:09] Overfitting issues* [00:34:47] How preferences get defined* [00:40:35] Ballpark on LLaMA2 costs* [00:42:50] Synthetic data for training* [00:47:25] Technical deep dive in the RLHF process* [00:54:34] Projection / best event sampling* [00:57:49] Constitutional AI* [01:04:13] DPO* [01:08:54] What's the Allen Institute for AI?* [01:13:43] Benchmarks and models comparisonsTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:15]: Hey, and today we have Dr. Nathan Lambert in the house. Welcome.Nathan [00:00:18]: Thanks guys.Swyx [00:00:19]: You didn't have to come too far. You got your PhD in Berkeley, and it seems like you've lived there most of the time in recent years. You worked on robotics and model-based reinforcement learning on your PhD, and you also interned at FAIR and DeepMind. You bootstrapped the RLHF team at Hugging Face, and you recently joined the Allen Institute as a research scientist. So that's your quick bio. What should people know about you that maybe is not super obvious about you on New LinkedIn?Nathan [00:00:43]: I stay sane in various insane sport and ultra-endurance sport activities that I do.Swyx [00:00:50]: What's an ultra-endurance sport activity?Nathan [00:00:52]: Long-distance trail running or gravel biking. Try to unplug sometimes, although it's harder these days. Yeah.Swyx [00:00:59]: Well, you know, just the Bay Area is just really good for that stuff, right?Nathan [00:01:02]: Oh, yeah. You can't beat it. I have a trailhead like 1.2 miles from my house, which is pretty unmatchable in any other urban area.Swyx [00:01:11]: Pretty excellent. You also have an incredible blog, Interconnects, which I'm a fan of. And I also just recently discovered that you have a new podcast, Retort.Nathan [00:01:20]: Yeah, we do. I've been writing for a while, and I feel like I've finally started to write things that are understandable and fun. After a few years lost in the wilderness, if you ask some of my friends that I made read the earlier blogs, they're like, oh, this is yikes, but it's coming along. And the podcast is with my friend Tom, and we just kind of like riff on what's actually happening on AI and not really do news recaps, but just what it all means and have a more critical perspective on the things that really are kind of funny, but still very serious happening in the world of machine learning.Swyx [00:01:52]: Yeah. Awesome. So let's talk about your work. What would you highlight as your greatest hits so far on Interconnects, at least?Nathan [00:01:59]: So the ones that are most popular are timely and or opinion pieces. So the first real breakout piece was when April and I also just wrote down the thing that everyone in AI was feeling, which is we're all feeling stressed, that we're going to get scooped, and that we're overworked, which is behind the curtain, what it feels to work in AI. And then a similar one, which we might touch on later in this, was about my recent job search, which wasn't the first time I wrote a job search post. People always love that stuff. It's so open. I mean, it's easy for me to do in a way that it's very on-brand, and it's very helpful. I understand that until you've done it, it's hard to share this information. And then the other popular ones are various model training techniques or fine tuning. There's an early one on RLHF, which is, this stuff is all just like when I figure it out in my brain. So I wrote an article that's like how RLHF actually works, which is just the intuitions that I had put together in the summer about RLHF, and that was pretty well. And then I opportunistically wrote about QSTAR, which I hate that you have to do it, but it is pretty funny. From a literature perspective, I'm like, open AI publishes on work that is very related to mathematical reasoning. So it's like, oh, you just poke a little around what they've already published, and it seems pretty reasonable. But we don't know. They probably just got like a moderate bump on one of their benchmarks, and then everyone lost their minds. It doesn't really matter.Swyx [00:03:15]: You're like, this is why Sam Altman was fired. I don't know. Anyway, we're here to talk about RLHF 101. You did a presentation, and I think you expressed some desire to rerecord it. And that's why I reached out on Twitter saying, like, why not rerecord it with us, and then we can ask questions and talk about it. Yeah, sounds good.Nathan [00:03:30]: I try to do it every six or 12 months is my estimated cadence, just to refine the ways that I say things. And people will see that we don't know that much more, but we have a bit of better way of saying what we don't know.Swyx [00:03:43]: Awesome. We can dive right in. I don't know if there's any other topics that we want to lay out as groundwork.Alessio [00:03:48]: No, you have some awesome slides. So for people listening on podcast only, we're going to have the slides on our show notes, and then we're going to have a YouTube version where we run through everything together.Nathan [00:03:59]: Sounds good. Yeah. I think to start skipping a lot of the, like, what is a language model stuff, everyone knows that at this point. I think the quote from the Llama 2 paper is a great kind of tidbit on RLHF becoming like a real deal. There was some uncertainty earlier in the year about whether or not RLHF was really going to be important. I think it was not that surprising that it is. I mean, with recent models still using it, the signs were there, but the Llama 2 paper essentially reads like a bunch of NLP researchers that were skeptical and surprised. So the quote from the paper was, meanwhile, reinforcement learning known for its instability seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness. So you don't really know exactly what the costs and time that Meta is looking at, because they have a huge team and a pretty good amount of money here to release these Llama models. This is just the kind of thing that we're seeing now. I think any major company that wasn't doing RLHF is now realizing they have to have a team around this. At the same time, we don't have a lot of that in the open and research communities at the same scale. I think seeing that converge would be great, but it's still very early days. And the other thing on the slide is some of Anthropic's work, but everyone knows Anthropic is kind of the masters of this, and they have some of their own techniques that we're going to talk about later on, but that's kind of where we start.Alessio [00:05:17]: Can we do just a one-second RL version? So you come from a robotics background, which RL used to be, or maybe still is, state-of-the-art. And then now you're seeing a lot of LLM plus RL, so you have the gym fans, Eureka, you have MPU, which we had on the podcast when they started with RL. Now they're doing RL plus LLMs. Yeah. Any thoughts there on how we got here? Maybe how the pendulum will keep swinging?Nathan [00:05:46]: I really think RL is about a framing of viewing the world through trial and error learning and feedback, and really just one that's focused on thinking about decision-making and inputs in the world and how inputs have reactions. And in that, a lot of people come from a lot of different backgrounds, whether it's physics, electrical engineering, mechanical engineering. There are obviously computer scientists, but compared to other fields of CS, I do think it's a much more diverse background of people. My background was in electrical engineering and doing robotics and things like that. It really just changes the worldview. I think that reinforcement learning as it was back then, so to say, is really different. You're looking at these toy problems and the numbers are totally different, and everyone went kind of zero to one at scaling these things up, but people like Jim Phan and other people that were... You saw this transition in the decision transformer and papers and when people are trying to use transformers to do decision-making for things like offline RL, and I think that was kind of like the early days. But then once language models were so proven, it's like everyone is using this tool for their research. I think in the long run, it will still settle out, or RL will still be a field that people work on just because of these kind of fundamental things that I talked about. It's just viewing the whole problem formulation different than predicting text, and so there needs to be that separation. And the view of RL in language models is pretty contrived already, so it's not like we're doing real RL. I think the last slide that I have here is a way to make RLHF more like what people would think of with RL, so actually running things over time, but a weird lineage of tools that happen to get us to where we are, so that's why the name takes up so much space, but it could have gone a lot of different ways. Cool.Alessio [00:07:29]: We made it one slide before going on a tangent.Nathan [00:07:31]: Yeah, I mean, it's kind of related. This is a...Swyx [00:07:35]: Yeah, so we have a history of RL.Nathan [00:07:37]: Yeah, so to give the context, this paper really started because I have this more diverse background than some computer scientists, such as trying to understand what the difference of a cost function or a reward function and a preference function would be without going into all of the details. Costs are normally things that control theorists would work with in these kind of closed domains, and then reinforcement learning has always worked with rewards that's central to the formulation that we'll see, and then the idea was like, okay, we now are at preferences, and each step along the way there's kind of different assumptions that you're making. We'll get into these, and those assumptions are built on other fields of work. So that's what this slide is going to say, it's like RLHF, while directly building on tools from RL and language models, is really implicitly impacted and built on theories and philosophies spanning tons of human history. I think we cite Aristotle in this paper, which is fun. It's like going pre-BC, it's like 2,300 years old or something like that. So that's the reason to do this, I think. We kind of list some things in the paper about summarizing what different presumptions of RLHF could be. I think going through these is actually kind of funny. It's fun to talk about these, because they're kind of grab bags of things that you'll see return throughout this podcast that we're talking about it. The core thing of RLHF that, in order to be a believer in this, is that RL actually works. It's like, if you have a reward function, you can optimize it in some way and get a different performance out of it, and you could do this at scale, and you could do this in really complex environments, which is, I don't know how to do that in all the domains. I don't know how to exactly make chat GPT. So it's kind of, we'll overshadow everything. And then there's, go from something kind of obvious like that, and then you read the von Neumann-Morgenstern utility theorem, which is essentially an economic theory that says you can weight different probabilities of different people, which is a theoretical piece of work that is the foundation of utilitarianism, and trying to quantify preferences is crucial to doing any sort of RLHF. And if you look into this, all of these things, there's way more you could go into if you're interested in any of these. So this is kind of like grabbing a few random things, and then kind of similar to that is the Bradley-Terry model, which is the fancy name for the pairwise preferences that everyone is doing. And then all the things that are like, that Anthropic and OpenAI figured out that you can do, which is that you can aggregate preferences from a bunch of different people and different sources. And then when you actually do RLHF, you extract things from that data, and then you train a model that works somehow. And we don't know, there's a lot of complex links there, but if you want to be a believer in doing this at scale, these are the sorts of things that you have to accept as preconditions for doing RLHF. Yeah.Swyx [00:10:09]: You have a nice chart of like the sort of intellectual history of RLHF that we'll send people to refer to either in your paper or in the YouTube video for this podcast. But I like the other slide that you have on like the presumptions that you need to have for RLHF to work. You already mentioned some of those. Which one's underappreciated? Like, this is the first time I've come across the VNM Utility Theorem.Nathan [00:10:29]: Yeah, I know. This is what you get from working with people like to my co-host on the podcast, the rhetoric is that sociologist by training. So he knows all these things and like who the philosophers are that found these different things like utilitarianism. But there's a lot that goes into this. Like essentially there's even economic theories that like there's debate whether or not preferences exist at all. And there's like different types of math you can use with whether or not you actually can model preferences at all. So it's pretty obvious that RLHF is built on the math that thinks that you can actually model any human preference. But this is the sort of thing that's been debated for a long time. So all the work that's here is like, and people hear about in their AI classes. So like Jeremy Bentham, like hedonic calculus and all these things like these are the side of work where people assume that preferences can be measured. And this is like, I don't really know, like, this is what I kind of go on a rant and I say that in RLHF calling things a preference model is a little annoying because there's no inductive bias of what a preference is. It's like if you were to learn a robotic system and you learned a dynamics model, like hopefully that actually mirrors the world in some way of the dynamics. But with a preference model, it's like, Oh my God, I don't know what this model, like I don't know what chat GPT encodes as any sort of preference or what I would want it to be in a fair way. Anthropic has done more work on trying to write these things down. But even like if you look at Claude's constitution, like that doesn't mean the model believes these things. It's just trained to prioritize these things. And that's kind of what the later points I'm looking at, like what RLHF is doing and if it's actually like a repeatable process in the data and in the training, that's just unknown. And we have a long way to go before we understand what this is and the link between preference data and any notion of like writing down a specific value.Alessio [00:12:05]: The disconnect between more sociology work versus computer work already exists, or is it like a recent cross contamination? Because when we had Tri Dao on the podcast, he said FlashAttention came to be because at Hazy they have so much overlap between systems engineer and like deep learning engineers. Is it the same in this field?Nathan [00:12:26]: So I've gone to a couple of workshops for the populations of people who you'd want to include this like R. I think the reason why it's not really talked about is just because the RLHF techniques that people use were built in labs like OpenAI and DeepMind where there are some of these people. These places do a pretty good job of trying to get these people in the door when you compare them to like normal startups. But like they're not bringing in academics from economics, like social choice theory. There's just too much. Like the criticism of this paper that this is based on is like, oh, you're missing these things in RL or at least this decade of RL and it's like it would be literally be bigger than the Sutton and Barto book if you were to include everyone. So it's really hard to include everyone in a principled manner when you're designing this. It's just a good way to understand and improve the communication of what RLHF is and like what is a good reward model for society. It really probably comes down to what an individual wants and it'll probably motivate models to move more in that direction and just be a little bit better about the communication, which is a recurring theme and kind of my work is like I just get frustrated when people say things that don't really make sense, especially when it's going to manipulate individual's values or manipulate the general view of AI or anything like this. So that's kind of why RLHF is so interesting. It's very vague in what it's actually doing while the problem specification is very general.Swyx [00:13:42]: Shall we go to the, I guess, the diagram here on the reinforcement learning basics? Yeah.Nathan [00:13:47]: So reinforcement learning, I kind of mentioned this, it's a trial and error type of system. The diagram and the slides is really this classic thing where you have an agent interacting with an environment. So it's kind of this agent has some input to the environment, which is called the action. The environment returns a state and a reward and that repeats over time and the agent learns based on these states and these rewards that it's seeing and it should learn a policy that makes the rewards go up. That seems pretty simple than if you try to mentally map what this looks like in language, which is that like the language models don't make this easy. I think with the language model, it's very hard to define what an environment is. So if the language model is the policy and it's generating, it's like the environment should be a human, but setting up the infrastructure to take tens of thousands of prompts and generate them and then show them to a human and collect the human responses and then shove that into your training architecture is very far away from working. So we don't really have an environment. We just have a reward model that returns a reward and the state doesn't really exist when you look at it like an RL problem. What happens is the state is a prompt and then you do a completion and then you throw it away and you grab a new prompt. We're really in as an RL researcher, you would think of this as being like you take a state, you get some completion from it and then you look at what that is and you keep kind of iterating on it and all of that isn't here, which is why you'll hear RLHF referred to as bandits problem, which is kind of like you choose one action and then you watch the dynamics play out. There's many more debates that you can have in this. If you get the right RL people in the room, then kind of like this is an RL even when you zoom into what RLHF is doing.Alessio [00:15:22]: Does this change as you think about a chain of thought reasoning and things like that? Like does the state become part of the chain that you're going through?Nathan [00:15:29]: There's work that I've mentioned on one slide called process reward models that essentially rewards each step in the chain of thought reasoning. It doesn't really give the part of interaction, but it does make it a little bit more fine grained where you can think about like calling it at least you have many states from your initial state. That formulation I don't think people have fully settled on. I think there's a bunch of great work out there, like even OpenAI is releasing a lot of this and let's verify step by step is there pretty great paper on the matter. I think in the next year that'll probably get made more concrete by the community on like if you can easily draw out like if chain of thought reasoning is more like RL, we can talk about that more later. That's a kind of a more advanced topic than we probably should spend all the time on.Swyx [00:16:13]: RLHF for decision making. You have a slide here that compares pre-deep RL versus deep RL.Nathan [00:16:19]: This is getting into the history of things, which is showing that the work that people are using now really came from well outside of NLP and it came before deep learning was big. Next up from this paper, Tamer, which is from 2008. Some names that are still really relevant in kind of human centric RL, Bradley Knox and Peter Stone. If you have an agent take an action, you would just have a human give a score from zero to one as a reward rather than having a reward function. And then with that classifier, you can do something with a policy that learns to take actions to maximize that reward. It's a pretty simple setup. It works in simple domains. And then the reason why this is interesting is you compare it to the paper that everyone knows, which is this Paul Christiano et al. Deep Reinforced Learning from Human Preferences paper, which is where they showed that learning from human preferences, you can solve like the basic RL tasks at the time. So various control problems and simulation and this kind of like human preferences approach had higher rewards in some environments than if you just threw RL at the environment that returned a reward. So the preferences thing was you took two trajectories. So in this case, it was like complete trajectories of the agent and the human was labeling which one is better. You can see how this kind of comes to be like the pairwise preferences that are used today that we'll talk about. And there's also a really kind of interesting nugget that is the trajectory that the humans were labeling over has a lot more information than the RL algorithm would see if you just had one state, which is kind of why people think that it's why the performance in this paper was so strong. But I still think that it's surprising that there isn't more RL work of this style happening now. This paper is in 2017. So it's like six years later and I haven't seen things that are exactly similar, but it's a great paper to understand where stuff that's happening now kind of came from.Swyx [00:17:58]: Just on the Christiano paper, you mentioned the performance being strong. I don't remember what results should I have in mind when I think about that paper?Nathan [00:18:04]: It's mostly like if you think about an RL learning curve, which is like on the X axis, you have environment interactions on the Y axis, you have performance. You can think about different like ablation studies of between algorithms. So I think they use like A2C, which I don't even remember what that stands for as their baseline. But if you do the human preference version on a bunch of environments, like the human preference labels, the agent was able to learn faster than if it just learned from the signal from the environment, which means like it's happening because the reward model has more information than the agent would. But like the fact that it can do better, I was like, that's pretty surprising to me because RL algorithms are pretty sensitive. So I was like, okay.Swyx [00:18:41]: It's just one thing I do want to establish as a baseline for our listeners. We are updating all the weights. In some sense, the next token prediction task of training a language model is a form of reinforcement learning. Except that it's not from human feedback. It's just self-supervised learning from a general corpus. There's one distinction which I love, which is that you can actually give negative feedback. Whereas in a general sort of pre-training situation, you cannot. And maybe like the order of magnitude of feedback, like the Likert scale that you're going to talk about, that actually just gives more signal than a typical training process would do in a language model setting. Yeah.Nathan [00:19:15]: I don't think I'm the right person to comment exactly, but like you can make analogies that reinforcement learning is self-supervised learning as well. Like there are a lot of things that will point to that. I don't know whether or not it's a richer signal. I think that could be seen in the results. It's a good thing for people to look into more. As reinforcement learning is so much less compute, like it is a richer signal in terms of its impact. Because if they could do what RLHF is doing at pre-training, they would, but they don't know how to have that effect in like a stable manner. Otherwise everyone would do it.Swyx [00:19:45]: On a practical basis, as someone fine-tuning models, I have often wished for negative fine-tuning, which pretty much doesn't exist in OpenAI land. And it's not the default setup in open-source land.Nathan [00:19:57]: How does this work in like diffusion models and stuff? Because you can give negative prompts to something to like stable diffusion or whatever. It's for guidance.Swyx [00:20:04]: That's for clip guidance.Nathan [00:20:05]: Is that just from like how they prompt it then? I'm just wondering if we could do something similar. It's another tangent.Swyx [00:20:10]: I do want to sort of spell that out for people in case they haven't made the connection between RLHF and the rest of the training process. They might have some familiarity with it.Nathan [00:20:19]: Yeah. The upcoming slides can really dig into this, which is like this in 2018 paper, there was a position paper from a bunch of the same authors from the Christiano paper and from the OpenAI work that everyone knows, which is like, they write a position paper on what a preference reward model could do to solve alignment for agents. That's kind of based on two assumptions. The first assumption is that we can learn user intentions to a sufficiently high accuracy. That doesn't last with me because I don't know what that means. But the second one is pretty telling in the context of RLHF, which is for many tasks we want to solve, evaluation of outcomes is easier than producing the correct behavior. And this is the whole thing. It's like we can compare two poems that the model generates and it can be viewed as liking a positive example, or it could be viewed as really disliking a negative example. And that's what I think a lot of people are doing in like the harm space is like a harmful response to a language model, whether or not you agree with the company's definition of harms is that it's a really bad negative example and they downweight them by preferring something more benign in the RLHF process, among other ways of dealing with safety. So that's a good way of saying it's like this is core, this kind of like comparison and positive or negative example is core to all of the RLHF work that has continued.Swyx [00:21:29]: People often say, I don't know what I want, but I'll know when I see it. This is that expressed in reinforcement learning tools.Nathan [00:21:35]: Yeah, it is. Yeah, it is. That's what everyone's doing in the preference modeling stage that we'll get to. Yeah. Yeah. And you can see there are more papers. This is really just to have all the links for people that go deeper. There's a Ziegler et al. paper in 2019, which shows that you can do this RLHF process on language models. This familiar diagram starts to emerge in 2019, and it's just to show that this goes really far back. I think we can kind of breeze through some of these. And then 2020 is the first open AI experiment that I think caught people's eyes, which is this learning to summarize experiment. It has this three-step process that we'll go to into more when I kind of go into the main concepts. But this is like the first time you see this diagram that they reuse with InstructGPT, they reuse with ChatGPT. And the types of examples that they would have, I don't think I need to read these exactly, but one that I have read a whole bunch of times is like, they took these prompts from Reddit that was like, explain like I'm five or get career advice, and people really pour their heart and soul into these. So these are like multi-paragraph pieces of writing. And then they essentially do comparisons between a vanilla language model, like I think it was either GPT-2 or GPT-3, I don't always get the exact years.Swyx [00:22:42]: 3 was early 2020. So that's about right.Nathan [00:22:45]: Yeah. So this is probably done with GPT-2. It doesn't really matter. But the language model does normal things when you do few shot, which is like it repeats itself. It doesn't have nice text. And what they did is that this was the first time where the language model would generate like pretty nice text from an output. It was restricted to the summarization domain. But I think that I guess this is where I wish I was paying attention more because I would see the paper, but I didn't know to read the language model outputs and kind of understand this qualitative sense of the models very well then. Because you look at the plots in the papers, these Learning to Summarize and Destruct GPT have incredibly pretty plots, just like nicely separated lines with error bars and they're like superfine tuning works, the RL step works. But if you were early to see like how different the language that was written by these models was, I think you could have been early to like things like ChatGPT and knowing RLHF would matter. And now I think the good people know to chat with language models, but not even everyone does this. Like people are still looking at numbers. And I think OpenAI probably figured it out when they were doing this, how important that could be. And then they had years to kind of chisel away at that and that's why they're doing so well now. Yeah.Swyx [00:23:56]: I mean, arguably, you know, it's well known that ChatGPT was kind of an accident that they didn't think it would be that big of a deal. Yeah.Nathan [00:24:02]: So maybe they didn't. Maybe they didn't, but they were getting the proxy that they needed.Swyx [00:24:06]: I've heard off the record from other labs that it was in the air. If OpenAI didn't do it, someone else would have done it. So you've mentioned a couple of other papers that are very seminal to this period. And I love how you say way back when in referring to 2019.Nathan [00:24:19]: It feels like it in my life.Swyx [00:24:21]: So how much should people understand the relationship between RLHF, instruction tuning, PPO, KL divergence, anything like that? Like how would you construct the level of knowledge that people should dive into? What should people know at the high level? And then if people want to dive in deeper, where do they go? Is instruct tuning important here or is that part of the overall process towards modern RLHF?Nathan [00:24:44]: I think for most people, instruction tuning is probably still more important in their day to day life. I think instruction tuning works very well. You can write samples by hand that make sense. You can get the model to learn from them. You could do this with very low compute. It's easy to do almost in like no code solutions at this point. And the loss function is really straightforward. And then if you're interested in RLHF, you can kind of learn from it from a different perspective, which is like how the instruction tuning distribution makes it easier for your RLHF model to learn. There's a lot of details depending on your preference data, if it's close to your instruction model or not, if that matters. But that's really at the RLHF stage. So I think it's nice to segment and just kind of understand what your level of investment and goals are. I think instruction tuning still can do most of what you want to do. And it's like, if you want to think about RLHF, at least before DPO really had taken off at all, it would be like, do you want to have a team of at least like five people if you're really thinking about doing RLHF? I think DPO makes it a little bit easier, but that's still really limited to kind of one data set that everyone's using at this point. Like everyone's using this ultra feedback data set and it boosts AlpacaVal, MTBench, TruthfulQA and like the qualitative model a bit. We don't really know why. It's like, it might just be a data set combined with the method, but you've got to be ready for a bumpy ride if you're wanting to try to do RLHF. I don't really recommend most startups to do it unless it's like going to provide them a clear competitive advantage in their kind of niche, because you're not going to make your model chat GPT like better than OpenAI or anything like that. You've got to accept that there's some exploration there and you might get a vein of benefit in your specific domain, but I'm still like, oh, be careful going into the RLHF can of worms. You probably don't need to.Swyx [00:26:27]: Okay. So there's a bit of a time skip in what you mentioned. DPO is like a couple months old, so we'll leave that towards the end. I think the main result that I think most people talk about at this stage, we're talking about September 2020 and then going into, I guess maybe last year was Vicuña as one of the more interesting applications of instruction tuning that pushed LLAMA1 from, let's say a GPT 3-ish model to a GPT 3.5 model in pure open source with not a lot of resources. I think, I mean, they said something like, you know, they use like under $100 to makeNathan [00:26:58]: this. Yeah. Like instruction tuning can really go a long way. I think the claims of chat GPT level are long overblown in most of the things in open source. I think it's not to say, like Vicuña was a huge step and it's just kind of showing that instruction tuning with the right data will completely change what it feels like to talk with your model. Yeah.Swyx [00:27:19]: From text completion to actually chatting back and forth. Yeah. Yeah.Nathan [00:27:23]: Instruction tuning can be multi-turn. Just having a little bit of data that's like a couple of turns can go a really long way. That was like the story of the whole first part of the year is like people would be surprised by how far you can take instruction tuning on a small model. I think the things that people see now is like the small models don't really handle nuance as well and they could be more repetitive even if they have really good instruction tuning. But if you take that kind of 7 to 70 billion parameter jump, like the instruction tuning at the bigger model is like robustness, little things make more sense. So that's still just with instruction tuning and scale more than anything else.Swyx [00:27:56]: Excellent. Shall we go to technical overview?Nathan [00:27:58]: Yeah. This is kind of where we go through my own version of this like three phase process. You can talk about instruction tuning, which we've talked about a lot. It's funny because all these things, instruction tuning has the fewest slides, even though it's the most practical thing for most people. We could save the debate for like if the big labs still do instruction tuning for later, but that's a coming wave for people. And then like preference data and training and then kind of like what does reinforce learning optimization actually mean? We talk about these sequentially because you really have to be able to do each of them to be able to do the next one. You need to be able to have a model that's chatty or helpful instruction following. Every company has their own word that they like to assign to what instructions mean. And then once you have that, you can collect preference data and do some sort of optimization.Swyx [00:28:39]: When you say word, you mean like angle bracket inst or do you mean something else?Nathan [00:28:42]: Oh, I don't even know what inst means, but just saying like they use their adjective that they like. I think Entropic also like steerable is another one.Swyx [00:28:51]: Just the way they describe it. Yeah.Nathan [00:28:53]: So like instruction tuning, we've covered most of this is really about like you should try to adapt your models to specific needs. It makes models that were only okay, extremely comprehensible. A lot of the times it's where you start to get things like chat templates. So if you want to do system prompts, if you want to ask your model, like act like a pirate, that's one of the ones I always do, which is always funny, but like whatever you like act like a chef, like anything, this is where those types of things that people really know in language models start to get applied. So it's good as a kind of starting point because this chat template is used in our early childhood and all of these things down the line, but it was a basic pointer. It's like, once you see this with instruction tuning, you really know it, which is like you take things like stack overflow where you have a question and an answer. You format that data really nicely. There's much more tricky things that people do, but I still think the vast majority of it is question answer. Please explain this topic to me, generate this thing for me. That hasn't changed that much this year. I think people have just gotten better at scaling up the data that they need. Yeah, this is where this talk will kind of take a whole left turn into more technical detail land. I put a slide with the RLHF objective, which I think is good for people to know. I've started going back to this more, just kind of understand what is trying to happen here and what type of math people could do. I think because of this algorithm, we've mentioned this, it's in the air, direct preference optimization, but everything kind of comes from an equation of trying to learn a policy that maximizes the reward. The reward is some learned metric. A lot can be said about what the reward should be subject to some constraint. The most popular constraint is the KL distraint, which is just a distributional distance. Essentially in language models, that means if you have a completion from your instruction or RLHF model, you can compare that completion to a base model. And looking at the log probs from the model, which are essentially how likely each token is, you can see a rough calculation of the distance between these two models, just as a scalar number. I think what that actually looks like in code, you can look at it. It'd be like a sum of log probs that you get right from the model. It'll look much more simpler than it sounds, but it is just to make the optimization kind of stay on tracks.Make sure it doesn't overfit to the RLHF data. Because we have so little data in RLHF, overfitting is really something that could happen. I think it'll fit to specific features that labelers like to see, that the model likes to generate, punctuation, weird tokens like calculator tokens. It could overfit to anything if it's in the data a lot and it happens to be in a specific format. And the KL constraint prevents that. There's not that much documented work on that, but there's a lot of people that know if you take that away, it just doesn't work at all. I think it's something that people don't focus on too much. But the objective, as I said, it's just kind of, you optimize the reward. The reward is where the human part of this comes in. We'll talk about that next. And then subject to a constraint, don't change the model too much. The real questions are, how do you implement the reward? And then how do you make the reward go up in a meaningful way? So like a preference model, the task is kind of to design a human reward. I think the equation that most of the stuff is based on right now is something called a Bradley-Terry model, which is like a pairwise preference model where you compare two completions and you say which one you like better. I'll show an interface that Anthropic uses here. And the Bradley-Terry model is really a fancy probability between two selections. And what's happening in the math is that you're looking at the probability that the chosen completion, the one you like better, is actually the better completion over the rejected completion. And what these preference models do is they assume this probability is correlated to reward. So if you just sample from this probability, it'll give you a scalar. And then you use that reward later on to signify what piece of text is better. I'm kind of inclined to breeze through the math stuff because otherwise, it's going to be not as good to listen to.Alessio [00:32:49]: I think people want to hear it. I think there's a lot of higher level explanations out there. Yeah.Nathan [00:32:55]: So the real thing is you need to assign a scalar reward of how good a response is. And that's not necessarily that easy to understand. Because if we take back to one of the first works, I mentioned this tamer thing for decision making. People tried that with language models, which is if you have a prompt in a completion and you just have someone rate it from 0 to 10, could you then train a reward model on all of these completions in 0 to 10 ratings and see if you can get chat2BT with that? And the answer is really kind of no. Like a lot of people tried that. It didn't really work. And then that's why they tried this pairwise preference thing. And it happened to work. And this Bradley Terry model comes from the 50s. It's from these fields that I was mentioning earlier. And it's wild how much this happens. I mean, this screenshot I have in the slides is from the DPO paper. I think it might be the appendix. But it's still really around in the literature of what people are doing for RLHF.Alessio [00:33:45]: Yeah.Nathan [00:33:45]: So it's a fun one to know.Swyx [00:33:46]: I'll point out one presumption that this heavily relies on. You mentioned this as part of your six presumptions that we covered earlier, which is that you can aggregate these preferences. This is not exactly true among all humans, right? I have a preference for one thing. You have a preference for a different thing. And actually coming from economics, you mentioned economics earlier. There's a theorem or a name for this called error impossibility, which I'm sure you've come across..Nathan [00:34:07]: It's one of the many kind of things we throw around in the paper.Swyx [00:34:10]: Right. Do we just ignore it?Nathan [00:34:14]: We just, yeah, just aggregate. Yeah. I think the reason this really is done on a deep level is that you're not actually trying to model any contestable preference in this. You're not trying to go into things that are controversial or anything. It's really the notion of preference is trying to stay around correctness and style rather than any meaningful notion of preference. Because otherwise these companies, they don't want to do this at all. I think that's just how it is. And it's like, if you look at what people actually do. So I have a bunch of slides on the feedback interface. And they all publish this.Swyx [00:34:43]: It's always at the appendices of every paper.Nathan [00:34:47]: There's something later on in this talk, which is like, but it's good to mention. And this is when you're doing this preference collection, you write out a very long document of instructions to people that are collecting this data. And it's like, this is the hierarchy of what we want to prioritize. Something amount like factuality, helpfulness, honestness, harmlessness. These are all different things. Every company will rank these in different ways, provide extensive examples. It's like, if you see these two answers, you should select this one and why. And all of this stuff. And then my kind of like head scratching is like, why don't we check if the models actually do these things that we tell the data annotators to collect? But I think it's because it's hard to make that attribution. And it's hard to test if a model is honest and stuff. It would just be nice to understand the kind of causal mechanisms as a researcher or like if our goals are met. But at a simple level, what it boils down to, I have a lot more images than I need. It's like you're having a conversation with an AI, something like type GPT. You get shown two responses or more in some papers, and then you have to choose which one is better. I think something you'll hear a lot in this space is something called a Likert scale. Likert is a name. It's a name for probably some research in economics, decision theory, something. But essentially, it's a type of scale where if you have integers from like one to eight, the middle numbers will represent something close to a tie. And the smallest numbers will represent one model being way better than the other. And the biggest numbers will be like the other models better. So in the case of one to eight, if you're comparing models A to B, if you return a one, if you really liked option A, you return eight if you really like B, and then like a four or five if they were close. There's other ways to collect this data. This one's become really popular. We played with it a bit at Hugging Face. It's hard to use. Filling out this preference data is really hard. You have to read like multiple paragraphs. It's not for me. Some people really like it. I hear I'm like, I can't imagine sitting there and reading AI-generated text and like having to do that for my job. But a lot of these early papers in RLHF have good examples of what was done. The one I have here is from Anthropic's collection demo because it was from slides that I did with Anthropic. But you can look up these in the various papers. It looks like Chat2BT with two responses, and then you have an option to say which one is better. It's nothing crazy. The infrastructure is almost exactly the same, but they just log which one you think is better. I think places like Scale are also really big in this where a lot of the labeler companies will help control like who's doing how many samples. You have multiple people go over the same sample once and like what happens if there's disagreement. I don't really think this disagreement data is used for anything, but it's good to know like what the distribution of prompts is, who's doing it, how many samples you have, controlling the workforce. All of this is very hard. A last thing to add is that a lot of these companies do collect optional metadata. I think the Anthropic example shows a rating of like how good was the prompt or the conversation from good to bad because things matter. Like there's kind of a quadrant of preference data in my mind, which is you're comparing a good answer to a good answer, which is like really interesting signal. And then there's kind of the option of you're comparing a bad answer to a bad answer, which is like you don't want to train your model on two different issues. This is like, we did this at Hugging Base and it was like, our data was like, we don't know if we can use this because a lot of it was just bad answer to bad answer because you're like rushing to try to do this real contract. And then there's also good answer to bad answer, which I think is probably pretty reasonable to include. You just prefer the good one and move on with your life. But those are very different scenarios. I think open AIs of the world are all in good answer, good answer, and have learned to eliminate everything else. But when people try to do this in open source, it's probably like what Open Assistance saw is like, there's just a lot of bad answers in your preference data. And you're like, what do I do with this? Metadata flags can help. I threw in the instruct GPT metadata. You can see how much they collect here. And like everything from the model fails to actually complete the task, hallucinations, different types of offensive or dangerous content, moral judgment, expresses opinion. Like, I don't know exactly if they're doing this now, but you can kind of see why doing RLHF at scale and prioritizing a lot of different endpoints would be hard because these are all things I'd be interested in if I was scaling up a big team to do RLHF and like what is going into the preference data. You do an experiment and you're like, okay, we're going to remove all the data where they said the model hallucinates like just that and then retrain everything. Like, what does that do?Swyx [00:38:59]: Yeah, so hallucination is big, but some of these other metadata categories, and I've seen this in a lot of papers, it's like, does it contain sexual content? Does it express a moral judgment? Does it denigrate a protected class? That kind of stuff, very binary. Should people try to adjust for this at the RLHF layer or should they put it as a pipeline where they have a classifier as a separate model that grades the model output?Nathan [00:39:20]: Do you mean for training or like a deployment? Deployment. I do think that people are doing it at deployment. I think we've seen safety and other things in the RLHF pipeline. Like Lama 2 is famous for kind of having this like helpfulness and safety reward models. Deep in the Gemini report is something that Gemini has like four things, which is like helpfulness, factuality, maybe safety, maybe something else. But places like Anthropic and Chattopadhyay and Bard almost surely have a classifier after, which is like, is this text good? Is this text bad? That's not that surprising, I think, because you could use like a hundred times smaller language model and do much better at filtering than RLHF. But I do think it's still so deeply intertwined with the motivation of RLHF to be for safety that some of these categories still persist. I think that's something I'll kind of settle out, I think.Swyx [00:40:11]: I'm just wondering if it's worth collecting this data for the RLHF purpose, if you're not going to use it in any way, separate model to-Nathan [00:40:18]: Yeah, I don't think OpenAI will collect all of this anymore, but I think for research perspectives, it's very insightful to know, but it's also expensive. So essentially your preference data scales with how many minutes it takes for you to do each task and every button is like, it scales pretty linearly. So it's not cheap stuff.Swyx [00:40:35]: Can we, since you mentioned expensiveness, I think you may have joined one of our spaces back in Lama 2 was released. We had an estimate from you that was something on the order of Lama 2 costs $3 to $6 million to train GPU-wise, and then it was something like $20 to $30 million in preference data. Is that something that's still in the ballpark? I don't need precise numbers.Nathan [00:40:56]: I think it's still a ballpark. I know that the 20 million was off by a factor of four because I was converting from a prompt number to a total data point. So essentially when you do this, if you have multi-turn setting, each turn will be one data point and the Lama 2 paper reports like 1.5 million data points, which could be like 400,000 prompts. So I would say it's still say like 6 to 8 million is safe to say that they're spending, if not more, they're probably also buying other types of data and or throwing out data that they don't like, but it's very comparable to compute costs. But the compute costs listed in the paper always are way lower because all they have to say is like, what does one run cost? But they're running tens or hundreds of runs. So it's like, okay, like... Yeah, it's just kind of a meaningless number. Yeah, the data number would be more interesting.Alessio [00:41:42]: What's the depreciation of this data?Nathan [00:41:46]: It depends on the method. Like some methods, people think that it's more sensitive to the, this is what I was saying. It was like, does the type of instruction tuning you do matter for RLHF? So like, depending on the method, some people are trying to figure out if you need to have like what is called like, this is very confusing. It's called like on policy data, which is like your RLHF data is from your instruction model. I really think people in open source and academics are going to figure out how to use any preference data on any model just because they're scrappy. But there's been an intuition that to do like PPO well and keep improving the model over time and do like what Meta did and what people think that OpenAI does is that you need to collect new preference data to kind of edge the distribution of capabilities forward. So there's a depreciation where like the first batch of data you collect isn't really useful for training the model when you have the fifth batch. We don't really know, but it's a good question. And I do think that if we had all the LLAMA data, we wouldn't know what to do with all of it. Like probably like 20 to 40% would be pretty useful for people, but not the whole data set. Like a lot of it's probably kind of gibberish because they had a lot of data in there.Alessio [00:42:51]: So do you think like the open source community should spend more time figuring out how to reuse the data that we have or like generate more data? I think that's one of the-Nathan [00:43:02]: I think if the people are kind of locked into using synthetic data, people also think that synthetic data is like GPT-4 is more accurate than humans at labeling preferences. So if you look at these diagrams, like humans are about 60 to 70% agreement. And we're like, that's what the models get to. And if humans are about 70% agreement or accuracy, like GPT-4 is like 80%. So it is a bit better, which is like in one way of saying it.Swyx [00:43:24]: Humans don't even agree with humans 50% of the time.Nathan [00:43:27]: Yeah, so like that's the thing. It's like the human disagreement or the lack of accuracy should be like a signal, but how do you incorporate that? It's really tricky to actually do that. I think that people just keep using GPT-4 because it's really cheap. It's one of my like go-to, like I just say this over and over again is like GPT-4 for data generation, all terms and conditions aside because we know OpenAI has this stuff is like very cheap for getting pretty good data compared to compute or salary of any engineer or anything. So it's like tell people to go crazy generating GPT-4 data if you're willing to take the organizational like cloud of should we be doing this? But I think most people have accepted that you kind of do this, especially at individuals. Like they're not gonna come after individuals. I do think more companies should think twice before doing tons of OpenAI outputs. Also just because the data contamination and what it does to your workflow is probably hard to control at scale.Swyx [00:44:21]: And we should just mention at the time of recording, we've seen the first example of OpenAI enforcing their terms of service. ByteDance was caught, reported to be training on GPT-4 data and they got their access to OpenAI revoked. So that was one example.Nathan [00:44:36]: Yeah, I don't expect OpenAI to go too crazy on this cause they're just gonna, there's gonna be so much backlash against them. And like, everyone's gonna do it anyways.Swyx [00:44:46]: And what's at stake here to spell it out is like, okay, that's like cost $10 to collect one data point from a human. It's gonna cost you like a 10th of a cent with OpenAI, right? So like it's just orders of magnitude cheaper. And therefore people-Nathan [00:44:58]: Yeah, and it's like the signal you get from humans is from preferences isn't that high. The signal that you get from humans for instructions is pretty high, but it is also very expensive. So like the human instructions are definitely like by far and away the best ones out there compared to the synthetic data. But I think like the synthetic preferences are just so much easier to get some sort of signal running with and you can work in other, I think people will start working in other goals there between safety and whatever. That's something that's taking off and we'll kind of see that. I think in 2024, at some point, people will start doing things like constitutional AI for preferences, which will be pretty interesting. I think we saw how long it took RLHF to get started in open source. Instruction tuning was like the only thing that was really happening until maybe like August, really. I think Zephyr was the first model that showed success with RLHF in the public, but that's a long time from everyone knowing that it was something that people are interested in to having any like check mark. So I accept that and think the same will happen with constitutional AI. But once people show that you can do it once, they continue to explore.Alessio [00:46:01]: Excellent.Swyx [00:46:01]: Just in the domain of human preference data suppliers, Scale.ai very happily will tell you that they supplied all that data for Lama 2. The other one is probably interesting, LMSYS from Berkeley. What they're running with Chaterina is perhaps a good store of human preference data.Nathan [00:46:17]: Yeah, they released some toxicity data. They, I think, are generally worried about releasing data because they have to process it and make sure everything is safe and they're really lightweight work. I think they're trying to release the preference data. I have, if we make it to evaluation, I'd pretty much say that Chaterina is the best limited evaluation that people have to learn how to use language models. And like, it's very valuable data. They also may share some data with people that they host models from. So like if your model is hosted there and you pay for the hosting, you can get the prompts because you're pointing the endpoint at it and that gets pinged to you and you're any real LLM inference stack saves the prompts that you get. So like that is some signal. I don't know if the shared preferences. I do think they're trying to. They're trying to do all the right things. They're just very strapped and moving data comes with other like legal and liability concerns in some cases. Awesome. So kind of looping back a little bit from that very valuable digression on like what preference data is, it's worth talking about the actual loss function because it's kind of like this classifier approach that might not make too much sense to people. You take a language model and you chop it into pieces a little bit at the end so that it outputs one number. It's like in technical level, it's a logit that corresponds to the probability that we talked about earlier. But in order to train this, you can't just have like prompt and completions. You need to have these pairs because we talked about scalars don't really work. So in order to train it, you use the magical batching of all language model, all deep learning architectures and you put in the chosen prompt and the rejected prompt at the same time and then you end up with two numbers and then there's this fun loss function and you essentially have to increase the difference between these two predicted numbers. It's always fun when you think about like automatic differentiation, it updates the same parameters to kind of separate these two numbers at once and there's this loss function that you'll see in OpenAI, Anthropic and everyone's papers. What it looks like is it's like some log of a scalar with an exponential that's the difference between these two predicted rewards. It's just some fancy math around a difference, a subtraction between the predicted reward for the rejected completion and the predicted reward of the chosen completion. Fun fact is that these loss functions look different and Anthropic and OpenAI's papers, but they're just literally just log transforms. So if you start like expandiating both sides and taking the log of both sides, both the two papers end up being the same thing. And people don't know how to train preference models particularly well now. I think if you zoom into any of the details to look at like the agreement number, so how if you look at a test set, you'll have a chosen and rejected and you can take the reward model you're training, pass in those completions and you see if the chosen predicted reward, so the scalar number is higher than the rejected predicted reward. And this is the agreement numbers in all of these datasets is like that where you see they have the 65 to 75% agreement. This just means that like these scalar numbers were ordered correctly. And that's a pretty low number. It's not gonna get to a hundred percent. That goes to show the kind of like deep questions at play here. People are playing with different loss functions and samples, different models to try to address this, but it's really a fundamental issue. It's like, it goes back to like, what does it mean to do RLHF? And we're not gonna answer that now, but it's good to know that like this 65 to 75% agreement, you'll see these numbers everywhere. It's like, we don't have a hundred percent agreement with the reward model and the data and that's fine. That's just where we're at. And we essentially take this model and then we start throwing RL at it. I think PPO, proximal policy optimization, it's pretty complicated compared to what you really need to know. It really just does RL under the hood. Things like PPO, it learns a value function and then it uses the value function to update the model. If you actually look at like a feedback diagram, more of like a systems problem than an RL problem. So you'll see things like you need to have two copies of the language model. This is for the KL constraint that we talked about before. You need to have the reward model, which is either a separate reward model or value head on your base model. And then you need to have your like RL code that actually learns a value function and updates all the parameters. I think it just is really messy to actually set up, but if you dig into it, most people could understand what each of the components are. And then the hard parts are like, how do we actually make a language model that works out of this? Which is not something that people know that well. I think things that I talk about a lot, it's just like, okay, like what is the signal flow? How do you access the reward model? The reward model is used in RLHF exactly what you would think. You have a prompt, the language model generates a completion and then that completion is given a score. That score gets plugged into the whole RL stuff and it updates the parameters. That's kind of the core of it. There's a lot of different things zooming in on where exactly you put this distance penalty between the base model and the RL model. Most people say that you just deduct it from the reward. So if you go all the way back to RL as an agent acting in the world, the reward from that world would be a combination of the reward model and any constraints like KL that you put on it. There's a lot of different ways to do this because a lot of RL algorithms like PPO actually have a KL constraint built into them. So it's confusing because you hear KL twice, but those are different KLs. One of them is about the text and one of them is about the value function distance or the policy distance or something like this. So those are different. It really ends up being kind of like gibberish that I think is less important now because it's more about data and infrastructure than RL details, than like value functions and everything. A lot of the papers have different terms in the equations. I think InstructGPT does something where they like try to get the RL model to match the instruction tuning dataset because they were really happy with that dataset to constrain the distribution. LLAMA does some different things, but I think these are all small gains over just getting the deep understanding of the data in the infrastructure setup. This is why we say it's like so little RL. It's like now we're getting to the point where you don't even really need this to get a good model. So that's why it's like, okay, the RL is such a small part of the actual, like doing RLHF, like RLHF is a metaphor for like all language model adaptation and RL is one tool used at one point in the time. So that's kind of where I wrap up like the core overview in my mind to say like RL doesn't really do as much as people think, but you could put up flashy equations and do all sorts of stuff if you want to. It's just like, I think it's kind of misleading even because I don't think about those equations on a regular basis.Swyx [00:52:20]: But what if we call it Q star?Alessio [00:52:23]: Yeah.Alessio [00:52:26]: So in your mind, it's a takeaway for this kind of next generation of people working on models. Maybe the underlying theories is less important than actually getting good data, basically.Nathan [00:52:38]: Yeah, I think it's getting good data and we'll see like, I have this like advanced topics thing in the slides, which it starts with the vowels and then it talks about a lot of different ways that people are using reward models or constructing training signals really. And I think that's like about understanding what your information flow is and like if your reward signal is good and like if your language model is generating right, like zooming in on the tokens is generating and kind of understanding how those things change over time. Like this is something we could also talk about evaluation, but it's really like RLHF is not that shown to improve capabilities yet. I think one of the fun ones is from the GPT-4 technical report. They essentially listed their kind of bogus evaluations because it's a hilarious table because it's like LSAT, AP exams and then like AMC 10 and AMC 12 are like kind of reasonable vowels in language model land. But they just showed that like RLHF doesn't improve their evaluation metrics. We don't know if internally they have other ones.Alessio [00:53:30]: They probably do.Nathan [00:53:30]: But from what OpenAI has shown us externally, like RLHF improves some metrics. It decreases some metrics. No one could really see. I do think it does things that they care about, but it's like RLHF is not an easy tool to make numbers go up with. It's a powerful tool to change your language model. But like, as we've seen with LLAMA and safety RLHF, like that doesn't always mean that people are gonna be happy with those changes or it's gonna do exactly what you want. It's like-Swyx [00:53:56]: Well, I think this is intuitive. Like a lot of these tests are multiple choice and RLHF isn't necessarily intended to improve your multiple choice reasoning capabilities.Nathan [00:54:04]: Yeah, I think it is reasonable, but I don't think a lot of people have like connected the dots there. And like, what is it in a preference point? Like what if your preference data was between a correct and a wrong answer? Like it could conceivably do it, but I just don't think that is remotely what it is actually doing.Swyx [00:54:22]: It's much better being a sommelier.Nathan [00:54:24]: Yeah. That was the weirdest one that was included in GPT-4Alessio [00:54:29]: I just see that the last three down there. That's really funny. I can't even taste it.Nathan [00:54:38]: Yeah, so this is essentially how to use RLHF-like things to make the bottle better without using PPO because PPO is kind of a nightmare to scale. The first thing that I started with is kind of the ideas of rejection sampling and best event sampling. I think best event sampling is what people often encounter first, which is the idea of you take a prompt, you generate like 10, 20 responses through it. You pass it through a reward model. The reward model assigns a scaler for each of them. You pick the one with the highest number and that's the one you answer the question with. It seems pretty logical to people because it's just spending more inference time compute to make your outputs better. And it works in a lot of things. This Let's Verify step-by-step paper that I talked about from OpenAI, they use it, lots of papers use it. It's just kind of like a good thing to know that you can do. You can spend more inference compute based on a preference dataset to make your answers better. The interesting thing that people are confused about more is rejection sampling because Meta talked about it in LLAMA 2. Essentially, a rejection sampling is putting something like best event sampling in a feedback loop. And instead of just returning the best answer to a user, you take the best few answers and then you apply instruction tuning on that dataset. And then you do the instruction tuning and then you could collect more preference data, do a new reward model. And then you rank some new outputs and you do instruction tuning again. So essentially, LLAMA started their RLHF process with this to get some signal out of preference data. That preference data went into a reward model. And then the reward model did a good enough ranking that it was like essentially super powered instruction tuning based on rewards. Works pretty well, much easier to implement than PPO because you can use it in all of your kind of like, it's still instruction tuning. So it's the same autoregressive loss. It's easy to plug into things like transformers and stuff like that. A lot easier to start with than whatever freaking mess doing RL at scale is going to be. So that's one. A quick nod that offline RL is something that people talk about for RLHF essentially because your model doesn't have to generate. In that case, you look at data and it back propagates through your reward model directly. So in PPO, you have the step of like needing to generate everything and passing it through the reward model. How offline RL essentially works is that all of this is kind of just done in one big data set. I'm not an expert in this, but essentially you do much less inference costs during the RLHF process. If you do offline RL, there's a few papers that people have published. Not a lot of traction. I think it could take off some people that I know in the RLHF area really think a lot of people are doing this in industry just because it makes the kind of training process simpler in the number of things you have to have running. Different feedback types are probably going to come into play. There's papers like written feedback or labeling multiple scores or multiple pairwise preferences for every completion that's coming. It's also kind of related to what we mentioned in process reward models where you're labeling each step in the chain of thought reasoning just to kind of make the problem more specific. It seems very likely that different feedback will be used for different domains.Chain of thought reasoning is great for math and that's where these process reward models are being designed. Probably not great for things like poetry, but as any tool gets better, it gets more specific. Then kind of get into more of a talking point, which I think is fun.The next one I have is constitutional AI. I think this is something that people really just kind of misunderstood. I mean, I think most people thought that constitutional AI was doing something where it's like created the preference data based on the specific principles in some way, where it's like, what did you two think of constitutional AI?Swyx [00:58:10]: I'll be the dumb person and you correct me. As far as I understood, Anthropic came out and said that the best way of generating this sort of preference data or alignment is give a second model, a constitution to evaluate the first model's outputs.Nathan [00:58:21]: Yeah.Swyx [00:58:22]: The constitution is unspecified, but like this is draws from like the UN Declaration of Human Rights and the Apple Terms of Service for some reason.Alessio [00:58:28]: Yeah.Nathan [00:58:28]: And this leads into the question is like, what is the other model evaluating? And like, how is it evaluating in a way that you can train on? And that's what I mean. It's like, people didn't think about this. A lot of the CAI paper was actually talking about instruction tuning, which is if you have an instruction, you then have a language model that critiques the instruction based on principles. And then your instruction responses are closer to the constitutional principles. This was the first half, which is like they have some acronym for all of this.The diagram in their papers wild on this one. I think their papers are sometimes pretty funny because they're not capabilities papers. They're like alignment papers. So like they don't make everything super clear. So the first half of constitutional AI is fine tuning your instructions based on principles. That's one half. And then the second half is what people really thought that they knew, which is like, how do you use these other model to provide a critique based on principles? And in the paper, they list essentially they like say what their prompt was, which is like for the synthetic feedback for generating new preferences, which is essentially pick between these two answers based on this principle. So they're kind of sampling from the principles in their constitution and from kind of A, B, like two options of completions. And then the AI model is essentially given the context of a certain principle to pick the A or B preference. And then that's a new preference data set is just the two completions without the context of the principles. So with this kind of like sampling idea, they're sampling from like 30 principles and a wide data set of two candidate completions across the different prompts. So to me, it's a very like loose, like the values are not explicit in this. It's just kind of how they're guided. It's a very machine learning approach because it is relying on averages and scale to get the principles in there. But it is way less explicit than I thought it was going to be. I kind of thought there was this like feedback thing in the preference data or like check to see if the principles were satisfied or anything like this. It's really just like a modification to the RLHF setup that we've talked about with instruction, tuning and preference data collection where there's an AI model providing critiques. And a lot of those critiques are based on like sampling of constitutional values. It almost sounds more tractable in that way. But I would also guess while I just like say like, oh, look, I figured it out. I'm guessing they do different things than they said in the paper. Like this paper is in 2022. It's a pretty old paper. They're surely doing more. But it's good to know like where they started, at least in this case.Swyx [01:00:51]: I thought the communication around the Pareto optimal improvements was helpful in understanding that you do actually want it to be more helpful and honest while maintaining the same level of harmlessness or something. Yeah, right.Nathan [01:01:03]: Yeah, so that figure right at the top of the constitutional AI paper is worth seeing if you don't have it immediately pop into your head where they essentially compare like constitutional AI to other RLHF that they're doing internally. And that's something that most RLHF papers don't do is like they have little dots on the lines to indicate intermediate checkpoints and be really great to see more RLHF papers showing how per epoch or per half epoch of training because most RLHF is only a few epochs, at least in the open models, like what is happening there. People release checkpoints, but that's how we should be thinking about it because the optimizer is so strong and it's like we don't know what's happening in this kind of intermediate land.Swyx [01:01:41]: I don't know if this is a relevant comparison for you, but OpenAI also recently released a weak to strong generalization paper where they actually talked about a few intermediate checkpoints for GPT-4. Any comments on the comparison between constitutional AI and weak to strong generalization?Nathan [01:01:55]: I didn't see the paper. I think I saw people criticizing it for like just being like safety washing from the fact that they're like talking about GPT-2 still, which is such a kind of like odd model to focus on. I didn't really look at the paper. I think that it's a thing with OpenAI. It's like they're sharing less than they know. So I think they probably have things that are pretty cool that they're doing internally. And I'll summarize for listeners who may not have seen the paper because it's impossible to keep up and everything. I do think that what constitutional AI and RLAIF represents is that we are starting to come to a point where it's just impossible for manual human preference data collection to scale. And the only way to scale this is to trust our AI overlords to model our human preferences. And constitutional AI was the first version of this. What the second version or what weak to strong is, is that anticipating a future of the need for super alignment, where the thing that we're trying to control is smarter than us. So you take GPT-2 and try to use GPT-4 to teach it to be smarter than itself, because this is what we're going to have to do in the future as well. When we are not, we're no longer fully in control. Are we the metaphorical GPT-2? No, we're like not even in the process anymore at the point of super intelligence.Alessio [01:03:10]: They're prepping. And they're saying this will happen. And humans will be like so far like in the dust that we just have no say in this debate.Swyx [01:03:18]: How do we still control systems then? And weak to strong generalization seems to be the answer. And I see a lineage from constitutional AI to this.Nathan [01:03:26]: Yeah, the constitutional AI and the super alignment is like very conceptually linked. It's like a group of people that has like a very similar intellectual upbringing and they work together for a long time, like coming to the same conclusions in different ways. And I understand the argument and I mostly just don't. I think they're just waiting to see more from the super alignment team because I just didn't really put it together in my brain, quickly looking at weak to strong generalization of like exactly how it all fits. But I'm also not a safety researcher. Yeah, but I think that could be feedback for them. It's like I understand what synthetic data means and all of this is like how could they communicate that a little bit more specifically in this context? Because like I want to know what they think about this. Which is why I like that Pareto optimal thing, because it steers the debate away from X risk to like, no, like this makes knowledge models more useful and we can all get behind that.Swyx: I agree.Nathan [01:04:13]: I think the last kind of emerging direction that I have might just be like this debate. You can control how long we talk about this, which is about direct preference optimization. You could go read my blog post on this. I had tried to summarize this already, but essentially DPO is a different class of algorithms. I still call it RLHF because RLHF is so vague in how it's defined. I think DPO is closer to RLHF than RLHF is to RL. You can unpack that if you need to need to. But what DPO is doing is essentially deriving a optimal reward function from the preference data where the preference data is the same thing that we've talked about. And then the clever math in the paper emerges optimal policy to that based on an implicit reward function. That's a ratio of like log probs. It's very odd. Like the difference between what a DPO reward is and a classifier reward is very different, where like the classifier is trained to output a scalar value based on this kind of like contrastive like loss where DPO is purely based on like the difference between two log prob ratios. So the reward there is the ratio between like the policy generation likelihood and the base model generation likelihood. I don't have intuitions for what that means yet, but like what the reward actually is is very different. The data starting point in principle could be the same. And I think we've seen a lot of successes in open source with it. It's way simpler implement and to work with in that regard, which is why I think we'll keep seeing a lot of success with it in the short term. I think we'll keep seeing DPO models for the time being, but we won't really answer like what the fundamental differences are, because it depends on your data. It depends on your infrastructure. Rumors seem to be that people still think that PPO like methods or other RL methods have like higher top end, but I don't necessarily think like.Swyx [01:05:56]: Sorry, what is top end?Nathan [01:05:57]: Just like the absolute best model you could get. So like I see Google and OpenAI aren't using DPO because they could do something more complicated, but like that's not what academics and open source people really care about. They care about like being able to improve on their methods and understand where to like iterate the model and kind of work off of each other. So like in a lot of ways, I think DPO still will be what people see, but like in some ways it's probably like slightly more constrained. There's other ways that you could think of PPO like working nicely in code where it's like if your code runs is the score that you give it, you have to generate like you have to kind of do canned things to get DPO to have the same data. So there are specific cases where like the DPO formulation is a little bit harder, but I expect to see more DPO models than anything else in the next six months. That's probably like what most people need to know unless they're an RLHF expert. And like I would love to learn more about PPO and a lot of authors in this space from the DPO authors who are great to talk to. You can reach out to all three of them.Swyx [01:06:54]: So as of the time of recording, we actually about to publish our NeurIPS recap where we talk to the authors. Yeah, so for people who are listening to this in the future, you can refer to that episode.Nathan [01:07:02]: Yeah, so like Raphael, Eric and Archit, I've talked to all of them at a good length and they're all fantastic. And it's like they'll say similar things and they'll also defend their method because it's an awesome paper. If you want to learn how like a good math, like I'm kind of mathy but still experimental paper in language models is like the DPO paper is a really good one to spend more time on. Yeah.Swyx [01:07:25]: When I ask them questions about it, they just kind of gestured at the poster and said, look at the equation, just stare at it. And yeah, that's my criticism for them is they're still in the academic world where some of their answers reflect that. But I've done it enough with them that I understand what they're saying.Alessio [01:07:44]: Yeah.Swyx [01:07:44]: I would say like it does remind me of FlashAttention a little bit in a sense that like kind of an equivalent thing to the thing it's replacing and it's just faster, cheaper, just better.Nathan [01:07:53]: It's a very different optimization tool. There's essentially the thing in my mind that I can't get past is the difference between the control you get in training a reward model and then training a policy because essentially everything you want your reward model to do might not be everything that you train the policy to do in the RLHF step where you have like the two different prompt distributions. But with DPO, you're doing both at once. So you don't control that. And we don't know if you have fancy engineering abstractions and like test your reward model to do different things if that separation is really important. And I think that's where this like benefit at the absolute biggest scale and most investment could come from. But DPO is one update. Like it is one model. You can't separate that. So like that's the thing to know. Probably doesn't matter for most people, but it is very different. And like I was asking somebody who was on some of those earlier OpenAI papers that's not OpenAI anymore. And they were like, I wish we had thought of that. So like it is a really cool idea. And like that's the type of thing that academia still can do and can do really well and hopefully continues to do.Alessio [01:08:54]: Yeah.Swyx [01:08:54]: One thing I wanted to make sure I cover before we leave this topic, you know, one of the DPO models that were trained apart from Zephyr and McStraw, which is two of the more high profile ones is Tulu from the Allen Institute. And you're one of the few people maybe placed to explain. Like maybe like what's Allen Institute doing here? And like, you know, what's the backstory? Yeah.Nathan [01:09:15]: So the Allen Institute for AI is I think the 10 year birthday is in January. This is a special event for that. And also like people should know this is Paul Allen from Microsoft. Paul Allen owns everything in Seattle. Not literally. I mean, he's passed and his estate is still operating in a lot of great ways. But the Allen Institute is mostly known as being like a super academic lab where they have more resources than academia and publish like hit after hit of research paper. And they're trying to move more in the direction of releasing models. And this is part of why I joined. It's like talking with new CEO, Ali Farhadi. I don't know if I pronounced the last name right, but he's trying to move from an org that does papers only to something that does papers, releases models, is active in policy, maybe is like helping work with these for profit institutions that don't have like an established place where they could all go through to new things. So they're really trying to expand the scope. It's part of why I joined and like the Tulu2 model is the type of thing I've joined. And they were talking about this and I was like, okay, we should just train it and release it because no one has done this direct preference optimization at a scale of like a really like 70 billion parameter scale. And this experiment is hilarious. This is like classic of like everything kind of works right now in ML. Like I showed up in the grad student Hamish Ivison and I need to learn how to pronounce last names better. But he had some Jack's DPO code built on this EZLM framework. And we have the TPUs that we could access for research purposes. So it's like, okay, we have a huge TPU. It's like, let's just try the Zephyr recipe on 70 billion parameters. And it's literally like the first run. It's like, we did no ablations, didn't change any parameters. We just copied them all over. And like, that's the model that people have been working with. It's like, that goes to show that there's a lot of runway and understanding and improving on this. It's like, we took the same data and just took it to a different JAX implementation and scaled it up 10X and it still returned a model that was pretty good. It's like on benchmarks and in people using it. So let's say it's like 2024, we'll be busy in this space as we do. Like we're running data ablations to try to understand what's best. Then Allen Institute is pre-training language models who are pre-training like open language models where we'll be able to share like data, code, everything, the kind of horn that everyone likes to get annoyed about these days. It's like, well, I'm not releasing data. So that'll come in the new year. And then things like Tulu2 or the recipes that we will apply to that. And we'll kind of keep doing both as the pre-trained models get better. Those will probably become more of a priority, but like starting pre-training is very hard. So it's like you still want to learn from LLAMA2 and LLAMA3. So that's fun. I think DPO releases are kind of becoming expected because Mistral released a DPO model as well. There's a ton. There's like Intel releases DPO models, Stability releases DPO models.At some point, you just have to accept that that's where we're going, whether or not you care about the whole like DPO debate. And that's why I find it so funny because there's really interesting, like debatable questions between DPO and other RL methods. But we just won't have the answer. And it'll look like there isn't a debate because everything that is published is with DPO. But that doesn't mean that anything is answered in the time being. Yeah, kind of last of this stuff is evaluation. And these slides were prepared kind of last minute. But I think the question is, how do you evaluate these models and what you should be doing? I think the PSA is like, don't trust your numbers and actually talk to models. It's very hard to do if you're an engineer or a researcher because you have your specific thing that you're zoomed in on. And it feels like a waste of time to just go play with chat GPT or go play with chat arena. But I really don't think it is. It's something that I, this is like me telling myself what I should be doing. But there's the question of like, is the Hugging Face leaderboard good for open source? And then what else can people do?The Hugging Face leaderboard came out of the team that I was on there. We were trying to build a framework to automatically evaluate the models that we were training and the models that people were releasing and then have them in a central place where it could be like, look, here's the evaluation scores. This is what we're competing with. It obviously blew up. I think it's very good for companies trying to operate in the open LLM space to build businesses around it. I think it's bad for people building LLMs that they think are the best because it's easy to overfit if you're training and focusing on them as a developer. But it's good to have distribution of models when there's so many people training them. But it's like now it has six evaluation tools. I can't even name all of them off the top of my head. It's like ARC, Hella Swag, MMLU. There was Drop on it at one point, but they dropped a drop, which was pretty funny. Truthful QA. And then I think maybe some other math. I don't know.Swyx [01:13:42]: This benchmark question is something that everyone's talking about because there's a lot of gaming that it seems to be going on. Is there some discussion about sort of held out benchmarks that Hugging Face could hold on to?Nathan [01:13:55]: Mostly it's who's going to pay for it. We're thinking about this at Allen AI too. We're specifically thinking about improving on AlpacaEval, which is-Swyx [01:14:01]: Who's going to pay for running the evals? Right now Hugging Face is just running every eval every day?Nathan [01:14:06]: Yeah. So they have like a thousand GPUs. At one point they were going to do more training. It was going to be used for that. But now they have less training and they do, they've run a good amount of GPUs. And one of their blog posts, they said how much compute it was. I don't think it's a ton to run these, but it is like, you have to have hundreds of GPUs to maintain this leaderboard.Swyx [01:14:23]: So one technical question, like some of these are open source models that they don't change. So you just have to run them once. Yeah. It's only the closed source models that need to be revalued.Nathan [01:14:37]: Yeah. So if you look at the like chat arena, they take specific dates. And then there's this whole controversy of like, is ChatGPT from March better than ChatGPT from June? So like on like one of these future slides, it's slide 58 is like the chatbot arena leaderboard. If you're looking later, which chatbot arena is this thing from LMSys that we were looking at, and then like on the X-axis is models. And you can see that GPT-4 from March has a higher score. It's like, this is not a perfect comparison, but there are signs that are pretty funny there. That like, there are things cooking, but you don't know who's collecting this data, what prompts they're doing. It's such a funny timeline.Swyx [01:15:20]: So for those listening, GPT-4 March 14th is 40 Elo points higher than GPT-4 June 13th.Nathan [01:15:26]: Yeah, it's like outside of the error bars on the LMSys thing. And the other piece of context is that GPT-4 Turbo is also notably ahead of the other GPT-4s, which it kind of showed up immediately once they added it to the arena. And I was like, all the GPT-4.5 memes aside, it seems like this is effectively a bump in the model. If you zoom into this, the leaderboard is very close for many like strata of models. So there are levels where you can get your model to, and it'll be really close to your peers. So in the open source, there's things like Mixtral Instruct 2.2.7db, which is effectively, it's a way bigger model than Mixtral. Mixtral's the mixture of expert model. Like I'll do credit, it's a very good model. And that's gonna be like the next level once people get better at fine tuning it. Like Ye34bchat, like this is one level. And then there was like a level with like the Alpacas and the Vicunas. But all of these open source models, there's then another step up to GPT-4, and then there's another step up to GPT-4 Turbo. So it's like the difference from the GPT-4 Turbo to like the GPT-4 that was first released is bigger than the difference from Tulu2 to GPT-4. So that's just like, there's something good going on there. And I was like, okay, that's a new model by my standards, but they're not gonna tell us about it. Like they did in DevDay, they said it's our new model, but they weren't like, this is our new best performing model because it's like the benchmark scores are probably the same, but they made it so that people like using it more.Swyx [01:16:57]: There's some hints that 4.5 might drop at some point. We don't actually know how true those things are, but I don't think it really matters.Nathan [01:17:03]: It's like they could call anything, they're retraining these models and they could call any of them GPT-4.5. Yeah, cool.Swyx [01:17:10]: And the other last points in, you have a couple more extra slides here.Nathan [01:17:14]: There's a bunch of an evaluation. I think the two tools that I talk about most in research domains on RLHF is like Alpaca Valid MT Bench. They're two academic maintained leaderboards for evaluating chat capabilities. Evaluating chat is really hard. And what they both do is they have GPT-4 provide some sort of feedback. MT Bench is called MT for multi-turn and they have a prompt and a follow-up question. So what they do is they ask GPT-4 to score both the initial response and the second response and provide the average kind of given up on following the slides. This is all on the slides if you look for it. And then Alpaca Val is a little bit different where you're comparing a candidate model. So the model we've trained. So like when we're training Tulu, we compare that we submit this and what it's doing under the hood is comparing the new model to DaVinci 0.0.3 which is one of OpenAI's older instruction models and calculating the win rate that GPT-4 sees between the new model and DaVinci. So that's kind of like it has many more prompts than MT Bench. MT Bench is custom prompts that they made to just kind of like take a stance on what is a good chat model. Alpaca Val sources theirs from Self-Instruct which is a popular paper from AI2. Open Assistant, Vicuna, Koala, Anthropix, Helpful Harmfulists. So like Alpaca Val is from sources that people know and love. MT Bench is its own thing. We were more focused on MT Bench at Hugging Face at AI2. We're a little bit more focused on Alpaca Val but it really can go either way. These are kind of like table stakes to saying that you have a good RLHF model is like you should be able to have a pretty good score on both of these. And then the kind of proof is in people actually talking to it. So I think like the Zephyr model from Hugging Face was a kind of step change in people's perception of open models that got integrated into a bunch of products within a few weeks. Like you.com was experimenting with it and someone else, like I saw some sub stacker was using it as like a writing feedback bot instead of chat GPT. But like that's what happens when a good open release is there now. It's like the evaluations are good and people pick it up and the evaluations are just enough to like say like, okay, we're in the right ballpark but you never really know if the model is the one or one of these big ones without talking to it. So it's like however much you talk about Vals that's still where we're at. You can't prove anything definitively and Google seeing that and like until Gemini Ultra comes out like we don't know. It's probably a great model but we don't know what they have.Swyx [01:19:47]: Yeah, Gemini Pro didn't do so great on the other stuff too.Nathan [01:19:51]: Yeah, I wanna know if Gemini Pro is just like some intermediate checkpoint or if it was like a major deliverable for them or not. Which if it wasn't a major deliverable it's probably a strategy headache for Google but it's not my problem.Alessio [01:20:05]: You have a bunch of open questions here. One of our lightning round question is always. Yeah, we just do inverted lightning round. Yeah, exactly.Swyx [01:20:12]: You asked people open questions.Nathan [01:20:16]: Oh, I mean, there's so much to do here. They're kind of like summarization of things that will be hinted at in the talk to this point which is like I split it up in my work between like data training and model which is essentially like how do we evaluate what's happening at the model level with RLHF. I think big labs are indexed on their own base models so they don't know like what's swapping between CloudBase or GPT-4 base how that would change any notion of preference or what you do with RLHF. I think in the open we could do that. We can swap between Lama2 and Mixedraw and kind of see like does RLHF work the same for both of those? Do they both get alpaca valve bumps when you use the same dataset in the same framework down the line? That'd be good to know if like how sensitive RLHF is. On the data we talk a lot about aggregation. On the research side there's a lot of interesting things as like does getting your data from scale or a Discord army change the quality of the data based on like professional contexts. They probably should do it internally. They should do like internal market analysis on that line.Swyx [01:21:18]: We should also mention there has been a report that a lot of these labelers use ChatGPT to do their work.Nathan [01:21:25]: Yeah, I mean I'm not surprised. So it's like it's a lot of messy grounds in RL these days. And then there's more trading questions which is like what happens at the end of the day. I mentioned what I call like qualitative alignment earlier on which is like do the models get better in ways matching the preference data preferences? So if you like collect two matches of preference data with different priorities like what does the downstream model change? I don't know if it does anything. Should all data be equal? Like if you have like healthcare questions should it be the next same as like write me a joke? Like this is all implicit to deep learning. Like deep learning just scales and aggregates and like I think we are going to be on that ride but it's not necessarily what some people would call fair or good. And then the kind of last slide that I have is fun which is just like John Schulman talks about this in his ICML talk. His ICML talk on proxy objectives for RLHF is public now. They made it public like three months after the conference or some weird timeline. But he talks about things like ChatGPT being verbose and have self-doubt refusals. Things that are really like in vogue in the conversation right now and like how those can emerge in the process of continually trying to adjust the RLHF process based on what users are seeing in the model. And this is like a sort of outer loop optimization that no one in the open is even remotely qualified to talk about. But OpenAI does monitor and they'll like rerun RLHF and train a new reward model with a mixture of their curated data and user prompts to try to make it work better over time. And like that's the different model versions. And while there's a lot of critiques about this they're definitely like intentional and trying to fix. I feel like it's probably whack-a-mole where they're like, oh, there's this problem. We have the data, we can fix this. And then it like pops up some new problem after doing RLHF and they're studying this. And if you could really figure it out this is where things start to look more like RL. You could automate it. Things are just like longer timeframe of optimizing the model.Alessio [01:23:19]: It would be cool.Nathan [01:23:20]: But I feel like I'm years away from ever actually working on this but we can try to get details from people who are. Excellent.Swyx [01:23:28]: Awesome.Alessio [01:23:29]: Anything else that we missed? I think we covered a lot of it. I mean, I'm good.Nathan [01:23:33]: I would ask you guys about if you know companies that are doing this and things. Like I know some that are in the like the RLHF as a service space will become busy. I think for good reason, just because like.Swyx [01:23:44]: There's companies doing RLAIF as a service.Nathan [01:23:46]: Yeah, both of them are. It depends if synthetic data is going to win over human data. If human data is the real winning feature in the end like it's a big capital investment. So it kind of makes sense as a VC model anyways but there's going to be both of them for a while. It'd be cool.Alessio [01:24:01]: You see a lot of people because I know Louis Castricato is starting a company. Is there a lot of ambition in this field to start companies or is this more such a research-driven part of the stack that maybe it just stays there?Nathan [01:24:16]: There definitely is. Because I know my former colleague, Nazneen Rajani from Hugging Face is also starting a company in this space. The Falcon team who left Hugging Face I think is also working in this space. I don't really know. I don't know exactly what. I haven't talked to them since the guy's email. So I don't know what they're doing. Startups change a lot but there are definitely a lot of people looking at this space. I mean, Scale's probably trying to do it. If I was Scale, they would want to do it. I think they've historically had trouble keeping like technical ML talent but they've started a new research lab. So that should help. It's a busy area.Alessio [01:24:50]: Cool. Yeah. Awesome, Nathan. Thank you.Swyx [01:24:52]: That was a masterclass. I think this is the first 201 that we've ever had and you set the bar very high. Get full access to Latent.Space at www.latent.space/subscribe
-
The Accidental AI Canvas - with Steve Ruiz of tldraw
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2024-01-05 20:43
Happy 2024! We appreciated all the feedback on the listener survey (still open, link here)! Surprising to see that some people’s favorite episodes were others’ least, but we’ll always work on improving our audio quality and booking great guests. Help us out by leaving reviews on Twitter, YouTube, and Apple Podcasts! 🙏 Big thanks to Chris Anderson for the latest review - be like Chris!Note to the Audio-only ListenerBecause of the nature of today’s topic, it makes the most sense to follow along the demo on video rather than audio. There’s also about 30 mins of demos and technical detail that we had to remove from the audio version, because they didn’t make sense without video.Trailer here.Full 90min chat:(In other words, pls jump over and watch on our YouTube if you can! Did you know we are now posting every episode to YouTube? We’ve been multimodal for a long time!)Trend 1: GPT4-V CodingYou might remember Greg Brockman’s hand-scribble-to-working-website demo from the GPT-4 demo from March. This was largely inaccessible to the rest of us until the GPT4-V API was released at Dev Day in November.As mentioned in our November 2023 recap, one of the biggest viral trends was tldraw’s open source “Make It Real” demo: starting from a simple wireframe and text annotations, you could create a real, functioning UI with the click of a button. Provoking another crisis of confidence in developer circles:And using state charts:And provoking responses from Excalidraw, a competitor.You can see us creating a Replit clone in this silent video here:Since our intervew the new GPT4V Coding metagame has been merging app UI’s and SQL with Supabase (another AIE Summit speaker) and other backend tools:* generating SQL* converting ERDs to SQL (part 2, for MariaDB)* seeding sample data* doing migrationsTrend 2: Latent Consistency ModelsAs covered in the Latent Space Paper Club in November, 3 papers drove a roughly ~100x acceleration in the speed of text to image generation over the past year:* Consistency Models (with Ilya Sutskever)* Latent Consistency Models (from Tsinghua)* LCM-LoRA (also Tsinghua, same authors)With the invaluable help of Fal.ai (friends of the show and AI Engineer Summit and progenitors of the viral GPU Rich/Poor hats mentioned on the Semianalysis episode), TLDraw has also been at the forefront of putting this research into production, with two projects:* drawfast: add a prompt, start sketching into the canvas and see each stroke affect the drawing. Overlap multiple of them to extend and merge drawings.* lens: a collaborative canvas where in real time people can draw and have their sketch turn into AI-generated art. Start drawing at the bottom and see it scroll into the magic canvas. For nontechnical people in your life, we do recommend showing them lens.tldraw.com (and its predecessor that we discuss on the show) on your and their mobile devices.The Rise of Multimodal PromptingAt the first AI Engineer Summit in October, Logan (our first guest!) declared this the Year of Multimodality. Over the next 2 months we saw an explosion of activity in multimodal: GPT-4V’s API release at OpenAI Dev Day (our coverage here), LLaVA (our chat with author here on Visual Instruction Tuning), BakLLaVA, Qwen-VL, CogVLM, etc.On today’s episode we have Steve Ruiz, founder of tldraw. The project originally started as an open source whiteboard that Steve built for himself and then “accidentally made a really, really good visual multimodal prompting application environment”. Turns out that infinite canvas and generative models are a very good match:* Design is iterative: DALL-E, Midjourney, etc all work in a linear way: prompt goes in, 1-4 images come back. As you generate more, the previous images scroll away from your view. In a canvas environment, you can see the progression of your generation and visually “branch” by putting new prompts in different spaces.* UI has “layers”: when designing interfaces there are different layers to it: the functionality, the style, the state, etc. Some of what they are building in tldraw is bringing images into the canvas to influence different layers: “One thing that we've done is to bring in screenshots of other apps, like here's Stripe.com, like make it look like Stripe, you know? Or like here's Linear.com, like let's do it this way”. In the episode we spend a lot more time talking through all of these ideas and how Steve’s background in fine arts came back to being really useful in building a multi-modal AI canvas. Enjoy!Show Notes* tldraw* Open Source Repo* Make Real (Wireframe to UI)* drawfast.tldraw.com* lens.tldraw.com* Perfect Free Hand and Perfect Arrows* “Make Real, the story so far”* Dog CEO* Other whiteboarding products mentioned* Excalidraw* FigJam* Adobe Whiteboard* See also Steve’s interviews on the Slow Steady Pod and TWiSt, and subscribe to his tldraw substack!* TLDraw Wireframe kit* TLDraw LLM starterTimestamps* [00:00:00] Introductions* [00:01:02] Steve's Background In Fine Arts and Transition into Tech* [00:08:22] The creation of tldraw and its open source origin* [00:15:44] The Inception and Growth of tldraw* [00:18:40] The Integration of AI with tldraw and Make It Real Feature* [00:21:56] Discussion on Multimodal Prompting and Iterative Design* [00:32:32] The Concept of Parallel Prompting in Design* [00:34:11] Impact of AI on developer jobs* [00:37:28] Additional Layers in Multimodal Prompting* [00:45:18] Introduction of DrawFast and Lens Projects* [00:50:03] tldraw 2.0 and the future of the project* [00:55:41] The Competitive Landscape of Canvas Tools and tldraw's Unique Position* [01:00:22] Advice for Founders: Following Your Interests and DesiresTranscriptSwyx: Welcome back to Latent Space. I'm very excited to have my good friend, Steve Ruiz. How are you this morning? [00:00:13]Steve: Hey, how's it going? [00:00:14]Swyx: I have had the good fortune of knowing you before you got famous and actually hanging out in the precise office and studio that you're recording from right now. Congrats on Make It Real. Congrats on tldraw. I think it's been something that's sort of years in the making, but it's probably looks like overnight success to a lot of people. [00:00:32]Steve: Yeah. Thank you. It's kind of a funny story. I don't know. Where should we jump into it? [00:00:37]Swyx: Well, I like to give you a little background on the person. You don't have a lot of detail on LinkedIn, just like myself. I just found out just before recording that you're also a late entrance into tech. So maybe just like, what's your background coming into something like tldraw? What makes you so unique at doing sort of creative collaborative experiences like that? I know you and I've actually used tldraw, so I have some appreciation for how hard this thing is. [00:01:02]Steve: Yeah. Like you said, I kind of came into this a little late and kind of came into it from a weird angle. My background is actually in fine art and studio art. I have my master's from University of Chicago in visual art, would write about contemporary art and put together exhibitions and do my own paintings and drawings. And that was back when I was living in Chicago. And then when I moved over to the UK, you know, got a new studio, kept that going. But when I turned 30, I kind of decided I should probably make some money and work with other people closer than I was at the time. Studio art is primarily a solo thing. I'd always had kind of like an analytical kind of side to me. My day jobs were, you know, I was working for lawyers. I was doing this writing, like magazines and stuff. So when I did that kind of that switch back eventually to design and product design, I was also able to use a tiny little bit of technical skill that I had had just building like WordPress websites for myself and other artists as portfolios. Kind of take that, just some natural curiosity around the way that products work and kind of create a career direction that was more around prototyping and like technical design and kind of like doing the design on the bits of a product that really couldn't be designed otherwise. So the interactive bits, the bits which are maybe more, there's more questions about them. There's no clear answer to terms of like, how should this work? You know, in all those places, you kind of have to build something in order to, to figure out what you want to build. It turns out, you know, to skip right to the end for a moment, like canvas is full of those types of problems. So it's no surprise that I ended up there. It's like kind of an extreme form of the same problem. But yeah, so I was working, this was back in like 2017, 2018. And I used at the time a product called Framer. That was back when it was more of like a code product than what it is now, which is more of like a visual builder that is kind of backed by code. So I'm sort of just drilled into that. It was cool. Uber was using it. No one knew how it worked. No one could use it. So I got good at it and got a lot of advancement, early traction, whatever in my career based on that. But it also taught me to code, taught me to think about building things that other people are going to use. Taught me about kind of like the type of code that you write when you're in an exploratory phase rather than like in an execution, like production phase. And I actually ended up working for Framer. I did their education for a year, which was very different than the type of product design that I was doing before that. I did a lot of video tutorials and writing and tweeting, trying to figure out some way to make technical design content interesting, you know, in little chunks that people could consume. I joke that like they probably got less out of me in that job than I got out of the job itself. Like because, yeah, I walked away from that. Not sure if I'd helped anyone really learn how to use Framer, but I certainly learned how to, how to tweet and learn how to record a good GIF and learn how to talk into a microphone and all that type of stuff. And so in the next roles that I had, I worked for a company called Play out in New York who is also doing design tools and I really wanted to work in design tools after that. Play's doing like a mobile, I guess right now it's like just general iOS, macOS platform specific design tools where you're using actual elements from the kind of widgets from that component collection in your designs and kind of bringing that a lot closer to the end product. At the same time I started getting into open source, I'd kind of done some popular open source before. This was now 2019, it was, it was locked down. I had a little bit more time. I also had a daughter, so not that much more time. I guess that open source that I started getting into started swinging back towards some of my kind of artistic interests or studio interests and kind of visual interests. Those are the parts where I felt like the problem space was, was really underserved. It wasn't necessarily like technical problems that were really hard. It was more subjective problems where I think the thing that was lacking was the taste or the opinions or the like feeling for what good solutions were. So the first kind of problem like this that I got into was arrows. I had, you know, two boxes or two points arbitrarily placed. I want a good looking arrow, like a quote mark, like good looking arrow between the two. Well, that could be anything. That's not a math problem. Maybe it involves some angles and linear geometry and vectors and all that, but it's like the good looking part was just like my own taste and my own eye and like tons and tons of iterations and arrows are super tricky and there's a million ways for this, you know, edge cases when things are overlapping or things are too far away or too close and all this. But I was working on this and I was working on this in public on Twitter, recording gifs of boxes and arrows kind of squishing together and all that. And I think people really liked that and they liked kind of following me on this somewhat obsessive journey, which was technical, but it wasn't like it wasn't like trying to crack an algorithm. It was like trying to trying to figure out and identify the the rules governing an aesthetic experience or an ecstatic thing, which was a good looking arrow that became perfect arrows and that was pretty popular. But the next one really is what kind of broke my popularity on Twitter or just in the space and that was a project that ended up being called perfect freehand. This is a little hard to describe. If you've ever used like an iPad pencil or drew with like a stylus in Photoshop or something, like the harder you push, the thicker the line gets and the lighter you push, the thinner the line gets. It kind of is like this ink experience and that's it's not an easy problem. But if you're doing it in a kind of a Photoshop style, like raster environment, you know, the solution is pretty straightforward. You interpolate like tons and tons of tons of whatever shape you're drawing in between each point that you've actually moved your mouse to and you just change the size of that little stamp that you're making. So it's like a little circle, slightly bigger circle, slightly bigger circle, slightly bigger circle, but they're all really tightly packed together and so it looks like a kind of a line that's changing its width as it moves. My angle on this, the reason why I spent so much time on it was that I wanted to do that using vectors. I wanted to get a bunch of points in and then like a polygon that sort of defined the outside of that shape coming out because I like to work in SVG and it turned out that this was like an insanely hard problem that no one had solved. And if they have solved it, they certainly didn't open source it, but I couldn't find any good example of a variable width line that actually worked fast enough and consistent enough, etc. for it to be digital ink. And so again, I did this in public, did this on Twitter, a million GIFs of lines that look terrible, but you know, like slowly attracting more, like getting closer to the solution, attracting more people who had solved this problem or tried to do this or they wrote their PhD on ink and let me tell you about, you know, how arcs work in this environment and all this stuff. [00:07:35]Swyx: Wow. [00:07:36]Steve: And I, it was fantastic. Like I met so many good people who had like, were experts on this or something like it. And slowly we made a really, really good, tight little library for doing exactly what I wanted. Like here are a bunch of mouse points or just arbitrary points, like give me back a polygon that surrounds them. And let me essentially draw a line around the edge of that polygon, fill it in and it'll look like ink. So that was perfect freehand. And that's now used in like Canva uses it, like draw.io uses it, ExcalDraw uses it. We use it at tldraw all over the place. It's just like a significantly better than the next best solution in that space. And there really wasn't even any known solution in that space. So someday I'm going to be checking out at a hotel and see my own ink and you know, a little iPad or something like that. [00:08:21]Swyx: That's amazing. [00:08:22]Steve: So that's kind of led right into tldraw is that I had integrated my ink into Excalidraw and I, you know, spent time in that code base. And I'd also like created several like infinite canvas like tools to help me build perfect freehand and visualize it and sort of do my ink pan and zoom in and, and program against this thing. And so I had done, including Globstud design, which I won't necessarily talk about, but it's a kind of like a weird experimental design tool, but anyway, it was like, it was an infinite canvas. It was like, you know, Framer, Figma, et cetera. And after doing Excalidraw and been working on these kinds of projects that were in the same area, I was like, you know, maybe there's, there's a market here for, or not even a market. It was just like, I think the thing that I want to work on next is like a general purpose, kind of like whiteboard like engine, mostly for myself. I'd built globs, but the only thing that you could put on the canvas in globs was a glob. So I had all this like code and these solutions that I, you know, was like hanging around. It could kind of see how I would adapt it. And so that's what I started doing. And that was the next story that I was kind of telling on Twitter is that like, okay, here's how selection works in something like Figma or, or something like Miro or Framer or Sketch. It's these sort of conventions that are part of this really complicated thing called like the infinite canvas, you know, going all the way back to Flash and before then, you know, Adobe Illustrator and before then all the way back. And they're all pretty consistent between products. Like if you're making a canvas this way, you have to kind of do them all. Like your undo, redo should work in a specific way. Your selection should work in a specific way. Like, you know, the camera position and how the camera moves should work in a certain way. All the modifier, like option, drag to clone. And all of those became their little vignettes of how I was building this thing. This was now like spring of 2021. And I had everyone from any infinite canvas related creative product kind of like in my inbox being like, Hey, can you come work for us? I was like, let's talk, let's do this. And so I was either going to go work for Figma or Adobe. And I ended up going with Adobe in part because I think FigJam had just come out and the team at Figma were like, well, this is competitive with FigJam. I'm like, this thing is like nothing. It's like a little open source, you know, it's like no one uses this. It's just me trying to get to 10,000 Twitter followers, but you know, it's mine. So no. So I went with Adobe, but I told them, I'm like, I don't want to start for six months. Like this is actually a pretty fun project for me. I want to get it out of my system, you know, let me start in January and just work on this. And so they said yes. And I quit working with Play and said, I'm going to go work on this little open source thing for six months. I have some contracting money in the bank. Let's drain the company account and do this. And that's not what happened. I went full time from a Wednesday on Thursday. I had a very large communications company say, Hey, we're moving a whiteboard that we've designed for specific touchscreen devices. We're moving that into the browser. It turns out people want to use the whiteboard on their phones and on their laptops and all that like they do with Miro. And so we need to make this thing that we wrote in C++ to be highly performant on these, you know, kind of tiny microcomputers that were part of these interactive touchscreen TVs. We have to make this work on the web and we don't think it's going to be good enough. We have to build from scratch. We don't have the team. Can we just build on what you're building? At the time, this thing wasn't open source, it was just sort of, but it was getting there And I'm like, yeah, sure. Like, give me like $75,000. I'll let you see the source code. I don't want to talk to you very often, you know, like I'm not working for you. I never want to see your code, but you can look at mine. And they said yes to that. And so I was, you know, funded for those first six months. And I got to work on this without having to feel bad about it. And I'd also eventually opened up tldraw to be like sponsorware that if you were sponsoring me on GitHub, you could access it, you know, in its kind of primitive state on tldraw.com. And it had like a couple hundred people join that way and sponsor me. So at one point, like my sponsorship was, you know, over $5,000 a month, which is not massive money, but it's like I wasn't doing anything different. So it was pretty good. That's a kind of a passive thing. Anyway, I shipped it at the end of November 2021. And it was very popular. I just open source everything. It was just like, you know, the tldraw.com app, the library, the canvas, and it was organized in a certain way. It just made it all public. Everything was MIT, you know, let's just throw this out into the world and see where it goes. Well, it went pretty far. There was like number one on Hacker News for a while. It was like the top trending repo on GitHub. A lot of people, like 40,000 people showed up at tldraw.com to use it on that launch date, which was all good. Like so far, this was all within my same narrative of, okay, this is cool. I'll make this and then I'll go do something else afterwards. The thing that really surprised me was how many teams wanted to build on this. And they weren't like, they weren't building whiteboards. They weren't Miro competitors or Figma competitors. They were just like apps that you wouldn't expect to have infinite canvases inside of them and they wouldn't have built it except that I had suddenly made this very easy. And I had suddenly shrunk the development time of this like whiteboard like feature in their product from like three years and three people to three weeks and one person and not even one person just like no new developers, no new team, no new graphics experts, no computational geometry guys. Like, you know, we can do this. The canvas itself is like React all the way down. So even if you wanted to customize it, you'd just be writing React components and then a little bit more code on top. And so I was totally overwhelmed by inbound from companies who were like, I want to build this or I want to acquire you or I want to, I want you to build something for me. Or, you know, I want this in my app, you know, how do you help me or how can I do this? And people were shipping things also like within two weeks, three weeks, like production ready. Like people had taken this and run with it. And so the story that I started to get around tldraw was that like, OK, well, this is this is a cool little whiteboard, but it's also kind of like filling a gap that no one knew was there in the same way that like Mapbox or Google Maps, you know, provide maps for apps that would definitely not build maps themselves. Like maps are insanely hard, like your little local food delivery app like wouldn't just wouldn't have a map in it, you know, like easy. But it is a value add. If they can have it in there, then absolutely it is a value add. It's just completely impractical to do themselves. And what I learned talking to folks was that like every PM had used Miro or used Figma or used one of these other collaborative tools. And every creative product person was like, well, this is fun. Collaboration is fun. This canvas thing is pretty cool. Like, you know, why can't we put our CRM on the canvas or why can't we do our sales stuff here? Like we're already kind of using Miro for this. Like, why couldn't we give this to our customers as well? Like, why don't we build a product around this? And it was just a technical no until, you know, November 24th, 2021, when suddenly it was like a technical maybe and there was absolutely demand. So hence the, you know, I had to call Adobe and say, no, I'm not going to come in on Monday. Like it turned out that the best possible outcome of this happened and there's actually a company here. And then I went out and I raised a seed round from Lux in New York and Amplify in California and a whole bunch of really great angels, you know, on the story of, yeah, this is cool It's a good app, feels good. Companies want it. And, you know, by then I had almost $200,000 of sponsorship, you know, and people were just signing up and signing up because there was no way to even be a customer. [00:15:44]Swyx: You're not saying 200k a month. [00:15:46]Steve: No, no, no. But like, I mean, I had had up to then the amount of sponsorship that I had received was around $200,000. I think some of the recurring stuff was like, like 5,000 a month. Yeah. But yeah. [00:16:00]Swyx: Which is in the top echelon. A lot. [00:16:02]Steve: Yeah. Oh yeah. Certainly. Just the amount of like kind of validation that had come in around this was like more than usual. So raise a round, put together a team here in London and basically had just been building this whiteboard SDK since then, you know, we, we reconfigured the project around, okay, we're going to be building this not necessarily for end users, but for, for teams to use as kind of an infrastructure product, a developer product, something closer to Mapbox, you know, we were making demos to kind of like show different ways that it could be used. Certainly the collaboration thing is a big one, but the fact that you could put anything on the canvas that you can put on a website just because it is all HTML, CSS all the way down and that was going really well, it was already a good story. And then I just raised like a 2 million extension for the company while I was on the final pitch for that, the dev day was happening at OpenAI. And in the morning I woke up and I was getting all this kind of action on Twitter because a developer at Figma had used tldraw to make this little demo where you could draw a website, click a button and get back a, a big pop-up that had your, your website in there. It was like a prompt, like you're a developer, you just got this wireframe from your designer. Can you give it back to a single page HTML file? And it would do it and it could do it. And then you could show that website to whoever is using the app. And we took that and we're like, wow, you could do so much more with tldraw. It's just like, it's, it's only scratching the surface of the type of integration that you could do. Again, we had just finished the race. Pressure was off a little bit. It was kind of getting towards the end of the year. I was like, all right, let's, let's just take this and have some fun. Let's make some, some viral s**t. Maybe we'll get like 200 likes or something like that. And it exploded. It was like, I think we're at like last 30 days, like 22 million views or something like that. It's just like Kanye West numbers. It was, it was really, really, really popular for a couple of days. If you're on Twitter and at all technical, you might've just seen a ton of tldraw stuff on your timeline or about two weeks ago, three weeks ago. [00:17:55]Swyx: Well, so yeah, that, that, that kind of brings us up almost to today. You just released something two hours ago, which we should talk about. Maybe this will bring a good time to bring up the screens, you know, for those who are listening. [00:18:08]Steve: Let me, let me share. [00:18:09]Swyx: We're recording a video as well. You can jump over to the YouTube to see stuff, but this is an inherently visual podcast, so we have to show stuff on the screen. The incremental thing I got from your blog post. So you did do a write up, which thank you for that, because I actually didn't know that you did a write up. It was just drawn up. [00:18:26]Steve: Oh yeah. [00:18:27]Swyx: Videos. This is the power of open source, right? That someone else had the idea. You weren't even focused on Dev Day. Someone else had the idea and just like, you know, made it without your permission or talking with you. And then the idea could spread back to you and you could run with it. [00:18:40]Steve: Yeah, exactly. And we had made a lot of the bits and pieces like in place already based on, you know, I mean, it's, it's well documented or it's documented. There's tons of examples and all that. Yeah. Yeah. And I mean, it's a big library as far as an open source library goes, but yeah, you can work with it. And once this thing got popular, the first thing we did was create like a starter kit so that someone could take it and like run with it. So this is normal tldraw where you draw, you can whatever, move things around. It works if you've used Figma, if you've used Miro, it's kind of, kind of familiar to that. And you can put pretty much anything on this canvas that you want, like YouTube links, et cetera, because this canvas is HTML and CSS like divs and stuff all the way down. You can put things like YouTube videos on there. You can even make them play because again, like anything you can do in a website you can do on tldraws canvas. What's fun is because it is a canvas all the way down, you can also like draw on top and like do the kind of canvas manipulation stuff that you might do with normal shapes, but also with this type of content. So that ended up becoming like a big part of why make it real got kind of popular. So anyway, I'll show you make it real now. This was a hastily built layer on top of the kind of tldraw engine SDK that we sent out. And the idea here is that you can make a wireframe and we're going to send it to GPT-4 with vision with like a prompt, like much like the original one that Sawyer Hood had come up with, which is you are a web developer, you work with designers, they give you wireframes and notes and screenshots and all sorts of stuff. Could be anything. Your job is to come back with a single HTML file that has all the styles, all the JavaScript, all the markup necessary in order to make a real working prototype based on what you've been sent. It also has emotional manipulation, like you love your designers and you want them to be happy and like the better your prototype is, the happier they are. Oh, in the prompts? Yeah. Yeah. Yeah. Again, it's open source. You can read, read the prompt. It's kind of a funny one. This is part of the joy of like a multimodal prompt is we send it the photo, which kind of looks like the same as if you had done a copy and paste thing. Yeah. So like an image as well as all the text. And you had all this functionality worked out prior. [00:21:00]Swyx: Yeah. [00:21:01]Steve: Yeah. Yeah. [00:21:03]Swyx: Yeah. Like that's what I find so poetic about this, that you were just ready. [00:21:06]Steve: Yeah. It feels like we had gone off, you know, as collaboration and AI and stuff was going in one direction, we kind of just went off in our own weird, like, hey, the world is really going to need a whiteboard at some point direction. And then it just, they kind of met us where we were at. And then we've been able to just be like, show up on day one of this new world of possibility with like the thing that if I hadn't spent the last two years building this, I would spend the next two years building this. Like it is the right product for this type of a feature. So anyway, they give us back a HTML. We stick it into an iframe, put that onto the canvas, just like we did with that YouTube link and I can interact with it. So it should be going from orange to pink, orange to pink, hey, it's given us a hex code. I can click the hex code and it gives me, you know, it says it's copied it to the clipboard. [00:21:55]Swyx: That's incredible. [00:21:56]Steve: Like this alone is like super cool in something like V0 or some of these other kind of prompting environments, like the only way for you to then make this better, oh, maybe you can do this with ChatchubbyT or something and you could write like, oh, actually, you know, you missed the labels. Like it should say orange and pink, you know, on top of this thing. And it doesn't. So you could go back here and like, you know, make sure that this is, whatever, you could change the input. But because this is tldrawn, because you can draw on top of this stuff, you could also, you know, write on top. Like you could kind of modify this and maybe even give it the same type of markup that you would give to a designer or something like that, you know, and draw some arrows or maybe paste in a screenshot and say, hey, make it look more stylistically close to this other thing. And then what you do is you select the website that they gave you back, the previous result, along with all this markup, and you use that as the new input. And so that's going to give you something kind of like an image that looks like this that you've now sent. But we've also kind of tweaked the prompt a little bit when you do include a previous result and say like, hey, the wireframes coming back are annotations or markup based on what you sent before. And there you go. So now we have a new prompt that, sure enough, the labels are there, you know, it still works just like before. The button is full width and, you know, it still works just the same. So we send it back. Again, we send it the image, we send it the text, the prompt. We also send it all of the text items themselves separately because ChetchiBT is not really great with recognizing text. So we say like, oh, by the way, your vision's not so good. So we've ensured to have our copywriter, you know, list out all the copy that you can use. I think we even send it back the HTML that they used for the previous result. So we just dump like as much information as possible at GPT-4 with vision. And that's how you're able to get these sort of iterative results. And it is like legitimately good, like it feels like work. It feels like you're actually doing stuff when you're iterating through this way and slowly shaping and adding complexity and doing step by step, you know, as you're building something. And then you can copy a link to that and open that in a new tab like we host it. It's there forever. You can bookmark this. If you really just needed a slider between orange and pink, well, now you have one, you know, whether you could code it or not, or maybe not worth building or using a no code tool to build. But we just made that in five minutes. If you are more on the co-design, you want to use this as a kind of a foundation of a real project or maybe just to like see how it like, how does that actually work? You can open it up in StackBlitz or CodeSandbox. I think tomorrow we'll have Repl.it and yeah, see all the code, see what Chachapiti came up with and kind of use it or adapt it or, you know, keep it going or do whatever you want with it. Yeah. Cause it is, it is real. Yeah. [00:24:50]Swyx: Make real. Yeah. It's interesting that you can also, I've seen some of your other demos. It looks like you're about to move us on to another. [00:24:57]Steve: Yeah. I'm going to grab a couple. Okay. So what I have on the screen now just to narrate, describe it is, is I have a drawing of a, like a kitchen timer, you know, where you can add a minute, add a second, you know, start or stop the timer or, or reset the timer. And then next to it, I also have a state chart, like state machine describing the three states of the timer stopped running or complete and like what each one of those buttons should do in terms of transitions or changing the state. I think you can hand this to pretty much any designer or developer and get back a working result. [00:25:32]Swyx: Like it's fully spec'd sort of. I mean, our friend David Korsheid might say, you know, develop a state chart first and then, you know, plug it into X state. [00:25:38]Steve: Yeah, exactly. Well, let's do a couple of things in parallel. First thing I'm going to do is I am just going to make a box over here and I'm going to say kitchen timer right in the middle of the box. And this is going to be the only prompt that I'm going to, I'm going to give it. Okay. Just going to click make real and just the, the kitchen timer box. As you see with these multimodal prompting, like someone will draw a calculator, like in a lot of complexity and say, you know, it makes this real and sure enough, you get back like a really complex full calculator. But if you did the same thing and you just said empty box, but just the word calculator, it would give you back the same thing is that it knows what a calculator looks like and it knows how it works and all that. So next let's also give it just the user interface, like without the state chart, we'll leave the state chart out, but we'll do just the user interface. And then we'll do just the state chart, you know, and say, Hey, make this real. And then we'll do both the state chart and the UI. So we have four different prompts with four potential different results based on, you know, variations of the same, same input. So first off our kitchen timer, where all we did was we, we sent it a box with the word kitchen timer. It has, I don't know what this box is for, but we have a time we have start, stop and reset. So I can double click in, I can click start. It doesn't do anything. Oh, what is this? Oh, whoa. If this, okay, well, if the numbers there, yeah, then it'll, it'll stop. If I stop it, it stops. I can start it. It'll keep going again. Okay. And I can reset it. And there we go. The only weird thing is that it works. Yeah. It has a, a number input field for the number of seconds that I can, I can type out. But yeah. You know what? In a pinch, in a pinch, I'll take it. If I really needed just to count 60 seconds or something. Next we have, or the result where the input was just my drawing of a kitchen timer. I didn't tell it it was a kitchen timer. I didn't send it the words kitchen timer and I didn't tell it how it should work, but it did produce something that kind of looks the same. Let's see if it works. So I'm going to click minute, second, start, reset. No. So unfortunately it did not make any working UI, although it did, you know, put the buttons in the right place or something like that. [00:27:51]Swyx: Maybe it over focuses on the UI because you told it, you just, that's all you gave it. Yeah. [00:27:57]Steve: Yeah. I mean, there is in the prompt kind of language around, like use what you know about the way that applications work in order to sort of fill in the blanks here in terms of the behavior and all that. But let's go to the next one. This one is where we only sent it the state chart. There's also something in the prompt that says like, if it's red, it's not part of the UI. Like if it's red, then like treat that as an annotation rather than a thing that you should, should actually make. So this time it actually looks a lot like the previous one. But it does have these minute, second buttons. Oh, weird. It has plus and minus minute, seconds, and it also has this like stop state written at the bottom. So there's four buttons, you know, minus minute, minus second, plus minute, plus second, and then there's start and reset. So does it work? I can add a minute. I can also subtract a minute. All right. [00:28:44]Swyx: Honestly, that's pretty smart. [00:28:45]Steve: I can add a second. I can also, yeah. And if I press start, we're now in the running state. Apparently it's going up rather than down. And I can reset it and okay. I'm just curious if I, if I do give it a, an additional prompt here and I say like this should count down, not up and just kind of do an arrow towards the start button here. Let me see if that'll make a real one. But, and then finally we look at the other example, which is where we sent the state chart and the UI. We get something that looks much, much more like our user interface. The question is, does it work? Yes, it does. Perfect. I can stop it. Amazing. [00:29:24]Swyx: Start it. [00:29:25]Steve: That's a working timer. Reset it. Wonderful. And in this case, my feedback was accepted. I went back to the one where I, I'd asked it to count down and not up and it all looks the same, but now it's counting down. So I think for folks, especially who have worked in design and who have worked in sort of like user experience design in particular, like this should feel pretty familiar, kind of sketching out and trying to do your best to specify like what it is you want and see what you get back from your designers. You see what you get back from your developer, but having like a environment in which to have that like game loop, that like iteration cycle alone and instantaneous and essentially free is really, really wild. And you end up spending a lot of time kind of like not only getting into the head of the AI and sort of being like, okay, well, why are they getting confused? You know, what am I sending that is confusing? How do I send more information in order to like produce a better result? But also it really forces you to clarify your own expectations of like somewhere up here, I have a drag and drop list, you know, where you can drag list items between and like I started working on this and started specking it out. I was like, man, this is like actually like not only really hard to produce a good result, but it's also like just really hard to describe is that like the failure was really on my end for just not knowing how to get the information in there because I didn't actually know how this thing should work. But I could figure it out. I have an environment in which to figure that out. It's fun. [00:30:49]Swyx: That's amazing. I'm still processing. [00:30:51]Steve: During this, like, because this thing went massively popular on Twitter, thousands of retweets. And there were some folks who like were subtweeting it about like, you know, get over it. It's just a wireframing or no code tool or something like that. One guy did say like, you know, I prefer to wireframe like the old fashioned way with pen and paper. And I was like, oh yeah, no, that works too. Like this works with screenshots. I can just take the screenshot here of posted of the drawing that he had made. You know, it's not even like a good photo. There's a pen, you know, across one of the screens, etc. But if you just give that with no other information, like as a prompt, you get back a pretty good result. You know, just from this like photo of a piece of paper on the guy's desk, you have a not completely arbitrary result, like working website here that was inferred from just that picture with no other input, not even like titles or anything else. And of course, it's like responsive and all this stuff. And so the idea of, yes, I've worked really hard to make all of our shapes, you know, really good and our arrows obsessively good and all this stuff. But like the fun of the infinite canvas and tldraw in particular is that you could just dump like whatever you want onto the canvas, screenshots, text, images, other websites, sticky notes, all that stuff. And the model, even as something that was in preview, like the very, very first sort of multimodal model can do a really good job at just taking all that stuff as the input. And yeah, like so we accidentally made a really, really good visual multimodal prompting application environments or user experience environment. I'm not even sure what we're going to call this thing. [00:32:32]Swyx: You also had in our pre-show prep, you also talked about parallel prompting. Is that basically just prompting and then moving on to something else? Is that what you've been showing us? [00:32:41]Steve: Yeah, that's kind of what we did up here with the stopwatches. The fact that we could get multiple prompts going at the same time and like arrange them spatially. People have done this also with imagery to say, OK, well, here we're going to use DALI. We're going to kind of like make a tree of prompts as you go, different iterations based on whatever. You make four iterations. You pick your favorite one. You keep going. Kind of like what you do in MidJourney. But to have that spatial and to have that like arranged on a canvas so that it actually can make sense to you and you can kind of look back and follow it, follow forward that like whiteboards, infinite canvas stuff is just really good for a lot of things. So organizing like a whole bunch of different content that is irregular or ephemeral or has a kind of like ad hoc meaning configuration, like, you know, things that are next to each other or things that are in a grid or in this case, you know, just even what we have here for what we did with the stopwatch, like there's an implicit meaning of like, OK, the source is on the left, the result is on the right and any further iterations are further on the right. Right. Like we didn't put that into a data model. We didn't structure that in any way. It doesn't actually that meaning relationship doesn't really exist in any part of the product. It just exists to us because we can make sense of it for this type of thing. Not only is it cool that now a model can make sense of it as well, but yeah, for organizing complex iterations of imagery, complex iterations of outputs, et cetera, like, yeah, the canvas is a place. I really do believe that. Yeah. [00:34:11]Swyx: I mean, that's that's that's really incredible. I think a few developers are kind of scared about, you know, how much of this their jobs this does. Obviously, there's a lot more that they can't do. [00:34:22]Steve: Yeah. Will this take my job story is is interesting. [00:34:26]Swyx: I think I'm not actually concerned, but I'm curious. I think this augments actually my concern as a developer is that this is good, but not good enough. You know, like it's good for throwaway UI, but would I actually export the code and take that code? I don't know. It looks like your first MVP was just HTML files, which, you know, if it's a single HTML file, it can have some JS and some CSS. I saw some problems with layout in there, which I don't know how for sure it is a layout. It's it looks like you could just prompt it for Tailwind if you want Tailwind. I assume it can generate React. I don't know. What are the limitations of this thing? [00:35:05]Steve: There's the limitations that are in that particular demo, which is that, like, it couldn't do React because it needs to just be a single compiled thing, excuse me, ready to go. So it needs to be a single compiled thing just ready to go. You can't do any multi page stuff or anything like that. But that's more of like how we're structuring the project rather than like a specific requirement of the project itself. There's two kind of things. There's one is like how big is the input window and how big is the output window or something. In theory, you could have the input be here's a entire full stack React application together with all my UI and all this, all my components, etc. And here is a screenshot that I took of the landing page where the menu is in the wrong spot, you know, and I'm going to annotate that with some arrows and some text in order to say, like, here's where I want it to be or here's what I want, etc. And for the output to be, you know, a diff that I can apply to my code base, like basically like produce the commit that would change this and have that commit be against multiple files and etc. in order to have potentially like a solution that is just ready to go applicable like a patch, a PR that you can make. There really isn't any limit in that and we've seen with Copilot, etc. The challenge is more on the input side than the output side. Absolutely. You could figure out a way for this thing to spit out like a working iOS app or something like that. The question is like, how do you tell it what you want and how do you iterate when it gets it wrong? And just doing zero shot, zero shot, zero shot is like really a frustrating process. But if you do have a way of iterating, if you do have a way of kind of like step by step moving towards the solution that you want and kind of like getting it into like, okay, well, this is good, but it's not great, could be better, etc. That's how you actually make that type of complex output more practical or more realistic is that you probably won't get ever get the prompt just right. Even if you have like a really, really, really good three generations from now agent, like you still have to put that information in, but you're never going to put all the information in the first time you need to be able to iterate on it. And so with visual stuff, I feel like the canvas, like what we were looking at, that's part of what it unlocks is that like space of iteration, that space of you have a way of marking up the result and using that as the new prompt. And that's that's kind of new. [00:37:28]Swyx: Yeah. Multimodal prompting is such a brilliant concept that, you know, I think it's going to be a norm for some things. In my mind, you demonstrated, you know, coming from Photoshop, there's this concept of layers. You can kind of simulate layers in tldraw. And I see like emergent property of layers in this kind of prompting, which is there's the UI layer, and then there's the state chart layer. And those two things seem like pretty useful in specifying a prompt. I was just wondering if you've thought a little bit about like other dimensions or other layers that would be useful in multimodal prompting. [00:38:02]Steve: Yeah. One thing that we've done is to bring in screenshots of other apps, like here's Stripe.com, like make it look like Stripe, you know? Or like here's Linear.com, like let's do it this way. [00:38:16]Swyx: Make my dev tool a website or make pop. Exactly. You should just, you should just like give a design and ask it to make pop instead of make real. [00:38:25]Steve: Yeah, exactly. Make it more, make it more, make more pop. So there's the idea of like bringing in style as like a, as another part of the input. Flowcharts are absolutely useful. I mean, this is, it really just boils down to like, what would you really give a developer who you are working completely asynchronous with, you know, if you had to spec out a project and put that, print it out on paper and mail it to a developer and they were going to mail back a disk with an HTML file on it, like what would you send? If you were sending this to the moon or something. So yeah, definitely like descriptions of how the state should operate and specs on that. We've even just pasted in code, like, like here's a whole bunch of Jason that you can use and have it just read that as the, as the input data. You can point it at specific endpoints. You can say like, I want you to hit this endpoint and then display the results, you know, as, as cards or as items or something like that. And not, I mean, you don't even have to like wire this up. It's not like retool or anything where you, you have to register that, you know, it's not built into the tool. You just. [00:39:29]Swyx: From an endpoint. [00:39:30]Steve: Yeah. Yeah. Yeah. I'm trying to think of what a good demo endpoint would be. We could, maybe we could do one more, more test. What is it? Like dog.co? [00:39:38]Swyx: Is that? Yeah. Dog.co is a good demo. Yeah. [00:39:42]Steve: I've used that one. I mean, this might be kind of like the box with the word calculator. Like it might just know because it's probably been in a bunch of tutorials. [00:39:48]Swyx: It's in the training set. Yeah. You're not sharing by the way. [00:39:51]Steve: You know what? We'll, we'll do it anyway. We'll, I'll, I'll share it. We'll try. [00:39:55]Swyx: Dog.co is, is one of those like demo APIs that you just set up just because it's not offensive. [00:40:02]Steve: And. [00:40:03]Swyx: Yeah, exactly. There's some useful dogs and everyone likes looking at dogs. [00:40:07]Steve: You can, you can get dog.co. [00:40:09]Swyx: I definitely didn't think about hitting endpoints just because it's just not in any of the demos I've seen. [00:40:15]Steve: Yeah. But it works. Let me see. I'll, I'll have a big button down here. Show me a dog. Okay. So that's going to be our show me a dog button. This should be a picture of a dog. [00:40:26]Swyx: Oh, that's a great dog. No, that's a cut. [00:40:30]Steve: Thank you. And then we'll, we'll do some annotations here. We'll say like when, when this is clicked, get a new dog. [00:40:36]Swyx: There's those perfect arrows coming in. [00:40:38]Steve: Yeah, exactly. When clicked, get a new dog from, from, I'll just paste in this and put the result in the image. Okay. So it's, it's more of a, more of an instruction than you would normally. [00:40:52]Swyx: Yeah. One thing that it's going to have to guess is that, you know, the, the response format, right? [00:40:57]Steve: Cause it could be anything. This is true. Let's see if it works. Yeah. And let's see if it hit the end point in the right way. So dog button. Yeah. Okay. It hit the right red end point, Jason dog image, and then it put it in. So there you go. You have yourself a JavaScript tutorial in a box ready to go. And I think like, we probably wouldn't do this on camera, but like, you can say, you know, like, like use the auth token, you know, whatever, and like, you know, go like really get real data back from this thing. There's no reason why it wouldn't be able to do that. [00:41:34]Swyx: You're kind of relying on the OCR. [00:41:35]Steve: Well, not really, because again, what, inside of the prompt for this, we do give it like an array of all the texts that you've put in. We say like, look, I know your vision isn't so good, or you have a hard time reading text sometimes when it's small, because what, like the input that you get is pretty wild. It's like, it takes this as a PNG, and then it like, I can't do this in tldraw, but it resizes it, it squishes it into a 512 by 512 image or something like that. [00:42:05]Swyx: It tiles it. [00:42:06]Steve: Yeah. The text especially can get kind of like chunked up, especially if it's small. So we send those strings separately so that it can kind of reassemble anything that it can't read right off the bat. This is a weird future that we've found ourselves in. Pretty cool. Yeah. [00:42:23]Swyx: I mean, you know, one layer I automatically think of is back-end, right? Like as someone who has worked at AWS, I see a lot of systems diagrams, like cloud diagrams, entity relationship diagrams for database. So I wonder if like anyone's tackled extending this to back-end, and then obviously the next level from that is full-stack apps where you have back-end in front of it. [00:42:43]Steve: Yeah. I mean, I guess there's someone on Twitter that was using this to generate flowcharts. I'm not a back-end guy, so I don't actually know exactly what the output was, but I believe it was like a configuration script for AWS that was built off of this, like, I think you just copy and pasted a diagram that he had made in tldraw anyway and said, okay, let's throw this at this thing and see what it comes up with. Tweaking the prompt to say like, rather than building single page websites, you just return the JSON description of this configuration or something like that, or return a script that would set this up. You could tweak it to say like, here are all the entity relationships between different tables or items in tables, and give me back the SQL initialization or something that would make all these tables and set up these relationships. Yeah. It's just, again, the hard part is getting that information in. I don't know, pictures are really good. [00:43:35]Swyx: They may speak a thousand words. Awesome. So that's one of two, what I think about multimodal viral hits in November. The other one also, you had a part to play in it, which is the local consistency models trend, where I think you worked with Fel. [00:43:51]Steve: Yeah. So actually, I do have something to show here. We actually have a couple of things to show here. We connected with Fel because they used tldraw to create a demo for their LCM, right? Yeah. So we did that, and we made a drawfast.tldraw.com, which is basically, you get these shapes, these little draw fast shapes, and it puts the result, basically grabs that new image and puts it right next to it. And these are extremely fast. So as I'm moving things, you should see the image updating as well. And I think this was originally not a wise princess, I don't know, I'd play this more with my daughter than anything else, what this looks like. Yeah, the kids must love it. And, oh my gosh, she does. And actually, we had a lot of folks on Twitter being like, this is not good, like, whatever. Because I had a video of, whatever, my daughter drawing, and she made this awesome drawing of a mermaid, and we turned it into this really anonymous, crappy version of an illustration of a mermaid. And they're like, no, no, the children's drawing is much more interesting. I'm like, yeah, yeah, yeah, come on, who cares? Of course it is. But, you know, this is fun. [00:45:03]Swyx: Yeah, I do think you might do animations, like some kind of, like, you could make some kind of, this is almost like stop motion film. Yeah. Yeah. I mean, we just, we need to do more work on consistency, but like, this is getting there. [00:45:18]Steve: Yeah, it is. The fun is that like, you end up, after playing with this for a little while, you end up like, getting really into the particularities of the input. Like what can you do with a design tool? Okay. You can move things around, right? I can grab some of these and move them around, like, oh yeah, there's a highlighter here too. So we could do some highlighting, you know, that'll, that'll do stuff. And then we couldn't help ourselves. We started making these like stories. So all right, well then I'll move on to the other one that we, that we released earlier today. Yeah. Which is called lens.teeldraw.com. So that was drawfast.teeldraw.com. And again, this is probably not making a good podcast audio, but the image updates as soon as possible based on what the input drawing is. And it is pretty hypnotic. So this one's a little riskier because it's live. So we took a project called Together, which is a vertically scrolling, infinite drawing collaborative experience, a little bit like a chat room. As you're drawing, everything's just sort of moving up and it just disappears off the top of the screen, never to be seen again. So it's kind of just fun to play with. [00:46:21]Swyx: By the way, one of the most magical chat experiences I ever had was with you. I think you were like with your daughter or something and I was, I was, whatever, showing off together. And you started writing, I started writing and we had chat on together.teeldraw.com. [00:46:34]Steve: Yeah. [00:46:35]Swyx: I was like, what is this? [00:46:37]Steve: It's super cool. Inevitably someone will write like, you know, where are you from? And everyone's like chiming in and talking about it. So I'll describe what's on the screen now, which is we're taking like a screenshot of the center, like a square out of the center of this chaotic, vertically scrolling chat experience. And we're sending that to the LCM and putting back the image based on like a prompt, like, you know, desert scene or busy marketplace or futuristic cityscape or something like that. And so it is updating like, you know, 10 times a second as we go. [00:47:12]Swyx: It's updating surprisingly quickly, like 10 frames per second. [00:47:14]Steve: No, I think it's now like to 32 milliseconds basically as you go. And so if I draw like a big orange thing down here, it's going to kind of show up into the drawing. Maybe I'll do a big black one so you can see better. Like it just sort of becomes part of the input to this prompt and it is extremely hypnotic. This is again like lens.teeldraw.com. Yeah. It's like this like slow moving, collaborative kind of like hallucination experience and it just never ends. I mean, yeah. I'm probably going to be funding Fal completely for the next, you know, their Series A or something like that. [00:47:55]Swyx: I don't know. I have a healthy respect for like the amount of processing that must be going on behind these things. [00:48:00]Steve: Yeah. Well, what's funny is that like, yeah, we're using like Cloudflare workers to do the updates and the CRDTs to do the collaboration and all this like whatever LCM models to populate this image or create this image. But there's also a laptop in my living room right now that is doing the actual screenshotting and sending that up. And so there's a big note that I had to write, you know, for my family to say like, don't turn off this laptop. Don't close this laptop because this needs to be on in order for this thing to work. And no matter how good our tech stack gets, we'll always come back to some laptop stuck in the corner that can't possibly be turned off. [00:48:39]Swyx: That's pretty fun. [00:48:40]Steve: Yeah. [00:48:41]Swyx: I've heard of major businesses being run that way. Yeah, exactly. [00:48:43]Steve: Raspberry Pi in the closet. [00:48:45]Swyx: Yeah. You know, it's weird because it's really funny because like, you know, you are inventing your own art form. This is fine art. You know, going back to your degree, it's just a different kind of art. [00:48:54]Steve: It's funny because like the output of this, like while it is like a visual output, the output like doesn't actually matter. Like it's gone in 16 milliseconds and it's not coming back. And I think with all this AI stuff right now, just where we are with it and just how completely unknown it is in terms of like, where is this useful? Like the best thing that you can get out of this is like the experience. And so I think of this much more as like, you know, the thing that people will walk away from, from playing with like lens.tiltro.com should be more of like that experience of having interacted with this thing or interacted with it, you know, among with others rather than like, oh, there's made my favorite image or something like that. I don't know. As a former image maker, like the idea of having, having like an aesthetic experience where the image is a major part, but it's, it's not necessarily like the important part or any one of these images isn't the important part. I don't know. There's something new feeling about this. Kind of fun. Certainly. I wish I could do a big critique with all the new media artists, people about this and about like what, you know, where does this fit into the sort of the, uh, like other people's work, et cetera. [00:50:03]Swyx: That's for them to write. And, you know, for you to build, you know, I would encourage you to keep building there because you're definitely well done with your explorations. I can sort of round it out by sort of looking towards the future. You hinted a little bit, uh, you're working towards TL draw 2.0. So first of all, actually, it seems like you're very focused on the core mission of Canvas and the AI stuff is, is a side project for now. Why not pursue it as like a full, why not pivot and like be an AI company, right? Like that's, that's, I'm sure you've got a lot of those questions. Yeah. [00:50:35]Steve: Yeah. I mean, when you, when you get something as viral as, as tldraw got, like I think I've talked to everyone, certainly every, every investor and yes, we, we probably could on for something like together or that draw fast thing, make a tiny little SaaS app, you know, give me $10 a month, play with this thing and you know, could make it, make it good. We could go in that direction. There's not much of a moat around any of this stuff. And we're seeing that just in, you know, I don't know, Gemini is going to come out in a couple of days, weeks or whatever. And if it's better than people are just going to use that until the next better thing comes along. Like there's not a lot of like unique defensible about like, Hey, it's an, it's a drawing app plus an LCM like model because there's going to be a lot of those models and there's going to be a lot of drawing app. The thing that I think is really unique for tldraw, the thing that we have added that is not easily created is the canvas itself. Is that like web-based, hackable, extendable, super refined interactions and all that stuff like all the thousand table stakes features that drive people nuts when building something like this, like they're all there, they're all good. Day one, you could build a really great experience, whether it's AI driven or not, like using tldraw in a way that it's just not practical to do if you're building it yourself. And especially if you're not doing like graphic stuff, there's really not that much else out there oriented towards this type of thing. And I think in a world where these types of AI driven capabilities are just going to keep coming out faster and faster, you know, I don't know, next year is going to be wild. Like every month there's going to be some new, you know, capability or something. The thing that I would want to see both just me as a person and as me as having built a business that I've built is for tldraw to sort of become the place where some of this prompting, some of these ideas are explored. Even if we decided to, okay, we're just going to close everything up, we're going to build a product based on this, and maybe it's a great product, but it would only be one direction, one ray kind of into this infinite space of possibility. And that could be successful, good, but like, I mean, we've built the sort of the direct manipulation core, but there are so many, even like AI specific APIs that we could build around tldraw for having, you know, like a virtual collaborator or working with images in a more rich way. There's just so much that we could build in order to make this the best possible place to explore, not just one direction, but like, you know, many, many, many directions. And I think that narrative, that gets me much more excited. And I think we're also just like the team that we have and the tech that we have and the skills that we have, we're more of the team to build that rather than like to become like a SaaS product company. I'm not saying we'll never do like a, you know, pay us 10 bucks a month and we can, you can play with our magic toy, but primarily my goal is to make tldraw either the place to explore these different models, or you might think of it as like the battleground on which the winners will be kind of identified. Like right now we're using open AI for the make real thing. Maybe next week we'll be using Gemini and now it's, now it's a question of, okay, well now we have an environment in which to compare these two models with the same input and a very advanced form of input. But yeah, like, let's see which one does better. Now, nothing would make me happier than to be at sort of like the battlefield for multimodal prompting and multimodal AI experience. [00:53:58]Swyx: I should also shout out Baklava as the open source vision and multimodal model. So I fully understand you want to, you want to own the light cone of multimodal prompting. I think that that'll probably be the title of the episode. What's coming up for tldraw 2.0? [00:54:15]Steve: So really the tldraw that you are using now and that I'm using are basically 2.0. It's been in pre-release for a long time. Really the only change that's going to happen once we launch it is we're going to start selling commercial licenses for it. So if you are using tldraw in a commercial product or if you want to, then, you know, if you're funded or if you have revenue, then you'll buy a license and I'll add you to our special list of customers. So yeah, it's mostly just go to market and the necessary changes around there. There will be some kind of fun changes, secret saucy changes that launch, but nothing substantial, nothing breaking. We've put a lot of effort in the last, like it's crazy that we've only had an open source since May of this year, this new version, right? And we've been very busy since then, but it is, it's stable, it's robust. We put it through a lot of usage and caught a lot of the issues. So it's absolutely ready to go. But I have a one or two conversations with my lawyer before we, we turned, turn over the license and start, start moving it that way. Gotcha. [00:55:12]Swyx: And then maybe I think if I could get your commentary before we close on just the competition out there, like you are not the only sort of canvas tool. I think I get now that I was going to ask about like Figma, FigJam, and they have some AI thing that they're also doing. I think Adobe is also working on similar things. Canvas also working on similar things, but they're all individual point solutions, whereas you're more the open source canvas to power all of them. I feel like it's just Excalidraw. That's like the other alternative that's, that remains. [00:55:41]Steve: I think Excalidraw, and I like Excalidraw a lot, I contributed there and we, we retweet each other and tease each other on Twitter. And early on, I was copying features from them. Now they're copying features from me, so I, but no, it's the collaboration space is so, has so many dominant players, like that I, I think me and Excalidraw are tiny within that. There's two things. One is that we made this very strange bet on using a kind of a web canvas that our canvas is not like an HTML element or HTML canvas element. It's like normal React components all the way down. So if you wanted to add something interactive and have that participate in the sort of space of the canvas, the way that we were doing our iframes, kind of like being able to write on top of an iframe, you can't do that in Excalidraw. You can't do that anywhere. That is like a very strange tech choice that we made around tldraw that is, you know, finding its home in a few different ways. Most of the people who pick tldraw and approach me like the inbound that I get are folks for whom that's like the killer feature, be able to put interactive widgets on the canvas using just React. No matter how good Figma's like AI solution is, and I hope it's great because I love Figma and I use it, it's not going to solve every possible problem in this space. It's not even going to like touch, you know, like you can't like none of these things. And I mean, I already had identified like, OK, there was a point where like any Kanban board was like was Trello, right? When you when you talked about Kanban boards, you were talking about Trello. Kanban boards are in every productivity app now. I think the same thing is going to happen with collaborative whiteboards. It's like people like them. I'm making it easy. People are already doing it even without tldraw when it's hard. Like, like, yeah, that's going to become a kind of a commodity user experience in a lot of different products. Probably, you know, give me a diagram from a text prompt like, yeah, that is probably going to be a commodity to give me an image from a text prompt like, yeah, that's just going to be everywhere. We're just going to assume that that's, you know, it's like adding a GIF to a to a chat or something that there's no mode there. I do hope that Figma has an amazing AI integration, but I think the thing that it will help you do is use Figma, like generating an image won't be super useful, but like generating it now, autocomplete this design absolutely would be. And I hope to launch something amazing there. But yeah, like I said, there's just a million different directions that this stuff could go in. The canvas is just like a input device that allows a certain type of user experience. And that's certainly not limited to design. That's not limited to whiteboarding. It's not limited to collaboration or anything like that. Yeah, my hope is that there are those like 10,000 products that could be made with what we're making. [00:58:26]Swyx: Yeah. That's a really great mission. And I see why you're so passionate about it. You're the right team for it. Okay. You know, a couple of lightning round questions. One, which is like, if you had some AI capability that you would wish for that you don't have yet, what would it be? [00:58:39]Steve: Oh, that's a really good question. [00:58:42]Swyx: Helps people to do some research. [00:58:44]Steve: Yeah. I think probably related to, it's not quite a CRM, but like a human, just normal relationship management. This is something that I've never had a problem with until I had a startup, actually, where there's just a lot more people involved in my life. And it's hard to keep up with them all. And I think this is probably something that like an EA kind of does of saying like, hey, there's a birthday coming up or something like that. But also just, you know, identifying opportunities to work together, to connect or who's an expert on this thing that I'm working on, like that doesn't always occur to me. And I think the value of your network, that even if you're good at that, you're probably only scratching the surface of like, you know, how you could be helping the people around you and how they could be helping you based on like the specific context of like what you're working on and the problems on your table today. Yeah. [00:59:36]Swyx: I've also wanted to build a CRM on top of Twitter because you have all the info there about what people working on your past conversations with each other and your shared interests. You know, like a bare minimum to search it, but to proactively suggest is the next layer. And I guess AI chief of staff, AI executive assistant, something like that. I think like some people are working on that, but the problem is so big that they're working on like the automation piece. So like Lindy, I had at my conference where they're like, it's a virtual assistant that you can trigger on your desktop or via email. And it mostly deals with scheduling, but also helps you do a little bit of research. So that, yeah, I think the agents field will progress there. We might take 10 years to do it. Yeah. [01:00:19]Steve: Yeah. I can wait. It's all good. [01:00:22]Swyx: And then finally, advice for founders, like what has helped you the most as a founder, you know, you're two years into your journey. [01:00:29]Steve: Yeah. So this, this kind of comes a little bit out of what you learn in art school type of thing. But yeah, but one thing is that basically like when you're a studio artist or you're in a studio or whatever, there's no external constraints. You just kind of are running on, well, what do I feel like working on? And the further you get away from like, what do I feel like working on kind of like the worse your work becomes. So having like a really good feeling for that sort of desire and being able to respect and follow that desire as like, because it's not arbitrary. Is that like, if you really, really feel like working on thing, like that might be the sort of the tip of a very complex iceberg of analysis of like the field or like what people are talking about or something that you, directions and market or something like that. Like I don't know, I think with, with tldraw and with, as, as a founder on this, the thing that I've tried to do and I've tried to preserve is like being able to prioritize based on like what is most interesting right now. And that is, that is true for what code we write and like what features we work on. That's true for like which partners we, you know, we spend time with in terms of who is using tldraw, the types of problems that we want to solve, like using your own sort of sense of what's interesting as a filter and what you want to work on, like what sounds like a fun thing to work on right now as a filter. It's not naive and it can be kind of part of your, your secret sauce. I think a lot of early founders are encouraged against that and to, to be working backwards from a certain outcome and all that. And yeah, you do have to have to do that. You have to put that into the, into the mix as well, but be sure that you're, you're picking the best parts out of that mix. I don't know, the parts that you want to work on. [01:02:12]Swyx: Well, I mean, what's the point of doing this if you don't have some fun, indulge your curiosity. [01:02:16]Steve: Yeah. The worst case, you'll, you'll build something that you love. Yeah. Yeah, exactly. Good things can come out. Good things can absolutely come out of like. [01:02:24]Swyx: You had an 8,000% increase in your followers or something. [01:02:29]Steve: Yeah. Yeah. If you're a, if you're a sub stack reader, the tldraw sub stack 72 hours into this big make real virality explosion, I sat down and wrote a blog post and I, I wanted to at least capture that, that vibe of what it felt like in the middle of that, that hurricane. So it's, it's a pretty fun one. Very special. It's good to read back. [01:02:48]Swyx: Well, I'm sure it's not the last time we'll see you do something crazy viral. I'm sure that a lot of people will be exploring tldraw. I hope a lot of people, honestly, one thing I'm thinking about is like embedding tldraw into my, my input box. I can't tldraw be like, you know, part of the input. [01:03:05]Steve: Hey, I'm, I'm talking to the good folks over at OpenAI tomorrow. Fingers crossed. Maybe we, maybe we get it in, inside of a chat GPT or something. Cause yeah, like, [01:03:15]Swyx: I need to, I need to move faster. [01:03:17]Steve: Like what? You want to like take a drawing or take a photo and then annotate it or like, you know, sketch something out. You should be able to do that. [01:03:29]Swyx: It's yeah, exactly. [01:03:31]Steve: Yeah. It's just a good, it's just a good thing. Yeah. The people cry out for it. I failed it fast enough. [01:03:38]Swyx: Well, thank you for inspiring the rest of us. Thank you for everything. And I'm sure we'll, we'll hear from more from you over the next few years. [01:03:46]Steve: So thanks. [01:03:46]Swyx: Thanks for your time. Awesome. [01:03:48]Steve: Thank you for your time. [01:04:01] Get full access to Latent.Space at www.latent.space/subscribe
-
NeurIPS 2023 Recap — Top Startups
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2023-12-30 16:30
We are running an end of year listener survey! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here.We can’t think of a more Latent-Space-y way to end 2023 than with a mega episode featuring many old and new friends recapping their biggest news, achievements, and themes and memes of the year!We previously covered the Best Papers of NeurIPS 2023, but the other part of NeurIPS being an industry friendly conference is all the startups that show up to hire and promote their latest and greatest products and papers! As a startup-friendly podcast, we of course were ready with our mics to talk to everyone we could track down.In lieu of an extended preamble, we encourage you to listen and click through all the interviews and show notes, all of which have been curated to match the references mentioned in the episode.Timestamps & Show Notes* [00:01:26] Jonathan Frankle - Chief Scientist, MosaicML/Databricks* see also the Mosaic/MPT-7B episode* $1.3B MosaicML x Databricks acquisition* [00:22:11] Lin Qiao - CEO, Fireworks AI* Fireworks Mixtral* [00:38:24] Aman Sanger - CEO, Anysphere (Cursor)* see also the Cursor episode* $8m seed from OpenAI* Tweet: Request-level memory-based KV caching* Tweet: GPT-4 grading and Trueskill ratings for rerankers* [00:51:14] Aravind Srinivas - CEO, Perplexity* 1m app installs on iOS and Android* pplx-online api 7b and 70b models* Shaan Puri/Paul Graham Fierce Nerds story* [01:04:26] Will Bryk - CEO, Metaphor* “Andrew Huberman may have singlehandedly ruined the SF social scene”* [01:12:49] Jeremy Howard - CEO, Answer.ai* see also the End of Finetuning episode* Jeremy’s podcast with Tanishq Abraham, Jess Leao* Announcing Answer.ai with $10m from Decibel VC* Laundry Buddy, Nov 2023 AI Meme of the Month * [01:37:13] Joel Hestness - Principal Scientist, Cerebras* CerebrasGPT, all the Cerebras papers we discussed* [01:56:34] Jason Corso - CEO, Voxel51* Open Source FiftyOne project* CVPR Survival Guide* [02:02:39] Brandon Duderstadt - CEO, Nomic.ai* GPT4All, Atlas, Demo* [02:12:39] Luca Antiga - CTO, Lightning.ai* Pytorch Lightning, Lightning Studios, LitGPT* [02:29:46] Jay Alammar - Engineering Fellow, Cohere* The Illustrated Transformer Get full access to Latent.Space at www.latent.space/subscribe
-
NeurIPS 2023 Recap — Best Papers
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2023-12-23 23:44
We are running an end of year listener survey! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here.NeurIPS 2023 took place from Dec 10–16 in New Orleans. The Latent Space crew was onsite for as many of the talks and workshops as we could attend (and more importantly, hosted cocktails and parties after hours)!Picking from the 3586 papers accepted to the conference (available online, full schedule here) is an impossible task, but we did our best to present an audio guide with brief commentary on each. We also recommend MLContests.com NeurIPS recap and Seb Ruder’s NeurIPS primer and Jerry Liu’s paper picks. We also found the VizHub guide useful for a t-SNE clustering of papers. Lots also happened in the arxiv publishing world outside NeurIPS, as highlighted by Karpathy, especially DeepMind’s Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.Jan 2024 update: we also strongly recommend Sebastian Raschka, PhD ‘s pick of the year’s 10 best papers, including Pythia.We’ll start with the NeurIPS Best Paper Awards, and then go to a selection of non-awarded but highly influential papers, and then arbitrary personal picks to round out the selection. Where we were able to do a poster session interview, please scroll to the relevant show notes for images of their poster for discussion. We give Chris Ré the last word due to the Mamba and StripedHyena state space models drawing particular excitement but still being too early to assess impact. Timestamps* [0:01:19] Word2Vec (Jeff Dean, Greg Corrado)* [0:15:28] Emergence Mirage (Rylan Schaeffer)* [0:28:48] DPO (Rafael Rafailov)* [0:41:36] DPO Poster Session (Archit Sharma)* [0:52:03] Datablations (Niklas Muennighoff)* [1:00:50] QLoRA (Tim Dettmers)* [1:12:23] DataComp (Samir Gadre)* [1:25:38] DataComp Poster Session (Samir Gadre, Alex Dimakis)* [1:35:25] LLaVA (Haotian Liu)* [1:47:21] LLaVA Poster Session (Haotian Liu)* [1:59:19] Tree of Thought (Shunyu Yao)* [2:11:27] Tree of Thought Poster Session (Shunyu Yao)* [2:20:09] Toolformer (Jane Dwivedi-Yu)* [2:32:26] Voyager (Guanzhi Wang)* [2:45:14] CogEval (Ida Momennejad)* [2:59:41] State Space Models (Chris Ré)Papers covered* Distributed Representations of Words and Phrases and their Compositionality (Word2Vec) Tomas Mikolov · Ilya Sutskever · Kai Chen · Greg Corrado · Jeff Dean. The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several improvements that make the Skip-gram model more expressive and enable it to learn higher quality vectors more rapidly. We show that by subsampling frequent words we obtain significant speedup, and also learn higher quality representations as measured by our tasks. We also introduce Negative Sampling, a simplified variant of Noise Contrastive Estimation (NCE) that learns more accurate vectors for frequent words compared to the hierarchical softmax. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of Canada'' and "Air'' cannot be easily combined to obtain "Air Canada''. Motivated by this example, we present a simple and efficient method for finding phrases, and show that their vector representations can be accurately learned by the Skip-gram model.* Some notable reflections from Tomas Mikolov - and debate over the Seq2Seq paper credit with Quoc Le* Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al.). Emergent abilities are abilities that are present in large-scale models but not in smaller models and are hard to predict. Rather than being a product of models’ scaling behavior, this paper argues that emergent abilities are mainly an artifact of the choice of metric used to evaluate them. Specifically, nonlinear and discontinuous metrics can lead to sharp and unpredictable changes in model performance. Indeed, the authors find that when accuracy is changed to a continuous metric for arithmetic tasks where emergent behavior was previously observed, performance improves smoothly instead. So while emergent abilities may still exist, they should be properly controlled and researchers should consider how the chosen metric interacts with the model.* Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al.)* While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. * In this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning. * Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds RLHF's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.See also Interconnects on DPO: and recent Twitter discussions* Scaling Data-Constrained Language Models (Muennighoff et al.)* The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.* 2 minute poster session presentation video* QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al.). * This paper proposes QLoRA, a more memory-efficient (but slower) version of LoRA that uses several optimization tricks to save memory. They train a new model, Guanaco, that is fine-tuned only on a single GPU for 24h and outperforms previous models on the Vicuna benchmark. Overall, QLoRA enables using much fewer GPU memory for fine-tuning LLMs. Concurrently, other methods such as 4-bit LoRA quantization have been developed that achieve similar results.* DataComp: In search of the next generation of multimodal datasets (Gadre et al.)* Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. * Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release \datanet and all accompanying code at www.datacomp.ai.* Visual Instruction Tuning (Liu et al)* Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. * By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.* Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.* Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al)* Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. * To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. * ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices.* Our experiments show that ToT significantly enhances language models’ problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4\% of tasks, our method achieved a success rate of 74\%. * Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.* Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al)* LMs exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller specialized models excel. * In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. * We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. * This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, a search engine, a translation system, and a calendar. * Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.* Voyager: An Open-Ended Embodied Agent with Large Language Models (Wang et al)* We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: * 1) an automatic curriculum that maximizes exploration, * 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and * 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. * Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize.Voyager discovers new Minecraft items and skills continually by self-driven exploration, significantly outperforming the baselines.* Evaluating Cognitive Maps and Planning in Large Language Models with CogEval (Momennejad et al)* Recently an influx of studies claims emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. * First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. The CogEval protocol can be followed for the evaluation of various abilities. * * Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets.* * We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and falling in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.* Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu, Tri Dao)* Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. * First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. * Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). * Mamba enjoys fast inference (5x higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-1.4B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.* Get full access to Latent.Space at www.latent.space/subscribe
-
The AI-First Graphics Editor - with Suhail Doshi of Playground AI
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2023-12-20 20:31
We are running an end of year survey for our listeners! Please let us know any feedback you have, what episodes resonated with you, and guest requests for 2024! Survey link here!Listen to the end for a little surprise from Suhail.Before language models became all the rage in November 2022, image generation was the hottest space in AI (it was the subject of our first piece on Latent Space!) In our interview with Sharif Shameem from Lexica we talked through the launch of StableDiffusion and the early days of that space. At the time, the toolkit was still pretty rudimentary: Lexica made it easy to search images, you had the AUTOMATIC1111 Web UI to generate locally, some HuggingFace spaces that offered inference, and eventually DALL-E 2 through OpenAI’s platform, but not much beyond basic text-to-image workflows.Today’s guest, Suhail Doshi, is trying to solve this with Playground AI, an image editor reimagined with AI in mind. Some of the differences compared to traditional text-to-image workflows:* Real-time preview rendering using consistency: as you change your prompt, you can see changes in real-time before doing a final rendering of it.* Style filtering: rather than having to prompt exactly how you’d like an image to look, you can pick from a whole range of filters both from Playground’s model as well as Stable Diffusion (like RealVis, Starlight XL, etc). We talk about this at 25:46 in the podcast.* Expand prompt: similar to DALL-E3, Playground will do some prompt tuning for you to get better results in generation. Unlike DALL-E3, you can turn this off at any time if you are a prompting wizard* Image editing: after generation, you have tools like a magic eraser, inpainting pencil, etc. This makes it easier to do a full workflow in Playground rather than switching to another tool like Photoshop.Outside of the product, they have also trained a new model from scratch, Playground v2, which is fully open source and open weights and allows for commercial usage. They benchmarked the model against SDXL across 1,000 prompts and found that humans preferred the Playground generation 70% of the time. They had similar results on PartiPrompts:They also created a new benchmark, MJHQ-30K, for “aesthetic quality”:We introduce a new benchmark, MJHQ-30K, for automatic evaluation of a model’s aesthetic quality. The benchmark computes FID on a high-quality dataset to gauge aesthetic quality.We curate the high-quality dataset from Midjourney with 10 common categories, each category with 3K samples. Following common practice, we use aesthetic score and CLIP score to ensure high image quality and high image-text alignment. Furthermore, we take extra care to make the data diverse within each category.Suhail was pretty open with saying that Midjourney is currently the best product for imagine generation out there, and that’s why they used it as the base for this benchmark. I think it's worth comparing yourself to maybe the best thing and try to find like a really fair way of doing that. So I think more people should try to do that. I definitely don't think you should be kind of comparing yourself on like some Google model or some old SD, Stable Diffusion model and be like, look, we beat Stable Diffusion 1.5. I think users ultimately want care, how close are you getting to the thing that people mostly agree with? [00:23:47]We also talked a lot about Suhail’s founder journey from starting Mixpanel in 2009, then going through YC again with Mighty, and eventually sunsetting that to pivot into Playground. Enjoy!Show Notes* Suhail’s Twitter* “Starting my road to learn AI”* Bill Gates book trip* Playground* Playground v2 Announcement* $40M raise announcement* “Running infra dev ops for 24 A100s”* Mixpanel* Mighty* “I decided to stop working on Mighty”* Fast.ai* CivitTimestamps* [00:00:00] Intros* [00:02:59] Being early in ML at Mixpanel* [00:04:16] Pivoting from Mighty to Playground and focusing on generative AI* [00:07:54] How DALL-E 2 inspired Mighty* [00:09:19] Reimagining the graphics editor with AI* [00:17:34] Training the Playground V2 model from scratch to advance generative graphics* [00:21:11] Techniques used to improve Playground V2 like data filtering and model tuning* [00:25:21] Releasing the MJHQ30K benchmark to evaluate generative models* [00:30:35] The limitations of current models for detailed image editing tasks* [00:34:06] Using post-generation user feedback to create better benchmarks* [00:38:28] Concerns over potential misuse of powerful generative models* [00:41:54] Rethinking the graphics editor user experience in the AI era* [00:45:44] Integrating consistency models into Playground using preview rendering* [00:47:23] Interacting with the Stable Diffusion LoRAs community* [00:51:35] Running DevOps on A100s* [00:53:12] Startup ideas?TranscriptAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:15]Swyx: Hey, and today in the studio we have Suhail Doshi, welcome. [00:00:18]Suhail: Yeah, thanks. Thanks for having me. [00:00:20]Swyx: So among many things, you're a CEO and co-founder of Mixpanel, and I think about three years ago you left to start Mighty, and more recently, I think about a year ago, transitioned into Playground, and you've just announced your new round. How do you like to be introduced beyond that? [00:00:34]Suhail: Just founder of Playground is fine, yeah, prior co-founder and CEO of Mixpanel. [00:00:40]Swyx: Yeah, awesome. I'd just like to touch on Mixpanel a little bit, because it's obviously one of the more successful analytics companies we previously had amplitude on, and I'm curious if you had any reflections on the interaction of that amount of data that people would want to use for AI. I don't know if there's still a part of you that stays in touch with that world. [00:00:59]Suhail: Yeah, I mean, the short version is that maybe back in like 2015 or 2016, I don't really remember exactly, because it was a while ago, we had an ML team at Mixpanel, and I think this is when maybe deep learning or something really just started getting kind of exciting, and we were thinking that maybe given that we had such vast amounts of data, perhaps we could predict things. So we built two or three different features, I think we built a feature where we could predict whether users would churn from your product. We made a feature that could predict whether users would convert, we built a feature that could do anomaly detection, like if something occurred in your product, that was just very surprising, maybe a spike in traffic in a particular region, can we tell you that that happened? Because it's really hard to like know everything that's going on with your data, can we tell you something surprising about your data? And we tried all of these various features, most of it boiled down to just like, you know, using logistic regression, and it never quite seemed very groundbreaking in the end. And so I think, you know, we had a four or five person ML team, and I think we never expanded it from there. And I did all these Fast AI courses trying to learn about ML. And that was the- That's the first time you did fast AI. Yeah, that was the first time I did fast AI. Yeah, I think I've done it now three times, maybe. [00:02:12]Swyx: Oh, okay. [00:02:13]Suhail: I didn't know it was the third. No, no, just me reviewing it, it's maybe three times, but yeah. [00:02:16]Swyx: You mentioned prediction, but honestly, like it's also just about the feedback, right? The quality of feedback from users, I think it's useful for anyone building AI applications. [00:02:25]Suhail: Yeah. Yeah, I think I haven't spent a lot of time thinking about Mixpanel because it's been a long time, but sometimes I'm like, oh, I wonder what we could do now. And then I kind of like move on to whatever I'm working on, but things have changed significantly since. [00:02:39]Swyx: And then maybe we'll touch on Mighty a little bit. Mighty was very, very bold. My framing of it was, you will run our browsers for us because everyone has too many tabs open. I have too many tabs open and slowing down your machines that you can do it better for us in a centralized data center. [00:02:51]Suhail: Yeah, we were first trying to make a browser that we would stream from a data center to your computer at extremely low latency, but the real objective wasn't trying to make a browser or anything like that. The real objective was to try to make a new kind of computer. And the thought was just that like, you know, we have these computers in front of us today and we upgrade them or they run out of RAM or they don't have enough RAM or not enough disk or, you know, there's some limitation with our computers, perhaps like data locality is a problem. Why do I need to think about upgrading my computer ever? And so, you know, we just had to kind of observe that like, well, actually it seems like a lot of applications are just now in the browser, you know, it's like how many real desktop applications do we use relative to the number of applications we use in the browser? So it's just this realization that actually like, you know, the browser was effectively becoming more or less our operating system over time. And so then that's why we kind of decided to go, hmm, maybe we can stream the browser. Fortunately, the idea did not work for a couple of different reasons, but the objective is try to make sure new computer. [00:03:50]Swyx: Yeah, very, very bold. [00:03:51]Alessio: Yeah, and I was there at YC Demo Day when you first announced it. It was, I think, the last or one of the last in-person ones, at Pier34 in Mission Bay. How do you think about that now when everybody wants to put some of these models in people's machines and some of them want to stream them in, do you think there's maybe another wave of the same problem before it was like browser apps too slow, now it's like models too slow to run on device? [00:04:16]Suhail: Yeah. I mean, I've obviously pivoted away from Mighty, but a lot of what I somewhat believed at Mighty, maybe why I'm so excited about AI and what's happening, a lot of what Mighty was about was like moving compute somewhere else, right? Right now, applications, they get limited quantities of memory, disk, networking, whatever your home network has, et cetera. You know, what if these applications could somehow, if we could shift compute, and then these applications have vastly more compute than they do today. Right now it's just like client backend services, but you know, what if we could change the shape of how applications could interact with things? And it's changed my thinking. In some ways, AI has like a bit of a continuation of my belief that like perhaps we can really shift compute somewhere else. One of the problems with Mighty was that JavaScript is single-threaded in the browser. And what we learned, you know, the reason why we kind of abandoned Mighty was because I didn't believe we could make a new kind of computer. We could have made some kind of enterprise business, probably it could have made maybe a lot of money, but it wasn't going to be what I hoped it was going to be. And so once I realized that most of a web app is just going to be single-threaded JavaScript, then the only thing you could do largely withstanding changing JavaScript, which is a fool's errand most likely, make a better CPU, right? And there's like three CPU manufacturers, two of which sell, you know, big ones, you know, AMD, Intel, and then of course like Apple made the M1. And it's not like single-threaded CPU core performance, single-core performance was increasing very fast, it's plateauing rapidly. And even these different companies were not doing as good of a job, you know, sort of with the continuation of Moore's law. But what happened in AI was that you got like, if you think of the AI model as like a computer program, like just like a compiled computer program, it is literally built and designed to do massive parallel computations. And so if you could take like the universal approximation theorem to its like kind of logical complete point, you know, you're like, wow, I can get, make computation happen really rapidly and parallel somewhere else, you know, so you end up with these like really amazing models that can like do anything. It just turned out like perhaps the new kind of computer would just simply be shifted, you know, into these like really amazing AI models in reality. Yeah. [00:06:30]Swyx: Like I think Andrej Karpathy has always been, has been making a lot of analogies with the LLMOS. [00:06:34]Suhail: I saw his video and I watched that, you know, maybe two weeks ago or something like that. I was like, oh man, this, I very much resonate with this like idea. [00:06:41]Swyx: Why didn't I see this three years ago? [00:06:43]Suhail: Yeah. I think, I think there still will be, you know, local models and then there'll be these very large models that have to be run in data centers. I think it just depends on kind of like the right tool for the job, like any engineer would probably care about. But I think that, you know, by and large, like if the models continue to kind of keep getting bigger, you're always going to be wondering whether you should use the big thing or the small, you know, the tiny little model. And it might just depend on like, you know, do you need 30 FPS or 60 FPS? Maybe that would be hard to do, you know, over a network. [00:07:13]Swyx: You tackled a much harder problem latency wise than the AI models actually require. Yeah. [00:07:18]Suhail: Yeah. You can do quite well. You can do quite well. You definitely did 30 FPS video streaming, did very crazy things to make that work. So I'm actually quite bullish on the kinds of things you can do with networking. [00:07:30]Swyx: Maybe someday you'll come back to that at some point. But so for those that don't know, you're very transparent on Twitter. Very good to follow you just to learn your insights. And you actually published a postmortem on Mighty that people can read up on and willing to. So there was a bit of an overlap. You started exploring the AI stuff in June 2022, which is when you started saying like, I'm taking fast AI again. Maybe, was there more context around that? [00:07:54]Suhail: Yeah. I think I was kind of like waiting for the team at Mighty to finish up, you know, something. And I was like, okay, well, what can I do? I guess I will make some kind of like address bar predictor in the browser. So we had, you know, we had forked Chrome and Chromium. And I was like, you know, one thing that's kind of lame is that like this browser should be like a lot better at predicting what I might do, where I might want to go. It struck me as really odd that, you know, Chrome had very little AI actually or ML inside this browser. For a company like Google, you'd think there's a lot. Code is actually just very, you know, it's just a bunch of if then statements is more or less the address bar. So it seemed like a pretty big opportunity. And that's also where a lot of people interact with the browser. So, you know, long story short, I was like, hmm, I wonder what I could build here. So I started to take some AI courses and review the material again and get back to figuring it out. But I think that was somewhat serendipitous because right around April was, I think, a very big watershed moment in AI because that's when Dolly 2 came out. And I think that was the first truly big viral moment for generative AI. [00:08:59]Swyx: Because of the avocado chair. [00:09:01]Suhail: Yeah, exactly. [00:09:02]Swyx: It wasn't as big for me as Stable Diffusion. [00:09:04]Suhail: Really? [00:09:05]Swyx: Yeah, I don't know. Dolly was like, all right, that's cool. [00:09:07]Suhail: I don't know. Yeah. [00:09:09]Swyx: I mean, they had some flashy videos, but it didn't really register. [00:09:13]Suhail: That moment of images was just such a viral novel moment. I think it just blew people's mind. Yeah. [00:09:19]Swyx: I mean, it's the first time I encountered Sam Altman because they had this Dolly 2 hackathon and they opened up the OpenAI office for developers to walk in back when it wasn't as much of a security issue as it is today. I see. Maybe take us through the journey to decide to pivot into this and also choosing images. Obviously, you were inspired by Dolly, but there could be any number of AI companies and businesses that you could start and why this one, right? [00:09:45]Suhail: Yeah. So I think at that time, Mighty and OpenAI was not quite as popular as it is all of a sudden now these days, but back then they had a lot more bandwidth to kind of help anybody. And so we had been talking with the team there around trying to see if we could do really fast low latency address bar prediction with GPT-3 and 3.5 and that kind of thing. And so we were sort of figuring out how could we make that low latency. I think that just being able to talk to them and kind of being involved gave me a bird's eye view into a bunch of things that started to happen. Latency first was the Dolly 2 moment, but then stable diffusion came out and that was a big moment for me as well. And I remember just kind of like sitting up one night thinking, I was like, you know, what are the kinds of companies one could build? Like what matters right now? One thing that I observed is that I find a lot of inspiration when I'm working in a field in something and then I can identify a bunch of problems. Like for Mixpanel, I was an intern at a company and I just noticed that they were doing all this data analysis. And so I thought, hmm, I wonder if I could make a product and then maybe they would use it. And in this case, you know, the same thing kind of occurred. It was like, okay, there are a bunch of like infrastructure companies that put a model up and then you can use their API, like Replicate is a really good example of that. There are a bunch of companies that are like helping you with training, model optimization, Mosaic at the time, and probably still, you know, was doing stuff like that. So I just started listing out like every category of everything, of every company that was doing something interesting. I started listing out like weights and biases. I was like, oh man, weights and biases is like this great company. Do I want to compete with that company? I might be really good at competing with that company because of Mixpanel because it's so much of like analysis. But I was like, no, I don't want to do anything related to that. That would, I think that would be too boring now at this point. So I started to list out all these ideas and one thing I observed was that at OpenAI, they had like a playground for GPT-3, right? All it was is just like a text box more or less. And then there were some settings on the right, like temperature and whatever. [00:11:41]Swyx: Top K. [00:11:42]Suhail: Yeah, top K. You know, what's your end stop sequence? I mean, that was like their product before GPT, you know, really difficult to use, but fun if you're like an engineer. And I just noticed that their product kind of was evolving a little bit where the interface kind of was getting a little bit more complex. They had like a way where you could like generate something in the middle of a sentence and all those kinds of things. And I just thought to myself, I was like, everything is just like this text box and you generate something and that's about it. And stable diffusion had kind of come out and it was all like hugging face and code. Nobody was really building any UI. And so I had this kind of thing where I wrote prompt dash like question mark in my notes and I didn't know what was like the product for that at the time. I mean, it seems kind of trite now, but I just like wrote prompt. What's the thing for that? Manager. Prompt manager. Do you organize them? Like, do you like have a UI that can play with them? Yeah. Like a library. What would you make? And so then, of course, then you thought about what would the modalities be given that? How would you build a UI for each kind of modality? And so there are a couple of people working on some pretty cool things. And I basically chose graphics because it seemed like the most obvious place where you could build a really powerful, complex UI. That's not just only typing a box. It would very much evolve beyond that. Like what would be the best thing for something that's visual? Probably something visual. Yeah. I think that just that progression kind of happened and it just seemed like there was a lot of effort going into language, but not a lot of effort going into graphics. And then maybe the very last thing was, I think I was talking to Aditya Ramesh, who was the co-creator of DALL-E 2 and Sam. And I just kind of went to these guys and I was just like, hey, are you going to make like a UI for this thing? Like a true UI? Are you going to go for this? Are you going to make a product? For DALL-E. Yeah. For DALL-E. Yeah. Are you going to do anything here? Because if you are going to do it, just let me know and I will stop and I'll go do something else. But if you're not going to do anything, I'll just do it. And so we had a couple of conversations around what that would look like. And then I think ultimately they decided that they were going to focus on language primarily. And I just felt like it was going to be very underinvested in. Yes. [00:13:46]Swyx: There's that sort of underinvestment from OpenAI, but also it's a different type of customer than you're used to, presumably, you know, and Mixpanel is very good at selling to B2B and developers will figure on you or not. Yeah. Was that not a concern? [00:14:00]Suhail: Well, not so much because I think that, you know, right now I would say graphics is in this very nascent phase. Like most of the customers are just like hobbyists, right? Yeah. Like it's a little bit of like a novel toy as opposed to being this like very high utility thing. But I think ultimately, if you believe that you could make it very high utility, the probably the next customers will end up being B2B. It'll probably not be like a consumer. There will certainly be a variation of this idea that's in consumer. But if your quest is to kind of make like something that surpasses human ability for graphics, like ultimately it will end up being used for business. So I think it's maybe more of a progression. In fact, for me, it's maybe more like Mixpanel started out as SMB and then very much like ended up starting to grow up towards enterprise. So for me, I think it will be a very similar progression. But yeah, I mean, the reason why I was excited about it is because it was a creative tool. I make music and it's AI. It's like something that I know I could stay up till three o'clock in the morning doing. Those are kind of like very simple bars for me. [00:14:56]Alessio: So you mentioned Dolly, Stable Diffusion. You just had Playground V2 come out two days ago. Yeah, two days ago. [00:15:02]Suhail: Two days ago. [00:15:03]Alessio: This is a model you train completely from scratch. So it's not a cheap fine tune on something. You open source everything, including the weights. Why did you decide to do it? I know you supported Stable Diffusion XL in Playground before, right? Yep. What made you want to come up with V2 and maybe some of the interesting, you know, technical research work you've done? [00:15:24]Suhail: Yeah. So I think that we continue to feel like graphics and these foundation models for anything really related to pixels, but also definitely images continues to be very underinvested. It feels a little like graphics is in like this GPT-2 moment, right? Like even GPT-3, even when GPT-3 came out, it was exciting, but it was like, what are you going to use this for? Yeah, we'll do some text classification and some semantic analysis and maybe it'll sometimes like make a summary of something and it'll hallucinate. But no one really had like a very significant like business application for GPT-3. And in images, we're kind of stuck in the same place. We're kind of like, okay, I write this thing in a box and I get some cool piece of artwork and the hands are kind of messed up and sometimes the eyes are a little weird. Maybe I'll use it for a blog post, you know, that kind of thing. The utility feels so limited. And so, you know, and then we, you sort of look at Stable Diffusion and we definitely use that model in our product and our users like it and use it and love it and enjoy it, but it hasn't gone nearly far enough. So we were kind of faced with the choice of, you know, do we wait for progress to occur or do we make that progress happen? So yeah, we kind of embarked on a plan to just decide to go train these things from scratch. And I think the community has given us so much. The community for Stable Diffusion I think is one of the most vibrant communities on the internet. It's like amazing. It feels like, I hope this is what like Homebrew Club felt like when computers like showed up because it's like amazing what that community will do and it moves so fast. I've never seen anything in my life and heard other people's stories around this where an academic research paper comes out and then like two days later, someone has sample code for it. And then two days later, there's a model. And then two days later, it's like in nine products, you know, they're all competing with each other. It's incredible to see like math symbols on an academic paper go to well-designed features in a product. So I think the community has done so much. So I think we wanted to give back to the community kind of on our way. Certainly we would train a better model than what we gave out on Tuesday, but we definitely felt like there needs to be some kind of progress in these open source models. The last kind of milestone was in July when Stable Diffusion Excel came out, but there hasn't been anything really since. Right. [00:17:34]Swyx: And there's Excel Turbo now. [00:17:35]Suhail: Well, Excel Turbo is like this distilled model, right? So it's like lower quality, but fast. You have to decide, you know, what your trade off is there. [00:17:42]Swyx: It's also a consistency model. [00:17:43]Suhail: I don't think it's a consistency model. It's like it's they did like a different thing. Yeah. I think it's like, I don't want to get quoted for this, but it's like something called ad like adversarial or something. [00:17:52]Swyx: That's exactly right. [00:17:53]Suhail: I've read something about that. Maybe it's like closer to GANs or something, but I didn't really read the full paper. But yeah, there hasn't been quite enough progress in terms of, you know, there's no multitask image model. You know, the closest thing would be something called like EmuEdit, but there's no model for that. It's just a paper that's within meta. So we did that and we also gave out pre-trained weights, which is very rare. Usually you just get the aligned model and then you have to like see if you can do anything with it. So we actually gave out, there's like a 256 pixel pre-trained stage and a 512. And we did that for academic research because we come across people all the time in academia, they have access to like one A100 or eight at best. And so if we can give them kind of like a 512 pre-trained model, our hope is that there'll be interesting novel research that occurs from that. [00:18:38]Swyx: What research do you want to happen? [00:18:39]Suhail: I would love to see more research around things that users care about tend to be things like character consistency. [00:18:45]Swyx: Between frames? [00:18:46]Suhail: More like if you have like a face. Yeah, yeah. Basically between frames, but more just like, you know, you have your face and it's in one image and then you want it to be like in another. And users are very particular and sensitive to faces changing because we know we're trained on faces as humans. Not seeing a lot of innovation, enough innovation around multitask editing. You know, there are two things like instruct pics to pics and then the EmuEdit paper that are maybe very interesting, but we certainly are not pushing the fold on that in that regard. All kinds of things like around that rotation, you know, being able to keep coherence across images, style transfer is still very limited. Just even reasoning around images, you know, what's going on in an image, that kind of thing. Things are still very, very underpowered, very nascent. So therefore the utility is very, very limited. [00:19:32]Alessio: On the 1K Prompt Benchmark, you are 2.5x prefer to Stable Diffusion XL. How do you get there? Is it better images in the training corpus? Can you maybe talk through the improvements in the model? [00:19:44]Suhail: I think they're still very early on in the recipe, but I think it's a lot of like little things and you know, every now and then there are some big important things like certainly your data quality is really, really important. So we spend a lot of time thinking about that. But I would say it's a lot of things that you kind of clean up along the way as you train your model. Everything from captions to the data that you align with after pre-train to how you're picking your data sets, how you filter your data sets. I feel like there's a lot of work in AI that doesn't really feel like AI. It just really feels like just data set filtering and systems engineering and just like, you know, and the recipe is all there, but it's like a lot of extra work to do that. I think we plan to do a Playground V 2.1, maybe either by the end of the year or early next year. And we're just like watching what the community does with the model. And then we're just going to take a lot of the things that they're unhappy about and just like fix them. You know, so for example, like maybe the eyes of people in an image don't feel right. They feel like they're a little misshapen or they're kind of blurry feeling. That's something that we already know we want to fix. So I think in that case, it's going to be about data quality. Or maybe you want to improve the kind of the dynamic range of color. You know, we want to make sure that that's like got a good range in any image. So what technique can we use there? There's different things like offset noise, pyramid noise, terminal zero, SNR, like there are all these various interesting things that you can do. So I think it's like a lot of just like tricks. Some are tricks, some are data, and some is just like cleaning. [00:21:11]Swyx: Specifically for faces, it's very common to use a pipeline rather than just train the base model more. Do you have a strong belief either way on like, oh, they should be separated out to different stages for like improving the eyes, improving the face or enhance or whatever? Or do you think like it can all be done in one model? [00:21:28]Suhail: I think we will make a unified model. Yeah, I think it will. I think we'll certainly in the end, ultimately make a unified model. There's not enough research about this. Maybe there is something out there that we haven't read. There are some bottlenecks, like for example, in the VAE, like the VAEs are ultimately like compressing these things. And so you don't know. And then you might have like a big informational information bottleneck. So maybe you would use a pixel based model, perhaps. I think we've talked to people, everyone from like Rombach to various people, Rombach trained stable diffusion. I think there's like a big question around the architecture of these things. It's still kind of unknown, right? Like we've got transformers and we've got like a GPT architecture model, but then there's this like weird thing that's also seemingly working with diffusion. And so, you know, are we going to use vision transformers? Are we going to move to pixel based models? Is there a different kind of architecture? We don't really, I don't think there have been enough experiments. Still? Oh my God. [00:22:21]Swyx: Yeah. [00:22:22]Suhail: That's surprising. I think it's very computationally expensive to do a pipeline model where you're like fixing the eyes and you're fixing the mouth and you're fixing the hands. [00:22:29]Swyx: That's what everyone does as far as I understand. [00:22:31]Suhail: I'm not exactly sure what you mean, but if you mean like you get an image and then you will like make another model specifically to fix a face, that's fairly computationally expensive. And I think it's like not probably not the right way. Yeah. And it doesn't generalize very well. Now you have to pick all these different things. [00:22:45]Swyx: Yeah. You're just kind of glomming things on together. Yeah. Like when I look at AI artists, like that's what they do. [00:22:50]Suhail: Ah, yeah, yeah, yeah. They'll do things like, you know, I think a lot of ARs will do control net tiling to do kind of generative upscaling of all these different pieces of the image. Yeah. And I think these are all just like, they're all hacks ultimately in the end. I mean, it just to me, it's like, let's go back to where we were just three years, four years ago with where deep learning was at and where language was that, you know, it's the same thing. It's like we were like, okay, well, I'll just train these very narrow models to try to do these things and kind of ensemble them or pipeline them to try to get to a best in class result. And here we are with like where the models are gigantic and like very capable of solving huge amounts of tasks when given like lots of great data. [00:23:28]Alessio: You also released a new benchmark called MJHQ30K for automatic evaluation of a model's aesthetic quality. I have one question. The data set that you use for the benchmark is from Midjourney. Yes. You have 10 categories. How do you think about the Playground model, Midjourney, like, are you competitors? [00:23:47]Suhail: There are a lot of people, a lot of people in research, they like to compare themselves to something they know they can beat, right? Maybe this is the best reason why it can be helpful to not be a researcher also sometimes like I'm not trained as a researcher, I don't have a PhD in anything AI related, for example. But I think if you care about products and you care about your users, then the most important thing that you want to figure out is like everyone has to acknowledge that Midjourney is very good. They are the best at this thing. I'm happy to admit that. I have no problem admitting that. Just easy. It's very visual to tell. So I think it's incumbent on us to try to compare ourselves to the thing that's best, even if we lose, even if we're not the best. At some point, if we are able to surpass Midjourney, then we only have ourselves to compare ourselves to. But on First Blush, I think it's worth comparing yourself to maybe the best thing and try to find like a really fair way of doing that. So I think more people should try to do that. I definitely don't think you should be kind of comparing yourself on like some Google model or some old SD, Stable Diffusion model and be like, look, we beat Stable Diffusion 1.5. I think users ultimately want care, how close are you getting to the thing that people mostly agree with? So we put out that benchmark for no other reason to say like, this seems like a worthy thing for us to at least try, for people to try to get to. And then if we surpass it, great, we'll come up with another one. [00:25:06]Alessio: Yeah, no, that's awesome. And you killed Stable Diffusion Excel and everything. In the benchmark chart, it says Playground V2 1024 pixel dash aesthetic. Do you have kind of like, yeah, style fine tunes or like what's the dash aesthetic for? [00:25:21]Suhail: We debated this, maybe we named it wrong or something, but we were like, how do we help people realize the model that's aligned versus the models that weren't? Because we gave out pre-trained models, we didn't want people to like use those. So that's why they're called base. And then the aesthetic model, yeah, we wanted people to pick up the thing that makes things pretty. Who wouldn't want the thing that's aesthetic? But if there's a better name, we're definitely open to feedback. No, no, that's cool. [00:25:46]Alessio: I was using the product. You also have the style filter and you have all these different styles. And it seems like the styles are tied to the model. So there's some like SDXL styles, there's some Playground V2 styles. Can you maybe give listeners an overview of how that works? Because in language, there's not this idea of like style, right? Versus like in vision model, there is, and you cannot get certain styles in different [00:26:11]Suhail: models. [00:26:12]Alessio: So how do styles emerge and how do you categorize them and find them? [00:26:15]Suhail: Yeah, I mean, it's so fun having a community where people are just trying a model. Like it's only been two days for Playground V2. And we actually don't know what the model's capable of and not capable of. You know, we certainly see problems with it. But we have yet to see what emergent behavior is. I mean, we've just sort of discovered that it takes about like a week before you start to see like new things. I think like a lot of that style kind of emerges after that week, where you start to see, you know, there's some styles that are very like well known to us, like maybe like pixel art is a well known style. Photorealism is like another one that's like well known to us. But there are some styles that cannot be easily named. You know, it's not as simple as like, okay, that's an anime style. It's very visual. And in the end, you end up making up the name for what that style represents. And so the community kind of shapes itself around these different things. And so if anyone that's into stable diffusion and into building anything with graphics and stuff with these models, you know, you might have heard of like Proto Vision or Dream Shaper, some of these weird names, but they're just invented by these authors. But they have a sort of je ne sais quoi that, you know, appeals to users. [00:27:26]Swyx: Because it like roughly embeds to what you what you want. [00:27:29]Suhail: I guess so. I mean, it's like, you know, there's one of my favorite ones that's fine tuned. It's not made by us. It's called like Starlight XL. It's just this beautiful model. It's got really great color contrast and visual elements. And the users love it. I love it. And it's so hard. I think that's like a very big open question with graphics that I'm not totally sure how we'll solve. I don't know. It's, it's like an evolving situation too, because styles get boring, right? They get fatigued. Like it's like listening to the same style of pop song. I try to relate to graphics a little bit like with music, because I think it gives you a little bit of a different shape to things. Like it's not as if we just have pop music, rap music and country music, like all of these, like the EDM genre alone has like sub genres. And I think that's very true in graphics and painting and art and anything that we're doing. There's just these sub genres, even if we can't quite always name them. But I think they are emergent from the community, which is why we're so always happy to work with the community. [00:28:26]Swyx: That is a struggle. You know, coming back to this, like B2B versus B2C thing, B2C, you're going to have a huge amount of diversity and then it's going to reduce as you get towards more sort of B2B type use cases. I'm making this up here. So like you might be optimizing for a thing that you may eventually not need. [00:28:42]Suhail: Yeah, possibly. Yeah, possibly. I think like a simple thing with startups is that I worry sometimes by trying to be overly ambitious and like really scrutinizing like what something is in its most nascent phase that you miss the most ambitious thing you could have done. Like just having like very basic curiosity with something very small can like kind of lead you to something amazing. Like Einstein definitely did that. And then he like, you know, he basically won all the prizes and got everything he wanted and then basically did like kind of didn't really. He can dismiss quantum and then just kind of was still searching, you know, for the unifying theory. And he like had this quest. I think that happens a lot with like Nobel Prize people. I think there's like a term for it that I forget. I actually wanted to go after a toy almost intentionally so long as that I could see, I could imagine that it would lead to something very, very large later. Like I said, it's very hobbyist, but you need to start somewhere. You need to start with something that has a big gravitational pull, even if these hobbyists aren't likely to be the people that, you know, have a way to monetize it or whatever, even if they're, but they're doing it for fun. So there's something, something there that I think is really important. But I agree with you that, you know, in time we will absolutely focus on more utilitarian things like things that are more related to editing feats that are much harder. And so I think like a very simple use case is just, you know, I'm not a graphics designer. It seems like very simple that like you, if we could give you the ability to do really complex graphics without skill, wouldn't you want that? You know, like my wife the other day was set, you know, said, I wish Playground was better. When are you guys going to have a feature where like we could make my son, his name's Devin, smile when he was not smiling in the picture for the holiday card. Right. You know, just being able to highlight his, his mouth and just say like, make him smile. Like why can't we do that with like high fidelity and coherence, little things like that, all the way to putting you in completely different scenarios. [00:30:35]Swyx: Is that true? Can we not do that in painting? [00:30:37]Suhail: You can do in painting, but the quality is just so bad. Yeah. It's just really terrible quality. You know, it's like you'll do it five times and it'll still like kind of look like crooked or just artifact. Part of it's like, you know, the lips on the face, there's such little information there. So small that the models really struggle with it. Yeah. [00:30:55]Swyx: Make the picture smaller and you don't see it. That's my trick. I don't know. [00:30:59]Suhail: Yeah. Yeah. That's true. Or, you know, you could take that region and make it really big and then like say it's a mouth and then like shrink it. It feels like you're wrestling with it more than it's doing something that kind of surprises you. [00:31:12]Swyx: Yeah. It feels like you are very much the internal tastemaker, like you carry in your head this vision for what a good art model should look like. Do you find it hard to like communicate it to like your team and other people? Just because it's obviously it's hard to put into words like we just said. [00:31:26]Suhail: Yeah. It's very hard to explain. Images have such high bitrate compared to just words and we don't have enough words to describe these things. It's not terribly difficult. I think everyone on the team, if they don't have good kind of like judgment taste or like an eye for some of these things, they're like steadily building it because they have no choice. Right. So in that realm, I don't worry too much, actually. Like everyone is kind of like learning to get the eye is what I would call it. But I also have, you know, my own narrow taste. Like I don't represent the whole population either. [00:31:59]Swyx: When you benchmark models, you know, like this benchmark we're talking about, we use FID. Yeah. Input distance. OK. That's one measure. But like it doesn't capture anything you just said about smiles. [00:32:08]Suhail: Yeah. FID is generally a bad metric. It's good up to a point and then it kind of like is irrelevant. Yeah. [00:32:14]Swyx: And then so are there any other metrics that you like apart from vibes? I'm always looking for alternatives to vibes because vibes don't scale, you know. [00:32:22]Suhail: You know, it might be fun to kind of talk about this because it's actually kind of fresh. So up till now, we haven't needed to do a ton of like benchmarking because we hadn't trained our own model and now we have. So now what? What does that mean? How do we evaluate it? And, you know, we're kind of like living with the last 48, 72 hours of going, did the way that we benchmark actually succeed? [00:32:43]Swyx: Did it deliver? [00:32:44]Suhail: Right. You know, like I think Gemini just came out. They just put out a bunch of benchmarks. But all these benchmarks are just an approximation of how you think it's going to end up with real world performance. And I think that's like very fascinating to me. So if you fake that benchmark, you'll still end up in a really bad scenario at the end of the day. And so, you know, one of the benchmarks we did was we kind of curated like a thousand prompts. And I think that's kind of what we published in our blog post, you know, of all these tasks that we a lot of some of them are curated by our team where we know the models all suck at it. Like my favorite prompt that no model is really capable of is a horse riding an astronaut, the inverse one. And it's really, really hard to do. [00:33:22]Swyx: Not in data. [00:33:23]Suhail: You know, another one is like a giraffe underneath a microwave. How does that work? Right. There's so many of these little funny ones. We do. We have prompts that are just like misspellings of things. Yeah. We'll figure out if the models will figure it out. [00:33:36]Swyx: They should embed to the same space. [00:33:39]Suhail: Yeah. And just like all these very interesting weirdo things. And so we have so many of these and then we kind of like evaluate whether the models are any good at it. And the reality is that they're all bad at it. And so then you're just picking the most aesthetic image. We're still at the beginning of building like the best benchmark we can that aligns most with just user happiness, I think, because we're not we're not like putting these in papers and trying to like win, you know, I don't know, awards at ICCV or something if they have awards. You could. [00:34:05]Swyx: That's absolutely a valid strategy. [00:34:06]Suhail: Yeah, you could. But I don't think it could correlate necessarily with the impact we want to have on humanity. I think we're still evolving whatever our benchmarks are. So the first benchmark was just like very difficult tasks that we know the models are bad at. Can we come up with a thousand of these, whether they're hand rated and some of them are generated? And then can we ask the users, like, how do we do? And then we wanted to use a benchmark like party prompts. We mostly did that so people in academia could measure their models against ours versus others. But yeah, I mean, fit is pretty bad. And I think in terms of vibes, it's like you put out the model and then you try to see like what users make. And I think my sense is that we're going to take all the things that we notice that the users kind of were failing at and try to find like new ways to measure that, whether that's like a smile or, you know, color contrast or lighting. One benefit of Playground is that we have users making millions of images every single day. And so we can just ask them for like a post generation feedback. Yeah, we can just ask them. We can just say, like, how good was the lighting here? How was the subject? How was the background? [00:35:06]Swyx: Like a proper form of like, it's just like you make it, you come to our site, you make [00:35:10]Suhail: an image and then we say, and then maybe randomly you just say, hey, you know, like, how was the color and contrast of this image? And you say it was not very good, just tell us. So I think I think we can get like tens of thousands of these evaluations every single day to truly measure real world performance as opposed to just like benchmark performance. I would like to publish hopefully next year. I think we will try to publish a benchmark that anyone could use, that we evaluate ourselves on and that other people can, that we think does a good job of approximating real world performance because we've tried it and done it and noticed that it did. Yeah. I think we will do that. [00:35:45]Swyx: I personally have a few like categories that I consider special. You know, you know, you have like animals, art, fashion, food. There are some categories which I consider like a different tier of image. Top among them is text in images. How do you think about that? So one of the big wow moments for me, something I've been looking out for the entire year is just the progress of text and images. Like, can you write in an image? Yeah. And Ideogram came out recently, which had decent but not perfect text and images. Dolly3 had improved some and all they said in their paper was that they just included more text in the data set and it just worked. I was like, that's just lazy. But anyway, do you care about that? Because I don't see any of that in like your sample. Yeah, yeah. [00:36:27]Suhail: The V2 model was mostly focused on image quality versus like the feature of text synthesis. [00:36:33]Swyx: Well, as a business user, I care a lot about that. [00:36:35]Suhail: Yeah. Yeah. I'm very excited about text synthesis. And yeah, I think Ideogram has done a good job of maybe the best job. Dolly has like a hit rate. Yes. You know, like sometimes it's Egyptian letters. Yeah. I'm very excited about text synthesis. You know, I don't have much to say on it just yet. You know, you don't want just text effects. I think where this has to go is it has to be like you could like write little tiny pieces of text like on like a milk carton. That's maybe not even the focal point of a scene. I think that's like a very hard task that, you know, if you could do something like that, then there's a lot of other possibilities. Well, you don't have to zero shot it. [00:37:09]Swyx: You can just be like here and focus on this. [00:37:12]Suhail: Sure. Yeah, yeah. Definitely. Yeah. [00:37:16]Swyx: Yeah. So I think text synthesis would be very exciting. I'll also flag that Max Wolf, MiniMaxxier, which you must have come across his work. He's done a lot of stuff about using like logo masks that then map onto food and vegetables. And it looks like text, which can be pretty fun. [00:37:29]Suhail: That's the wonderful thing about like the open source community is that you get things like control net and then you see all these people do these just amazing things with control net. And then you wonder, I think from our point of view, we sort of go that that's really wonderful. But how do we end up with like a unified model that can do that? What are the bottlenecks? What are the issues? The community ultimately has very limited resources. And so they need these kinds of like workaround research ideas to get there. But yeah. [00:37:55]Swyx: Are techniques like control net portable to your architecture? [00:37:58]Suhail: Definitely. Yeah. We kept the Playground V2 exactly the same as SDXL. Not because not out of laziness, but just because we knew that the community already had tools. You know, all you have to do is maybe change a string in your code and then, you know, retrain a control net for it. So it was very intentional to do that. We didn't want to fragment the community with different architectures. Yeah. [00:38:16]Swyx: So basically, I'm going to go over three more categories. One is UIs, like app UIs, like mock UIs. Third is not safe for work, and then copyrighted stuff. I don't know if you care to comment on any of those. [00:38:28]Suhail: I think the NSFW kind of like safety stuff is really important. I kind of think that one of the biggest risks kind of going into maybe the U.S. election year will probably be very interrelated with like graphics, audio, video. I think it's going to be very hard to explain, you know, to a family relative who's not kind of in our world. And our world is like sometimes very, you know, we think it's very big, but it's very tiny compared to the rest of the world. Some people like there's still lots of humanity who have no idea what chat GPT is. And I think it's going to be very hard to explain, you know, to your uncle, aunt, whoever, you know, hey, I saw President Biden say this thing on a video, you know, I can't believe, you know, he said that. I think that's going to be a very troubling thing going into the world next year, the year after. [00:39:12]Swyx: That's more like a risk thing, like deepfakes, faking, political faking. But there's a lot of studies on how for most businesses, you don't want to train on not safe for work images, except that it makes you really good at bodies. [00:39:24]Suhail: Personally, we filter out NSFW type of images in our data set so that it's, you know, so our safety filter stuff doesn't have to work as hard. [00:39:32]Swyx: But you've heard this argument that not safe for work images are very good at human anatomy, which you do want to be good at. [00:39:38]Suhail: It's not like necessarily a bad thing to train on that data. It's more about like how you go and use it. That's why I was kind of talking about safety, you know, in part, because there are very terrible things that can happen in the world. If you have an extremely powerful graphics model, you know, suddenly like you can kind of imagine, you know, now if you can like generate nudes and then there's like you could do very character consistent things with faces, like what does that lead to? Yeah. And so I tend to think more what occurs after that, right? Even if you train on, let's say, you know, new data, if it does something to kind of help, there's nothing wrong with the human anatomy, it's very valid for a model to learn that. But then it's kind of like, how does that get used? And, you know, I won't bring up all of the very, very unsavory, terrible things that we see on a daily basis on the site, but I think it's more about what occurs. And so we, you know, we just recently did like a big sprint on safety. It's very difficult with graphics and art, right? Because there is tasteful art that has nudity, right? They're all over in museums, like, you know, there's very valid situations for that. And then there's the things that are the gray line of that, you know, what I might not find tasteful, someone might be like, that is completely tasteful, right? And then there are things that are way over the line. And then there are things that maybe you or, you know, maybe I would be okay with, but society isn't, you know? So where does that kind of end up on the spectrum of things? I think it's really hard with art. Sometimes even if you have like things that are not nude, if a child goes to your site, scrolls down some images, you know, classrooms of kids, you know, using our product, it's a really difficult problem. And it stretches mostly culture, society, politics, everything. [00:41:14]Alessio: Another favorite topic of our listeners is UX and AI. And I think you're probably one of the best all-inclusive editors for these things. So you don't just have the prompt, images come out, you pray, and now you do it again. First, you let people pick a seed so they can kind of have semi-repeatable generation. You also have, yeah, you can pick how many images and then you leave all of them in the canvas. And then you have kind of like this box, the generation box, and you can even cross between them and outpaint. There's all these things. How did you get here? You know, most people are kind of like, give me text, I give you image. You know, you're like, these are all the tools for you. [00:41:54]Suhail: Even though we were trying to make a graphics foundation model, I think we think that we're also trying to like re-imagine like what a graphics editor might look like given the change in technology. So, you know, I don't think we're trying to build Photoshop, but it's the only thing that we could say that people are largely familiar with. Oh, okay, there's Photoshop. What would Photoshop compare itself to pre-computer? I don't know, right? It's like, or kind of like a canvas, but you know, there's these menu options and you can use your mouse. What's a mouse? So I think that we're trying to re-imagine what a graphics editor might look like, not just for the fun of it, but because we kind of have no choice. Like there's this idea in image generation where you can generate images. That's like a super weird thing. What is that in Photoshop, right? You have to wait right now for the time being, but the wait is worth it often for a lot of people because they can't make that with their own skills. So I think it goes back to, you know, how we started the company, which was kind of looking at GPT-3's Playground, that the reason why we're named Playground is a homage to that actually. And, you know, it's like, shouldn't these products be more visual? These prompt boxes are like a terminal window, right? We're kind of at this weird point where it's just like MS-DOS. I remember my mom using MS-DOS and I memorized the keywords, like DIR, LS, all those things, right? It feels a little like we're there, right? Prompt engineering, parentheses to say beautiful or whatever, waits the word token more in the model or whatever. That's like super strange. I think a large portion of humanity would agree that that's not user-friendly, right? So how do we think about the products to be more user-friendly? Well, sure, you know, sure, it would be nice if I wanted to get rid of, like, the headphones on my head, you know, it'd be nice to mask it and then say, you know, can you remove the headphones? You know, if I want to grow, expand the image, you know, how can we make that feel easier without typing lots of words and being really confused? I don't even think we've nailed the UI UX yet. Part of that is because we're still experimenting. And part of that is because the model and the technology is going to get better. And whatever felt like the right UX six months ago is going to feel very broken now. So that's a little bit of how we got there is kind of saying, does everything have to be like a prompt in a box? Or can we do things that make it very intuitive for users? [00:44:03]Alessio: How do you decide what to give access to? So you have things like an expand prompt, which Dally 3 just does. It doesn't let you decide whether you should or not. [00:44:13]Swyx: As in, like, rewrites your prompts for you. [00:44:15]Suhail: Yeah, for that feature, I think once we get it to be cheaper, we'll probably just give it up. We'll probably just give it away. But we also decided something that might be a little bit different. We noticed that most of image generation is just, like, kind of casual. You know, it's in WhatsApp. It's, you know, it's in a Discord bot somewhere with Majorny. It's in ChatGPT. One of the differentiators I think we provide is at the expense of just lots of users necessarily. Mainstream consumers is that we provide as much, like, power and tweakability and configurability as possible. So the only reason why it's a toggle, because we know that users might want to use it and might not want to use it. There's some really powerful power user hobbyists that know what they're doing. And then there's a lot of people that just want something that looks cool, but they don't know how to prompt. And so I think a lot of Playground is more about going after that core user base that, like, knows, has a little bit more savviness and how to use these tools. You know, the average Dell user is probably not going to use ControlNet. They probably don't even know what that is. And so I think that, like, as the models get more powerful, as there's more tooling, hopefully you'll imagine a new sort of AI-first graphics editor that's just as, like, powerful and configurable as Photoshop. And you might have to master a new kind of tool. [00:45:28]Swyx: There's so many things I could go bounce off of. One, you mentioned about waiting. We have to kind of somewhat address the elephant in the room. Consistency models have been blowing up the past month. How do you think about integrating that? Obviously, there's a lot of other companies also trying to beat you to that space as well. [00:45:44]Suhail: I think we were the first company to integrate it. Ah, OK. [00:45:47]Swyx: Yeah. I didn't see your demo. [00:45:49]Suhail: Oops. Yeah, yeah. Well, we integrated it in a different way. OK. There are, like, 10 companies right now that have kind of tried to do, like, interactive editing, where you can, like, draw on the left side and then you get an image on the right side. We decided to kind of, like, wait and see whether there's, like, true utility on that. We have a different feature that's, like, unique in our product that is called preview rendering. And so you go to the product and you say, you know, we're like, what is the most common use case? The most common use case is you write a prompt and then you get an image. But what's the most annoying thing about that? The most annoying thing is, like, it feels like a slot machine, right? You're like, OK, I'm going to put it in and maybe I'll get something cool. So we did something that seemed a lot simpler, but a lot more relevant to how users already use these products, which is preview rendering. You toggle it on and it will show you a render of the image. And then graphics tools already have this. Like, if you use Cinema 4D or After Effects or something, it's called viewport rendering. And so we try to take something that exists in the real world that has familiarity and say, OK, you're going to get a rough sense of an early preview of this thing. And then when you're ready to generate, we're going to try to be as coherent about that image that you saw. That way, you're not spending so much time just like pulling down the slot machine lever. I think we were the first company to actually ship a quick LCM thing. Yeah, we were very excited about it. So we shipped it very quick. Yeah. [00:47:03]Swyx: Well, the demos I've been seeing, it's not like a preview necessarily. They're almost using it to animate their generations. Like, because you can kind of move shapes. [00:47:11]Suhail: Yeah, yeah, they're like doing it. They're animating it. But they're sort of showing, like, if I move a moon, you know, can I? [00:47:17]Swyx: I don't know. To me, it unlocks video in a way. [00:47:20]Suhail: Yeah. But the video models are already so much better than that. Yeah. [00:47:23]Swyx: There's another one, which I think is general ecosystem of Loras, right? Civit is obviously the most popular repository of Loras. How do you think about interacting with that ecosystem? [00:47:34]Suhail: The guy that did Lora, not the guy that invented Loras, but the person that brought Loras to Stable Diffusion actually works with us on some projects. His name is Simu. Shout out to Simu. And I think Loras are wonderful. Obviously, fine tuning all these Dreambooth models and such, it's just so heavy. And it's obvious in our conversation around styles and vibes, it's very hard to evaluate the artistry of these things. Loras give people this wonderful opportunity to create sub-genres of art. And I think they're amazing. Any graphics tool, any kind of thing that's expressing art has to provide some level of customization to its user base that goes beyond just typing Greg Rakowski in a prompt. We have to give more than that. It's not like users want to type these real artist names. It's that they don't know how else to get an image that looks interesting. They truly want originality and uniqueness. And I think Loras provide that. And they provide it in a very nice, scalable way. I hope that we find something even better than Loras in the long term, because there are still weaknesses to Loras, but I think they do a good job for now. Yeah. [00:48:39]Swyx: And so you would never compete with Civit? You would just kind of let people import? [00:48:43]Suhail: Civit's a site where all these things get kind of hosted by the community, right? And so, yeah, we'll often pull down some of the best things there. I think when we have a significantly better model, we will certainly build something that gets closer to that. Again, I go back to saying just I still think this is very nascent. Things are very underpowered, right? Loras are not easy to train. They're easy for an engineer. It sure would be nicer if I could just pick five or six reference images, right? And they might even be five or six different reference images that are not... They're just very different. They communicate a style, but they're actually like... It's like a mood board, right? And you have to be kind of an engineer almost to train these Loras or go to some site and be technically savvy, at least. It seems like it'd be much better if I could say, I love this style. Here are five images and you tell the model, like, this is what I want. And the model gives you something that's very aligned with what your style is, what you're talking about. And it's a style you couldn't even communicate, right? There's no word. You know, if you have a Tron image, it's not just Tron. It's like Tron plus like four or five different weird things. Even cyberpunk can have its like sub-genre, right? But I just think training Loras and doing that is very heavy. So I hope we can do better than that. [00:49:50]Alessio: We have Sharif from Lexica on the podcast before. Both of you have like a landing page with just a bunch of images where you can like explore things. [00:50:01]Suhail: Yeah, we have a feed. [00:50:02]Alessio: Yeah, is that something you see more and more often in terms of like coming up with these styles? Is that why you have that as the starting point versus a lot of other products you just go in, you have the generation prompt, you don't see a lot of examples. [00:50:14]Suhail: Our feed is a little different than their feed. Our feed is more about community. So we have kind of like a Reddit thing going on where it's a kind of a competition like every day, loose competition, mostly fun competition of like making things. And there's just this wonderful community of people where they're liking each other's images and just showing their like genuine interest in each other. And I think we definitely learn about styles that way. One of the funniest polls, if you go to the mid-journey polls, they'll sometimes put these polls out and they'll say, you know, what do you wish you could like learn more from? And like one of the things that people vote the most for is like learning how to prompt, right? And so I think like if you put away your research hat for a minute and you just put on like your product hat for a second, you're kind of like, well, why do people want to learn how to prompt, right? It's because they want to get higher quality images. Well, what's higher quality? Composition, lighting, aesthetics, so on and so forth. And I think that the community on our feed, I think we might have the biggest community. And it gives all of the users a way to learn how to prompt because they're just seeing this huge rising tide of all these images that are super cool and interesting. And they can kind of like take each other's prompts and like kind of learn how to do that. I think that'll be short-lived because I think the complexity of these things is going to get higher. But that's more about why we have that feed, is to help each other, help teach users and then also just celebrate people's art. You run your own infra. We do. [00:51:30]Swyx: Yeah, that's unusual. [00:51:31]Suhail: It's necessary. It's necessary. [00:51:35]Swyx: What have you learned running DevOps for GPUs? You had a tweet about like how many A100s you have, but I feel like it's out of date probably. [00:51:42]Suhail: I mean, it just comes down to cost. These things are very expensive. So we just want to make it as affordable for everybody as possible. I find the DevOps for inference to be relatively easy. It doesn't feel that different than, you know, I think we had thousands and thousands of servers at Mixpanel just for dealing with the API. It had such huge quantities of volume that I don't find it particularly very different. I do find model optimization performance is very new to me. So I think that I find that very difficult at the moment. So that's very interesting. But scaling inference is not terrible. Scaling a training cluster is much, much harder than I perhaps anticipated. Why is that? Well, it's just like a very large distributed system with, you know, if you have like a node that goes down, then your training running crashes and then you have to somehow be resilient to that. And I would say training infra software is very early. It feels very broken. I can tell in 10 years it would be a lot better. [00:52:37]Swyx: Like a mosaic or whatever. [00:52:39]Suhail: Yeah, we don't even know. We don't think we use very basic tools like, you know, Slurm for scheduling and just normal PyTorch, PyTorch Lightning, that kind of thing. I think our tooling is nascent. I think I talked to a friend that's over at XAI. They just built their own scheduler, you know, and doing things with Kubernetes. Like when people are building out tools because the existing open source stuff doesn't work and everyone's doing their own bespoke thing, you know, there's a valuable company to be formed. [00:53:01]Swyx: Yeah, I think it's mosaic. [00:53:03]Suhail: I don't know. It might be worth like wondering like why not everyone is going to mosaic and perhaps it's still, I just think it's nascent and perhaps mosaic will come through. [00:53:12]Alessio: Just to wrap, we talked about some of the pivotal moments in your mind with like DALI and whatnot. If you were not doing this, what's the most interesting unsolved question in AI that you would try and build in? [00:53:25]Suhail: Oh man, coming up with startup ideas is very hard on the spot. You have to have them. [00:53:31]Swyx: I mean, you're a founder, you're a repeat founder. I'm very picky about my startup ideas. [00:53:35]Suhail: I don't have an idea per se as much as a curiosity. Suppose I'll pose it to you guys. Right now we sort of think that a lot of the modalities just kind of feel like they're vision, language, audio, that's roughly it. And somehow all this will like turn into something, it'll be multimodal and then we'll end up with AGI. And I just think that there are probably far more modalities than meets the eye. And it just seems hard for us to see it right now because it's sort of like we have tunnel vision on the moment. [00:54:08]Swyx: We're just like code, image, audio, video. [00:54:11]Suhail: Yeah, I think- [00:54:11]Swyx: Very, very broad categories. [00:54:13]Suhail: I think we are lacking imagination as a species in this regard. Yeah, I see it. I don't know what company would form as a result of this, but there's some very difficult problems, like a true actual, not a meta world model, but an actual world model that truly maps everything that's going in terms of like physics and fluids and all these various kinds of interactions. And what does that kind of model, like a true physics foundation model of sorts that represents earth. And that in of itself seems very difficult, but we're kind of stuck on like thinking that we can approximate everything with like a word or a token, if you will. You know, I had a dinner last night where we were kind of debating this philosophically. And I think someone said something that I also believe in, which is like at the end of the day, it doesn't really matter that it's like a token or a byte, at the end of the day, it's just like some unit of information that it emits. But I do wonder if there are far more modalities than meets the eye. And if you could create that, what would that company become? What problems could you solve? So I don't know yet, so I don't have a great company for it. I don't know. [00:55:15]Alessio: Maybe you just inspire somebody to try. [00:55:17]Suhail: Yeah, hopefully. [00:55:18]Swyx: My personal response to that is I'm less interested in physics and more interested in people. Like how do I mind upload? Because that is teleportation, that is immortality, that is everything. Yeah. [00:55:29]Suhail: Rather than trying to create consciousness, could we model our own? Even if it was lossy to some extent, yeah. We won't solve that here. [00:55:35]Swyx: If I were to take a Bill Gates book trip and had a week, what should I take with me to learn AI? [00:55:42]Suhail: Oh gosh, you shouldn't take a book. You should just go to YouTube and visit Kaparthy's class. [00:55:49]Swyx: Zero to Hero. [00:55:50]Suhail: And just do it, grind through it. [00:55:52]Swyx: Was that actually the most useful thing for you? [00:55:53]Suhail: I wish it came out when I started. Wow. Back last year. I was bummed that I didn't get to take it at the beginning, but I did do a few of his classes regardless. Every time I buy a programming book, I never read it. Or an AI book. I always find that just writing code helps cement my internal understanding. Yeah. [00:56:10]Swyx: So more generally, advice for founders who are not PhDs and are effectively self-taught like you are. Like what should they do? What should they avoid? Same thing that I would advise [00:56:18]Suhail: if you're programming. Pick a project that seems very exciting to you. You know, it doesn't have to be too serious. And build it and learn every detail of it while you do it. [00:56:27]Swyx: Should you train? Or can you go far enough not training, just fine-tuning? I would just follow your curiosity. [00:56:32]Suhail: If what you want to do is something that requires fundamental understanding of training models, then you should learn it. You don't have to get to become a five-year, whatever, PhD. But if that's necessary, I would do it. If it's not necessary, then go as far as you need to go. But I would learn, pick something that motivates. I think most people tap out on motivation, but they're deeply curious. Cool. [00:56:51]Alessio: Thank you so much for coming out, man. [00:56:53]Suhail: Thank you for having me. Appreciate it. [00:57:07] Get full access to Latent.Space at www.latent.space/subscribe
-
The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2023-12-14 18:48
We are running an end of year survey for our listeners. Let us know any feedback you have for us, what episodes resonated with you the most, and guest requests for 2024! RAG has emerged as one of the key pieces of the AI Engineer stack. Jerry from LlamaIndex called it a “hack”, Bryan from Hex compared it to “a recommendation system from LLMs”, and even LangChain started with it. RAG is crucial in any AI coding workflow. We talked about context quality for code in our Phind episode. Today’s guests, Beyang Liu and Steve Yegge from SourceGraph, have been focused on code indexing and retrieval for over 15 years. We locked them in our new studio to record a 1.5 hours masterclass on the history of code search, retrieval interfaces for code, and how they get SOTA 30% completion acceptance rate in their Cody product by being better at the “bin packing problem” of LLM context generation. Google Grok → SourceGraph → CodyWhile at Google in 2008, Steve built Grok, which lives on today as Google Kythe. It allowed engineers to do code parsing and searching across different codebases and programming languages. (You might remember the infamous Google Platforms Rant from Steve’s time at Google, and his 2021 followup on GCP). Beyang was an intern at Google at the same time, and Grok became the inspiration to start SourceGraph in 2013. The two didn’t know eachother personally until Beyang brought Steve out of retirement 9 years later to join him as VP Engineering. Fast forward 10 years, SourceGraph has become to best code search tool out there and raised $223M along the way. Nine months ago, they open sourced SourceGraph Cody, their AI coding assistant. All their code indexing and search infrastructure allows them to get SOTA results by having better RAG than competitors:* Code completions as you type that achieve an industry-best Completion Acceptance Rate (CAR) as high as 30% using a context-enhanced open-source LLM (StarCoder)* Context-aware chat that provides the option of using GPT-4 Turbo, Claude 2, GPT-3.5 Turbo, Mistral 7x8B, or Claude Instant, with more model integrations planned* Doc and unit test generation, along with AI quick fixes for common coding errors* AI-enhanced natural language code search, powered by a hybrid dense/sparse vector search engine There are a few pieces of infrastructure that helped Cody achieve these results:Dense-sparse vector retrieval system For many people, RAG = vector similarity search, but there’s a lot more that you can do to get the best possible results. From their release:"Sparse vector search" is a fancy name for keyword search that potentially incorporates LLMs for things like ranking and term expansion (e.g., "k8s" expands to "Kubernetes container orchestration", possibly weighted as in SPLADE): * Dense vector retrieval makes use of embeddings, the internal representation that LLMs use to represent text. Dense vector retrieval provides recall over a broader set of results that may have no exact keyword matches but are still semantically similar. * Sparse vector retrieval is very fast, human-understandable, and yields high recall of results that closely match the user query. * We've found the approaches to be complementary.There’s a very good blog post by Pinecone on SPLADE for sparse vector search if you’re interested in diving in. If you’re building RAG applications in areas that have a lot of industry-specific nomenclature, acronyms, etc, this is a good approach to getting better results.SCIPIn 2016, Microsoft announced the Language Server Protocol (LSP) and the Language Server Index Format (LSIF). This protocol makes it easy for IDEs to get all the context they need from a codebase to get things like file search, references, “go to definition”, etc. SourceGraph developed SCIP, “a better code indexing format than LSIF”:* Simpler and More Efficient Format: SCIP utilizes Protobuf instead of JSON, which is used by LSIF. Protobuf is more space-efficient, simpler, and more suitable for systems programming. * Better Performance and Smaller Index Sizes: SCIP indexers, such as scip-clang, show enhanced performance and reduced index file sizes compared to LSIF indexers (10%-20% smaller)* Easier to Develop and Debug: SCIP's design, centered around human-readable string IDs for symbols, makes it faster and more straightforward to develop new language indexers. Having more efficient indexing is key to more performant RAG on code. Show Notes* Sourcegraph* Cody* Copilot vs Cody* Steve’s Stanford seminar on Grok* Steve’s blog* Grab* Fireworks* Peter Norvig* Noam Chomsky* Code search* Kelly Norton* Zoekt* v0.devSee also our past episodes on Cursor, Phind, Codeium and Codium as well as the GitHub Copilot keynote at AI Engineer Summit.Timestamps* [00:00:00] Intros & Backgrounds* [00:05:20] How Steve's work on Grok inspired SourceGraph for Beyang* [00:08:10] What's Cody?* [00:11:22] Comparison of coding assistants and the capabilities of Cody* [00:16:00] The importance of context (RAG) in AI coding tools* [00:21:33] The debate between Chomsky and Norvig approaches in AI* [00:30:06] Normsky: the Norvig + Chomsky models collision* [00:36:00] The death of the DSL?* [00:40:00] LSP, Skip, Kythe, BFG, and all that fun stuff* [00:53:00] The SourceGraph internal stack* [00:58:46] Building on open source models* [01:02:00] SourceGraph for engineering managers?* [01:12:00] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. [00:00:16]Swyx: Hey, and today we're christening our new podcast studio in the Newton, and we have Beyang and Steve from Sourcegraph. Welcome. [00:00:25]Beyang: Hey, thanks for having us. [00:00:26]Swyx: So this has been a long time coming. I'm very excited to have you. We also are just celebrating the one year anniversary of ChatGPT yesterday, but also we'll be talking about the GA of Cody later on today. We'll just do a quick intros of both of you. Obviously, people can research you and check the show notes for more. Beyang, you worked in computer vision at Stanford and then you worked at Palantir. I did, yeah. You also interned at Google. [00:00:48]Beyang: I did back in the day where I get to use Steve's system, DevTool. [00:00:53]Swyx: Right. What was it called? [00:00:55]Beyang: It was called Grok. Well, the end user thing was Google Code Search. That's what everyone called it, or just like CS. But the brains of it were really the kind of like Trigram index and then Grok, which provided the reference graph. [00:01:07]Steve: Today it's called Kythe, the open source Google one. It's sort of like Grok v3. [00:01:11]Swyx: On your podcast, which you've had me on, you've interviewed a bunch of other code search developers, including the current developer of Kythe, right? [00:01:19]Beyang: No, we didn't have any Kythe people on, although we would love to if they're up for it. We had Kelly Norton, who built a similar system at Etsy, it's an open source project called Hound. We also had Han-Wen Nienhuys, who created Zoekt, which is, I think, heavily inspired by the Trigram index that powered Google's original code search and that we also now use at Sourcegraph. Yeah. [00:01:45]Swyx: So you teamed up with Quinn over 10 years ago to start Sourcegraph and you were indexing all code on the internet. And now you're in a perfect spot to create a code intelligence startup. Yeah, yeah. [00:01:56]Beyang: I guess the backstory was, I used Google Code Search while I was an intern. And then after I left that internship and worked elsewhere, it was the single dev tool that I missed the most. I felt like my job was just a lot more tedious and much more of a hassle without it. And so when Quinn and I started working together at Palantir, he had also used various code search engines in open source over the years. And it was just a pain point that we both felt, both working on code at Palantir and also working within Palantir's clients, which were a lot of Fortune 500 companies, large financial institutions, folks like that. And if anything, the pains they felt in dealing with large complex code bases made our pain points feel small by comparison. So that was really the impetus for starting Sourcegraph. [00:02:42]Swyx: Yeah, excellent. Steve, you famously worked at Amazon. And you've told many, many stories. I want every single listener of Latent Space to check out Steve's YouTube because he effectively had a podcast that you didn't tell anyone about or something. You just hit record and just went on a few rants. I'm always here for your Stevie rants. And then you moved to Google, where you also had some interesting thoughts on just the overall Google culture versus Amazon. You joined Grab as head of eng for a couple of years. I'm from Singapore, so I have actually personally used a lot of Grab's features. And it was very interesting to see you talk so highly of Grab's engineering and sort of overall prospects. [00:03:21]Steve: Because as a customer, it sucked? [00:03:22]Swyx: Yeah, no, it's just like, being from a smaller country, you never see anyone from our home country being on a global stage or talked about as a startup that people admire or look up to, like on the league that you, with all your legendary experience, would consider equivalent. Yeah. [00:03:41]Steve: Yeah, no, absolutely. They actually, they didn't even know that they were as good as they were, in a sense. They started hiring a bunch of people from Silicon Valley to come in and sort of like fix it. And we came in and we were like, Oh, we could have been a little better operational excellence and stuff. But by and large, they're really sharp. The only thing about Grab is that they get criticized a lot for being too westernized. Oh, by who? By Singaporeans who don't want to work there. [00:04:06]Swyx: Okay. I guess I'm biased because I'm here, but I don't see that as a problem. If anything, they've had their success because they were more westernized than the Sanders Singaporean tech company. [00:04:15]Steve: I mean, they had their success because they are laser focused. They copy to Amazon. I mean, they're executing really, really, really well for a giant. I was on a slack with 2,500 engineers. It was like this giant waterfall that you could dip your toe into. You'd never catch up. Actually, the AI summarizers would have been really helpful there. But yeah, no, I think Grab is successful because they're just out there with their sleeves rolled up, just making it happen. [00:04:43]Swyx: And for those who don't know, it's not just like Uber of Southeast Asia, it's also a super app. PayPal Plus. [00:04:48]Steve: Yeah. [00:04:49]Swyx: In the way that super apps don't exist in the West. It's one of the enduring mysteries of B2C that super apps work in the East and don't work in the West. We just don't understand it. [00:04:57]Beyang: Yeah. [00:04:58]Steve: It's just kind of curious. They didn't work in India either. And it was primarily because of bandwidth reasons and smaller phones. [00:05:03]Swyx: That should change now. It should. [00:05:05]Steve: And maybe we'll see a super app here. [00:05:08]Swyx: You retired-ish? I did. You retired-ish on your own video game? Mm-hmm. Any fun stories about that? And that's also where you discovered some need for code search, right? Mm-hmm. [00:05:16]Steve: Sure. A need for a lot of stuff. Better programming languages, better databases. Better everything. I mean, I started in like 95, right? Where there was kind of nothing. Yeah. Yeah. [00:05:24]Beyang: I just want to say, I remember when you first went to Grab because you wrote that blog post talking about why you were excited about it, about like the expanding Asian market. And our reaction was like, oh, man, how did we miss stealing it with you? [00:05:36]Swyx: Hiring you. [00:05:37]Beyang: Yeah. [00:05:38]Steve: I was like, miss that. [00:05:39]Swyx: Tell that story. So how did this happen? Right? So you were inspired by Grok. [00:05:44]Beyang: I guess the backstory from my point of view is I had used code search and Grok while at Google, but I didn't actually know that it was connected to you, Steve. I knew you from your blog posts, which were always excellent, kind of like inside, very thoughtful takes from an engineer's perspective on some of the challenges facing tech companies and tech culture and that sort of thing. But my first introduction to you within the context of code intelligence, code understanding was I watched a talk that you gave, I think at Stanford, about Grok when you're first building it. And that was very eye opening. I was like, oh, like that guy, like the guy who, you know, writes the extremely thoughtful ranty like blog posts also built that system. And so that's how I knew, you know, you were involved in that. And then, you know, we always wanted to hire you, but never knew quite how to approach you or, you know, get that conversation started. [00:06:34]Steve: Well, we got introduced by Max, right? Yeah. It was temporal. Yeah. Yeah. I mean, it was a no brainer. They called me up and I had noticed when Sourcegraph had come out. Of course, when they first came out, I had this dagger of jealousy stabbed through me piercingly, which I remember because I am not a jealous person by any means, ever. But boy, I was like, but I was kind of busy, right? And just one thing led to another. I got sucked back into the ads vortex and whatever. So thank God Sourcegraph actually kind of rescued me. [00:07:05]Swyx: Here's a chance to build DevTools. Yeah. [00:07:08]Steve: That's the best. DevTools are the best. [00:07:10]Swyx: Cool. Well, so that's the overall intro. I guess we can get into Cody. Is there anything else that like people should know about you before we get started? [00:07:18]Steve: I mean, everybody knows I'm a musician. I can juggle five balls. [00:07:24]Swyx: Five is good. Five is good. I've only ever managed three. [00:07:27]Steve: Five is hard. Yeah. And six, a little bit. [00:07:30]Swyx: Wow. [00:07:31]Beyang: That's impressive. [00:07:32]Alessio: So yeah, to jump into Sourcegraph, this has been a company 10 years in the making. And as Sean said, now you're at the right place. Phase two. Now, exactly. You spent 10 years collecting all this code, indexing, making it easy to surface it. Yeah. [00:07:47]Swyx: And also learning how to work with enterprises and having them trust you with their code bases. Yeah. [00:07:52]Alessio: Because initially you were only doing on-prem, right? Like a lot of like VPC deployments. [00:07:55]Beyang: So in the very early days, we're cloud only. But the first major customers we landed were all on-prem, self-hosted. And that was, I think, related to the nature of the problem that we're solving, which becomes just like a critical, unignorable pain point once you're above like 100 devs or so. [00:08:11]Alessio: Yeah. And now Cody is going to be GA by the time this releases. So congrats to your future self for launching this in two weeks. Can you give a quick overview of just what Cody is? I think everybody understands that it's a AI coding agent, but a lot of companies say they have a AI coding agent. So yeah, what does Cody do? How do people interface with it? [00:08:32]Beyang: Yeah. So how is it different from the like several dozen other AI coding agents that exist in the market now? When we thought about building a coding assistant that would do things like code generation and question answering about your code base, I think we came at it from the perspective of, you know, we've spent the past decade building the world's best code understanding engine for human developers, right? So like it's kind of your guide as a human dev if you want to go and dive into a large complex code base. And so our intuition was that a lot of the context that we're providing to human developers would also be useful context for AI developers to consume. And so in terms of the feature set, Cody is very similar to a lot of other assistants. It does inline autocompletion. It does code base aware chat. It does specific commands that automate, you know, tasks that you might rather not want to do like generating unit tests or adding detailed documentation. But we think the core differentiator is really the quality of the context, which is hard to kind of describe succinctly. It's a bit like saying, you know, what's the difference between Google and Alta Vista? There's not like a quick checkbox list of features that you can rattle off, but it really just comes down to all the attention and detail that we've paid to making that context work well and be high quality and fast for human devs. We're now kind of plugging into the AI coding assistant as well. Yeah. [00:09:53]Steve: I mean, just to add my own perspective on to what Beyang just described, RAG is kind of like a consultant that the LLM has available, right, that knows about your code. RAG provides basically a bridge to a lookup system for the LLM, right? Whereas fine tuning would be more like on the job training for somebody. If the LLM is a person, you know, and you send them to a new job and you do on the job training, that's what fine tuning is like, right? So tuned to our specific task. You're always going to need that expert, even if you get the on the job training, because the expert knows your particular code base, your task, right? That expert has to know your code. And there's a chicken and egg problem because, right, you know, we're like, well, I'm going to ask the LLM about my code, but first I have to explain it, right? It's this chicken and egg problem. That's where RAG comes in. And we have the best consultants, right? The best assistant who knows your code. And so when you sit down with Cody, right, what Beyang said earlier about going to Google and using code search and then starting to feel like without it, his job was super tedious. Once you start using these, do you guys use coding assistants? [00:10:53]Swyx: Yeah, right. [00:10:54]Steve: I mean, like we're getting to the point very quickly, right? Where you feel like almost like you're programming without the internet, right? Or something, you know, it's like you're programming back in the nineties without the coding assistant. Yeah. Hopefully that helps for people who have like no idea about coding systems, what they are. [00:11:09]Swyx: Yeah. [00:11:10]Alessio: I mean, going back to using them, we had a lot of them on the podcast already. We had Cursor, we have Codium and Codium, very similar names. [00:11:18]Swyx: Yeah. Find, and then of course there's Copilot. [00:11:22]Alessio: You had a Copilot versus Cody blog post, and I think it really shows the context improvement. So you had two examples that stuck with me. One was, what does this application do? And the Copilot answer was like, oh, it uses JavaScript and NPM and this. And it's like, but that's not what it does. You know, that's what it's built with. Versus Cody was like, oh, these are like the major functions. And like, these are the functionalities and things like that. And then the other one was, how do I start this up? And Copilot just said NPM start, even though there was like no start command in the package JSON, but you know, most collapse, right? Most projects use NPM start. So maybe this does too. How do you think about open source models? Because Copilot has their own private thing. And I think you guys use Starcoder, if I remember right. Yeah, that's correct. [00:12:09]Beyang: I think Copilot uses some variant of Codex. They're kind of cagey about it. I don't think they've like officially announced what model they use. [00:12:16]Swyx: And I think they use a range of models based on what you're doing. Yeah. [00:12:19]Beyang: So everyone uses a range of model. Like no one uses the same model for like inline completion versus like chat because the latency requirements for. Oh, okay. Well, there's fill in the middle. There's also like what the model's trained on. So like we actually had completions powered by Claude Instant for a while. And but you had to kind of like prompt hack your way to get it to output just the code and not like, hey, you know, here's the code you asked for, like that sort of text. So like everyone uses a range of models. We've kind of designed Cody to be like especially model, not agnostic, but like pluggable. So one of our kind of design considerations was like as the ecosystem evolves, we want to be able to integrate the best in class models, whether they're proprietary or open source into Cody because the pace of innovation in the space is just so quick. And I think that's been to our advantage. Like today, Cody uses Starcoder for inline completions. And with the benefit of the context that we provide, we actually show comparable completion acceptance rate metrics. It's kind of like the standard metric that folks use to evaluate inline completion quality. It's like if I show you a completion, what's the chance that you actually accept the completion versus you reject it? And so we're at par with Copilot, which is at the head of that industry right now. And we've been able to do that with the Starcoder model, which is open source and the benefit of the context fetching stuff that we provide. And of course, a lot of like prompt engineering and other stuff along the way. [00:13:40]Alessio: And Steve, you wrote a post called cheating is all you need about what you're building. And one of the points you made is that everybody's fighting on the same axis, which is better UI and the IDE, maybe like a better chat response. But data modes are kind of the most important thing. And you guys have like a 10 year old mode with all the data you've been collecting. How do you kind of think about what other companies are doing wrong, right? Like, why is nobody doing this in terms of like really focusing on RAG? I feel like you see so many people. Oh, we just got a new model. It's like a bit human eval. And it's like, well, but maybe like that's not what we should really be doing, you know? Like, do you think most people underestimate the importance of like the actual RAG in code? [00:14:21]Steve: I think that people weren't doing it much. It wasn't. It's kind of at the edges of AI. It's not in the center. I know that when ChatGPT launched, so within the last year, I've heard a lot of rumblings from inside of Google, right? Because they're undergoing a huge transformation to try to, you know, of course, get into the new world. And I heard that they told, you know, a bunch of teams to go and train their own models or fine tune their own models, right? [00:14:43]Swyx: Both. [00:14:43]Steve: And, you know, it was a s**t show. Nobody knew how to do it. They launched two coding assistants. One was called Code D with an EY. And then there was, I don't know what happened in that one. And then there's Duet, right? Google loves to compete with themselves, right? They do this all the time. And they had a paper on Duet like from a year ago. And they were doing exactly what Copilot was doing, which was just pulling in the local context, right? But fundamentally, I thought of this because we were talking about the splitting of the [00:15:10]Swyx: models. [00:15:10]Steve: In the early days, it was the LLM did everything. And then we realized that for certain use cases, like completions, that a different, smaller, faster model would be better. And that fragmentation of models, actually, we expected to continue and proliferate, right? Because we are fundamentally, we're a recommender engine right now. Yeah, we're recommending code to the LLM. We're saying, may I interest you in this code right here so that you can answer my question? [00:15:34]Swyx: Yeah? [00:15:34]Steve: And being good at recommender engine, I mean, who are the best recommenders, right? There's YouTube and Spotify and, you know, Amazon or whatever, right? Yeah. [00:15:41]Swyx: Yeah. [00:15:41]Steve: And they all have many, many, many, many, many models, right? For all fine-tuned for very specific, you know. And that's where we're heading in code, too. Absolutely. [00:15:50]Swyx: Yeah. [00:15:50]Alessio: We just did an episode we released on Wednesday, which we said RAG is like Rexis or like LLMs. You're basically just suggesting good content. [00:15:58]Swyx: It's like what? Recommendations. [00:15:59]Beyang: Recommendations. [00:16:00]Alessio: Oh, got it. [00:16:01]Steve: Yeah, yeah, yeah. [00:16:02]Swyx: So like the naive implementation of RAG is you embed everything, throw it in a vector database, you embed your query, and then you find the nearest neighbors, and that's your RAG. But actually, you need to rank it. And actually, you need to make sure there's sample diversity and that kind of stuff. And then you're like slowly gradient dissenting yourself towards rediscovering proper Rexis, which has been traditional ML for a long time. But like approaching it from an LLM perspective. Yeah. [00:16:24]Beyang: I almost think of it as like a generalized search problem because it's a lot of the same things. Like you want your layer one to have high recall and get all the potential things that could be relevant. And then there's typically like a layer two re-ranking mechanism that bumps up the precision and tries to get the relevant stuff to the top of the results list. [00:16:43]Swyx: Have you discovered that ranking matters a lot? Oh, yeah. So the context is that I think a lot of research shows that like one, context utilization matters based on model. Like GPT uses the top of the context window, and then apparently Claude uses the bottom better. And it's lossy in the middle. Yeah. So ranking matters. No, it really does. [00:17:01]Beyang: The skill with which models are able to take advantage of context is always going to be dependent on how that factors into the impact on the training loss. [00:17:10]Swyx: Right? [00:17:10]Beyang: So like if you want long context window models to work well, then you have to have a ton of data where it's like, here's like a billion lines of text. And I'm going to ask a question about like something that's like, you know, embedded deeply into it and like, give me the right answer. And unless you have that training set, then of course, you're going to have variability in terms of like where it attends to. And in most kind of like naturally occurring data, the thing that you're talking about right now, the thing I'm asking you about is going to be something that we talked about recently. [00:17:36]Swyx: Yeah. [00:17:36]Steve: Did you really just say gradient dissenting yourself? Actually, I love that it's entered the casual lexicon. Yeah, yeah, yeah. [00:17:44]Swyx: My favorite version of that is, you know, how we have to p-hack papers. So, you know, when you throw humans at the problem, that's called graduate student dissent. That's great. It's really awesome. [00:17:54]Alessio: I think the other interesting thing that you have is this inline assist UX that I wouldn't say async, but like it works while you can also do work. So you can ask Cody to make changes on a code block and you can still edit the same file at the same time. [00:18:07]Swyx: Yeah. [00:18:07]Alessio: How do you see that in the future? Like, do you see a lot of Cody's running together at the same time? Like, how do you validate also that they're not messing each other up as they make changes in the code? And maybe what are the limitations today? And what do you think about where the attack is going? [00:18:21]Steve: I want to start with a little history and then I'm going to turn it over to Bian, all right? So we actually had this feature in the very first launch back in June. Dominic wrote it. It was called nonstop Cody. And you could have multiple, basically, LLM requests in parallel modifying your source [00:18:37]Swyx: file. [00:18:37]Steve: And he wrote a bunch of code to handle all of the diffing logic. And you could see the regions of code that the LLM was going to change, right? And he was showing me demos of it. And it just felt like it was just a little before its time, you know? But a bunch of that stuff, that scaffolding was able to be reused for where we're inline [00:18:56]Swyx: sitting today. [00:18:56]Steve: How would you characterize it today? [00:18:58]Beyang: Yeah, so that interface has really evolved from a, like, hey, general purpose, like, request anything inline in the code and have the code update to really, like, targeted features, like, you know, fix the bug that exists at this line or request a very specific [00:19:13]Swyx: change. [00:19:13]Beyang: And the reason for that is, I think, the challenge that we ran into with inline fixes, and we do want to get to the point where you could just fire and forget and have, you know, half a dozen of these running in parallel. But I think we ran into the challenge early on that a lot of people are running into now when they're trying to construct agents, which is the reliability of, you know, working code generation is just not quite there yet in today's language models. And so that kind of constrains you to an interaction where the human is always, like, in the inner loop, like, checking the output of each response. And if you want that to work in a way where you can be asynchronous, you kind of have to constrain it to a domain where today's language models can generate reliable code well enough. So, you know, generating unit tests, that's, like, a well-constrained problem. Or fixing a bug that shows up as, like, a compiler error or a test error, that's a well-constrained problem. But the more general, like, hey, write me this class that does X, Y, and Z using the libraries that I have, that is not quite there yet, even with the benefit of really good context. Like, it definitely moves the needle a lot, but we're not quite there yet to the point where you can just fire and forget. And I actually think that this is something that people don't broadly appreciate yet, because I think that, like, everyone's chasing this dream of agentic execution. And if we're to really define that down, I think it implies a couple things. You have, like, a multi-step process where each step is fully automated. We don't have to have a human in the loop every time. And there's also kind of like an LM call at each stage or nearly every stage in that [00:20:45]Swyx: chain. [00:20:45]Beyang: Based on all the work that we've done, you know, with the inline interactions, with kind of like general Codyfeatures for implementing longer chains of thought, we're actually a little bit more bearish than the average, you know, AI hypefluencer out there on the feasibility of agents with purely kind of like transformer-based models. To your original question, like, the inline interactions with CODI, we actually constrained it to be more targeted, like, you know, fix the current error or make this quick fix. I think that that does differentiate us from a lot of the other tools on the market, because a lot of people are going after this, like, shnazzy, like, inline edit interaction, whereas I think where we've moved, and this is based on the user feedback that we've gotten, it's like that sort of thing, it demos well, but when you're actually coding day to day, you don't want to have, like, a long chat conversation inline with the code base. That's a waste of time. You'd rather just have it write the right thing and then move on with your life or not have to think about it. And that's what we're trying to work towards. [00:21:37]Steve: I mean, yeah, we're not going in the agent direction, right? I mean, I'll believe in agents when somebody shows me one that works. Yeah. Instead, we're working on, you know, sort of solidifying our strength, which is bringing the right context in. So new context sources, ways for you to plug in your own context, ways for you to control or influence the context, you know, the mixing that happens before the request goes out, etc. And there's just so much low-hanging fruit left in that space that, you know, agents seems like a little bit of a boondoggle. [00:22:03]Beyang: Just to dive into that a little bit further, like, I think, you know, at a very high level, what do people mean when they say agents? They really mean, like, greater automation, fully automated, like, the dream is, like, here's an issue, go implement that. And I don't have to think about it as a human. And I think we are working towards that. Like, that is the eventual goal. I think it's specifically the approach of, like, hey, can we have a transformer-based LM alone be the kind of, like, backbone or the orchestrator of these agentic flows? Where we're a little bit more bearish today. [00:22:31]Swyx: You want the human in the loop. [00:22:32]Beyang: I mean, you kind of have to. It's just a reality of the behavior of language models that are purely, like, transformer-based. And I think that's just like a reflection of reality. And I don't think people realize that yet. Because if you look at the way that a lot of other AI tools have implemented context fetching, for instance, like, you see this in the Copilot approach, where if you use, like, the at-workspace thing that supposedly provides, like, code-based level context, it has, like, an agentic approach where you kind of look at how it's behaving. And it feels like they're making multiple requests to the LM being like, what would you do in this case? Would you search for stuff? What sort of files would you gather? Go and read those files. And it's like a multi-hop step, so it takes a long while. It's also non-deterministic. Because any sort of, like, LM invocation, it's like a dice roll. And then at the end of the day, the context it fetches is not that good. Whereas our approach is just like, OK, let's do some code searches that make sense. And then maybe, like, crawl through the reference graph a little bit. That is fast. That doesn't require any sort of LM invocation at all. And we can pull in much better context, you know, very quickly. So it's faster. [00:23:37]Swyx: It's more reliable. [00:23:37]Beyang: It's deterministic. And it yields better context quality. And so that's what we think. We just don't think you should cargo cult or naively go like, you know, agents are the [00:23:46]Swyx: future. [00:23:46]Beyang: Let's just try to, like, implement agents on top of the LM that exists today. I think there are a couple of other technologies or approaches that need to be refined first before we can get into these kind of, like, multi-stage, fully automated workflows. [00:24:00]Swyx: It makes sense. You know, we're very much focused on developer inner loop right now. But you do see things eventually moving towards developer outer loop. Yeah. So would you basically say that they're tackling the agent's problem that you don't want to tackle? [00:24:11]Beyang: No, I would say at a high level, we are after maybe, like, the same high level problem, which is like, hey, I want some code written. I want to develop some software and can automate a system. Go build that software for me. I think the approaches might be different. So I think the analogy in my mind is, I think about, like, the AI chess players. Coding, in some senses, I mean, it's similar and dissimilar to chess. I think one question I ask is, like, do you think producing code is more difficult than playing chess or less difficult than playing chess? More. [00:24:41]Swyx: I think more. [00:24:41]Beyang: Right. And if you look at the best AI chess players, like, yes, you can use an LLM to play chess. Like, people have showed demos where it's like, oh, like, yeah, GPT-4 is actually a pretty decent, like, chess move suggester. Right. But you would never build, like, a best in class chess player off of GPT-4 alone. [00:24:57]Swyx: Right. [00:24:57]Beyang: Like, the way that people design chess players is that you have kind of like a search space and then you have a way to explore that search space efficiently. There's a bunch of search algorithms, essentially. We were doing tree search in various ways. And you can have heuristic functions, which might be powered by an LLM. [00:25:12]Swyx: Right. [00:25:12]Beyang: Like, you might use an LLM to generate proposals in that space that you can efficiently explore. But the backbone is still this kind of more formalized tree search based approach rather than the LLM itself. And so I think my high level intuition is that, like, the way that we get to more reliable multi-step workflows that do things beyond, you know, generate unit test, it's really going to be like a search based approach where you use an LLM as kind of like an advisor or a proposal function, sort of your heuristic function, like the ASTAR search algorithm. But it's probably not going to be the thing that is the backbone, because I guess it's not the right tool for that. Yeah. [00:25:50]Swyx: I can see yourself kind of thinking through this, but not saying the words, the sort of philosophical Peter Norvig type discussion. Maybe you want to sort of introduce that in software. Yeah, definitely. [00:25:59]Beyang: So your listeners are savvy. They're probably familiar with the classic like Chomsky versus Norvig debate. [00:26:04]Swyx: No, actually, I wanted, I was prompting you to introduce that. Oh, got it. [00:26:08]Beyang: So, I mean, if you look at the history of artificial intelligence, right, you know, it goes way back to, I don't know, it's probably as old as modern computers, like 50s, 60s, 70s. People are debating on like, what is the path to producing a sort of like general human level of intelligence? And kind of two schools of thought that emerged. One is the Norvig school of thought, which roughly speaking includes large language models, you know, regression, SVN, basically any model that you kind of like learn from data. And it's like data driven. Most of machine learning would fall under this umbrella. And that school of thought says like, you know, just learn from the data. That's the approach to reaching intelligence. And then the Chomsky approach is more things like compilers and parsers and formal systems. So basically like, let's think very carefully about how to construct a formal, precise system. And that will be the approach to how we build a truly intelligent system. I think Lisp was invented so that you could create like rules-based systems that you would call AI. As a language. Yeah. And for a long time, there was like this debate, like there's certain like AI research labs that were more like, you know, in the Chomsky camp and others that were more in the Norvig camp. It's a debate that rages on today. And I feel like the consensus right now is that, you know, Norvig definitely has the upper hand right now with the advent of LMs and diffusion models and all the other recent progress in machine learning. But the Chomsky-based stuff is still really useful in my view. I mean, it's like parsers, compilers, basically a lot of the stuff that provides really good context. It provides kind of like the knowledge graph backbone that you want to explore with your AI dev tool. Like that will come from kind of like Chomsky-based tools like compilers and parsers. It's a lot of what we've invested in in the past decade at Sourcegraph and what you build with Grok. Basically like these formal systems that construct these very precise knowledge graphs that are great context providers and great kind of guard rails enforcers and kind of like safety checkers for the output of a more kind of like data-driven, fuzzier system that uses like the Norvig-based models. [00:28:03]Steve: Jang was talking about this stuff like it happened in the middle ages. Like, okay, so when I was in college, I was in college learning Lisp and prologue and planning and all the deterministic Chomsky approaches to AI. And I was there when Norvig basically declared it dead. I was there 3,000 years ago when Norvig and Chomsky fought on the volcano. When did he declare it dead? [00:28:26]Swyx: What do you mean he declared it dead? [00:28:27]Steve: It was like late 90s. [00:28:29]Swyx: Yeah. [00:28:29]Steve: When I went to Google, Peter Norvig was already there. He had basically like, I forget exactly where. It was some, he's got so many famous short posts, you know, amazing. [00:28:38]Swyx: He had a famous talk, the unreasonable effectiveness of data. Yeah. [00:28:41]Steve: Maybe that was it. But at some point, basically, he basically convinced everybody that deterministic approaches had failed and that heuristic-based, you know, data-driven statistical approaches, stochastic were better. [00:28:52]Swyx: Yeah. [00:28:52]Steve: The primary reason I can tell you this, because I was there, was that, was that, well, the steam-powered engine, no. The reason was that the deterministic stuff didn't scale. [00:29:06]Swyx: Yeah. Right. [00:29:06]Steve: They're using prologue, man, constraint systems and stuff like that. Well, that was a long time ago, right? Today, actually, these Chomsky-style systems do scale. And that's, in fact, exactly what Sourcegraph has built. Yeah. And so we have a very unique, I love the framing that Bjong's made, that the marriage of the Chomsky and the Norvig, you know, sort of models, you know, conceptual models, because we, you know, we have both of them and they're both really important. And in fact, there, there's this really interesting, like, kind of overlap between them, right? Where like the AI or our graph or our search engine could potentially provide the right context for any given query, which is, of course, why ranking is important. But what we've really signed ourselves up for is an extraordinary amount of testing. [00:29:45]Swyx: Yeah. [00:29:45]Steve: Because in SWIGs, you were saying that, you know, GPT-4 tends to the front of the context window and maybe other LLMs to the back and maybe, maybe the LLM in the middle. [00:29:53]Swyx: Yeah. [00:29:53]Steve: And so that means that, you know, if we're actually like, you know, verifying whether we, you know, some change we've made has improved things, we're going to have to test putting it at the beginning of the window and at the end of the window, you know, and maybe make the right decision based on the LLM that you've chosen. Which some of our competitors, that's a problem that they don't have, but we meet you, you know, where you are. Yeah. And we're, just to finish, we're writing tens of thousands. We're generating tests, you know, fill in the middle type tests and things. And then using our graph to basically sort of fine tune Cody's behavior there. [00:30:20]Swyx: Yeah. [00:30:21]Beyang: I also want to add, like, I have like an internal pet name for this, like kind of hybrid architecture that I'm trying to make catch on. Maybe I'll just say it here. Just saying it publicly kind of makes it more real. But like, I call the architecture that we've developed the Normsky architecture. [00:30:36]Swyx: Yeah. [00:30:36]Beyang: I mean, it's obviously a portmanteau of Norvig and Chomsky, but the acronym, it stands for non-agentic, rapid, multi-source code intelligence. So non-agentic because... Rolls right off the tongue. And Normsky. But it's non-agentic in the sense that like, we're not trying to like pitch you on kind of like agent hype, right? Like it's the things it does are really just developer tools developers have been using for decades now, like parsers and really good search indexes and things like that. Rapid because we place an emphasis on speed. We don't want to sit there waiting for kind of like multiple LLM requests to return to complete a simple user request. Multi-source because we're thinking broadly about what pieces of information and knowledge are useful context. So obviously starting with things that you can search in your code base, and then you add in the reference graph, which kind of like allows you to crawl outward from those initial results. But then even beyond that, you know, sources of information, like there's a lot of knowledge that's embedded in docs, in PRDs or product specs, in your production logging system, in your chat, in your Slack channel, right? Like there's so much context is embedded there. And when you're a human developer, and you're trying to like be productive in your code base, you're going to go to all these different systems to collect the context that you need to figure out what code you need to write. And I don't think the AI developer will be any different. It will need to pull context from all these different sources. So we're thinking broadly about how to integrate these into Codi. We hope through kind of like an open protocol that like others can extend and implement. And this is something else that should be accessible by December 14th in kind of like a preview stage. But that's really about like broadening this notion of the code graph beyond your Git repository to all the other sources where technical knowledge and valuable context can live. [00:32:21]Steve: Yeah, it becomes an artifact graph, right? It can link into your logs and your wikis and any data source, right? [00:32:27]Alessio: How do you guys think about the importance of, it's almost like data pre-processing in a way, which is bring it all together, tie it together, make it ready. Any thoughts on how to actually make that good? Some of the innovation you guys have made. [00:32:40]Steve: We talk a lot about the context fetching, right? I mean, there's a lot of ways you could answer this question. But, you know, we've spent a lot of time just in this podcast here talking about context fetching. But stuffing the context into the window is, you know, the bin packing problem, right? Because the window is not big enough, and you've got more context than you can fit. You've got a ranker maybe. But what is that context? Is it a function that was returned by an embedding or a graph call or something? Do you need the whole function? Or do you just need, you know, the top part of the function, this expression here, right? You know, so that art, the golf game of trying to, you know, get each piece of context down into its smallest state, possibly even summarized by another model, right, before it even goes to the LLM, becomes this is the game that we're in, yeah? And so, you know, recursive summarization and all the other techniques that you got to use to like stuff stuff into that context window become, you know, critically important. And you have to test them across every configuration of models that you could possibly need. [00:33:32]Beyang: I think data preprocessing is probably the like unsexy, way underappreciated secret to a lot of the cool stuff that people are shipping today. Whether you're doing like RAG or fine tuning or pre-training, like the preprocessing step matters so much because it's basically garbage in, garbage out, right? Like if you're feeding in garbage to the model, then it's going to output garbage. Concretely, you know, for code RAG, if you're not doing some sort of like preprocessing that takes advantage of a parser and is able to like extract the key components of a particular file of code, you know, separate the function signature from the body, from the doc string, what are you even doing? Like that's like table stakes. It opens up so much more possibilities with which you can kind of like tune your system to take advantage of the signals that come from those different parts of the code. Like we've had a tool, you know, since computers were invented that understands the structure of source code to a hundred percent precision. The compiler knows everything there is to know about the code in terms of like structure. Like why would you not want to use that in a system that's trying to generate code, answer questions about code? You shouldn't throw that out the window just because now we have really good, you know, data-driven models that can do other things. [00:34:44]Steve: Yeah. When I called it a data moat, you know, in my cheating post, a lot of people were confused, you know, because data moat sort of sounds like data lake because there's data and water and stuff. I don't know. And so they thought that we were sitting on this giant mountain of data that we had collected, but that's not what our data moat is. It's really a data pre-processing engine that can very quickly and scalably, like basically dissect your entire code base in a very small, fine-grained, you know, semantic unit and then serve it up. Yeah. And so it's really, it's not a data moat. It's a data pre-processing moat, I guess. [00:35:15]Beyang: Yeah. If anything, we're like hypersensitive to customer data privacy requirements. So it's not like we've taken a bunch of private data and like, you know, trained a generally available model. In fact, exactly the opposite. A lot of our customers are choosing Cody over Copilot and other competitors because we have an explicit guarantee that we don't do any of that. And that we've done that from day one. Yeah. I think that's a very real concern in today's day and age, because like if your proprietary IP finds its way into the training set of any model, it's very easy both to like extract that knowledge from the model and also use it to, you know, build systems that kind of work on top of the institutional knowledge that you've built up. [00:35:52]Alessio: About a year ago, I wrote a post on LLMs for developers. And one of the points I had was maybe the depth of like the DSL. I spent most of my career writing Ruby and I love Ruby. It's so nice to use, but you know, it's not as performant, but it's really easy to read, right? And then you look at other languages, maybe they're faster, but like they're more verbose, you know? And when you think about efficiency of the context window, that actually matters. [00:36:15]Swyx: Yeah. [00:36:15]Alessio: But I haven't really seen a DSL for models, you know? I haven't seen like code being optimized to like be easier to put in a model context. And it seems like your pre-processing is kind of doing that. Do you see in the future, like the way we think about the DSL and APIs and kind of like service interfaces be more focused on being context friendly, where it's like maybe it's harder to read for the human, but like the human is never going to write it anyway. We were talking on the Hacks podcast. There are like some data science things like spin up the spandex, like humans are never going to write again because the models can just do very easily. Yeah, curious to hear your thoughts. [00:36:51]Steve: Well, so DSLs, they involve, you know, writing a grammar and a parser and they're like little languages, right? We do them that way because, you know, we need them to compile and humans need to be able to read them and so on. The LLMs don't need that level of structure. You can throw any pile of crap at them, you know, more or less unstructured and they'll deal with it. So I think that's why a DSL hasn't emerged for sort of like communicating with the LLM or packaging up the context or anything. Maybe it will at some point, right? We've got, you know, tagging of context and things like that that are sort of peeking into DSL territory, right? But your point on do users, you know, do people have to learn DSLs like regular expressions or, you know, pick your favorite, right? XPath. I think you're absolutely right that the LLMs are really, really good at that. And I think you're going to see a lot less of people having to slave away learning these things. They just have to know the broad capabilities and the LLM will take care of the rest. [00:37:42]Swyx: Yeah, I'd agree with that. [00:37:43]Beyang: I think basically like the value profit of DSL is that it makes it easier to work with a lower level language, but at the expense of introducing an abstraction layer. And in many cases today, you know, without the benefit of AI cogeneration, like that totally worth it, right? With the benefit of AI cogeneration, I mean, I don't think all DSLs will go away. I think there's still, you know, places where that trade-off is going to be worthwhile. But it's kind of like how much of source code do you think is going to be generated through natural language prompting in the future? Because in a way, like any programming language is just a DSL on top of assembly, right? And so if people can do that, then yeah, like maybe for a large portion of the code [00:38:21]Swyx: that's written, [00:38:21]Beyang: people don't actually have to understand the DSL that is Ruby or Python or basically any other programming language that exists. [00:38:28]Steve: I mean, seriously, do you guys ever write SQL queries now without using a model of some sort? At least a draft. [00:38:34]Swyx: Yeah, right. [00:38:36]Steve: And so we have kind of like, you know, past that bridge, right? [00:38:39]Alessio: Yeah, I think like to me, the long-term thing is like, is there ever going to be, you don't actually see the code, you know? It's like, hey, the basic thing is like, hey, I need a function to some two numbers and that's it. I don't need you to generate the code. [00:38:53]Steve: And the following question, do you need the engineer or the paycheck? [00:38:56]Swyx: I mean, right? [00:38:58]Alessio: That's kind of the agent's discussion in a way where like you cannot automate the agents, but like slowly you're getting more of the atomic units of the work kind of like done. I kind of think of it as like, you know, [00:39:09]Beyang: do you need a punch card operator to answer that for you? And so like, I think we're still going to have people in the role of a software engineer, but the portion of time they spend on these kinds of like low-level, tedious tasks versus the higher level, more creative tasks is going to shift. [00:39:23]Steve: No, I haven't used punch cards. [00:39:25]Swyx: Yeah, I've been talking about like, so we kind of made this podcast about the sort of rise of the AI engineer. And like the first step is the AI enhanced engineer. That is that software developer that is no longer doing these routine, boilerplate-y type tasks, because they're just enhanced by tools like yours. So you mentioned OpenCodeGraph. I mean, that is a kind of DSL maybe, and because we're releasing this as you go GA, you hope for other people to take advantage of that? [00:39:52]Beyang: Oh yeah, I would say so OpenCodeGraph is not a DSL. It's more of a protocol. It's basically like, hey, if you want to make your system, whether it's, you know, chat or logging or whatever accessible to an AI developer tool like Cody, here's kind of like the schema by which you can provide that context and offer hints. So I would, you know, comparisons like LSP obviously did this for kind of like standard code intelligence. It's kind of like a lingua franca for providing fine references and codefinition. There's kind of like analogs to that. There might be also analogs to kind of the original OpenAI, kind of like plugins, API. There's all this like context out there that might be useful for an LM-based system to consume. And so at a high level, what we're trying to do is define a common language for context providers to provide context to other tools in the software development lifecycle. Yeah. Do you have any critiques of LSP, by the way, [00:40:42]Swyx: since like this is very much, very close to home? [00:40:45]Steve: One of the authors wrote a really good critique recently. Yeah. I don't think I saw that. Yeah, yeah. LSP could have been better. It just came out a couple of weeks ago. It was a good article. [00:40:54]Beyang: Yeah. I think LSP is great. Like for what it did for the developer ecosystem, it was absolutely fantastic. Like nowadays, like it's much easier now to get code navigation up and running in a bunch of editors by speaking this protocol. I think maybe the interesting question is like looking at the different design decisions comparing LSP basically with Kythe. Because Kythe has more of a... How would you describe it? [00:41:18]Steve: A storage format. [00:41:20]Beyang: I think the critique of LSP from a Kythe point of view would be like with LSP, you don't actually have an actual symbolic model of the code. It's not like LSP models like, hey, this function calls this other function. LSP is all like range-based. Like, hey, your cursor's at line 32, column 1. [00:41:35]Swyx: Yeah. [00:41:35]Beyang: And that's the thing you feed into the language server. And then it's like, okay, here's the range that you should jump to if you click on that range. So it kind of is intentionally ignorant of the fact that there's a thing called a reference underneath your cursor, and that's linked to a symbol definition. [00:41:49]Steve: Well, actually, that's the worst example you could have used. You're right. But that's the one thing that it actually did bake in is following references. [00:41:56]Swyx: Sure. [00:41:56]Steve: But it's sort of hardwired. [00:41:58]Swyx: Yeah. [00:41:58]Steve: Whereas Kythe attempts to model [00:42:00]Beyang: like all these things explicitly. [00:42:02]Swyx: And so... [00:42:02]Steve: Well, so LSP is a protocol, right? And so Google's internal protocol is gRPC-based. And it's a different approach than LSP. It's basically you make a heavy query to the back end, and you get a lot of data back, and then you render the whole page, you know? So we've looked at LSP, and we think that it's a little long in the tooth, right? I mean, it's a great protocol, lots and lots of support for it. But we need to push into the domain of exposing the intelligence through the protocol. Yeah. [00:42:29]Beyang: And so I would say we've developed a protocol of our own called Skip, which is at a very high level trying to take some of the good ideas from LSP and from Kythe and merge that into a system that in the near term is useful for Sourcegraph, but I think in the long term, we hope will be useful for the ecosystem. Okay, so here's what LSP did well. LSP, by virtue of being like intentionally dumb, dumb in air quotes, because I'm not like ragging on it, allowed language servers developers to kind of like bypass the hard problem of like modeling language semantics precisely. So like if all you want to do is jump to definition, you don't have to come up with like a universally unique naming scheme for each symbol, which is actually quite challenging because you have to think about like, okay, what's the top scope of this name? Is it the source code repository? Is it the package? Does it depend on like what package server you're fetching this from? Like whether it's the public one or the one inside your... Anyways, like naming is hard, right? And by just going from kind of like a location to location based approach, you basically just like throw that out the window. All I care about is jumping definition, just make that work. And you can make that work without having to deal with like all the complex global naming things. The limitation of that approach is that it's harder to build on top of that to build like a true knowledge graph. Like if you actually want a system that says like, okay, here's the web of functions and here's how they reference each other. And I want to incorporate that like semantic model of how the code operates or how the code relates to each other at like a static level. You can't do that with LSP because you have to deal with line ranges. And like concretely the pain point that we found in using LSP for source graph is like in order to do like a find references [00:44:04]Swyx: and then jump definitions, [00:44:04]Beyang: it's like a multi-hop process because like you have to jump to the range and then you have to find the symbol at that range. And it just adds a lot of latency and complexity of these operations where as a human, you're like, well, this thing clearly references this other thing. Why can't you just jump me to that? And I think that's the thing that Kaith does well. But then I think the issue that Kaith has had with adoption is because it is more sophisticated schema, I think. And so there's basically more things that you have to implement to get like a Kaith implementation up and running. I hope I'm not like, correct me if I'm wrong about any of this. [00:44:35]Steve: 100%, 100%. Kaith also has a problem, all these systems have the problem, even skip, or at least the way that we implemented the indexers, that they have to integrate with your build system in order to build that knowledge graph, right? Because you have to basically compile the code in a special mode to generate artifacts instead of binaries. And I would say, by the way, earlier I was saying that XREFs were in LSP, but it's actually, I was thinking of LSP plus LSIF. [00:44:58]Swyx: Yeah. That's another. [00:45:01]Steve: Which is actually bad. We can say that it's bad, right? [00:45:04]Steve: It's like skip or Kaith, it's supposed to be sort of a model serialization, you know, for the code graph, but it basically just does what LSP needs, the bare minimum. LSIF is basically if you took LSP [00:45:16]Beyang: and turned that into a serialization format. So like you build an index for language servers to kind of like quickly bootstrap from cold start. But it's a graph model [00:45:23]Steve: with all of the inconvenience of the API without an actual graph. And so, yeah. [00:45:29]Beyang: So like one of the things that we try to do with skip is try to capture the best of both worlds. So like make it easy to write an indexer, make the schema simple, but also model some of the more symbolic characteristics of the code that would allow us to essentially construct this knowledge graph that we can then make useful for both the human developer through SourceGraph and through the AI developer through Cody. [00:45:49]Steve: So anyway, just to finish off the graph comment, we've got a new graph, yeah, that's skip based. We call it BFG internally, right? It's a beautiful something graph. A big friendly graph. [00:46:00]Swyx: A big friendly graph. [00:46:01]Beyang: It's a blazing fast. [00:46:02]Steve: Blazing fast. [00:46:03]Swyx: Blazing fast graph. [00:46:04]Steve: And it is blazing fast, actually. It's really, really interesting. I should probably have to do a blog post about it to walk you through exactly how they're doing it. Oh, please. But it's a very AI-like iterative, you know, experimentation sort of approach. We're building a code graph based on all of our 10 years of knowledge about building code graphs, yeah? But we're building it quickly with zero configuration, and it doesn't have to integrate with your build. And through some magic tricks that we have. And so what just happens when you install the plugin, that it'll be there and indexing your code and providing that knowledge graph in the background without all that build system integration. This is a bit of secret sauce that we haven't really like advertised it very much lately. But I am super excited about it because what they do is they say, all right, you know, let's tackle function parameters today. Cody's not doing a very good job of completing function call arguments or function parameters in the definition, right? Yeah, we generate those thousands of tests, and then we can actually reuse those tests for the AI context as well. So fortunately, things are kind of converging on, we have, you know, half a dozen really, really good context sources, and we mix them all together. So anyway, BFG, you're going to hear more about it probably in the holidays? [00:47:12]Beyang: I think it'll be online for December 14th. We'll probably mention it. BFG is probably not the public name we're going to go with. I think we might call it like Graph Context or something like that. [00:47:20]Steve: We're officially calling it BFG. [00:47:22]Swyx: You heard it here first. [00:47:24]Beyang: BFG is just kind of like the working name. And so the impetus for BFG was like, if you look at like current AI inline code completion tools and the errors that they make, a lot of the errors that they make, even in kind of like the easy, like single line case, are essentially like type errors, right? Like you're trying to complete a function call and it suggests a variable that you defined earlier, but that variable is the wrong type. [00:47:47]Swyx: And that's the sort of thing [00:47:47]Beyang: where it's like a first year, like freshman CS student would not make that error, right? So like, why does the AI make that error? And the reason is, I mean, the AI is just suggesting things that are plausible without the context of the types or any other like broader files in the code. And so the kind of intuition here is like, why don't we just do the basic thing that like any baseline intelligent human developer would do, which is like click jump to definition, click some fine references and pull in that like Graph Context into the context window and then have it generate the completion. So like that's sort of like the MVP of what BFG was. And turns out that works really well. Like you can eliminate a lot of type errors that AI coding tools make just by pulling in that context. Yeah, but the graph is definitely [00:48:32]Steve: our Chomsky side. [00:48:33]Swyx: Yeah, exactly. [00:48:34]Beyang: So like this like Chomsky-Norvig thing, I think pops up in a bunch of different layers. And I think it's just a very useful and also kind of like nicely nerdy way to describe the system that we're trying to build. [00:48:46]Steve: By the way, I remembered the point I was trying to make earlier to your question, Alessio, about is AI going to replace programmers? And I was talking about how compilers, they thought, oh, are compilers going to replace programming? And what it did was just change [00:48:57]Beyang: kind of what programmers [00:48:58]Steve: had to focus on. And I think AI is just going to level us at the game, right? Programmers are still in the middle of stuff and, you know, Intel agents come along, but I don't believe. And so, yeah. [00:49:09]Beyang: Yeah, I mean, to be clear, again, like with the agent stuff at a high level, I think we will get there. [00:49:14]Swyx: I think that's still [00:49:14]Beyang: the kind of long-term target. And I think also with Cody, it's like you can have Cody like draft up an execution plan. It's just not going to be the sort of thing where you can't attend to what it's doing. Like we think that like with Cody, it's like, yes, Cody, like, hey, I have this bug, [00:49:30]Swyx: help me solve it. [00:49:30]Beyang: It would do a reasonable job of fetching context and saying, like, here are the files you should modify. And if you prompt it further, you can actually suggest like code changes to make to those files. And that's a very nice way to like resolve issues because you're kind of like on the rails for most of the time. But then, you know, now and then you have to intervene as a human. I just think that like [00:49:48]Swyx: if we're trying to get [00:49:48]Beyang: to complete automation, where it's like the sort of thing where like a non-software engineer, like someone who has no technical expertise can just like speak a non-trivial feature into existence. [00:49:59]Swyx: You know, that is still, [00:50:00]Beyang: I think, several key innovations away from happening right now. And I don't think the pure like transformer based LLM orchestrator modeled agents that is kind of like dominant today is going to get us there. Yeah. [00:50:14]Swyx: What you're talking about triggered a thread I've been working on for a little bit, which is, you know, we're very much reacting to developments in models on a month-to-month basis. We had a post about we're going to need a bigger moat, which is great JAWS reference for those who didn't catch it. I forgot all about that. How quickly models are evolving. But I think if you like kind of look out, I actually caught Sam Altman on the podcast yesterday talking about GPT-10. I know. Wow. [00:50:40]Beyang: Things are accelerating. [00:50:42]Swyx: And actually there's a pretty good cadence from GPT-2, 3 and 4 that you can, if you project out, 4 is based on George Hotz's concept of like 20 petaflops being a human's worth of compute. GPT-4 took about 100 years in terms of human years to train in terms of the amount of compute. So that's one living person. And every generation of GPT increases two orders of magnitude. So 5 is, you know, 100 people. And if you just project it out, 9 is every human on earth and 10 is every human ever. And he thinks he'll reach there by the end of the decade. George Hotz does? No, Sam Altman. Oh, Sam Altman. Okay. [00:51:19]Beyang: Yeah. [00:51:20]Swyx: So I just like setting those high level, you have dots on the line. We're at the start of the curve with Moore's law. George Moore, I think, thought it would last like 10 years. Yeah. And he just kept drawing for another 50. Yeah. And I think we have all these data points and we're just trying to draw, extrapolate the curve to where this goes. All I'm saying is, the agent stuff that we dealt might come here by 2030. And I don't know how you plan when things are not possible today and you're like, it's not worth doing. But like, you know, I mean, we're going to be here in 2030. [00:51:50]Swyx: And what do we do then? [00:51:54]Beyang: So is the question like, you know... There's no question. [00:51:57]Swyx: It's like sharing of a comment just because like at the back of my head, anytime we hear things like things are not practical today. Yeah. I'm just like, all right, but how do we... [00:52:06]Beyang: So here's like a question maybe, like I get the whole like scaling argument. I do think that there will be something like a Moore's law for AI inference. I mean, definitely, I think at like the hardware level, like GPUs, I think it gets a little fuzzier the higher you move up in the stack. But for instance, like going back to the chess analogy, right? At what point do we think that, you know, GPDX or whatever, you know, a pure, a transformer based LM model will be like state of the art or outperform the best like chess playing algorithm today? Because I think that is one milestone on... Where you completely overlap search. [00:52:41]Swyx: Yeah, exactly. [00:52:42]Beyang: Because I think that would be, I mean, just to put my cards on the table, I think that would kind of disprove the thesis that I just stated, which is, you know, kind of like the pure transformer, just scale the transformer based approach. That would be a proof point where like, hey, like maybe that is the right approach versus, oh, we actually have to take a step back and think, you get what I'm saying, right? Like is the transformer going to be like, is that the end all be all of architectures and it's just a matter of scaling that? [00:53:04]Swyx: Yeah. [00:53:04]Beyang: Or are there other algorithms and like that is going to be one piece of a system of intelligence that will have to take advantage of like many other algorithms and approaches. Yeah, we shall see. [00:53:14]Swyx: Maybe John Carmack will find it. Yeah. All right. Sorry for that digression. I'm just very curious. So one thing I did actually want to check in on because we talked a little bit about code graphs and like reference graphs and all that. Do you actually use a graph database? No, right? No. [00:53:29]Beyang: How would you find graph database? [00:53:31]Steve: We use Postgres. And yeah, I saw a paper actually right after I joined Sourcegraph. There was some joint study between IBM and some other company that basically showed that Postgres was performing as well as most of the graph databases for most graph workloads. [00:53:43]Swyx: Wow. [00:53:45]Beyang: In V0 of Sourcegraph, we're like, we're building a code graph. Let's use a graph database. I won't name the database because I mean, it was like 10 years ago. So they're probably much better now. But like we basically tried to dump like a non-trivially sized like dataset, but also like not the whole universe of code, right? Like it was a relatively small dataset compared to what we're indexing now [00:54:05]Swyx: into the database. [00:54:05]Beyang: And it was just, we let it run for like a week. And I think it like seg faulted or something. And we're like, okay, let's try another approach. Let's just put everything in Postgres. And these days, like the graph data, I mean, it's partially in Postgres. It's partially just, I mean, you could store them as like flat files. [00:54:21]Swyx: Yep. [00:54:21]Beyang: I mean, at the end of the day, all the databases like just get me the data I want. Like answer the queries that I need, right? Like if all your queries are like, you know, single hops. [00:54:30]Steve: Which they will be if you denormalize from other use cases. [00:54:33]Beyang: Exactly. [00:54:34]Swyx: Interesting. [00:54:34]Beyang: So yeah. [00:54:35]Swyx: Set of normal form is just a bunch of files. Yeah, yeah. And I don't know, like, [00:54:40]Beyang: I feel like there's a bunch of stuff like that where it's like, if you look past the marketing and think about like the actual query load or like the traffic patterns or the end user use cases you need to serve, just go with like the tried and true, kind of like dumb classic tools over kind of like the new agent stuff. Yeah. I mean, there's a bunch of stuff like that in the search domain too. Especially right now with like, you know, embeddings and vector search and all that. But, you know, like classic search techniques still go very far. And I don't know, I think in the next year or two, maybe as we get past like the peak AI hype, we'll start to see the gap emerge or become more obvious to more people about like how many of like the newfangled techniques actually work in practice and yield a better product experience day to day. Yeah. [00:55:25]Swyx: So speaking of which, like, you know, obviously there's a bunch of other people trying to build AI tooling. What can you say about your AI stack? Obviously you build a lot proprietary in-house, but like what approaches, you know, like so prompt engineering, do you have a prompt engineering management tool? You know, what approaches there do you do? Pre-processing orchestration, like do you use Airflow? Do you use something else? Like, you know, that kind of stuff. Yeah. [00:55:46]Beyang: Ours is very like duct taped together at the moment. So in terms of stack, it's essentially go in TypeScript and now Rust. There's the code knowledge graph that we built, which is using indexers, many of which are open source, that speak the skip protocol. And we have the code search backend. You know, traditionally we supported regular expression search and a string literal search with like a trigram index. And we're also building more like fuzzy search on top of that now, kind of like natural language or keyword based search on top of that. We use a variety of open source and proprietary models. We try to be like pluggable with respect to different models so we can easily swap the latest model in and out as they come online. I'm just hunting for like, [00:56:26]Swyx: is there anything out there that you're like, these guys are really good. Everyone else should check them out. So for example, you talked about recursive summarization, which is something that LangChain and Llama indexed. I presume you wrote your own. Yeah, we wrote our own. [00:56:37]Beyang: I think like the stuff that Llama indexed and LangChain are doing are like super interesting. I think from our point of view, it's like we're still in the application, like end user use case discovery phase. And so adopting like an external infrastructure or middleware kind of tool just seems like overly constraining right now. Yeah, we need full control. Yeah, we need full control because we need to be able to iterate rapidly up and down the stack. But maybe at some point there'll be like a convergence and we can actually merge some of our stuff into theirs and turn that into a common resource. In terms of like other vendors that we use, I mean, obviously like nothing but good things to say about Anthropic and OpenAI, which we both kind of partner with and use. Also Plug for Fireworks as an inference platform. Their team was kind of like ex-meta people who basically know all like the bag of tricks for making inference fast. Yeah, I met Lynn. [00:57:25]Swyx: So she was like with Sumith. She was like the co-manager of PyTorch for five years. Yeah, yeah, yeah. [00:57:31]Beyang: But like is their main thing [00:57:32]Swyx: that we just do fastest inference on earth? Is that what it is or? I think that's the pitch. [00:57:37]Beyang: And it keeps getting faster somehow. Like we run Starcoder on top of Fireworks and that's made it so that we just don't have to think about building up an inference stack. And so that's great for us because it allows us to focus more on the kind of like data fetching, the knowledge graph and model fine tuning, which we've also invested a bit in. [00:57:55]Swyx: That's right. [00:57:55]Steve: We've got multiple AI work streams in progress now because we hired a head of AI finally. We spent close to a year actually. I think I talked to probably 75 candidates. And the guy we hired, Rashab, is absolutely world-class. And he immediately started multiple work streams, including he's fine-tuned Starcoder already. He's got prompt engineering work stream. He's got bettings work stream. He's got evaluation and experimentation. Benchmarking, wouldn't it be nice if Cody was on Hugging Face with a benchmark that we could just, anybody could say, well, we'll run against the benchmark or we'll make our own benchmark if we don't like yours. But we'll be forcing people into the sort of quantitative comparisons. And that's all happening under the AI program that he's building for us. [00:58:35]Swyx: I should mention, by the way, I've heard that there's a V2 of Starcoder coming on. So you guys should talk to Hugging Face. Cool. Awesome. Great. I actually visited their offices in Paris, which is where I heard it. That's awesome. [00:58:47]Steve: Can you guys believe how amazing it is that the open source models are competitive with GPT and Anthropic? I mean, it's nuts, right? I mean, that one Googler that was predicting that open source would catch up. At least he was right for completions. [00:59:03]Beyang: Yeah. I mean, for completions, open source is state-of-the-art. [00:59:06]Swyx: You were on OpenAI, then you went to Claude, and now you've ripped it up. Yeah. Yeah, for completions. [00:59:10]Beyang: I mean, we still use Claude and GPT-4 for chat and also commands. Like, the ecosystem is going to continue to devolve. We obviously love the open source ecosystem and, like, huge shout out to Hugging Face. And also, like, meta research. We love the work that they're doing and kind of driving the ecosystem forward. [00:59:26]Swyx: Yeah, you didn't mention Code Llama. [00:59:27]Beyang: We're not using Code Llama currently. It's always kind of like a constant evaluation process. So, like, I don't want to come out and say, like, hey, this model is the best because we chose it. Basically, like, we did a bunch of, like, tests for the sorts of, like, contexts that we're fetching now and given the way that our prompts constructed now. And at the end of the day, it was like a judgment call. Like, starcoder seemed to work the best, and that's why we adopted it. But it's sort of like a continual process of revisitation. Like, if someone comes up with, like, a neat new, like, context fetching mechanism, and we have a couple coming online soon, then it's always like, okay, let's try that against the array of models that are available and see how this moves the needle across that set. [01:00:01]Swyx: Yeah. What do you wish someone else built? This is a request for startups. [01:00:04]Beyang: I mean, if someone could just provide, like, a very nice, clean data set of both naturally occurring and synthetic code data. [01:00:15]Steve: Yeah. Could someone please give us their data mode? [01:00:17]Swyx: Well, not even the data mode. [01:00:19]Beyang: It's just like, I feel like most models today, they still use, like, combination of, like, the stack and the pile as, like, their training corpus. But you can only stretch that so far. At some point, you need more data. And I think there's still more alpha in, like, synthetic data. Like, we have a couple of efforts where, like, we think fine tuning some models on specific coding tasks will yield more kind of, like, reliable code generation of the sort where it's, like, reliable enough that we can fully automate it, at least, like, the one hop thing. And synthetic data is playing a part of that. But, I mean, if there were, like, a synthetic data provider, I don't think you could construct a provider that has access to, like, some proprietary code base. Like, no company in the world would be able to, like, sell that to you. But, like, anyone who's just, like, providing clean data sets off of the publicly available data. That would be nice. I don't know if there's a business around that, but, like, that's something that we definitely, like, [01:01:09]Swyx: love to use. [01:01:09]Beyang: Oh, for sure. [01:01:10]Steve: My God. I mean, but that's also, like, the secret weapon, right, for any AI, you know, is the data that you've curated. So I doubt people are going to be, oh, we'll see, you know. But we can maybe contribute, you know, if we want to have a benchmark of our own. [01:01:25]Swyx: Yeah. I would say, like, that would be the bull case for Repl.it, that, like, you want to be a coding platform where you also offer bounties. Like, then you eventually bootstrap your own proprietary set of coding data. I don't think they'll ever share it. The rumor is, this is from nobody at Repl.it that I'm hearing, but, like, they're just not leveraging that actively. Like, they're actually just betting on OpenAI to do a lot of that, which banking on OpenAI, you know, has been a winning strategy so far. [01:01:50]Beyang: Yeah, they're definitely great at executing. [01:01:55]Steve: Executing their CEO. [01:01:56]Swyx: And then bring him back in four days. Yeah. [01:02:01]Steve: That was a whole, like... [01:02:03]Swyx: It was a company, like, just obsessed by the drama. Like, we were unable to work. I just walked in after it happened, and this whole room in the new room was just like, everyone's just staring at their phones. [01:02:12]Beyang: Yeah, it's a bit difficult to ignore. I mean, it would have real implications for us, too, because, like, we're using them. And so there's a very real question of, like, do we have to, like, do it quick? [01:02:21]Swyx: Yeah, Microsoft. Like, you just move to Microsoft, right? [01:02:23]Beyang: Yeah, I mean, that would have been, like, the break glass plan. If the worst case played out, then I think we'd have a lot of customers, you know, the day after being like, you know, how can you guarantee the reliability of your services if the company itself isn't stable? But I'm really happy they got things sorted out and things are stable now because, like, they build really cool stuff and we love using their tech. [01:02:43]Swyx: Yeah, awesome. [01:02:44]Alessio: So we kind of went through everything, right? Sourcecraft, Cody, why agents don't work, why inline completion is better, all of these things. How does that bubble up to who manages the people, right? Because as engineering managers, I didn't write much code. I was mostly helping people write their own code, you know, so even if you have the best inline completion, it doesn't help me do my job. [01:03:08]Swyx: Yeah. [01:03:08]Alessio: What's kind of the future of Sourcecraft in the engineering org? [01:03:13]Beyang: That's a really interesting question. And I think it sort of gets at this, like, issue, which is basically, like, every AI DevTools creator or producer these days, I think us included, we're kind of, like, focusing on the wrong problem in a way. Because, like, the real problem of modern software development, I think, is not how quickly can you write more lines of code. It's really about managing the emergent complexity of codebases as they evolve and grow and how to make, like, efficient development tractable again. Because the bulk of your time becomes more about understanding how the system works and how the pieces fit together currently so that you can update it in a way that gets you your added functionality, doesn't break anything, and doesn't introduce a lot of additional complexity that will slow you down in the future. And if anything, like, the Interloop developer tools that are all about, like, generating lines of code, yes, they help you get your feature done faster. They generate a lot of boilerplate for you. But they might make this problem of, like, managing large, complex codebases more challenging, just because instead of having, like, a pistol, you'll have a machine gun in terms of, like, being able to write code. And there's going to be a bunch of, like, natural language prompted code that is generated in the future that was produced by someone who doesn't even have an understanding of source code. And so, like, how are you going to verify the quality of that and make sure it not only checks the kind of, like, low-level boxes, but also fits architecturally in a way that's sensible into your codebase. And so I think as we look forward to the future of the next year, we have a lot of ideas around how to make codebases, as they evolve, more understandable and manageable to the people who really care about the codebase as a whole. You know, tech leads, engineering leaders, folks like that. It is kind of like a return to our ultimate mission at Sourcegraph, which is to make code accessible to all. It's not really about, you know, enabling people to write code. And if anything, like, the original version of Sourcegraph was a rejection of, like, hey, let's stop trying to build, like, the next best editor, because, like, there's already enough people doing that. The real problem that we're facing, I mean, Quinn, myself, and you, Steve at Google, was like, how do we make sense of the code that exists so that we can understand enough to know what code needs to be written? Mm-hmm. [01:05:25]Steve: Yeah. Well, I'll tell you what customers want, right? And what they're going to get. What they want is for Cody to have a monitor for developer productivity. And any developer who falls below a threshold, a button lights up where the admin can fire them. Or Cody will even press that button for you as time passes. But I'm kind of only half tongue-in-cheek here. We've got some prospects who are kind of, like, sniffing down that avenue. And we're like, no. But what they're going to get is a much greater whole code-based understanding, which is actually something that Cody is, I would argue, the best at today in the coding assistance space, right? Because of our search engine and the techniques that we're using. And that whole code-based understanding is so important, you know, for any sort of a manager who just wants to get a feel for the architecture or potential security vulnerabilities or whether, you know, people are writing code that's well-tested and et cetera, et cetera, right? And solving that problem is tricky, right? This is not the developer inner loop or outer loop. It's like the manager inner loop? [01:06:21]Swyx: No, outer loop. [01:06:21]Steve: The manager inner loop is staring at your belly button, I guess. So in any case... [01:06:27]Beyang: Waiting for the next Slack message to arrive? [01:06:29]Steve: Yes. What they really want is a batch mode for these assistants where you can actually take the coding assistant and shove its face into your code base, you know, and six billion lines of code later, right? It's told you all the security vulnerabilities. That's what they really actually want. It's insanely expensive proposition, right? You know, just the GPU costs, especially if you're doing it on a regular basis. So it's better to do it at the point the code enters the system. And so now we're starting to get into developer outer loop stuff. And I think that's where a lot of the... To your question, right? A lot of the admins and managers and so, you know, the decision makers, anybody who just like kind of isn't coding [01:07:03]Swyx: but is involved, [01:07:03]Steve: they're going to have a set of tools, right? [01:07:06]Swyx: And a set of... [01:07:06]Steve: Just like with CodeSearch today. Our CodeSearch actually serves that audience as well. The CIO types, right? Because they're just like, oh, hey, I want to see how we do, you know, Samaloth. And they use our search engine and they go find it. And AI is just going to make that so much easier for them. [01:07:20]Swyx: Yeah, this is my perfect place to put my anecdote of how I used Cody yesterday. I was actually trying to build this sort of Twitter scraper thing. And Twitter is notoriously very challenging to work with because they don't want to work with you, with anyone. There's a repo that I wanted to inspect. It was really big that had the Twitter scraper thing in it. And I pulled it into Copilot, didn't work. But then I noticed that on your landing page, you had a web version. Like, I typically think of Cody as a VS Code extension, but you have a web version where you just plug in any repo in there and just talk to it. And that's what I used to figure it out. So yeah. [01:07:54]Steve: Wow, Cody web is wild. [01:07:57]Beyang: Yeah, I mean, we've done a very poor job of making the existence of that feature. It's not easy to find. [01:08:02]Swyx: It's not easy to find. You don't have to go through the search thing. It's like, oh, this is old source graph. You don't want to look at old source graph. I mean, you can use source graph, all the AI stuff. Old source graph has AI stuff and it's Cody web. Yeah, yeah. [01:08:13]Beyang: There's a little ask Cody button that's hidden in the upper right-hand corner. We should make that more visible. It's definitely one of those aha moments when you can ask a question of Cody. Of any repo, right? [01:08:22]Swyx: Because you already indexed it. Well, you didn't embed it, but you indexed it. Yeah. [01:08:26]Beyang: And there's actually some use cases that have emerged among power users where they kind of do... You're familiar with v0.dev. You can kind of replicate that, but for arbitrary frameworks and libraries with Cody web. Because there's also an equally hidden toggle, which you may not have discovered yet, where you can actually tag in multiple repositories as context. [01:08:44]Swyx: Yeah. [01:08:44]Beyang: And so you can do things like, we have a demo path where it's like, okay, let's say you want to build a stock ticker [01:08:50]Swyx: that's React-based, [01:08:50]Beyang: but uses this one tick data fetching API. It's like you tag both repositories in, you ask it, it's like two sentences, like build a stock tick app, track the tick data of Bank of America, Wells Fargo over the past week, and then generates a code. You can paste that in and it just works magically. We'll probably invest in that more just because the wow factor of that is just pretty incredible. It's like, what if you can speak apps into existence that use the frameworks and packages that you want to use? Yeah. [01:09:19]Swyx: It's not even fine-tuning. It's just taking advantage of your RAG pipeline. [01:09:22]Beyang: Yeah. It's just RAG. RAG is all you need for many things. [01:09:25]Steve: It's not just RAG. It's RAG, right? RAG's good. Not a fallback. [01:09:33]Swyx: Yeah. [01:09:33]Beyang: But I guess getting back to the original question, I think there's a couple of things I think would be interesting for engineering leaders. One is the use case that you called out is all the stuff that you currently don't do that you really ought to be doing with respect to ensuring code quality or updating dependencies or keeping things up to date. The things that humans find toilsome and tedious and just don't want to do but would really help up-level the quality, security, and robustness of your code base, now we potentially have a way to do that with machines. I think there's also this other thing, and this gets back to the point of how do you measure developer productivity? It's the perennial age-old question. Every CFO in the world would love to do it in the same way that you can measure marketing or sales or other parts of the organization. And I think what is the actual way you would do this that is good? And if you had all the time in the world, I think as an engineering manager or an engineering leader, what you would do is you would go read through the Git log, maybe line by line, be like, you, Sean, these are the features that you built over the past six months or a year. These are the things that delivered that you helped drive. Here's the stuff that you did to help your teammates. Here are the reviews that you did that helped ensure that we maintain a coherent and a high-quality code base. Now connect that to the things that matter to the business. What were we trying to drive this? Was it engagement? Was it revenue? Was it adoption of some new product line? And really weave that story together. The work that you did had this impact on the metrics that moved the needle for the business and ultimately show up in revenue or stock price or whatever it is that's at the very top of any for-profit organization. And you could, in theory, do all that today if you had all the time in the world. [01:11:22]Swyx: Yeah. [01:11:22]Beyang: But as an engineering leader- It's a busy building. Yeah, you're too busy building, you're too busy with a bunch of other stuff. Plus it's also tedious. Reading through a Git log and trying to understand what a change does and summarizing that, it's not the most exciting work in the world. But with the benefit of AI, I think you could conceive of a system that actually does a lot of the tedium and helps you actually tell that story. And I think that is maybe the ultimate answer to how we get at developer productivity in a way that a CFO would be like, okay, I can buy that. The work that you did impacted these core metrics because these features were tied to those and therefore we can afford to invest more in this part of the organization. And that's what we really want to drive towards. That's what we've been trying to build all along in a way with Sourcegraph. It's this kind of code-based level of understanding and the availability of LLMs and AI now just puts that much sooner in reach, I think. [01:12:14]Swyx: Yeah. [01:12:15]Steve: But I mean, we have to focus also, small company, our short-term focus is lovability, right? [01:12:21]Swyx: Yeah. [01:12:21]Steve: We absolutely have to make Cody, like everybody wants it, right? [01:12:25]Swyx: Absolutely. [01:12:26]Steve: Sourcegraph is all about enabling non-engineering roles, decision makers and so on. As Bianca says, I mean, I think there's just a lot of opportunity there once we've built a lovable Cody. [01:12:37]Swyx: Awesome. [01:12:37]Alessio: We want to jump into lightning round? [01:12:40]Swyx: Lightning round. [01:12:40]Alessio: Okay. [01:12:41]Swyx: So we usually have three, [01:12:42]Alessio: one around acceleration, exploration, and then a final takeaway. So the acceleration one is what's something that already happened in AI that is possible today that you thought would take much longer? [01:12:54]Beyang: I mean, just LLMs and how good the vision models are now. Like I got my start. Okay. [01:13:00]Swyx: Yeah. [01:13:00]Beyang: Back in the day, I got my start machine learning in computer vision, but circa like 2009, 2010. [01:13:07]Swyx: And in those days, [01:13:07]Beyang: everything was like statistical based. Neural nets had not yet made their comeback. And so nothing really worked. And so I was very bearish after that experience on the future of computer vision. But like, man, the progress that's been made just in the past, like three, four years has just been absolutely astounding. Came up faster than I expected it to. Yeah. [01:13:27]Steve: Multimodal in general, [01:13:28]Swyx: I think is, [01:13:28]Steve: I think there's a lot more capability there that we're not tapping into. Potentially even in the coding assistant space. You know, honestly, I think that the form factor that coding assistants have today is probably not the steady state that we're seeing, you know, long-term. You'll always have completions and you always have chat and commands and so on. But I think we're going to discover a lot more. And I think multimodal potentially opens up some kind of new ways to, you know, get your stuff done. So yeah, I think the capabilities are there today. And they're just, it's just shocking. I mean, like, I still am astonished when I sit down, you know, and I have a conversation with the LLM, with the context, and it's like, I'm talking to a, you know, a senior engineer or an architect or somebody, right? I think that people have very different working models with these assistants today. You know, some people are just completion, completion, completion, that's it. And if they want some code generated, they write a comment and then, you know what I mean? Telling them what to do. But I truly think that there are other modalities that we're going to stumble across. Just kind of latently, you know, inherently built into the LLMs today that we just haven't found them yet. They're more of a discovery than invention, you know? [01:14:31]Swyx: Like other usage patterns? [01:14:34]Steve: Absolutely. I mean, the one that we talked about earlier, nonstop coding is one, right? Where you could just kick off a whole bunch of, you know, requests to refactor and so on. But, you know, there could be any number of others. You know, we talk about agents, you know, that's kind of out there. But I think there are kind of more inner loop type ones to be found. And we haven't looked at all at multimodal yet. [01:14:52]Swyx: Yeah, for sure. Like there's two that come to mind, just off the top of my head. One, which is effectively architecture diagrams and entity relationship diagrams. There's probably more alpha in like synthesizing them for management to see. Ooh, yeah. Which is like, you don't need AI for that. You can just use your reference graph. Yeah. But then also doing it the other way around when like someone draws stuff on a whiteboard and actually generating code. [01:15:14]Steve: Well, you can generate the diagram and then, you know, explanations as well. [01:15:18]Swyx: Yeah. And then the other one is, there was a demo that went pretty viral like two, three weeks ago about how someone just had an always on script, just screenshotting and sending it to GPT Vision on some kind of time interval. And it would just autonomously suggest stuff. Yeah. So like no trigger, just watching your screen and just like being a real co-pilot rather than having you initiate with a chat. Yeah. [01:15:39]Beyang: It's like the return of Clippy, right? But actually good. [01:15:42]Swyx: The reason I know this is that we actually did a hackathon where we wrote that project, but it roasted you while you did it. So it's like, hey, you're on Twitter right now. You should be coding. Yeah. That can be a fun co-pilot thing as well. Yeah, yeah. Okay. So I'll jump on. Exploration. What do you think is the most interesting unsolved question in AI? I mean, I think- [01:16:01]Steve: It used to be scaling, right? With CNNs and RNNs and Transformer solved that. Yeah. So what's the next big hurdle? It's keeping GPT-10 from emerging. [01:16:09]Beyang: I mean, do you mean that like- Oh, is this like a safetyist argument? I feel like, do you mean like the pure model, like AI layer or- [01:16:17]Swyx: No, it doesn't have to be. [01:16:18]Beyang: For me personally, it's like, how do you get reliable, like first try working code generation? Even like the single hop, like write a function that does this. Because I think like if you want to get to the point where you can actually be truly agentic or like multi-step automated, a necessary part of that is like the single step has to be robust and reliable. And so I think that's the problem that we're focused on solving right now. Because once you have that, it's a building block that you can then compose into longer chains. [01:16:47]Alessio: And just to wrap things up, what's one message takeaway that you want people to remember and think about? I mean, I think for me, [01:16:55]Beyang: it's just like the best dev tools in the future are going to have to leverage many different forms of intelligence. You know, calling back to that like Normsky architecture, trying to make catch on. [01:17:06]Swyx: You should have called it something cool, like S star or R star. [01:17:09]Beyang: Yes, yes, yes. [01:17:10]Swyx: Just one letter and then just let people speculate. Yeah, yeah. What could he mean? [01:17:14]Beyang: I don't know, like in terms of like trying to describe what we're building, we try to be a little bit more like down to earth and like straightforward. And I think like Normsky kind of like encapsulates like the two big technology areas that we're investing in that we think will be very important for producing really good dev tools. And I think it's a big differentiator that we view that Cody has right now. [01:17:35]Steve: Yeah, and mine would be, I know for a fact that not all developers today are using coding systems. Yeah, and that's probably because they tried it and it didn't, you know, immediately write a bunch of beautiful code for them and they were like, oh, too much effort and they left, right? Well, my big takeaway from this talk would be if you're one of those engineers, you better start like planning another career, okay? Because this stuff is in the future and honestly, it takes some effort to actually make coding assistance work today, right? You have to, you know, just like talking to GPT, they'll give you the runaround, just like doing a Google search sometimes. But if you're not putting that effort in and learning the sort of footprint, and the characteristics of how LLMs behave under different query conditions and so on, if you're not getting a feel for the coding assistant, then you're letting this whole train just like pull out of the station and leave you behind. [01:18:26]Swyx: Yeah, absolutely. [01:18:28]Alessio: Yeah, thank you guys so much for coming on and being the first guest in the new studio. [01:18:32]Swyx: Our pleasure. [01:18:34] Get full access to Latent.Space at www.latent.space/subscribe
-
The Busy Person's Intro to Finetuning & Open Source AI - Wing Lian, Axolotl
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2023-12-08 16:16
The Latent Space crew will be at NeurIPS on Tuesday! Reach out with any parties and papers of interest. We have also been incubating a smol daily AI Newsletter and Latent Space University is making progress.Good open models like Llama 2 and Mistral 7B (which has just released an 8x7B MoE model) have enabled their own sub-industry of finetuned variants for a myriad of reasons:* Ownership & Control - you take responsibility for serving the models* Privacy - not having to send data to a third party vendor* Customization - Improving some attribute (censorship, multiturn chat and chain of thought, roleplaying) or benchmark performance (without cheating)Related to improving benchmark performance is the ability to use smaller (7B, 13B) models, by matching the performance of larger models, which have both cost and inference latency benefits.Core to all this work is finetuning, and the emergent finetuning library of choice has been Wing Lian’s Axolotl.AxolotlAxolotl is an LLM fine-tuner supporting SotA techniques and optimizations for a variety of common model architectures:It is used by many of the leading open source models:* Teknium: OpenHermes, Trismigestus, CollectiveCognition* OpenOrca: Mistral-OpenOrca, Mistral-SlimOrca* Nous Research: Puffin, Capybara, NousHermes* Pygmalion: Mythalion, Pygmalion* Eric Hartford: Dolphin, Samantha* DiscoResearch: DiscoLM 120B & 70B* OpenAccess AI Collective: Manticore, Minotaur, Jackalope, HippogriffAs finetuning is very formatting dependent, it also provides prompt interfaces and formatters between a range of popular model formats from Stanford’s Alpaca and Steven Tey’s ShareGPT (which led to Vicuna) to the more NSFW Pygmalion community.Nous Research MeetupWe last talked about Nous at the DevDay Recap at the e/acc “banger rave”. We met Wing at the Nous Research meetup at the a16z offices in San Francisco, where they officially announced their company and future plans:Including Nous Forge:Show NotesWe’ve already covered the nuances of Dataset Contamination and the problems with “Open Source” in AI, so we won’t rehash those topics here but do read/listen to those if you missed it.* Axolotl GitHub and Discord* The Flan paper and dataset* StackLlama model and blogpost* Multipack paper* Our episode with Tri Dao* Mamba state space models - Tri Dao and Albert GuTimestamps* [00:00:00] Introducing Wing* [00:02:34] SF Open Source AI Meetup* [00:04:09] What is Axolotl?* [00:08:01] What is finetuning?* [00:08:52] Open Source Model Zoo* [00:10:53] Benchmarks and Contamination* [00:14:29] The Case for Open Source AI* [00:17:34] Orca and OpenOrca* [00:23:36] DiscoLM and Model Stacking* [00:25:07] Datasets and Evals over Models* [00:29:15] Distilling from GPT4* [00:33:31] Finetuning - LoRA, QLoRA, ReLoRA, GPTQ* [00:41:55] Axolotl vs HF Transformers* [00:48:00] 20x efficiency with StackLlama and Multipack* [00:54:47] Tri Dao and Mamba* [00:59:08] Roadmap for Axolotl* [01:01:20] The Open Source AI CommunityTranscript[00:00:00] Introducing Wing Lian[00:00:00] [00:00:00] swyx: Welcome to Latent Space, a special edition with Wing Lien, but also with our new guest host, Alex. Hello, hello. Welcome, welcome. Again, needs no introduction. I think it's like your sixth time on Latent Space already. I think so, yeah. And welcome, Wing. We just met, but you've been very prolific online. Thanks for having me.[00:00:30] Yeah. So you are in town. You're not local. You're in town. You're from Minneapolis?[00:00:35] Wing Lian: Annapolis. Annapolis. It's funny because a lot of people think it's Indianapolis. It's I've got Minneapolis, but I used to live out at least in the San Francisco Bay Area years ago from like 2008 to 2014. So it's fairly familiar here.[00:00:50] swyx: Yep. You're the maintainer of Axolotl now, which we'll get into. You're very, very prolific in the open source AI community, and you're also the founder of the Open Access AI Collective. Yeah. Cool. Awesome. Maybe we can go over a little bit of your backgrounds into tech and then coming into AI, and then we'll cover what[00:01:06] Wing Lian: happens and why you're here.[00:01:08] Yeah. So. Back on tech, so I started years ago, I started way back when I was scraping, Apartment websites for listings and then, and then building like SEO optimized pages and then just throwing Google AdSense on it.[00:01:24] And that got me through like college basically. Is[00:01:27] swyx: that decent money? And what year[00:01:28] Wing Lian: was this? Like 2004, 2005. Yeah, that's decent money. It's like thousand bucks a month. But as a college student, that's like. Gravy. Really good money, right? So, and then there's just too much competition It's just sort of like died off. I was writing stuff in like Perl back then using like like who nobody hosted anything on Perl anymore, right? Still did a little bit more like computer tech support and then software, and web more professionally.[00:01:54] So I spent some time working on applications in the blood industry. I came out to San Francisco for, I was at SGN, so Social Gaming Network, as a startup. They started doing, with Facebook apps, and then they pivoted into doing mobile apps. And then, from there, I spent time.[00:02:14] I've quite a few more startups since then and in the last few years I've been in the music space So like I was at United Masters for a while and then past year I've been at SoundCloud, but not doing that anymore and now that I have a lot more time It's just like all right.[00:02:30] We're going full bore on axolotl and we're gonna we're gonna crush AI So yeah,[00:02:34] SF Open Source AI Meetup[00:02:34] swyx: totally you so you're here in town for the open source. Yeah, I meet up that we had yesterday Yep, yeah, that was amazing. Yeah, it was a big collection. Olama, Noose Research, Alignment Lab, Anyone else that I missed? I mean, Jeremy Howard is his own thing.[00:02:47] Yeah.[00:02:49] And Alex, you're also there. You love to bring SF to the world. Your takes?[00:02:55] Alex Volkov: It's incredible that we recorded a Thursday Eye episode after that one. And LDJ, who's usually co hosts Thursday Eye, just like briefly mentioned, Oh yeah, I talked about it.[00:03:04] Like, I saw Karpathy, and then I talked to Jeremy Howard, and the guy from Mistral came in, and it's like, He's talking about all these, titans of industry, basically, that outside of SF, You just don't meet casually hanging out in the same space. You can't, pull somebody. He ran into the Laylow from Mistral, he ran into him while, drinking water.[00:03:20] He didn't even know he was there. It's just, that type of stuff is really hard to find outside of SF. So, absolutely, absolutely great. And also, presentations from Alignment Labs, presentations from News Research, news issues, talked about. Forge, and some of[00:03:33] swyx: the other stuff they announced. We can say now they're officially a company.[00:03:36] I met Technium.[00:03:37] He[00:03:37] Alex Volkov: came over here. He didn't want to get recorded. But maybe.[00:03:41] Wing Lian: We'll wear him down at some point. Yeah, I'm excited for Forge. They've positioned it as this agentic sort of framework where it's just Drag and drop things and, fill in text with where you want to inject different variables and it opens up all of these potentials for data pipelines now, right?[00:03:56] And using your own local LLMs and not relying on GPT 4 or anything like that. Yeah, yeah,[00:04:02] swyx: good stuff. Okay, so let's maybe go into the Axolotl origin story and then we have, we have some intro or background.[00:04:09] What is Axolotl?[00:04:09] swyx: To do on like the open source model universe and also on fine tuning, but maybe just, since you're talking about your personal journey, what was your personal journey into[00:04:18] Wing Lian: axolotl?[00:04:19] Yeah, so my personal journey started like back in mid March, completely unrelated to AI and axolotl. And it really started, I fell while skiing, I torqued. Great 3 MCL sprain and being sort of like an active person that can no longer be active because the two, couldn't play soccer, because that is requires to have having knees until I, it's healed.[00:04:42] So I. I decided I needed to find something to do to take up my free time. And that became, well, let's learn how to train in, these language models. It was everywhere. So I was like, all right, I'm just going to sit down, learn. I think I used like other, I think I was using like Alpacalora.[00:05:00] Cause I think the Alpaca paper had just came out, come out then. So I was like using Alpacalora repo and sort of like learning how to use like. None of us were like GPU rich back then, and none of us, most of us still we're still all GPU poor, but I was doing what was it, like 4 bit, Alpaca Lord, there was like a 4 bit version where we were doing quant, or 8, no, 8 bit quantizations, and then I think they had released QLOR a little bit later, and I think right when, before QLOR came out, I was already starting to do fine tunes, but having this need to sort of like mix data sets together, and If you've ever looked at all the various different datasets available on HuggingFace, they all have various different prompt formats, and, it's sort of a nightmare, and then I think the other piece is if you've ever tried to fine tune, at least Back then probably the ecosystem's a little better now.[00:05:54] Everybody required that you say, alright, you put your hyperparameters as command line arguments. And so it's always like, well, I now have to go copy and paste my previous thing and to change things out. And I really wanted it. to be in a YAML file because it was more portable and reproducible.[00:06:09] So I was doing that and then the QLOR paper came out. Tim Dettmer announced that and then somebody looked it up for me yesterday and it's like between that announcement it took us seven days to get that integrated into Axolotl, right? Which is like, it's not. I wouldn't say it's really fast, but in a manner that, is in a, a reusable framework, I think it was quite the accomplishment then.[00:06:33] And so we started, picking up traction with people there. And then it's just been building models, and then just iterating what my needs are. So, yeah. Excellent. Yeah. I[00:06:44] Alex Volkov: want to ask, for folks who are listening who never heard of Axolotl, now do you describe how you got there?[00:06:49] Can you, how do you summarize this for folks who maybe haven't fine tuned anything. They know about open source LLM exists, they maybe know like LLAML, what's XLR for somebody who doesn't know. I've never heard of a data set curation[00:07:01] Wing Lian: creation before. We sort of have to take a step back and understand that, when you've got these language models, you have what I think most people refer to as like base models, also known as like foundational models, right?[00:07:15] Where some benefactor, whether it's Meta or Mistral or whoever, has gone and spent all this money. To train these models on huge corpuses of text, right? And these, these corpuses, they're generally good across lots of different things, but they're really good at just saying, talking on and on and on, but they're not good at, following instructions or having chats or anything like that.[00:07:40] So, when you think about fine tuning, it's like Saying, all right, we have this really sort of good generalized, text completion thing, and I want to turn it into something that I can talk to or have, follow instructions. So, I think fine tuning is probably best defined in like that.[00:07:58] swyx: Okay, got it.[00:07:59] And we actually[00:08:01] What is finetuning?[00:08:01] swyx: Do want to make sure that we have like an overall introduction to fine tuning for people because again like trying to make sure that we bring everyone along in this, in this journey. We already went into Loras and QLoras without explaining what[00:08:12] Wing Lian: they are. Oh yes, yes, sorry.[00:08:14] swyx: And so I will put things in my words and you can correct me as, as, as my I'll be the village idiot here.[00:08:21] So, so fine tuning is basically sort of grabbing an open source model off the shelf, and then basically doing further training on it with a custom dataset of your own. Primarily, people use it, think about it as fine tuning for JSON output, or fine tuning for a style of response. Let's say you wanted to tell jokes, or be funny, or be short, or whatever.[00:08:43] Just the open source AI community has really fine tuned in all sorts of different manner. I think we'll go over those those things now. Let's go over those things now, and then we'll talk about fine tuning methods.[00:08:52] Open Source Model Zoo[00:08:52] swyx: So there's a universe of people who fine tune stuff. Yesterday in your slides, you had, I'll just list some of these and then we'll maybe go through some of them, right?[00:08:59] So Technium is personally leading Open Hermes, which is I think the sort of premier model out of the news. news community. There's OpenOrca, which you had a hand in. News, the news research itself also has Capybara and Puffin and all the others. There's Pygmalion, which I've never messed with.[00:09:14] Eric Hartford, I am aware of his Uncensored Models and his Samantha Models. Disco Research with Disco LM. And then you personally have done Manticore, Minotaur, Jackalope, and Hippogriff. What should people know about all these names? Being part of AI Twitter is seeing all these things and going dude, I'm being DDoS'ed by all these things and I don't know how different they are.[00:09:32] What should people know? Yeah, so[00:09:34] Wing Lian: I think on a lot of these models, generally, we like to think of those as sort of general models, so If you think about it, what is GPT 4, what is Chad GPT? It's a good general model, and then, One of the services I think that OpenAI offers is like these fine tunings where you're a business and you have very specific business use cases and you might fine tune for that use case.[00:10:00] All of these models are really just general use case that you can then go and maybe Fine tune another lore over it for your use cases, but they tend to be good. With good being relative, it's open source. Open source AI is still sort of is infancy. So, good is, it's pretty reasonable.[00:10:18] It's probably still better than most, high schoolers at answering questions and being able to like figure things out and, and reasoning skills and math and those sorts of things, right?[00:10:27] swyx: And also as measured on the Hugging[00:10:29] Wing Lian: Face leaderboard. Yes, well, that's like a whole other discussion, right, there's a whole other, group of people who, and I, I mostly agree with them that, benchmarks can be, are pretty bogus these days, LM says, I think they published something recently where, even if you think the dataset's not contaminated, you can go and, find contamination And maybe we should step back and say what contamination is, right?[00:10:53] Benchmarks and Contamination[00:10:53] Wing Lian: So we have all of these data, when you go and do these benchmarks, there's a specific data set where there are these questions and usually it's multiple choice. And what can happen is, well, sometimes someone It puts the question, maybe maliciously, maybe accidentally, into the training dataset, and now the, the, your model knows how to answer the test questions really well, but it doesn't, it hasn't generalized the ability to actually do that[00:11:20] Alex Volkov: right.[00:11:21] We've seen some folks competitively announce models that are like the best at that leaderboard, but then it's, it's quite obvious that, In open source? Yeah, and in that leaderboard, for Hugging Face specific, I don't know if LMCs, if that had suffered, but we, there's been some models that seem to have been competitively trained and some leakage happened into their,[00:11:41] swyx: like, supposal.[00:11:43] I understand, once there's been a credible assertion, Hugging Face actually does take them down, right? Yeah, yeah,[00:11:48] Alex Volkov: which is really hard to know, right?[00:11:50] swyx: It's really hard to know, sometimes it's like a pure accident,[00:11:52] Alex Volkov: it's oh, oops. You're going through a mixer. I think, a responsible So acknowledgement, that this kind of happened to you is also important.[00:11:58] I saw LDJ from news research can acknowledge that. Because many of these datasets are collections of other datasets. There's a bunch of people are baking, basically. It's alchemy. Right. And so sometimes you don't know. Sometimes you pull an open source dataset and they announce, oh, you know what, actually, the MMLU benchmark which we used to Specifically identify models that did go into this data set, that then went into that data set.[00:12:22] So sometimes it's actually an accident and folks take it down. But I've seen some competitive folks who want to put their name out there because people are starting to notice which is the top[00:12:30] swyx: model. For those who want a fun take on this so the file one dataset. FindOne model from Microsoft was accused of being contaminated.[00:12:37] And I saw this joke paper that was fantastic. It was called, training on the test set is all you need. It's a super small model that just memorizes everything. It was fantastic. So yeah, contamination, I think we've actually covered it in a previous episode before. So we're good. But again, I want to give people a map into the open source AI model, the universe.[00:12:57] And Alex, you can also jump in here because you guys have spent a lot more time with them than I have. So, what should people know about Technium? What should people know about Noose? And then we can go down the list. Yeah,[00:13:05] Wing Lian: I think so. I think if we start with, Technium. When you talk to him, he's gonna say, I think, I think his response is that he wants to build GP4 on his laptop, right?[00:13:14] So, very, very good at building general models. I think with Noose, Noose Research, they're looking at more, sort of, More, more research focused things, like their Yarn models, I don't, I don't, they didn't actually train their, they have their own trainer for their Yarn models, but So they did not use Xlato for that one?[00:13:30] They didn't use that, but like Is that, you don't have support for it? I think we do support Yarn, I think, I'd have to double check that answer. Yeah, I'm just kind of curious what you can and cannot support, and Yeah, I mean, Yarn is supportable, it's basically, I think it's just replacing, I think, the rope part of that, so Yeah, not, not a big deal.[00:13:48] Yeah, it's not a big deal, it's just I haven't gotten to it, not enough people have asked, I think a lot of people have asked for other things, so it's just, squeaky wheel, right? I think at the end of the day, people are like building these data sets and I think if you sort of map things chronologically, these make more sense because it's like, how do we incrementally improve all of these models?[00:14:07] So a lot of these models are just incremental improvements over the last thing, right? Whether it is sort of through methods of how do we, how did we curate the data set? How did we improve the quality of the data set? So, you maybe LDJ talked about it right on I think for, for Capybara and Puffin, like how those, those were very specific dataset curation techniques that he works on.[00:14:29] The Case for Open Source AI[00:14:29] Alex Volkov: So there's, folks are doing this for dataset curation. Folks are doing this for skillset building as well. Definitely people understand that open source is like very important, especially after the, the, the, the, the march, the debacle, the OpenAI weekend that we all had. And people started noticing that even after developer day in OpenAI, the APIs went out.[00:14:48] And then after that, the whole leadership of the company is swiftly changed and people, there was worries about, you know. How can people continue building AI products based on these like shaky grounds that turned attention definitely to Technium at least in open RMS I started seeing this more and more on Twitter, but also other models and many companies They're gonna start with open AI just to get there quick, and then they they think about okay Maybe I don't want to share my knowledge.[00:15:13] Maybe I don't want to sign up for Microsoft. Maybe they will change their terms and conditions so What else is out there? They turned to other companies. Up until yesterday, Google was nowhere to be found. We've talked about Gemini a little bit before in a previous And you can tune in[00:15:26] swyx: to[00:15:26] Alex Volkov: Thursday Eye.[00:15:26] Yeah, you can tune in to Thursday Eye. We covered the Gemini release a little bit. And but many are turning into the open source community and seeing that Meta released and continues to release and commit to open source AI. Mistral came out and the model is way smaller than LLAMA and performs Significantly better.[00:15:43] People play with OpenRMS, which is currently techniums based, news researched, sourced, axolotl trained OpenRMS, I assume, right? And then they play with this and they see that, okay, this is like GPT 3. 5 quality. We had GPT 4. 5 birthday just a week ago. A week ago, a year ago, a week ago, we never, interacted with these models of this caliber.[00:16:04] And now there's one open source, one that's on my laptop, completely offline, that, I can continue improving for my use cases. So enterprises, companies are also noticing this. And the open source community folks are building the skill set, not only the data sets. They're building the actual kind of, here's how we're going to do this, with Axelotl, with these data sets.[00:16:21] The curation pieces. Now. Interesting. There's like recipes of curation. The actual model training is kind of a competitive thing where people go and compete on these leaderboards that we talked about, the LMC arena, and that recently added open air and recently added open chat and a bunch of other stuff that are super cool.[00:16:37] The hug and face open source leaderboard. And so there's a competitive aspect to this. There's the open source. Aspect to this, like Technium says, I want GPT 4 on my laptop. There's the, let me build a skill set that potentially turns into a company, like we saw with Noose. Noose just, started organizing, a bunch of people on Discord, and suddenly, they're announcing their company.[00:16:54] It's happening across all these modalities, and suddenly all these people who saw these green pastures and a fairly quick way to, hey, here's a cool online community I can, start doing cool stuff with. You mentioned the same in the beginning, right? Like, after your accident, what's cool, let me try this out.[00:17:08] Suddenly I start noticing that there's a significant movement of interest in enterprising companies into these areas. And, this skill set, these data sets, and this community is now very Very important, important enough to create an event which pulls in Andrei Karpathy from OpenAI to come and see what's new Jeremy Howard, like the event that we just talked about, people are flying over and this is just a meetup.[00:17:28] So, definitely, the community is buzzing right now and I think Axelot is a big piece as well.[00:17:34] Orca and OpenOrca[00:17:34] Wing Lian: Cool. Maybe we can talk about like Orca real quick, Orca, OpenOrca rather, I think there was a lot of buzz when, the first Orca paper came out. And just briefly, what is Orca? Yeah, Orca was basically having traces of like chain of thought reasoning, right?[00:17:48] So they go and they, they distill sort of GPT 4. They take, they take a sampling of data from the Flan dataset. Maybe we can like add some show notes in the Flan dataset. Yeah, but we've covered it. Okay, cool. Use GPT 4 to say, all right, explain this in a step by step reasoning, right?[00:18:06] And then you take that and you, they train the model and it showed, very good improvements across a lot of benchmarks. So OpenOrca was sort of the open reproduction of that since Microsoft Research never released that particular data set. And going back to sort of the Hugging Face leaderboard thing, those models did really well.[00:18:23] And then I think, so sort of the follow up to that was SlimOrca, right? I think Going into and building the OpenOrca dataset, we never really went in and, validated the actual answers that GPT 4 gave us, so what we did was one from OpenChat actually cross referenced the original Flan, the original Flan response, the human responses, the correct answers with the dataset, and then I went and took it and sent all of, both of them to GPT 4 and said, is this answer mostly correct, right?[00:18:54] Yeah. And then we were able to filter the dataset from, At least of the GPT 4 only answers from like 800, 000 to like 500, 000 answers or rows and then, and then retrain the model and it had the same performance as the original model to within I think, 0. 1 percent here about, and 30 percent less data.[00:19:13] So, yeah. Okay.[00:19:15] swyx: Interesting. So, I mean, there's, there's so much there that I want to highlight, but yeah. Orca is interesting. I do want people to know about it. Putting chain of thought into the data set like it's just makes a ton of sense one thing I think it would be helpful for people to scope thing these things out is how much data are we talking about when when you When people are fine tuning and then how much time or resources or money does it take to train to fine[00:19:36] Wing Lian: tune?[00:19:37] Yeah, so I think there's a little bit of overlap there with sort of like fine tuning techniques, but let's say Orca and I think even Hermes, they're both relatively large data sets like 10 billion tokens. Yeah. So large data sets being or the original Orca was, or the original open Orca was 800,000 rows.[00:19:55] I believe it was somewhere in the ballpark of like a gigabyte of data, of gigabyte, of text data. And I, I don't. I believe, Hermes was, is like a quarter million rows of data, I don't know the actual byte size on that particular one. So, going and training a, let's, let's say everybody's training 7 billion Mistral right now, right?[00:20:15] So, to tri I, I believe to fine tune 7 billion Mistral on, let's say, 8 A6000s, which have 48 gigabytes of VRAM, I believe, It takes about 40 hours, so 40, and then that's, depending on where you get your compute, 40 times 6, so it's like 500 to fine tune that model, so, and, and that's assuming you get it right the first time, right?[00:20:44] So, you know.[00:20:45] swyx: Is, is that something that X. Lotto handles, like, getting it right the first[00:20:48] Wing Lian: time? If you talk to anybody, it's like you've probably tried at least three or four runs or experiments to like find the right hyperparameters. And after a while you sort of have a feel for like which, where you need your hyperparameters to be.[00:21:04] Usually you might do like a partial training run, do some benchmark. So I guess for Al Farouk, whether you're going by his. This is Jeremy, he's, his actual name, or his twitter handle. He released the Dharma dataset, which is basically a subset of all the benchmarks. And Axolotl actually supports, you know taking that subset and then just running many benchmarks across your model every time you're doing an evaluation so you can sort of like see sort of relative it's not going to be the actual benchmark score, but you can get ideas alright, is this benchmark improving, is this benchmark decreasing, based on, you know Wait,[00:21:39] swyx: why don't you run the full benchmark?[00:21:41] What, what, what The[00:21:42] Wing Lian: full benchmarks take Take a long time. Significant, yeah, significant amount of time. Yeah. And Okay, so that's like[00:21:48] swyx: mini MMLU. Yeah. Like,[00:21:49] Wing Lian: mini BigBench or whatever. Yep, exactly.[00:21:51] Alex Volkov: It's really cool. We, when I joined Web2Masters just recently, and one of the things that I try to do is hey I'm not, I'm a software engineer by trade, I don't have an MLE background, But I joined a company that does primarily MLE, and I wanted to learn from the community, Because a lot of the open source community, they use weights and biases, And the benchmark that you said that Pharrell did, remind me of the name, sorry.[00:22:13] Dharma? Dharma, yeah, yeah. So Luigi showed me how Dharma shows inside the dashboard. In Wi and Biases dashboard and so you can actually kinda see the trending run and then you can see per each kind of iteration or, or epoch or you can see the model improving trending so you can on top of everything else.[00:22:29] The wi and biases gives like hyper parameter tracking, which like you, you started with common line and that's really hard to like remember. Also the Dharma data set, like the quick, the mini orca mini, you mini many different things. It's pretty cool to like visualize them as well. And I, I heard that he's working on a new version of, of Dharma, so Dharma 2, et cetera.[00:22:47] So hopefully, hopefully we'll see that soon, but definitely it's hard, right? You start this training around, it said like 40, 50 hours. Sometimes, sometimes it's like your SSHing into this machine. You, you start a process, you send it with God and you just go about your day, collecting data sets, and then you have to return.[00:23:04] And the whole process of instrumentation of this is still a little bit like squeaky but definitely. Tuning performance, or like grabbing performance in the middle of this, like with Dharma and some other tools, is very helpful to know that you're not wasting precious resources going somewhere you shouldn't go.[00:23:21] Yeah.[00:23:22] swyx: Yeah. Very cool. Maybe I'll, I'll, before we go into like sort of more details on fine tuning stuff, I just wanted to round out the rest of the Excel autoverse. There's, there's still Eric Hartford stuff. I don't know if you want to talk about Pygmalion, Disco, anything that you know about[00:23:35] Wing Lian: those, those things.[00:23:36] DiscoLM and Model Stacking[00:23:36] Wing Lian: Yeah, I think like one of the, definitely one of the more interesting ones was like the Disco 120b, right? Yeah, I know nothing about it. Yeah. So, so. Alpen from Pygmalion AI, right, so they, so Pygmalion is a sort of a, it's, it's, they have their own community, a lot of it is based around, roleplay models, those sorts of things, and Alpen, like, put together, merged together Llama270B, so, and Alpen, like, put together, merged together Llama270B, so, I don't remember how he stacked them together, whether he merged the layers in between. There's a whole, there's a whole toolkit for that by Charles Goddard, where you can like take a single model and like stack them together or multiple models merge.[00:24:18] That's like a whole other talk and a whole other tool set, but was able to create this 120. Billion parameter model out of a LAMA two 70 B. And then I believe the, yeah, disco is a fine tune of, of the, the, the sort of the base one 20 B is, I believe Goliath one 20 B. So, and, and what are the[00:24:37] swyx: headline results that people should know about[00:24:39] Wing Lian: disco?[00:24:39] I think for the headline results, I, I've, I haven't played with it personally because it's. It's a very large model and there's a lot of GPU, right? But, like, from what I've heard anecdotally, it performs really well. The responses are very good. Even with, like, just, even the base model is a lot better than, Llama70b.[00:24:57] So, and we, I think generally everybody's like, we would all love to fine tune Llama70b, but it's just, it's so much, it's so much memory, so much compute, right?[00:25:07] Datasets and Evals over Models[00:25:07] Wing Lian: I[00:25:07] Alex Volkov: want to touch on this point because the interesting thing That comes up out of being in this ecosphere and being friends with open source folks, tracking week to week state of the art performance on different models.[00:25:19] First of all, a lot of the stuff that the folks do a couple of weeks ago, and then something like Mistral comes out, and a lot of the stuff back then, Doesn't technically make sense anymore. Like the artifacts of that work, the actual artifacts, they don't no longer make sense. They're like lower on the on, on the hug and face leaderboard or lower on LM CS leaderboard.[00:25:36] But some of the techniques that people use, definitely the datasets. The datasets keep traveling, right? So open airmen, for example, is the dataset. The tum cleaned up for only. Open sourceable data that previously was just Hermes. And that, it was previously used to train Lama. And then once Mistral came out, it was used to train Mistral.[00:25:54] And then it became significantly better on the 7b base Mistral. So the data sets keep traveling, keep getting better a little bit here and there. And so the techniques improve as well. It looks like both things are simultaneously true. The artifacts of a month and a half ago. The, the actual models themselves, it's great the hug and face has them, because not every company can keep up with the next weeks', oh, I, I'll install this model instead, sell this model instead.[00:26:19] But the, the techniques and the, the dataset keep improving as we go further, and I think that's really cool. However, the outcome of this is that for a long time. For many, many people, including us, that we do this every week. We literally talk with people who release these models every week. It's really hard to know.[00:26:36] So, there's a few aspects of this. One, I think, like you said, the bigger model, the 70B models, you actually have to have somebody like Perplexity, for example, giving you access to the 70B really fast. Or you have to, like, Actually, find some compute, and it's expensive, especially for the bigger models. For example Falcon 180B came out, like the hugest open source model.[00:26:56] How do you evaluate this if you can't run it? Nobody liked it. It's really, so first of all, nobody liked it, but secondly, only the people who were able to find compute enough to run inference on this, they only had like, I can't run this on my laptop, and so that's why it's much easier, something like OpenRMS 7 to be, 7B, it's much easier, because you can run this on your MacBook.[00:27:14] It's much easier to evaluate. It's much easier to figure out the vibes, right? Everybody talks about the vibes as an evaluation check. If you're plugged in enough, if you follow the right people, if they say pretty much the same things all independently, then you run into a problem of whether they're repeating, and their stochastic parents are repeating the same thing, or they actually evaluated themselves.[00:27:31] Yeah, you never know. But, you never know, but like, I think on a large enough scale on Twitter, you start getting the feel. And we all know that like, OpenRMS is one of the top performing models, benchmarks, but also vibes. And I just wanted to highlight this vibes checks thing because you can have the benchmarks, you can have the evaluations, they potentially have contamination in them, potentially they not necessarily tell you the whole story because some models are good on benchmarks, but then you talk to them, they're not super helpful.[00:28:00] And I think it's a combination of the benchmarks, the leaderboards, the chatbot, because LMSys, remember, their ranking is not only based on benchmarks, it's also people playing with their arena stuff. People actually like humans, like, get two answers. I think they completely ignore benchmarks. Yeah, and then They only do ELO.[00:28:18] Oh, they do ELO completely, right? So that, for example, is just like people playing with both models and say, Hey, I prefer this one, I prefer that one. But also there's like some selection bias. The type of people who will go to LMCs to play with the models, they're a little bit specific in terms of like who they are.[00:28:33] It's very interesting. There's so many models. People are doing this in this way, that way. Some people are doing this for academic rigor only to test out new ideas. Some people are actually doing this like the Intel fine tunes of Mistral. Intel wanted to come out and show that their hardware approach is possible, Mistral, etc.[00:28:51] And it's really hard to know, like, what to pick, what to use. And especially on the bigger models, like you said, like the Llama 70B, the Falcon 180B. It's really because, like, who has the compute to validate those? So I would mention that, like, use with caution. Like, go and research and see if the biggest model that just released was actually worth the tokens and the money you spend on it.[00:29:12] To try and, if you're a business, to integrate it.[00:29:15] Distilling from GPT4[00:29:15] swyx: Since you said use of caution, I'll bring in one issue that has always been in the back of my mind whenever I look at the entire universe of open source AI models, which is that 95 percent of the data is derived from GPC 4, correct?[00:29:30] Which technically you can't use for commercial licenses,[00:29:34] Wing Lian: right?[00:29:35] swyx: What is the community's stance on this kind of stuff?[00:29:40] Wing Lian: I think from the community stance, like I feel like a lot of us are just experimenting, so for us, it's like, we're not going and building a product that we're trying to sell, right?[00:29:49] We're just building a product because we think it's interesting and we want to use it in our day to day lives, whether or not we try and integrate it. Personal use, yeah. Yeah, personal use, so like, as long as we're not selling it, yeah, it's fine. But[00:30:01] swyx: like, I as a company cannot just take OpenHermes and start serving[00:30:05] Alex Volkov: it and make money on it.[00:30:06] OpenHermes you can. Because the opening of OpenHermes, I think, is a clean up. That did after the regular Hermes, please folks, check your licenses before you listen to podcasts and say, Hey, I will tell you though, you could say the same thing about OpenAI. You could say the same thing kind of makes sense, where OpenAI or StabilityAI trains their diffusion model on a bunch of pictures on the internet, and then the court kind of doesn't strike down Sarah Silverman, I think, or somebody else, who came and said, hey, this has my work in it, because of the way how it processes, and the model eventually builds this knowledge into the model, and then it doesn't actually reproduce one to one what happened in the dataset.[00:30:45] You could claim the same thing for open source. Like, we're using And by we, I mean the, the open source community that I like happily report on uses GPT 4 to rank, for example, which is the better answer you, you, that's how you build one, one type of data set, right? Or DPO or something like this, you, you basically generate data set of like a question and four answers, for example, and then you go to GPT 4 and say, Hey, smartest model in the world right now, up to Gemini Ultra, that we should mention as well.[00:31:11] Which one of those choices is better? But the choices themselves are not necessarily written with GPT 4. Some of them may be, so there's like full syntactic datasets. But there's also, datasets are just ranked with GPT 4. But they're actually generated with a sillier model, or like the less important model.[00:31:25] The lines are very blurry as to what type of stuff is possible or not possible. And again, when you use this model that's up on Hug Face, the license says you can use this. OpenAI is not going to come after you, the user. If anything, OpenAI will try to say, hey, let's prevent this, this type of thing happening, and the brain, but I honestly don't think that they could know even, not that it makes it okay, it's just like, They also kind of do this with the Internet's archive, and also, I think that some of it is for use.[00:31:55] You use models to help you augment tasks, which is what GPT 4 lets you do.[00:32:00] swyx: Yeah, the worst thing that OpenAI can do is just kick you off OpenAI. That's because it's only enforced in the terms of service.[00:32:05] Alex Volkov: Sure, but just like to make sure, to clarify who they're going to kick out, they could kick out like News, for example, if news are abusing their service, a user of the open source, fully Apache 2 open source, for example, They won't get kicked out if they use both, just because they use both.[00:32:22] I don't believe so. I don't think OpenAI has a claim for that.[00:32:25] swyx: Well, we're not lawyers, but I just want to mention it for people to know it's an issue.[00:32:30] Wing Lian: And one of the things, like, I talked to someone recently, and I think that they also are like interested in it, but also to the point of like, right, if I use a model trained on data, using GPT for data, But I use that model to then regenerate new data.[00:32:46] Is that model, is that data okay? So like you start going down this whole rabbit hole. So yeah. All right.[00:32:53] swyx: Fantastic. Cool. Well, I think that's roughly highlights most of the open source universe. You also have your own models. Do you want to shout out any one of them? Yeah.[00:33:01] Wing Lian: I mean, I think like, I think Early on, Manicore got a lot of love.[00:33:04] I think it was mostly popular in, like, the roleplay communities. It was, it tended to be pretty truthful. It tended to be, like, have relatively good answers, depending on who you ask, right? But, I think for me, it was just, Releasing models was a way to try and, like, continue to build out the product, figure out what I needed to put into the product, how do I make it faster, and, if you've got to, like, go and debug your product, you may as well have it do something useful.[00:33:29] Awesome. So, yeah.[00:33:31] Finetuning - LoRA, QLoRA, ReLoRA, GPTQ[00:33:31] swyx: Okay, and then maybe we'll talk about just fine tuning techniques. So this is going to be a little bit more technical than just talking about model names and datasets. So we started off talking about LoRa, QLoRa. I just learned from your readme there's ReLoRa. Which I've never heard about.[00:33:45] Could you maybe talk about, like, just parameter efficient fine tuning that whole, that[00:33:50] Wing Lian: whole journey, like, what people should know. Yeah, so with parameter efficient fine tuning, I think the popular ones, again, being, let's, we'll start with lore, right? So, usually what you do is you freeze all the layers on your base, on the base model, and then you, at the same time, you sort of introduce additional Oh, this is tight.[00:34:08] No. You introduce, another set of layers over it, and then you train those, and it is done in a way that is mathematically possible, particularly with LORs that you can, then you, you, When you, when you train the model, you, you run your inputs through the base model, whose weights are frozen, but you, then you also run it through the additional weights, and then at the end you combine the weights, and then, and then, or you combine the weights to get your outputs, and then at the end, and when you're done training, you're left with this other set of weights, right, that are completely independent, and And then from that, what you can do is, some person smarter than I figured out, well, oh, they've done it in such a way that now I can merge these weights back into the original model without changing the architecture of the model, right?[00:35:03] So, so, that tends to be, like, the go to, and You're training much fewer parameters so that when you do that, yes, you still need to have all of the original weights, but you have a smaller gradient, you have a smaller optimizer state, and you're just training less weights, so you can tend to train those models on, like, much smaller GPUs.[00:35:27] swyx: Yeah. And it's roughly like, what I've seen, what I've seen out there is roughly like 1 percent the number of parameters that you're trading. Yeah, that sounds about right. Which is that much cheaper. So Axelotl supports full fine tune, LoRa, QLoRa,[00:35:40] Wing Lian: Q. Yes. So, so QLoRa is, is very similar to LoRa. The paper was, if I remember correctly, the paper was Rather, traditionally, most people who did Loras were, were, they were quant, they were putting the model weights in 8 bit, and then fine tune, parameter efficient fine tuning over the Lora weights, and then with QLora, they were quantizing all of those, they were then quantizing the weights down to 4 bit, right, and then I believe they were also training on all of the linear layers in the model.[00:36:15] And then with ReLore, that was an interesting paper, and then, I think, like, it got implemented. Some people in the community tried it, tried it out, and it showed that it didn't really have the impact that the paper indicated that it would. And from what I was told recently, that they re I guess they re released something for Relora, like, a few weeks ago, and that it's possibly better.[00:36:44] I personally haven't had the time. What was the[00:36:46] swyx: main difference,[00:36:47] Wing Lian: apart from quantization? I don't know. Okay. What was the main difference, sorry?[00:36:49] swyx: Apart from quantization, right? Like,[00:36:50] Wing Lian: Qlora's thing was, like, we'll just drop off some bits. With Relora, what they did was, you would go through, you would define some number of steps that you would train, like, your Lora with, or your Qlora.[00:37:01] Like, you could do Like, ReqLore, if you really wanted to, you would, you would train your LoRa for some number of steps, And then you would merge those weights into your base model, and then you would start over. So by starting, so, then by starting over, The optimizer has to find, like, sort of, re optimize again, and find what's the best direction to move in, and then do it all again, and then merge it in, do it all again, and theoretically, according to the paper, doing ReLore, you can do parameter efficient fine tuning, but still have sort of, like, the performance gains of doing a full fine tuning, so.[00:37:38] swyx: Yeah, and[00:37:39] Wing Lian: GPTQ? And GPTQ, so it's, I think with GPTQ, it's very similar to, more similar to QLore, where you're, it's mostly a quantization of the weights down to like 4 bit, where GPTQ is a very, is a specific methodology or implementation of quantization, so. Got it.[00:37:57] Alex Volkov: Wang, for, for folks who use Axolotl, your users, some people who maybe, Want to try it out?[00:38:03] And do they need to know the differences? Do they need to know the implementation details of QLora versus ReLora? Or is it okay for them to just know that Axolotl is the place that already integrated them? And if that's true, if that's all they need to know, how do they choose which method to use? Yeah,[00:38:22] Wing Lian: so I think like, I think most people aren't going to be using ReLora.[00:38:25] I think most people are going to be using either Lora or QLora. And I think they should have it. They should have an understanding of why they might want to use one over the other. Most people will say that with Qlora, the quality of the final model is not quite as good as like if you were to do a LoRa or a full fine tune, right?[00:38:44] Just because, you've quantized these down, so your accuracy is probably a little off, and so that by the time you've done the Qlora, you're not moving the weights how you would on a full fine tune with the full parameter weights.[00:38:56] Interesting.[00:38:57] swyx: Okay, cool. For people who are more interested, obviously, read the papers. I just wanted to give people, like, a high level overview of what these things are. And you've done people a service by making it easy for people to try it out. I'm going to, I'm going to also ask a question which I know to be wrong, but I'm curious because I get asked this all the time.[00:39:15] What is the difference between all these kinds of fine tunes[00:39:17] Wing Lian: and RLHF? Okay, between all of these sorts of fine tunes and RLHF. So all of these sorts of fine tunes are based, are, ideally, this, they are taking knowledge that the base model already knows about, and presenting it in a way to the model that you're having the model answer like, Use what it already knows to sort of answer in a particular way, whether it's, you're extracting general knowledge, a particular task, right?[00:39:44] Instruct, tune, chat, those sorts of things. And then generally with RLHF, so what is, let's go back, what is it? Reinforcement Learning with Human Feedback. So if we start with the human feedback part, What you're doing is you generally have, you have like a given prompt and then you, maybe you have one, maybe you have two, I think, like if you look at with Starling, you have like up to what, seven different, seven different possible responses, and you're sort of ranking those responses on, on some sort of metric, right, whether the metric is how much I, I might like that answer versus or I think with like starling is like how how how helpful was the answer how accurate was the answer how toxic was the answer those sorts of things on some sort of scale right and then using that to go back and like sort of Take a model and nudge it in the direction of giving that feedback, to be able to answer questions based on those preferences.[00:40:42] swyx: Yeah, so you can apply, and is it commutative? Can you apply fine tuning after and onto an RLHF model? Or should the RLHF apply, come in afterwards,[00:40:54] Wing Lian: after the fine tune? Um, I, yeah, I don't know that there's There's been enough research for one way or another, like, I don't know.[00:41:02] That's a question that's been asked on Discord. Yeah, like, I definitely would say I don't know the answer. Go and try it and report back to me and let me know so I can answer for the next guy.[00:41:10] swyx: It's shocking how much is still unknown about all these things. Well, I mean, that's what research is for, right?[00:41:16] Wing Lian: So actually I, I think I saw on the top of a leaderboard, it was a, it was a mytral base model, and they didn't actually fine tune it. They, or they, they just did RLH, they did like an RLHF fine tune on it using like, I don't, I don't recall which dataset, but it was like, and it benchmarked really well.[00:41:37] But yeah, you'd have to go and look at it. But, so it is interesting, like going back to that, it's like. Traditionally, most people will fine tune the model and then do like a DPO, PPO, some sort of reinforcement learning over that, but that particular model was, it seemed like they skipped like the supervised fine tuning or Scott.[00:41:55] Axolotl vs HF Transformers[00:41:55] swyx: Cool. One thing I did also want to comment about is the overall, like, landscape, competitive landscape, I don't know. Hugging Face Transformers, I think, has a PFT module.[00:42:05] Wing Lian: Yeah, yeah, the PEFT, the Parameter Efficient Fine Tuning, yep. Is that a competitor to you? No, no, so we actually use it. We're just a wrapper over sort of, sort of the HuggingFace stuff.[00:42:15] So, so that is their own sort of module where They have, taken the responsibility or yeah, the responsibility of like where you're doing these parameter efficient fine tuning methods and just sort of like, it is in that particular package where transformers is mostly responsible for sort of like the modeling code and, and the trainer, right.[00:42:35] And then sort of, there's an integration between the two and, there's like a variety of other fine tuning packages, I think like TRL, TRLX, that's the stability AI one. Yeah, I think TRL likes the stability, yeah, Carper, and TRL is a hugging face trainer. Even that one's just another wrapper over, over the transformers library and the path library, right?[00:43:00] But what we do is we have taken sort of those, yes, we've We also use that, but we also have more validation, right? So, there are some of us who have done enough fine tunes where like, Oh, this and this just don't go together, right? But most people don't know that, so like Example?[00:43:19] Like, people want to One and one doesn't go together. I don't have an example offhand, but if you turn this knob and this knob, right? You would think, all right, maybe this will work, but you don't know until you try. And then by the time you find out it doesn't work, it's like maybe five minutes later, it's failed.[00:43:34] It's failed in the middle of training or it's failed during the evaluation step. And you're like, ah, so we've, we've added a lot of, we've added a lot more validation in it. So that like, when you've, you've created your configuration, you run it through and now you say. The validation code says this is probably not right or probably not what you don't, not what you want.[00:43:52] So are you like a, you[00:43:53] swyx: do some linting of your YAML file?[00:43:56] Wing Lian: There, I guess you could call it linting, it's sort of like Is there a set of rules out[00:44:00] swyx: there somewhere? Yeah, there's a set of rules in there. That's amazing, you should write documentation like This rule is because, this user at this time, like, ran into this bug and that's what we invested in.[00:44:10] It's like a good collection[00:44:11] Wing Lian: of knowledge. Yeah, it is, and I guess like, if you really wanted to, like, figure it out, I guess you could, like, git blame everything, and But, yeah, it's, so, I think that's always a useful thing, it's like Because people want to experiment but they don't, people will get frustrated when you've experiment, you're experimenting and it breaks and you don't know why or you know why and you've just gone down the rabbit hole, right?[00:44:37] So, so I think that's one of the big features that's, that I think I find important because it's It prevents you from doing things you probably shouldn't have, and it, and sometimes we will let you do those things, but we'll try and warn, warn you that you've done that.[00:44:50] I[00:44:51] Alex Volkov: have a follow up question on this, actually, because yesterday we hung out to this open source event, and I spent time by you a couple times, like when people told you, oh, XLR, I use XLR, it's super cool, and then the first thing you asked is, like, immediately, like, what can we improve?[00:45:04] And yes, from multiple folks, and I think we talked about this a little bit, where there's It's a developer tool. It's like a machine learning slash developer tool. Your purpose in this is to help and keep people, as much as possible, like, Hey, here's the best set of things that you can use right now. The bear libraries are, or the bear trainer, for example, is a bear trainer.[00:45:28] And also, maybe we should talk about how fast you're implementing these things. So you mentioned the first implementation took a week or so. Now there's a core maintainer group, right? There's like, features are landing, like Qlora, for example. Neftune, I don't know if that's one example of something that people potentially said that it's going to be cool, and then eventually, like, one of those things that didn't really shake out, like, people quickly tested this out.[00:45:48] So, there's a ton of Wait, Neftune is cancelled? I don't know if it's fully canceled, but based on vibes, I heard that it's not that great. So like, but the whole point that I'm trying to make with Neftune as well is that being existing in the community of like XLR or like, I don't know, even following the, the GitHub options or following the Discord, it's a fairly good way to like, learn these, Kind of gut feelings that you just, you just said, right?[00:46:14] Like where this, maybe this knob, that knob doesn't work. Some of these are not written down. Some of these are like tribal knowledge that passes from place to place. Axel is like a great collection of many of them. And so, do you get That back also from community of folks who just use, like, how do you know who uses this?[00:46:30] I think that's still an issue, like, knowing if they trained with XLR or should they add this to things? Talk about, how do you get feedback and how else you should get feedback?[00:46:38] Wing Lian: Yeah, I mean, most of the feedback comes from the Discord, so people come in and , they don't get a training running, they run into, like, obscure errors or, errors that That's a lot of things that maybe, maybe as a product we could catch, but like, there's a lot of things that at some point we need to go and do and it's just on the list somewhere.[00:46:58] Right that's why when people come up, I'm like, what, what were your pain points? Because like, as a developer tool, if you're not happy with it, or you come in and in the first, Takes you 30 minutes and you're still not happy. You leave the tool and you may, you might move on maybe to a better tool, maybe to, one with less frustration, but it may not be as good, right?[00:47:17] So I'm trying to like, figure out, all right, how can I reduce all this frustration? Because like for me, I use it every day for the most part, right? And so I am blind to that, right? Mm-Hmm. . Mm-Hmm. . I just know, I, I go do this, this, and this. It pretty much mostly works, right? But, so I don't have sort of that, alright, that learning curve that other people are seeing and don't understand their pain points.[00:47:40] Yeah,[00:47:40] Alex Volkov: you don't have the The ability to onboard yourself as a new user completely new to the whole paradigm to like get into the doors of like, Oh, no, I don't even know how to like ask about this problem or error.[00:47:53] swyx: Cool. The last few things I wanted to cover was also just the more advanced stuff that you covered yesterday.[00:48:00] 20x efficiency with StackLlama and Multipack[00:48:00] swyx: So I'll just, caution this as like, yeah, this is more advanced. But you mentioned Stackllama and Multipack. What are they[00:48:06] Wing Lian: and what should people know? Yeah, so, so, Stack Llama was, that paper came out, so Stack Llama I think was like, two, two, two separate, two separate concepts that they announced, so the first one was They being hugging face.[00:48:20] Yeah, sorry, yes, they being hugging face, so the first one being sort of like, this idea of packing, like some packing sequences together, so like, if we think about training data, right, your training data is, let's say, to keep the math easy, let's say your training data is 500, We, we, we, we will use the terminology words.[00:48:39] Let's say your training data is 500 words long, and let's say your, your context length, you know how much data your, that your model can accept is like, or that you want feed into your model. It's, let's say, we won't use tokens again, we'll we'll use it is it's 4,000 tokens, right? So if you're training at 4K Con or four 4,000 4K contacts and you're only using 500 of it, you're sitting like with the other 1500.[00:49:05] 3, 500 words that you're not using, right? And typically that's either filled with these PAD tokens, so I think I made the analogy last night that it's like having sort of like a glass here you fill it up with a shot of liquor and then you're and that's your training data and then you just fill it up with more water and those are your PAD tokens and it's just, it doesn't do much, right?[00:49:27] It's still the same thing, but you still have to go through all of that to go through all your training data. And then, so what Stack Llama showed was you could just sort of take your training data, append the next row of training data until you filled that entire 4k context, so in this example, right, with 500 words to 4k, that's 8 rows of training data.[00:49:48] But, the problem with that is, is that with a lot of these transformer models, they're very much relying on attention, right? So, like, if you now have this sequence of words that now, in order for the, the model has seen all of these other words before, right? And then it sees another set of words, another set of words, but it's learning everything in context of all the words that it's seen before.[00:50:13] We haven't corrected the attention for that. And just real quickly, since I said that that paper was two concepts, the other one was, I believe it was like a reinforcement learning, but outside the scope of this. So going from that, I implemented that early on because I was like, Oh, wow, this is really great.[00:50:29] And. Yes, because it saves you a bunch of time, but the trade off is a little bit of accuracy, ultimately, but it still did pretty well. I think when I did Manicore, I think it used sort of that concept from Stack Llama of just sort of appending these sequences together, right? And then sort of the next evolution of that is Multipack, right?[00:50:51] So, there was a separate paper on that, it was, I believe it was referenced, it got referenced in the Orca paper, where you could, you could properly mask those out using like a, I think it was like a lower block triangular attention mask, and then sort of, so, So, there's that. I did try implementing that, manually recreating that mask, but then one from the OpenChat, so he was helping with OpenOrca as well, and he had done an implementation of Multipack, and where he used FlashAttention, so FlashAttention So that was released by TreeDAO, and it was this huge performance gain.[00:51:35] Everybody uses it now, even the Transformers library now, they've taken all of these, like, people are taking all of these models and sort of like, making it compatible with FlashAttention. But in Flash Tension, there is one particular implementation that lets you say, Well, I'm sending you all of these sequences like you would in Stack Llama, But let me send you another, another, Set of information about, this is where this set of sequences is, this is where the second set of sequences is.[00:52:06] So like, if it was like, 500 words long, and you stacked them all together, you would just send it a row of information that was like, 0, 500, 1000, 1500, etc, etc, out to 4000. And it would know, alright, I need to break this up, and then run the forward pass with it. And then it would be able to, and it was much more, much more performant.[00:52:29] And I think you end up seeing like 10x, 20x improvements over sort of, I mean, I think FlashAttention was like a 2x improvement, and then adding that with the Multipack, you start to see like, depending on, how much data you have, up to like a 20x improvement sometimes. 20x. 20x. Wow. Yeah.[00:52:48] And I only know the 20x because I, like, before last night, I was like, I re ran the alpaca, I looked up the alpaca paper because it was like, I just need a frame of reference where somebody did it, and I think they used eight A100s for three hours, and they said it cost them 100. I don't, I don't think eight A100s cost, I don't know how much it costs right now.[00:53:14] But I ended up rerunning it. Usually a dollar an hour, right? Yeah, so eight. The cheapest is like a[00:53:18] Alex Volkov: dollar, a dollar an hour for one.[00:53:20] Wing Lian: Yeah, so that's still like 24, 25. But maybe if you're going on Azure, maybe it's like, maybe it's 100 on Azure. I mean, it used to be more expensive, like, a year ago.[00:53:31] Yeah, and then, so I re ran it with sort of like, I turned on all of the optimizations just to see what it would be. And like, and usually Multipack is the biggest optimization, so Multipack with Flash Detention. And it, I think I spun it up on 8 L40s, and it ran, and I didn't let it run all the way through, I just grabbed the time, the estimated completion time, and it was like 30 minutes, so it would have cost like 4 or 5 to run the entire, like, reproduce the alpaca paper, right?[00:54:00] Which is crazy. It's crazy. 20x,[00:54:02] Alex Volkov: yeah. I want to ask about, like, you said you turned on all the optimization. Is that the yaml file with xlodl, you just go and like check off, like, I want this, I want that? Yeah, yeah,[00:54:10] Wing Lian: so there's like one particular yaml file in there, That, there's one particular YAML file in there that's like, it's under examples, llama2, fft, optimize.[00:54:20] So, I think someone had created one where they just turned, they put in all of the optimizations and turned them on. I mean, it actually, it does run, which is like, sort of surprising sometimes, because sometimes, you optimize this, optimize this, and sometimes they just don't work together, but, yeah.[00:54:36] Just turn the knobs on, and like, fine tuning should really just be that easy, right? I just want to flip the knob and move on with my life and not figure out how to implement it.[00:54:47] Tri Dao and Mamba[00:54:47] Alex Volkov: Specifically, the guy behind FlashAttention came up with something new. You want to talk about this a little bit? You want to briefly cover Mamba?[00:54:53] Yeah, let's talk about Mamba. Let's talk about Mamba. So, what is Mamba?[00:54:57] Wing Lian: Oh, gosh. I mean, I have not read the paper end to end. Like, I think you need to find someone smarter to tell you what Mamba is. But I think in a nutshell, it's sort of this, like, attentionless, attentionless model architecture. So I think it was, like, using a lot of his learnings from, like, I think Stanford did a lot of like sort of attentionless models with like I think Hyena several months ago as well so it is sort of this evolution of that of these of this research they've done and Apparently I believe it is what 5x faster for inference But the memory requirements are sub quadratic, so like I think, so with models that have attention, as you scale the context length out, the memory and the inference and training time goes up, quadratically, like Or squared, right?[00:55:50] Whereas this one is closer, much closer to linear. So it's, it's really exciting. And there's a lot of like, I think a lot of people in the community are excited about it because especially I was talking with LGJ yesterday and he was saying it showed think with the perplexity curves and given the same exact, like comparing a, I think it was like a 140 million parameter model with the Pythea 140 million parameter model trained on the exact same data set as that model that there was a, that I believe the perplexity curves were a little bit lower than the Pythea model.[00:56:26] So yeah. Yeah.[00:56:28] Alex Volkov: I think one thing LDJ also is the guy behind, he was super excited to get like us to talk on Thursday about Mamba as well. He mentioned to me that the significant improvements in performance, it could be like 2x in the beginning where like lower tokens are, but then as you scale more with longer, longer tokens, because the non quadratic, the almost linear type scale, it's the performance improvements for larger and bigger and like more models are significant, like in the 10x to maybe 20x.[00:56:57] Yeah, I think he said 10 yeah. At the larger models. And that's where we want to go. We want to get to the bigger sizes, the longer trains.[00:57:06] Wing Lian: Yeah, yeah. So in particular, the longer context links. So like, if you're talking like 50, 60, like, or 128k context, like what is it, GPT turbo now? Or 4 turbo?[00:57:19] 128, yes. So, like, getting out to that because it's no longer, yeah, it's like, it's, it's just as fast. I believe it should be just as fast, like, generating those tokens as it is, like, on a short, on a short[00:57:34] Alex Volkov: prop. So, this came out just recently, and then between running to this open source AI, driving here in Uber, like, you already put out something that I saw that, that you started.[00:57:44] Wait, what? Something today? Yeah,[00:57:47] Wing Lian: what did you do? Well, I mean, so like tree and I forget who the other author is on that paper. They had released sort of the modeling code on, on GitHub. And then sort of like, it wasn't, they hadn't quite put it like made it like transform or, transformers library native.[00:58:04] So, and it, it didn't quite drop in. Like cleanly into like Axel lot to get it, so that you could fine tune it. So like, it was one of the things I actually wanted to try and get done before the, before the meetup yesterday, and just demo that because that would be awesome, right? That'd be awesome.[00:58:20] I think it dropped on Thursday and you know No. What day? No, today is Thursday. Thursday. I keep. I keep thinking today was Friday, that's what I said. I think, so it dropped on what, Tuesday, the meetup was Wednesday, I wanted to get it done for that, but I was getting it where it would like, the loss would just go to zero, and just fail.[00:58:40] So, but yeah, right before coming here, I was working on it this morning and I think we finally got it working. So, I think Pharrell's training something on it. I'm pretty sure like Tenuim is going to be training something on it soon. So,[00:58:52] Alex Volkov: yeah. So, we'll see, but I wanted to highlight the speed because you started with like within a week the first alpaca or, implementation and change in Axelot came and now like you're talking about like three days and that's with you flying and that's with you like presenting and talking on podcasts.[00:59:08] Roadmap for Axolotl[00:59:08] swyx: Very productive. Yeah. Yeah, excellent. Well, so, we're going to start wrapping up soon, but I always wanted to give you space to also talk about what you're working on next, and, on the[00:59:17] Wing Lian: roadmap for Axelotl. Yeah, I think so, the roadmap for Axelotl is really like, I think, trying to stabilize sort of the feature set.[00:59:26] Like, so the first thing on the roadmap is to write the roadmap, and then sort of going from there, it's, I think, So for me the sort of the vision is like it's it's a developer first platform right and as a developer You you're maybe you're more than likely doing it this sort of this side hustle side project trying to figure out like how do I build?[00:59:45] LLMs and you know how do I build you know? How do I use a trainer that sort of thing and then you're you get comfortable with this tool? And then you maybe you take it to your company and you're training Models for where you work, right? So, and then, ultimately, you're saying, I want to use this because it's easy and I know how to use it.[01:00:03] So, for me Given that sort of like, if I follow that through, that thought through, it's like, well, companies don't want to use this if it's hard for them to like, if given their specific use cases, right, they might need something specific in the workflow that they, and I, what I don't want is to have is them having to fork it, like, to Like, fork it in a way that is, like, hard to maintain, that if they want to get features, they then have to, like, rebase it and all of that.[01:00:32] So, for me, and I actually have, like, a issue in GitHub that's about three or four months old at this point of exactly, yet, expose, like, create a plugin system, expose sort of, like, these hooks where companies can go in and build their own plugins and sort of, like, Modify, like, hyperparameters on the fly, or modify various, like attributes of training.[01:00:57] Yeah, it's becoming a platform. Yeah, exactly. So, I need to, provide a way for, for them to be able to, like, use it in, in a, in a reliable manner and something that, that they can go invent and feel comfortable using, right? Yeah,[01:01:10] swyx: awesome. You are working independently? You left SoundCloud a few months ago, and you have a non profit, the Open Access AI Collective.[01:01:20] The Open Source AI Community[01:01:20] swyx: It has a Discord people can join. How else can people support[01:01:22] Wing Lian: you? I think really, like, for me, the biggest thing is, like, I'm looking, I'm always looking for contributors. Like, we have a great, set of core contributors, Nanobit, Amin slash TMM1, Casper Hansen, and then, and there are probably a few others who, I've Don't have the names offhand for, but we do see some like smaller PRs trickle through, but like A lot of the, sort of like, if I had somebody that could have gone and done Mamba for me, that would make my life a hundred times easier, right?[01:01:51] I wouldn't have to be scrambling between, Ubers and meetings and those sorts of things to try and, like, get that implemented. So, there's definitely this, like, roadmap of, Things to do and nice to have, right? And like Nano is great at being a community manager and answering questions and sort of fueling all of that and being technical and you know It's really technical and can stole open PR's and fix things and like so and he's a graduate So he's a graduate student in Japan Working, doing research, and somehow he finds time to like, support this community, right?[01:02:25] He's amazing, I love him, and I think everybody should like, show him some love, and then, but yeah, like, ultimately, the, I think the, the big, yeah, the biggest thing that I could ask for would be just, yeah, more core contributors.[01:02:38] swyx: Cool. All right, well if you're interested in checking it out check out XLotto.[01:02:42] Alex, anything else to, to[01:02:43] Alex Volkov: add? Yeah, I will say folks who are listening to us, open source doesn't just happen. It happens because there's a bunch of great people. Giving their life, basically, to these things. So, first of all, be nice in comments. Like, that's obvious. Like, if you want to come in and complain about something, be productive and do the work as much as possible so the person who's, like, giving out of their life to help you will actually find it, like, easier.[01:03:06] It usually gets to a point where, like, a small project becomes a platform, the platform then has rules, and then it's making it hard for some people to just go in and kind of say, Hey, this thing or that thing. Remember, there's people contributing without necessarily a lot of gain from it, just because they're contributing to the community.[01:03:24] And also, come in and contribute. If you're using axolotl, and I heard many people, commercial people, come up to you, A16z folks come up to you, like many people, if they use axolotl, Give back. Give back to the community. I think it's always great. So I just like, if you're listening to this, and you've used Excelato, it helped you, there is a way to also contribute, not necessarily as the only core contributor, as a sponsorship, reach out, reach out to you as well, but definitely talk about this and give feedback as well.[01:03:51] That's also very helpful. Sometimes people get stuck, and it's like, ah, okay, we'll do something else. No, just give feedback, talk about this. I think everybody else will generally benefit from that. Excellent.[01:04:01] Wing Lian: Thank you. That's it. Yeah. Alright.[01:04:04] Alex Volkov: Cool. Thanks for coming. Everybody should try Axolotl and tell us what[01:04:08] swyx: they[01:04:09] Wing Lian: think.[01:04:11] Yeah. Get full access to Latent.Space at www.latent.space/subscribe
-
Notebooks = Chat++ and RAG = RecSys! — with Bryan Bischof of Hex Magic
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2023-11-29 21:10
Catch us at Modular’s ModCon next week with Chris Lattner, and join our community!2024 note: Hex is now hiring AI Engineers.Due to Bryan’s very wide ranging experience in data science and AI across Blue Bottle (!), StitchFix, Weights & Biases, and now Hex Magic, this episode can be considered a two-parter.Notebooks = Chat++We’ve talked a lot about AI UX (in our meetups, writeups, and guest posts), and today we’re excited to dive into a new old player in AI interfaces: notebooks! Depending on your background, you either Don’t Like or you Like notebooks — they are the most popular example of Knuth’s Literate Programming concept, basically a collection of cells; each cell can execute code, display it, and share its state with all the other cells in a notebook. They can also simply be Markdown cells to add commentary to the analysis. Notebooks have a long history but most recently became popular from iPython evolving into Project Jupyter, and a wave of notebook based startups from Observable to DeepNote and Databricks sprung up for the modern data stack.The first wave of AI applications has been very chat focused (ChatGPT, Character.ai, Perplexity, etc). Chat as a user interface has a few shortcomings, the major one being the inability to edit previous messages. We enjoyed Bryan’s takes on why notebooks feel like “Chat++” and how they are building Hex Magic:* Atomic actions vs Stream of consciousness: in a chat interface, you make corrections by adding more messages to a conversation (i.e. “Can you try again by doing X instead?” or “I actually meant XYZ”). The context can easily get messy and confusing for models (and humans!) to follow. Notebooks’ cell structure on the other hand allows users to go back to any previous cells and make edits without having to add new ones at the bottom. * “Airlocks” for repeatability: one of the ideas they came up with at Hex is “airlocks”, a collection of cells that depend on each other and keep each other in sync. If you have a task like “Create a summary of my customers’ recent purchases”, there are many sub-tasks to be done (look up the data, sum the amounts, write the text, etc). Each sub-task will be in its own cell, and the airlock will keep them all in sync together.* Technical + Non-Technical users: previously you had to use Python / R / Julia to write notebooks code, but with models like GPT-4, natural language is usually enough. Hex is also working on lowering the barrier of entry for non-technical users into notebooks, similar to how Code Interpreter is doing the same in ChatGPT. Obviously notebooks aren’t new for developers (OpenAI Cookbooks are a good example), but haven’t had much adoption in less technical spheres. Some of the shortcomings of chat UIs + LLMs lowering the barrier of entry to creating code cells might make them a much more popular UX going forward.RAG = RecSys!We also talked about the LLMOps landscape and why it’s an “iron mine” rather than a “gold rush”: I'll shamelessly steal [this] from a friend, Adam Azzam from Prefect. He says that [LLMOps] is more of like an iron mine than a gold mine in the sense of there is a lot of work to extract this precious, precious resource. Don't expect to just go down to the stream and do a little panning. There's a lot of work to be done. And frankly, the steps to go from this resource to something valuable is significant.Some of my favorite takeaways:* RAG as RecSys for LLMs: at its core, the goal of a RAG pipeline is finding the most relevant documents based on a task. This isn’t very different from traditional recommendation system products that surface things for users. How can we apply old lessons to this new problem? Bryan cites fellow AIE Summit speaker and Latent Space Paper Club host Eugene Yan in decomposing the retrieval problem into retrieval, filtering, and scoring/ranking/ordering:As AI Engineers increasingly find that long context has tradeoffs, they will also have to relearn age old lessons that vector search is NOT all you need and a good systems not models approach is essential to scalable/debuggable RAG. Good thing Bryan has just written the first O’Reilly book about modern RecSys, eh?* Narrowing down evaluation: while “hallucination” is a easy term to throw around, the reality is more nuanced. A lot of times, model errors can be automatically fixed: is this JSON valid? If not, why? Is it just missing a closing brace? These smaller issues can be checked and fixed before returning the response to the user, which is easier than fixing the model.* Fine-tuning isn’t all you need: when they first started building Magic, one of the discussions was around fine-tuning a model. In our episode with Jeremy Howard we talked about how fine-tuning leads to loss of capabilities as well. In notebooks, you are often dealing with domain-specific data (i.e. purchases, orders, wardrobe composition, household items, etc); the fact that the model understands that “items” are probably part of an “order” is really helpful. They have found that GPT-4 + 3.5-turbo were everything they needed to ship a great product rather than having to fine-tune on notebooks specifically.Definitely recommend listening to this one if you are interested in getting a better understanding of how to think about AI, data, and how we can use traditional machine learning lessons in large language models. The AI PivotFor more Bryan, don’t miss his fireside chat at the AI Engineer Summit:Show Notes* Hex Magic* Bryan’s new book: Building Recommendation Systems in Python and JAX* Bryan’s whitepaper about MLOps* “Kitbashing in ML”, slides from his talk on building on top of foundation models* “Bayesian Statistics The Fun Way” by Will Kurt* Bryan’s Twitter* “Berkeley man determined to walk every street in his city”* People:* Adam Azzam* Graham Neubig* Eugene Yan* Even OldridgeTimestamps* [00:00:00] Bryan’s background* [00:02:34] Overview of Hex and the Magic product* [00:05:57] How Magic handles the complex notebook format to integrate cleanly with Hex* [00:08:37] Discussion of whether to build vs buy models - why Hex uses GPT-4 vs fine-tuning* [00:13:06] UX design for Magic with Hex's notebook format (aka “Chat++”)* [00:18:37] Expanding notebooks to less technical users* [00:23:46] The "Memex" as an exciting underexplored area - personal knowledge graph and memory augmentation* [00:27:02] What makes for good LLMops vs MLOps* [00:34:53] Building rigorous evaluators for Magic and best practices* [00:36:52] Different types of metrics for LLM evaluation beyond just end task accuracy* [00:39:19] Evaluation strategy when you don't own the core model that's being evaluated* [00:41:49] All the places you can make improvements outside of retraining the core LLM* [00:45:00] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, Partner and CTO-in-Residence of Decibel Partners, and today I'm joining by Bryan Bischof. [00:00:15]Bryan: Hey, nice to meet you. [00:00:17]Alessio: So Bryan has one of the most thorough and impressive backgrounds we had on the show so far. Lead software engineer at Blue Bottle Coffee, which if you live in San Francisco, you know a lot about. And maybe you'll tell us 30 seconds on what that actually means. You worked as a data scientist at Stitch Fix, which used to be one of the premier data science teams out there. [00:00:38]Bryan: It used to be. Ouch. [00:00:39]Alessio: Well, no, no. Well, you left, you know, so how good can it still be? Then head of data science at Weights and Biases. You're also a professor at Rutgers and you're just wrapping up a new O'Reilly book as well. So a lot, a lot going on. Yeah. [00:00:52]Bryan: And currently head of AI at Hex. [00:00:54]Alessio: Let's do the Blue Bottle thing because I definitely want to hear what's the, what's that like? [00:00:58]Bryan: So I was leading data at Blue Bottle. I was the first data hire. I came in to kind of get the data warehouse in order and then see what we could build on top of it. But ultimately I mostly focused on demand forecasting, a little bit of recsys, a little bit of sort of like website optimization and analytics. But ultimately anything that you could imagine sort of like a retail company needing to do with their data, we had to do. I sort of like led that team, hired a few people, expanded it out. One interesting thing was I was part of the Nestle acquisition. So there was a period of time where we were sort of preparing for that and didn't know, which was a really interesting dynamic. Being acquired is a very not necessarily fun experience for the data team. [00:01:37]Alessio: I build a lot of internal tools for sourcing at the firm and we have a small VCs and data community of like other people doing it. And I feel like if you had a data feed into like the Blue Bottle in South Park, the Blue Bottle at the Hanahaus in Palo Alto, you can get a lot of secondhand information on the state of VC funding. [00:01:54]Bryan: Oh yeah. I feel like the real source of alpha is just bugging a Blue Bottle. [00:01:58]Alessio: Exactly. And what's your latest book about? [00:02:02]Bryan: I just wrapped up a book with a coauthor Hector Yee called Building Production Recommendation Systems. I'll give you the rest of the title because it's fun. It's in Python and JAX. And so for those of you that are like eagerly awaiting the first O'Reilly book that focuses on JAX, here you go. [00:02:17]Alessio: Awesome. And we'll chat about that later on. But let's maybe talk about Hex and Magic before. I've known Hex for a while, I've used it as a notebook provider and you've been working on a lot of amazing AI enabled experiences. So maybe run us through that. [00:02:34]Bryan: So I too, before I sort of like joined Hex, saw it as this like really incredible notebook platform, sort of a great place to do data science workflows, quite complicated, quite ad hoc interactive ones. And before I joined, I thought it was the best place to do data science workflows. And so when I heard about the possibility of building AI tools on top of that platform, that seemed like a huge opportunity. In particular, I lead the product called Magic. Magic is really like a suite of sort of capabilities as opposed to its own independent product. What I mean by that is they are sort of AI enhancements to the existing product. And that's a really important difference from sort of building something totally new that just uses AI. It's really important to us to enhance the already incredible platform with AI capabilities. So these are things like the sort of obvious like co-pilot-esque vibes, but also more interesting and dynamic ways of integrating AI into the product. And ultimately the goal is just to make people even more effective with the platform. [00:03:38]Alessio: How do you think about the evolution of the product and the AI component? You know, even if you think about 10 months ago, some of these models were not really good on very math based tasks. Now they're getting a lot better. I'm guessing a lot of your workloads and use cases is data analysis and whatnot. [00:03:53]Bryan: When I joined, it was pre 4 and it was pre the sort of like new chat API and all that. But when I joined, it was already clear that GPT was pretty good at writing code. And so when I joined, they had already executed on the vision of what if we allowed the user to ask a natural language prompt to an AI and have the AI assist them with writing code. So what that looked like when I first joined was it had some capability of writing SQL and it had some capability of writing Python and it had the ability to explain and describe code that was already written. Those very, what feel like now primitive capabilities, believe it or not, were already quite cool. It's easy to look back and think, oh, it's like kind of like Stone Age in these timelines. But to be clear, when you're building on such an incredible platform, adding a little bit of these capabilities feels really effective. And so almost immediately I started noticing how it affected my own workflow because ultimately as sort of like an engineering lead and a lot of my responsibility is to be doing analytics to make data driven decisions about what products we build. And so I'm actually using Hex quite a bit in the process of like iterating on our product. When I'm using Hex to do that, I'm using Magic all the time. And even in those early days, the amount that it sped me up, that it enabled me to very quickly like execute was really impressive. And so even though the models weren't that good at certain things back then, that capability was not to be underestimated. But to your point, the models have evolved between 3.5 Turbo and 4. We've actually seen quite a big enhancement in the kinds of tasks that we can ask Magic and even more so with things like function calling and understanding a little bit more of the landscape of agent workflows, we've been able to really accelerate. [00:05:57]Alessio: You know, I tried using some of the early models in notebooks and it actually didn't like the IPyNB formatting, kind of like a JSON plus XML plus all these weird things. How have you kind of tackled that? Do you have some magic behind the scenes to make it easier for models? Like, are you still using completely off the shelf models? Do you have some proprietary ones? [00:06:19]Bryan: We are using at the moment in production 3.5 Turbo and GPT-4. I would say for a large number of our applications, GPT-4 is pretty much required. To your question about, does it understand the structure of the notebook? And does it understand all of this somewhat complicated wrappers around the content that you want to show? We do our very best to abstract that away from the model and make sure that the model doesn't have to think about what the cell wrapper code looks like. Or for our Magic charts, it doesn't have to speak the language of Vega. These are things that we put a lot of work in on the engineering side, to the AI engineer profile. This is the AI engineering work to get all of that out of the way so that the model can speak in the languages that it's best at. The model is quite good at SQL. So let's ensure that it's speaking the language of SQL and that we are doing the engineering work to get the output of that model, the generations, into our notebook format. So too for other cell types that we support, including charts, and just in general, understanding the flow of different cells, understanding what a notebook is, all of that is hard work that we've done to ensure that the model doesn't have to learn anything like that. I remember early on, people asked the question, are you going to fine tune a model to understand Hex cells? And almost immediately, my answer was no. No we're not. Using fine-tuned models in 2022, I was already aware that there are some limitations of that approach and frankly, even using GPT-3 and GPT-2 back in the day in Stitch Fix, I had already seen a lot of instances where putting more effort into pre- and post-processing can avoid some of these larger lifts. [00:08:14]Alessio: You mentioned Stitch Fix and GPT-2. How has the balance between build versus buy, so to speak, evolved? So GPT-2 was a model that was not super advanced, so for a lot of use cases it was worth building your own thing. Is with GPT-4 and the likes, is there a reason to still build your own models for a lot of this stuff? Or should most people be fine-tuning? How do you think about that? [00:08:37]Bryan: Sometimes people ask, why are you using GPT-4 and why aren't you going down the avenue of fine-tuning today? I can get into fine-tuning specifically, but I do want to talk a little bit about the good old days of GPT-2. Shout out to Reza. Reza introduced me to GPT-2. I still remember him explaining the difference between general transformers and GPT. I remember one of the tasks that we wanted to solve with transformer-based generative models at Stitch Fix were writing descriptions of clothing. You might think, ooh, that's a multi-modal problem. The answer is, not necessarily. We actually have a lot of features about the clothes that are almost already enough to generate some reasonable text. I remember at that time, that was one of the first applications that we had considered. There was a really great team of NLP scientists at Stitch Fix who worked on a lot of applications like this. I still remember being exposed to the GPT endpoint back in the days of 2. If I'm not mistaken, and feel free to fact check this, I'm pretty sure Stitch Fix was the first OpenAI customer, unlike their true enterprise application. Long story short, I ultimately think that depending on your task, using the most cutting-edge general model has some advantages. If those are advantages that you can reap, then go for it. So at Hex, why GPT-4? Why do we need such a general model for writing code, writing SQL, doing data analysis? Shouldn't a fine-tuned model just on Kaggle notebooks be good enough? I'd argue no. And ultimately, because we don't have one specific sphere of data that we need to write great data analysis workbooks for, we actually want to provide a platform for anyone to do data analysis about their business. To do that, you actually need to entertain an extremely general universe of concepts. So as an example, if you work at Hex and you want to do data analysis, our projects are called Hexes. That's relatively straightforward to teach it. There's a concept of a notebook. These are data science notebooks, and you want to ask analytics questions about notebooks. Maybe if you trained on notebooks, you could answer those questions, but let's come back to Blue Bottle. If I'm at Blue Bottle and I have data science work to do, I have to ask it questions about coffee. I have to ask it questions about pastries, doing demand forecasting. And so very quickly, you can see that just by serving just those two customers, a model purely fine-tuned on like Kaggle competitions may not actually fit the bill. And so the more and more that you want to build a platform that is sufficiently general for your customer base, the more I think that these large general models really pack a lot of additional opportunity in. [00:11:21]Alessio: With a lot of our companies, we talked about stuff that you used to have to extract features for, now you have out of the box. So say you're a travel company, you want to do a query, like show me all the hotels and places that are warm during spring break. It would be just literally like impossible to do before these models, you know? But now the model knows, okay, spring break is like usually these dates and like these locations are usually warm. So you get so much out of it for free. And in terms of Magic integrating into Hex, I think AI UX is one of our favorite topics and how do you actually make that seamless. In traditional code editors, the line of code is like kind of the atomic unit and HEX, you have the code, but then you have the cell also. [00:12:04]Bryan: I think the first time I saw Copilot and really like fell in love with Copilot, I thought finally, fancy auto-complete. And that felt so good. It felt so elegant. It felt so right sized for the task. But as a data scientist, a lot of the work that you do previous to the ML engineering part of the house, you're working in these cells and these cells are atomic. They're expressing one idea. And so ultimately, if you want to make the transition from something like this code, where you've got like a large amount of code and there's a large amount of files and they kind of need to have awareness of one another, and that's a long story and we can talk about that. But in this atomic, somewhat linear flow through the notebook, what you ultimately want to do is you want to reason with the agent at the level of these individual thoughts, these atomic ideas. Usually it's good practice in say Jupyter notebook to not let your cells get too big. If your cell doesn't fit on one page, that's like kind of a code smell, like why is it so damn big? What are you doing in this cell? That also lends some hints as to what the UI should feel like. I want to ask questions about this one atomic thing. So you ask the agent, take this data frame and strip out this prefix from all the strings in this column. That's an atomic task. It's probably about two lines of pandas. I can write it, but it's actually very natural to ask magic to do that for me. And what I promise you is that it is faster to ask magic to do that for me. At this point, that kind of code, I never write. And so then you ask the next question, which is what should the UI be to do chains, to do multiple cells that work together? Because ultimately a notebook is a chain of cells and actually it's a first class citizen for Hex. So we have a DAG and the DAG is the execution DAG for the individual cells. This is one of the reasons that Hex is reactive and kind of dynamic in that way. And so the very next question is, what is the sort of like AI UI for these collections of cells? And back in June and July, we thought really hard about what does it feel like to ask magic a question and get a short chain of cells back that execute on that task. And so we've thought a lot about sort of like how that breaks down into individual atomic units and how those are tied together. We introduced something which is kind of an internal name, but it's called the airlock. And the airlock is exactly a sequence of cells that refer to one another, understand one another, use things that are happening in other cells. And it gives you a chance to sort of preview what magic has generated for you. Then you can accept or reject as an entire group. And that's one of the reasons we call it an airlock, because at any time you can sort of eject the airlock and see it in the space. But to come back to your question about how the AI UX fits into this notebook, ultimately a notebook is very conversational in its structure. I've got a series of thoughts that I'm going to express as a series of cells. And sometimes if I'm a kind data scientist, I'll put some text in between them too, explaining what on earth I'm doing. And that feels, in my opinion, and I think this is quite shared amongst exons, that feels like a really nice refinement of the chat UI. I've been saying for several months now, like, please stop building chat UIs. There is some irony because I think what the notebook allows is like chat plus plus. [00:15:36]Alessio: Yeah, I think the first wave of everything was like chat with X. So it was like chat with your data, chat with your documents and all of this. But people want to code, you know, at the end of the day. And I think that goes into the end user. I think most people that use notebooks are software engineer, data scientists. I think the cool things about these models is like people that are not traditionally technical can do a lot of very advanced things. And that's why people like code interpreter and chat GBT. How do you think about the evolution of that persona? Do you see a lot of non-technical people also now coming to Hex to like collaborate with like their technical folks? [00:16:13]Bryan: Yeah, I would say there might even be more enthusiasm than we're prepared for. We're obviously like very excited to bring what we call the like low floor user into this world and give more people the opportunity to self-serve on their data. We wanted to start by focusing on users who are already familiar with Hex and really make magic fantastic for them. One of the sort of like internal, I would say almost North Stars is our team's charter is to make Hex feel more magical. That is true for all of our users, but that's easiest to do on users that are already able to use Hex in a great way. What we're hearing from some customers in particular is sort of like, I'm excited for some of my less technical stakeholders to get in there and start asking questions. And so that raises a lot of really deep questions. If you immediately enable self-service for data, which is almost like a joke over the last like maybe like eight years, if you immediately enabled self-service, what challenges does that bring with it? What risks does that bring with it? And so it has given us the opportunity to think about things like governance and to think about things like alignment with the data team and making sure that the data team has clear visibility into what the self-service looks like. Having been leading a data team, trying to provide answers for stakeholders and hearing that they really want to self-serve, a question that we often found ourselves asking is, what is the easiest way that we can keep them on the rails? What is the easiest way that we can set up the data warehouse and set up our tools such that they can ask and answer their own questions without coming away with like false answers? Because that is such a priority for data teams, it becomes an important focus of my team, which is, okay, magic may be an enabler. And if it is, what do we also have to respect? We recently introduced the data manager and the data manager is an auxiliary sort of like tool on the Hex platform to allow people to write more like relevant metadata about their data warehouse to make sure that magic has access to the best information. And there are some things coming to kind of even further that story around governance and understanding. [00:18:37]Alessio: You know, you mentioned self-serve data. And when I was like a joke, you know, the whole rush to the modern data stack was something to behold. Do you think AI is like in a similar space where it's like a bit of a gold rush? [00:18:51]Bryan: I have like sort of two comments here. One I'll shamelessly steal from a friend, Adam Azzam from Prefect. He says that this is more of like an iron mine than a gold mine in the sense of there is a lot of work to extract this precious, precious resource. And that's the first one is I think, don't expect to just go down to the stream and do a little panning. There's a lot of work to be done. And frankly, the steps to go from this like gold to, or this resource to something valuable is significant. I think people have gotten a little carried away with the old maxim of like, don't go pan for gold, sell pickaxes and shovels. It's a much stronger business model. At this point, I feel like I look around and I see more pickaxe salesmen and shovel salesmen than I do prospectors. And that scares me a little bit. Metagame where people are starting to think about how they can build tools for people building tools for AI. And that starts to give me a little bit of like pause in terms of like, how confident are we that we can even extract this resource into something valuable? I got a text message from a VC earlier today, and I won't name the VC or the fund, but the question was, what are some medium or large size companies that have integrated AI into their platform in a way that you're really impressed by? And I looked at the text message for a few minutes and I was finding myself thinking and thinking, and I responded, maybe only co-pilot. It's been a couple hours now, and I don't think I've thought of another one. And I think that's where I reflect again on this, like iron versus gold. If it was really gold, I feel like I'd be more blown away by other AI integrations. And I'm not yet. [00:20:40]Alessio: I feel like all the people finding gold are the ones building things that traditionally we didn't focus on. So like mid-journey. I've talked to a company yesterday, which I'm not going to name, but they do agents for some use case, let's call it. They are 11 months old. They're making like 8 million a month in revenue, but in a space that you wouldn't even think about selling to. If you were like a shovel builder, you wouldn't even go sell to those people. And Swix talks about this a bunch, about like actually trying to go application first for some things. Let's actually see what people want to use and what works. What do you think are the most maybe underexplored areas in AI? Is there anything that you wish people were actually trying to shovel? [00:21:23]Bryan: I've been saying for a couple of months now, if I had unlimited resources and I was just sort of like truly like, you know, on my own building whatever I wanted, I think the thing that I'd be most excited about is building sort of like the personal Memex. The Memex is something that I've wanted since I was a kid. And are you familiar with the Memex? It's the memory extender. And it's this idea that sort of like human memory is quite weak. And so if we can extend that, then that's a big opportunity. So I think one of the things that I've always found to be one of the limiting cases here is access. How do you access that data? Even if you did build that data like out, how would you quickly access it? And one of the things I think there's a constellation of technologies that have come together in the last couple of years that now make this quite feasible. Like information retrieval has really improved and we have a lot more simple systems for getting started with information retrieval to natural language is ultimately the interface that you'd really like these systems to work on, both in terms of sort of like structuring the data and preparing the data, but also on the retrieval side. So what keys off the query for retrieval, probably ultimately natural language. And third, if you really want to go into like the purely futuristic aspect of this, it is latent voice to text. And that is also something that has quite recently become possible. I did talk to a company recently called gather, which seems to have some cool ideas in this direction, but I haven't seen yet what I, what I really want, which is I want something that is sort of like every time I listen to a podcast or I watch a movie or I read a book, it sort of like has a great vector index built on top of all that information that's contained within. And then when I'm having my next conversation and I can't quite remember the name of this person who did this amazing thing, for example, if we're talking about the Memex, it'd be really nice to have Vannevar Bush like pop up on my, you know, on my Memex display, because I always forget Vannevar Bush's name. This is one time that I didn't, but I often do. This is something that I think is only recently enabled and maybe we're still five years out before it can be good, but I think it's one of the most exciting projects that has become possible in the last three years that I think generally wasn't possible before. [00:23:46]Alessio: Would you wear one of those AI pendants that record everything? [00:23:50]Bryan: I think I'm just going to do it because I just like support the idea. I'm also admittedly someone who, when Google Glass first came out, thought that seems awesome. I know that there's like a lot of like challenges about the privacy aspect of it, but it is something that I did feel was like a disappointment to lose some of that technology. Fun fact, one of the early Google Glass developers was this MIT computer scientist who basically built the first wearable computer while he was at MIT. And he like took notes about all of his conversations in real time on his wearable and then he would have real time access to them. Ended up being kind of a scandal because he wanted to use a computer during his defense and they like tried to prevent him from doing it. So pretty interesting story. [00:24:35]Alessio: I don't know but the future is going to be weird. I can tell you that much. Talking about pickaxes, what do you think about the pickaxes that people built before? Like all the whole MLOps space, which has its own like startup graveyard in there. How are those products evolving? You know, you were at Wits and Biases before, which is now doing a big AI push as well. [00:24:57]Bryan: If you really want to like sort of like rub my face in it, you can go look at my white paper on MLOps from 2022. It's interesting. I don't think there's many things in that that I would these days think are like wrong or even sort of like naive. But what I would say is there are both a lot of analogies between MLOps and LLMops, but there are also a lot of like key differences. So like leading an engineering team at the moment, I think a lot more about good engineering practices than I do about good ML practices. That being said, it's been very convenient to be able to see around corners in a few of the like ML places. One of the first things I did at Hex was work on evals. This was in February. I hadn't yet been overwhelmed by people talking about evals until about May. And the reason that I was able to be a couple of months early on that is because I've been building evals for ML systems for years. I don't know how else to build an ML system other than start with the evals. I teach my students at Rutgers like objective framing is one of the most important steps in starting a new data science project. If you can't clearly state what your objective function is and you can't clearly state how that relates to the problem framing, you've got no hope. And I think that is a very shared reality with LLM applications. Coming back to one thing you mentioned from earlier about sort of like the applications of these LLMs. To that end, I think what pickaxes I think are still very valuable is understanding systems that are inherently less predictable, that are inherently sort of experimental. On my engineering team, we have an experimentalist. So one of the AI engineers, his focus is experiments. That's something that you wouldn't normally expect to see on an engineering team. But it's important on an AI engineering team to have one person whose entire focus is just experimenting, trying, okay, this is a hypothesis that we have about how the model will behave. Or this is a hypothesis we have about how we can improve the model's performance on this. And then going in, running experiments, augmenting our evals to test it, et cetera. What I really respect are pickaxes that recognize the hybrid nature of the sort of engineering tasks. They are ultimately engineering tasks with a flavor of ML. And so when systems respect that, I tend to have a very high opinion. One thing that I was very, very aligned with Weights and Biases on is sort of composability. These systems like ML systems need to be extremely composable to make them much more iterative. If you don't build these systems in composable ways, then your integration hell is just magnified. When you're trying to iterate as fast as people need to be iterating these days, I think integration hell is a tax not worth paying. [00:27:51]Alessio: Let's talk about some of the LLM native pickaxes, so to speak. So RAG is one. One thing is doing RAG on text data. One thing is doing RAG on tabular data. We're releasing tomorrow our episode with Kube, the semantic layer company. Curious to hear your thoughts on it. How are you doing RAG, pros, cons? [00:28:11]Bryan: It became pretty obvious to me almost immediately that RAG was going to be important. Because ultimately, you never expect your model to have access to all of the things necessary to respond to a user's request. So as an example, Magic users would like to write SQL that's relevant to their business. And it's important then to have the right data objects that they need to query. We can't expect any LLM to understand our user's data warehouse topology. So what we can expect is that we can build a RAG system that is data warehouse aware, data topology aware, and use that to provide really great information to the model. If you ask the model, how are my customers trending over time? And you ask it to write SQL to do that. What is it going to do? Well, ultimately, it's going to hallucinate the structure of that data warehouse that it needs to write a general query. Most likely what it's going to do is it's going to look in its sort of memory of Stack Overflow responses to customer queries, and it's going to say, oh, it's probably a customer stable and we're in the age of DBT, so it might be even called, you know, dim customers or something like that. And what's interesting is, and I encourage you to try, chatGBT will do an okay job of like hallucinating up some tables. It might even hallucinate up some columns. But what it won't do is it won't understand the joins in that data warehouse that it needs, and it won't understand the data caveats or the sort of where clauses that need to be there. And so how do you get it to understand those things? Well, this is textbook RAG. This is the exact kind of thing that you expect RAG to be good at augmenting. But I think where people who have done a lot of thinking about RAG for the document case, they think of it as chunking and sort of like the MapReduce and the sort of like these approaches. But I think people haven't followed this train of thought quite far enough yet. Jerry Liu was on the show and he talked a little bit about thinking of this as like information retrieval. And I would push that even further. And I would say that ultimately RAG is just RecSys for LLM. As I kind of already mentioned, I'm a little bit recommendation systems heavy. And so from the beginning, RAG has always felt like RecSys to me. It has always felt like you're building a recommendation system. And what are you trying to recommend? The best possible resources for the LLM to execute on a task. And so most of my approach to RAG and the way that we've improved magic via retrieval is by building a recommendation system. [00:30:49]Alessio: It's funny, as you mentioned that you spent three years writing the book, the O'Reilly book. Things must have changed as you wrote the book. I don't want to bring out any nightmares from there, but what are the tips for people who want to stay on top of this stuff? Do you have any other favorite newsletters, like Twitter accounts that you follow, communities you spend time in? [00:31:10]Bryan: I am sort of an aggressive reader of technical books. I think I'm almost never disappointed by time that I've invested in reading technical manuscripts. I find that most people write O'Reilly or similar books because they've sort of got this itch that they need to scratch, which is that I have some ideas, I have some understanding that we're hard won, I need to tell other people. And there's something that, from my experience, correlates between that itch and sort of like useful information. As an example, one of the people on my team, his name is Will Kurt, he wrote a book sort of Bayesian statistics the fun way. I knew some Bayesian statistics, but I read his book anyway. And the reason was because I was like, if someone feels motivated to write a book called Bayesian statistics the fun way, they've got something to say about Bayesian statistics. I learned so much from that book. That book is like technically like targeted at someone with less knowledge and experience than me. And boy, did it humble me about my understanding of Bayesian statistics. And so I think this is a very boring answer, but ultimately like I read a lot of books and I think that they're a really valuable way to learn these things. I also regrettably still read a lot of Twitter. There is plenty of noise in that signal, but ultimately it is still usually like one of the first directions to get sort of an instinct for what's valuable. The other comment that I want to make is we are in this age of sort of like archive is becoming more of like an ad platform. I think that's a little challenging right now to kind of use it the way that I used to use it, which is for like higher signal. I've chatted a lot with a CMU professor, Graham Neubig, and he's been doing LLM evaluation and LLM enhancements for about five years and know that I didn't misspeak. And I think talking to him has provided me a lot of like directionality for more believable sources. Trying to cut through the hype. I know that there's a lot of other things that I could mention in terms of like just channels, but ultimately right now I think there's almost an abundance of channels and I'm a little bit more keen on high signal. [00:33:18]Alessio: The other side of it is like, I see so many people say, Oh, I just wrote a paper on X and it's like an article. And I'm like, an article is not a paper, but it's just funny how I know we were kind of chatting before about terms being reinvented and like people that are not from this space kind of getting into AI engineering now. [00:33:36]Bryan: I also don't want to be gatekeepy. Actually I used to say a lot to people, don't be shy about putting your ideas down on paper. I think it's okay to just like kind of go for it. And I, I myself have something on archive that is like comically naive. It's intentionally naive. Right now I'm less concerned by more naive approaches to things than I am by the purely like advertising approach to sort of writing these short notes and articles. I think blogging still has a good place. And I remember getting feedback during my PhD thesis that like my thesis sounded more like a long blog post. And I now feel like that curmudgeonly professor who's also like, yeah, maybe just keep this to the blogs. That's funny.Alessio: Uh, yeah, I think one of the things that Swyx said when he was opening the AI engineer summit a couple of weeks ago was like, look, most people here don't know much about the space because it's so new and like being open and welcoming. I think it's one of the goals. And that's why we try and keep every episode at a level that it's like, you know, the experts can understand and learn something, but also the novices can kind of like follow along. You mentioned evals before. I think that's one of the hottest topics obviously out there right now. What are evals? How do we know if they work? Yeah. What are some of the fun learnings from building them into X? [00:34:53]Bryan: I said something at the AI engineer summit that I think a few people have already called out, which is like, if you can't get your evals to be sort of like objective, then you're not trying hard enough. I stand by that statement. I'm not going to, I'm not going to walk it back. I know that that doesn't feel super good because people, people want to think that like their unique snowflake of a problem is too nuanced. But I think this is actually one area where, you know, in this dichotomy of like, who can do AI engineering? And the answer is kind of everybody. Software engineering can become AI engineering and ML engineering can become AI engineering. One thing that I think the more data science minded folk have an advantage here is we've gotten more practice in taking very vague notions and trying to put a like objective function around that. And so ultimately I would just encourage everybody who wants to build evals, just work incredibly hard on codifying what is good and bad in terms of these objective metrics. As far as like how you go about turning those into evals, I think it's kind of like sweat equity. Unfortunately, I told the CEO of gantry several months ago, I think it's been like six months now that I was sort of like looking at every single internal Hex request to magic by hand with my eyes and sort of like thinking, how can I turn this into an eval? Is there a way that I can take this real request during this dog foodie, not very developed stage? How can I make that into an evaluation? That was a lot of sweat equity that I put in a lot of like boring evenings, but I do think ultimately it gave me a lot of understanding for the way that the model was misbehaving. Another thing is how can you start to understand these misbehaviors as like auxiliary evaluation metrics? So there's not just one evaluation that you want to do for every request. It's easy to say like, did this work? Did this not work? Did the response satisfy the task? But there's a lot of other metrics that you can pull off these questions. And so like, let me give you an example. If it writes SQL that doesn't reference a table in the database that it's supposed to be querying against, we would think of that as a hallucination. You could separately consider, is it a hallucination as a valuable metric? You could separately consider, does it get the right answer? The right answer is this sort of like all in one shot, like evaluation that I think people jump to. But these intermediary steps are really important. I remember hearing that GitHub had thousands of lines of post-processing code around Copilot to make sure that their responses were sort of correct or in the right place. And that kind of sort of defensive programming against bad responses is the kind of thing that you can build by looking at many different types of evaluation metrics. Because you can say like, oh, you know, the Copilot completion here is mostly right, but it doesn't close the brace. Well, that's the thing you can check for. Or, oh, this completion is quite good, but it defines a variable that was like already defined in the file. Like that's going to have a problem. That's an evaluation that you could check separately. And so this is where I think it's easy to convince yourself that all that matters is does it get the right answer? But the more that you think about production use cases of these things, the more you find a lot of this kind of stuff. One simple example is like sometimes the model names the output of a cell, a variable that's already in scope. Okay. Like we can just detect that and like we can just fix that. And this is the kind of thing that like evaluations over time and as you build these evaluations over time, you really can expand the robustness in which you trust these models. And for a company like Hex, who we need to put this stuff in GA, we can't just sort of like get to demo stage or even like private beta stage. We really hunting GA on all of these capabilities. Did it get the right answer on some cases is not good enough. [00:38:57]Alessio: I think the follow up question to that is in your past roles, you own the model that you're evaluating against. Here you don't actually have control into how the model evolves. How do you think about the model will just need to improve or we'll use another model versus like we can build kind of like engineering post-processing on top of it. How do you make the choice? [00:39:19]Bryan: So I want to say two things here. One like Jerry Liu talked a little bit about in his episode, he talked a little bit about sort of like you don't always want to retrain the weights to serve certain use cases. Rag is another tool that you can use to kind of like soft tune. I think that's right. And I want to go back to my favorite analogy here, which is like recommendation systems. When you build a recommendation system, you build the objective function. You think about like what kind of recs you want to provide, what kind of features you're allowed to use, et cetera, et cetera. But there's always another step. There's this really wonderful collection of blog posts from Eugene Yon and then ultimately like even Oldridge kind of like iterated on that for the Merlin project where there's this multi-stage recommender. And the multi-stage recommender says the first step is to do great retrieval. Once you've done great retrieval, you then need to do great ranking. Once you've done great ranking, you need to then do a good job serving. And so what's the analogy here? Rag is retrieval. You can build different embedding models to encode different features in your latent space to ensure that your ranking model has the best opportunity. Now you might say, oh, well, my ranking model is something that I've got a lot of capability to adjust. I've got full access to my ranking model. I'm going to retrain it. And that's great. And you should. And over time you will. But there's one more step and that's downstream and that's the serving. Serving often sounds like I just show the s**t to the user, but ultimately serving is things like, did I provide diverse recommendations? Going back to Stitch Fix days, I can't just recommend them five shirts of the same silhouette and cut. I need to serve them a diversity of recommendations. Have I respected their requirements? They clicked on something that got them to this place. Is the recommendations relevant to that query? Are there any hard rules? Do we maybe not have this in stock? These are all things that you put downstream. And so much like the recommendations use case, there's a lot of knobs to pull outside of retraining the model. And even in recommendation systems, when do you retrain your model for ranking? Not nearly as much as you do other s**t. And even this like embedding model, you might fiddle with more often than the true ranking model. And so I think the only piece of the puzzle that you don't have access to in the LLM case is that sort of like middle step. That's okay. We've got plenty of other work to do. So right now I feel pretty enabled. [00:41:56]Alessio: That's great. You obviously wrote a book on RecSys. What are some of the key concepts that maybe people that don't have a data science background, ML background should keep in mind as they work in this area? [00:42:07]Bryan: It's easy to first think these models are stochastic. They're unpredictable. Oh, well, what are we going to do? I think of this almost like gaseous type question of like, if you've got this entropy, where can you put the entropy? Where can you let it be entropic and where can you constrain it? And so what I want to say here is think about the cases where you need it to be really tightly constrained. So why are people so excited about function calling? Because function calling feels like a way to constrict it. Where can you let it be more gaseous? Well, maybe in the way that it talks about what it wants to do. Maybe for planning, if you're building agents and you want to do sort of something chain of thoughty. Well, that's a place where the entropy can happily live. When you're building applications of these models, I think it's really important as part of the problem framing to be super clear upfront. These are the things that can be entropic. These are the things that cannot be. These are the things that need to be super rigid and really, really aligned to a particular schema. We've had a lot of success in making specific the parts that need to be precise and tightly schemified, and that has really paid dividends. And so other analogies from data science that I think are very valuable is there's the sort of like human in the loop analogy, which has been around for quite a while. And I have gone on record a couple of times saying that like, I don't really love human in the loop. One of the things that I think we can learn from human in the loop is that the user is the best judge of what is good. And the user is pretty motivated to sort of like interact and give you kind of like additional nudges in the direction that you want. I think what I'd like to flip though, is instead of human in the loop, I'd like it to be AI in the loop. I'd rather center the user. I'd rather keep the user as the like core item at the center of this universe. And the AI is a tool. By switching that analogy a little bit, what it allows you to do is think about where are the places in which the user can reach for this as a tool, execute some task with this tool, and then go back to doing their workflow. It still gets this back and forth between things that computers are good at and things that humans are good at, which has been valuable in the human loop paradigm. But it allows us to be a little bit more, I would say, like the designers talk about like user-centered. And I think that's really powerful for AI applications. And it's one of the things that I've been trying really hard with Magic to make that feel like the workflow as the AI is right there. It's right where you're doing your work. It's ready for you anytime you need it. But ultimately you're in charge at all times and your workflow is what we care the most about. [00:44:56]Alessio: Awesome. Let's jump into lightning round. What's something that is not on your LinkedIn that you're passionate about or, you know, what's something you would give a TED talk on that is not work related? [00:45:05]Bryan: So I walk a lot. [00:45:07]Bryan: I have walked every road in Berkeley. And I mean like every part of every road even, not just like the binary question of, have you been on this road? I have this little app that I use called Wanderer, which just lets me like kind of keep track of everywhere I've been. And so I'm like a little bit obsessed. My wife would say a lot a bit obsessed with like what I call new roads. I'm actually more motivated by trails even than roads, but like I'm a maximalist. So kind of like everything and anything. Yeah. Believe it or not, I was even like in the like local Berkeley paper just talking about walking every road. So yeah, that's something that I'm like surprisingly passionate about. [00:45:45]Alessio: Is there a most underrated road in Berkeley? [00:45:49]Bryan: What I would say is like underrated is Kensington. So Kensington is like a little town just a teeny bit north of Berkeley, but still in the Berkeley hills. And Kensington is so quirky and beautiful. And it's a really like, you know, don't sleep on Kensington. That being said, one of my original motivations for doing all this walking was people always tell me like, Berkeley's so quirky. And I was like, how quirky is Berkeley? Turn it out. It's quite, quite quirky. It's also hard to say quirky and Berkeley in the same sentence I've learned as of now. [00:46:20]Alessio: That's a, that's a good podcast warmup for our next guests. All right. The actual lightning ground. So we usually have three questions, acceleration, exploration, then a takeaway acceleration. What's, what's something that's already here today that you thought would take much longer to arrive in AI and machine learning? [00:46:39]Bryan: So I invited the CEO of Hugging Face to my seminar when I worked at Stitch Fix and his talk at the time, honestly, like really annoyed me. The talk was titled like something to the effect of like LLMs are going to be the like technology advancement of the next decade. It's on YouTube. You can find it. I don't remember exactly the title, but regardless, it was something like LLMs for the next decade. And I was like, okay, they're like one modality of model, like whatever. His talk was fine. Like, I don't think it was like particularly amazing or particularly poor, but what I will say is damn, he was right. Like I, I don't think I quite was on board during that talk where I was like, ah, maybe, you know, like there's a lot of other modalities that are like moving pretty quick. I thought things like RL were going to be the like real like breakout success. And there's a little pun with Atari and breakout there, but yeah, like I, man, I was sleeping on LLMs and I feel a little embarrassed. I, yeah. [00:47:44]Alessio: Yeah. No, I mean, that's a good point. It's like sometimes the, we just had Jeremy Howard on the podcast and he was saying when he was talking about fine tuning, everybody thought it was dumb, you know, and then later people realize, and there's something to be said about messaging, especially like in technical audiences where there's kind of like the metagame, you know, which is like, oh, these are like the cool ideas people are exploring. I don't know where I want to align myself yet, you know, or whatnot. So it's cool exploration. So it's kind of like the opposite of that. You mentioned RL, right? That's something that was kind of like up and up and up. And then now it's people are like, oh, I don't know. Are there any other areas if you weren't working on, on magic that you want to go work on? [00:48:25]Bryan: Well, I did mention that, like, I think this like Memex product is just like incredibly exciting to me. And I think it's really opportunistic. I think it's very, very feasible, but I would maybe even extend that a little bit, which is I don't see enough people getting really enthusiastic about hardware with advanced AI built in. You're hearing whispering of it here and there, put on the whisper, but like you're starting to see people putting whisper into pieces of hardware and making that really powerful. I joked with, I can't think of her name. Oh, Sasha, who I know is a friend of the pod. Like I joked with Sasha that I wanted to make the big mouth Billy Bass as a babble fish, because at this point it's pretty easy to connect that up to whisper and talk to it in one language and have it talk in the other language. And I was like, this is the kind of s**t I want people building is like silly integrations between hardware and these new capabilities. And as much as I'm starting to hear whisperings here and there, it's not enough. I think I want to see more people going down this track because I think ultimately like these things need to be in our like physical space. And even though the margins are good on software, I want to see more like integration into my daily life. Awesome. [00:49:47]Alessio: And then, yeah, a takeaway, what's one message idea you want everyone to remember and think about? [00:49:54]Bryan: Even though earlier I was talking about sort of like, maybe like not reinventing things and being respectful of the sort of like ML and data science, like ideas. I do want to say that I think everybody should be experimenting with these tools as much as they possibly can. I've heard a lot of professors, frankly, express concern about their students using GPT to do their homework. And I took a completely opposite approach, which is in the first 15 minutes of the first class of my semester this year, I brought up GPT on screen and we talked about what GPT was good at. And we talked about like how the students can sort of like use it. I showed them an example of it doing data analysis work quite well. And then I showed them an example of it doing quite poorly. I think however much you're integrating with these tools or interacting with these tools, and this audience is probably going to be pretty high on that distribution. I would really encourage you to sort of like push this into the other people in your life. My wife is very technical. She's a product manager and she's using chat GPT almost every day for communication or for understanding concepts that are like outside of her sphere of excellence. And recently my mom and my sister have been sort of like onboarded onto the chat GPT train. And so ultimately I just, I think that like it is our duty to help other people see like how much of a paradigm shift this is. We should really be preparing people for what life is going to be like when these are everywhere. [00:51:25]Alessio: Awesome. Thank you so much for coming on, Bryan. This was fun. [00:51:29]Bryan: Yeah. Thanks for having me. And use Hex magic. [00:51:31] Get full access to Latent.Space at www.latent.space/subscribe
-
The State of Silicon and the GPU Poors - with Dylan Patel of SemiAnalysis
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2023-11-17 19:32
This episode came together at ~4 hrs notice since Dylan had just landed in SF and we had to setup quickly; you might notice some small audio issues in some segments, we apologize. We’re currently building our own podcast studio for 2024! 🙏 We’re ramping up our presence on Twitter and YouTube if you’d like to support us.Note: 17k people joined our emergency pod on Sam Altman’s ouster today.If Charles Dickens was alive in 2024, A Tale of Two Cities might be the divide between the “GPU poor” and the “GPU rich”.We mentioned these terms in some of our previous episodes; they were originally coined by Dylan Patel of SemiAnalysis in his “Gemini Eats the World” post, put on blast by Sam Altman. SemiAnalysis are one of the most in depth research and consulting firms in the semis world, and have a unique insight into the design, production, and supply chain of GPUs based on their ground presence in Asia. In this episode we break down the State of Silicon: when are more GPUs coming? Are there real GPU alternatives on the way? Should Microsoft buy AMD chips just to scare Jensen? Is there a “GPU poor is beautiful” manifesto?The supply wave is comingThe GPU shortage is the talk of the town in the Bay Area, but next year looks a lot better in terms of AI accelerating capacity: * NVIDIA is forecasted to sell over 3 million GPUs next year, about 3x their 2023 sales of about 1 million H100s.* AMD is forecasting $2B of sales for their new MI300X datacenter GPU. They are also indirectly getting a boost from the work that companies like Modular and tiny are doing in making it easier to actually use these chips (will ROCm ever catch up?)* Google’s TPUv5 supply is going to increase rapidly going into 2024* Microsoft just announced Maia 100, a new AI accelerator built “with feedback” from OpenAI. In the episode we dove deeper into what this means for each of these companies and the GPU consumers, but the TLDR (sadly) is that capacity increases but FLOPS requirements to train the next generation of models will eclipse the one of previous ones. GPT-3 was 4,000x more FLOPS than GPT-2. Dylan estimates GPT-4 was trained on 20,000 A100s for ~$500M all-in; how much will OpenAI spend to train GPT-5? How many GPUs will need to go brrr? In the meantime, the amount of companies looking for GPUs has increased, with Meta rising as one of the de-facto top 3 AI labs in terms of capacity. The pressure to acquire more chips will not ease in 2024. We also talked about some of the companies trying to displace traditional GPU architectures: MatX, Lemurian Labs, Cerebras, etc. The different variables they are fighting on are size of SRAM vs HBM, focusing on memory bandwidth vs memory size, different math representation for kernels, etc, and how the key to this market is whether or not the transformer architecture will still be the #1 in the future.Surviving in the GPU Poor laneA lot of the smaller companies (when compared to $1T+ giants, it’s all relative) are trying hard to fight against the GPU rich, but they can’t quite offer the same scale: * HuggingFace is trying to launch a training cluster as a service, but it seems to just be a software wrapper around NVIDIA’s GDX Cloud, as they don’t actually own that much GPU supply. The max option for GPUs to use is 1,000 in their form.* Databricks’ “GPU-enabled clusters” run on AWS, and the largest one listed there is only powered by 8 NVIDIA A10Gs. The Mosaic team is also doing research on running on AMD cards with some promising results, but they seem to be pushing up to just 128 cards, which isn’t much.* Together actually has 4,424 H100s live in production, which is quite sizable but still nothing compared to the 100,000 that Meta is putting online. Take LLaMA2 as an example; the 70B model was trained on 2T tokens. Using the highest accelerator count on HuggingFace it’d take ~43 days to train the model from scratch and it’d cost ~$2M. That doesn’t include all the data and prep work. In the meantime, Zuck is probably burning tens of thousands of H100s to train LLaMA3, which will surely have much higher performance than whatever a GPU poor company can train in the same time span. The good news, is that there’s a ton of opportunity for the GPU poors to shine, especially around fine-tuning. Most of the open source models coming out are one-size-fits-all, and there’s a ton of opportunity for startups to take them and tailor them to their customers, or to specific tasks or use cases to build vertical applications. The other area of improvement is data quality; Mistral showed how you can build a high quality small model with less FLOPs by feeding it better data. The key to differentiation won’t be GPUs, but tokens. Show Notes* SemiAnalysis* Google Gemini Eats The World – Gemini Smashes GPT-4 By 5X, The GPU-Poors* How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0* AMD MI300 – Taming The Hype – AI Performance, Volume Ramp, Customers, Cost, IO, Networking, Software* @sama: incredible google got that semianalysis guy to publish their internal marketing/recruiting chart lol* Mellanox* MatX* Lemurian Labs* Cerebras* For SRAM / HBM, see our FlashAttention episode* Suggested readings:* Moore's Law: The Life of Gordon Moore, Silicon Valley's Quiet Revolutionary* Chip War by Chris MillerChapters* Introduction [00:00:00]* Importance of infrastructure for tech companies [00:01:11]* Training costs are irrelevant [00:03:06]* Worldview of GPU-poor vs GPU-rich [00:04:01]* Google's TPU infrastructure [00:08:12]* Alternative hardware like Cerebras and Graphcore [00:17:37]* Partnerships between labs and hardware companies [00:37:15]* Apple's potential in AI [00:40:56]* Concerns over China and Taiwan [00:41:02]* Feasibility of rebuilding the semiconductor supply chain in the US [00:43:22]* Foundational semiconductor readings [00:46:09]* NVIDIA's pivot to AI [00:47:40]* Dylan's writing process [00:48:17]* Using multiple data centers for distributed AI training [00:52:36]TranscriptAlessio: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners. I'm joined by my co-host Swyx, founder of Smol AI. [00:00:16]Swyx: And today we have Dylan Patel and welcome. So you are the author of the extremely popular Semi-Analysis blog. We have both had a little bit of claim to fame in breaking details of GPT-4. George Hotz came on our pod and talked about the mixture of experts thing and then you had a lot more detail. [00:00:29]Dylan: To be clear, I talked about mixture of experts in January, it's just people didn't really notice it. Yeah. I guess. [00:00:35]Swyx: I don't know. You went into a lot more detail and I'd love to dig into some of that. [00:00:38]Dylan: Yeah. Thank you so much. I've been doing consulting in the industry, semiconductor industry since 17. 2021 got bored and in November I started writing a blog and then like 2022 I was good and started hiring folks for my firm. And then all of a sudden 2023 happens and it's like the perfect intersection. I used to do data science, but not like AI, not really like multivariable progression is not AI. Right. But also I've been involved in the semiconductor industry for a long, long time, posting about it online since I was 12. Right. You know, all of a sudden this all kind of came to fruition. So it's cool to have the blog sort of blow up in that way. [00:01:11]Swyx: I used to cover semis at Belyasny as well. And it was for a long time, it was just the mobile cycle. And then a little bit of PCs, but like not that much. And then maybe some cloud stuff, you know, like public cloud, you know, semiconductor stuff. But it really wasn't anything until this wave. And I was actually listening to you on one of the previous podcasts that you've done. And it was surprising that high-performance computing also kind of didn't really take off. Like AI is just the first form of high-performance computing that worked. [00:01:37]Dylan: One of the theses I've had for a long time that I think people haven't really caught on, but it's coming to fruition now is that the largest tech companies in the world, their software is important, but actually having and operating a very efficient infrastructure is incredibly important. And so, you know, people talk about, you know, hey, Amazon is great, AWS is great because yes, it is easy to use and they've built all these things. But behind the scenes, they've done a lot on the infrastructure that is super custom that Microsoft, Azure and Google Cloud just don't even match in terms of efficiency. If you think about the cost to rent out SSD space, so the cost to rent, you know, offer database service on top of that, obviously, a cost to rent out a certain level of CPU performance. Amazon has a massive advantage there. And likewise, like Google spent all this time doing that in AI, right, with their TPUs and infrastructure there and an optical switches and all this sort of stuff. And so in the past, it wasn't immediately obvious. I think with AI, especially like how scaling laws are going, it's like incredibly important for infrastructure is like so much more important. And then like when you just think about software cost, right, like the cost structure of it, there was always a bigger component of R&D and like SAS businesses, you know, all over SF, all these SAS businesses did crazy good because, you know, they just start as they grow and then all of a sudden they're so freaking profitable for each incremental new customer. And AI software looks like it's going to be very different, in my opinion, right? Like the R&D cost is much lower in terms of people, but the cost of goods sold in terms of actually operating the service, I think will be much higher. And so in that same sense, infrastructure matters a ton. [00:03:02]Swyx: And I think you wrote once that training costs effectively don't matter. [00:03:06]Dylan: Yeah. In my opinion, I think that's a little bit spicy, but yeah, it's like training costs are irrelevant, right? Like GPT-4, right, like 20,000 A100s, that's, that's like, I know it sounds like a lot of money. The supercomputer, it's, it's, oh, it's slightly more, but yeah, I think the 500 million is a fair enough number. I mean, if you think about just the pre-training, right, three months, 20,000 A100s at, you know, a dollar an hour is like, that is way less than 500 million, right? But of course there's data and all this sort of stuff. [00:03:33]Alessio: So people that are watching this on YouTube, they can see a GPU-poor and a GPU-rich hat on the table, which is inspired by your, yeah, your Google Gemini, it's the world blog post. So Sam, did you know that this thing was going to blow up so much? Sam Altman even tweeted about it, he said, incredible Google got the semi-analysis guide to publish their internal marketing recruiting chart. And yeah, tell people who are the GPU-poors, who are the GPU-rich, like what's this framework that they should think about? [00:04:01]Dylan: So it's, it's, you know, some of this work we've been doing for a while is just on infrastructure and like, hey, like when something happens, I think it's like a sort of competitive advantage of our firm, right, myself and my colleagues is we go from software all the way through to like low-level manufacturing, and it's like, who, you know, oh, Google's actually ramping up TPU production massively, right? And like, I think people in AI would be like, well, duh, but like, okay, like who, who has the capability of figuring out the number? Well, one, you could just get Google to tell you, but they don't, they won't tell you, right? That's like a very closely guarded secret. And most people that work at Google DeepMind don't even know that number, right? Two, you go through the supply chain and see what they've placed in order. Three is sort of like, well, who's actually winning from this? Hey, oh, Celestica's building these boxes. Wow. Oh, interesting. Oh, Google's involved in testing for them. Oh, okay. Oh, this company's providing design IP to them. Okay. That's very valuable in a monetary sense, but you know, you have to understand the whole technology stack. But on the flip side, right, is, well, why is Google building all these? What could they do with it? And what does that mean for the world? Especially in SF, right? Like, I'm sure you folks have been to parties. If we just brag about how many TPUs they have, like, it's happened to me multiple times where someone's just like, I'm just witnessing a conversation where somebody from Meta is bragging about how many TPUs they have versus someone from another firm that it's like, or like a startup person's like, dude, can you believe we just acquired, we have 512 H100s coming online in August. And it's like, oh, cool. Like, you know, going through the supply chain, it's like, dude, you realize there's 400,000 manufactured last quarter and like 530,000 this quarter being sold of H100s. And it's like, oh crap, that's a lot. That's a lot of GPUs. But then like, oh, how does that compare to Google? And like, there's one way to look at the world, which is just like, hey, scale is all you need. Like obviously data matters. Obviously all this stuff matters. I think any data set, a larger model will just do better. I think it's going to be more expensive, but it's going to do better. Okay, there's all these GPUs going into production. NVIDIA is going to sell well over 3 million total GPUs next year, over a million H100s this year alone. There's a lot of GPU capacity coming online. It's an incredible amount. And well, what are people doing? What are people working on? I think it's very important to like, just think about what are people working on, right? What actually are you building that's going to advance? What is monetizable? But what also makes sense? And so like, a lot of people were doing things that I felt counterproductive, right? In a world where in less than a year, there's going to be more than 4 million high-end GPUs out there. I mean, we can talk about the concentration of those GPUs, but if you're doing really valuable work as a good person, right, like you're contributing in some way, should you be focused on like, well, I don't have access to any of those 4 million GPUs, right? I actually only have access to gaming GPUs. Should I focus on like being able to fine tune a model on that, right? Like, no, it's not really that important. Or like, should I be focused on batch one inference on a cloud GPU? Like, no, that's like pointless. Like, why would you do batch size one inference on an H100? That's just like ridiculously dumb. There's a lot of counterproductive work. And at the same time, there's a lot of things that people should be doing. I mean, obviously, most people don't have resources, right? And I love the open source and I want the open source to win. And I hate the people who want to like, no, we're X lab and we think this is the only way you should do it. And if people don't do it this way, they should be regulated against it and all this kind of stuff. So I want the open source to win, right? Like companies like Mistral and like what Meta are doing, you know, Mosaic and all these folks together. All these people doing, you know, huge stuff with open source, you know, want them to succeed. But it's like, there's certain things that are, you know, like hyper focusing on leaderboards and hugging face. No, like truthful QA is a garbage benchmark. Some of the models that are very high on there, if you use it for five seconds, you're like, this is garbage. There was things I wanted to say. Also, you know, we're in a world where compute matters a lot. Google is going to have more compute than any other company in the world, period. By like a large, large factor. It's just like framing it into that like mindset of like, Hey, like, what are the counterproductive things? What do I think personally? Or what have people told me that are involved in this should they focus on the pace of acceleration from 2020 to 2022 is less than 2022 to 2024, you know, GP two to four, two to four is like 2020 to 2022, right? Is less than I think from GPT four in 2022, which is when it was trained, right? What open AI and Google and, and Anthropic do in 2025, right? Like I think the pace of acceleration is increasing and it's just good to like, think about, you know, that sort of stuff. [00:08:12]Alessio: That makes sense. And the chart that Sam mentioned is about, yeah, Google TPU B fives completely overtaking open AI by orders of magnitude. Let's talk about the TPU a bit. We had Tris Landner on the show, which I know, you know, he used to work on TensorFlow and Google. And he did mention that the goal of Google is like make TPUs go fast with TensorFlow, but then he also had a post about PyTorch dealing the thunder. How do you see that changing? If like, now that a lot of the compute will be TPU based and Google wants to offer some of that to the public to Google internally. [00:08:44]Dylan: And I think, you know, as obviously on JAX and XLA and all that kind of stuff externally, like they've done a really good job. Wouldn't say like TPUs through PyTorch XLA is amazing, but it's, it's not bad, right? Like some of the numbers they've shown, some of the, you know, code they've shown for TPU V5E, which is not the TPU V5 that I was referring to, which is in the sort of the post, the TPU poor post is referring to, but TPU V5E is like the new one, but it's mostly, mostly an inference chip. It's a small chip. It's, it's about half the size of a TPU V5. That chip, you know, you can get very good performance on of LLAMA 70B inference. Very, very good performance when you're using PyTorch and XLA. Now of course you're going to get better if you go JAX XLA, but I think Google is doing a really good job after the restructuring of focusing on external customers too. Probably won't focus too much on TPU V5 for everyone externally, but V5E, we're also building a million of those, right? Hey, a lot of companies are using them, right? Or will be using them because it's going to be an incredibly cheap form of compute. I think the world of frameworks and all that, right? Like that's obviously something a researcher should talk about, not myself, but you know, the stats are clear that PyTorch is way, way dominating everything. But JAX is like doing well. Like there's external users of JAX. The forever shouldn't be that the person doing PyTorch level code, right? That high should also be writing custom CUDA kernels, right? There should be, you know, different layers of abstraction where people hyper optimize and make it much easier for everyone to innovate on separate stacks, right? And then every once in a while, someone comes through and pierces through the layers of abstraction and innovates across multiple or a group of people. But I think frameworks are important. Compilers are important, right? Chris Lattner's, what he's doing is really cool. I don't know if it'll work, but it's super cool and it certainly works on CPUs. We'll see about accelerators. Likewise, there's OpenAI's Triton, like what they're trying to do there. And you know, everyone's really coalescing around Triton, third-party hardware vendors. There's Palace. I don't want to mischaracterize it, but you can write in Palace and it'll go through, you can lower level code and it'll work to TPUs and GPUs, kind of like Triton, but it's like there's a backend for Triton. I don't know exactly everything about it, but I think there's a lot of innovation happening on make things go faster, right? How do you go burr? Is every single person working in ML, it would be a travesty if they had to write like custom CUDA kernels always, right? Like that would just slow down productivity, but at the same time, you kind of have to. [00:10:53]Swyx: By the way, I like to quantify things when you say make things go burr. Is there a target range of MFU that you typically talk about? [00:10:59]Dylan: Yeah, there's sort of two metrics that I like to think about a lot, right? So in training, everyone just talks about MFU, right? But then on inference, right, which I think is one LLM inference will be bigger than training or multimodal, whatever, bubble inference will be bigger than training, probably next year, in fact, at least in terms of GPUs deployed. And the other thing is like, you know, what's the bottleneck when you're running these models? The simple, stupid way to look at it is training is, you know, there's six flops floating point operations you have to do for every byte you read in, right? Every parameter you read in. So if it's FP8, then it's a byte, if it's FP16, it's two bytes, whatever, right? On training, but on inference side, the ratio is completely different. It's two to one, right? There's two flops per parameter that you read in and parameters, maybe one byte, right? But then when you look at the GPUs, the GPUs are very, very different ratio. The H100 has 3.35 terabytes a second of memory bandwidth, and it has a thousand teraflops of FP16, BFLIP16, right? So that ratio is like, I'm sorry, I'm going to butcher the math here and people are going to think I'm dumb, but 256 to one, right, call it 256 to one if you're doing FP16. And same applies to FP8, right, because anyways, per parameter read to number of floating point operations, right? If you quantize further, but you also get double the performance on that lower quantization. That does not fit the hardware at all. So if you're just doing LLM inference at batch one, then you're always going to be under utilizing the flops. You're only paying for memory bandwidth. And the way hardware is developing, that ratio is actually only going to get worse. H200 will come out soon enough, which will help the ratio a little bit, improve memory bandwidth more than improves flops, just like the A180 gig did versus the A140 gig. But then when the B100 comes out, the flops are going to increase more than memory bandwidth. And when future generations come out and the same with AMD side, right, MI300 versus 400, as you move on generations, just due to fundamental like semiconductor scaling, DRAM memory is not scaling as fast as logic has been. And you can do a lot of interesting things on the architecture. So you're going to have this problem get worse and worse and worse. And so on training, it's very, you know, who cares, right? Because my flops are still my bottleneck most of the time. I mean, memory bandwidth is obviously a bottleneck, but like, well, you know, batch sizes are freaking crazy, right? Like people train like 2 million batch sizes, it's trivial, right? Like that's what Lama, I think did, Lama 70B was 2 million batch size. Unlike you talk to someone at one of the frontier labs and they're like, just 2 million, 2 million token batch size, right? That's crazy, or sequence, sorry. But when you go to inference side, well, it's impossible to do one, to do 2 million batch size. Also your latency would be horrendous if you tried to do something that crazy. So you kind of have this like differing problem where on training, everyone just kept talking MFU, model flop utilization, right? How many flops, six times the number of parameters, basically, more or less. And then what's the quoted number, right? So if I have 312 teraflops out of my A100 and I was able to achieve 200, that's really good, right? You know, some people are achieving higher, right? Some people are achieving lower. That's a very important like metric to think about. Now you have like people thinking MFU is like a security risk, but on inference, MFU is not nearly as important, right? It's memory bandwidth utilization. You know, batch one is, you know, what memory bandwidth can I achieve, right? Because as I increase batch from batch size one to four to eight to even 256, right, it's sort of where the crossover happens on an H100 inference wise, where it's flops limiting you more and more. But like you should have very high memory bandwidth utilization. So when people talk about A100s, like 60% MFU is decent. On H100s, it's more like 40, 45% because the flops increased more than the memory bandwidth. But people over time will probably get above 50% on H100, on MFU, on training. But on inference, it's not being talked about much, but MBU, model bandwidth utilization is the important factor, right? Above my 3.35 terabytes a second memory bandwidth on my H100, can I get two? Can I get three? Right? That's the important thing. And right now, if you look at everyone's inference stuff, I dogged on this in the GPU poor thing, right? But it's like hugging faces libraries are actually very inefficient, like incredibly inefficient for inference. You get like 15% MBU on some configurations, like eight A100s and LLAMA 70B, you get like 15%, which is just like horrendous. Because at the end of the day, your latency is derived from what memory bandwidth you can effectively get, right? So if you're doing LLAMA 70 billion, 70 billion parameters, if you're doing it int8, okay, that's 70 gigabytes a second, gigabytes you need to read for every single inference, every single forward pass, plus the attention, but again, we're simplifying it. 70 gigabytes you need to read for every forward pass, what is an acceptable latency for a user to have? I would argue 30 milliseconds per token. Some people would argue lower, right? But at the very least, you need to achieve human reading level speeds and probably a little bit faster, because we like their skin, to have a usable model for chatbot style applications. Now there's other applications, of course, but chatbot style applications, you want it to be human reading speed. So 30 tokens per second, 30 tokens per second is 33, or 30 tokens, milliseconds per token is 33 tokens per second, times 70 is, let's say three times seven is 21, and then add two zeros to 2,100 gigabytes a second, right? To achieve human reading speed on LLAMA 70B, right? So one, you can never achieve LLAMA 70B human reading speed on, even if you had enough memory capacity on a model, on an A100, right? Even an H100 to achieve human reading speed, right? Of course, you couldn't fit it because it's 80 gigabytes versus 70 billion parameters, so you're kind of butting up against the limits already, 70 billion parameters being 70 gigabytes at int8 or fp8. You end up with one, how do I achieve human reading level speeds, right? So if I go with two H100s, then now I have, you know, call it six terabytes a second of memory bandwidth, if I achieve just 30 milliseconds per token, then I'm, you know, which is 33 tokens per second, which is 30, you know, is three terabytes a second, was it three, three times, 21, 2.1 terabytes a second of memory bandwidth, then I'm only at like 30% bandwidth utilization. So I'm not using all my flops on batch one anyways, right? Because 70, you know, the flops that you're using there is tremendously low relative to inference, and I'm not actually using a ton of the tokens on inference. So with two H100s, I only get 30 milliseconds a token, that's a really bad result. You should be striving to get, you know, so upwards of 60%, and that's like 60% is kind of low too, right? Like, I've heard people getting 70, 80% model bandwidth utilization. And then, you know, obviously you can increase your batch size from there and your model bandwidth utilization will start to fall as your flops utilization increases, but, you know, you have to pick the sweet spot for where you want to hit on the latency curve for your user. Obviously, as you increase batch size, you get more throughput per GPU, so that's more cost effective. There's a lot of like things to think about there, but I think those are sort of the two main things that people want to think about, and there's obviously a ton with regards to like networking and inner GPU connection, because most of the useful models don't run on a single GPU. They can't run on a single GPU. [00:17:37]Swyx: Is there a TPU equivalent of Mellanox? [00:17:39]Dylan: The Google TPU is like super interesting because Google has been working with Broadcom, who's the number one networking company in the world, right? So Mellanox was nowhere close to number one. They had a niche that they were very good at, which was the network card, the card that you actually put in the server, but they didn't do much. They didn't have, they weren't doing successfully in the switches, right? Which is, you know, you connect all the networks cards to switches, and then the switches to all the servers. So Mellanox was not that great. I mean, it was good. They were doing good, and NVIDIA bought them, you know, in 19, I believe, or 18, but Broadcom has been number one in networking for a decade plus, and Google partnered with them on making the TPU, right? So Google does a lot of the design, especially on the ML hardware side, on how you pass stuff around internally on the chip, but Broadcom does a lot on the network side, right? They specifically, how to get really high connection speed between two chips, right? They've done a ton there, and obviously Google works a ton there too, but this is sort of Google's like less discussed partnership that's truly critical for them, and Google's tried to get away from them many times. Their latest target to get away from Broadcom is 2027, right? But like, you know, that's four years from now. Chip design cycle's four years, so they already tried to get away in 2025, and that failed. They have this equivalent of very high speed networking. It works very differently than the way GPU networking does, and that's important for people who code on a lower level. [00:18:52]Swyx: I've seen this described as the ultimate rate limit on how big models can go. It's not flops, it's not memory, it's networking. Like it has the lowest scaling laws, the lowest Moore's laws, and I don't know what to do about that because no one else has any solutions. [00:19:06]Dylan: Yeah, yeah, so I think what you're referring to is that like network speed is increased Much slower than the other two. Than flops, yeah, and bandwidth, yeah, yeah. And yeah, that's a tremendous problem in the industry, right? That's why NVIDIA bought a networking company, that's why Broadcom is working on Google's chip right now, but of course on Meta's internal AI chip, which they're on the second generation of, working on that, and what's the main thing that Meta's doing interesting is networking stuff, right? Multiplying tensors is kind of, there's a lot of people who've made good matrix multiply units, right? But it's about like getting good utilization out of those and interfacing with the memory and interfacing with other chips really efficiently makes designing these chips very hard. Most of the startups obviously have not done that really well. [00:19:46]Alessio: I think the startup's point is the most interesting, right? You mentioned companies that are GPU poor, they raise a lot of money, and there's a lot of startups out there that are GPU poor and did not raise a lot of money. What should they do? How do you see like the space dividing? Are we just supposed to wait for like the big labs to do a lot of this work with a lot of the GPUs? What's like the GPU poor's beautiful version of the article? [00:20:12]Dylan: Open AI, who everyone would be like, oh yeah, they have more GPUs than anyone else, right? But they have a lot less flops than Google, right? That was the point of the thing, but not just them, it's like, okay, it's like a relative totem pole, right? And of course, Google doesn't use GPUs as much for training and inference, they do use some, but mostly TPUs. So kind of like, the whole point is that everyone is GPU poor because we're going to continue to scale faster and faster and faster and faster, and compute will always be a bottleneck, just like data will always be a bottleneck. You can have the best data set in the world and you can always have a better one. And same with, you have the biggest compute system in the world, but you'll always want a better one. And so it's like, there's things that like Mistral, they trained a fricking awesome model on relatively fewer GPUs, right? And now they're scaling up higher and higher and higher, right? There's a lot that the GPU poor can do though, right? We all have phones, we all have laptops, right? There is a world for running GPUs or models on device. The replet folks are trying to do stuff like that. Their models, they can't follow scaling laws, right? Why? Because there's a fundamental limit to how much memory bandwidth and capacity you can get on a laptop or a phone. You know, I mentioned the ratio of flops to bandwidth on a GPU is actually really good compared to like a MacBook or like a phone. To run Llama 70 billion requires two terabytes a second of memory bandwidth, 2.1 at human reading speed. Yeah, but my phone has like 50 gigabytes a second. Your laptop, even if you have an M1 Ultra has what, like, I don't remember, like a couple hundred gigabytes a second of memory bandwidth. You can't run Llama 70B just by doing the classical thing. So there's like, there's stuff like speculative decoding, you know, together did something really cool. And they put it in the open source, of course, Medusa, right? Like things like that, that are, you know, they work on batch size one, they don't work on batch size, you know, high. And so there's like the world of like cloud inference. And so in the cloud, it's all about what memory bandwidth and MFU I can achieve. Whereas on the edge, I don't think Google is going to deploy a model that I can run on my laptop to help me with code or help me with, you know, X, Y, Z, they're always going to want to run it in a cloud for control. Or maybe they let it run on the device, but it's like only their pixel phone, you know, it's kind of like a walled garden thing. There's obviously a lot of reasons to do other things for security, for openness, to not be at the whims of a trillion dollar plus company who wants my data, right? You know, there's a lot of stuff to be done there. And I think folks like Repl.it, they open source their model, right? Things like together, I just mentioned, right, that developing Medusa, that didn't take much GPU at all, right? That's very, well, they do have quite a few GPUs, they made a big announcement about having 4,000 H100s, that's still relatively poor, right, when we're talking about hundreds of thousands of like the big labs, like OpenAI, and so on and so forth, or millions of TPUs like Google, but still, they were able to develop Medusa with probably just one server, one server with eight GPUs in it. And its usefulness of something like Medusa, something like speculative decoding is, is on device, right? And that's what like a lot of people can focus on, you know, people can focus on all sorts of things like that. I don't know, right? Like a new model architecture, right? Like, are we only going to use transformers? I'm pretty told to think like transformers are it, right? My hardware brain can only know something that loves hardware, right? People should continue to try and innovate on that, right? Like, you know, asynchronous training, right? Like that kind of stuff is like, super, super interesting. I think it's Tim Demeters. He had like the- Demeers? [00:23:09]Swyx: The same guy as Kylo Ren. [00:23:10]Dylan: Yes, he had the swarm paper and petal. That research is super cool. The universities will never have much compute, but like, hey, to prepare to do things to, you know, all these sorts of stuff, like they should try to build, you know, super large models. Like, you look at what Tsinghua University is doing in China, actually, they open sourced their model to I think the largest like by parameter count, at least open source models. I mean, of course, they didn't train it on much data, but it's like, you know, it's like you could do some cool stuff like that. I don't know. I think there's a lot that people can focus on. One, scaling out a service to many, many users. Distribution is very important. So figuring out distribution, right? Like figuring out useful fine tunes, right? Like doing LLMs that OpenAI will never make, sorry for the crassness, a porn DALL-E 3, right? Open source is doing crazy stuff with stable diffusion, right? Right? Like, I don't know. Yeah, but it's like, it's like, and there's a legitimate market. I think there's a couple of companies who make tens of millions of dollars of revenue from LLMs or diffusion models for porn, right? Or, or, you know, that kind of stuff. Like, I mean, there's a lot of stuff that people can work on that will be successful businesses or doesn't even have to be a business, but can advance humanity tremendously. That doesn't require crazy scale. [00:24:10]Alessio: How do you think about the depreciation of like the hardware versus the models? If you buy a H100, sure, the next year's is going to be better, but like at least the hardware is good. If you're spending a lot of money on like training a smaller model, it might be like super obsolete in like three months. And you've got now all this compute coming online. I'm just curious if like companies should actually spend the time to like, you know, fine tune them and like work on them where like the next generation is going to be out of the box so much better. [00:24:37]Dylan: Unless you're fine tuning for on-device use, I think fine tuning current existing models, especially the smaller ones is a useless waste of time because the cost of inference is actually much cheaper than you think once you achieve good MBU and you batch at a decent size, which any successful business in the cloud is going to achieve, you know, and then two, fine tuning like people like, oh, you know, this 7 billion parameter model, if you fine tune it on a data set is almost as good as 3.5, right? Why don't you fine tune 3.5 and look at your performance, right? And like, there's nothing open source that is anywhere close to 3.5 yet. There will be. People also don't quite grasp. Falcon was supposed to be, Falcon 140B. It's less parameters than 3.5. And also, I don't know about the exact token count, but I believe it. Do we know the parameters of 3.5? It's not 175 billion. People keep saying this. [00:25:25]Swyx: No. Because we know 3, but we don't know 3.5. [00:25:27]Dylan: 3.5. [00:25:28]Swyx: It's definitely smaller. [00:25:29]Dylan: No, it's bigger than 175. I think it's sparse. MOE. I'm pretty sure. And yeah, you can, you can do some like gating around the size of it by looking at their inference latency. Well, what's the theoretical bandwidth if they're running it on this hardware and doing tensor parallel in this way? So they have this much memory bandwidth and maybe they get, maybe they're awesome and they get 90% memory bandwidth utilization. I don't know. That's an upper bound and you can see the latency that 3.5 gives you, especially at like off peak hours or if you do fine tuning and you have your, if you have a private enclave, they'll like my Azure will quote you latency. So you can, you can figure out how many parameters per forward pass, which I think is somewhere in the like 50 to 40 billion range, but I could be very wrong. That's just like my guess based on that sort of stuff. You know, 50 ish. And actually I think open source will have models of that quality. I mean, I assume Mosaic or like Meadow will open source and Mistral will be able to open source models of that quality. And furthermore, right? Like if you just look at the amount of compute, obviously data is very important and the ability, all these tricks and dials that you turn to be able to get good MFU and good MBO, right? Like depending on inference or training is, there's a ton of tricks. But at the end of the day, there's like 10 companies that have enough compute in one single data center to be able to beat GPT-4, right? Like straight up, like if not today, within the next six months, right? 4,000 H100s is, I think you need about 7,000 maybe. And with some algorithmic improvements that have happened since GPT-4 and some data quality improvements probably, like you could probably get to even like less than 7,000 H100s running for three months to beat GPT-4. Of course, that's going to take a really awesome team, but there's quite a few companies that are going to have that many, right? Open source will match GPT-4, but then it's like, what about GPT-4 Vision? Or what about, you know, 5 and 6 and all these kinds of stuff and like interact tool use and Dolly and like, that's the other thing is like, there's a lot of stuff on tool use that the open source could also do, that the GPT-4 could do. I think there are some folks that are doing that kind of stuff, agents and all that kind of stuff. I don't know. That's way over my head, the agent stuff. [00:27:24]Swyx: Yeah, it's over everyone's head. One more question on this sort of Gemini GPU rich essay. We've had a very wide ranging conversation already, so it's hard to categorize, but I tried to look for the Meena Eats the World document. Oh, it's not public. [00:27:36]Dylan: No, no, no, no, no, no. You've read it. Yeah, I read it. So Noam Shazir is like, I don't know, I think he's like- The GOAT. The GOAT. Yeah, I think he's the GOAT. [00:27:46]Swyx: In one year, he published like switch transformers, like some attention is all you need, obviously, but he also did the speculative decoding stuff. [00:27:53]Dylan: Yeah, exactly. It's like, it's like all this stuff that we were talking about today was like, you know, and obviously there's other people that are awesome that were, you know, helping and all that sort of stuff. Meena Eats the World was basically, he wrote an internal document around the time where Google had Meena, right? But it was like, he wrote it and he was like, basically predicting everything that's happening now, which is that like large language models are going to eat the world, right? In terms of, you know, compute and he's like the total amount of deployed flops within Google data centers will be dominated by large language models. Back then, a lot of people thought he was like silly for that, right? Like internally at Google. But you know, now if you look at it, it's like, oh wait, millions of TPUs. You're right. You're right. You're right. Okay. We're totally getting dominated by like both, you know, Gemini training and inference, right? Like, you know, total flops being dominated by LLMs is completely right. [00:28:36]Swyx: So my question was, he had a bunch of predictions in there. Do you think there were any like underrated predictions that may not have yet have come true? Was he wrong on anything? [00:28:44]Dylan: Meena sucked, right? If you'd look at the total flops, right? You know, parameters times tokens times six, it's like tiny, tiny fraction of GPT-2, which came out just a few months later, which was like, okay, so he was right about everything, but like, maybe he knew about GPT-2. I have no clue. OpenAI clearly was like way ahead of Google on LLM scaling. Even then, people didn't really recognize it back in GPT-2 days, maybe. The number of people that recognized it was maybe hundreds, tens. [00:29:10]Alessio: So we talked about transformer alternatives. The other thing is GPU alternatives. The CPU is obviously one, but there's Cerebras, there's Graphcore, there's MAD-X, Lemurian Labs, there's a lot of them. Thoughts on what's real, who's alive, who's kind of like a zombie company walking. [00:29:27]Dylan: You know, I mentioned like transformers were the architecture that won out, but I think, you know, the number of people who recognized that in 2020 was, you know, as you mentioned, probably hundreds, right? For natural language processing, maybe in 2019 at least, right? You think about a chip design cycle, it's like years, right? So it's kind of hard to bet your architecture on the type of model that develops. But what's interesting about all the first wave AI hardware startups, you know, there's a ratio of memory, capacity, compute, and memory bandwidth, right? Everyone kind of made the same bet, which is, I have a lot of memory on my chip, which is A, really dumb, because the models have grew way past that, right? Even Cerebras, right? You know, like I'm talking about like Graphcore, it's called SRAM, which is the memory on chip, much lower density, but much higher speeds versus, you know, DRAM, memory off chip. And so everyone was betting on more memory on chip and less memory off chip, right? And to be clear, right, for image networks and models that are small enough to just fit on your chip, that works. That is a superior architecture, but scale, right, scale, scale, scale, scale. NVIDIA was the only company that bet on the other side of more memory bandwidth and more memory capacity external, also the right ratio of memory bandwidth versus capacity. A lot of people like Graphcore specifically, right, that ton of memory on chip, and then they had a lot more memory off chip, but that memory off chip was a much lower bandwidth. Same applies to Samanova, same applies to Cerebras. They had no memory off chip, but they thought, hey, I'm going to make a chip the size of a wafer, right? Like, you know, those guys, they're silly, right? Hundreds of megabytes, we have 40 gigabytes. There's no, you know, and then, oh, crap, models are way bigger than 40 gigabytes, right? The ones that people deploy. Everyone bet on sort of the left side of this curve, right? The interesting thing is that there's new age startups like Lumerium, like MedEx, I won't get into what they're doing, but they're making much more rational bets. I don't know, you know, it's hard to say with a startup, like, it's going to work out, right? Obviously there's tons of risk embedded, but those folks, you know, Jay Duane of Lumerium and like Mike and Rainier, they understand models, they understand how they work. And if transformers continue to reign supreme, whatever innovations those folks are doing on hardware are going to need to be fitted for that. Or you have to predict what the model architecture is going to look like in a few years, right? You know, and hit that spot correctly. So that's kind of a background on those. But like now you look today, hey, Intel bought Nirvana, which was Naveen Rao's Mosaic ML. He started Mosaic ML and sold it to Databricks recently, obviously leading LLMs and stuff there, AI there. Intel bought that company from him and then shut it down and bought this other AI company. And now that company is kind of, you know, got new chips. They're going to release a better chip than the H100 within the next quarter or so. AMD, they have a GPU, MI300, that will be better than the H100 in a quarter or so. Now that says nothing about how hard it is to program it, but at least hardware-wise on paper, it's better. Why? Because it's, you know, a year and a half later, right, than in the H100 or a year later than the H100, of course, and, you know, a little bit more time and all that sort of stuff. But they're at least making similar bets on memory bandwidth versus flops versus capacity. Following NVIDIA's lead, the questions are like, what is the correct bet for three years from now? How do you engineer that? And will those alternatives make sense? The other thing is, if you look at total manufacturing capacity, right, for this sort of bet, right, you need high bandwidth memory, you need HBM, and you need large five nanometer dies, you know, soon three nanometer, whatever, right? You need both of those components and you need the whole supply chain to go through that. We've written a lot about it, but, you know, to simplify it, NVIDIA has a little bit more than half and Google has like 30%, right, through Broadcom. So it's like the total capacity for everyone else, much lower, and they're all sharing it, right? Amazon's training and inferentia, Microsoft's in-house chip, and, you know, you go down the list and it's like Meta's in-house chip, and also AMD, and also, so all of these companies are sharing like a much smaller slice. Their chips are not as good, or if they are, even though, you know, I mentioned Intel and AMD's chips are better, that's only because they're throwing more money at the problem kind of, right? You know, NVIDIA charges crazy prices, I think everyone knows that. Their gross margins are insane. AMD and Intel and others will charge more reasonable margins, and so they're able to give you more HBM and et cetera for a similar price, and so that ends up letting them beat NVIDIA, if you will, but their manufacturing costs are twice that in some cases, right? In the case of AMD, their manufacturing costs are MI300 or more than twice that of H100, and it only beats H100 by a little bit from, you know, performance stuff I've seen. So it's like, you know, it's tough for anyone to like bet the farm on a alternative hardware supplier, right? Like, in my opinion, like, you should either just like be like, you know, a lot of like ex-Google startups are just using TPUs, right? And hey, that's Google Cloud, you know, after moving the TPU team into the cloud team, infrastructure team, sort of, they're much more aggressive on external selling, and so you companies like, even see companies like Apple using TPUs for training LLMs, as well as TPUs, but either bet heavily on TPUs, because that's where the capacity is, bet heavily on GPUs, of course, and stop worrying about it, and leverage all this amazing open source code that is optimized for NVIDIA. If you do bet on AMD or Intel or any of these startups, then you better make damn sure you're really good at low-level programming, and damn sure you also have a compelling business case, and that the hardware supplier is giving you such a good deal that it's worth it. And also, by the way, NVIDIA's releasing a new chip in, you know, they're going to announce it in March, and they're going to release it and ship it Q2, Q3 next year anyways, right? And that chip will probably be three or four times as good, right? And maybe it'll cost twice as much, or 50% more. I hear it's 3x the performance on an LLM, and 50% more expensive, is what I hear. So it's like, okay, yeah, nothing is going to compete with that, even if it is 50% more expensive, right? And then you're like, okay, well, that kicks the can down further, and then NVIDIA's moving to a yearly release cycle, so it's like very hard for anyone to catch up to NVIDIA, really, right? So, you know, investing all this in other hardware, like, if you're Microsoft, obviously, who cares if I spend $500 million a year on my internal chip? Who cares if I spend $500 million a year on AMD chips, right? Like, if it lets me knock the price of NVIDIA GPUs down a little bit, puts the fear of God within Jensen Huang, right, like, you know, then it is what it is, right? And likewise, you know, with Amazon, and so on and so forth, you know, of course, their hope is that their chips succeed, or that they can actually have an alternative that is much cheaper than NVIDIA. To throw a couple hundred million dollars at a company, you know, as product is completely reasonable. And in the case of AMD, I think it'll be more than a couple hundred million dollars, right? But yeah, I think alternative hardware is like, it really does hit like sort of a peak hype cycle, kind of end of this year, early next year, because all NVIDIA has is H100, and then H200, which is just better, more memory bandwidth, higher memory capacity, H100, right? But that doesn't beat what, you know, AMD are doing, it doesn't beat what, you know, Intel's Gaudi 3 does, but then very quickly after, NVIDIA will crush them. And then those other companies are gonna take two years to get to their next generation. You know, it's just a really tough place. And no one besides, you know, the main thing about hardware is like, hey, that bet I talked about earlier is like, you know, that's very oversimplified, right? Just memory bandwidth flops and memory capacity. There's a whole lot more bets. There's 100 different bets that you have to make and guess correctly to get good hardware, not even have better hardware than NVIDIA get close to them. And that takes understanding models really, really well. That takes understanding so many different aspects, whether it's power delivery or cooling or design, layout, all this sort of stuff. And it's like, how many companies can do everything here, right? It's like, I'd argue Google probably understands models better than NVIDIA, I don't think people would disagree. I'm an NVIDIA understands hardware better than Google. And so you end up with like, Google's hardware is competitive, but like, does Amazon understand models better than NVIDIA? I don't think so. And does Amazon better understand hardware better than NVIDIA? No. I also have the opinion that the labs are useful partners, they're convenient partners. They're not going to buddy up as close as people think, right? I don't even think like, I expect in the next few years that the OpenAI Microsoft probably falls apart too. I mean, they'll still continue to use GPUs and stuff there. But like, I think that the level of closeness you see today is probably the closest they get. [00:37:15]Swyx: At some point, they become competitive if OpenAI becomes its own cloud. [00:37:18]Dylan: The level of value that they deliver to the world, if you talk to anyone there, they truly believe it'll be tens of trillions, if not hundreds of trillions of dollars, right? In which case, obviously, you know, I know weird corporate structure aside, you know, this is the same playing field as companies like Microsoft and Google. Google wants to also deliver hundreds of trillions of dollars of value. And it's like, obviously you're competing and Microsoft wants to do the same and you're going to compete. In general, right, like these lab partnerships are going to be nice, but they're probably incentivized to, you know, hey, NVIDIA, you should, you know, can you design the hardware in this way? It doesn't work like that. It works like this. And they're like, oh, so this is the best compromise. Right? Like, I think OpenAI would be stupid not to do that with NVIDIA, but also with AMD. But also, hey, like how much time, and Microsoft's internal silicon, but it's like, how much time do I actually have? Right? Like, you know, should I do that? Should I spend all my, you know, super, super smart people's time and limited, you know, this caliber of person's time doing that? Or should they focus on like, hey, can we get like asynchronous training to work? Or like, you know, figure out this next multimodal thing? Or I don't know. I don't know. Right? Right? Or should I eke out 5% more MFU and work on designing the next supercomputer? Right? Like, these kind of things, how much more valuable is that? Right? So it's like, you know, it's tough to see, you know, even OpenAI helping Microsoft enough to get their knowledge of models. So, so, so good. Right? Like, Microsoft's going to announce their chip soon. It's worse performance than the H100, but the cost effectiveness of it is better for Microsoft internally, just because they don't have to pay the NVIDIA tax. But again, like by the time they ramp it and all these sorts of things, and oh, hey, that only works on a certain size of models. Once you exceed that, then it's actually, you know, again, better for NVIDIA. So it's like, it's really tough for OpenAI to be like, yeah, we want to bet on, on Microsoft. Right? Like, and hey, we have, you know, I don't know, what's their number of people they have now? Like 700 people, you know, of which how many do low level code? Do I want to have separate code bases for this and this and this and this? And, you know, it's like, it's just like a big headache to, I don't know, I think it'd be very difficult to see anyone truly pivoting to anything besides a GPU and a TPU, especially if you have, if you need that scale. And that scale that the lab, at least the labs, right, require is absurd. Google says millions, right, of TPUs. OpenAI will say millions of GPUs, right? Like I truly do believe they think that that number of next generation GPUs, right? Like the numbers that we're going to get to are like, I bet you, I mean, I don't know, but I bet Sam Alton would say, yeah, we're going to build a hundred billion dollar supercomputer in three years or two years, right? And like after GPT-5 releases, if he goes to the market and says like, hey, I want to raise a hundred billion dollars at $500 billion valuation, I'm sure the market would give it to him, right? Like, and then they build that supercomputer, right? Like, I mean, like, I think that's like truly the path we're on. And so it's hard to, hard to imagine. Yeah. I don't know. [00:40:00]Swyx: One point that you didn't touch on and Taiwan companies are famously very chatty about the fruit company. Should we take Apple seriously at all in this game or they're just in a different world altogether? [00:40:10]Dylan: I respect their products, but like, I don't think Apple will ever release a model that you can get to say really bad things. There's all these jailbreaks, but also like as soon as they happen, like, you know, it gets fed back into OpenAI's like platform and it gets them, it's like being public and open is accelerating their like ability to make a better and better model, right? Like the RLHF and all this kind of stuff. I don't see how Apple can do that structurally, like as a company, like the fruit company ships perfect products or like, or else, right? That's why everyone loves iPhones, right? And all these like open source firms and like all these folks are doing exactly that, right? Building a bigger and better model every, you know, every few months. And I don't know how Apple gets on that train, but you know, at the same time, there's no company that has more powerful distribution, right? [00:40:56]Swyx: Are people in Taiwan concerned that it will come to a point where China will just claim Taiwan? [00:41:02]Dylan: I think, I think that a lot of people there are not super concerned, but there's some people that are super concerned. I think, I think especially after like, you know, instability across the world and in Europe and in the Middle East and even Africa, if you look at any of the stuff they're building up, it seems very clear. And if you talk to a lot of people, they think China will invade Taiwan in 27 or 26 in April or in September, sort of the best timeframes, right? Like a lot of people believe that's what will happen, right? [00:41:29]Swyx: Maybe the semi-analysis analyst point of view is, is it feasible to build this capacity up in the US? No. [00:41:35]Dylan: No, right? Like people don't understand how fragmented the semiconductor supply chain really is and how many monopolies there are. The US could absolutely shut down the Chinese semiconductor supply chain. They won't. But, and China could absolutely shut down the US one actually, by the way. But more, more relevantly, right, is like, you know, Austria has two companies, like the country of Austria and Europe has two companies that have super high market share and very specific technologies that are required for every single like, like chip period, right? There is no chip that is less than seven nanometer that doesn't get touched by this one Austrian company's tool, right? And there is no alternative. And there's another Austrian, you know, and I, it's, it's, and there's another Austrian company. Likewise, everything two nanometer and beyond will be touched by their tool. And it's like, but both of these companies are like doing well, less than a billion dollars of revenue, right? So it's like, you think it's so inconsequential. No, there's actually like three or four Japanese chemical companies, same, same idea, right? It's like the supply chain is so fragmented, right? Like people only ever talk about where the fabs were, where they actually get produced, but it's like, I mean, TSMC in Arizona, right? TSMC is building a fab in Arizona. It's, it's quite a bit smaller than the fabs in, in, in Taiwan. But even ignoring that, those fabs don't have to ship everything to Taiwan back anyways. And also they have to get what's called a mask from Taiwan and get sent to, get sent to Arizona. And by the way, there's these Japanese companies that make these chemicals that need to ship to, you know, like TOK and Shinetsu and, you know, it's like, and, and hey, it needs this tool from Austria no matter what it's like, oh wow, wait, actually like the entire supply chain is just way too fragmented. You can't like re-engineer and rebuild it on a snap, right? It's just like that. It's just complex to do that. Semiconductors are more complex than any other thing that humans do, without a doubt. There's more people working in that supply chain with XYZ backgrounds and more money invested every year and R&D plus CapEx, you know, it's like, it's just by far the most complex supply chain that humanity has. And to think that we could rebuild it in a few years is absurd. [00:43:22]Swyx: In an alternate universe, the US kept Morris Chang. I mean, people, right? Like it was just one guy. Yeah. [00:43:29]Dylan: In an alternative universe, Texas Instruments communicated to Morris Chang that he would become CEO. And so he never goes to Taiwan and you know, blah, blah, blah. Right. Yeah. No. But I, you know, that's just also, I think, I think the world would probably be further behind in terms of technology development if that didn't happen, right? Like technology proliferation is how you accelerate the pace of innovation, right? So the, you know, the dissemination to, oh, wow, hey, it's not just a bunch of people in Oregon at Intel that are leading everything, right? Or, you know, hey, a bunch of people in Samsung Korea, right? Or Shinshu, Taiwan, right? It's actually all three of those plus all these tool companies across the country and the Netherlands and in Japan and the US and, you know, it's millions of people innovating on a disseminated technology that's led us to get here, right? I don't even think, you know, if Morris Chang didn't go to Taiwan, would we even be at 5 nanometer? Would we be at 7 nanometer? Probably not, right? So there's a lot of things that, you know, happened because of that, right? [00:44:22]Alessio: Let's get a quick lightning round on semi-analysis branded one. So the first one is what are like foundational readings that people that are listening today should read to get up to speed on like semis? [00:44:34]Dylan: I think the easiest one is the PyTorch 2.0 and Triton one that I did. You know, there's the advanced packaging series. There's the Google infrastructure supremacy piece. I think that one's really critical because it explains Google's infrastructure quite a bit from networking through chips, through all that sort of history of the TPU a little bit. Maybe like AMD's MI300 piece, it talks a lot about the one that we did on that are very good. And then obviously like, you know, like, I don't know, probably like Chip Wars by Chris Miller who doesn't recommend that book, right? It's a really good book, right? I mean, like I would say Gordon Moore's book is freaking awesome because you got to think about right, like, you know, LLM scaling laws are like Moore's law on crack, right? Kind of like, you know, in a different sense, like, you know, if you think about all of human productivity gains since the 70s is probably just off of the base of semiconductors and technology, right? Of course, of course, people across the world are getting, you know, access to oil and gas and all this sort of stuff. But like, at least in the Western world, since the 70s, everything has just been mostly innovated because of technology, right? Oh, we're able to build better cars because semiconductors enable us to do that. Or be able to build better software because we're able to connect everyone because semiconductors enabled that, right? That is like, I think that's why it's the most important industry in the world. But like seeing the frame of mind of what Gordon Moore has written, you know, he's got a couple, you know, papers, books, et cetera, right? Only the paranoid survive, right? Like I think, I think like that philosophy and thought process really translates to the now modern times, except maybe, you know, humanity has been an exponential S-curve and this is like another exponential S-curve on top of that. So I think that's probably a good, good readings to do. [00:46:09]Swyx: Has there been an equivalent pivot? So Gordon, like that classic tale was more of like the pivot to memory. [00:46:16]Dylan: From memory to logic. Yeah. [00:46:18]Swyx: Yeah. And then was there, has there been an equivalent pivot in Semi's history of that magnitude? [00:46:24]Dylan: I mean, like, you know, some people would argue that like, you know, Jensen, you know, he basically didn't care about, he only cared about, you know, like gaming and 3D professional visualization and like rendering and things like that until like he started to learn about AI. And then all of a sudden he's going to like universities, like you want some GPUs, here you go. Right. Like, I think there's even stories of like, you know, not so long ago, NeurIPS, when it used to have the more unfortunate name, he would go there and just give away GPUs to people. Right. Like there's like stuff like that. Like, you know, very grassroots, like pivoting the company. Now like you, you look on gaming forums and it's like, everybody's like, oh, NVIDIA doesn't even care about us. They only care about AI and it's like, yes, you're right. They only care. They mostly only care about AI and the gaming innovations are only because of like, they're putting more AI into it. Right. It's like, but also like, hey, they're doing a lot of ship design stuff with AI. And, you know, I think, I think that's like, not, I don't know if it's equivalent pivot quite yet, but, you know, because the digital, you know, logic is a pretty big innovation, but I think that's a big one. And, you know, likewise, it's like, you know, what did, what did OpenAI do? Right. What did they pivot? How did they pivot? They left the, like, a lot of, most people left the culture of like Google brain and deep mind and decided to build this like company. That's crazy cool. Right. Like it does things in a very different way and like is innovating in a very different way. So you consider that a pivot, even though it's not inside Google. [00:47:40]Swyx: They were on a very different path with like the Dota games and all that before they eventually found like GPTs as the, as the thing. So it was a full, like started in 2015 and then like really pivoted in 2019 to be like, all right, we're the GPT company. Yeah. Yeah. If I could classify them, I don't, I'm sure there's OpenAI people who are yelling at me right now. Okay. So just a general question about, you know, I'm a fellow writer on, on Substack. You are obviously managing your consulting business while you're also publishing these amazing posts. How do you, what's your writing process? How do you source info? Like when do you sit down and go like, here's the theme for the week. Do you, do you have a pipeline going out? Just anything you can describe. [00:48:17]Dylan: I'm thankful for my, you know, my teammates cause they are actually awesome. Like, and they're much more, um, you know, directed focused to working on one thing, you know, or not one thing, but a number of things, right. Like, you know, someone who's this expert on X and Y and Z and the semiconductor supply chain. So that really helps with the, the, that side of the business. I most of the times only write when I'm very excited or, you know, it's like, Hey, like we should work on this and we should write about this. So like, you know, one of the most recent posts as we did was we explained the manufacturing process for 3D NAND, you know, flash storage, uh, gate all around transistors and 3D DRAM and all this sort of stuff. Cause there's a company in Japan that's going public, Kokusai Electric, right. It was like, okay, well we should do a post about this and we should explain this. But like, it's like, okay, we, you know, and so Myron, he did all that work, Myron and she, and most of the work and awesome. But like, usually it's like, there's a few, like very long in-depth back burner type things, right? Like that took a long time, took, you know, over a month of research and Myron knows this stuff already really well, right? Like there's stuff like that that we do and that like builds up a body of work for our consulting and some of the reports that we sell that aren't, you know, newsletter posts. But a lot of times the process is also just like, well, like Meena Eats the World is the culmination of reading that, having done a lot of work on the supply chain around the TPU ramp and co-osts and HBM capacities and all this sort of stuff to be able to, you know, figure out how many units and that Google's ordering all sorts of stuff. And then like, also like looking at like open sources, like all just that, all that culminated in like, I wrote that in four hours, right? I sent it to a couple of people and they were like, no, change this, this, this, oh, you know, add this. Cause that's really going to piss off, you know, the open source community. I'm like, okay, sure. And then posted it, right? So it's like, there's no like specific process. Unfortunately, like the most viral posts, especially in the AI community are just like those kinds of pieces rather than the, like the really deep, deep, like, you know, obviously like what was in the Gemini Eats the World post, you know, the obvious, Hey, like we, we do deep work and there's a lot more like factual, not leaks, you know, it's just factual research. Hey, we crossed the team. We go to 40 plus conferences a year, right. All the way from like a photo resist conference to a photo mask conference, to a lithography conference all the way up to like AI conferences and you know, all everything in between networking conferences and piecing everything across the supply chain. So it's like, that's like the true, like work and like, yeah, I don't know. It is sometimes bad to like have the infamousness of, you know, only people caring about this or the GPT-4 leak or the Google has no moat leak. Right. It's like, but like, you know, that's just like stuff that comes along. Right. You know, it's really focused on like understanding the supply chain and how it's pivoting and who's the winners, who's the losers, what technologies are inflecting, things like that. Where's the best place to invest resources, you know, sort of like stuff like that and in accelerating or capturing value, et cetera. [00:50:54]Alessio: Awesome. And to wrap, if you had a magic genie that could answer any question that would change your worldview, what question would you ask? [00:51:03]Dylan: That's a tough one. [00:51:04]Swyx: Like you, you operate based on a set of facts about the world right now, then there's maybe some unknowns where you're like, man, if I really knew the answer to this one, I would do so many things differently, or I would think about things very differently. [00:51:18]Dylan: So I'm of the view, at least everything that we've seen so far is that large scale training has to happen in an individual data center with very high speed networking. Now, everything doesn't need to be all to all connected, but you need very high speed networking between all of your, your chips, right? I would love to know, you know, hey, magic genie, how can we build artificial intelligence in a way that it can use multiple data centers of resources where there is a significantly lower bandwidth between pools of resources, right? Because that would instantly, like one of the big bottlenecks is how much power and how many chips you can get into a single data center. So like, A, Google and OpenAI and Anthropic are working on this, and I don't know if they've solved it yet, but if they haven't solved it yet, then what is the solution? Because that will like accelerate the scaling that can be done by not just like a factor of 10, but like orders of magnitude, because there's so many different data centers, right? Like if you, you know, across the world and, you know, oh, if I could pick up, you know, if I could effectively use 256 GPUs in this little data center here, and then with this big cluster here, you know, how can you make an algorithm that can do that? Like I think that would be like the number one thing I'd be curious to know if, how, what, because that changes the world significantly in terms of how we continue to scale this amazing technology that people have invented over the last, you know, five years. Awesome. [00:52:36]Alessio: Well, thank you so much for coming on, Dylan. [00:52:38]Dylan: Thank you. Thank you. [00:52:46]Alessio: Thank you. [00:52:46] Get full access to Latent.Space at www.latent.space/subscribe
-
AGI is Being Achieved Incrementally (DevDay Recap - cleaned audio)
From 🇺🇸 Latent Space: The AI Engineer Podcast, published at 2023-11-08 07:27
We left a high amount of background audio in the Devday podcast, which many of you loved, but we definitely understand that some of you may have had trouble with it. Listener Klaus Breyer ran it through Auphonic with speech islolation and we figured we’d upload it as a backdated pod for people who prefer this. Of course it means that our speakers sound out of place since they now sound like they are talking loudly in a quiet room. Let us know in the comments what you think?Timestampsthe cleaned part is only part 2:* [00:55:09] Part II: Spot Interviews* [00:55:59] Jim Fan (Nvidia) - High Level Takeaways* [01:05:19] Raza Habib (Humanloop) - Foundation Model Ops* [01:13:32] Surya Dantuluri (Stealth) - RIP Plugins* [01:20:53] Reid Robinson (Zapier) - AI Actions for GPTs* [01:30:45] Div Garg (MultiOn) - GPT4V for Agents* [01:36:42] Louis Knight-Webb (Bloop.ai) - AI Code Search* [01:48:36] Shreya Rajpal (Guardrails) - Guardrails for LLMs* [01:59:00] Alex Volkov (Weights & Biases, ThursdAI) - "Keeping AI Open"* [02:09:39] Rahul Sonwalkar (Julius AI) - Advice for Founders Get full access to Latent.Space at www.latent.space/subscribe