VideoToBe
Lang: en
Lang: en

Convert your Audio and Video Files to Text
Speaker Separated Transcription98% Accurate
Free for files under 30 minutes
0:00
Metaprompting is turning out to be a very, very powerful tool that everyone's using now. It kind of actually feels like coding in, you know, 1995, like the tools are not all the way there. We're, you know, in this new frontier. But personally, it also kind of feels like learning how to manage a person, where it's like, how do I actually communicate, you know, the things that they need to know in order to make a good decision? Welcome back to another episode of The Light Cone. Today we're pulling back the curtain on what is actually happening inside the best AI startups when it comes to prompt engineering.0:44
We surveyed more than a dozen companies and got their take right from the frontier of building this stuff, the practical tips. Jared, why don't we start with an example from one of your best AI startups.0:58
I managed to get an example from a company called ParaHelp. ParaHelp does AI customer support. There are a bunch of companies who are doing this, but ParaHelp is doing it really, really well. They're actually powering the customer support for Perplexity and Replet and Bolt and a bunch of other top AI companies now. So if you go and you email a customer support ticket into Perplexity, what's actually responding is their AI agent. The cool thing is that the PowerHelp guys very graciously agreed to show us the actual prompt that is powering this agent. And to put it on screen on YouTube for the entire world to see, it's like relatively hard to get these prompts for vertical AI agents because they're kind of like the crown jewels of the IP of these companies.1:37
And so very grateful to the PowerHelp guys for agreeing to basically like open source this prompt.1:41
Diana, can you walk us through this very detailed prompt? It's super interesting and it's very rare to get a chance to see this in action.1:48
So the interesting thing about this prompt is actually first, it's really long, it's very detailed. In this document, you can see it's like six pages long, just scrolling through it. The big thing that a lot of the best prompts start with is this concept of setting up the role of the LLM. You're a manager of a customer service agent, and it breaks it down into bullet points what it needs to do. Then the big thing is telling the The task, which is to approve or reject a tool call because it's orchestrating agent calls from all these other ones.2:19
And then it gives it a bit of the high level plan. It breaks it down step by step. You see steps one, two, three, four, five. And then it gives some of the important things to keep in mind that it should not kind of go weird into calling different kinds of tools. It tells them how to structure the output because a lot of things with agents is you need them to integrate with other agents. So it's almost like gluing the API call. So it's important to specify that it's going to give certain output of accepting or rejecting and in this format.2:53
Then this is sort of the high level section. And one thing that the best prompts do, they break it down sort of in this markdown type of style formatting. So you have sort of the heading here. And then later on, it goes into more details on how to do the planning. And you see this is like a sub bullet part of it. And as part of the plan, there's actually three big sections. There's how to plan, and then how to create each of the steps in the plan, and then the high level example of the plan.3:21
One big thing about the best prompts is they outline how to reason about the task, and then a big thing is giving it an example. And this is what it does. And one thing that's interesting about this, it looks more like programming than writing English. because it has this XML tag kind of format to specify sort of the plan. We found that it makes it a lot easier for LLMs to follow because a lot of LLMs were post trained in IRLHF with kind of XML type of input and it turns out to produce better results.3:55
Yeah, one thing I'm surprised that isn't in here or maybe this is just the version that they released. What I almost expect is there to be a section where it describes a particular scenario and actually gives example output for that scenario.4:10
That's in like the next stage of the pipeline.4:13
Yeah. Oh, really? Okay.4:15
Yeah, because it's customer specific, right? Because like every customer has their own like flavor of how to respond to these support tickets. And so their challenge, like a lot of these agent companies is like, how do you build a general purpose product when every customer like wants, you know, has like slightly different workflows and like preferences has a really interesting thing that I see the vertical AI agent companies talking about a lot, which is like, how do you have enough flexibility to make special purpose logic without turning into a consulting company where you're building like a new prompt for for for every customer.4:46
I actually think this like concept of like forking and merging prompts across customers and which part of the prompt is customer specific versus like company wide is like a like a really interesting thing that the world is only just beginning to explore.4:58
Yeah, that's a very good point, Jared. So this is concept of defining the prompt in the system prompt, then there's the developer prompt, and then there's the user prompt. So what this means is the system prompt is basically almost like defining sort of the high-level API of how your company operates. In this case, the example of para-help is very much a system prompt. There's nothing specific about the customer. And then as they add specific instances of that API and calling it, then they stuff all that in into more the developer prompt, which is not shown here.5:30
And that adds all the context of, let's say, working with perplexity. There are certain ways of how you handle Rack questions as opposed to working with bold is very different, right? And then I don't think PowerHelp has a user prompt because their product is not consumed directly by an end user, but an end user prompt could be more like Replet or A0, right? Where users need to type is like generate me a site that has these buttons, this and that, that goes all in the user prompt. So that's sort of the architecture that's sort of emerging.6:02
To your point about avoiding becoming a consulting company, I think there's so many startup opportunities in building the tooling around all of this stuff. For example, anyone who's done prompt engineering knows that the examples and worked examples are really important to improving the quality of the output. And so then if you take Para help as an example, they really want good worked examples that are specific to each company. And so you can imagine that as they scale, you almost want that done automatically. Like in your dream world, what you want is just like an agent itself that can pluck out the best examples from like the customer data set and then software that just like ingests that straight into like wherever it should belong in the pipeline without you having to manually Go out and plug that all in and ingest it in all of yourself.6:49
That's probably a great segue into meta-prompting, which is one of the things we want to talk about, because that's a consistent theme that keeps coming up when we talk to our AI startups.6:58
Yeah, Tropier is one of the startups I'm working with in the current YC batch and they've really helped people like YC company Ducky do really in-depth understanding and debugging of the prompts and the return values from a multi-stage workflow. And one of the things they figured out is prompt folding. So, you know, basically one prompt can dynamically generate better versions of itself. So a good example of that is a classifier prompt that generates a specialized prompt based on the previous query. And so you can actually go in, take the existing prompt that you have, and actually feed it more examples where maybe the prompt failed, it didn't quite do what you wanted.7:39
And you can actually, instead of you having to go and rewrite the prompt, you just put it into the raw LLM and say, help me make this prompt better. And because it knows itself so well, strangely, meta prompting is turning out to be a very, very powerful tool that everyone's using now.7:57
And the next step after you do sort of prompt folding, if the task is very complex, there's this concept of using examples. And this is what JazzBerry does. It's one of the companies I'm working with this batch. They basically build automatic bug finding in code, which is a lot harder. And the way they do it is they feed a bunch of really hard examples that only expert programmers could do. Let's say if you want to find an n plus one query. It's actually hard for today for even like the best LLMs to find those. And the way to do those is they find parts of the code, then they add those into the prompt and meta prompt is like, hey, this is an example of n plus one type of error.8:38
And then that works it out. And I think this pattern of sometimes when it's too hard to even kind of write a prose around it, let's just give you an example that turns out to work really well because it helps LLM to reason around complicated tasks and steer it better because you can't quite kind of put exact parameters. And it's almost like unit testing programming in a sense, like test-driven development is sort of the LLM version of that.9:05
Yeah, another thing that Tropier sort of talks about is the model really wants to actually help you so much that if you just tell it, give me back output in this particular format, even if it doesn't quite have the information it needs, it'll actually just tell you what it thinks you want to hear. And it's literally a hallucination. So one thing they discovered is that you actually have to give the lms a real escape hatch you need to tell it if you do not have enough information to say yes or no or make a determination.9:37
Don't just make it up stop and ask me and that's a very different way to think about it.9:43
That's actually something we learned at some of the internal work that we've done with agents at YC, where Jared came up with a really inventive way to give the LLM an escape patch. Do you want to talk about that?9:55
Yeah. So the trope here approach is one way to give the LLM an escape patch. We came up with a different way, which is in the response format, to give it the ability to have part of the response be essentially a complaint to you, the developer, that you have given it confusing or underspecified information and it doesn't know what to do. And then the nice thing about that is that we just run your LLM in production with real hoser data. And then you can go back and you can look at the output that it has given you in that output parameter.10:28
We call it debug info internally. So we have this debug info parameter where it's basically reporting to us things that we need to fix about it. literally ends up being like a to-do list that you the agent developer has to do. It's like really kind of mind blowing stuff.10:43
Yeah. I mean, just even for hobbyists or people who are interested in playing around for this for personal projects, like a very simple way to get started with meta prompting is to follow the same structure of the prompt is to give it a role and make the role be like, you know, you're an expert prompt engineer who gives really like detailed great critiques and advice on how to improve prompts and give it the prompt that you had in mind and it will spit you back a much more expanded better prompt and so you can just keep running that loop for a while.11:11
Works surprisingly well.11:12
I think that's a common pattern sometimes for companies when they need to get responses from elements in their product a lot quicker. They do the meta-prompting with a bigger, beefier model than you have the, I don't know, hundreds of billions of parameters plus. Models like Cloud 4, 3.7, or your GPT-03. And they do this meta-prompting and then they have a very good working one that then they use into the distilled model. So they use it on, for example, an FRO and it ends up working pretty well. Specifically, sometimes for voice AI agents, companies because latency is very important to get this whole Turing test to pass, because if you have too much pause before the agent responds, I think humans can detect something is off.12:02
So they use a faster model, but with a bigger, better prompt that was refined from the bigger models. So it's like a common pattern as well.12:10
Another, again, less sophisticated maybe, but as the prompt gets longer and longer, it becomes a large working doc. One thing I found useful is as you're using it, if you just note down in a Google doc things that you're seeing, just the outputs not being how you want or ways that you can think to improve it, you can just write those in note form and then give Gemini Pro, like your notes plus the original prompt and ask it to suggest a bunch of edits to the prompt to incorporate these in well. And it does that quite well.12:44
The other trick is in Gemini 2.5 Pro, if you look at the thinking traces as is parsing through evaluation, you could actually learn a lot about all those misses as well. We've done that internal as well, right?12:59
As this is critical, because if you're just using Gemini via the API, until recently, you did not get the thinking traces. And the thinking traces are the critical debugging information to understand what's wrong with your prompt. They just added it to the API. So you can now actually pipe that back into your developer tools and workflows.13:18
Yeah, I think it's an underrated consequence of Gemini Pro having such long context windows is you can effectively use it like a REPL. Go sort of like one by one, I put your prompt on like one example, then literally watch the reasoning trace in real time to figure out like, how you can steer it in the direction you want.13:36
Jared and the software team at YC has actually built various forms of workbenches that allow us to do debug and things like that. But to your point, sometimes it's better just to use gemini.google.com directly and then drag and drop literally JSON files. and you don't have to do it in some sort of special container. It seems to be totally something that works even directly in chat GPT itself.14:07
Yeah, this is all stuff. I would give a shout out to YC's head of data, Eric Bacon, who's helped us all a lot. A lot of this meta-prompting and using Gemini Pro 2.5 is effectively a rapport.14:18
What about evals? I mean, we've talked about evals for going on a year now. What are some of the things that founders are discovering?14:27
Even though we've been saying this for a year or more now, Gary, I think it's still the case that evals are the true crown jewel data asset for all of these companies. One reason that PowerHelp was willing to open source the prompt is they told me that they actually don't consider the prompts to be the crown jewels. The evals are the crown jewels, because without the evals, you don't know why the prompt was written the way that it was, and it's very hard to improve it.14:55
Yeah, and I think in abstraction, you can think about, you know, YC funds a lot of companies, especially in vertical AI and SaaS, and then you can't get the evals unless you're sitting literally side by side with people who are doing X, Y, or Z knowledge work. You know, you need to sit next to the tractor sales. regional manager and understand, well, you know, this person cares, you know, this is how they get promoted. This is what they care about. This is that person's reward function. And then, you know, what you're doing is taking these in-person interactions, sitting next to someone in Nebraska, and then going back to your computer and codifying it into very specific evals, like, this particular user wants this outcome after this invoice comes in, we have to decide whether we're going to honor the warranty on this tractor, just to take one example.15:48
That's the true value. Everyone's really worried about, are we just rappers? What is going to happen to startups and I think this is literally where the rubber meets the road where if you if you are out there in particular places understanding that user better than anyone else and having the software actually work for those people.16:12
That's the moat is that it's like such a perfect depiction of like what is the core competency required of founders today like Literally like the thing that you just said, like that's your job as a founder of a company like this is to be really good at that thing. And like maniacally obsessed with like the details of the regional tractor sales managers workflow. Yeah.16:32
And then the wild thing is it's very hard to do. Like, you know, have you even been to Nebraska? The classic view is that the best founders in the world, they're sort of really great practical engineers and technologists and just really brilliant. And then at the same time, they have to understand some part of the world that very few people understand. And then there's this little sliver that is the founder of a multi-billion dollar startup. I think of Ryan Peterson from Flexport. really, really great person who understands how software is built, but then also, I think he was the third biggest importer of medical hot tubs for an entire year, like, you know, a decade ago.17:15
So, you know, the weirder that is, the more of the world that you've seen that nobody else who's a technologist has seen, the greater the opportunity, actually.17:25
I think you've put this in a really interesting way before Carrie, where you sort of saying that every founders become a forward deployed engineer. That's like a term that traces back to Palantir. And since you were early at Palantir, maybe tell us a little bit about how did forward deployed engineer become a thing at Palantir and what can founders learn from it now?17:41
I mean, I think the whole thesis of Palantir at some level was that if you look at Meta, back then it was called Facebook or Google or any of the top software startups that everyone sort of knew back then, one of the key recognitions that Peter Thiel and Alex Karp and Stephen Cohen and Joe Lonsdale, Nathan Gettings, like the original founders of Palantir had, was that go into anywhere in the Fortune 500, go into any government agency in the world, including the United States. And nobody who understands computer science and technology at the level that, you know, at the highest possible level would ever even be in that room.18:23
And so Palantir's sort of really, really big idea that they discovered very early was that The problems that those places face they're actually multi billion dollar sometimes trillion dollar problems and yet this is well before i became a thing i mean people sort of talking about machine learning but. You back then they called it data mining you know the world is awash and data these giant databases of. people and things and transactions and we have no idea what to do with it. That's what Palantir was, is, and still is, that you can go and find the world's best technologists who know how to write software to actually make sense of the world.19:04
You have these petabytes of data and you don't know how do you find the needle in the haystack. The wild thing is going on something like 20, 22 years later, It's only become more true that we have more and more data and we have less and less of an understanding of what's going on. And it's no mistake that actually, now that we have LLMs, it is becoming much more tractable. And then the forward deployed engineer title was specifically how do you sit next to literally the FBI agent who's investigating domestic terrorism? How do you sit right next to them in their actual office and see what is the case coming in look like?19:46
What are all the steps? When you actually need to go to the federal prosecutor, what are the things that they're sending? What's funny is literally it's like Word documents and Excel spreadsheets, right? And what you do as a forward deployed engineer is take these sort of file cabinet and fax machine things that people have to do and then convert it into really clean software. So the classic view is that it should be as easy to actually do uh, an investigation at a three-letter agency as going and taking a photo of your lunch on Instagram and posting it to all your friends.20:22
Like that's, you know, kind of the funniest part of it. And so, you know, I think it's no mistake today that four deployed engineers who came up through that system at Palantir now, they're turning out to be some of the best founders at YC actually.20:34
Yeah. I mean, you produced this incredible an incredible number of startup founders, because, yeah, the training to be a forward-deploying engineer, that's exactly the right training to be a founder of these companies now. The other interesting thing about Palantir is other companies would send a salesperson to go and sit with the FBI agent. And Palantir sent engineers to go and do that. I think Palantir was probably the first company to really institutionalize that and scale that as a process, right?20:58
Yeah. I mean, I think what happened there, the reason why they were able to get these seven and eight and now nine-figure contracts very consistently is that instead of sending someone who's like hair and teeth and they're in there and, you know, let's go to the steakhouse, you know, it's all like relationship. And you'd have one meeting, they would really like the salesperson. And then through sheer force of personality, you try to get them to give you a seven figure contract. And like the time scales on this would be, you know, six weeks, 10 weeks, 12 weeks, like five years, I don't know.21:31
It's like, and the software would never work. Whereas if you put an engineer in there, And you give them Palantir Foundry, which is what they now call their core data viz and data mining suites. Instead of the next meeting being reviewing 50 pages of sales documentation or a contract or a spec or anything like that, it's literally like, okay, we built it. And then you're getting like real live feedback within days. And I mean, that's honestly the biggest opportunity for startup founders. If startup founders can do that, and that's what forward deployed engineers are sort of used to doing, that's how you could beat a Salesforce or an Oracle or, you know, a Booz Allen or literally any company out there that has a big office and a big fancy, you know, you have big fancy salespeople with big strong handshakes and it's like, how does a really good engineer with a weak handshake go in22:16
there and beat them? Well, it's actually, you show them something that they've never seen before and like make them feel super heard. You have to be super empathetic about it. Like you actually have to be a great designer and product person. and then come back and you can just blow them away. The software is so powerful that the second you see something that makes you feel seen, you want to buy it on the spot.22:50
Is a good way of thinking about it that founders should think about themselves as being the four deployed engineers of their own company?22:56
Absolutely, yeah. You definitely can't farm this out. Literally, the founders themselves, they're technical. They have to be the great product people. They have to be the ethnographer. They have to be the designer. You want the person on the second meeting to see the demo you put together based on the stuff you heard, and you want them to say, wow, I've never seen anything like that and take my money.23:18
I think the incredible thing about this model is this is why we're seeing a lot of the vertical AI agents take off is precisely this because they can have these meetings with the end buyer and champion of these big enterprises. They take that context and then they stuff it basically in the prompt. And then they can quickly come back in a meeting like just the next day, maybe with Palantir would have taken a bit longer on a team of engineers. Here it could be just the two founders go in and then they would close this six, seven figure deals, which we've seen and with large enterprises, which has never been done before.23:55
And it's just possible with this new model of forward deploy engineer plus AI is just on accelerating.24:04
It reminds me of a company I mentioned before on the podcast at GigaML who do another customer support and especially a lot of voice support. And it's just classic case of two extremely talented software engineers, not natural sales people, but they force themselves to be essentially forward deployed engineers. And they closed a huge deal with Zepto and then a couple of other companies they can't announce yet.24:27
Do they physically go on site like the Palantir model? Yes.24:30
So they did all of that where once they close the deal, they go on site and they sit there with all the customer support people and figuring out how to keep tuning and getting the software or the LLM to work even better. But before that, even to win the deal, what they found is that they can win by just having the most impressive demo. And in their case, they've innovated a bit on the rag pipeline so that they can have their voice responses be both accurate and very low latency, sort of like a technically challenging thing to do.25:00
I just feel like in the part, like pre sort of the current LM rise, you couldn't necessarily differentiate enough in the demo phase of sales to beat out incumbent. So you can really beat Salesforce by having a slightly better CRM with a better UI. But now because the technology evolves so fast and so hard to get this like last five, ten, five to 10% correct. You can actually, if you're a forward deployed engineer, go in, do the first meeting, tweak it so that it works really well for that customer, go back with the demo and just get that, oh, wow, we've not seen anyone else pull this off before experience and close huge deals.25:33
And that was the exact same case with Happy Robot, who has sold seven-figure contracts to the top three largest logistic brokers in the world. They're built AI voice agents for that. They are the ones doing the forward deploy engineer model and talking to the CIOs of these companies. quickly shipping a lot of product like they're very quick turnaround and it's been incredible to see that take off right now and It started from six figure deals now doing closing and seven figure deals, which is crazy This is just a couple months after so that's the kind of stuff that you can do with I mean unbelievably very very smart prompt engineering actually Well, one of the things that's kind of interesting about26:16
Each model is that they each seem to have their own personality and one of the things the founders are really realizing is that you're going to go to different people for different things actually.26:27
One of the things that's known a lot is Claude is sort of the more happy and more human steerable model. And the other one is Lama. Four is one that needs a lot more steering. It's almost like talking to a developer. And part of it could be an artifact of not having done as much RLHF on top of it. So it's a bit more rough to work with, but you could actually steer it very well if you actually are good at actually doing a lot of prompting. I'm almost doing a bit more IRLHF, but it's a bit harder to work with actually.27:02
Well, one of the things we've been using LLMs for internally is actually helping founders figure out who they should take money from. And so in that case, sometimes you need a very straightforward rubric as zero to a hundred, zero being never ever take their money and a hundred being take their money right away. Like they actually help you so much that you'd be crazy not to take their money. Harj, we've been working on some scoring rubrics around that using prompts. What are some of the things we've learned?27:32
So it's certainly best practice to give LLM's rubrics, especially if you want to get a numerical score as the output, you want to give it a rubric to help it understand like how should I think through and what's like an 80 versus a 90. But these rubrics are never perfect. There's often always exceptions.27:47
And you tried it with 03 versus Gemini 2.5, and you found discrepancies.27:52
This is what we found really interesting, is that you can give the same rubric to two different models. And in our specific case, what we found is that 03 was very rigid, actually. It really sticks to the rubric. It heavily penalizes for anything that doesn't fit the rubric that you've given it. whereas Gemini 2.5 Pro was actually quite good at being flexible in that it would apply the rubric but it could also sort of almost reason through why someone might be like an exception or why you might want to push something up more positively or negatively than the rubric might suggest, which I just thought was really interesting because that It's just like when you're training a person, you're trying to, you give them a rubric, like you want them to use a rubric as a guide, but there are always these sort of edge cases where you need to sort of think a little bit28:28
more deeply. And I just thought it was interesting that the models themselves will handle that differently, which means they sort of have different personalities, right? Like 03 felt a little bit more like the soldier sort of Okay, I'm definitely going to check, check, check, check, check. And Gemini Pro 2.5 had a little bit more like a high agency sort of employee was like, oh, okay, no, I think this makes sense. But this might be an exception in this case, which was really interesting to see.29:01
Yeah, it's funny to see that for investors. Sometimes you have investors like a benchmark or a thrive. It's like, yeah, take their money right away. Their process is immaculate. They never ghost anyone. They answer their emails faster than most founders. It's very impressive. And then one example here might be there are plenty of investors who are just overwhelmed and maybe they're just not that good at managing their time. And so they might be really great investors and their track record bears that out, but they're sort of slow to get back. They seem overwhelmed all the time.29:31
They accidentally, probably not intentionally ghost people. And so this is legitimately exactly what an LLM is for. The debug info on some of these are very interesting to see. Maybe it's a 91 instead of like an 89. We'll see. I guess one of the things that's been really surprising to me as we ourselves are playing with it and we spend maybe 80 to 90 percent of our time with founders who are all the way out on the edge is Uh, you know, on the one hand, the analogies, I think even we use to discuss this is, uh, it's kind of like coding.30:04
It kind of actually feels like coding in, you know, 1995, like the tools are not all the way there. There's a lot of stuff that's unspecified were, you know, in this new frontier. But personally, it also kind of feels like learning how to manage a person where it's like, how do I actually communicate the things that they need to know in order to make a good decision? And how do I make sure that they know how I'm going to evaluate and score them? And not only that, like, there's this aspect of Kaizen, you know, this manufacturing technique that created really, really good cars for Japan in the 90s.30:43
And that principle actually says that the people who are the absolute best at improving the process are the people actually doing it. That's literally why Japanese cars got so good in the 90s. And that's meta-prompting to me. So, I don't know. It's a brave new world. We're sort of in this new moment. So with that, we're out of time, but can't wait to see what kind of prompts you guys come up with, and we'll see you next time.