_pdp_ a day ago

I've started a company in this space about 2 years ago. We are doing fine. What we've learned so far is that a lot of these techniques are simply optimisations to tackle some deficiency in LLMs that is a problem "today". These are not going to be problems tomorrow because the technology will shift. As it happened many time in the span of the last 2 years.

So yah, cool, caching all of that... but give it a couple of months and a better technique will come out - or more capable models.

Many years ago when disc encryption on AWS was not an option, my team and I had to spend 3 months to come up with a way to encrypt the discs and do so well because at the time there was no standard way. It was very difficult as that required pushing encrypted images (as far as I remember). Soon after we started, AWS introduced standard disc encryption that you can turn on by clicking a button. We wasted 3 months for nothing. We should have waited!

What I've learned from this is that often times it is better to do absolutely nothing.

  • siva7 a day ago

    This is the most important observation. I'm getting so many workshop invitations from my corporate colleagues about AI and agents. What most people don't get that these clever patterns they "invented" will be obsolete next week. This nice company blog about agents - one which got viral recently - will be obsolete next month. It's hard to swallow for my colleagues that in these age - like when you studied gang of four or a software architecture pattern book that you have learned a common language - no, these days the half-life of a pattern for AI is about a week. Even when you ask 10 professionals what an agent actually is - you will get 10 different answers yet they assume that how they use it is the common understanding.

    • Vinnl a day ago

      This is also why it's perfectly fine to wait out this AI hype and see what sticks afterward. It probably won't cost too much time to catch up, because at that point everyone who knows what they're doing only learned that a month or two ago anyway.

      • credit_guy 2 hours ago

        > It probably won't cost too much time to catch up

        That's a risky bet. It is more likely that the user interface of AI will evolve. Some things will stick, some will not. Three years from now, many things that are clunky now will be replaced by more intuitive things. But some things that already work now will still be in place. People who have been heavy users of AI between now and then will definitely have a leg start on those who will just start then.

    • lowbloodsugar a day ago

      Counterpoint to these two posts: a journeyman used to have to make his own tools. He could easily have bought them, or his master could have made them. Making your own tools gives you vastly greater skills when using the tools. So I know how fast AI agents and model APIs are evolving, but I’m writing them anyway. Every break in my career has been someone telling me it’s impossible and then me doing it anyway. If you use an agent framework, you really have no idea how artificially constrained you. You’re so constrained, and yet you are oblivious to it.

      On the “wasting three months” remark (GP), if it’s a key value proposition, just do it. Don’t wait. If it’s not a key value prop, then don’t do it at all. Often times what I’ve built has been better tailored to our product than what AWS built.

      • hammock 20 hours ago

        You can make your own hand plane, and you will be a better woodworker for it. Still in a few months your competition will be using a electric planes and routers

        • DrewADesign 20 hours ago

          The cult of efficiency aims to turn craftsmanship into something that only concerns hobbyists. Everything else is optimizing money in vs money out to get as close to possible as revenue being directly deposited into shareholders bank accounts.

    • hibikir a day ago

      Note that even many of those "long knowledge" things people learned are today obsolete, but people that follow them just haven't figured it out yet. See how many of those object oriented design patters just look very silly the minute you use immutable data structures, and have access to functional programming constructs in your language. And nowadays most do. Many seminal books on how to program in the early 2000s, especially those covering "pure" OO, look quite silly today.

    • AYBABTME 9 hours ago

      And yet despite being largely obsolete in the specifics, gang of four remains highly relevant and useful in the generalities. All these books continue to be absolutely great foundations if you look past their immediate advice.

      I wagger the same for AI agent techniques.

  • lelanthran a day ago

    > I've started a company in this space about 2 years ago. We are doing fine.

    You have a positive cash flow from sales of agents? Your revenue exceeds your operating costs?

    I've been very skeptical that it is possible to make money from agents, having seen how difficult it was for the current well-known players to do so.

    What is your secret sauce?

    • yayitswei 13 hours ago

      I'm cash flow positive on my SMS sales agent, it serves just one client and my revenue is at least 3x the cost of inference/hosting.

      Imo the key is to serve one use case really well rather than overgeneralize.

    • nvader a day ago

      Bumping for interest too. Would love to hear what you believe is correlated to success.

  • gchamonlive a day ago

    I think knowing when to do nothing is being able to evaluate if the problem the team is tackling is essential or tangential to the core focus of the project, and also whether the problem is something new or if it's been around for a while and there is still no standard way to solve it.

    • gessha a day ago

      Yeah, that will be the make it to brake it moment because if it’s too essential, it will be implemented but if it’s not, it may become a competitive advantage

  • ramraj07 19 hours ago

    Vehement disagree. We implemented our own context editing features 4 months back. Claude released a very similar featureset we had all along last month. We were still glad we did it because (A) it took me half a day to do that work (B) our solution is still more powerful for our use case (C) our solution works on other models as well.

    It all comes down to trying to predict what will be your vendors' roadmap (or if youre savvy, get a peek into it) and whether the feature you want to create is fundamental to your applications behavior (I doubt encryption is unless youre a storage company).

  • nrhrjrjrjtntbt 20 hours ago

    If you wait long enough in AI they may not need your agent they just use OpenAI directly.

    • DrewADesign 20 hours ago

      These days it seems like training yourself into a specialty that provides steadyish income for a year before someone obliterates your professional/corporate/field’s scaffolding with AI and you have to start over is kind of a win. Doesn’t it feel like a win? Look at the efficiency!

  • wolttam 5 hours ago

    This has been my intuition with these models since close to the beginning.

    Any framework you build around the model is just behaviour that can be trained into the model itself

  • nowittyusername a day ago

    I agree with the sentiment. things are moving so fast that waiting now is a legitimate strategy. though it is also easy to fall in to the trap of. well if we continue along these lines might as well wait 4-5 years and we get agi. which still true imo does feel off as you arent participating in the process.

  • an0malous a day ago

    > These are not going to be problems tomorrow because the technology will shift. As it happened many time in the span of the last 2 years.

    What technology shifts have happened for LLMs in the last 2 years?

    • dcre a day ago

      One example is that there used to be a whole complex apparatus around getting models to do chain of thought reasoning, e.g., LangChain. Now that is built in as reasoning and they are heavily trained to do it. Same with structured outputs and tool calls — you used to have to do a bunch of stuff to get models to produce valid JSON in the shape you want, now it’s built in and again, they are specifically trained around it. It used to be you would have to go find all relevant context up front and give it to the model. Now agent loops can dynamically figure out what they need and make the tool calls to retrieve it. Etc etc.

      • mewpmewp2 a day ago

        LangChain generally felt pointless for me to use, not a good abstraction. It would rather keep you from the most important thing that you need in this fast evolving ecosystem, and it's direct prompt level (if you can even call that low level) understanding of what is going on.

      • vinibrito 16 hours ago

        For JSON I agree, now I can just mention JSON and provide examples and the response always comes in the right format, but for tool calling and information retrieval I have never seen a system actually work, nor in my tests have these worked.

        Now, I'm open to the idea that I am just using it wrong, but I have seen several reports around the web that the most that people got in tool calling accuracy is 80%, which is unusable for any production system, also for info retrieval I have seen it lose coherence the more data is available overall.

        Is there a model that actually achieved 100% tool calling accuracy?

        So far I built systems for that myself, surrounding the LLM, and only like this it worked well in production.

    • postalcoder a day ago

      If we expand this to 3 years, the single biggest shift that totally changed LLM development is the increase in size of context windows from 4,000 to 16,000 to 128,000 to 256,000.

      When we were at 4,000 and 16,000 context windows, a lot of effort was spent on nailing down text splitting, chunking, and reduction.

      For all intents and purposes, the size of current context windows obviates all of that work.

      What else changed?

      - Multimodal LLMs - Text extraction from PDFs was a major issue for rag/document intelligence. A lot of time was wasted trying to figure out custom text extraction strategies for documents. Now, you can just feed the image of a PDF page into an LLM and get back a better transcription.

      - Reduced emphasis on vector search. People have found that for most purposes, having an agent grep your documents is cheaper and better than using a more complex rag pipeline. Boris Cherny created a stir when he talked about claude code doing it that way[0]

      https://news.ycombinator.com/item?id=43163011#43164253

      • yesb 4 hours ago

        >For all intents and purposes, the size of current context windows obviates all of that work.

        Large context windows can make some problems easier or go away for sure. But you may still have the same issue of getting the right information to the model. If your data is much larger than e.g. 256k tokens you still need to filter it. Either way, it can still be beneficial (cost, performance, etc.) to filter out most of the irrelevant information.

        >Reduced emphasis on vector search. People have found that for most purposes, having an agent grep your documents is cheaper and better than using a more complex rag pipeline

        This has been obvious from the beginning for anyone familiar with information retrieval (R in RAG). It's very common that search queries are looking for exact matches, not just anything with similar meaning. Your linked example is code search. Exact matches/regex type of searches are generally what you are looking for there.

    • throwaway13337 a day ago

      I'm amazed at this question and the responses you're getting.

      These last few years, I've noticed that the tone around AI on HN changes quite a bit by waking time zone.

      EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.

      It's really puzzling to me. This is the first time I noticed such a disconnect in the community about what the reality of things are.

      To answer your question personally, genAI has changed the way I code drastically about every 6 months in the last two years. The subtle capability differences change what sorts of problems I can offload. The tasks I can trust them with get larger and larger.

      It started with better autocomplete, and now, well, agents are writing new features as I write this comment.

      • GoatInGrey a day ago

        The main line of contention is how much autonomy these agents are capable of handling in a competitive environment. One side generally argues that they should be fully driven by humans (i.e. offloading tedious tasks you know the exact output of but want to save time not doing) while the other side generally argues that AI agents should handle tasks end-to-end with minimal oversight.

        Both sides have valid observations in their experiences and circumstances. And perhaps this is simply another engineering "it depends" phenomenon.

      • bdangubic a day ago

        the disconnect is quite simple, there are people that are professionals and are willing to put the time in to learn and then there’s vast majority of others who don’t and will bitch and moan how it is shit etc. if you can’t get these tools to make your job easier and more productive you ought to be looking for a different career…

        • overfeed a day ago

          You're not doing yourself any favors by labeling people who disagree with you undereducated or uninformed. There is enough over-hyped products/techniques/models/magical-thinking to warrant skepticism. At the root of this thread is an argument to (paraphrasing) encouraging people to just wait until someone solves major problems instead of tackling it themselves. This is a broad statement of faith, if I've ever seen one, in a very religious sense: "Worry not, the researchers and foundation models will provide."

          My skepticism and intuition that AI innovations are not exponential, but sigmoid are not because I don't understand what gradient-descent, transformers, RAG, CoT, or multi-head attention are. My statement of faith is: the ROI economics are going to catch up with the exuberance way before AGI/ASI is achieved; sure, you're getting improving agents for now, but that's not going to justify the 12- or 13-digit USD investments. The music will stop, and improvements slow to a drip

          Edit: I think at it's root, the argument is between folk who think AI will follow the same curve as past technological trends, and those who believe "It's different this time".

          • bdangubic a day ago

            > labeling people who disagree with you undereducated or uninformed

            I did neither of these two things... :) I personally could not care about

            - (over)hype

            - 12/13/14/15 ... digit USD investment

            - exponential vs. sigmoid

            There are basically two groups of industry folk:

            1. those that see technology as absolutely transformational and are already doing amazeballs shit with it

            2. those that argue how it is bad/not-exponential/ROI/...

            If I was a professional (I am) I would do everything in my power to learn everything there is to learn (and then more) and join the Group #1. But it is easier to be in Group #2 as being in Group #1 requires time and effort and frustrations and throwing laptop out the window and ... :)

            • overfeed 14 hours ago

              > I did neither of those 2 things

              >> ...there are people that are professionals and are willing to put the time in to learn and then there’s vast majority of others who don’t...

              • bdangubic 9 hours ago

                being lazy and unprofessional is just a tad different than uneducated :)

            • overfeed a day ago

              A mutually exclusive group 1 & group 2 are a false dichotomy. One can have a grasp on the field and keep up to date with recent papers, have an active Claude subscription, use agents and still have a net-negative view of "AI" as a whole, considering the false promises, hucksters, charlatans and an impending economic reckoning.

              tl;dr version: having negative view of the industry is decoupled from one's familiarity with, and usage of the tools, or the willingness to learn.

              • bdangubic a day ago

                > considering the false promises, hucksters, charlatans and an impending economic reckoning.

                I hack for a living. I could hardly give two hoots about “false promises” or “hucksters” or some “impeding economic reckoning…” I made a general comment that a whole lot of people simple discount technology on technical grounds (favorite here on HN)…

                • overfeed 21 hours ago

                  > I could hardly give two hoots about “false promises” or “hucksters”

                  I suppose this is the crux of our misunderstanding: I deeply care about the long-term health and future of the field that gave me a hobby that continues to scratch a mental itch with fractal complexity/details, a career, and more money than I ever imagined.

                  > or some “impeding economic reckoning…”

                  I'm not going to guess if you missed the last couple of economic downturns or rode them out, but an economic reckoning may directly impact your ability to hack for a living, that's the thing you prize.

                • __loam 21 hours ago

                  You should give a shit about how the field is perceived because that affects your ability to make a living whether you care about it or not

                  • bdangubic 33 minutes ago

                    yea the “ai” is going to ruin this amazing reputation the field had before it :)

            • gmm1990 a day ago

              If there is really amazing stuff happening with this technology how did we have two recent major outages that were cause by embarrassing problems? I would guess that at least in the cloud flare instance some of the responsible code was ai generated

              • ctoth a day ago

                > I would guess that at least in the cloud flare instance some of the responsible code was ai generated

                Your whole point isn't supported by anything but ... a guess?

                If given the chance to work with an AI who hallucinates sometimes or a human who makes logical leaps like this

                I think I know what I'd pick.

                Seriously, just what even? "I can imagine a scenario where AI was involved, therefore I will treat my imagination as evidence."

                • gmm1990 19 hours ago

                  The whole point is that the outages happened not that the ai code caused them. If ai is so useful/amazing then these outages should be less common not more. It’s obviously not rock solid evidence. Yeah ai could be useful and speed up or even improve a code base but there isn’t any evidence that that’s actually improving anything the only real studies point to imagined productivity improvements

                • __loam 21 hours ago

                  Microsoft is saying they're generating 30% of their code now and there's clearly been a lot of stability issues with Windows 11 recently that they've publicly acknowledged. It's not hard to tell a story that involves layoffs, increased pressure to ship more code, AI tools, and software quality issues. You can make subtle jabs about your peers as much as you want but that isn't going to change public perception when you ship garbage.

              • bdangubic a day ago

                good thing before “ai” when humans coded we had many decades of no outages… phew

            • wat10000 18 hours ago

              I see the first half of group 1, but where's the second half? Don't get me wrong, there's some cool and interesting stuff in this space, but nothing I'd describe as close to "amazeballs shit."

              • bdangubic 30 minutes ago

                you should see what I’ve seen (and many other people also). after 30 years of watching humans do it (fairly poorly as there is extremely small percentage of truly great SWEs) stuff I am seeing is ridiculously amazing

            • what 15 hours ago

              Can you show me some of the “amazeballs shit” people are doing with it?

            • __loam 21 hours ago

              Amazeballs shit yet precious little actual products.

              • bdangubic 18 minutes ago

                this is a tool, you use it to create just like you use brush and oils to paint a masterpiece. it is not a product in it of itself…

          • juped a day ago

            They're not logistic, this is a species of nonsense claim that irks me even more than claiming "capabilities gains are exponential, singularity 2026!"; it actually includes the exponential-gains claim and then tries to tack on epicycles to preempt the lack of singularities.

            Remember, a logistic curve is an exponential (so, roughly, a process whose outputs feed its growth, the classic example being population growth, where more population makes more population) with a carrying capacity (the classic example is again population, where you need to eat to be able to reproduce).

            Singularity 2026 is open and honest, wearing its heart on its sleeve. It's a much more respectable wrong position.

        • siva7 a day ago

          It's disheartening. I got a colleague, very senior, who dislikes AI for a myriad of reasons and doesn't want to adapt if not forced by mgmt. I feel from 2022-2024 the majority of my colleagues were in this camp - either afraid from AI or because they looked at it as not something a "real" developer would ever use. 2025 it seemed to change a bit. American HN seemed to adapt more quickly while EU companies are still lacking the foresight to see what is happening on the grand scale.

          • wat10000 18 hours ago

            I'm pretty senior and I just don't find it very useful. It is useful for certain things (deep code search, writing non-production helper scripts, etc.) and I'm happy to use it for those things, but it still seems like a long way off for it to be able to really change things. I don't foresee any of my coworkers being left behind if they don't adopt it.

            • bdangubic 25 minutes ago

              senior as well, few years from finishing up my career. I run 8 to 12 terminals entire day. it is changing existing and writing new stuff all day, every day. 100’s of thosands of lines of changed/added/removed code in production… and a lot less issues than when every line was typed in by me (or another human)

            • TheMayorOfDunce 8 hours ago

              AI gives you either free expertise or free time. If you can make software above the level of Gemini or Claude output, then have it write your local tools, or have it write synthetic data for tests, or have it optimize your zshrc or bash profile. Maybe have it implement changes your skip level wants to see made, which you know they are amateurish, unsound garbage with revolting UI. Rather than waste your day writing ill-advised but high quality code just to show them how it’s a bad idea, you can have AI write code for you, to illustrate your point without spending any real work hours on it.

              Just in my office, I have seen “small tools” like Charles Proxy almost entirely disappear. Everyone writes/shares their AI-generated solutions now rather than asking cyber to approve a 3rd party envfile values autoloader to be whitelisted across the entire organization.

      • GiorgioG a day ago

        Despite the latest and greatest models…I still see glaring logic errors in the code produced in anything beyond basic CRUD apps. They still make up fields that don’t exist, assign a value to a variable that is nonsensical. I’ll give you an example, in the code in question, Codex assigned a required field LoanAmount to a value from a variable called assessedFeeAmount…simply because as far as I can tell, it had no idea how to get the correct value from the current function/class.

        • lbreakjai a day ago

          That's why I don't get people that claim to be letting an agent run for an hour on some task. LLMs tend to do so many small errors like that, that are so hard to catch if you aren't super careful.

          I wouldn't want to have to review the output of an agent going wild for an hour.

          • snoman a day ago

            Who says anyone’s reviewing anything? I’m seeing more and more influencers and YouTubers playing engineer or just buying an app from an overseas app farm. Do you think anyone in that chain gives the first shit what the code is like?

            It’s the worst kind of disposable software.

        • brabel 10 hours ago

          If the LLM can test the code it will fix those issues automatically. That’s how it can keep going for hours and produce something useful. You need to review the code and tests obviously afterwards.

      • what 15 hours ago

        Can you show us the features AI wrote while you wrote this comment?

      • nickphx a day ago

        ai is useless. anyone claiming otherwise is dishonest

        • la_fayette a day ago

          I use GenAI for text translation, text 2 voice and voice 2 text, there it is extremely useful. For coding I often have the feeling it is useless, but also sometimes it is useful, like most tools...

          • balder1991 14 hours ago

            Exactly, it’s really weird to see all this people claiming these wonderful things about LLMs. Maybe it’s really just different levels of amazement, but I understand how LLMs work, I actually use ChatGPT quite a bit for certain things (searching, asking some stuff I know it can find online, discuss ideas or questions I have etc.).

            But all the times I tried using LLMs to help me coding, the best it performs is when I give it a sample code (more or less isolated) and ask it for a certain modification that I want.

            More often than not, it does make seemingly random mistakes and I have to be looking at the details to see if there’s something I didn’t catch, so the smallest scope there better.

            If I ask for something more complex or more broad, it’s almost certain it will make many things completely wrong.

            At some point, it’s such a hard work to detail exactly what you want with all context that it’s better to just do it yourself, cause you’re writing a wall of text to have a one time thing.

            But anyway, I guess I remain waiting. Waiting until FreeBSD catches up with Linux, because it should be easy, right? The code is there in the Linux kernel, just tell an agent to port it to FreeBSD.

            I’m waiting for the explosion of open source software that aren’t bloated and that can run optimized, because I guess agents should be able to optimize code? I’m waiting for my operating system to get better over time instead of worse.

            Instead I noticed the last move from WhatsApp was to kill the desktop app to keep a single web wrapper. I guess maintaining different codebases didn’t get cheaper with the rise of LLMs? Who knows. Now Windows releases updates that break localhost. Ever since the rise of LLMs I haven’t seen software release features any faster, or any Cambrian explosion of open source software copying old commercial leaders.

        • whattheheckheck a day ago

          What are you doing at your job that ai can't help with at all to consider is completely use less?

        • ghurtado a day ago

          That could even be argued (with an honest interlocutor, which you clearly are not)

          The usefulness of your comment, on the other hand, is beyond any discussion.

          "Anyone who disagrees with me is dishonest" is some kindergarten level logic.

        • ulfw a day ago

          [Deleted as Hackernews is not for discussion of divergent opinions]

          • wiseowise a day ago

            > It's not useless but it's not good for humanity as a whole.

            Ridiculous statement. Is Google also not good for humanity as a whole? Is Internet not good for humanity as a whole? Wikipedia?

            • Libidinalecon 6 hours ago

              I think it is an interesting thought experiment to try to visualize 2025 without the internet ever existing because we take it completely for granted that the internet has made life better.

              It seems pretty clear to me that culture, politics and relationships are all objectively worse.

              Even remote work, I am not completely sure I am happier than when I use to go to the office. I know I am certainly not walking as much as I did when I would go to the office.

              Amazon is vastly more efficient than any kind of shopping in the pre-internet days but I can remember shopping being far more fun. Going to a store and finding an item I didn't know I wanted because I didn't know it existed. That experience doesn't exist for me any longer.

              Information retrieval has been made vastly more efficient so I instead of spending huge amounts of time at the library, I get that all back in free time. What I would have spent my free time doing though before the internet has largely disappeared.

              I think we want to take the internet for granted because the idea that the internet is a long term, giant mistake is unthinkable to the point of almost having a blasphemous quality.

              Childhood? Wealth inequality?

              It is hard to see how AI as an extension of the internet makes any of this better.

            • Nition a day ago

              Chlorofluorocarbons, microplastics, UX dark patterns, mass surveillance, planned obsolescence, fossil fuels, TikTok, ultra-processed food, antibiotic overuse in livestock, nuclear weapons.

              It's a defensible claim I think. Things that people want are not always good for humanity as a whole, therefore things can be useful and also not good for humanity as a whole.

              • wiseowise 12 hours ago

                You’re delusional if you think LLMs/AI are in the ballpark of these. I’ve listed things in my comment for a reason.

      • the_mitsuhiko a day ago

        > EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.

        I don't think it's because the audience is different but because the moderators are asleep when Europeans are up. There are certain topics which don't really survive on the frontpage when moderators are active.

        • jagged-chisel a day ago

          I'm unsure how you're using "moderators." We, the audience, are all 'moderators' if we have the karma. The operators of the site are pretty hands-off as far as content in general.

          This would mean it is because the audience is different.

          • the_mitsuhiko a day ago

            I’m referring to the actual moderators of this website removing posts from the front page.

            • verdverm a day ago

              that's a conspiracy theory

              The by far more common action is for the mods to restore a story which has been flagged to oblivion by a subset of the HN community, where it then lands on the front page because it already has sufficient pointage

              • the_mitsuhiko a day ago

                It's not controversial to say that submissions are being moderated, that's how this (and many other) sites work. I haven't made any claims about how often it happens, or how it relates to second-chance moderation.

                What I'm pointing out is just that moderation isn't the same at different times of the day and that this sometimes can explain what content you see during EU and US waking hours. If you're active during EU daytime hours and US morning hours, you can see the pattern yourself. Tools like hnrankings [1] make it easy to watch how many top-10 stories fall off the front page at different times of day over a few days.

                [1]: https://hnrankings.info/

                • verdverm a day ago

                  > I’m referring to the actual moderators of this website removing posts from the front page.

                  This is what you said. There has only been one until this year, so now we have two.

                  The moderation patterns you see are the community and certainly have significant time factors that play into that. The idea that someone is going into the system and making manual changes to remove content is the conspiracy theory

          • uoaei a day ago

            The people who "operate" the website are different from the people who "moderate" the website but both are paid positions.

            This fru-fru about how "we all play a part" is only serving to obscure the reality.

            • delinka a day ago

              I'm sure this site works quite differently from what you say. There's no paid team of moderators flicking stories and comments off the site because management doesn't like them.

              There's dang who I've seen edit headlines to match the site rules. Then there's the army of users upvoting and flagging stories, voting (up and down) and flagging comments. If you have some data to backup your sentiments, please do share it - we'd certainly like to evaluate it.

              • verdverm a day ago

                HN brought on a second mod (Tim, this year, iirc)

                My email exchanges with Dang, as part of the moderation that happens around here, have all been positive

                1. I've been moderated, got a slowdown timeout for a while

                2. I've emailed about specific accounts, (some egregious stuff you've probably never seen)

                3. Dang once emailed me to ask why I flagged a story that was near the top, but getting heavily flagged by many users. He sought understanding before making moderation choices

                I will defend HN moderation people & policies 'til the cows come home. There is nothing close to what we have here on HN, which is largely about us being involved in the process and HN having a unique UX and size

              • uoaei 17 hours ago

                dang announced they were moved from volunteer to paid position a few years ago. More rumblings about more mods brought on since then. What makes you say you're "so sure"?

                • fragmede 17 hours ago

                  > There's no paid team of moderators flicking stories and comments off the site because management doesn't like them.

                  Emphasis mine. The question is does the paid moderation team disappear unfavorable posts and comments, or are they merely downranked and marked dead (which can still be seen by turning on showdead in your profile).

            • throwaway13337 a day ago

              As an anonymous coward on HN for at least a decade, I'd say that's not really true.

              When paul graham was more active and respected here, I spoke negatively about how revered he was. I was upvoted.

              I also think VC-backed companies are not good for society. And have expressed as much. To positive response here.

              We shouldn't shit on one of the few bastions of the internet we have left.

              I regret my negativity around pg - he was right about a lot and seems to be a good guy.

              • TheMayorOfDunce 8 hours ago

                Yeah, I avoid telling people about HN. It’s too rare and pleasant of an anomaly to risk getting morphed into X/Bluesky/Reddit.

        • jamesblonde a day ago

          Anything sovereign AI or whatever is gone immediately when the mods wake up. Got an EU cloud article? Publish it at 11am CET, it's disappears around 12.30.

    • deepdarkforest a day ago

      On the foundational level, test time compute(reasoning), heavy RL post training, 1M+ plus context length etc.

      On the application layer, connecting with sandboxes/VM's is one of the biggest shifts. (Cloudfares codemode etc). Giving an llm a sandbox unlocks on the fly computation, calculations, RPA, anything really.

      MCP's, or rather standardized function calling is another one.

      Also, local llm's are becoming almost viable because of better and better distillation, relying on quick web search for facts etc.

    • WA a day ago

      Not the LLMs. The APIs got more capabilities such as tool/function calling, explicit caching etc.

      • dcre a day ago

        It is the LLMs because they have to be RLed to be good at these things.

    • echelon a day ago

      We started putting them in image and video models and now image and video models are insane.

      I think the next period of high and rapid growth will be in media (image, video, sound, 3D), not text.

      It's much harder to adapt LLMs to solving business use cases with text. Each problem is niche, you have to custom tailor the solution, and the tooling is crude.

      The media use cases, by contrast, are low hanging fruit and result in 10,000x speedups and cost reductions almost immediately. The models are pure magic.

      I think more companies would be wise to ignore text for now and focus on visual domain problems.

      Nano Banana has so much more utility than agents. And there are so many low hanging fruit ways to make lots of money.

      Don't sleep on image and video. That's where the growth salient is.

      • wild_egg a day ago

        > Nano Banana has so much more utility than agents.

        I am so far removed from multimedia spaces that I truly can't imagine a universe where this could be true. Agents have done incredible things for me and Nano Banana has been a cool gimmick for making memes.

        Anyone have a use case for media models that'll expand my mind here?

        • echelon a day ago

          We now have capacity to program and automate in the optics, signals, and spatial domains.

          As someone in the film space, here's just one example: we are getting extremely close to being able to make films with only AI tools.

          Nano Banana makes it easy to create character and location consistent shots that adhere to film language and the rules of storytelling. This still isn't "one shot", and considerable effort still needs to be put in by humans. Not unlike AI assistance in IDEs requiring a human engineer pilot.

          We're entering the era of two person film studios. You'll undoubtedly start seeing AI short films next year. I had one art school professor tell me that film seems like it's turning into animation, and that "photorealism" is just style transfer or an aesthetic choice.

          The film space is hardly the only space where these models have utility. There are so many domains. News, shopping, gaming, social media, phone and teleconference, music, game NPCs, GIS, design, marketing, sales, pitching, fashion, sports, all of entertainment, consumer, CAD, navigation, industrial design, even crazy stuff like VTubing, improv, and LARPing. So much of what we do as humans is non-text based. We haven't had effective automation for any of this until this point.

          This is a huge percentage of the economy. This is actually the beating heart of it all.

          • wild_egg 21 hours ago

            Been thinking about this. Curious why you positioned it as Nano Banana having more utility than agents when it seems like the next level even would be Nano Banana with agents?

            The two are kind of orthogonal concepts.

          • yunwal a day ago

            > we are getting extremely close to being able to make films with only AI tools

            AI still can’t reliably write text on background details. It can’t get shadows right. If you ask it to shoot things from a head on perspective, for example a bookshelf, it fails to keep proportions accurate enough. The bookshelf will not have parallel shelves. The books won’t have text. If in a library, the labels will not be in Dewey decimal order.

            It still lacks a huge amount of understanding about how the world works necessary to make a film. It has its uses, but pretending like it can make a whole movie is laughable.

            • wild_egg a day ago

              I don't think they're suggesting AI could one-shot a whole movie. It would be iterative, just like programming.

              • echelon 21 hours ago

                Exactly. You can still open the generations in Photoshop.

                I'd say the image and video tools are much further along and much more useful than AI code gen (not to dunk on code autocomplete). They save so much time and are quite incredible at what they can do.

            • gabriel666smith a day ago

              I don't think equating "extremely close" with "pretending like it can" is a fair way to frame the sentiment of the comment you were replying to. Saying something is close to doing something is not the same as saying it already can.

              In terms of cinema tech, it took us arguably until the early 1940s to achieve "deep focus in artificial light". About 50 years!

              The last couple of years of development in generative video looks, to me, like the tech is improving more quickly than the tech it is mimicking did. This seems unsurprising - one was definitely a hardware problem, and the other is most likely a mixture of hardware and software problems.

              Your complaints (or analogous technical complaints) would have been acceptable issues - things one had to work around - for a good deal of cinema history.

              We've already reached people complaining about "these book spines are illegible", which feels very close to "it's difficult to shoot in focus, indoors". Will that take four or five decades to achieve, based on the last 3 - 5 years of development?

              The tech certainly isn't there yet, nor am I pretending like it is, and nor was the comment you replied to. To call it close is not laughable, though, in the historical context.

              The much more interesting question is: At what point is there an audience for the output? That's the one that will actually matter - not whether it's possible to replicate Citizen Kane.

  • sethev a day ago

    I suspect you're right, but it's a bit discouraging to consider that an alternative way of framing this is that companies like OpenAI have a huge advantage in this landscape and anything that works will end up behind their API.

  • toddmorey a day ago

    In some ways, the fact that the technology will shift is the problem as model behavior keeps changing. It's rather maddening unstable ground to build on. Really hard to gauge the impact to customer experience from a new model.

    • ares623 a day ago

      For a JS dev, it’s just another Tuesday

      • cheschire a day ago

        Is JS dev really still so mercurial as it was 5 to 10 years ago? I'm not so sure. Back then, there would be a new topic daily about some new JS framework etc etc.

        I still occasionally see a blip of activity but I can't say it's anything like what we witnessed in the past.

        Though I will agree that gen AI trends feel reminiscent of that period of JS dev history.

        • biztos 16 hours ago

          I’m working on a couple apps using Typescript and for me (ex-JS hacker coming back to it after some years) it’s still an insane menu of bad choices and new “better” frameworks, some of which are abandoned before you get done reading the docs. Though I get that it probably moved faster a few years ago.

          I settled on what seemed like the most “standard” set of things (marketable skills blabla) and every week I read an article about how that stack is dead, and everybody supposedly uses FancyStack now.

          Adding insult to injury, I have relearned the fine art of inline styles. I assume table layouts are next.

          To lurch back on topic: I’m doing this for AI-related stuff and yes, the AI pace of change is much worse, but they sure do make a nice feedback loop.

        • snoman a day ago

          If it is, it’s entirely self inflicted today. There’s some tentpole tech that is reliable enough to stick with and get things done. Has been for a while.

  • verdverm a day ago

    Vendor choice matters.

    You could use the like of Amazon / Anthropic, or use Google who has had transparent disk encryption for 10+ years, and Gemini which already had the transparent caching discussed built in.

    • te_chris a day ago

      If you’ve spent any time with the vertex LLM apis you wouldn’t be so enthusiastic about using Google’s platform (I say this as someone who prefers GCP to aws for compute and networking).

      • verdverm a day ago

        been using it for years, no idea what you are getting at

        I've never had the downtime or service busy situations I've heard others complain about with other vendors.

        They did pricing based on chars back in the day, but now they are token based like everyone else.

        I like that they are building custom hardware that is industry leading in terms of limiting how much my AI usage impacts the environment.

        What do you think I shouldn't be enthusiastic about?

  • jFriedensreich a day ago

    exactly what my experience is too. we focus all our energy on the parts that will not be solved by someone else in a few months.

  • exe34 a day ago

    if we wait long enough, we just end up dead, so it turns out we didn't need to do anything at all whatsoever. of course there's a balance - often times starting out and growing up with the technology gives you background and experience that gives you an advantage when it hits escape velocity.

moinism a day ago

Amen. Been seeing these agent SDKs coming out left and right for a couple of years and thought it'd be a breeze to build an agent. Now I'm trying to build one for ~3 weeks, and I've tried three different SDKs and a couple of architectures.

Here's what I found:

- Claude Code SDK (now called Agent SDK) is amazing, but I think they are still in the process of decoupling it from the Claude Code, and that's why a few things are weird. e.g, You can define a subagent programmatically, but not skills. Skills have to be placed in the filesystem and then referenced in the plugin. Also, only Anthoripic models are supported :(

- OpenAI's SDK's tight coupling with their platform is a plus point. i.e, you get agents and tool-use traces by default in your dashboard. Which you can later use for evaluation, distillation, or fine-tuning. But: 2. They have agent handoffs (which works in some cases), but not subagents. You can use tools as subagents, though. 1. Not easy to use a third-party model provider. Their docs provide sample codes, but it's not as easy as that.

- Google Agent Kit doesn't provide any Typescript SDK yet. So didn't try.

- Mastra, even though it looks pretty sweet, spins up a server for your agent, which you can then use via REST API. umm.. why?

- SmythOS SDK is the one I'm currently testing because it provides flexibility in terms of choosing the model provider and defining your own architecture (handoffs or subagents, etc.). It has its quirks, but I think it'll work for now.

Question: If you don't mind sharing, what is your current architecture? Agent -> SubAgents -> SubSubAgents? Linear? or a Planner-Executor?

I'll write a detailed post about my learnings from architectures (fingers crossed) soon.

  • copypaper a day ago

    Every single SDK I've used was a nightmare once you get past the basics. I ended up just using an OpenRouter client library [1] and writing agents by hand without an abstraction layer. Is it a little more boilerplatey? Yea. Does it take more LoC to write? Yea. Is it worth it? 100%. Despite writing more code, the mental model is much easier (personally) to follow and understand.

    As for the actual agent I just do the following:

    - Get metadata from initial query

    - Pass relevant metadata to agent

    - Agent is a reasoning model with tools and output

    - Agent runs in a loop (max of n times). It will reason which tool calls to use

    - If there is a tool call, execute it and continue the loop

    - Once the agent outputs content, the loop is effectively finished and you have your output

    This is effectively a ReAct agent. Thanks to the reasoning being built in, you don't need an additional evaluator step.

    Tools can be anything. It can be a subagent with subagents, a database query, etc. Need to do an agent handoff? Just output the result of the agent into a different agent. You don't need an sdk to do a workflow.

    I've tried some other SDKs/frameworks (Eino and langchaingo), and personally found it quicker to do it manually (as described above) than fight against the framework.

    [1]: https://github.com/reVrost/go-openrouter

  • peab a day ago

    I think the term sub-agent is almost entirely useless. An agent is an LLM loop that has reasoning and access to tools.

    A "sub agent" is just a tool. It's implantation should be abstracted away from the main agent loop. Whether the tool call is deterministic, has human input, etc, is meaningless outside of the main tool contract (i.e Params in Params out, SLA, etc)

    • moinism a day ago

      I agree, technically, "sub agent" is also another tool. But I think it's important to differentiate tools with deterministic input/output from those with reasoning ability. A simple 'Tool' will take the input and try to execute, but the 'subagent' might reason that the action is unnecessary and that the required output already exists in the shared context. Or it can ask a clarifying question from the main agent before using its tools.

    • nostrebored 20 hours ago

      Nah, when working on anything sufficiently complicated you will have many parallel subagents that need their own context window, ability to mutate shared state, sandboxing differences, durability considerations, etc.

      If you want to rewrite the behavior per instance you totally can, but there is a definite concept here that is different than “get_weather”.

      I think that existing tools don’t work very well here or leave much of this as an exercise for the user. We have tasks that can take a few days to finish (just a huge volume of data and many non deterministic paths). Most people are doing way too much or way too little. Having subagents with traits that can be vended at runtime feels really nice.

    • the_mitsuhiko a day ago

      > It's implantation should be abstracted away from the main agent loop. Whether the tool call is deterministic, has human input, etc, is meaningless outside of the main tool contract (i.e Params in Params out, SLA, etc)

      Up to a point. You're obviously right in principle, but if that task itself has the ability to call into "adjacent" tools then the behavior changes quite a bit. You can see this a bit with how the Oracle in Amp surfaces itself to the user. The oracle as sub-agent has access to the same tools as the main agent, and the state changes (rare!) that it performs are visible to itself as well as the main agent. The tools that it invokes are displayed similarly to the main agent loop, but they are visualized as calls within the tool.

    • verdverm a day ago

      ADK differentiates between tools and subagents based on the ability to escalate or transfer control (subagents), where as tools are more basic

      I think this is a meaningful distinction, because it impacts control flow, regardless what they are called. The lexicon are quite varied vendor-to-vendor

      • peab a day ago

        Are there any examples of implementations of this that actually work, and/or are useful? I've seen people write about this, but I haven't seen it anywhere

        • verdverm a day ago

          I think in ADK, the most likely place to find them actually used is the Workflow agent interfaces (sequential, parallel, loop). Perhaps looping, where it looks like they suggest you have an agent that determines if the loop is done and escalates with that message to the Looper.

          https://google.github.io/adk-docs/agents/workflow-agents/

          I haven't gotten there yet, still building out the basics like showing diffs instead of blind writing and supporting rewind in a session

    • Vinnl a day ago

      What does "has reasoning" mean? Isn't that just a system prompt that says something like "make a plan" and includes that in the loop?

      • peab a day ago

        You actually probably don't need reasoning, as the old non reasoning models like 4o can do this too.

        In the past, the agent type flows would work better if you prompted the LLM to write down a plan, or reasoning steps on how to accomplish the task with the available tools. These days, the new models are trained to do this without promoting

    • ColinEberhardt a day ago

      Oh, so _that_ is what a sub-agent is. I have been wondering about that for a while now!

  • verdverm a day ago

    Google's ADK is pretty nice, I'm using the Go version, which is less mature than the python on. Been at it a bit over a week and progress is great. This weekend I'm aiming for tracking file changes in the session history to allow rewinding / forking

    It has a ton of day 2 features, really nice abstractions, and positioned itself well in terms of the building blocks and constructing workflows.

    ADK supports working with all the vendors and local LLMs

    • dragonwriter a day ago

      I really wish ADK had a local persistent memory implementation, though.

      • verdverm a day ago

        w.r.t. Go, it's probably not that big a lift. I was looking at that yesterday, made a small change to lift the Gorm stuff a bit so the DB conn can be shared between the services

        I thought the same thing about the artifact service, which could have a nice local FS option.

        I'm pretty new to ADK, so we'll see how long the honeymoon phase lasts. Generally very optimistic that I found a solid foundation and framework

        edit: opened an issue to track it

        https://github.com/google/adk-go/issues/339

  • mountainriver a day ago

    The frameworks are all pointless, just use AI assist to create agents in python or ideally a language with concurrency.

    You will be happy you did

    • moinism a day ago

      How do you deal with the different APIs/Tooluse schema in a custom build? As other people have mentioned, it's a bigger undertaking than it sounds.

      • koakuma-chan a day ago

        You can just tell the AI which format you want the input in, in natural language.

    • moduspol 18 hours ago

      You will undoubtedly be recreating what already exists in LangGraph. And you'll probably be doing it worse.

  • otterley a day ago

    Have you tried AWS’s Strands Agents SDK? I’ve found it to be a very fluent and ergonomic API. And it doesn’t require you to use Bedrock; most major vendor native APIs are supported.

    (Disclaimer: I work for AWS, but not for any team involved. Opinions are my own.)

    • moinism a day ago

      This looks good. Even though it's only in Python, I think its worth a try. Thanks.

  • ph4rsikal a day ago

    My favourite is Smolagents from Huggingface. You can easily mix and match their models in your agents.

    • moinism a day ago

      Dude, it looks great, but I just spent half an hour learning about its 'CodeAgents' feature. Which essentially is 'actions written as code'.

      This idea has been floating around in my head, but it wasn't refined enough to implement. It's so wild that what you're thinking of may have already been done by someone else on the internet.

      https://huggingface.co/docs/smolagents/conceptual_guides/int...

      For those who are wondering, it's kind of similar to the 'Code Mode' idea implemented by Cloudflare and now being explored by Anthropic; Write code to discover and call MCPs instead of stuffing context window with their definations.

  • thewhitetulip a day ago

    Did you try langchain/langgraph? Am I confusing what the OP means aa agents?

mritchie712 a day ago

Some things we've[0] learned on agent design:

1. If your agent needs to write a lot of code, it's really hard to beat Claude Code (cc) / Agent SDK. We've tried many approaches and frameworks over the past 2 years (e.g. PydanticAI), but using cc is the first that has felt magic.

2. Vendor lock-in is a risk, but the bigger risk is having an agent that is less capable then what a user gets out of chatgpt because you're hand rolling every aspect of your agent.

3. cc is incredibly self aware. When you ask cc how to do something in cc, it instantly nails it. If you ask cc how to do something in framework xyz, it will take much more effort.

4. Give your agent a computer to use. We use e2b.dev, but Modal is great too. When the agent has a computer, it makes many complex features feel simple.

0 - For context, Definite (https://www.definite.app/) is a data platform with agents to operate it. It's like Heroku for data with a staff of AI data engineers and analysts.

  • CuriouslyC a day ago

    Be careful about what you hand off to Claude versus another agent. Claude is a vibe project monster, but it will fail at hard things, come up with fake solutions, and then lie to you about them. To the point that it'll add random sleeps and do pointless work to cover up the fact that it's reward hacking. It's also very messy.

    For brownfield work, work on hard stuff or work in big complex codebases you'll save yourself a lot of pain if you use Codex instead of CC.

    • wild_egg a day ago

      Claude is amazing at brownfield if you take the time to experiment with your approach.

      Codex is stronger out of the box but properly customized Claude can't be matched at the moment

      • CuriouslyC a day ago

        The issue with Claude are twofold:

        1. Poor long context performance compared to GPT5.1, so Claude gets confused about things when it has to do exploration in the middle of a task.

        2. Claude is very completion driven, and very opinionated, so if your codebase has its own opinions Claude will fight you, and if there are things that are hard to get right, rather than come back and ask for advice, Claude will try to stub/mock it ("let's try a simpler solution...") which would be fine, except that it'll report that it completed the task as written.

      • gnat a day ago

        What have you done to make Claude stronger on brownfields work? This is very interesting to me.

  • faxmeyourcode a day ago

    Point 2 is very often overlooked. Building products that are worse than the baseline chatgpt website is very common.

  • smcleod a day ago

    It's quite worrying that I have several times in the last few months had to really drive home why people should probably not be building bespoke agentic systems just to essentially act as a half baked version of an agentic coding tool when they could just go use Claude code and instead focus their efforts on creating value rather than instant technical debt.

    • CuriouslyC a day ago

      You can pretty much completely reprogram agents just by passing them through a smart proxy. You don't need to rewrite claude/codex, just add context engineering and tool behaviors at the proxy layer.

  • verdverm a day ago

    yes, we should all stop experimenting and outsource our agentic workflows to our new overlords...

    this will surely end up better than where big tech has already brought our current society...

    For real though, where did the dreamers about ai / agentic free of the worst companies go? Are we in the seasons of capitulation?

    My opinion... build, learn, share. The frameworks will improve, the time to custom agent will be shortened, the knowledge won't be locked in another unicorn

    anecdotally, I've come quite far in just a week with ADK and VS Code extensions, having never done extensions before, which has been a large part of the time spent

postalcoder a day ago

I've been building agent type stuff for a couple years now and the best thing I did was build my own framework and abstractions that I know like the back of my hand.

I'd stay clear of any llm abstraction. There are so many companies with open source abstractions offering the panacea of a single interface that are crumbling under their own weight due to the sheer futility of supporting every permutation of every SDK evolution, all while the same companies try to build revenue generating businesses on top of them.

  • sathish316 a day ago

    I agree with your analysis of building your own Agent framework to have some level of control and fewer abstractions. Agents at their core are about programmatically talking to an LLM and performing these basic operations: 1. Structured Input and String Interpolation in prompts 2. Structured Output and Unmarshalling String response to Structured output (This is getting easier now with LLMs supporting Structured output) 3. Tool registry/discovery (of MCP and Function tools), Tool calls and response looping 4. Composability of Tools 5. Some form of Agent to Agent delegation

    I’ve had good luck with using PydanticAI which does these core operations well (due to the underlying Pydantic library), but still struggles with too many MCP servers/Tools and composability.

    I’ve built an open-source Agent framework called OpusAgents, that makes the process of creating Agents, Subagents, Tools that are simpler than MCP servers without overloading the context. Check it out here and tutorials/demos to see how it’s more reliable than generic Agents with MCP servers in Cursor/ClaudeDesktop - https://github.com/sathish316/opus_agents

    It’s built on top of PydanticAI and FastMCP, so that all non-core operations of Agents are accessible when I need them later.

    • drittich a day ago

      This sounds interesting. What about the agent behavior itself? How it decides how to come at a problem, what to show the user along the way, and how it decides when to stop? Are these things you have attempted to grapple with in your framework?

      • sathish316 a day ago

        The framework has the following capabilities:

        1. A way to create function tools

        2. A way to create specialised subagents that can use their own tool or their own model. The main agent can delegate to subagent exposed as a tool. Subagents don’t get confused because they have their own context window, tools and even models (mix and match Remote LLM with Local LLM if needed)

        3. Don’t use all tools of the MCP servers you’ve added. Filter out and select only the most relevant ones for the problem you’re trying to solve

        4. HigherOrderTool is a way to callMCPTool(toolName, input) in places where the Agent to MCP interface can be better suited for the problem than what’s exposed as a generic interface by the MCP provider - https://github.com/sathish316/opus_agents/blob/main/docs/GUI... . This is similar to Anthropic’s recent blogpost on Code tools being better than MCP - https://www.anthropic.com/engineering/code-execution-with-mc...

        5. MetaTool is a way to use ready made patterns like OpenAPI and not having to write a tool or add more MCP servers to solve a problem - https://github.com/sathish316/opus_agents/blob/main/docs/GUI... . This is similar to a recent HN post on Bash tools being better for context and accuracy than MCP - https://mariozechner.at/posts/2025-11-02-what-if-you-dont-ne...

        Other than AgentBuilder, CustomTool, HigherOrderTool, MetaTool, SubagentBuilder the framework does not try to control PydanticAI’s main agent behaviour. The high level approach is to use fewer, more relevant tools and let LLM orchestration and prompt tool references drive the rest. This approach has been more reliable and predictable for a given Agent based problem.

    • wizhi a day ago

      So you advice people to build their own framework, then advertise your own?

    • spacecadet a day ago

      I also recommend this. I have tried all of the frameworks, and deploy some still for some clients- but for my personal agents, its my own custom framework that is dead simple and very easy to spin up, extend, etc.

  • the_mitsuhiko a day ago

    Author here. I’m with you on the abstractions part. I dumped a lot of my though so this into a follow up post: https://lucumr.pocoo.org/2025/11/22/llm-apis/

    • thierrydamiba a day ago

      Excellent write up. I’ve been thinking a lot about caching and agents so this was right ilup my alley.

      Have you experimented with using semantic cache on the chain of thought(what we get back from the providers anyways) and sending that to a dumb model for similar queries to “simulate” thinking?

  • NitpickLawyer a day ago

    Yes, this is great advice. It also applies to interfaces. When we designed a support "chat bot", we went with a diferent architecture than what's out there already. We designed the system with "chat rooms" instead, and the frontend just dumps messages to a chatroom (with a session id). Then on the backend we can do lots of things, incrementally adding functionality, while the front end doesn't have to keep up. We can also do things like group messages, have "system" messages that other services can read, etc. It also feels more natural, as the client can type additional info while the server is working, etc.

    If you have to use some of the client side SDKs, another good idea is to have a proxy where you can also add functionality without having to change the frontend.

    • postalcoder a day ago

      Creativity is an underrated hard part of building agents. The fun part of building right now is knowing how little of the design space for building agents has been explored.

      • spacecadet a day ago

        This! I keep telling people that if tool use was not a an aha moment relative to AI agents for you, then you need to be more creative...

    • verdverm a day ago

      This is not so unlike the coding agent I'm building for vs code. One of the things I'm doing is keeping a snapshot of the current vs code state (files open, terminal history, etc) in the agent server. Similarly, I track the file changes without actually writing them until the user approves the diff, so there are some "filesystem" like things that need to be carefully managed on each side.

      tl;dr, Both sides are broadcasting messages and listening for the ones they care about.

  • _pdp_ a day ago

    This is a huge undertaking though. Yes it is quite simple to build some basic abstraction on top of openai.complete or similar but this like 1% of an agent need to do.

    My bet is that agent frameworks and platform will become more like game engines. You can spin your own engine for sure and it is fun and rewarding. But AAA studios will most likely decide to use a ready to go platform with all the batteries included.

    • postalcoder a day ago

      In totality, yes. But you don't need every feature at once. You add to it once you hit boundaries. But I think the most important thing about this exercise is that you leave nothing to the imagination when building agents.

      The non-deterministic nature of LLMs already makes the performance of agents so difficult to interpret. Building agents on top of code that you cannot mentally trace through leads to so much frustration when addressing model underperformance and failure.

      It's hard to argue that after the dust settles, companies will default to batteries-included frameworks but, right now, a lot of people i've regretted adopting a large framework off the bat.

eclipsetheworld a day ago

We're repeating the same overengineering cycle we saw with early LangChain/RAG stacks. Just a couple of months ago the term agent was hard to define, but I've realized the best mental model is just a standard REPL:

Read: Gather context (user input + tool outputs). Eval: LLM inference (decides: do I need a tool, or am I done?). Print: Execute the tool (the side effect) or return the answer. Loop: Feed the result back into the context window.

Rolling a lightweight implementation around this concept has been significantly more robust for me than fighting with the abstractions in the heavy-weight SDKs.

  • throw310822 a day ago

    I don't think this has much to do with SDKs. I've developed my own agent code from scratch (starting from the simple loop) and eventually- unless your use case is really simple- you always have to deal with the need for subagents specialised for certain tasks, that share part of their data (but not all) with the main agent, with internal reasoning and reinforcement messages, etc.

    • eclipsetheworld a day ago

      Interestingly, sticking to the "Agent = REPL" mental model is actually what helped me solve those specific scaling problems (sub-agents and shared data) without the SDK bloat.

      1. Sub-agents are just stack frames. When the main loop encounters a complex task, it "pushes" a new scope (a sub-agent with a fresh, empty context). That sub-agent runs its own REPL loop, returns only the clean result with out any context pollution and is then "popped".

      2. Shared Data is the heap. Instead of stuffing "shared data" into the context window (which is expensive and confusing), I pass a shared state object by reference. Agents read/write to the heap via tools, but they only pass "pointers" in the conversation history. In the beginning this was just a Python dictionary and the "pointers" were keys.

      My issue with the heavy SDKs isn't that they try to solve these problems, but that they often abstract away the state management. I’ve found that explicitly managing the "stack" (context) and "heap" (artifacts) makes the system much easier to debug.

      • throw310822 a day ago

        Indeed. So in addition to your chat loop, you have built a way to spawn sub-agents, and to share memory objects between them (or tools) and the main agent; also (I suppose) a standard way to define tools and their actions; to define sub-agents with their separate tools and actions and (if needed) separate memory objects; to inject ephemeral context in the chat (the current state of the UI, or the last user action); to introduce reinforcement messages when needed; etc. Maybe context packing if/ when the context gets too big. Then you've probably have built something for evals, so that you can run batches of tasks and score the results. Etc.

        So that's my point (and that of the article): it's not "just a loop", it quickly gets much more complicated than that. I haven't used any framework, so I can't tell if they're good or not; but for sure I ended up building my own. Calling tools in a loop is enough for a cool demo but doesn't work well enough for production.

mitchellh a day ago

This is why I use the agent I use. I won't name the company, because I don't want people to think I'm a shill for them (I've already been accused of it before, but I'm just a happy, excited customer). But it's an agentic coding company that isn't associated with any of the big model providers.

I don't want to keep up with all the new model releases. I don't want to read every model card. I don't want to feel pressured to update immediately (if it's better). I don't want to run evals. I don't want to think about when different models are better for different scenarios. I don't want to build obvious/common subagents. I don't want to manage N > 1 billing entities.

I just want to work.

Paying an agentic coding company to do this makes perfect sense for me.

  • pjm331 a day ago

    I’ve been surprised at the lack of discussion about sourcegraph’s Amp here which I’m pretty sure you’re referring to - it started a bit rough but these days I find that it’s really good

    • SatvikBeri a day ago

      So, I tried to sign up for Amp. I saw a livestream that mentioned you can sign up for their community Buildcrew on Discord and get $100 of credits. I tried signing up, and got an email that I was accepted and would soon get the credits. The Discord link did not work (it was expired) and the email was a noreply, so I tried emailing Amp support. This was last Friday (8 days ago.) As of today, no updated Discord link, no human response, no credits. If this is their norm, people probably aren't talking about it because they just haven't been able to try it.

      • sqs a day ago

        Sorry we missed that email! I don’t know what went wrong there, but I just replied and will figure it out. This is definitely not the norm (and Build Crew is a small fraction of our users).

        • SatvikBeri 21 hours ago

          (I can't edit my old post, but it turned out to be a Discord issue, not an issue with the amp link. Oops!)

ReDeiPirati a day ago

> We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here.

I'm curious about the solutions the op has tried so far here.

  • hommes-r 6 hours ago

    "Because there’s too much you need to feed into it" - what does the author mean by this? If it is the amount of data, then I would say sampling needs to be implemented. If that's the extent of the information required from the agent builder, I agree that an LLM-as-a-judge e2e eval setup is necessary.

    In general, a more generic eval setup is needed, with minimal requirements from AI engineers, if we want to move forward from Vibe's reliability engineering practices as a sector.

  • heljakka 11 hours ago

    What are the main shortcomings of the solutions you tried out?

    We believe you need to both automatically create the evaluation policies from OTEL data (data-first) and to bring in rigorous LLM judge automation from the other end (intent-first) for the truly open-ended aspects.

  • ColinEberhardt a day ago

    Likewise. I have a nasty feeling that most AI agent deployments happen with nothing more than some cursory manual testing. Going with the ‘vibes’ (to coin an over used term in the industry).

    • radarsat1 a day ago

      A lot of "generative" work is like this. While you can come up with benchmarks galore, at the end of the day how a model "feels" only seems to come out from actual usage. Just read /r/localllama for opinions on which models are "benchmaxed" as they put it. It seems to be common knowledge in the local LLM community that many models perform well on benchmarks but that doesn't always reflect how good they actually are.

      In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)

      Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)

  • ramraj07 19 hours ago

    Its a 2 day project at best to create your own bespoke llm as judge e2e eval framework. Thats what we did. Works fine. Not great. Still need someone to write the evals though.

  • verdverm a day ago

    ADK has a few pages and some API for evaluating agentic systems

    https://google.github.io/adk-docs/evaluate/

    tl;dr - challenging because different runs produce different output, also how do you pass/fail (another LLM/agent is what people do)

havkom a day ago

My tip is: don’t use SDK:s for agents. Use a while loop and craft your own JSON, handle context size and handle faults yourself. You will in practice need this level of control if you are not doing something trivial.

CuriouslyC a day ago

The 'Reinforcement in the Agent Loop' section is a big deal, I use this pattern to enable async/event steered agents, it's super powerful. In long context you can use it to re-inject key points ("reminders"), etc.

  • pjm331 a day ago

    Yes that was the one point in here where I thought to myself oh yeah I’m going to go implement that immediately

d4rkp4ttern 6 hours ago

Relatedly, I haven’t seen two issues discussed much in agent design, but repeatedly come up in real world use cases:

(1) LLM forgets to call a tool (and instead outputs plain text). Contrary to some of the comments here saying that these concerns will disappear as frontier models improve, there will always be a need for having your agent scaffolding work well with weaker LLMs (cost, privacy, etc).

(2) Determining when a task is finished. In some cases, we want the LLM to decide that (e.g search with different queries until desired info found), but in others, we want to specify deterministic task completion conditions (e.g., end the task immediately after structured info extraction, or after acting on such info, or after the LLM sees the result of that action etc).

After repeatedly running into these types of issues in production agent systems, we’ve added mechanisms for these in the Langroid[1] agent framework (I’m the lead dev), which has blackboard-like loop architecture that makes it easy to incorporate these.

For issue (1) we can configure an agent with a `handle_llm_no_tool` [2] set to a “nudge” that is sent back to the LLM when a non-tool response is detected (it could also be set as a lambda function to take other possible actions)

For issue (2) Langroid has a DSL[3] for specifying task termination conditions. It lets you specify patterns that trigger task termination, e.g.

- "T" to terminate immediately after a tool-call,

- "T[X]" to terminate after calling the specific tool X,

- "T,A" to terminate after a tool call, and agent handling (i.e. tool exec)

- "T,A,L" to terminate after tool call, agent handling, and LLM response to that

[1] Langroid https://github.com/langroid/langroid

[2] Handling non-tool LLM responses https://langroid.github.io/langroid/notes/handle-llm-no-tool...

[3] Task Termination in Langroid https://langroid.github.io/langroid/notes/task-termination/

cvhc a day ago

I'm frustrated that some fundamental aspects in agent development lack clear guideline.

One example is input/output types of function tools. Frameworks offer some flexiblity and seemingly I can use fundamental types and simple data structures (list, dict/map). But on the other hand I know all data types are eventually stringified and this has implications.

I have recently observed two issues when my agent calls a function that simply takes some int64 numeric IDs: (1) when the IDs are presented as hexadecimal in the context, the LLM attempts to convert them to decimal itself but mess it up because it doesn't really calculate; (2) some big IDs are not passed precisely in Google ADK framework [1], presumbly because its JSON serialization failed to keep the precision. I ended up changing the function to take string args instead. I also wasn't sure if the tool should return the data as original as possible in a moderately deeply nested dict, or step further to properly organize the output in a more human-readable text format for model ingestion.

OpenAI's doc [2] writes: "A result must be a string, but the format is up to you (JSON, error codes, plain text, etc.). The model will interpret that string as needed." -- But that clearly contradicts with the framework's capability and some official examples where dict/numbers are returned.

[1] https://github.com/google/adk-python/issues/3592 [2] https://platform.openai.com/docs/guides/function-calling

cedws a day ago

Going to hijack this to ask a question that’s been on my mind: does anybody know why there’s no agentic tools that use tree-sitter to navigate code? It seems like it would be much more powerful than having the LLM grepping for strings or rewriting entire files to change one line.

  • esafak a day ago
    • jaen 6 hours ago

      These links are all mostly unmaintained AI-generated random MCP servers. Could have put in a bit more effort than copy-pasting search results...

      To talk about where it's _actually_ at:

      Agentic IDEs have LSP support built in, which is better that just tree-sitter -such as Copilot in VSCode, which contrary to what you might expect, can actually use arbitrary models & has BYOK.

      There's also OpenCode from the CLI side etc. etc.

      From the MCP side, there are large community efforts such as https://github.com/oraios/serena

  • stanleykm a day ago

    In my own measurements (made a framework to test number of tokens used / amount of reprompting to accomplish a battery of tests) i found that using an ast type tool makes the results worse. I suspect it just fills the context with distractors. LLMs know how to search effectively so it’s better to let them do that, as far as I can tell. At this point I basically dont use MCPs. Instead I just tell it that certain tools are available to it if it wants to use them.

  • postalcoder a day ago

    This doesn't fully satisfy your question, but it comes straight from bcherny (claude code dev):

    > Claude Code doesn't use RAG currently. In our testing we found that agentic search out-performed RAG for the kinds of things people use Code for.

    source thread: https://news.ycombinator.com/item?id=43163011#43164253

    • cedws a day ago

      Thanks, I was also wondering why they don't use RAG.

      • the_mitsuhiko a day ago

        They are using RAG. Grep is also just RAG. The better question is why they don’t use a vector database and honestly the answer is that these things are incredibly hard to keep in sync. And if they are not perfectly in sync, the utility drops quickly.

        • esafak a day ago

          By RAG they mean vector search. They're calling grep "agentic search".

      • postalcoder a day ago

        I forgot to mention that Aider uses a tree sitter approach. It's blazing fast and great for quick changes. It's not the greatest agent for doing heavy edits but I don't it has to do with them not using grep.

  • dangoodmanUT a day ago

    This is how the embeddings generation works, they just convert it to embeddings so it can use natrual language

  • spacecadet a day ago

    Create an agent with these tools and you will. Agent tools are just functions, but think of them like skills. The more knowledge and skills your agents (plural, Id recommend more than one) have access to, the more useful they become.

  • the_mitsuhiko a day ago

    > does anybody know why there’s no agentic tools that use tree-sitter to navigate code?

    I would guess the biggest reason is that there is no RL happening on the base models with tree-sitter as tool. But there is a lot of RL with bash and so it knows how to grep. I did experiment with giving tree sitter and ast-grep to agents and my experience the results are mixed.

lvl155 a day ago

Design is hard because models change almost on a weekly basis. Quality abruptly falls off or drastically changes without announcements. It’s like building a house without proper foundation. I don’t know how many tokens I wasted because of this. I want to say 30% of the cost and resources this year for me.

Vanclief a day ago

We have been working hard on the past two months implementing agents for different tasks. We started with Claude code, as we really liked working with hooks. However being vendor locked and having usage limit problems, we ended up implementing our own "runtime" that keeps the conversation structure while having hooks. Right now it only works with OpenAI but its designed to be able to incorporate Claude / Gemini down the road

We ended up open sourcing that runtime if anyone is interested:

https://github.com/Vanclief/agent-composer

Scotrix a day ago

While working on a new startup I had exactly the same challenges and issues. As soon as it gets complex, data gets bigger, amount of tools increases it becomes tremendously difficult to control agents, network of agents or anything LLM related. Add then specific domains like legal or finance where most of the time there is just 0 or 1 and nothing in between (metaphorical speaking) it becomes a nightmare in code and side effects.

So I started to actually build something to solve most of my own problems reliably and pushing deterministic outputs with help of LLMs (e.g. imagine finding the right columns/sheets in massive spreadsheets to create tool execution flows and fine tuning finding a range of data sources). My idea and solution which helped not only me but also quite a few business folks so far to fix and test agents is visualizing flows, connect and extract data visually, test and deploy changes in real time while keeping it very close to static types and predictable output (other than e.g. llama flow).

Would love to hear your thoughts about my approach: https://llm-flow-designer.com

srameshc a day ago

I still feel there is no sure shot way to build an abstraction yet. Probably that is why Loveable decided to build on Gemini AI rather than giving options of choosing model. On the other hand I like Pydantic AI framework and got myself a decent working solution where my preference is to stick with cheaper models by default and only use expensive only in cases where failure rate is too high.

  • _pdp_ a day ago

    This is true. Based on my experience with real customers, they really don't know what is the difference between the different models.

    What they want is to get things done. The model is simply means to an end. As long as the task is completed, everything else is secondary.

    • morkalork a day ago

      I feel like there's always that "one guy" who has a preference and asks for an option to choose, and, it's a way for developers to offload the decision. You can't be blamed for the wrong one if you put that decision on the user.

      Tbh I agree with your approach and use it when building stuff for myself. I'm so over yak shaving that topic.

elvin_d a day ago

My experience with BAML was really good. For structured outputs, JSON schema is slow and sluggish while BAML performs well. Interestingly don't see much attention on HN, wondering is it hype of other SDKs or BAML doesn't deliver enough.

cheriot a day ago

It's a library design flaw. The agent SDKs focus on an "easy" high level API and hard code all its assumptions (AI SDK, LangGraph, etc). There's no lower level primitives to recompose when you discover your requirements are different than what the library author thought of.

So for now the choice is, "all in one for great prototypes and better hope it has everything you need" or roll your own.

If someone knows of a library that's good for quick prototypes and malleable for power users please share.

Frannky a day ago

I use CLI Terminals as agent frameworks. You have big tech and open source behind them. All the problems are solved with zero work. They take care of new problems. You don't need to remove all the stuff that becomes outdated because the latest model doesn't make the same mistakes. Free via free tiers, cheap via openai compatible open source models like z.ai. Insanely smart and can easily talk to MCP servers.

Yokohiii a day ago

Most of it reads like too high expectation of overhyped technology.

  • xpe 19 hours ago

    What is the benefit of injecting one’s “take” on what level of hype exists when thinking about the OP’s exploration into agentic development?

    Better to step back and just look at what exists for what it is? Strive for a less biased take.

    The pieces are rapidly changing, non-standard, fragmented, and exploratory. If true, jumping in involves risks from chaos, churn, rushing, unproveness (sp?), etc

    • Yokohiii 18 hours ago

      Confirmation bias can be a problem. Positivity bias as well. The author seems to take agentic dev as granted, which isn't. He doesn't goes into much detail either. It's at best a soft complaint about why it isn't plug and play already. I am open for the topic. But I demand more depth.

      • xpe 8 hours ago

        Have you questioned confirmation bias on your own side as well?

        • Yokohiii 6 hours ago

          About what? Most times i check, it's there.

          There is hype around AI and agentic dev. It's not subjective. My opinion is that it's a valid factor to consider if you evaluate tech. We've seen with i.e. microservices that people followed the hype because it seemed sound, but it was eventually terrible for many. Not everything is for everyone just because it's there.

          Can people explore things they care about? Yes. Can I have an opinion about what they do if they make a public post about it. Yes.

Fiveplus a day ago

I liked reading this but got a silly question as I am a noob in these things. If explicit caching is better, does that mean the agent is just forgetting stuff unless we manually save its notes? Are these things really that forgetful? Also why is there a virtual file system? So the agent is basically just running around a tiny digital desktop looking for its files? Why can't the agent just know where the data is? I'm sorry if these are juvenile questions.

  • the_mitsuhiko a day ago

    > If explicit caching is better, does that mean the agent is just forgetting stuff unless we manually save its notes?

    Caching is unrelated to memory, it's about how to not do the same work over and over again due to the distributed nature of state. I wrote a post that goes into detail from first principles here with my current thoughts on that topic [1].

    > Are these things really that forgetful?

    No, not really. They can get side-tracked which is why most agents do a form of reinforcement in-context.

    > Why is there a virtual file system?

    So that you don't have dead-end tools. If a tool creates or manipulates state (which we represent on a virtual file system), another tool needs to be able to pick up the work.

    > Why can't the agent just know where the data is?

    And where would that data be?

    [1]: https://lucumr.pocoo.org/2025/11/22/llm-apis/

  • pjm331 a day ago

    You are maybe confusing caching and context windows. Caching is mainly about keeping inference costs down

jonmagic a day ago

Great post. Re: frameworks, I tried a number of them and then found Pocketflow and haven’t found a reason to try anything else since. It’s now been ported to half a dozen or more languages (including my port to Ruby). The simple api and mental model makes it easy for everyone on my team to jump into, extend, and compose. Highly recommend for anyone frustrated with the larger SDKs.

wrochow 8 hours ago

It seems to me that people are missing the point. AI is a tool, useful for some things, not so useful for others. The old adage applies here quite well. To a man with a hammer, everything is a nail.

Having used AI quite extensively I have come to realize that to get quality outputs, it needs quality inputs... this is especially true with general models. It seems to me that many developers today just start typing without planning. That may work to some degree for humans, but AI needs more direction. In the early 2000s, Rational Unified Process (RUP) was all the rage. It gave way to Agile approaches, but now, I often wonder if we didn't throw the baby out with the bath water. I would wager that any AI model could produce high-quality code if provided with even a light version of RUP documentation.

fudged71 a day ago

If I understand correctly, Claude Code Agent SDK can’t edit itself (to incrementally improve), can’t edit or create its own agents and skills.

I’ve found that by running Claude Code within Manus sandbox I can inspect the reasoning traces and edit the Agents/Skills with the Manus agent.

linux2647 a day ago

The tmux usage referenced at the end of the article was fascinating to watch. I’d never considered using tmux as a way of getting more insight into how an agent is working through a problem. Or to watch it debug something

bobwolf a day ago

Season 2 of the podcast "Shell Game" is about the host trying to do a startup company with only Ai agents and all the problems. It's interesting and entertaining

dangoodmanUT a day ago

I think output functions aren't necessary, you should just use the text output when the agentic loop ends, and prompt to what kind of output you want (markdown, summary of changes, etc.)

ColinEberhardt a day ago

> We find testing and evals to be the hardest problem here …

I wonder what this means for the agents that people are deploying into production? Are they tested at all? Or just manual ad-hoc testing?

Sounds risky!

  • verdverm a day ago

    I'm curious what people are doing. We're still very much in the experimentation phase

    > Sounds risky!

    One of first attempts at building file system tools for my custom agent called `tree` and caught a few node_models. Blew up my context and cost me $5 in 60s. Fortunately I triggered the TPM rate-limit and the thing stopped

jfghi a day ago

Technologically and mathematically this is all quite interesting. However, I have no desire to ever use an agent and I doubt there are many that will.

  • kvirani a day ago

    No desire because you don't need them to optimize anything for yourself, or ?

    • jfghi a day ago

      So far the agentic stuff seems focused on shopping. Online shopping is remarkably seamless and optimized for me so the potential productivity gain is not enticing.

      I’ve encountered a number of errors dealing with LLMs so would be wary of the results.

      I also think there’d be an incentive to enshittify by having vendors pay to get preferential prioritization from the LLM. This could result in worse products being delivered for higher prices.

      • blharr 17 hours ago

        Have you had issues with LLM agents or when chatting to them one-on-one?

      • xpe 19 hours ago

        > So far the agentic stuff seems focused on shopping.

        Where are you getting this impression?

munro a day ago

I had built an agent with LangGraph a 9 months ago--now seems React agents are in LangChain. Over all pretty happy with that, I just don't use any of the dumb stuff like embedding/search layer: just tools&state

But I was actually playing with a few frameworks yesterday and struggling--I want what I want without having to write it. ;). Ended up using pydantic_ai package, literally just want tools w/ pydantic validation--but out of the box it doesn't have good observability, you would have to use their proprietary SaaS; and it comes bundled with Temporal.io (yo odio eso proyecto). I had to write my own observability which was annoying, and it sucks.

If anyone has any things they've built, I would love to know, and TypeScript is an option. I want: - ReAct agent with tools that have schema validation - built in REALTIME observability w/ WebUI - customizable playground ChatUI (This is where TypeScript would shine) - no corporate takeover tentacles

p.s.s: I know... I purposely try to avoid hard recommendations on HN, to avoid enshittification. "reddit best X" has been gamed. And generally skip these subtle promotional posts..

nowittyusername a day ago

If you are building an agent, start from scratch and build your own framework. This will save you more headache and wasted time down the line. one of the issues when you use someone else framework is that you miss out on learning and understanding important fundamentals about LLM's, how they work, context, etc... Also many developers don't learn the fundamentals of running LLM's locally and thus miss crucial context (giggidy) that would have helped them better understand the whole system. It seems to me that the author here came to a similar conclusion like many of us. I do want to add my own insights though that might be of use to some.

One of the things he talked about was issues with reliable tool calling by the model. I recommend he try the following approach. Have the agent perform a self calibration exercise that makes the agent use his tools at the beginning of the context. Make him perform some complex stuff. Do it many times to test for coherence and accuracy while adjusting the system prompt towards more accurate tool calls. Once the agent had performed that calibration process successfully, you "freeze" that calibration context/history by broadening the --keep n to include not just the system prompt in the rolling window but also up to the end of this calibration session. then no matter how far the context window drifts the conditioned tokens generated by that calibration session steer the agent towards proper tool use. From then on your "system prompt" includes those turns. Note that this is probably not possible on cloud based models as you don't have access to the inference engine directly. A hacky way around that is emulate the conversation turns inside the system prompt.

On the note of benchmark's. The calibration test is your benchmark from then on. When introducing new tools to the system prompt or adjusting any important variable, you must always rerun the same test to make sure the new adjustments you made don't negatively affect the system stability.

On context engineering. That is a must as a bloated history will decohere the whole system. So its important to device an automated system that compresses the context down but retains overall essence of the history. there are about a billion ways you could do this and you will need to experiment a lot. LLM's are conditioned quite heavily from their own outputs, so having the ability to remove error tool calls from the context is a big boon as now the model is less likely to repeat its same mistakes. There are trade offs though, like he said caching is a no go when going this route but you gain a lot more control and stability within the whole system if you do this right. its basically reliability vs cost here. I tend to lean towards reliability. Also i don't recommend using the whole context size of the model. Most llms perform very poorly past a specific amount and I find that using maximum of 50% of the whole context window is recommended for cohesion. Meaning that if lets say max context window is 100k tokens, treat 50k as the max limit and start compressing the history around 35k tokens. Granular and step wise system can be set up. Where the most recent context is most detailed and uncompressed but as it goes further from the current time it gets less and less detailed. Obviously you want to store the full uncompressed history for a subagent that uses rag. This allows the agent to see in detail the whole thing if it finds the need to.

ahh also on the matter of output. I found great success with making input and output channels for my agent. there are many channels that the model is conditioned in using for specific interactions. <think> channel for cot and reasoning. <message_to_user> channel for explicit messages to user. <call_agent> channel for calling agents and interacting with them. <call_tool> for tool use. and then a few other environment and system channels that are input channels from error scripts and environment towards the agent. This channel segmentation also allows for better management of internal automated scripts, and focus the model. Oh also one important thing is the fact that you need at least 2 separate output layers. meaning you need to separate your llm outputs from what is displayed to the user. and they have their own rules they follow. what that allows you to do is display information in a very human readable way to the real human while hiding all the noise but also retaining the crucial context thats needed for the model to function appropriately.

bah anyways i rambled for long enough. good luck folks. hope this info helps someone.

nehalem a day ago

I am glad Vercel works on agents now. After all, Next is absolutely perfect and recommends them for greater challenges. /s

  • benatkin 21 hours ago

    From AWS wrapper to OpenAI wrapper

callamdelaney a day ago

Agent design is still boring and I’m tired of hearing about it.