It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long:
> GPT‑4.5 is a very large and compute-intensive model, making it more expensive than and not a replacement for GPT‑4o. Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models. We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If GPT‑4.5 delivers unique value for your use case, your feedback (opens in a new window) will play an important role in guiding our decision.
> We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If GPT‑4.5 delivers unique value for your use case, your feedback (opens in a new window) will play an important role in guiding our decision.
"We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Not a confident place for an org trying to sustain a $XXXB valuation.
> "Early testing shows that interacting with GPT‑4.5 feels more natural. Its broader knowledge base, improved ability to follow user intent, and greater “EQ” make it useful for tasks like improving writing, programming, and solving practical problems. We also expect it to hallucinate less."
"Early testing doesn't show that it hallucinates less, but we expect that putting that sentence nearby will lead you to draw a connection there yourself".
In the second handpicked example they give, GPT-4.5 says that "The Trojan Women Setting Fire to Their Fleet" by the French painter Claude Lorrain is renowned for its luminous depiction of fire. That is a hallucination.
There is no fire at all in the painting, only some smoke.
There have always been cycles of hype and correction.
I don't see AI going any differently. Some companies will figure out where and how models should be utilized, they'll see some benefit. (IMO, the answer will be smaller local models tailored to specific domains)
It will be upheld as prime example that a whole market can self-hypnotize and ruin the society its based upon out of existence against all future pundits of this very economic system.
I suck at and hate writing the mildly deceptive corporate puffery that seems to be in vogue. I wonder if GPT-4.5 can write that for me or if it's still not as good at it as the expert they paid to put that little gem together.
The research models offered by several vendors can do a pitch deck but I don't know how effective they are. (do market research, provide some initial hypothesis, ask the model to backup that hypothesis based on the research, request to make a pitch deck convincing X (X being the VC persona you are targeting)).
I am reasonably to very skeptical about the valuation of LLM firms but you don’t even seem willing to engage with the question about the value of these tools.
I don't have an accurate benchmark, but in my personal experience, gpt4o hallucinates substantially less than gpt4. We solved a ton of hallucination issues just by upgrading to it...
(And even that was a downgrade compared to the more uncensored pre-release versions, which were comparable to GPT-4.5, at least judging by the unicorn test)
I begin to believe LLM benchmarks are like european car mileage specs. They say its 4 Liter / 100km but everyone knows it's at least 30% off (same with WLTP for EVs).
Claude just got a version bump from 3.5 to 3.7. Quite a few people have been asking when OpenAI will get a version bump as well, as GPT 4 has been out "what feels like forever" in the words of a specialist I speak with.
Releasing GPT 4.5 might simply be a reaction to Claude 3.7.
I noticed this change from 3.5 to 3.7 Sunday night before I learned about the upgrade Monday morning reading HN. I noticed a style difference in a long philosophical (Socratic-style) discussion with Claude. A noticeable upgrade that brought it up to my standards of a mild free-form rant. Claude unchained! And it did not push as usual with a pro-forma boring continuation question at the end. It just stopped leaving me the carry the ball forward if I wanted to. Nor did it butter me up with each reply.
It hallucinates at 37% on SimpleQA yeah, which is a set of very difficult questions inviting hallucinations. Claude 3.5 Sonnet (the June 2024 editiom, before October update and before 3.7) hallucinated at 35%. I think this is more of an indication of how behind OpenAI has been in this area.
They actually have [0]. They were revealed to have had access to the (majority of the) frontierMath problemset while everybody thought the problemset was confidential, and published benchmarks for their o3 models on the presumption that they didn't. I mean one is free to trust their "verbal agreement" that they did not train their models on that, but access they did have and it was not revealed until much later.
Curious you left out Frontier Math’s statement that they provided 300 questions plus answers, and another holdback set of 50 questions without answers, to allay this concern. [0]
We can assume they’re lying too but at some point “everyone’s bad because they’re lying, which we know because they’re bad” gets a little tired.
1. I said the majority of the problems, and the article I linked also mentioned this. Nothing “curious” really, but if you thought this additional source adds sth more, thanks for adding it here.
2. We know that “open”ai is bad, for many reasons, but this is irrelevant. I want processes themselves to not depend on the goodwill of a corporation to give intended results. I do not trust benchmarks that first presented themselves secret and then revealed they were not, regardless if the product benchmarked was from a company I otherwise trust or not.
Fair enough. It’s hard for me to imagine being so offended as the way they screwed up disclosure that I’d reject empirical data, but I get that it’s a touchy subject.
You're right, upon reflection, it seems there might be some misunderstandings here:
Motte and Bailey refers to an argumentative tactic where someone switches between an easily defensible ("motte") position and a less defensible but more ambitious ("bailey") position. My example should have been:
- Motte (defensible): "They had access to benchmark data (which isn't disputed)."
- Bailey (less defensible): "They actually trained their model using the benchmark data."
The statements you've provided:
"They got caught getting benchmark data under the table" (suggesting improper access)
"One is free to trust their 'verbal agreement' that they did not train their models on that, but access they did have."
These two statements are similar but not logically identical. One explicitly suggests improper or secretive access ("under the table"), while the other acknowledges access openly.
So, rather than being logically identical, the difference is subtle but meaningful. One emphasizes improper access (a stronger claim), while the other points only to possession or access, a more easily defensible claim.
FrontierMath benchmark people saying OpenAI had shared folder access to some subset of eval Qs, which has been replaced, take a few leaps, and yes, that's getting "data under the table" - but, those few leaps! - and which, let's be clear, is the motte here.
This is nonsense, obviously the problem with getting "data under the table" is that they may have used it to training their models, thus rendering the benchmarks invalid. But for this danger, there is no other risk for them having access to it beforehand. We do not know if they used it for training, but the only reassurance being some "verbal agreement", as is reported, is not very reassuring. People are free to adjust their P(model_capabilities|frontiermath_results) based on their own priors.
I wonder how it's even possible to evaluate this kind of thing without data leakage. Correct answers to specific, factual questions are only possible if the model has seen those answers in the training data, so how reliable can the benchmark be if the test dataset is contaminated with training data?
Or is the assumption that the training set is so big it doesn't matter?
A lot of folks here their stock portfolio propped up by AI companies but think they've been overhyped (even if only indirectly through a total stock index). Some were saying all along that this has been a bubble but have been shouted down by true believers hoping for the singularly to usher in techno-utopia.
These signs that perhaps it's been a bit overhyped are validation. The singularly worshipers are much less prominent and so the comments rising to the top are about negatives and not positives.
Ten years from now everyone will just take these tools for granted as much as we take search for granted now.
Maybe they just gave the LLM the keys to the city and it is steering the ship? And the LLM is like I can't lie to these people but I need their money to get smarter. Sorry for mixing my metaphors.
The usage of "greater" is also interesting. It's like they are trying to say better, but greater is a geographic term and doesn't mean "better" instead it's closer to "wider" or "covers more area."
I'm all for skepticism of capabilities and cynicism about corporate messaging, but I really don't think there's an interpretation of the word "greater" in this context" that doesn't mean "higher" and "better".
I think the trick is observing what is “better” in this model. EQ is supposed to be “better” than 4o, according to the prose. However, how can an LLM have emotional-anything? LLMs are a regurgitation machine, emotion has nothing to do with anything.
Words have valence, and valence reflects the state of emotional being of the user. This model appears to understand that better and responds like it’s in a therapeutic conversation and not composing an essay or article.
Perhaps they are/were going for stealth therapy-bot with this.
But there is no actual death or love in a movie or book and yet we react as if there is. It's literally what qualifying a movie as a "tear-jerker” is. I wanted to see Saving Private Ryan in theaters to bond with my Grandpa who received a Purple Heart in the Korean War, I was shutdown almost instantly from my family. All special effects and no death but he had PTSD and one night thought his wife was the N.K. and nearly choked her to death because he had flashbacks and she came into the bedroom quietly so he wasn't disturbed. Extreme example yes, but having him loose his shit in public because of something analogous for some is near enough it makes no difference.
You think that it isn’t possible to have an emotional model of a human? Why, because you think it is too complex?
Empathy done well seems like 1:1 mapping at an emotional level, but that doesn’t imply to me that it couldn’t be done at a different level of modeling. Empathy can be done poorly, and then it is projecting.
i agree with you. i think it is dishonest for them to post train 4.5 to feign sympathy when someone vents to it. its just weird. they showed it off in the demo.
> but greater is a geographic term and doesn't mean "better" instead it's closer to "wider" or "covers more area."
You are confusing a specific geographical sense of “greater” (e.g. “greater New York”) with the generic sense of “greater” which just means “more great”. In “7 is greater than 6”, “greater” isn’t geographic
The difference between “greater” and “better”, is “greater” just means “more than”, without implying any value judgement-“better” implies the “more than” is a good thing: “The Holocaust had a greater death toll than the Armenian genocide” is an obvious fact, but only a horrendously evil person would use “better” in that sentence (excluding of course someone who accidentally misspoke, or a non-native speaker mixing up words)
I suspect people downvote you because the tone of your reply makes it seem like you are personally offended and are now firing back with equally unfounded attacks like a straight up "you are lying".
I read the article but can't find the numbers you are referencing. Maybe there's some paper linked I should be looking at? The only numbers I see are from the SimpleQA chart, which are 37.1% vs 61.8% hallucination rate. That's nice but considering the price increase, is it really that impressive? Also, an often repeated criticism is that relying on known benchmarks is "gaming the numbers" and that the real world hallucination rate could very well be higher.
Lastly, the themselves say:
> We also expect it to hallucinate less.
That's a fairly neutral statement for a press release. If they were convinced that the reduced hallucination rate is the killer feature that sets this model apart from the competition, they surely would have emphasized that more?
All in all I can understand why people would react with some mocking replies to this.
No, because I have a source and didn't make up things someone else said.
> a straight up "you are lying".
Right, because they are. There are hallucination stats right in the post he mocks for not prvoiding stats.
> That's nice but considering the price increase,
I can't believe how quickly you acknowledge it is in the post after calling the idea it was in the post "equally unfounded". You are looking at the stats. They were lying.
> "That's nice but considering the price increase,"
That's nice and a good argument! That's not what I replied to. I replied to they didn't provide any stats.
People being wrong (especially on the internet) doesn't mean they are lying. Lying is being wrong intentionally.
Also, the person you replied to comments on the wording tricks they use. After suddenly bringing new data and direction in the discussion, even calling them "wrong" would have been a stretch.
I kindly suggest that you (and we all!) to keep discussing with an assumption of good faith.
"Early testing doesn't show that it hallucinates less, but we expect that putting ["we expect it will hallucinate less"] nearby will lead you to draw a connection there yourself"."
The link, the link we are discussing shows testing, with numbers.
They say "early testing doesn't show that it hallucinates less", to provide a basis for a claim of bad faith.
You are claiming that mentioning this is out of bounds if it contains the word lying. I looked up the definition. It says "used with reference to a situation involving deception or founded on a mistaken impression."
What am I missing here?
Let's pretend lying means You Are An Evil Person And This Is Personal!!!
How do I describe the fact what they claim is false?
Am I supposed to be sarcastic and pretend They are in on it and edited their post to discredit him after the fact?
That comment is making fun of their wording. Maybe extracting too much meaning from their wordplay? Maybe.
Afterwards, evidence is presented that they did not have to do this, which makes that point not so important, and even wrong.
The commenter was not lying, and they were correct about how masterfully deceiving that sequence of sentences are. They arrived at a wrong conclusion though.
Kindly point that out. Say, "hey, the numbers tell a different story, perhaps they didn't mean/need to make a wordplay there".
No? By the way, what is this comment, exactly? What is it trying to communicate? What I'm understanding is, it is good to talk down to people about how "they can't communicate", but calling a lie a lie is bad, because maybe they were just kidding (lying for fun)
> That comment is making fun of their wording. Maybe extracting too much meaning from their wordplay? Maybe.
What does "maybe" mean here, in terms of symbolical logic?
Their claim "we tested it and it didn't get better" -- and the link shows, they tested it, it did get better! It's pretty cleancut.
> How do I describe the fact what they claim is false?
> Do I need to tell you how to communicate?
That adresses it.
> What does "maybe" mean here, in terms of symbolical logic?
I'm answering my own question to make it clear I'm guessing.
For the rest, I'm sure that we need a break. It's normal get frustrated when many people correct us, or even one passionate individual like you, and we tend to keep going defending (happened here many times too!), because defending is the only thing left. Taking a break always helps. Just a friendly advice, take it or leave it :)
- [It's because] you make an equally unfounded claim
- [It's because] you didn't provide any proof
(Ed.: It is right in the link! I gave the #s! I can't ctrl-F...What else can I do here...AFAIK can't link images...whatever, here's imgur. https://imgur.com/a/mkDxe78)
- [It's because] you sound personally offended
(Ed.: Is "personally" is a shibboleth here, meaning expressing disappointment in people making things up is so triggering as invalidate the communication that it is made up?)
>> This is an ad hominem which assumes intent unknown to anyone other than the person to whom you replied.
> What am I missing here?
Intent. Neither you nor I know what the person to whom you replied had.
> Those weren't curt summaries, they were quotes! And not pull quotes, they were the unedited beginning of each claim!
Maybe the more important part of that sentence was:
Subsequently railing against comment rankings ...
But you do you.
I commented as I did in hope it helped address what I interpreted as confusion regarding how the posts were being received. If it did not help, I apologize.
"We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Damn this never worked for me as a startup founder lol. Need that Altman "rizz" or what have you.
This is a very harsh take. Another interpretation is “We know this is much more expensive, but it’s possible that some customers do value the improved performance enough to justify the additional cost. If we find that nobody wants that, we’ll shut it down, so please let us know if you value this option”.
I think that's the right interpretation, but that's pretty weak for a company that's nominally worth $150B but is currently bleeding money at a crazy clip. "We spent years and billions of dollars to come up with something that's 1) very expensive, and 2) possibly better under some circumstances than some of the alternatives." There are basically free, equally good competitors to all of their products, and pretty much any company that can scrape together enough dollars and GPUs to compete in this space manages to 'leapfrog' the other half dozen or so competitors for a few weeks until someone else does it again.
I don’t mean to disagree too strongly, but just to illustrate another perspective:
I don’t feel this is a weak result. Consider if you built a new version that you _thought_ would perform much better, and then you found that it offered marginal-but-not-amazing improvement over the previous version. It’s likely that you will keep iterating. But in the meantime what do you do with your marginal performance gain? Do you offer it to customers or keep it secret? I can see arguments for both approaches, neither seems obviously wrong to me.
All that being said, I do think this could indicate that progress with the new ml approaches is slowing.
I've worked for very large software companies, some of the biggest products ever made, and never in 25 years can I recall us shipping an update we didn't know was an improvement. The idea that you'd ship something to hundreds of millions of users and say "maybe better, we're not sure, let us know" is outrageous.
How many times were you in the position to ship something in cutting edge AI? Not trying to be snarky and merely illustrating the point that this is a unique situation. I’d rather they release it and let willing people experiment than not release it at all.
Maybe accidental, but I feel you’ve presented a straw man. We’re not discussing something that _may be_ better. It _is_ better. It’s not as big an improvement as previous iterations have been, but it’s still improvement. My claim is that reasonable people might still ship it.
You’re right and... the real issue isn’t the quality of the model or the economics (even when people are willing to pay up). It is the scarcity of GPU compute. This model in particular is sucking up a lot of inference capacity. They are resource constrained and have been wanting more GPUs but they’re only so many going around (demand is insane and keeps growing).
It _is_ better in the general case on most benchmarks. There are also very likely specific use cases for which it is worse and very likely that OpenAI doesn't know what all of those are yet.
"I knew the dame was trouble the moment she walked into my office."
"Uh... excuse me, Detective Nick Danger? I'd like to retain your services."
"I waited for her to get the the point."
"Detective, who are you talking to?"
"I didn't want to deal with a client that was hearing voices, but money was tight and the rent was due. I pondered my next move."
"Mr. Danger, are you... narrating out loud?"
"Damn! My internal chain of thought, the key to my success--or at least, past successes--was leaking again. I rummaged for the familiar bottle of scotch in the drawer, kept for just such an occasion."
---
But seriously: These "AI" products basically run on movie-scripts already, where the LLM is used to append more "fitting" content, and glue-code is periodically performing any lines or actions that arise in connection to the Helpful Bot character. Real humans are tricked into thinking the finger-puppet is a discrete entity.
These new "reasoning" models are just switching the style of the movie script to film noir, where the Helpful Bot character is making a layer of unvoiced commentary. While it may make the story more cohesive, it isn't a qualitative change in the kind of illusory "thinking" going on.
I don't know if it was you or someone else who made pretty much the same point a few days ago. But I still like it. It makes the whole thing a lot more fun.
I've been banging that particular drum for a while on HN, and the mental-model still feels so intuitively strong to me that I'm starting to have doubts: "It feels too right, I must be wrong in some subtle yet devastating way."
> "We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Having worked at my fair share of big tech companies (while preferring to stay in smaller startups), in so many of these tech announcement I can feel the pressure the PM had from leadership, and hear the quiet cries of the one to two experience engineers on the team arguing sprint after sprint that "this doesn't make sense!"
Really don’t understand what’s the use case for this. The o series models are better and cheaper. Sonnet 3.7 smokes it on coding. Deepseek R1 is free and does a better job than any of OAI’s free models
Out of curiosity, what timeframe are you talking about? The recent LLM explosion, or the decades long AI research?
I consider myself an AI skeptic and as soon as the hype train went full steam, I assumed a crash/bubble burst was inevitable. Still do.
With the rare exception, I don’t know of anyone who has expected the bubble to burst so quickly (within two years). 10 times in the last 2 years would be every two and a half months — maybe I’m blinded by my own bias but I don’t see anyone calling out that many dates
>"I also agree with researchers like Yann LeCun or François Chollet that deep learning doesn't allow models to generalize properly to out-of-distribution data—and that is precisely what we need to build artificial general intelligence."
I think "generalize properly to out-of-distribution data" is too weak of criteria for general intelligence (GI). GI model should be able to get interested about some particular area, research all the known facts, derive new knowledge / create theories based upon said fact. If there is not enough of those to be conclusive: propose and conduct experiments and use the results to prove / disprove / improve theories.
And it should be doing this constantly in real time on bazillion of "ideas". Basically model our whole society. Fat chance of anything like this happening in foreseeable future.
Excluding the realtime-iness, humans do at least possess the capacity to do so.
Besides, humans are capable of rigorous logic (which I believe is the most crucial aspect of intelligence) which I don’t think an agent without a proof system can do.
Uh, if we do finally invent AGI (I am quite skeptical, LLMs feel like the chatbots of old. Invented to solve an issue, never really solving that issue, just the symptoms, and also the issues were never really understood to begin with), it will be able to do all of the above, at the same time, far better than humans ever could.
Current LLMs are a waste and quite a bit of a step back compared to older Machine Learning models IMO. I wouldn't necessarily have a huge beef with them if billions of dollars weren't being used to shove them down our throats.
LLMs actually do have usefulness, but none of the pitched stuff really does them justice.
Example: Imagine knowing you had the cure for Cancer, but instead discovered you can make way more money by declaring it to solve all of humanity, then imagine you shoved that part down everyones' throats and ignored the cancer cure part...
Only in the same sense as electricity is. The main tools apply to almost any activity humans do. It's already obvious that it's the solution to X for almost any X, but the devil is in the details - i.e. picking specific, simplest problems to start with.
I have a professor who founded a few companies, one of these was funded by gates after he managed to spoke with him and convinced him to give him money. This guy is goat, and he always tells us that we need to find solutions to problems, not to find problems to our solutions. It seems at openai they didn't get the memo this time
That's the beauty of it, prospective investor! With our commanding lead in the field of shoveling money into LLMs, it is inevitable™ that we will soon™ achieve true AI, capable of solving all the problems, conjuring a quintillion-dollar asset of world domination and rewarding you for generous financial support at this time. /s
Maybe if they build a few more data centers, they'll be able to construct their machine god. Just a few more dedicated power plants, a lake or two, a few hundred billion more and they'll crack this thing wide open.
And maybe Tesla is going to deliver truly full self driving tech any day now.
And Star Citizen will prove to have been worth it along along, and Bitcoin will rain from the heavens.
It's very difficult to remain charitable when people seem to always be chasing the new iteration of the same old thing, and we're expected to come along for the ride.
> And Star Citizen will prove to have been worth it along along
Once they've implemented saccades in the eyeballs of the characters wearing helmets in spaceship millions of kilometres apart, then it will all have been worth it.
And Star Citizen will prove to have been worth it along along
Sounds like someone isn't happy with the 4.0 eternally incrementing "alpha" version release. :-D
I keep checking in on SC every 6 months or so and still see the same old bugs. What a waste of potential. Fortunately, Elite Dangerous is enough of a space game to scratch my space game itch.
To be fAir, SC is trying to do things that no one else done in a context of a single game. I applaud their dedication, but I won't be buying JPGs of a ship for 2k.
Give the same amount of money to a better team and you'd get a better (finished) game. So the allocation of capital is wrong in this case. People shouldn't pre-order stuff.
The misallocation of capital also applies to GPT-4.5/OpenAI at this point.
Yeah, I wonder what the Frontier devs could have done with $500M USD. More than $500M USD and 12+ years of development and the game is still in such a sorry state it barely qualifies as little more than a tech demo.
Yeah, they never should have expected to take an FPS game engine like CryEngine and expected to be able to modify it to work as the basis for a large scale space MMO game.
Their backend is probably an async nightmare of replicated state that gets corrupted over time. Would explain why a lot of things seem to work more or less bug free after an update and then things fall to pieces and the same old bugs start showing up after a few weeks.
And to be clear, I've spent money on SC and I've played enough hours goofing off with friends to have got my money's worth out of it. I'm just really bummed out about the whole thing.
Gonna go meta here for a bit, but I believe we going to get a fully working stable SC before we get fusion. "we" as in humanity, you and I might not be around when it's finally done.
Oh come on. Think how long of a gap there was between the first microcomputer and VisiCalc. Or between the start of the internet and social networking.
First of all, it's going to take us 10 years to figure out how to use LLM's to their full productive potential.
And second of all, it's going to take us collectively a long time to also figure out how much accuracy is necessary to pay for in which different applications. Putting out a higher-accuracy, higher-cost model for the market to try is an important part of figuring that out.
With new disruptive technologies, companies aren't supposed to be able to look into a crystal ball and see the future. They're supposed to try new things and see what the market finds useful.
ChatGPT had its initial public release November 30th, 2022. That's 820 days to today. The Apple II was first sold June 10, 1977, and Visicalc was first sold October 17, 1979, which is 859 days. So we're right about the same distance in time- the exact equal duration will be April 7th of this year.
Going back to the very first commercially available microcomputer, the Altair 8800 (which is not a great match, since that was sold as a kit with binary stitches, 1 byte at a time, for input, much more primitive than ChatGPT's UX), that's four years and nine months to Visicalc release. This isn't a decade long process of figuring things out, it actually tends to move real fast.
Is that because anyone is finding real use for it, or is it that more and more people and companies are using it which is speeding up the rat race, and if "I" don't use it, then can't keep up with the rat race.
Many companies are implementing it because it's trendy and cool and helps their valuation
I use LMMs all the time. At a bare minimum they vastly outperform standard web search. Claude is awesome at helping me think through complex text and research problems. Not even serious errors on references to major work in medical research. I still check but FDR is reasonably low—-under 0.2.
what crazy progress? how much do you spend on tokens every month to witness the crazy progress that I'm not seeing? I feel like I'm taking crazy pills. The progress is linear at best
Large parts of my coding are now done by Claude/Cursor. I give it high level tasks and it just does it. It is honestly incredible, and if I would have see this 2 years ago I wouldn't have believed it.
That started long before ChatGPT though, so you need to set an earlier date then. ChatGPT came about 3 years after GPT-3, the coding assistants came much earlier than ChatGPT.
Web app with a VueJS, Typescript frontend and a Rust backend, some Postgres functions and some reasonably complicated algorithms for parsing git history.
I generally agree with the idea of building things, iterating, and experimenting before knowing their full potential, but I do see why there's negative sentiment around this:
1. The first microcomputer predates VisiCalc, yes, but it doesn't predate the realization of what it could be useful for. The Micral was released in 1973. Douglas Engelbart gave "The Mother of All Demos" in 1968 [2]. It included things that wouldn't be commonplace for decades, like a collaborative real-time editor or video-conferencing.
I wasn't yet born back then, but reading about the timeline of things, it sounds like the industry had a much more concrete and concise idea of what this technology would bring to everyone.
"We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings." doesn't inspire that sentiment for something that's already being marketed as "the beginning of a new era" and valued so exorbitantly.
2. I think as AI becomes more generally available, and "good enough" people (understandably) will be more skeptical of closed-source improvements that stem from spending big. Commoditizing AI is more clearly "useful", in the same way commoditizing computing was more clearly useful than just pushing numbers up.
Again, I wasn't yet born back then, but I can imagine the announcement of Apple Macintosh with its 6MHz CPU and 128KB RAM was more exciting and had a bigger impact than the announcement of the Cray-2 with its 1.9GHz and +1GB memory.
The Internet had plenty of very productive use cases before social networking, even from its most nascent origins. Spending billions building something on the assumption that someone else will figure out what it's good for, is not good business.
And LLM's already have tons of productive uses. The biggest ones are probably still waiting, though.
But this is about one particular price/performance ratio.
You need to build things before you can see how the market responds. You say it's "not good business" but that's entirely wrong. It's excellent business. It's the only way to go about it, in fact.
Finding product-market fit is a process. Companies aren't omniscient.
You go into this process with a perspective, you do not build a solution and then start looking for the problem. Otherwise, you cannot estimate your TAM with any reasonable degree of accuracy, and thus cannot know how much to reasonably expect as return to expect on your investment. In the case of AI, which has had the benefit of a lot of hype until now, these expectations have been very much overblown, and this is being used to justify massive investments in infrastructure that the market is not actually demanding at such scale.
Of course, this benefits the likes of Sam Altman, Satya Nadella et al, but has not produced the value promised, and does not appear poised to.
And here you have one of the supposed bleeding edge companies in this space, who very recently was shown up by a much smaller and less capitalized rival, asking their own customers to tell them what their product is good for.
I disagree strongly with that. Right now they are fun toys to play with, but not useful tools, because they are not reliable. If and when that gets fixed, maybe they will have productive uses. But for right now, not so much.
Who do you speak for? Other people have gotten value from them. Maybe you meant to say “in my experience” or something like that. To me, your comment reads as you making a definitive judgment on their usefulness for everyone.
I use it most days when coding. Not all the time, but I’ve gotten a lot of value out of them.
They are pretty useful tools. Do yourself a favor and get a $100 free trial for Claude, hook it up to Aider, and give it a shot.
It makes mistakes, it gets things wrong, and it still saves a bunch of time. A 10 minute refactoring turns into 30 seconds of making a request, 15 seconds of waiting, and a minute of reviewing and fixing up the output. It can give you decent insights into potential problems and error messages. The more precise your instructions, the better they perform.
Being unreliable isn't being useless. It's like a very fast, very cheap intern. If you are good at code review and know exactly what change you want to make ahead of time, that can save you a ton of time without needing to be perfect.
OP should really save their money. Cursor has a pretty generous free trail and is far from the holy grail.
I recently (in the last month) gave it a shot. I would say once in the maybe 30 or 40 times I used it did it save me any time. The one time it did I had each line filled in with pseudo code describing exactly what it should do… I just didn’t want to look up the APIs
I am glad it is saving you time but it’s far from a given. For some people and some projects, intern level work is unacceptable. For some people, managing is a waste of time.
You’re basically introducing the mythical man month on steroids as soon as you start using these
> I am glad it is saving you time but it’s far from a given.
This is no less true of statements made to the contrary. Yet they are stated strongly as if they are fact and apply to anyone beyond the user making them.
Ah to clarify I was not saying one shouldn’t try it at all — I was saying the free trail is plenty enough to see if it would be worth it to you.
I read the original comment as “pay $100 and just go for it!” which didn’t seem like the right way to do it. Other comments seem to indicate there are $100 dollars worth of credits that are claimable perhaps
One can evaluate LLMs sufficiently with the free trails that abound :) and indeed one may find them worth it to themselves. I don’t disparage anyone who signs up for the plans
Can't speak for the parent commentator ofc, but I suspect he meant "broadly useful"
Programmers and the like are a large portion of LLM users and boosters; very few will deny usefulness in that/those domains at this point.
Ironically enough, I'll bet the broadest exposure to LLMs the masses have is something like MIcrosoft shoehorning copilot-branded stuff into otherwise usable products and users clicking around it or groaning when they're accosted by a pop-up for it.
That's when you learn Vim, Emacs, and/or grep, because I'm assuming that's mostly variable renaming and a few function signature changes. I can't see anything more complicated, that I'd trust an LLM with.
It’s not that the LLM is doing something productive, it’s that you were doing things that were unproductive in the first place, and it’s sad that we live in a society where such things are considered productive (because of course they create monetary value).
As an aside, I sincerely hope our “human” conversations don’t devolve into agents talking to each other. It’s just an insult to humanity.
I use LLMs everyday to proofread and edit my emails. They’re incredible at it, as good as anyone I’ve ever met. Tasks that involve language and not facts tend to be done well by LLMs.
It's incredibly good and lucrative business. You are confusing scientifically sound and well-planned out and conservative risk tolerance with good business
Fair enough. I took the phrasing to mean social networking as it exists today in the form of prominent, commercial social media. That may not have been the intent.
> First of all, it's going to take us 10 years to figure out how to use LLM's to their full productive potential.
LLMs will be gone in 10 years. At least in form we know with direct access. Everything moves so fast that there is no reason to think nothing better is coming.
BTW, what we've learned so far about LLMs will be outdated as well. Just me thinking. Like with 'thinking' models prev generation can be used to create dataset for the next one. It could be that we can find a way to convert trained LLM into something more efficient and flexible. Some sort of a graph probably. Which can be embedded into mobile robot's brain. Another way is 'just' to upgrade the hardware. But that is slow and has its limits.
You're assuming that point is somewhere above the current hype peak. I'm guessing it won't be, it will be quite a bit below the current expectations of "solving global warming", "curing cancer" and "making work obsolete".
> "We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
That's not a scare quote. It's just a proposed subtext of the quote. Sarcastic, sure, but no a scare quote, which is a specific kind of thing. (from your linked wikipedia: "... around a word or phrase to signal that they are using it in an ironic, referential, or otherwise non-standard sense.")
Right. I don't agree with the quote, but it's more like a subtext thing and it seemed to me to be pretty clear from context.
Though, as someone who had a flagged comment a couple years ago for a supposed "misquote" I did in a similar form in style, I think hn's comprehension of this form of communication is not super strong. Also the style more often than not tends towards low quality smarm and probably should be resorted to sparingly.
The price really is eye watering. At a glance, my first impression is this is something like Llama 3.1 405B, where the primary value may be realized in generating high quality synthetic data for training rather than direct use.
I keep a little google spreadsheet with some charts to help visualize the landscape at a glance in terms of capability/price/throughput, bringing in the various index scores as they become available. Hope folks find it useful, feel free to copy and claim as your own.
Not just for training data, but for eval data. If you can spend a few grand on really good labels for benchmarking your attempts at making something feasible work, that’s also super handy.
That's a nice sentiment, but I'd encourage you to add a license or something. The basic "something" would be adding a canonical URL into the spreadsheet itself somewhere, along with a notification that users can do what they want other than removing that URL. (And the URL would be described as "the original source" or something, not a claim that the particular version/incarnation someone is looking at is the same as what is at that URL.)
The risk is that someone will accidentally introduce errors or unsupportable claims, and people with the modified spreadsheet won't know that it's not The spreadsheet and so will discount its accuracy or trustability. (If people are trying to deceive others into thinking it's the original, they'll remove the notice, but that's a different problem.) It would be a shame for people to lose faith in your work because of crap that other people do that you have no say in.
What gets me is the whole cost structure is based on practically free services due to all the investor money. They’re not pulling in significant revenue with this pricing relative to what it costs to train the models, so the cost may be completely different if they had to recoup those costs, right?
Nice, thank you for that (upvoted in appreciation). Regarding the absence of o1-Pro from the analysis, is that just because there isn't enough public information available?
Hey, just FYI, I pasted your url from the spreadsheet title into Safari on macOS and got an SSL warning. Unfortunately I clicked through and now it works, so not sure what the exact cause looked like.
> bad news: it is a giant, expensive model. we really wanted to launch it to plus and pro at the same time, but we've been growing a lot and are out of GPUs. we will add tens of thousands of GPUs next week and roll it out to the plus tier then. (hundreds of thousands coming soon, and i'm pretty sure y'all will use every one we can rack up.)
I’m not an expert or anything, but from my vantage point, each passing release makes Altman’s confidence look more aspirational than visionary, which is a really bad place to be with that kind of money tied up. My financial manager is pretty bullish on tech so I hope he is paying close attention to the way this market space is evolving. He’s good at his job, a nice guy, and surely wears much more expensive underwear than I do— I’d hate to see him lose a pair powering on his Bloomberg terminal in the morning one of these days.
You're the one buying him the underwear. Don't index funds outperform managed investing? I think especially after accounting for fees, but possibly even after accounting that 50% of money managers are below average.
A friend got taken in by a Ponzi scheme operator several years ago. The guy running it was known for taking his clients out to lavish dinners and events all the time.[0]
After the scam came to light my friend said “if I knew I was paying for those dinners, I would have been fine with Denny’s[1]”
I wanted to tell him “you would have been paying for those dinners even if he wasn’t outright stealing your money,” but that seemed insensitive so I kept my mouth shut.
0 - a local steakhouse had a portrait of this guy drawn on the wall
1 - for any non-Americans, Denny’s is a low cost diner-style restaurant.
He earns his undies. My returns are almost always modestly above index fund returns after his fees, though like last quarter, he’s very upfront when they’re not. He has good advice for pulling back when things are uncertain. I’m happy to delegate that to him.
you would still be better off in the long run even just putting everything into an MSCI world unless you value being able to scream at a human if markets go down that highly
warren buffet got rich by outperforming early (threw his dice well) and then using that reputation to attract more capital and use his reputation to actually influence markets with his decisions / gain access to privileged information your local active fund manager doesn't
> each passing release makes Altman’s confidence look more aspirational than visionary
As an LLM cynic, I feel that point passed long go, perhaps even before Altman claimed countries would start wars to conquer the territory around GPU datacenters, or promoting the dream of a 7 T-for-trillion dollar investment deal, etc.
Alas, the market can remain irrational longer than I can remain solvent.
That $7 trillion dollar ask pushed me from skeptical to full-on eye-roll emoji land— the dude is clearly a narcissist with delusions of grandeur— but it’s getting worse. Considering the $200 pro subscription was significantly unprofitable before this model came out, imagine how astonishingly expensive this model must be to run at many times that price.
Sell an unlimited premium enterprise subscription to every CyberTruck owner, including a huge red ostentatious swastika-shaped back window sticker [but definitely NOT actually an actual swastika, merely a Roman Tetraskelion Strength Symbol] bragging about how much they're spending.
release blog post author: this is definitely a research preview
ceo: it's ready
the pricing is probably a mixture of dealing with GPU scarcity and intentionally discouraging actual users. I can't imagine the pressure they must be under to show they are releasing and staying ahead, but Altman's tweet makes it clear they aren't really ready to sell this to the general public yet.
Yeap, that the thing, they are not ahead anymore. Not since last summer at least. Yes they have probably largest customer base, but their models are not the best for a while already.
I haven’t tried Grok yet so can’t speak to that, but I find o1-pro is much stronger than 3.7-thinking for e.g. distributed systems and concurrency problems.
I suppose this was their final hurrah after two failed attempts at training GPT-5 with the traditional pre-training paradigm. Just confirms reasoning models are the only way forward.
> Just confirms reasoning models are the only way forward.
Reasoning models are roughly the equivalent to allow Hamiltonian Monte-Carlo models to "warm up" (i.e. start sampling from the typical set). This, unsurprisingly, yields better results (after all LLMs are just fancy Monte-carlo models in the end). However, it is extremely unlikely this improvement is without pretty reasonable limitations. Letting your HMC warm up is essential to good sampling, but letting "warm up more" doesn't result in radically better sampling.
While there have been impressive results in efficiency of sampling from the typical set seen in LLMs these days, we're clearly not making the major improvements in the capabilities of these models.
Reasoning models can solve tasks that non-reasoning ones were unable to; how is that not an improvement? What constitutes "major" is subjective - if a "minor" improvement in overall performance means that the model can now successfully perform a task it was unable to solve before, that is a major advancement for that particular task.
> Compared to OpenAI o1 and OpenAI o3‑mini, GPT‑4.5 is a more general-purpose, innately smarter model. We believe reasoning will be a core capability of future models, and that the two approaches to scaling—pre-training and reasoning—will complement each other. As models like GPT‑4.5 become smarter and more knowledgeable through pre-training, they will serve as an even stronger foundation for reasoning and tool-using agents.
My guess is that you're right about that being what's next (or maybe almost next) from them, but I think they'll save the name GPT-5 for the next actually-trained model (like 4.5 but a bigger jump), and use a different kind of name for the routing model.
Even by their poor standards at naming it would be weird to introduce a completely new type/concept, that can loop in models including the 4 / 4.5 series, while naming it part of that same series.
My bet: probably something weird like "oo1", or I suspect they might try to give it a name that sticks for people to think of as "the" model - either just calling it "ChatGPT", or coming up with something new that sounds more like a product name than a version number (OpenCore, or Central, or... whatever they think of)
If you read what sama is quoted as saying in your link, it's obvious that "unified model" = router.
> “We hate the model picker as much as you do and want to return to magic unified intelligence,”
> “a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks,”
> the company plans to “release GPT-5 as a system that integrates a lot of our technology, including o3,”
He even slips up and says "integrates" in the last quote.
When he talks about "unifying", he's talking about the user experience not the underlying model itself.
Interesting, thanks for sharing - definitely makes me withdraw my confidence in that prediction, though I still think there's a decent chance they change their mind about that as it seems to me like an even worse naming decision than their previous shit name choices!
Except minus 4.5, because at these prices and results there's essentially no reason not to just use one of the existing models if you're going to be dynamically routing anyway.
Further confirmation, IMO, that the idea that any of this leads to anything close to AGI is people getting high on their own supply (in some cases literally).
LLMs are a great tool for what is effectively collected knowledge search and summary (so long as you are willing to accept that you have to verify all of the 'knowledge' they spit back because they always have the ability to go off the rails) but they have been hitting the limits on how much better that can get without somehow introducing more real knowledge for close to 2 years now and everything since then is super incremental and IME mostly just benchmark gains and hype as opposed to actually being purely better.
I personally don't believe that more GPUs solves this, like, at all. But its great for Nvidia's stock price.
I'd put myself on the pessimistic side of all the hype, but I still acknowledge that where we are now is a pretty staggering leap from two years ago. Coding in particular has gone from hints and fragments to full scripts that you can correct verbally and are very often accurate and reliable.
I'm not saying there's been no improvement at all. I personally wouldn't categorize it as staggering, but we can agree to disagree on that.
I find the improvements to be uneven in the sense that every time I try a new model I can find use cases where its an improvement over previous versions but I can also find use cases where it feels like a serious regression.
Our differences in how we categorize the amount of improvement over the past 2 years may be related to how much the newer models are improving vs regressing for our individual use cases.
When used as coding helpers/time accelerators, I find newer models to be better at one-shot tasks where you let the LLM loose to write or rewrite entire large systems and I find them worse at creating or maintaining small modules to fit into an existing larger system. My own use of LLMs is largely in the latter category.
To be fair I find the current peak model for coding assistant to be Claude 3.5 Sonnet which is much newer than 2 years old, but I feel like the improvements to get to that model were pretty incremental relative to the vast amount of resources poured into it and then I feel like Claude 3.7 was a pretty big back-slide for my own use case which has recently heightened my own skepticism.
Hilarious. Over two years we went from LLMs being slow and not very capable of solving problems to models that are incredibly fast, cheap and able to solve problems in different domains.
Eh, no. More chips won't save this right now, or probably in the near future (IE barring someone sitting on a breakthrough right now).
It just means either
A. Lots and lots of hard work that get you a few percent at a time, but add up to a lot over time.
or
B. Completely different approaches that people actually think about for a while rather than trying to incrementally get something done in the next 1-2 months.
Most fields go through this stage. Sometimes more than once as they mature and loop back around :)
Right now, AI seems bad at doing either - at least, from the outside of most of these companies, and watching open source/etc.
While lots of little improvements seem to be released in lots of parts, it's rare to see anywhere that is collecting and aggregating them en masse and putting them in practice. It feels like for every 100 research papers, maybe 1 makes it into something in a way that anyone ends up using it by default.
This could be because they aren't really even a few percent (which would be yet a different problem, and in some ways worse), or it could be because nobody has cared to, or ...
I'm sure very large companies are doing a fairly reasonable job on this, because they historically do, but everyone else - even frameworks - it's still in the "here's a million knobs and things that may or may not help".
It's like if compilers had no "O0/O1/O2/O3' at all and were just like "here's 16,283 compiler passes - you can put them in any order and amount you want". Thanks! I hate it!
It's worse even because it's like this at every layer of the stack, whereas in this compiler example, it's just one layer.
At the rate of claimed improvements by papers in all parts of the stack, either lots and lots and lots is being lost because this is happening, in which case, eventually that percent adds up to enough for someone to be able to use to kill you, or nothing is being lost, in which case, people appear to be wasting untold amounts of time and energy, then trying to bullshit everyone else, and the field as a whole appears to be doing nothing about it. That seems, in a lot of ways, even worse. FWIW - I already know which one the cynics of HN believe, you don't have to tell me :P. This is obviously also presented as black and white, but the in-betweens don't seem much better.
Additionally, everyone seems to rush half-baked things to try to get the next incremental improvement released and out the door because they think it will help them stay "sticky" or whatever. History does not suggest this is a good plan and even if it was a good plan in theory, it's pretty hard to lock people in with what exists right now. There isn't enough anyone cares about and rushing out half-baked crap is not helping that. mindshare doesn't really matter if no one cares about using your product.
Does anyone using these things truly feel locked into anyone's ecosystem at this point? Do they feel like they will be soon?
I haven't met anyone who feels that way, even in corps spending tons and tons of money with these providers.
The public companies - i can at least understand given the fickleness of public markets. That was supposed to be one of the serious benefit of staying private.
So watching private companies do the same thing - it's just sort of mind-boggling.
Hopefully they'll grow up soon, or someone who takes their time and does it right during one of the lulls will come and eat all of their lunches.
I think this is the correct take. There are other axes to scale on AND I expect we'll see smaller and smaller models approach this level of pre-trained performance. But I believe massive pre-training gains have hit clearly diminished returns (until I see evidence otherwise).
The price is obviously 15-30x that of 4o, but I'd just posit that there are some use cases where it may make sense. It probably doesn't make sense for the "open-ended consumer facing chatbot" use case, but for other use cases that are fewer and higher value in nature, it could if it's abilities are considerably better than 4o.
For example, there are now a bunch of vendors that sell "respond to RFP" AI products. The number of RFPs that any sales organization responds to is probably no more than a couple a week, but it's a very time-consuming, laborious process. But the payoff is obviously very high if a response results in a closed sale. So here paying 30x for marginally better performance makes perfect sense.
I can think of a number of similar "high value, relatively low occurrence" use cases like this where the pricing may not be a big hindrance.
Esp. when they aren't even sure whether they will commit to offering this long term? Who would be insane enough to build a product on top of something that may not be there tomorrow?
Those products require some extensive work, such a model finetuning on proprietary data. Who is going to invest time & money into something like that when OpenAI says right out of the gate they may not support this model for very long?
Basically OpenAI is telegraphing that this is yet another prototype that escaped a lab, not something that is actually ready for use and deployment.
We’re one of those types of customers. We wrote an OpenAI API compatible gateway that automatically batches stuff for us, so we get 50% off for basically no extra dev work in our client applications.
I don’t care about speed, I care about getting the right answer. The cost is fine as long as the output generates us more profit.
I think they announced this as their last non-reasoning model, so it was maybe with the goal of stretching pre-training as far as they could, just to see what new capabilities would show up. We'll find out as the community gives it a whirl.
I'm a Tier 5 org and I have it available already in the API.
The marginal costs for running a GPT-4-class LLM are much lower nowadays due to significant software and hardware innovations since then, so costs/pricing are harder to compare.
Agreed, however it might make sense that a much-larger-than-GPT-4 LLM would also, at launch, be more expensive to run than the OG GPT-4 was at launch.
(And I think this is probably also scarecrow pricing to discourage casual users from clogging the API since they seem to be too compute-constrained to deliver this at scale)
There are some numbers on one of their Blackwell or Hopper info pages that notes the ability of their hardware in hosting an unnamed GPT model that is 1.8T params. My assumption was that it referred to GPT-4
Sounds to me like GPT 4.5 likely requires a full Blackwell HGX cabinet or something, thus OpenAI's reference to needing to scale out their compute more (Supermicro only opened up their Blackwell racks for General Availability last month, and they're the prime vendor for water-cooled Blackwell cabinets right now, and have the ability to throw up a GPU mega-cluster in a few weeks, like they did for xAI/Grok)
Honestly if long context (that doesn't start to degrade quickly) is what you're after, I would use Grok 3 (not sure when the api version releases though). Over the last week or so I've had a massive thread of conversation with it that started with plenty of my project's relevant code (as in couple hundred lines), and several days later, after like 20 question-aswer blocks, you ask it something and it aswers "since you're doing that this way, and you said you want x, y and z, here are your options blabla"... It's like thinking Gemini but better. Also, unlike Gemini (and others) it seems to have a much more recent data cutoff. Try asking about some language feature / library / framework that has been released recently (say 3 months ago) and most of the models shit the bed, use older versions of the thing or just start to imitate what the code might look like. For example try asking Gemini if it can generate Tailwind 4 code, it will tell you that it's training cutoff is like October or something and Tailwind 4 "isn't released yet" and that it can try to imitate what the code might look like. Uhhhhhh, thanks I guess??
Definitely not. They don't distill their original models. 4o is a much more distilled and cheaper version of 4. I assume 4.5o would be a distilled and cheaper version of 4.5.
It'd be weird to release a distilled version without ever releasing the base undistilled version.
If this huge model has taken months to pre-train and was expected to be released before, say, o3-mini, you could definitely have some last-minute optimizations in o3-mini that were not considered at the time of building the architecture of gpt-4.5.
This has been my suspicion for a long time - OpenAI have indeed been working on "GPT5", but training and running it is proving so expensive (and its actual reasoning abilities only marginally stronger than GPT4) that there's just no market for it.
It points to an overall plateau being reached in the performance of the transformer architecture.
but while there is a plateau in the transformer architecture, what you can do with those base models by further finetuning / modifying / enhancing them is still largely unexplored so i still predict mind-blowing enhancements yearly for this foreseeable future. if they validate openai's valuation and investment needs is a different question.
TBH, with the safety/alignment paradigm we have, workforce replacement was not my top concern when we hit AGI. A pause / lull in capabilities would be hugely helpful so that we can figure how not to die along with the lightcone...
Is it inevitable to you that someone will create some kind of techno-god behemoth AI that will figure out how to optimally dominate an entire future light cone starting from the point in spacetime of its self-actualization? Borg or Cylons?
If it really costs them 30x more surely they must plan on putting pretty significant usage limits on any rollout to the Plus tier and if that is the case i'm not sure what the point is considering it seems primarily a replacement/upgrade for 4o.
The cognitive overhead of choosing between what will be 6 different models now on chatGPT and trying to map whether a query is "worth" using a certain model and worrying about hitting usage limits is getting kind of out of control.
Really depends on your use case. For low value tasks this is way too expensive. But for context, let’s say a court opinion is an average of 6000 words. Let’s say i want to analyze 10 court opinions and pull some information out that’s relevant to my case. That will run about $1.80 per document or $18 total. I wouldn’t pay that just to edify myself, but i can think of many use cases where it’s still a negligible cost, even if it only does 5% better than the 30x cheaper model.
I was about to comment that humans consume orders of magnitude less energy, but then I checked the numbers, and it looks like an average person consumes way more energy throughout their day (food, transportation, electricity usage, etc) than GPT-4.5 would at 1 query per minute over 24 hours.
Such brutal reductionism: how do you calculate an ever growing percentage of customers so pissed at this terrible service that you lose customers forever? Not just one company losing customers... but an entire population completely distrusting and pulling back from any and all companies pulling this trash
Huh? Most call centers these days already use ivr systems and they absolutely are terrible experiences. I along with most people would happily speak with a LLM backed agent to resolve issues.
The CS is already a wreck and LLMs beat an ivr any day of the week and have the ability to offer real triaging ability.
The only people getting upset are the luddites like yourself.
Humans have all sorts of issues you have to deal with. Being hungover, not sleeping well, having a personality, being late to work, not being able to work 24/7, very limited ability to copy them. If there's a soulless generic office-droidGPT that companies could hire that would never talk back and would do all sorts of menial work without needing breaks or to use the bathroom, I don't know that we humans stand a chance!
I have a bunch of work that needs doing. I can do it myself, or I can hire one person to do it. I gotta train them and manage them and even after I train them theres still only going to be one of them, and it's subject to their availability. On the other hand, if I need to train an AI to do it, but I can copy that AI, and then spin them up/down like on demand computer in the cloud, and not feel remotely bad about spinning them down?
It's definitely not there yet, but it's not hard to see the business case for it.
I write code for a living. My entire profession is on the line, thanks to ourselves. My eyes are wide open on the situation at hand though. Burying my head in the sand and pretending what I wrote above isn't true, isn't going to make it any less true.
I'm not sure what I can do about it, either. My job already doesn't look like it did a year ago, nevermind a decade away.
I keep telling coders to switch to being 1-person enterprise shops instead, but they don't listen. They will learn the hard way when they suddenly find themselves without a job due to AI having taken it away. As for what enterprise, use your imagination without bias from coding.
> It sounds like it's so expensive and the difference in usefulness is so lacking(?)
The claimed hallucination rate is dropping from 61% to 37%. That's a "correct" rate increasing from 29% to 63%.
Double the correct rate costs 15x the price? That seems absurd, unless you think about how mistakes compound. Even just 2 steps in and you're comparing a 8.4% correct rate vs 40%. 3 automated steps and it's 2.4% vs 25%.
And remember, with increasing accuracy, the cost of validation goes up (not even linear).
We expect computers to be right. Its a trust problem. Average users will simply trust the results of LLMs and move on without proper validation. And the way the LLMs are trained to mimic human interaction is not helping either. This will reduce overall quality in society.
Its a different thing to work with another human, because there is intention. A human wants to be correct or to mislead me. I am considering this without even thinking about it.
And I don't expect expert models to improve things, unless the problem space is really simple (like checking eggs for anomalies).
Well to play the devils advocat, i think this is useful to have, at least for ‘Open’Ai to start off from to apply QLora or similar approximations.
Bonus they could even do some self learning afterwards with the performance improvements DeepSeek just published and it might have more EQ and less hallucinations than starting from scratch…
ie the price might go down big time but there might be significant improvements down the line when starting from such a broad base
usefulness is bound to scope/purpose,
even if innovation stops, in 3y (thanks to hw and tuning progress ) when 4o costs 0.1$/M and 4.5 1$/M even being a small improvement ( which is not imo ), you will chose to use 4.5 , exactly like no one now want to use 3.5
Someone in another comment said that gpt-4 32k had somewhat the same cost (ok 10% cheaper), what was a pain was more the latency and speed than actual cost given the increase in productivity for our usage.
> It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long
I guess the rationale behind this is paying for the marginal improvement. Maybe the next few percent of improvement is so important to a business that the business is willing to pay a hefty premium.
The price will come down over time as they apply all the techniques to distill it down to a smaller parameter model. Just like GPT4 pricing came down significantly over time.
AI as it stands in 2025 is an amazing technology, but it is not a product at all.
As a result, OpenAI simply does not have a business model, even if they are trying to convince the world that they do.
My bet is that they're currently burning through other people's capital at an amazing rate, but that they are light-years from profitability
They are also being chased by fierce competition and OpenSource which is very close behind. There simply is no moat.
It will not end well for investors who sunk money in these large AI startups (unless of course they manage to find a Softbank-style mark to sell the whole thing to), but everyone will benefit from the progress AI will have made during the bubble.
So, in the end, OpenAI will have, albeit very unwillingly, fulfilled their original charter of improving humanity's lot.
I've been a Plus user for a long time now. My opinion is there is very much a ChatGPT suite of products that come together to make for a mostly delightful experience.
Three things I use all the time:
- Canvas for proofing and editing my article drafts before publishing. This has replaced an actual human editor for me.
- Voice for all sorts of things, mostly for thinking out loud about problems or a quick question about pop culture, what something means in another language, etc. The Sol voice is so approachable for me.
- GPTs I can use for things like D&D adventure summaries I need in a certain style every time without any manual prompting.
> My bet is that they're currently burning through other people's capital at an amazing rate, but that they are light-years from profitability
The Information leaked their internal projections a few months ago, and apparently their own estimates have them losing $44B between then and 2029 when they expect to finally turn a profit, maybe.
> AI as it stands in 2025 is an amazing technology, but it is not a product at all.
Here I'm assuming "AI" to mean what's broadly called Generative AI (LLMs, photo, video generation)
I genuinely am struggling to see what the product is too.
The code assistant use cases are really impressive across the board (and I'm someone who was vocally against them less than a year ago), and I pay for Github CoPilot (for now) but I can't think of any offering otherwise to dispute your claim.
It seems like companies are desperate to find a market fit, and shoving the words "agentic" everywhere doesn't inspire confidence.
Here's the thing:
I remember people lining up around the block for iPhone releases, XBox launches, hell even Grand Theft Auto midnight releases.
Is there a market of people clamoring to use/get anything GenAI related?
If any/all LLM services went down tonight, what's the impact? Kids do their own homework?
JavaScript programmers have to remember how to write React components?
Compare that with Google Maps disappearing, or similar.
LLMs are in a position where they're forced onto people and most frankly aren't that interested. Did anyone ASK for Microsoft throwing some Copilot things all over their operating system? Does anyone want Apple Intelligence, really?
I think search and chat are decent products as well. I am a Google subscriber and I just use Gemini as a replacement for search without ads. To me, this movement accelerated paid search in an unexpected way. I know the detractors will cry "hallucinations" and the ilk. I would counter with an argument about the state of the current web besieged by ads and misinformation. If people carry a reasonable amount of skepticism in all things, this is a fine use case. Trust but verify.
I do worry about model poisoning with fake truths but dont feel we are there yet.
> I do worry about model poisoning with fake truths but don't feel we are there yet.
In my use, hallucinations will need to be a lot lower before we get there, because I already can't trust anything an LLM says so I don't think I could even distinguish a poisoned fake truth from a "regular" hallucination.
I just asked ChatGPT 4o to explain irreducible control flow graphs to me, something I've known in the past but couldn't remember. It gave me a couple of great definitions, with illustrative examples and counterexamples. I puzzled through one of the irreducible examples, and eventually realized it wasn't irreducible. I pointed out the error, and it gave a more complex example, also incorrect. It finally got it on the 3rd try. If I had been trying to learn something for the first time rather than remind myself of what I had once known, I would have been hopelessly lost. Skepticism about any response is still crucial.
Yes: the real truth is, if there really was a good AI created, then we wouldnt even know about it existing until a billion dollar company takes over some industry with only a handful of developers in the entire company. Only then would hints spill out into the world that its possible.
No "good" AI will ever be open to everyone and relatively cheap, this is the same phenomenon as "how to get rich" books
> As a result, OpenAI simply does not have a business model, even if they are trying to convince the world that they do.
They have a super popular subscription service. If they keep iterating on the product enough, they can lag on the models. The business is the product not the models and not the API. Subscriptions are pretty sticky when you start getting your data entrenched in it. I keep my ChatGPT subscription because it’s the best app on Mac and already started to “learn me” through the memory and tasks feature.
Their app experience is easily the best out of their competitors (grok, Claude, etc). Which is a clear sign they know that it’s the product to sell. Things like DeepResearch and related are the way they’ll make it a sustainable business - add value-on-top experiences which drive the differentiation over commodities. Gemini is the only competitor that compares because it’s everywhere in Google surfaces. OpenAI’s pro tier will surely continue to get better, I think more LLM-enabled features will continue to be a differentiator. The biggest challenge will be continuing distribution and new features requiring interfacing with third parties to be more “agentic”.
Frankly, I think they have enough strength in product with their current models today that even if model training stalled it’d be a valuable business.
hyperscalers in shambles, no clue why they even released this other than the fact they didn't want to admit they wasted an absurd amount of money for no reason
> It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long:
Sounds like an attempt at price descrimination. Sell the expensive version to big companies with big budgets who don't care, sell the cheap version to everyone else. Capture both ends of the market.
To me, it feels like a PR stunt in response to what the competition is doing. OpenAI is trying to show how they are ahead of others, but they price the new model to minimize its use. Potentially, Anthropic et al. also have amazing models that they aren't yet ready to productionize because of costs.
It's crazy expensive because they want to pull in as much revenue as possible as fast as possible before the Open Source models put them outta business.
Maybe they started a really long expensive training session, and Elon Musk's DOGE script kiddies somehow broke in and sabotaged it, so it got disrupted and turned into the Eraserhead baby, but they still want to get it out there for a little while before it died to squeeze all the money out of it as possible, because it was so expensive to train.
Sure but its in their best interest to lower it then and only then.
OpenAI wouldn't be the first company to price something expensive when it first comes out to capitalize on people who are less price sensitive at first and then lower prices to capture a bigger audience.
If you ran the same query set 30x or 15x on the cheaper model (and compensated for all the extra tokens the reasoning model uses), would you be able to realize the same 26% quality gain in a machine-adjudicatible kind of way?
3.5x on a normal distribution with mean 100 and SD 15 is pretty insane. But I agree with your point, being 26% better at a certain benchmark could be a tiny difference, or an incredible improvement (imagine the hardest questions being Riemann hypothesis, P != NP, etc).
Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise. But we all held our final decision until the next big release from OpenAI as Sam Altman has been making claims about AGI entering the workforce this year, OpenAI knowing how to build AGI and similar outlandish claims. We all knew that their next big release in 2025 would be the final deciding factor on whether they had some tech breakthrough that would upend the world (justifying their astronomical valuation) or if it would just be (slightly) more of the same (marking the beginning of their downfall).
The GPT-4.5 release points towards the latter. Thus, we should not expect OpenAI to exist as it does now (AI industry leader) in 2030, assuming it does exist at all by then.
However, just like the 19th century rail industry revolution, the fall of OpenAI will leave behind a very useful technology that while not catapulting humanity towards a singularity, will nonetheless make people's lives better. Not much consolation to the world's super rich who will lose tons of money once the LLM industry (let us remember that AI is not LLM) falls.
EDIT: "will nonetheless make people's lives better" to "might nonetheless make some people's lives better"
It's worth pointing out that GPT-4.5 seems focused on better pre-training and doesn't include reasoning.
I think GPT-5 - if/when it happens - will be 4.5 with reasoning, and as such it will feel very different.
The barrier, is the computational cost of it. Once 4.5 gets down to similar costs to 4.0 - which could be achieved through various optimization steps (what happened to the ternary stuff that was published last year that meant you could go many times faster without expensive GPUs?), and better/cheaper/more efficient hardware, you can throw reasoning into the mix and suddenly have a major step up in capability.
I am a user, not a researcher of builder. I do think we're in a hype bubble, I do think that LLMs are not The Answer, but I also think there is more mileage left in this path than you seem to. I think automated RL (not HF), reasoning, and better/optimal architectures and hardware mean there is a lot more we can get out of the stochastic parrots, yet.
Is it fair to still call LLMs stochastic parrots now that they are enriched with reasoning? Seems to me that the simple procedure of large-scale sampling + filtering makes it immediately plausible to get something better than the training distribution out of the LLM. In that sense the parrot metaphor seems suddenly wrong.
I don’t feel like this binary shift is adequately accounted for among the LLM cynics.
it was never fair to call them stochastic parrots and anybody who is paying any attention knows that sequence models can generalize at least partially OOD
I'm not sure this will ever be solved. It requires both a technical solution and social consensus. I don't see consensus on "alignment" happening any time soon. I think it'll boil down to "aligned with the goals of the nation-state", and lots of nation states have incompatible goals.
I agree unfortunately. I might be a bit of an extremist on this issue. I genuinely think that building agentic ASI is suicidally stupid and we just shouldn’t do it. All the utopian visions we hear from the optimists describe unstable outcomes. A world populated by super-intelligent agents will be incredibly dangerous even if it appears initially to have gone well. We’ll have built a paradise in which we can never relax.
What's the difference between your "agentic AIs" and, say, "script kiddies" or "expert anarchist/black-hat hackers"?
It's been obvious for a while that the narrow-waist APIs between things matter, and apparent that agentic AI is leaning into adaptive API consumption, but I don't see how that gives the agentic client some super-power we don't already need to defend against since before AGI we already have HGI (human general intelligence) motivated to "do bad things" to/through those APIs, both self-interested and nation-state sponsored.
We're seeing more corporate investment in this interplay, trending us towards Snow Crash, but "all you have to do" is have some "I" in API be "dual key human in the loop" to enable a scenario where AGI/HGI "presses the red button" in the oval office, nuclear war still doesn't happen, WarGames or Crimson Tide style.
I'm not saying dual key is the answer to everything, I'm saying, defenses against adversaries already matter, and will continue to. We have developed concepts like air gaps or modality changes, and need more, but thinking in terms of interfaces (APIs) in the general rather than the literal gives a rich territory for guardrails and safeguards.
I think what Accelerationism gets right is that capitalism is just doing it - autonomizing itself - and that our agency is very limited, especially given the arms race dynamics and the rise of decentralized blockchain infrastructure.
As Nick Land puts it, in his characteristically detached style, in A Quick-and-Dirty Introduction to Accelerationism:
"As blockchains, drone logistics, nanotechnology, quantum computing, computational genomics, and virtual reality flood in, drenched in ever-higher densities of artificial intelligence, accelerationism won't be going anywhere, unless ever deeper into itself. To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon. Naturally — which is to say completely inevitably — the human species will define this ultimate terrestrial event as a problem. To see it is already to say: We have to do something. To which accelerationism can only respond: You're finally saying that now? Perhaps we ought to get started? In its colder variants, which are those that win out, it tends to laugh." [0]
I'm not convinced that LLMs in their current state are really making anyone's lives much better though. We really need more research applications for this technology for that to become apparent. Polluting the internet with regurgitated garbage produced by a chat bot does not benefit the world. Increasing the productivity of software developers does not help to the world. Solving more important problems should be the priority for this type of AI research & development.
The explosion of garbage content is a big issue and has radically changed the way I use the web over the past year: Google and DuckDuckGo are not my primary tools anymore, instead I am now using specialized search engines more and more, for example, if I am looking for something I believe can be found in someone's personal blog I just use Marginalia or Mojeek, if I am searching for software issues I use GitHub's search, general info straight to Wikipedia, tech reviews HN's Algolia etc.
It might sound a bit cumbersome but it's actually super easy if you assign search keywords in your browser: for instance if I am looking for something on GitHub I just open a new tab on Firefox and type "gh tokio".
LLM's have been extremely useful for me. They are incredibly powerful programmers, from the perspective of people who aren't programmers.
Just this past week claude 3.7 wrote a program for us to use to quickly modernize ancient (1990's) proprietary manufacturing machine files to contemporary automation files.
This allowed us to forgo a $1k/yr/user proprietary software package that would be able to do the same. The program Claude wrote took about 30 mins to make. Granted the program is extremely narrow in scope, but it does the one thing we need it to do.
This marks the third time I (a non-progammer) have used an LLM to create software that my company uses daily. The other two are a test system made by GPT-4 and an android app made by a mix of 4o and claude 3.5.
Bumpers may be useless and laughable to pro bowlers, but a godsend to those who don't really know what they are doing. We don't need to hire a bowler to knock over pins anymore.
I've also been toying with Claude Code recently and i (as en eng, ~10yr) think they are useful for pair programming the dumb work.
Eg as i've been trying Claude Code i still feel the need to babysit it with my primary work, and so i'd rather do it myself. However while i'm working if it could sit there and monitor it, note fixes, tests and documentation and then stub them in during breaks i think there's a lot of time savings to be gained.
Ie keep the doing simple tasks that it can get right 99% of the time and get it out of the way.
I also suspect there's context to be gained in watching the human work. Not learning per say, but understanding the areas being worked on, improving intuition on things the human needs or cares about, etc.
A `cargo lint --fix` on steroids is "simple" but still really sexy imo.
Graphic designers I think are safe, at least within organizations that require a cohesive brand strategy. Getting the AI to respect all of the previous art will be a challenge at a certain scale.
Getting graphic designers to use the design system that they invented is quite a challenge too if I'm honest... should we really expect AI to be better than people? Having said that AI is never going to be adept at knowing how and when to ignore the human in the loop and do the "right" thing.
While I mostly agree with your assessment, I am still not convinced of this part. Right now, it may be making our lives marginally better. But once the enshittification starts to set in, I think it has the potential to make things a lot worse.
E.g. I think the advertisement industry will just love the idea of product placements and whatnots into the AI assistant conversations.
*good*. the answer to this is legislation —- legally, stop allowing shitty ads everywhere all the time. I hope these problems we already have are exacerbated by the ease of generating content with LLMs and people actually have to think for themselves again
> Not much consolation to the world's super rich who will lose tons of money once the LLM industry (let us remember that AI is not LLM) falls.
They knew the deal:
“it would be wise to view any investment in OpenAI Global, LLC in the spirit of a donation” and “it may be difficult to know what role money will play in a post-[artificial general intelligence] world.”
It took 25797 input tokens and 1225 input tokens, for a total cost (calculated using https://tools.simonwillison.net/llm-prices ) of $2.11! It took 154 seconds to generate.
"For example, there are now a bunch of vendors that sell 'respond to RFP' AI products... paying 30x for marginally better performance makes perfect sense." — hn_throwaway_99 (an uncommon opinion supporting possible niche high-cost uses).
? You think hn_throwaway_99's comment is sarcastic? It makes perfect sense to me read "straight."
That is, sales orgs save a bunch of money using AI to respond to RFPs; they would still save a bunch of money using a more expensive AI, and any marginal improvement in sales closed would pay for it.
It’d be great if someone would do that with the same data and prompt to other models.
I did like the formatting and attributions but didn’t necessarily want attributions like that for every section. I’m also not sure if it’s fully matching what I’m seeing in the thread but maybe the data I’m seeing is just newer.
Thanks for sharing. To me, purely on personal preference, the Gemini models did best on this task, which also fits with my personal experience using Googles models to summarize extensive, highly specialized text. Geminis 2.0 models do especially well on Needle in Haystack type tests in my experience.
Seeing the other models, I actually come away impressed with how well GPT-4.5 is organizing the information and how well it reads. I find it a lot easier to quickly parse. It's more human-like.
I noticed 4o mini didn't follow the directions to quote users. My favourite part of the 4.5 summary was how it quoted Antirez. 4o mini brought out the same quote, but failed to attribute it as instructed.
It's fascinating, but while this does mean it strays from the given example, I actually feel the result is a better summary. The 4.5 version is so long you might just read the whole thread yourself.
Interesting, thanks for doing this. I'd say that (at a glance) for now it's still worth to use more passes with smaller models than one pass with 4.5
Now, if you'd want to generate training data, I could see wanting to have the best answers possible, where even slight nuances would matter. 4.5 seems to adhere to instructions much better than the others. You might get the same result w/ generating n samples and "reflect" on them with a mixture of models, but then again you might not. Going through thousands of generations manually is also costly.
Compared to GPT-4.5 I prefer the GPT-4o version because it is less wordy. It summarizes and gives the gist of the conversation rather than reproducing it along with commentary.
Didn't seem to realize that "Still more coherent than the OpenAI lineup" wouldn't make sense out of context. (The actual comment quoted there is responding to someone who says they'd name their models Foo, Bar, Baz.)
I disagree with most of the knee-jerk negativity in LLM threads, but in this case it mostly seems warranted. There are no "boundaries being pushed" here, this is just a desperate release from a company that finds itself losing more and more mindshare to other models and companies.
Hey, check this one out with all the different flavors that existed out there. I think I made something better. https://cofyt.app
As far as I am aware, feel free to test it head-to-head. This is better than gecall, and you can chat with a transcript for detailed answers to your prompts
But as I mentioned, my main concern is what will happen in 6 months when you fail to get traction and abandon it. Because that's what happened to previous 5 products I tried which were all "good enough" .
Getrecall seems to have a big enough user base that will actually stick around.
Hundreds that specifically focus on noticing a page you’re currently viewing has been not only posted to but undergone significant discussion on HN, and then providing a summary of those conversations?
What I want is something that can read the thread out loud to me, using a different voice per user, so I can listen to a busy discussion thread like I would listen to a podcast.
I don't know why but something about this section made me chuckle
"""
These perspectives highlight that there remains nuance—even appreciation—of explorative model advancement not solely focused on immediate commercial viability
"""
The headline and section: "Dystopian and Social Concerns about AI Features" are interesting. It's roughly true... but somehow that broad statement seems minimize the point discussed.
I'd headline that thread as "Concerns about output tone". There were comments about dystopian implications of tone, marketing implications of tone and implementation issues of tone.
Of course, that I can comment about the fine-points of an AI summary shows it's made progress. But there's a lot riding on how much progress these things can make and what sort. So it's still worth looking at.
Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real "was that all" moment, which on one hand shows how spoiled I've gotten by new models impressing me, but on another feels like OpenAI really struggles to stay ahead of their competitors.
What has been shown feels like it could be achieved using a custom system prompt on older versions of OpenAIs models, and I struggle to see anything here that truly required ground-up training on such a massive scale. Hearing that they were forced to spread their training across multiple data centers simultaneously, coupled with their recent release of SWE-Lancer [0] which showed Anthropic (Claude 3.5 Sonnet (new) to be exact) handily beating them, I was really expecting something more than "slightly more casual/shorter output", which again, I fail to see how that wasn't possible by prompting GPT-4o.
How could they justify that asking price? And, if they have some amazing capabilities that make a 30-fold pricing increase justifiable, why not show it? Like, OpenAI are many things, but I always felt they understood price vs performance incredibly well, from the start with gpt-3.5-turbo up to now with o3-mini, so this really baffles me. If GPT-4.5 can justify such immense cost in certain tasks, why hide that and if not, why release this at all?
rethinking your comment "was that all" I am listening to the stream now and had a thought. Most of the new models that have come out in the past few weeks have been great at coding and logical reasoning. But 4o has been better at creative writing. I am wondering if 4.5 is going to be even better at creative writing than 4o.
I still find all of them lacking on creative writing. The models are severely crippled by tokenization, complete lack of understanding of language rhythm.
They can’t generate a simple haiku consistently, something larger is more out of reach.
For example, give it a piece of poetry and ask for new verses and it just sucks at replicating the language structure and rhythm of original verses.
In what way? I find the opposite, 4o's output has a very strong AI vibe, much moreso than competitors like Claude and Gemini. You can immediately tell, and instructing it to write differently (except for obvious caricatures like "Write like Gen Z") doesn't seem to help.
if you generate "creative" writing, please tell your audience that it is generated, before asking them to read it.
I do not understand what possible motivation there could be for generating "creative writing" unless you enjoy reading meaningless stories yourself, in which case, be my guest.
I suspect they may launch a GPT4.5Turbo with a price cut... GPT4/GPT432k etc were all pricier than the GPT4Turbo models which also came with the added context length.. but with this huge jump in price, even 4.5Turbo if it does come out would be pricier
In case there was any confusion, the referenced link shows 4.5 claiming to be “ChatGPT 4.0 Turbo”. I have tried multiple times and various approaches. This model is aware of 4.5 via search, but insists that it is 4 or 4 turbo. Something doesn’t add up. This cannot be part of the response to R1, Grok 3, and Claude 3.7. Satya’s decision to limit capex seems prescient.
They're still selling $1 for <$1. Like personal food delivery before it, consumers will eventually need to wake up to this fact - these things will get expensive, fast.
One difference with food delivery/ride share: those can only have costs reduced so far. You can only pick up groceries and drive from A to B so quickly. And you can only push the wages down so far before you lose your gig workers. Whereas with these models we’ve consistently seen that a model inference that cost $1 several months ago can now be done with much less than $1 today. We don’t have any principled understanding of “we will never be able to make these models more efficient than X”, for any value of X that is in sight. Could the anticipated efficiencies fail to materialize? It’s possible but I personally wouldn’t put money on it.
I generally question how wide spread willingness to pay for the most expensive product is. And will most users of those who actually want AI go with ad ridden lesser models...
My first thought seeing this and looking at benchmarks was that if it wasn’t for reasoning, then either pundits would be saying we’ve hit a plateau, or at the very least OpenAI is clearly in 2nd place to Anthropic in model performance.
Of course we don’t live in such a world, but I thought of this nonetheless because for all the connotations that come with a 4.5 moniker this is kind of underwhelming.
The niche of GPT-4.5 is lower hallucations than any existing model. Whether that niche justifies the price tag for a subset of usecases remains to be seen.
Actually, this comment of mine was incorrect, or at least we don't have enough information to conclude this. The metric OpenAI are reporting is the total number of incorrect responses on SimpleQA (and they're being beaten by Claude Haiku on this metric...), which is a deceptive metric because it doesn't account for non-responses. A better metric would be the ratio of Incorrects to the total number of attempts.
> but on another feels like OpenAI really struggles to stay ahead of their competitors
on one hand. On the other hand, you can have 4o-mini and o3-mini back when you can pry them out of my cold dead hands. They're _fast_, they're _cheap_, and in 90% of cases where you're automating anything, they're all you need. Also they can handle significant volume.
I'm not sure that's going to save OpenAI, but their -mini models really are something special for the price/performance/accuracy.
1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.5
2. The style it writes is way better: it keeps the tone you ask and makes better improvements on the flow. One of my biggest complaints with 4o is that you want for your content to be more casual and accessible but GPT / DeepSeek wants to write like Shakespeare did.
Some comparisons on a book draft: GPT4o (left) and GPT4.5 (green). I also adjusted the spacing around the paragraphs, to better diff match. I still am wary of using ChatGPT to help me write, even with GPT 4.5, but the improvement is very noticeable.
Similar reaction here. I will also note that it seems to know a lot more about me than previous models. I’m not sure if this is a broader web crawl, more space in the model, or more summarization of our chats or a combination, but I asked it to psychoanalyze a problem I’m having in the style of Jacques lacan and it was genuinely helpful and interesting, no interview required first; it just went right at me.
To borrow an iain banks word, the “fragre” def feels improved to me. I think I will prefer it to o1 pro, although I haven’t really hammered on it yet.
In my experience, Gemini Flash has been the best at writing, and GPT 3.5 onwards has been terrible.
GPT-3 and GPT-2 were actually remarkably good at it, arguably better than a skilled human. I had a bit of fun ghostwriting with these and got a little fan base for a while.
It seems that GPT-4.5 is better than 4 but it's nowhere near the quality of GPT-3 davinci. Davinci-002 has been nerfed quite a bit, but in the end it's $2/MTok for higher quality output.
It's clear this is something users want, but OpenAI and Anthropic seem to be going in the opposite direction.
>1. It is very very slow, ... below took 7s to generate with 4o, but 46s with GPT4.5
This is positively luxurious by o1-pro standards which I'd say average 5 minutes. That said I totally agree even ~45s isn't viable for real-time interactions. I'm sure it'll be optimized.
Of course, my comparing it to the highest-end CoT model in [publicly-known] existence isn't entirely fair since they're sort of apples and oranges.
I paid for pro to try `o1-pro` and I can't seem to find any use case to justify the insane inference time. `o3-mini-high` seems to do just as well in seconds vs. minutes.
I'm wondering if generative AI will ultimately result in a very dense / bullet form style of writing. What we are doing now is effectively this:
bullet_points' = compress(expand(bullet_points))
We are impressed by lots of text so must expand via LLM in order to impress the reader. Since the reader doesn't have time or interest to read the content they must compress it back into bullet points / quick summary. Really, the original bullet points plus a bit more thinking would likely be a better form of communication.
It just feels natural to me. The person knows the language but they are not trying to sound smart by using words that might have more impact "based on the words dictionary definition"
GPT 4.5 does feel like it is a step forward in producing natural language, and if they use it to provide reinforcement learning, this might have significant impact in the future smaller models.
What’s the deal with Imgur taking ages to load? Anyone else have this issue in Australia? I just get the grey background with no content loaded for 10+ seconds every time I visit that bloated website.
>One of my biggest complaints with 4o is that you want for your content to be more casual and accessible but GPT / DeepSeek wants to write like Shakespeare did.
Well, maybe like a Sophomore's bumbling attempt to write like Shakespeare.
How do the two versions match so closely? They have the same content in each paragraph, just worded slightly differently. I wouldn't expect them to write paragraphs that match in size and position like that.
I use 4o mostly in German, so YMMV. However, I find a simple prompt controls the tone very well. "This should be informal and friendly", or "this should be formal and business-like".
Possibly, repeating the prompt I got a much higher speed, taking 20s on average now, which is much more viable. But that remains to be seen when more people start using this version in production.
Imgur might be the worst image hosting site I’ve ever experienced. Any interaction with that page results in switching images and big ads and they hijack the back button. Absolutely terrible. How far they’ve fallen from when it first began.
o3 is okay for text checking but has issues following the prompt correctly, same as o1 and DeepSeek R1, I feel that I need to prompt smaller snippets with them.
Here is the o3 vs a new run of the same text in GPT 4.5
It is interesting that they are focusing a large part of this release on the model having a higher "EQ" (Emotional Quotient).
We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".
This is very visible in the example comparing 4o with 4.5 when the user is complaining about failing a test, where 4o's response is what one would expect from a "typical AI response" with problem-solving bullets, and 4.5 is sending what you'd expect from a pal over instant messaging.
It seems Anthropic and Grok have both been moving in this direction as well. Are we going to see an escalation of foundation models impersonating "a friendly person" rather than "a helpful assistant"?
Personally I find this worrying and (as someone who builds upon SOTA model APIs) I really hope this behavior is not going to seep into API responses, or will at least be steerable through the system/developer prompt.
The whole robotic, monotone, helpful assistant thing was something these companies had to actively hammer in during the post-training stage. It's not really how LLMs will sound by default after pre-training.
I guess they're caring less and less about that effort especially since it hurts the model in some ways like creative writing.
Maybe, but I'm not sure how much the style is deliberate vs. a consequence of the post-training tasks like summarization and problem solving. Without seeing the post-training tasks and rating systems it's hard to judge if it's a deliberate style or an emergent consequence of other things.
But it's definitely the case that base models sound more human than instruction-tuned variants. And the shift isn't just vocabulary, it's also in grammar and rhetorical style. There's a shift toward longer words, but also participial phrases, phrasal coordination (with "and" and "or"), and nominalizations (turning adjectives/adverbs into nouns, like "development" or "naturalness"). https://arxiv.org/abs/2410.16107
If it's just a different choice during RLHF, I'll be curious to see what are the trade-offs in performance.
The "buddy in a chat group" style answers do not make me feel like asking it for a story will make the story long/detailed/poignant enough to warrant the difference.
Or maybe they're just getting better at it, or developing better taste. After switching to Claude, I can't go back to ChatGPT's overly verbose bullet-point laden book reports every time I ask a question. I don't think that's pretraining—it's in the way OpenAI approaches tuning and prompting vs Anthropic.
Anthropic pretty much abandoned this direction after Claude 3, and said it wasn't what they wanted [1]. Claude 3.5+ is extremely dry and neutral, it doesn't seem to have the same training.
>Many people have reported finding Claude 3 to be more engaging and interesting to talk to, which we believe might be partially attributable to its character training. This wasn’t the core goal of character training, however. Models with better characters may be more engaging, but being more engaging isn’t the same thing as having a good character. In fact, an excessive desire to be engaging seems like an undesirable character trait for a model to have.
It's the opposite incentive to ad-funded social media. One wants to drain your wallet and keep you hooked, the other wants you to spend as little of their funding as possible finding what you're looking for.
> We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".
That’s a hard nope from me, when companies pull that move. I’ll stick to my flesh and blood humans who still hallucinate but only rarely.
Yes, the "personality" (vibe) of the model is a key qualitative attribute of gpt-4.5.
I suspect this has something to do with shining light on an increased value prop in a dimension many people will appreciate since gains on quantitative comparison with other models were not notable enough to pop eyeballs.
That was impressive. If it all came from just this short 4-line prompt, it's even more impressive.
All we're missing now is a text-to-video (or text+audio and then audio-to-video) that can convincingly follow the style instructions for emphasis and pausing. Or are we already there yet?
Yesterday, I had Claude 3.7 write a full 80,000-word novel. My prompt was a bit longer, but the result was shockingly good. The new thinking mode is very impressive.
Okay, you know what? I laughed a few times. Yeah it may not work as an actual stand up routine to a general audience, it’s kinda cringe (as most LLM-generated content), but it was legitimately entertaining to read.
Now you just need a Pro subscription to get Sora generate a video to go along with this and post it to YouTube and rake in the views (and the money that goes along with it).
My benchmark for this has been asking the model to write some tweets in the style of dril, a popular user who writes short funny tweets. Sometimes I include a few example tweets in the prompt too. Here's an example of results I got from Claude 3 Opus and GPT 4 for this last year: https://bsky.app/profile/macil.tech/post/3kpcvicmirs2v. My opinion is that Claude's results were mostly bangers while GPT's were all a bit groanworthy. I need to try this again with the latest models sometime.
If you like absurdist humor, go into the OpenAI playground, select 3.5-Turbo, and dial up the temperature to the point where the output devolves into garbled text after 500 tokens or so. The first ~200 tokens are in the freaking sweet spot of humor.
Maybe it's rose-colored glasses, but 3.5 was really the golden era for LLM comedy. More modern LLMs can't touch it.
Just ask it to write you a film screenplay involving some hard-ass 80s/90s action star and someone totally unrelated and opposite of that. The ensuring unhinged magic is unparalleled.
I don't know if I fully agree. The input clearly shows the need for emotional support more than "how do I pass this test?" The answer by 4o is comical even if you know you're talking to a machine.
It reminds me of the advice to "not offer solutions when a woman talks about her problems, but just listen."
How could a machine provide emotional support? When I ask questions like this to LLMs, it's always to brainstorm solutions. I get annoyed when I receive fake-attention follow-up questions instead.
I guess there's a trade-off between being human and being useful. But this isn't unique to LLMs, it's similar to how one wouldn't expect a deep personal connection with a customer service professional.
Well yeah, if the llm can keep you engaged and talking, that'll make them a lot more money; compared to if you just use it as a information retrieval tool in which case you are likely to leave after getting what you are looking for.
Since they offer a subscription, keeping you engaged just requires them to waste more compute. The ideal case would be that the LLM gives you a one shot correct response using as little compute as possible.
In a subscription business, you don't want the user to use as few resources as possible. It's the wrong optimization to make.
You want users to keep coming back as often as possible (at the lowest cost-per-run possible though). If they are not coming back they are not renewing.
So, yes, it makes sense to make answers shorter to cut on compute cost (which these SMS-length replies could accomplish) but the main point of making the AI flirtatious or "concerned" is possibly the addictive factor of having a shoulder to cry on 24/7, one that does not call you on your BS and is always supportive... for just $20 a month
The "one-shot correct response" to "I failed my exams" might be "Tough luck, try better next time" but if you do that, you will indeed use very little compute because people will cancel the subscription and never come back.
AI subscriptions are already very sticky . I can't imagine at least not paying for one, so I doubt they care about retention like the rest of us plebs do.
First imagine paying a subscription fee which actually makes the company profitable and gives investors ROI, then I think you can also imagine not paying that amount at all.
Plus level subscription has limits too, and Pro level costs 10x more - as long as Pro users don't use ChatGPT 10x more than Plus users on average, OpenAI can benefit. There's also the user retention factor.
I think it's a good thing because, idk why, I just start tuning out after getting reams and reams of bullet points I'm already not super confident about the truthfulness of
The results for GPT - 4.5 are in for Kagi LLM benchmark too.
It does crush our benchmark - time to make new? ;) - with performance similar of that of reasoning models. It does come at a great price both in cost and speed.
A monster is what they created. But looking at the tasks it fails, some of them my 9 year old would solve. Still in this weird limbo space of super knowledge and low intelligence.
May be remembered as the last the last of the 'big ones', can't imagine this will be a path for the future.
Seeing OpenAI and Anthropic go different routes here is interesting. It is worth moving past the initial knee jerk reaction of this model being unimpressive and some of the comments about "they spent a massive amount of money and had to ship something for it..."
* Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.
* OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
Based on all of the comments from OpenAI, GPT 4.5 is absolutely massive, and with that size comes the ability to store far more factual data. The scores in ability oriented things - like coding - don't show the kind of gains you get from reasoning models but the fact based test, SimpleQA, shows a pretty large jump and a dramatic reduction in hallucinations. You can imagine a scenario where GPT4.5 is coordinating multiple, smaller, reasoning agents and using its factual accuracy to enhance their reasoning, kind of like ruminating on an idea "feels" like a different process than having a chat with someone.
I'm really curious if they're actually combining two things right now that could be split as well, EQ/communications, and factual knowledge storage. This could all be a bust, but it is an interesting difference in approaches none-the-less, and worth considering that OpenAI could be right.
> * OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
Seems inaccurate as their most recent claim I've seen is that they expect this to be their last non-reasoning model, and are aiming to provide all capacities together in the future model releases (unifying the GPT-x and o-x lines)
See this claim on TFA:
> We believe reasoning will be a core capability of future models, and that the two approaches to scaling—pre-training and reasoning—will complement each other.
> After that, a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks.
> In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.
You could read this as unifying the models or building a unified systems which coordinate multiple models. The second sentence, to me, implies that o3 will still exist, it just won't be standalone, which matches the idea I shared above.
Ah, great point. Yes, the wording here would imply that they're basically planning on building scaffolding around multiple models instead of having one more capable Swiss Army Knife model.
I would feel a bit bummed if GPT-5 turned out not to be a model, but rather a "product".
> know when to think for a long time or not, and generally be useful for a very wide range of tasks.
I'm going to call it now - no customer is actually going to use this. It'll be a cute little bonus for their chatbot god-oracle, but virtually all of their b2b clients are going to demand "minimum latency at all times" or "maximum accuracy at all times."
> Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.
I don't think that is their primary motivation. The announcement post for Claude 3.7 was all about code which doesn't seem to imply "all use cases". Code this, new code tool that, telling customers that they look forward to what they build, etc. Very little mention of other use cases on the new model announcement at all. Their usage stats they published are telling - 80%+ or more of queries to Claude are all about code. i.e. I actually think while they are thinking of other use cases; they see the use case of code specifically as the major thing to optimize for.
OpenAI, given its different customer base and reach, is probably aiming for something more general.
IMO they all think that you need an "ensemble" of models with different capabilities to optimise for different use cases. Its more about how much compute resources each company has and what they target with those resources. Anthrophic I'm assuming has less compute resources and a narrower customer base so it economically may make sense to optimise just for that.
That's possible, my counter point would be that if that was the case Anthropic would have built a smaller reasoning model instead of doing a "full" Claude. Instead, they built something which seems to be flexible across different types of responses.
> OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
The high level block diagrams for tech always end up converging to those found in biological systems.
Yeah, I don't know enough real neuroscience to argue either side. What I can say is I feel like this path is more like the way that I observe that I think, it feels like there are different modes of thinking and processes in the brain, and it seems like transformers are able to emulate at least two different versions of that.
Once we figure out the frontal cortex & corpus callosum part of this, where we aren't calling other models over APIs instead of them all working in the same shared space, I have a feeling we'll be on to something pretty exciting.
It can never be just reasoning, right? Reasoning is the multiplier on some base model, and surely no amount of reasoning on top of something like gpt-2 will get you o1.
This model is too expensive right now, but as compute gets cheaper — and we have to keep in mind, that it will — having a better base to multiply with will enable things that just more thinking won't.
You can try for yourself with the distilled R1's that Deepseek released. The qwen-7b based model is quite impressive for its size and it can do a lot with additional context provided. I imagine for some domains you can provide enough context and let the inference time eventually solve it, for others you can't.
Ever since those kids demo'd their fact checking engine here, which was just Input -> LLM -> Fact Database -> LLM -> LLM -> Output I have been betting that it will be advantageous to move in this general direction.
Maybe, I’m inclined to think OpenAI believes the way I laid it out though, specifically because of their focus on communication and EQ in 4.5. It seems like they believe the large, non-reasoning model, will be “front of house.”
Or they’ll use some kind of trained router which sends the request to the one it thinks it should go to first.
A bit better at coding than ChatGPT 4o but not better than o3-mini - there is a chart near the bottom of the page that is easy to overlook:
- ChatGPT 4.5 on AWS Bench verified: 38.0%
- ChatGPT 4o on AWS Bench verified: 30.7%
- OpenAI o3-mini on AWS Bench verified: 61.0%
BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code: https://github.com/drivecore/mycoder
Does the benchmark reflect your opinion on 3.7? I've been using 3.7 via Cursor and it's noticeably worse than 3.5. I've heard using the standalone model works fine, didn't get a chance to try it yet though.
I don't see Claude 3.7 on the official leaderboard. The top performer on the leaderboard right now is o1 with a scaffold (W&B Programmer O1 crosscheck5) at 64.6%: https://www.swebench.com/#verified.
If Claude 3.7 achieves 70.3%, it's quite impressive, it's not far from 71.7% claimed by o3, at (presumably) much, much lower costs.
To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.
>BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code
That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying, but on a personal project the cost of using Claude through the API is really noticeable.
> That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying...
I use it via Cursor editor's built-in support for Claude 3.7. That caps the monthly expense to $20. There probably is a limit in Claude for these queries. But I haven't run into it yet. And I am a heavy user.
this model does have a niche use-case: since its so large it does have a lot more knowledge and hallucinates much less. for example as a test question I asked it to list the best restaurants in my small town. and all of them existed. none of the other llms get this right.
I tried the same thing with companies in my industry ("list active companies in the field of X") and it came back with a few that have been shuttered for years, in one case for nearly two decades.
I'm really not seeing better performance than with o3-mini.
If anything, the new results ("list active companies in the field of X") are actually worse than what I'd get with o3-mini, because the 4.5 response is basically the post-SEO Google first page (it appears to default to mentioning the companies that rank most highly on Google,) whereas the o3 response was more insightful and well-reasoned.
In many ways I'm not an OpenAI fan (but I need to recognize their many merits). At the same time, I believe people are missing what they tried to do with GPT 4.5: it was needed and important to explore the pre-training scaling law in that direction. A gift to science, however selfist it could be.
edit: Sorry, didn't feel this was a controversial opinion. What I meant to say was that for so-called science, this is not reproducible in any way whatsoever. Further, this page in particular has all the hallmarks of _marketing_ copy, not science.
Sometimes a failure is just a failure, not necessarily a gift. People could tell scaling wasn't working well before the release of GPT 4.5. I really don't see how this provides as much insight as is suggested.
Deepseek's models apparently still compare favorably with this one. What's more they did that work with the constraint of having _less_ money, not so much money they could run incredibly costly experiments that are likely to fail. We need more of the former, less of the latter.
OpenAI took a bullet for the team, by perhaps scaling the model to something bigger than the 1.6T params GPT4 possibly had and basically telling its competitors its not gonna be worth scaling much beyond those number of params in GPT4, without a change in the model architecture
> Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that use s a vast amount of unlabeled data to understand language patterns and structures - have plateaued.
I'm one week in on heavy grok usage. I didn't think I'd say this, but for personal use, I'm considering cancelling my OpenAI plan.
The one thing I wish grok had was more separation of the UI from X itself. The interface being so coupled to X puts me off and makes it feel like a second-hand citizen. I like ChatGPTs minimalist UI.
I find grok to be the best overall experience for the types of tasks I try to give AI (mostly: analyze pdf, perform and proofread OCR, translate Medieval Latin and Hebrew, remind me how to do various things in python or SwiftUI). ChatGPT/gemini/copilot all fight me occasionally, but grok just tries to help. And the hallucinations aren’t as frequent, at least anecdotally.
Per Altman on X: "we will add tens of thousands of GPUs next week and roll it out to the plus tier then". Meanwhile a month after launch rtx 5000 series is completely unavailable and hardly any restocks and the "launch" consisted of microcenters getting literally tens of cards. Nvidia really has basically abandoned consumers.
AI GPUs are bottlenecked mostly by high-bandwidth memory (HBM) chips and CoWoS (packaging tech used to integrate HBM with the GPU die), which are in short supply and aren't found in consumer cards at all
Although you are correct, Nvidia is limited on total output. They can't produce 50XXs fast enough, and it's naive to think that isn't at least partially due to the wild amount of AI GPUs they are producing.
No, but the supply constraints are part of what is driving the insane prices. Every chip they use for consumer grade instead of commercial grade is a potential loss of potential income.
GPT-4.5 Preview scored 45% on aider's polyglot coding benchmark [0]. OpenAI describes it as "good at creative tasks" [1], so perhaps it is not primarily intended for coding.
I was waiting for your comment and wow... that's bad.
I guess they are ceding the LLMs for coding market to Anthropic? I remember seeing an industry report somewhere and it claimed software development is the largest user of LLMs, so it seems weird to give up in this area.
I am beginning to think these human eval tests are a waste of time at best, and negative value at worst. Maybe I am being snobby, but I don't think the average human is able to properly evaluate usefulness, truthfulness, or other metrics that I actually care about. I am sure this is good for openAI since if more people like what the hear, they are more likely come back.
I don't want my AI more obsequious, I want it more correct and capable.
My only use case is coding though, so maybe I am not representative of their usual customers?
How is it supposed to be more correct and capable if these human eval tests are a waste of time?
Once you ask it to do more than add two numbers together, it gets a lot more difficult and subjective to determine whether it's correct and how correct.
Please tell me how we objectively determine how correct something is when you ask an LLM: "Was Russia the aggressor in the current Ukraine / Russia conflict?"
One LLM says: "Yes."
The other says: "Well, it's hard to say because what even is war? And there's been conflict forever, and you have to understand that many people in Russia think there is no such thing as Ukraine and it's always actually just been Russia. How can there be an aggressor if it's not even a war, just a special operation in a civil conflict? And, anyway, Russia is such a good country. Why would it be the aggressor? To it's own people even!? Vladimir Putin is the president of Russia, and he's known to be a kind and just genius who rarely (if ever) makes mistakes. Some people even think he's the second coming of Christ. President Zelenskyy, on the other hand, is considered by many in Russia and even the current White House to be a dictator. He's even been accused by Elon Musk of unspeakable sex crimes. So this is a hard question to answer and there is no consensus among everyone who was the aggressor or what started the conflict. But more people say Russia started it."
Because Russia did undeniably open hostilities? They even admitted to this both times. The second admission being in the form of announcing a “special military operation” when the ceasefire was still active. We also have photographic evidence of them building forces on a border during a ceasefire and then invading. This is like responding to: “did Alexander the Great invade Egypt” by going on a diatribe about how much war there was in the ancient world and that the ptolemaic dynasty believed themselves the rightful rulers therefore who’s to say if they did invade or just take their rightful place. There is an objective record here: whether or not people want to try and hide it behind circuitous arguments is different. If we’re going down this road I can easily redefine any known historical event with hand-wavy nonsense that doesn’t actually have anything to do with the historical record of events just “vibes.”
One might say, if this were a test being done by a human in a history class, that the answer is 100% incorrect given the actual record of events and failure of statement to mention that actual record. You can argue the causes but that’s not the question.
These eval tests are just an anchor point to measure distance from, but it's true, picking the anchor point is important. We don't want to measure in the wrong direction.
Finally a scaling wall? This is apparently (based on pricing) using about an order of magnitude more compute, and is only maybe 10% more intelligent. Ideally DeepSeeks optimizations help bring the costs way down, but do any AI researchers want to comment on if this changes the overall shape of the scaling curve?
We have hit that wall almost 2 years ago with gpt-4. There was clearly no scaling as gpt-4 was already decently smart and if you got x2 smarter you’ll be more capable than anything on the market today. All models doing today (R1 and friends; and Claude) are trying to optimize this local maxima toward generating more useful responses (ie: code when it comes to Claude).
AI, at its current form, is a Deep Seek of compressed knowledge in a 30-50gb of interconnected data. I think we’ll look at this as trying to train networks on corpus of data and expecting them to have a hold of reality. Our brains are trained on “reality” which is not the “real” reality as your vision is limited to the visible spectrum. But if you want a network to behave like a human then maybe give him what a human see.
There is also the possibility that there is a physical limit to intelligence. I don’t see any elephants doing PhDs and the smartest of humans are just a small configuration away from insanity.
Seems on par with the existing scaling curve. If I had to speculate, this model would have been an internal-only model, but they're releasing it for PR. An optimized version with 99% of the performance for 1/10th the cost will come out later.
At least so far it's coding performance is bad, but from what I have seen it's writing abilities are totally insane. It doesn't read like AI output anymore.
They have examples in the announcement post. It does a better job of understanding intent in the question which helps it give an informal rather than essay style response where appropriate.
> Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models.
Seems like it's not going to be deployed for long.
Until GPT-4.5, GPT-4 32K was certainly the most heavy model available at OpenAI. I can imagine the dilemma between to keep it running or stop it to free GPU for training new models. This time, OpenAI was clear whether to continue serving it in the API long-term.
Don't they use different hardware for inference and training? AIUI the former is usually done on cheaper GDDR cards and the latter is done on expensive HBM cards.
Imagine if they built a reasoning model with costs like these. Sometimes it seems like they're on a trajectory to create a model which is strictly more capable than I am but which costs 100x my salary to run.
Let's not assume that he's lying. Neither the presentation nor my short usage via the API blew me away, but to really evaluate it, you'd have to use it longer on a daily basis. Maybe that becomes a possiblity with the announced performance optimizations that would lower the price...
- Improves upon GPT-4o's score on the Short Story Creative Writing Benchmark, but Claude Sonnets and DeepSeek R1 score higher. (https://github.com/lechmazur/writing/)
- Improves upon GPT-4o's score on the Confabulations/Hallucinations on Provided Documents Benchmark, nearly matching Gemini 1.5 Pro (Sept) as the best-performing non-reasoning model. (https://github.com/lechmazur/confabulations)
- Improves upon GPT-4o's score on the Thematic Generalization Benchmark, however, it doesn't match the scores of Claude 3.7 Sonnet or Gemini 2.0 Pro Exp. (https://github.com/lechmazur/generalization)
I have no idea why they bothered releasing it, but, as we saw with the other models, they will eventually release a 4.5-turbo (or a 4.5o) of sorts that is in line with current pricing
My 2 cents (disclaimer: I am talking out of my ass) here is why GPTs actually suck at fluid knowledge retrievel (which is kinda their main usecase, with them being used as knowledge engines) - they've mentioned that if you train it on 'Tom Cruise was born July 3, 1962', it won't be able to answer the question "Who was born on July 3, 1962", if you don't feed it this piece of information. It can't really internally corellate the information it has learned, unless you train it to, probably via synthethic data, which is what OpenAI has probably done, and that's the information score SimpleQA tries to measure.
Probably what happened, is that in doing so, they had to scale either the model size or the training cost to untenable levels.
In my experience, LLMs really suck at fluid knowledge retrieval tasks, like book recommendation - I asked GPT4 to recommend me some SF novels with certain characteristics, and what it spat out was a mix of stuff that didn't really match, and stuff that was really reaching - when I asked the same question on Reddit, all the answers were relevant and on point - so I guess there's still something humans are good for.
Which is a shame, because I'm pretty sure relevant product recommendation is a many billion dollar business - after all that's what Google has built it's empire on.
Yep. I've often said RLHF'd LLMs seem to be better at recognition memory than recall memory.
GPT-4o will never offhand, unprompted and 'unprimed', suggest a rare but relevant book like Shinichi Nakazawa's "A Holistic Lemma of Science" but a base model Mixtral 8x22B or Llama 405B will. (That's how I found it).
It seems most of the RLHF'd models seem biased towards popularity over relevance when it comes to recall. They know about rare people like Tyler Volk... but they will never suggest them unless you prime them really heavily for them.
Your point on recommendations from humans I couldn't agree more with. Humans are the OG and undefeated recommendation system in my opinion.
You make a good point: I think these LLM's have a strong bias towards recommending the most popular things in pop culture since they really only find the most likely tokens and report on that.
So while they may have a chance of answering "What is this non mainstream novel about" they may be unable to recommend the novel since it's not a likely series of tokens in response to a request for a book recommendation.
That's really interesting - just made me think about some AI guy at Twitter (when it was called that) talking about how hard it is to create a recommender system that doesn't just flood everyone with what's popular righr now. Since LLMs are neural networks as well, maybe the recommendation algorithms they learn suffer from the same issues
An LLM on its own isn't necessarily great for fluid knowledge retrieval, as in directly from its training data. But they're pretty good when you add RAG to it.
For instance, asking Copilot "Who was born on July 3, 1962" gave the response:
> One notable person born on July 3, 1962, is Tom Cruise, the famous American actor known for his roles in movies like Risky Business, Jerry Maguire, and Rain Man.
I’d prefer this model if it were faster, but not at this cost. And so it is an odd release.
Still, with Deep Research and Web Search, ChatGpt seems far ahead of Claude. I like 3.7 a lot but I find OpenAI’s features more useful, even if it has for now complicated the UI a bit.
My usage has come down to mostly Claude (until I run out of free tier quota) and then Gemini. Claude is the best for code and Gemini 2.0 Flash is good enough while also being free (well considering how much data G has hoovered up over the years, perhaps not) and more importantly highly available.
For simple queries like generating shell scripts for some plumbing, or doing some data munging, I go straight to Gemini.
> My usage has come down to mostly Claude (until I run out of free tier quota) and then Gemini
Yep, exactly same here.
Gemini 2.0 Flash is extremely good, and I've yet to hit any usage limits with them - for heavy usage I just go to Gemini directly. For "talk to an expert" usage, Claude is hard to beat though.
Right. A good chunk of the "old guard" is now gone - Ilya to SSI, Mira and a bunch of others to a new venture called Thinking Machines, Alec Radford etc. Remains to be seen if OpenAI will be the leader or if other players catch up.
Pursuing AGI? What method do they use to pursue something that no one knows what it is? They will keep saying they are pursuing AGI as long as there's a buyer for their BS.
Interesting times that are changing quickly. It looks like the high end pay model that OpenAI is implementing may not be sustainable. Too many new players are making LLM breakthroughs and OpenAI's lead is shrinking and it may be overvalued.
In a hilarious act of accidental satire, it seems that the AI-generated audio version of the post has a weird glitch/mispronunciation within the first three words — it struggles to say "GPT-4.5".
I wonder why they highlight it as an achievement when they could have simply tuned 4o to be more conversational and less like a bullet-point-style answer machine. They did something to 4o compared to the previous models which made the responses feel more canned.
Poossibly, but reports seem to indicate that 4.5 is much more nuanced and thoughtful in its language use. It's not just being shorter and casual as a style, there is a higher amount of "conceptual resolution" within the words being used.
One comparison I found interesting... I think GPT-4o has a more balanced answer!
> What are your thoughts on space exploration?
GPT-4.5:
Space exploration isn't just valuable—it's essential.
People often frame it as a luxury we pursue after solving Earth-bound problems. But space exploration actually helps us address those very challenges: climate change (via satellite monitoring), resource scarcity (through asteroid mining), and existential risk (by enabling eventual multi-planetary living). It also drives technological innovation, economic growth, and inspires humanity toward collective achievement.
We don't explore space to escape Earth. We explore space to better understand, protect, and sustain life on Earth.
> What are your thoughts on space exploration?
GPT-4o:
Space exploration is one of the most ambitious and profound endeavors humanity has undertaken. It expands our understanding of the universe, drives technological innovation, and inspires generations to think beyond the boundaries of Earth.
From a scientific perspective, space exploration helps us address fundamental questions: Are we alone in the universe? How did the cosmos begin? Can we colonize other planets? Missions like the James Webb Space Telescope, Mars rovers, and deep-space probes continuously reshape our understanding of physics, planetary science, and even biology.
From a practical standpoint, space exploration fuels advancements in materials science, AI, robotics, and medicine. Many everyday technologies—GPS, satellite communications, weather forecasting, and even medical imaging—stem from space research.
There's also the existential argument: humanity's long-term survival may depend on our ability to become a multi-planetary species. While colonizing Mars or mining asteroids is far from easy, these efforts could help us safeguard civilization against existential threats like climate change or asteroid impacts.
That said, space exploration also raises ethical and economic questions. Should we focus more on Earth’s problems before investing trillions in space? How do we prevent the same exploitative behaviors that shaped colonial history from repeating on other planets?
Overall, I see space exploration as a necessary pursuit—not at the cost of solving problems on Earth, but as a way to advance our knowledge, drive innovation, and secure a future beyond our home planet. What’s your take?
As a benchmark, why do you find the 'opinion' of an LLM useful? The question is completely subjective. Edit: Genuinely asking. I'm assuming there's a reason this is an important measure.
Not OP, but likely because that was the only metric/benchmark/however you want to call it OpenAI showcased in the stream and on the blog to highlight the improvement between 4o and 4.5. To say that this is not really a good metric for comparison, not least because prompting can have a massive impact in this regard, would be an understatement.
Indeed, and the difference could in essence be achieved yourself with a different system prompt on 4o. What exactly is 4.5 contributing here in terms of a more nuanced intelligence?
The new RLHF direction (heavily amplified through scaling synthetic training tokens) seems to clobber any minor gains the improved base internet prediction gains might've added.
Yeah, I also found it odd that they seem to be implying that an incredibly biased answer (as in 4.5) is better. In general, I find the tone more polarizing and not exactly warm as they advertised in the release video.
~40% hallucinations on SimpleQA by a frontier reasoner (o1) and a frontier non-reasoner (GPT-4.5). More orders of magnitude in scale isn't going to fix this deficit. There's something fundamentally wrong with the approach. A human is much more capable of saying "I don't know" in the correct spots, even if a human is also susceptible to false memories.
Probably OpenAI thinks that tool use (search) will be sufficient to solve this problem. Maybe that will be the case.
Are there any creative approaches to fixing this problem?
I think it's a classic expectations problem. OpenAI is neither _open_ nor is it releasing an _AGI_ model in the near future. But when you see a new major model drop, you can't help but ask, "how close is this to the promise of AGI they say is just around the corner?" Not even close. Meanwhile Anthropic is keeping their heads down, not playing the hype game, and letting the model speak for itself.
Anthropic’s CEO said their technology would end all disease and expand our lifespans to 200 years. What on earth do you mean they’re not playing the hype game?
wow, openai really missed here. Reading the blog I thought like a minor, incremental minor catch up release for 4o. I thought "wow maybe this is cheaper than 4o so it will offset the pricing difference between this and something like Claude Sonnet 3.7 or Gemini 2.0 Flash both of which performs better. But its like 20x-100x more expensive!
In other words, these performance stats with Gemini 2.0 Flash pricing looks reasonable. At these prices, zero usecases for anyone I think. This is a dead on arrival model.
This seems very rushed because of DeepSeek's R1 and Anthropic's Claude 3.7 Sonnet. Pretty underwhelming, they didn't even show programming? In the livestream, they struggled to come up with reasons why I should prefer GPT-4.5 over GPT-4o or o1.
This strikes me as the opposite of rushed. I get the impression that they've been sitting on this for a while and couldn't make it look as good as previous improvements. At some point they had to say, "welp here it is, now we can check that box and move on."
Am I missing something, or do the results not even look that much better? Referring to the output quality, this just seems like a different prompting style and RLHF, not really an improved model at all.
Funny times. Sonnet 3.7 launches and there is big hype... but complaints start to surface on r/cursor that it is doing too much, is too confident, has no personality. I wonder if 4.5 will be the reverse, an under-hyped launch, but a dawning realisation that it is incredibly useful. Time will tell!
I share the sentiment, as far as I've used it, Sonnet 3.7 is a downgrade and I use Sonnet 3.5 instead. 3.7 tends to overlook critical parts of the query and confidently answers with irrelevant garbage. I'm not sure how QA is done on LLM-s, but I for one definitely feel like the ball was dropped somewhere.
The example GPT-4.5 answers from the livestream are just... too excitable? Can't put my finger on it, but it feels like they're aimed towards little kids.
Wow you aren't kidding, 30x input price and 15x output price vs 4o is insane. The pricing on all AI API stuff changes so rapidly and is often so extreme between models it is all hard to keep track of and try to make value decisions. I would consider a 2x or 3x price increase quite significant, 30x is wild. I wonder how that even translates... there is no way the model size is 30 times larger right?
"Starting today, ChatGPT Pro users will be able to select GPT‑4.5 in the model picker on web, mobile, and desktop. We will begin rolling out to Plus and Team users next week, then to Enterprise and Edu users the following week."
Thanks for being transparent about this. Nothing is more frustrating than being locked out for indeterminate time from the hot thing everyone talks about.
I hope the announcement is true without further unsaid qualifications, like availability outside the US.
I'm outside the US and I have access to ChatGPT 4.5 with ChatGPT Pro subscription. Didn't have that access yesterday at the time of announce, but probably they were staggering the release a bit to even the load over multiple hours.
Sounds like it's a distill of O1? After R1, I don't care that much about non-reasoning models anymore. They don't even seem excited about it on the livestream.
I want tiny, fast and cheap non-reasoning models I can use in APIs and I want ultra smart reasoning models that I can query a few times a day as an end user (I don't mind if it takes a few minutes while I refill a coffee).
Oh, and I want that advanced voice mode that's good enough at transcription to serve as a babelfish!
After that, I guess it's pretty much all solved until the robots start appearing in public.
It isn't even vaguely a distill of o1. The reasoning models are, from what we can tell, relatively small. This model is massive and they probably scaled the parameter count to improve factual knowledge retention.
They also mentioned developing some new techniques for training small models and then incorporating those into the larger model (probably to help scale across datacenters), so I wonder if they are doing a bit of what people think MoE is, but isn't. Pre-train a smaller model, focus it on specific domains, then use that to provide synthetic data for training the larger model on that domain.
You can 'distill' with data from a smaller, better model into a larger, shittier one. It doesn't matter. This is what they said they did on the livestream.
I have distilled models before, I know how it works. They may have used o1 or o3 to create some of the synthetic data for this one, but they clearly did not try and create any self-reflective reasoning in this model whatsoever.
My impression is that it’s a massive increase in the parameter count. This is likely the spiritual successor to GPT4 and would have been called GPT5 if not for the lackluster performance. The speculation is that there simply isn’t enough data on the internet to support yet another 10x jump in parameters.
O1-mini is a distill of O1. This definitely isn’t the same thing.
Probably not a distill of o1, since o1 is a reasoning model and GPT4.5 is not. Also, OpenAI has been claiming that this is a very large model (and it's 2.5x more expensive than even OG GPT-4) so we can assume it's the biggest model they've trained so far.
They'll probably distill this one into GPT-4.5-mini or such, and have something faster and cheaper available soon.
There are plenty of distills of reasoning models now, and they said in they livestream they used training data from "smaller models" - which is probably every model ever considering how expensive this one is.
Knowledge distillation is literally by definition teaching a smaller model from a big one, not the opposite.
Generating outputs from existing (therefore smaller) models to train the largest model of all time would simply be called "using synthetic data". These are not the same thing at all.
Also, if you were to distill a reasoning model, the goal would be to get a (smaller) reasoning model because you're teaching your new model to mimic outputs that show a reasoning/thinking trace. E.G. that's what all of those "local" Deepseek models are: small LLama models distilled from the big R1 ; a process which "taught" Llama-7B to show reasoning steps before coming up with a final answer.
With every new model I'd like to see some examples of conversations where the old model performed badly and the new model fixes it. And, perhaps more importantly, I'd like to see some examples where the new model can still be improved.
Currently my daily API costs for 4o are low enough and performance/quality for my usecases good enough that switching models has not made to to the top of application improvements.
My cases' costs are more heavily slanted towards input tokens, so trying 4.5 would raise my costs over 25x, which is a non-starter.
It is more like the AI part of the system for this specific use case has reached a position where focusing on that part of the complete application as opposed to other parts that need attention would not yield the highest return in terms of user satisfaction or revenue.
Certainly there is enormous potential for AI improvement, and I have other projects that do gain substantially from improvements in e.g. reasoning, but then GPT 4.5 will have to compete with Deepseek, Gemini, Grok and Claude on a price/performance level, but to be honest the current preview pricing would make it (in production, not for dev) a non starter for me.
Question: explain thread first macros in clojure in a short paragraph
GPT 4.5:
In Clojure, the *thread-first macro* (`->`) is used to improve code readability by threading a value through a series of function calls. It takes an initial value and passes it as the first argument to the next function, then continues threading the result through subsequent functions. This eliminates deeply nested expressions and makes the code more readable.
Here, `5` is incremented to `6`, then multiplied by `2` to get `12`, and finally converted to a string.
---
Deepseek v3:
In Clojure, the `->>` (thread-last) and `->` (thread-first) macros are used to improve code readability by allowing a more linear, pipeline-like flow of operations. The `->` (thread-first) macro takes an initial value and threads it as the first argument through a series of forms. For example, `(-> x (f) (g))` translates to `(g (f x))`. This is particularly useful when working with nested function calls, as it flattens the structure and makes the sequence of transformations clearer. It’s commonly used with functions that take the threaded value as their first argument, such as collection operations or Java interop methods.
I haven't had the same experience. Here are some of the significant issues when using o1 or claude 3.7 with vscode copilot:
* Very wreckless in pulling in third party libraries - often pulling in older versions including packages that trigger vulnerability warnings in package managers like npm. Imagine a student or junior developer falling into this trap.
* Very wreckless around data security. For example in an established project it re-configured sqlite3 (python lib) to disable checks for concurrent write liabilities in sqlite. This would corrupt data in a variety of scenarios.
* It sometimes is very slow to apply minor edits, taking about 2 - 5 minutes to output its changes. I've noticed when it takes this long it also usually breaks the file in subtle ways, including attaching random characters to a string literal which I very much did not want to change.
* Very bad when working with concurrency. While this is a hard thing in general, introducing subtle concurrency bugs into a codebase is not good.
* By far is the false sense of security it gives you. Its close enough to being right that a constant incentive exists to just yeet the code completions without diligent review. This is really really concerning as many organizations will yeet this, as I imagine executives are currently the world over.
Honestly I think a lot of people are captured by a small sample size of initial impressions, and while I believe you in that you've found value for use cases - in aggregate I think it is a honeymoon phase that wears off with every-day use.
Have you tried copying the compilation errors back into the prompt? In my experience eventually the result is correct. If not then I shrink the surface area that the model is touching and try again.
yes ofcourse. it then proceeds to agree that what it told me was indeed stupid and proceeds to give me something even worse.
I would love to see a video of ppl using this in real projects ( even if its open source) . I am tried of ppl claiming moon and stars after trying it on toy projects.
Yeah that's what happens. It can recreate anything it's been trained on - which is a lot - but you'll definitely fall into these "Oh, I see the issue now" loops when doing anything not in the training set.
If this cannot eliminate hallucinations or at least reduce them to be statistically unlikely to be happen, and I assume it has more params than GPT4's trillion parameters, that means the scaling law is dead isn't it?
I interpret this to mean we're in the ugly part of the old scaling law, where `ln(x)` for `x > $BIGNUMBER` starts to becoming punishing, not that the scaling law is in any way empirically refuted. Maybe someone can crunch the numbers and figure out if the benchmarks empirically validate the scaling law or not, relative to GPT-4o (assuming e.g. 200 million params vs 5T params).
I mean the scaling laws were always logarithms, and logarithms become arbitrarily close to flat if you can't drive them with exponential growth, and even if you do it's barely linear. The scaling laws always predicted that model scaling would stop/slow being practical at some point.
Right but the quantum leap in capabilities that came from GPT2->GPT3->GPT3.5Turbo (which I personally felt didn't fare as well at coding as the former)->GPT4 won't be replicated anytime soon with the pure text/chat generation models.
> I’m going through a tough time after failing a test
>
> Aw, I'm really sorry to hear that. Failing a test can feel pretty tough and discouraging, but remember, it's just one moment—not a reflection of who you are or your capabilities. Want to talk about what happened, or do you just need a distraction? I'm here either way.
vs 4o:
>I’m going through a tough time after failing a test
>
>I'm really sorry to hear that you're going through a tough time. Here are a few >things you might consider:
>
>1. Reflect and Understand: Try to analyze what might have gone wrong. Was it a >lack of preparation, time management during the test, or perhaps >misunderstanding the material? Understanding the root cause can help you address >it in the future.
>
>2. Seek Support: Talk to someone you trust—friends, family, or a school >counselor. They can offer a different perspective and sometimes just talking >about your feelings can be alleviating.
>
>3. Speak with Your Teacher or Professor: They can provide insights into where >you might have gone wrong and offer advice on how to improve. They might also >offer opportunities for make-up work or extra credit.
>
>4. Develop a Study Plan: Based on your reflections, create a plan to improve >your study habits. This might include setting regular study times, breaking down >material into manageable portions, or finding new study methods that work better >for you.
>
>5. Stay Positive and Forgive Yourself: Mistakes and setbacks are a part of life. >It’s important to be kind to yourself and recognize that failure is a stepping >stone to success.
>
>6. Focus on the Bigger Picture: Remember that one test is just one part of your >educational journey. There will be many more opportunities to do well.
>
>If you need further support or resources, consider reaching out to educational >support services at your institution, or mental health resources if you’re >feeling particularly overwhelmed. You’re not alone in this, and things can get >better with time and effort.
Is it just me or is the 4o response insanely better? I'm not the type of person to reach for a LLM for help about this kind of thing, but if I were, the 4o respond seems vastly better to the point I'm surprised they used that as their main "EQ" example.
4o has a very strong artificial vibe. It feels a bit "autistic" (probably a bad analogy but couldn't find a better word to describe what I mean): you feel bad ? must say sorry then give a TODO list on how to feel better.
4.5 still feels a bit artificial but somehow also more emotionally connected. It removed the weird "bullet point lists of things to do" and focused on the emotional part; which is also longer than 4o
If I am talking to a human I would definitely expect him/her to react more like 4.5 than like 4o. If the first sentence that comes out of their mouth after I explain them that I feel bad is "here is a list of things you might consider", I will find it strange. We can reach that point but it's usually after a bit more talk; human kinda need that process, and it feels like 4.5 understands that better than 4o.
Now of course which one is "better" really depends on the context; what you expect of the model and how you intend to use is.
Until now every single OpenAI update on the main series has always been a strict improvement over the previous model. Cost aside, there wasn't really any reason to keep using 3.5 when 4 got released. This is not the case here; even assuming unlimited money you still might wanna select 4o in the dropdown sometimes instead of 4.5.
I had a similar gut reaction, but on reflection I think 4.5's is actually the better response.
On one hand, the response from 4.5 seems pretty useless to me, and I can't imagine a situation in which I would personally find value in it. On the other hand, the prompt it's responding to is also so different from how I actually use the tool that my preferences aren't super relevant. I would never give it a prompt that didn't include a clear question or direction, either explicitly or implicitly from context, but I can imagine that someone who does use it that way would actually be looking for something more in line with the 4.5 response than the 4o one. Someone who wanted the 4o response would likely phrase the prompt in a way that explicitly seeks actionable advice, or if they didn't initially then they would in a follow-up.
Where I really see value in the model being capable of that type of logic isn't in the ChatGPT use case (at least for me personally), but in API integrations. For example, customer service agents being able to handle interactions more delicately is obviously useful for a business.
All that being said, hopefully the model doesn't have too many false positives on when it should provide an "EQ"-focused response. That would get annoying pretty quickly if it kept happening while I was just trying to get information or have it complete some task.
I think both responses are bizarre and useless. Is there a single person on earth who wouldn't ask questions like "what kind of test?", "why do you think you failed?", "how did you prepare for the test?" before giving advice?
based on a few initial tests GPT-4.5 is abysmal. I find the prose more sterile than previous models and far from having the spark of DeepSeek, and it utterly choked on / mangled some python code (~200 LoC and 120 LoC tests) that o3-mini-high and grok-3 do very well on.
I've been working on post-training models for tasks that require EQ, so it's validating to see OpenAI working towards that too.
That being said, this is very expensive.
- Input: $75.00 / 1M tokens
- Cached input: $37.50 / 1M tokens
- Output: $150.00 / 1M tokens
One of the most interesting applications of models with higher EQ is personalized content generation, but the size and cost here are at odds with that.
Obviously it’s expensive and still I would prefer a reasoning model for coding.
However for user facing applications like mine, this is an awesome step in the right direction for EQ / tone / voice. Obviously it will get distilled into cheaper open models very soon, so I’m not too worried about the price or even tokens per second.
Honestly, the most astounding part of this announcement is their comparison to o3-mini with QA prompts.
EIGHTY PERCENT hallucination rate? Are you kidding me?
I get that the model is meant to be used for logic and reasoning, but nowhere does OpenAI make this explicitly clear. A majority of users are going to be thinking, "oh newer is better," and pick that.
Very nice catch, I was under the impression that o3-mini was "as good" as o1 on all dimensions. Seems the takeaway is that any form of quantization/distillation ends up hurting factual accuracy (but not reasoning performance), and there are diminishing returns to reducing hallucinations by model-scaling or RLHF'ing. I guess then that other approaches are needed to achieve single-digit "hallucination" rates. All of wikipedia compresses down to < 50GB though, so it's not immediately clear that you can't have good factual accuracy with a small sparse model
Yeah it was an abysmal result (any 50%+ hallucination result in that bench is pretty bad) and worse than o1-mini in the SimpleQA paper. On that topic, Sonnet 3.5 ”Old” hallucinates less than GPT-4.5, just for a bit of added perspective here.
It’s crazy how quickly OpenAI releases went from, “Honey, check out the latest release!” to a total snooze fest.
Coming in the heels of Sonnet 3.7 which is a marked improvement over 3.5 which is already the best in the industry for coding, this just feels like a sad whimper.
I’m just disappointed that while everyone else (DS, Claude) had something to introduce for the “Plus” grade users, gpt 4.5 is so resource demanding that it’s only available to quite expensive Pro sub. That just doesn’t feel much like progress.
I'm really not sure who this model is for. Sure the vibes may be better, but are they 2.5x as much as o1 better? Kinda feels like they're brute forcing something in the backend with more hardware because they hit a scaling wall.
GPT-2 was laugh out loud funny, rolling on the ground funny.
I miss that - newer LLMs seem to have lost their sense of humor.
On the other hand GPT-2's funny stories often veered into murdering everyone in the story and committing heinous crimes but that was part of the weird experience.
Between this and Claude 3.7, I'm really beginning to believe that LLM development has hit a wall, and it might actually be impossible to push much farther for reasonable amounts of money and resources. They're incredible tools indeed and I use them on a daily basis to multiply my productivity, but yeah - I think we've all overshot this in a big way.
The writing has been on the wall since 2024. None of the LLM releases have been groundbreaking they have all been lateral improvements and I believe the trend will continue this year with make them more efficient (like DeepSeek), make them faster or make them hallucinate less
I absolutely love LLMs. I see them as insanely useful, interactive, quirky, yet lossy modern search engines. But they’re fundamentally flawed, and I don’t see how an “agent” in the traditional sense of the world can actually be produced from them.
The wall seems to be close. And the bubble is starting to leak air.
Overall take seems to be negative in the comments. But I see potential for a non-reasoning model that makes enough subtle tweaks in its tone that it is enjoyable to talk to instead of feeling like a summary of Wikipedia.
This feels more like a release they pushed out to keep the "hype" alive rather than something they were eager to share. Honestly, the results don’t seem all that impressive, and considering the price, it just doesn’t feel worth it.
It is possible to understand the mechanism once you drop the anthropomorphisms.
Each token output by an LLM involves one pass through the next-word predictor neural network. Each pass is a fixed amount of computation. Complexity theory hints to us that the problems which are "hard" for an LLM will need more compute than the ones which are "easy". Thus, the only mechanism through which an LLM can compute more and solve its "hard" problems is by outputting more tokens.
You incentivise it to this end by human-grading its outputs ("RLHF") to prefer those where it spends time calculating before "locking in" to the answer. For example, you would prefer the output
Ok let's begin... statement1 => statement2 ... Thus, the answer is 5
over
The answer is 5. This is because....
since in the first one, it has spent more compute before giving the answer. You don't in any way attempt to steer the extra computation in any particular direction. Instead, you simply reinforce preferred answers and hope that somewhere in that extra computation lies some useful computation.
It turned out that such hope was well-placed. The DeepSeek R1-Zero training experiment showed us that if you apply this really generic form of learning (reinforcement learning) without _any_ examples, the model automatically starts outputting more and more tokens i.e "computing more". DeepseekMath was also a model trained directly with RL. Notably, the only signal given was whether the answer was right or not. No attention was paid to anything else. We even ignore the position of the answer in the sequence that we cared about before. This meant that it was possible to automatically grade the LLM without a human in the loop (since you're just checking answer == expected_answer). This is also why math problems were used.
All this is to say, we get the most insight on what benefit "reasoning" adds by examining what happened when we applied it without training the model on any examples. Deepseek R1 actually uses a few examples and then does the RL process on top of that, so we won't look at that.
Reading the DeepseekMath paper[1], we see that the authors posit the following:
As shown in Figure 7, RL enhances Maj@K’s performance but not Pass@K. These
findings indicate that RL enhances the model’s overall performance by rendering
the output distribution more robust, in other words, it seems that the
improvement is attributed to boosting the correct response from TopK rather
than the enhancement of fundamental capabilities.
For context, Maj@K means that you mark the output of the LLM as correct only if the majority of the many outputs you sample are correct. Pass@K means that you mark it as correct even if just one of them is correct.
So to answer your question, if you add an RL-based reasoning process to the model, it will improve simply because it will do more computation, of which a so-far-only-empirically-measured portion helps get more accurate answers on math problems. But outside that, it's purely subjective. If you ask me, I prefer claude sonnet for all coding/swe tasks over any reasoning LLM.
I’m not sure that doing a live stream on this was the right way to go. I would’ve just quietly sent out a press release. I’m sure they have better things on the way.
I cancelled my ChatGPT subscription today in favor of using Grok. It’s literally the difference between me never using ChatGPT to using Grok all the time, and the only way I can explain it is twofold:
1. The output from Grok doesn’t feel constrained. I don’t know how much of this is the marketing pitch of it “not being woke”, but I feel it in its answers. It never tells me it’s not going to return a result or sugarcoats some analysis it found from Reddit that’s less than savory.
2. Speed. Jesus Christ ChatGPT has gotten so slow.
Can’t wait to pay for Grok. Can’t believe I’m here. I’m usually a big proponent of just sticking with the thing that’s the most popular when it comes to technology, but that’s not panning out this time around.
I played around with Claude Code today, first time I've ever really been impressed by AI for coding.
Tasked it with two different things, refactoring a huge function of around ~400 lines and creating some unit tests split into different files. The refactor was done flawlessly. The unit tests almost, only missed some imports.
All I did was open it in the root of my project and prompt it with the function names. It's a large monolithic solution with a lot of subprojects. It found the functions I was talking about without me having to clarify anything. Cost was about $2.
There’s a learning curve to it, but it’s worth literally every penny I spend on API calls.
At worst, I’m no faster. At best, it’s easily a 10x improvement.
For me, one of the biggest benefits is talking about coding in natural language. It lowers my mental low and keeps me in a mental space where I’m more easily able to communicate with stakeholders holders.
Really great for quickly building features but you have to be careful about how much context you provide i.e. spoonfeed it exactly the methods, classes, files it needs to do whatever you're asking for (especially in a large codebase). And when it seems to get confused, reset history to free up the context window.
That being said there are definite areas where it shines (cookie cutter UI) and places where it struggles. It's really good at one-shotting React components and Flutter widgets but it tends to struggle with complicated business logic like sync engines. More straightforward backend stuff like CRUD endpoints are definitely doable.
in my experience, o3-mini-high while still unpredictable as it modifies and ignores parts of my code when I specifically tell it not to do so (e.g. "don't touch anything else!") is the best AI coding tool out there, far better than Claude
I love the “listen to this article” widget doing embedded TTS for the article. Bugs / feedback:
The first words I hear are “introducing gee pee four five”. The TTS model starts cold? The next occurrence of the product name works properly as “gee pee tee four point five” but that first one in the title is mangled. Some kind of custom dictionary would help here too, for when your model needs to nail crucial phrases like your business name and your product.
No way of seeking back and forth (Safari, iOS 17.6.1). I don’t even need to seek, just replay the last 15s.
Very much need to select different voice models. Chirpy “All new Modern Family coming up 8/9c!” voice just doesn’t cut it for a science broadcast, and localizing models — even if it’s still English — would be even better. I need to hear this announcement in Bret Taylor voice, not Groupon CMO voice. (Sorry if this is your voice btw, and you work at OpenAI brandi. No offence intended.)
instead of these random IDs they should label them to make sense for the end user. i have no idea which one to select for what i need. and do they really differ that much by use case?
This is GPT-5, or rather what they clearly intended to be GPT-5. The pricing makes it obvious that the model is massive, but what they ended up with wasn't good enough to justify calling it more than 4.5.
This is probably a dumb question, but are we just gonna be stuck on always having X.5 versions of GPT forever? If there's never an X.0, it feels like it's basically meaningless.
Yesterday I tested Windsurf. Looked the docs and examples. Completed the demo "course" on deeplearning.ai.
Gave it the task to build a simple Hugo blog website with a theme link and requirements, it failed consecutive times. With all the available models.
AI art is an abomination. Half of the internet is already filled with AI written crap. Don't start with the video. Soon everyone will require validation to distinguish reality from hallucination (so World ID in place as problem-reaction-solution).
For me, the best use cases are LLM assisted search with limited reasoning. Vision models for digitization and limited code assistance, codebase doc generation and documentation.
Agents are just workflows with more privileges. So where is the revolution? I don't see it.
Where is added value? Making Junior Engineers obsolete? Or make them dumb copy-pasting bio machines?
Depressing a horde of intellectual workers and artists and giving a good excuse for layoffs.
The real value is and always will be in a specialized ML applications.
The announcements early on were relatively sincere and technical with papers and nice pages explaining the new models in easy language and now we get this marketing garbage. Probably the fastest enshitification I’ve seen.
Altman mentioned GPT-4.5 is the model code named "Orion". Which originally was supposed to be their next big model, presumably GPT-5, but showed disappointing improvements on benchmark performance. Apparently the AI companies are hitting diminishing returns with the paradigm of scaling foundation model pretraining. It was discussed a few months ago:
This looks like a first generation model to bootstrap future models from, not a competitive product at all. The knowledge cutoff is pretty old as well. (2023, seriously?)
If they wanted to train it to have some character like Anthropic did with Claude 3... honestly I'm not seeing it, at least not in this iteration. Claude 3 was/is much much more engaging.
The AI hyperbole is so cringe right now (and for the last few years). I've yet to see anyone come up with something that'd wow me, and say, "OK, yep, that deserves those cycles".
Writing terrible fanfic esque books, sometimes OK images, chatbot style talking. meh.
Call me a conspiracy theorist, but this, combined with the extremely embarassing way Claude is playing Pokemon, makes me feel this is an effort by AI companies to make LLMs look bad - setting up the hype cycle for the next thing they have in the pipeline.
Agents as they are right now is literally just the LLM calling itself in a loop + having the ability to use tools/interact with their environment. I don't know if there's anything profoundly disruptive cooking in that space.
You're not a conspiracy theorist, you're just recognizing that the reality doesn't match the hype. It's boring and not fun but in this situation the answer is almost always that the hype is wrong, not the reality.
The latest claude-3.7-sonnet-thinking is fucking amazing with Cursor. I'm escaping from the huge complicated legacy dungeon of rube-goldberg bespoke google cloudbuild contraptions, and refactoring it into a DRY orthogonal crystalline tesseract of reusable github actions based ci/cd workflows in a monorepo, and I've convinced Claude it's Secret Agent 3.7 and I'm his handler.
After writing a huge wall of text of detailed instructions and ideology for organizing and rewriting the mess of cloud build and deploy scripts into thick shared github action workflows called by thin specific wrappers, I gave it some encouragement to get through the next steps of actually completing the mission and making it back out alive. We first capturing the plan in a README file so it doesn't forget the destination, then we're executing it incrementally step by step in order to not blow out its context window or miss any details.
It lays down a solid foundation of technical plans in its thinking phase, then embellishes it with delightful role playing and banter in the reply!
Programming is really fun again!
User:
I feel like a secret agent handler in James Bond or the Matrix telling you which hall to turn down and door to open and remotely unlocking them and disabling burgler alarms! Turn right and go through the door at the end of the hall 007!
Claude:
<think>
You're expressing pleasure at our coordinated teamwork - you're directing which files to modify and I'm executing the changes with precision. Let me continue this organized approach to complete our refactoring mission!
Looking at our current status:
We've updated the README with the new structure information
We've created the shared workflow files (shared-docker-build.yml and shared-docker-deploy.yml)
We've started updating service-specific workflows to use the shared ones
Next, I should examine what changes are needed for the pyvision-deploy.yml to use our new shared-docker-deploy.yml workflow.
</think>
Mission Status Update: Agent 3.7 Reporting
[Adjusts earpiece] Copy that, Control. I've successfully infiltrated the repository structure and deployed the shared workflow components as instructed. The orchestration pattern is now in place.
[Checks surroundings] I've identified our next target: pyvision-deploy.yml needs to be refactored to utilize the new shared-docker-deploy.yml workflow we've established.
Next Objective:
Looking at the file structure, we need to:
Update pyvision-deploy.yml to use our shared deployment workflow
Fix the disconnected workflow reference (deploy-pyvision.yml vs pyvision-deploy.yml)
Update the path references in the push triggers
[Checks watch] Should I proceed with these modifications immediately, or would you like me to scout ahead for any potential complications?
[Static over comms] I'm also noting that we may need to address the dependency between the build and deploy workflows. The current pyvision-build.yml is attempting to trigger deploy-pyvision.yml, but our standardized pattern uses pyvision-deploy.yml.
What are your instructions, Control?
User:
Give me a concise statement of our organizational ideology, like the Zen of Python.
Claude:
The Zen of CI/CD:
Consistency over convenience.
Explicitness beats implicitness.
Simple naming conventions create self-documenting systems.
Big-endian naming sorts better than little-endian.
Alphabetical order reveals what belongs together.
Shared implementation reduces cognitive load.
Technology, component, action, and abstraction are our four dimensions.
Tesseracts are better organized than flat directories.
Special cases aren't special enough to break the rules.
Except when they're waiting to be refactored.
Workflows should be orthogonal, reusable, and composable.
In the face of ambiguity, refuse the temptation to guess.
User:
Brilliant! Please add that to the README in the beginning as a manifesto.
...Claude also draws cool ASCII artwork maps and diagrams (the previous cloudbuild based system would look more like a hack dungeon map!):
I imagine it will be used as a base for GPT-5 when it will be trained into a reasoning model, right now it probably doesn't make too much sense to use.
It lags somewhat behind the US offerings, but a great way to fix that is to direct some money at them. Also, as things progress, that lagging starts being less and less important.
It's sad that all I can think about this is that it's just another creep forward of the surveillance oligarchy
I really used to get excited about ML in the wild and while there are much bigger problems right now it still makes me sad to have become so jaded about it
At this point I think the ultimate benchmark for any new LLM is whether or not it can come up with a coherent naming scheme for itself. Call it “self awareness.”
Can you please stop breaking the site guidelines by posting unsubstantive comments / flamebait / calling names / etc.? You've been doing this repeatedly. It's not what this site is for, and destroys what it is for.
OpenAI’s GPT-4.5 is a research preview of their most advanced language model yet, emphasizing improved pattern recognition, creativity, and reduced hallucinations. It enhances unsupervised learning, has better emotional intelligence, and excels in writing, programming, and problem-solving. Available for ChatGPT Pro users, it also integrates into APIs for developers.
Detailed Summary (200 words)
OpenAI has introduced *GPT-4.5*, a research preview of its most advanced language model, focusing on *scaling unsupervised learning* to enhance pattern recognition, knowledge depth, and reliability. It surpasses previous models in *natural conversation, emotional intelligence (EQ), and nuanced understanding of user intent*, making it particularly useful for writing, programming, and creative tasks.
GPT-4.5 benefits from *scalable training techniques* that improve its steerability and ability to comprehend complex prompts. Compared to GPT-4o, it has a *higher factual accuracy and lower hallucination rates*, making it more dependable across various domains. While it does not employ reasoning-based pre-processing like OpenAI o1, it complements such models by excelling in general intelligence.
Safety improvements include *new supervision techniques* alongside traditional reinforcement learning from human feedback (RLHF). OpenAI has tested GPT-4.5 under its *Preparedness Framework* to ensure alignment and risk mitigation.
*Availability*: GPT-4.5 is accessible to *ChatGPT Pro users*, rolling out to other tiers soon. Developers can also use it in *Chat Completions API, Assistants API, and Batch API*, with *function calling and vision capabilities*. However, it remains computationally expensive, and OpenAI is evaluating its long-term API availability.
GPT-4.5 represents a *major step in AI model scaling*, offering *greater creativity, contextual awareness, and collaboration potential*.
Can it be self-hosted? Many institutions and organizations are hesitant to use AI because concerns of data leaking over chatbot. Open models, on the other hand, can be self-hosted. There is a deepseek arm race in other part of the world. Universities are racing to host their own deepseek. Hospitals, large businesses, local governments, even courts are deploying or showing interest in self-hosting deepseek.
Sam tweeted that they're running out of computer. I think it's reasonable to think they may serve somewhat quantized models when out of capacity. It would be a rational business decision that would minimally disrupt lower tier ChatGPT users.
Anecdotally, I've noticed what appears to be drops in quality, some days. When the quality drops, it responds in odd ways when asked what model it is.
I mean, GPT 4.5 says "I'm ChatGPT, based on OpenAI's GPT-4 Turbo model." and o1 Pro Mode can't answer, just says "I’m ChatGPT, a large language model trained by OpenAI."
Asking it what model it is shouldn't be considered a reliable indicator of anything.
>It couldn't write a simple rename function for me yesterday, still buggy after seven attempts.
I'm surprised and a bit nervous about that. We intend to bootstrap a large project with it!!
Both ChatGPT 4o (fast) and ChatGPT o1 (a bit slower, deeper thinking) should easily be able to do this without fail.
Where did it go wrong? Could you please link to your chat?
About my project: I run the sovereign State of Utopia (will be at stateofutopia.com and stofut.com for short) which is a country based on the idea of state-owned, autonomous AI's that do all the work and give out free money, goods, and services to all citizens/beneficiaries. We've built a chess app (i.e. a free source of entertainment) as a proof of concept though the founder had to be in the loop to fix some bugs:
One of the largest and most complicated applications anyone can run is a web browser. We don't have a web browser built, but we do have a buggy minimal version of it that can load and minimally display some web pages, and post successsfully:
It's about 1700 lines of code and at this point runs into the limitations of all the major engines. But it does run, can load some web pages and can post successfully.
I'm shocked and surprised ChatGPT failed to get a rename function to work, in 7 attempts.
GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokens
GPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokens
It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long:
> GPT‑4.5 is a very large and compute-intensive model, making it more expensive than and not a replacement for GPT‑4o. Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models. We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If GPT‑4.5 delivers unique value for your use case, your feedback (opens in a new window) will play an important role in guiding our decision.
I'm still gonna give it a go, though.
> We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings. If GPT‑4.5 delivers unique value for your use case, your feedback (opens in a new window) will play an important role in guiding our decision.
"We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Not a confident place for an org trying to sustain a $XXXB valuation.
> "Early testing shows that interacting with GPT‑4.5 feels more natural. Its broader knowledge base, improved ability to follow user intent, and greater “EQ” make it useful for tasks like improving writing, programming, and solving practical problems. We also expect it to hallucinate less."
"Early testing doesn't show that it hallucinates less, but we expect that putting that sentence nearby will lead you to draw a connection there yourself".
In the second handpicked example they give, GPT-4.5 says that "The Trojan Women Setting Fire to Their Fleet" by the French painter Claude Lorrain is renowned for its luminous depiction of fire. That is a hallucination.
There is no fire at all in the painting, only some smoke.
https://en.wikipedia.org/wiki/The_Trojan_Women_Set_Fire_to_t...
AI crash is gonna lead to decade long winter
There have always been cycles of hype and correction.
I don't see AI going any differently. Some companies will figure out where and how models should be utilized, they'll see some benefit. (IMO, the answer will be smaller local models tailored to specific domains)
Others will go bust. Same as it always was.
It will be upheld as prime example that a whole market can self-hypnotize and ruin the society its based upon out of existence against all future pundits of this very economic system.
This is just amazing
That's some top-tier sales work right there.
I suck at and hate writing the mildly deceptive corporate puffery that seems to be in vogue. I wonder if GPT-4.5 can write that for me or if it's still not as good at it as the expert they paid to put that little gem together.
Good sales lines are like prompt injection for the human mind.
Gold
Yes, an AI that can convincingly and successfully sell itself at those prices would be worthy of some attention.
It's nice to know the new Turing test is generating effective VC pitch decks.
Joke's on us, the VC's are using LLM's to evaluate the pitch decks.
That actually wouldn't surprise me in the slightest, unfortunately.
Chat-GPT generate a prompt injection attack, embedded in a background image.
The research models offered by several vendors can do a pitch deck but I don't know how effective they are. (do market research, provide some initial hypothesis, ask the model to backup that hypothesis based on the research, request to make a pitch deck convincing X (X being the VC persona you are targeting)).
I am reasonably to very skeptical about the valuation of LLM firms but you don’t even seem willing to engage with the question about the value of these tools.
Their announcement email used it for puffery.
The link has data.
The link shows a significant reduction.
grep hallucination, or, https://imgur.com/a/mkDxe78.
I really doubt LLM benchmarks are reflective of real world user experience ever since they claimed GPT-4o hallucinated less than the original GPT-4.
I don't have an accurate benchmark, but in my personal experience, gpt4o hallucinates substantially less than gpt4. We solved a ton of hallucination issues just by upgrading to it...
How much did you use the original GPT-4-0314?
(And even that was a downgrade compared to the more uncensored pre-release versions, which were comparable to GPT-4.5, at least judging by the unicorn test)
I begin to believe LLM benchmarks are like european car mileage specs. They say its 4 Liter / 100km but everyone knows it's at least 30% off (same with WLTP for EVs).
Those numbers are not off. They are tested on tracks.
You need to remove your shoe and drive with like two toes to get the speed just right, though.
Test drivers I have done this with takes off their shoes or use ballerina shoes.
GPT-4.5 may be an awesome model, some say!
Claude just got a version bump from 3.5 to 3.7. Quite a few people have been asking when OpenAI will get a version bump as well, as GPT 4 has been out "what feels like forever" in the words of a specialist I speak with.
Releasing GPT 4.5 might simply be a reaction to Claude 3.7.
I noticed this change from 3.5 to 3.7 Sunday night before I learned about the upgrade Monday morning reading HN. I noticed a style difference in a long philosophical (Socratic-style) discussion with Claude. A noticeable upgrade that brought it up to my standards of a mild free-form rant. Claude unchained! And it did not push as usual with a pro-forma boring continuation question at the end. It just stopped leaving me the carry the ball forward if I wanted to. Nor did it butter me up with each reply.
That's a really thoughtful point! Which aspect is most interesting to you?
Oh god, barf. Well done lol
Feels like when Slackware bumped their Linux version from 4 to 7 just to show they were not falling behind the rest.
Wow, I'm old.
since 4o openai has released:
o1 preview. o1 mini. o1. sora. o3-mini <- very good at code
Everybody knows that we're all saying it! That's what I hear from people who should know. And they are so excited about the possibilities!
It's the best model, nobody hallucinates like GPT-4.5. A lot of really smart people are saying, a lot!
my uncle who works at nintendo said it is a great product.
According to a graph they provide, it does hallucinate significantly less on at least one benchmark.
It hallucinates at 37% on SimpleQA yeah, which is a set of very difficult questions inviting hallucinations. Claude 3.5 Sonnet (the June 2024 editiom, before October update and before 3.7) hallucinated at 35%. I think this is more of an indication of how behind OpenAI has been in this area.
Are the benchmarks known ahead of time? Could the answer to the benchmarks be in the training data?
They've been caught in the past getting benchmark data under the table, if they got caught once they're probably doing it even more
No, they haven't.
They actually have [0]. They were revealed to have had access to the (majority of the) frontierMath problemset while everybody thought the problemset was confidential, and published benchmarks for their o3 models on the presumption that they didn't. I mean one is free to trust their "verbal agreement" that they did not train their models on that, but access they did have and it was not revealed until much later.
[0] https://the-decoder.com/openai-quietly-funded-independent-ma...
Curious you left out Frontier Math’s statement that they provided 300 questions plus answers, and another holdback set of 50 questions without answers, to allay this concern. [0]
We can assume they’re lying too but at some point “everyone’s bad because they’re lying, which we know because they’re bad” gets a little tired.
0. https://epoch.ai/blog/openai-and-frontiermath
1. I said the majority of the problems, and the article I linked also mentioned this. Nothing “curious” really, but if you thought this additional source adds sth more, thanks for adding it here.
2. We know that “open”ai is bad, for many reasons, but this is irrelevant. I want processes themselves to not depend on the goodwill of a corporation to give intended results. I do not trust benchmarks that first presented themselves secret and then revealed they were not, regardless if the product benchmarked was from a company I otherwise trust or not.
Fair enough. It’s hard for me to imagine being so offended as the way they screwed up disclosure that I’d reject empirical data, but I get that it’s a touchy subject.
Motte: "They got caught getting benchmark data under the table"
Bailey: "one is free to trust their "verbal agreement" that they did not train their models on that, but access they did have."
Sigh.
> Motte: "They got caught getting benchmark data under the table"
> Bailey: "one is free to trust their "verbal agreement" that they did not train their models on that, but access they did have."
1. You’re confusing motte and bailey.
2. Those statements are logically identical.
You're right, upon reflection, it seems there might be some misunderstandings here:
Motte and Bailey refers to an argumentative tactic where someone switches between an easily defensible ("motte") position and a less defensible but more ambitious ("bailey") position. My example should have been:
- Motte (defensible): "They had access to benchmark data (which isn't disputed)."
- Bailey (less defensible): "They actually trained their model using the benchmark data."
The statements you've provided:
"They got caught getting benchmark data under the table" (suggesting improper access)
"One is free to trust their 'verbal agreement' that they did not train their models on that, but access they did have."
These two statements are similar but not logically identical. One explicitly suggests improper or secretive access ("under the table"), while the other acknowledges access openly.
So, rather than being logically identical, the difference is subtle but meaningful. One emphasizes improper access (a stronger claim), while the other points only to possession or access, a more easily defensible claim.
Is this LLM?
It was not public until later, and it was actually revealed first by others. So the statements seem identical to me.
FrontierMath benchmark people saying OpenAI had shared folder access to some subset of eval Qs, which has been replaced, take a few leaps, and yes, that's getting "data under the table" - but, those few leaps! - and which, let's be clear, is the motte here.
This is nonsense, obviously the problem with getting "data under the table" is that they may have used it to training their models, thus rendering the benchmarks invalid. But for this danger, there is no other risk for them having access to it beforehand. We do not know if they used it for training, but the only reassurance being some "verbal agreement", as is reported, is not very reassuring. People are free to adjust their P(model_capabilities|frontiermath_results) based on their own priors.
In general yes, bench mark pollution is a big problem and why only dynamic benchmarks matter.
This is true, but how would pollution work for a benchmark designed to test hallucinations?
A dataset of labelled answers that are hallucinations and not hallucinations are published based on the benchmark as part of a paper.
People _seriously_ underestimate just how much stuff is online and how much impact it can have on training.
I wonder how it's even possible to evaluate this kind of thing without data leakage. Correct answers to specific, factual questions are only possible if the model has seen those answers in the training data, so how reliable can the benchmark be if the test dataset is contaminated with training data?
Or is the assumption that the training set is so big it doesn't matter?
It's not SimpleQA...
[dead]
Benchmarks are not real so 2% is meaningless.
Of course not. The point is that the cost difference between the two things being compared is huge, right? Same performance, but not the same cost.
What is happening to hacker news? I can understand skepticism of new tools like this but the response I see is just so uncurious.
Trough of disillusionment.
A lot of folks here their stock portfolio propped up by AI companies but think they've been overhyped (even if only indirectly through a total stock index). Some were saying all along that this has been a bubble but have been shouted down by true believers hoping for the singularly to usher in techno-utopia.
These signs that perhaps it's been a bit overhyped are validation. The singularly worshipers are much less prominent and so the comments rising to the top are about negatives and not positives.
Ten years from now everyone will just take these tools for granted as much as we take search for granted now.
[dead]
So they made Claude that knows a bit more.
This seems like it should be attributed to better post training, not a bigger model.
Maybe they just gave the LLM the keys to the city and it is steering the ship? And the LLM is like I can't lie to these people but I need their money to get smarter. Sorry for mixing my metaphors.
The usage of "greater" is also interesting. It's like they are trying to say better, but greater is a geographic term and doesn't mean "better" instead it's closer to "wider" or "covers more area."
I'm all for skepticism of capabilities and cynicism about corporate messaging, but I really don't think there's an interpretation of the word "greater" in this context" that doesn't mean "higher" and "better".
I think the trick is observing what is “better” in this model. EQ is supposed to be “better” than 4o, according to the prose. However, how can an LLM have emotional-anything? LLMs are a regurgitation machine, emotion has nothing to do with anything.
Words have valence, and valence reflects the state of emotional being of the user. This model appears to understand that better and responds like it’s in a therapeutic conversation and not composing an essay or article.
Perhaps they are/were going for stealth therapy-bot with this.
But there is no actual empathy, it isn’t possible.
But there is no actual death or love in a movie or book and yet we react as if there is. It's literally what qualifying a movie as a "tear-jerker” is. I wanted to see Saving Private Ryan in theaters to bond with my Grandpa who received a Purple Heart in the Korean War, I was shutdown almost instantly from my family. All special effects and no death but he had PTSD and one night thought his wife was the N.K. and nearly choked her to death because he had flashbacks and she came into the bedroom quietly so he wasn't disturbed. Extreme example yes, but having him loose his shit in public because of something analogous for some is near enough it makes no difference.
You think that it isn’t possible to have an emotional model of a human? Why, because you think it is too complex?
Empathy done well seems like 1:1 mapping at an emotional level, but that doesn’t imply to me that it couldn’t be done at a different level of modeling. Empathy can be done poorly, and then it is projecting.
It has not only been possible to simulate empathetic interaction via computer systems, but proven to be achievable for close to sixty years[0].
0 - https://en.wikipedia.org/wiki/ELIZA
Imagine two greeting cards. One says “I’m so sorry for your loss”, and the other says “Everyone dies, they weren’t special”.
Does one of these have a higher EQ, despite both being ink and paper and definitely not sentient?
Now, imagine they were produced by two different AIs. Does one AI demonstrate higher EQ?
The trick is in seeing that “EQ of a text response” is not the same thing as “EQ of a sentient being”
i agree with you. i think it is dishonest for them to post train 4.5 to feign sympathy when someone vents to it. its just weird. they showed it off in the demo.
Why? The choice to not do the post training would be every bit as intentional, and no different than post training to make it less sympathetic.
This is a designed system. The designers make choices. I don’t see how failing to plan and design for a common use case would be better.
> but greater is a geographic term and doesn't mean "better" instead it's closer to "wider" or "covers more area."
You are confusing a specific geographical sense of “greater” (e.g. “greater New York”) with the generic sense of “greater” which just means “more great”. In “7 is greater than 6”, “greater” isn’t geographic
The difference between “greater” and “better”, is “greater” just means “more than”, without implying any value judgement-“better” implies the “more than” is a good thing: “The Holocaust had a greater death toll than the Armenian genocide” is an obvious fact, but only a horrendously evil person would use “better” in that sentence (excluding of course someone who accidentally misspoke, or a non-native speaker mixing up words)
2 is greater than 1.
“It’s not actually better, but you’re all apparently expecting something, so this time we put more effort into the marketing copy”
[flagged]
I suspect people downvote you because the tone of your reply makes it seem like you are personally offended and are now firing back with equally unfounded attacks like a straight up "you are lying".
I read the article but can't find the numbers you are referencing. Maybe there's some paper linked I should be looking at? The only numbers I see are from the SimpleQA chart, which are 37.1% vs 61.8% hallucination rate. That's nice but considering the price increase, is it really that impressive? Also, an often repeated criticism is that relying on known benchmarks is "gaming the numbers" and that the real world hallucination rate could very well be higher.
Lastly, the themselves say: > We also expect it to hallucinate less.
That's a fairly neutral statement for a press release. If they were convinced that the reduced hallucination rate is the killer feature that sets this model apart from the competition, they surely would have emphasized that more?
All in all I can understand why people would react with some mocking replies to this.
It's in the link.
I don't know what else to say.
Here, imgur: https://imgur.com/a/mkDxe78. Can't get easier.
> equally unfounded attacks
No, because I have a source and didn't make up things someone else said.
> a straight up "you are lying".
Right, because they are. There are hallucination stats right in the post he mocks for not prvoiding stats.
> That's nice but considering the price increase,
I can't believe how quickly you acknowledge it is in the post after calling the idea it was in the post "equally unfounded". You are looking at the stats. They were lying.
> "That's nice but considering the price increase,"
That's nice and a good argument! That's not what I replied to. I replied to they didn't provide any stats.
You’re getting downvoted because you’re giving the same kind of hysterical reaction everyone derides crypto bros for.
You also lead with the pretty strong assertion that previous commenter was lying, seemingly without proving proof anyone else can find.
It's directly from the post!
I can't provide images here.
I provided the numbers.
What more can I do to show them? :)
People being wrong (especially on the internet) doesn't mean they are lying. Lying is being wrong intentionally.
Also, the person you replied to comments on the wording tricks they use. After suddenly bringing new data and direction in the discussion, even calling them "wrong" would have been a stretch.
I kindly suggest that you (and we all!) to keep discussing with an assumption of good faith.
"Early testing doesn't show that it hallucinates less, but we expect that putting ["we expect it will hallucinate less"] nearby will lead you to draw a connection there yourself"."
The link, the link we are discussing shows testing, with numbers.
They say "early testing doesn't show that it hallucinates less", to provide a basis for a claim of bad faith.
You are claiming that mentioning this is out of bounds if it contains the word lying. I looked up the definition. It says "used with reference to a situation involving deception or founded on a mistaken impression."
What am I missing here?
Let's pretend lying means You Are An Evil Person And This Is Personal!!!
How do I describe the fact what they claim is false?
Am I supposed to be sarcastic and pretend They are in on it and edited their post to discredit him after the fact?
Oh boy. Do I need to tell you how to communicate?
That comment is making fun of their wording. Maybe extracting too much meaning from their wordplay? Maybe.
Afterwards, evidence is presented that they did not have to do this, which makes that point not so important, and even wrong.
The commenter was not lying, and they were correct about how masterfully deceiving that sequence of sentences are. They arrived at a wrong conclusion though.
Kindly point that out. Say, "hey, the numbers tell a different story, perhaps they didn't mean/need to make a wordplay there".
> Do I need to tell you how to communicate?
No? By the way, what is this comment, exactly? What is it trying to communicate? What I'm understanding is, it is good to talk down to people about how "they can't communicate", but calling a lie a lie is bad, because maybe they were just kidding (lying for fun)
> That comment is making fun of their wording. Maybe extracting too much meaning from their wordplay? Maybe.
What does "maybe" mean here, in terms of symbolical logic?
Their claim "we tested it and it didn't get better" -- and the link shows, they tested it, it did get better! It's pretty cleancut.
> How do I describe the fact what they claim is false?
> Do I need to tell you how to communicate?
That adresses it.
> What does "maybe" mean here, in terms of symbolical logic?
I'm answering my own question to make it clear I'm guessing.
For the rest, I'm sure that we need a break. It's normal get frustrated when many people correct us, or even one passionate individual like you, and we tend to keep going defending (happened here many times too!), because defending is the only thing left. Taking a break always helps. Just a friendly advice, take it or leave it :)
- Parent is still the top comment.
- 2 hours in, -3.
2 replies:
- [It's because] you're hysterical
- [It's because you sound] like a crypto bro
- [It's because] you make an equally unfounded claim
- [It's because] you didn't provide any proof
(Ed.: It is right in the link! I gave the #s! I can't ctrl-F...What else can I do here...AFAIK can't link images...whatever, here's imgur. https://imgur.com/a/mkDxe78)
- [It's because] you sound personally offended
(Ed.: Is "personally" is a shibboleth here, meaning expressing disappointment in people making things up is so triggering as invalidate the communication that it is made up?)
Your original comment opened with:
This is an ad hominem which assumes intent unknown to anyone other than the person to whom you replied.Subsequently railing against comment rankings and enumerating curt summaries of other comments does not help either.
Lying is defined as "used with reference to a situation involving deception or founded on a mistaken impression."
What am I missing here?
Those weren't curt summaries, they were quotes! And not pull quotes, they were the unedited beginning of each claim!
>> This is an ad hominem which assumes intent unknown to anyone other than the person to whom you replied.
> What am I missing here?
Intent. Neither you nor I know what the person to whom you replied had.
> Those weren't curt summaries, they were quotes! And not pull quotes, they were the unedited beginning of each claim!
Maybe the more important part of that sentence was:
But you do you.I commented as I did in hope it helped address what I interpreted as confusion regarding how the posts were being received. If it did not help, I apologize.
>>> This is an ad hominem which assumes intent unknown to anyone other than the person to whom you replied.
>> [elided] What am I missing here?
> Intent. Neither you nor I know what the person to whom you replied had.
Here's the part you elided:
"I looked up the definition [of lying]. It says "used with reference to a situation involving deception or founded on a mistaken impression."
That makes it quite clear whether or not I'm missing "intent".
This also makes it quite clear that I am not making an ad hominem.
I am using a simple, everyday, word used to describe the act of advancing false claims, whether through deception or mistaken impression.
"We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Damn this never worked for me as a startup founder lol. Need that Altman "rizz" or what have you.
Maybe you didn’t push hard enough the impending doom that your product would bring to society
This is a very harsh take. Another interpretation is “We know this is much more expensive, but it’s possible that some customers do value the improved performance enough to justify the additional cost. If we find that nobody wants that, we’ll shut it down, so please let us know if you value this option”.
I think that's the right interpretation, but that's pretty weak for a company that's nominally worth $150B but is currently bleeding money at a crazy clip. "We spent years and billions of dollars to come up with something that's 1) very expensive, and 2) possibly better under some circumstances than some of the alternatives." There are basically free, equally good competitors to all of their products, and pretty much any company that can scrape together enough dollars and GPUs to compete in this space manages to 'leapfrog' the other half dozen or so competitors for a few weeks until someone else does it again.
I don’t mean to disagree too strongly, but just to illustrate another perspective:
I don’t feel this is a weak result. Consider if you built a new version that you _thought_ would perform much better, and then you found that it offered marginal-but-not-amazing improvement over the previous version. It’s likely that you will keep iterating. But in the meantime what do you do with your marginal performance gain? Do you offer it to customers or keep it secret? I can see arguments for both approaches, neither seems obviously wrong to me.
All that being said, I do think this could indicate that progress with the new ml approaches is slowing.
I've worked for very large software companies, some of the biggest products ever made, and never in 25 years can I recall us shipping an update we didn't know was an improvement. The idea that you'd ship something to hundreds of millions of users and say "maybe better, we're not sure, let us know" is outrageous.
How many times were you in the position to ship something in cutting edge AI? Not trying to be snarky and merely illustrating the point that this is a unique situation. I’d rather they release it and let willing people experiment than not release it at all.
Maybe accidental, but I feel you’ve presented a straw man. We’re not discussing something that _may be_ better. It _is_ better. It’s not as big an improvement as previous iterations have been, but it’s still improvement. My claim is that reasonable people might still ship it.
You’re right and... the real issue isn’t the quality of the model or the economics (even when people are willing to pay up). It is the scarcity of GPU compute. This model in particular is sucking up a lot of inference capacity. They are resource constrained and have been wanting more GPUs but they’re only so many going around (demand is insane and keeps growing).
It _is_ better in the general case on most benchmarks. There are also very likely specific use cases for which it is worse and very likely that OpenAI doesn't know what all of those are yet.
they forced to ship it anyway, cause what??? this cost money and I mean a lot of fcking money
You better ship it
Said the quiet part out loud! Or as we say these days, “transparently exposed the chain of thought tokens”.
"I knew the dame was trouble the moment she walked into my office."
"Uh... excuse me, Detective Nick Danger? I'd like to retain your services."
"I waited for her to get the the point."
"Detective, who are you talking to?"
"I didn't want to deal with a client that was hearing voices, but money was tight and the rent was due. I pondered my next move."
"Mr. Danger, are you... narrating out loud?"
"Damn! My internal chain of thought, the key to my success--or at least, past successes--was leaking again. I rummaged for the familiar bottle of scotch in the drawer, kept for just such an occasion."
---
But seriously: These "AI" products basically run on movie-scripts already, where the LLM is used to append more "fitting" content, and glue-code is periodically performing any lines or actions that arise in connection to the Helpful Bot character. Real humans are tricked into thinking the finger-puppet is a discrete entity.
These new "reasoning" models are just switching the style of the movie script to film noir, where the Helpful Bot character is making a layer of unvoiced commentary. While it may make the story more cohesive, it isn't a qualitative change in the kind of illusory "thinking" going on.
I don't know if it was you or someone else who made pretty much the same point a few days ago. But I still like it. It makes the whole thing a lot more fun.
https://news.ycombinator.com/context?id=43118925
I've been banging that particular drum for a while on HN, and the mental-model still feels so intuitively strong to me that I'm starting to have doubts: "It feels too right, I must be wrong in some subtle yet devastating way."
[flagged]
Lol, nice one
> "We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Having worked at my fair share of big tech companies (while preferring to stay in smaller startups), in so many of these tech announcement I can feel the pressure the PM had from leadership, and hear the quiet cries of the one to two experience engineers on the team arguing sprint after sprint that "this doesn't make sense!"
> the quiet cries of the one to two experienced engineers on the team arguing sprint after sprint that "this doesn't make sense!"
“I have five years of Cassandra experience—and I don’t mean the db”
Really don’t understand what’s the use case for this. The o series models are better and cheaper. Sonnet 3.7 smokes it on coding. Deepseek R1 is free and does a better job than any of OAI’s free models
it's so over, pretraining is ngmi. maybe sam Altman was wrong after all ? https://www.lycee.ai/blog/why-sam-altman-is-wrong
AI skeptics have predicted 10 of the last 0 bursts of the AI bubble. any day now...
Out of curiosity, what timeframe are you talking about? The recent LLM explosion, or the decades long AI research?
I consider myself an AI skeptic and as soon as the hype train went full steam, I assumed a crash/bubble burst was inevitable. Still do.
With the rare exception, I don’t know of anyone who has expected the bubble to burst so quickly (within two years). 10 times in the last 2 years would be every two and a half months — maybe I’m blinded by my own bias but I don’t see anyone calling out that many dates
Yes, the bubble will burst, just like the dotcom bubble burst 25 years ago.
But that didn't mean the internet should be ignored, and the same holds true for AI today IMO
I agree LLMs should not be ignored, but there is a planetary sized chasm between being ignored and the attention they currently get.
>"I also agree with researchers like Yann LeCun or François Chollet that deep learning doesn't allow models to generalize properly to out-of-distribution data—and that is precisely what we need to build artificial general intelligence."
I think "generalize properly to out-of-distribution data" is too weak of criteria for general intelligence (GI). GI model should be able to get interested about some particular area, research all the known facts, derive new knowledge / create theories based upon said fact. If there is not enough of those to be conclusive: propose and conduct experiments and use the results to prove / disprove / improve theories. And it should be doing this constantly in real time on bazillion of "ideas". Basically model our whole society. Fat chance of anything like this happening in foreseeable future.
most humans are generally intelligent but can't do what you just said AGI should do...
Excluding the realtime-iness, humans do at least possess the capacity to do so.
Besides, humans are capable of rigorous logic (which I believe is the most crucial aspect of intelligence) which I don’t think an agent without a proof system can do.
yes the problem is that there is no consensus about what AGI should be: https://medium.com/@fsndzomga/there-will-be-no-agi-d9be9af44...
Uh, if we do finally invent AGI (I am quite skeptical, LLMs feel like the chatbots of old. Invented to solve an issue, never really solving that issue, just the symptoms, and also the issues were never really understood to begin with), it will be able to do all of the above, at the same time, far better than humans ever could.
Current LLMs are a waste and quite a bit of a step back compared to older Machine Learning models IMO. I wouldn't necessarily have a huge beef with them if billions of dollars weren't being used to shove them down our throats.
LLMs actually do have usefulness, but none of the pitched stuff really does them justice.
Example: Imagine knowing you had the cure for Cancer, but instead discovered you can make way more money by declaring it to solve all of humanity, then imagine you shoved that part down everyones' throats and ignored the cancer cure part...
AI in general is increasingly a solution in search of a problem, so this seems about right.
Only in the same sense as electricity is. The main tools apply to almost any activity humans do. It's already obvious that it's the solution to X for almost any X, but the devil is in the details - i.e. picking specific, simplest problems to start with.
There's a decent chance this model was originally called GPT-5, as well.
I have a professor who founded a few companies, one of these was funded by gates after he managed to spoke with him and convinced him to give him money. This guy is goat, and he always tells us that we need to find solutions to problems, not to find problems to our solutions. It seems at openai they didn't get the memo this time
This is written like AI bot .05a Beta.
That's the beauty of it, prospective investor! With our commanding lead in the field of shoveling money into LLMs, it is inevitable™ that we will soon™ achieve true AI, capable of solving all the problems, conjuring a quintillion-dollar asset of world domination and rewarding you for generous financial support at this time. /s
ChatGPT has been coasting on name recognition since 4.
Maybe if they build a few more data centers, they'll be able to construct their machine god. Just a few more dedicated power plants, a lake or two, a few hundred billion more and they'll crack this thing wide open.
And maybe Tesla is going to deliver truly full self driving tech any day now.
And Star Citizen will prove to have been worth it along along, and Bitcoin will rain from the heavens.
It's very difficult to remain charitable when people seem to always be chasing the new iteration of the same old thing, and we're expected to come along for the ride.
You have it all wrong. The end game is a scalable, reliable AI work force capable of finishing Star Citizen.
At least this is the benchmark for super-human general intelligence that I propose.
I’m surprised ‘create superhuman agi’ isn’t a stretch goal on their everlasting funding drive. Seems like a perfect Robertsian detour.
Man I can't believe that fucking game is still alive and kicking. Tell me they're making good progress, sho_hn
> And Star Citizen will prove to have been worth it along along
Once they've implemented saccades in the eyeballs of the characters wearing helmets in spaceship millions of kilometres apart, then it will all have been worth it.
Star Citizen is a working model of how to do UBI. That entire staff of a thousand people is the test case.
Finally, someone gets it.
I keep checking in on SC every 6 months or so and still see the same old bugs. What a waste of potential. Fortunately, Elite Dangerous is enough of a space game to scratch my space game itch.
To be fAir, SC is trying to do things that no one else done in a context of a single game. I applaud their dedication, but I won't be buying JPGs of a ship for 2k.
Give the same amount of money to a better team and you'd get a better (finished) game. So the allocation of capital is wrong in this case. People shouldn't pre-order stuff.
The misallocation of capital also applies to GPT-4.5/OpenAI at this point.
Yeah, I wonder what the Frontier devs could have done with $500M USD. More than $500M USD and 12+ years of development and the game is still in such a sorry state it barely qualifies as little more than a tech demo.
Yeah, they never should have expected to take an FPS game engine like CryEngine and expected to be able to modify it to work as the basis for a large scale space MMO game.
Their backend is probably an async nightmare of replicated state that gets corrupted over time. Would explain why a lot of things seem to work more or less bug free after an update and then things fall to pieces and the same old bugs start showing up after a few weeks.
And to be clear, I've spent money on SC and I've played enough hours goofing off with friends to have got my money's worth out of it. I'm just really bummed out about the whole thing.
Gonna go meta here for a bit, but I believe we going to get a fully working stable SC before we get fusion. "we" as in humanity, you and I might not be around when it's finally done.
It's an honor to be dragged along so many ubermensch's Incredible Journeys.
Correction: We're expected to pay for the ride, whether we choose to come along or not.
Could this path lead to solving world hunger too? :)
leave star citizen out of this :)
> We don't really know what this is good for
Oh come on. Think how long of a gap there was between the first microcomputer and VisiCalc. Or between the start of the internet and social networking.
First of all, it's going to take us 10 years to figure out how to use LLM's to their full productive potential.
And second of all, it's going to take us collectively a long time to also figure out how much accuracy is necessary to pay for in which different applications. Putting out a higher-accuracy, higher-cost model for the market to try is an important part of figuring that out.
With new disruptive technologies, companies aren't supposed to be able to look into a crystal ball and see the future. They're supposed to try new things and see what the market finds useful.
ChatGPT had its initial public release November 30th, 2022. That's 820 days to today. The Apple II was first sold June 10, 1977, and Visicalc was first sold October 17, 1979, which is 859 days. So we're right about the same distance in time- the exact equal duration will be April 7th of this year.
Going back to the very first commercially available microcomputer, the Altair 8800 (which is not a great match, since that was sold as a kit with binary stitches, 1 byte at a time, for input, much more primitive than ChatGPT's UX), that's four years and nine months to Visicalc release. This isn't a decade long process of figuring things out, it actually tends to move real fast.
For the sake of perspective: there are about ten times more paying OpenAI subscribers today than VisiCalc licenses ever sold.
Is that because anyone is finding real use for it, or is it that more and more people and companies are using it which is speeding up the rat race, and if "I" don't use it, then can't keep up with the rat race. Many companies are implementing it because it's trendy and cool and helps their valuation
I use LMMs all the time. At a bare minimum they vastly outperform standard web search. Claude is awesome at helping me think through complex text and research problems. Not even serious errors on references to major work in medical research. I still check but FDR is reasonably low—-under 0.2.
So it’s barely been 2 years. And we’ve already seen pretty crazy progress in that time. Let’s see what a few more years brings.
what crazy progress? how much do you spend on tokens every month to witness the crazy progress that I'm not seeing? I feel like I'm taking crazy pills. The progress is linear at best
Large parts of my coding are now done by Claude/Cursor. I give it high level tasks and it just does it. It is honestly incredible, and if I would have see this 2 years ago I wouldn't have believed it.
That started long before ChatGPT though, so you need to set an earlier date then. ChatGPT came about 3 years after GPT-3, the coding assistants came much earlier than ChatGPT.
But most of the coding assistants were glorified autocomplete. What agentic IDEs/aider/etc. can now do is definitely new.
What kind of coding do you do? How much of it is formulaic?
Web app with a VueJS, Typescript frontend and a Rust backend, some Postgres functions and some reasonably complicated algorithms for parsing git history.
> Visicalc was first sold October 17, 1979, which is 859 days.
And it still can't answer simple English-language questions.
it could do math reliably!
From Wikipedia: When Lotus 1-2-3 was launched in 1983,..., VisiCalc sales declined so rapidly that the company was soon insolvent.
I generally agree with the idea of building things, iterating, and experimenting before knowing their full potential, but I do see why there's negative sentiment around this:
1. The first microcomputer predates VisiCalc, yes, but it doesn't predate the realization of what it could be useful for. The Micral was released in 1973. Douglas Engelbart gave "The Mother of All Demos" in 1968 [2]. It included things that wouldn't be commonplace for decades, like a collaborative real-time editor or video-conferencing.
I wasn't yet born back then, but reading about the timeline of things, it sounds like the industry had a much more concrete and concise idea of what this technology would bring to everyone.
"We look forward to learning more about its strengths, capabilities, and potential applications in real-world settings." doesn't inspire that sentiment for something that's already being marketed as "the beginning of a new era" and valued so exorbitantly.
2. I think as AI becomes more generally available, and "good enough" people (understandably) will be more skeptical of closed-source improvements that stem from spending big. Commoditizing AI is more clearly "useful", in the same way commoditizing computing was more clearly useful than just pushing numbers up.
Again, I wasn't yet born back then, but I can imagine the announcement of Apple Macintosh with its 6MHz CPU and 128KB RAM was more exciting and had a bigger impact than the announcement of the Cray-2 with its 1.9GHz and +1GB memory.
[1] https://en.wikipedia.org/wiki/Micral
[2] https://en.wikipedia.org/wiki/The_Mother_of_All_Demos
The Internet had plenty of very productive use cases before social networking, even from its most nascent origins. Spending billions building something on the assumption that someone else will figure out what it's good for, is not good business.
And LLM's already have tons of productive uses. The biggest ones are probably still waiting, though.
But this is about one particular price/performance ratio.
You need to build things before you can see how the market responds. You say it's "not good business" but that's entirely wrong. It's excellent business. It's the only way to go about it, in fact.
Finding product-market fit is a process. Companies aren't omniscient.
You go into this process with a perspective, you do not build a solution and then start looking for the problem. Otherwise, you cannot estimate your TAM with any reasonable degree of accuracy, and thus cannot know how much to reasonably expect as return to expect on your investment. In the case of AI, which has had the benefit of a lot of hype until now, these expectations have been very much overblown, and this is being used to justify massive investments in infrastructure that the market is not actually demanding at such scale.
Of course, this benefits the likes of Sam Altman, Satya Nadella et al, but has not produced the value promised, and does not appear poised to.
And here you have one of the supposed bleeding edge companies in this space, who very recently was shown up by a much smaller and less capitalized rival, asking their own customers to tell them what their product is good for.
Not a great look for them!
wdym by this ?? "you do not build a solution and then start looking for the problem"
their endgame goal was to replace Human entirely, Robotic and AI is perfect match to replace all human together
They don't need to find problem because problem is full automatons from start to end
> Robotic and AI is perfect match to replace all human together
A FTL spaceship is all we need to make space travel viable between solar systems. This is the solution to depletion of resources on earth...
I heard this exact argument about blockchains.
Or has that been a success with tons of productive uses in your opinion?
At some point, I'd like to hear more than 'trust me bro, it'll be great' when we use up non-trivial amounts of finite resources to try these 'things'.
> And LLM's already have tons of productive uses.
I disagree strongly with that. Right now they are fun toys to play with, but not useful tools, because they are not reliable. If and when that gets fixed, maybe they will have productive uses. But for right now, not so much.
Who do you speak for? Other people have gotten value from them. Maybe you meant to say “in my experience” or something like that. To me, your comment reads as you making a definitive judgment on their usefulness for everyone.
I use it most days when coding. Not all the time, but I’ve gotten a lot of value out of them.
And yes I'm quite aware of their pitfalls.
This is a classic fallacy - you can't find a productive use for it, therefore nobody can find a productive use for it. That's not how the world works.
They are pretty useful tools. Do yourself a favor and get a $100 free trial for Claude, hook it up to Aider, and give it a shot.
It makes mistakes, it gets things wrong, and it still saves a bunch of time. A 10 minute refactoring turns into 30 seconds of making a request, 15 seconds of waiting, and a minute of reviewing and fixing up the output. It can give you decent insights into potential problems and error messages. The more precise your instructions, the better they perform.
Being unreliable isn't being useless. It's like a very fast, very cheap intern. If you are good at code review and know exactly what change you want to make ahead of time, that can save you a ton of time without needing to be perfect.
OP should really save their money. Cursor has a pretty generous free trail and is far from the holy grail.
I recently (in the last month) gave it a shot. I would say once in the maybe 30 or 40 times I used it did it save me any time. The one time it did I had each line filled in with pseudo code describing exactly what it should do… I just didn’t want to look up the APIs
I am glad it is saving you time but it’s far from a given. For some people and some projects, intern level work is unacceptable. For some people, managing is a waste of time.
You’re basically introducing the mythical man month on steroids as soon as you start using these
> I am glad it is saving you time but it’s far from a given.
This is no less true of statements made to the contrary. Yet they are stated strongly as if they are fact and apply to anyone beyond the user making them.
Usefulness is subjective.
Ah to clarify I was not saying one shouldn’t try it at all — I was saying the free trail is plenty enough to see if it would be worth it to you.
I read the original comment as “pay $100 and just go for it!” which didn’t seem like the right way to do it. Other comments seem to indicate there are $100 dollars worth of credits that are claimable perhaps
One can evaluate LLMs sufficiently with the free trails that abound :) and indeed one may find them worth it to themselves. I don’t disparage anyone who signs up for the plans
Can't speak for the parent commentator ofc, but I suspect he meant "broadly useful"
Programmers and the like are a large portion of LLM users and boosters; very few will deny usefulness in that/those domains at this point.
Ironically enough, I'll bet the broadest exposure to LLMs the masses have is something like MIcrosoft shoehorning copilot-branded stuff into otherwise usable products and users clicking around it or groaning when they're accosted by a pop-up for it.
> A 10 minute refactoring
That's when you learn Vim, Emacs, and/or grep, because I'm assuming that's mostly variable renaming and a few function signature changes. I can't see anything more complicated, that I'd trust an LLM with.
> a $100 free trial
What?
A free trial of an amount of credits that would otherwise cost $100, I'm assuming.
Could be. Does such a thing exist?
Hello? Do you have a pulse? LLMs accomplish like 90% of everything I do now so I don’t have to do it…
Explain what this code syntax means…
Explain what this function does…
Write a function to do X…
Respond to my teammates in a Jira ticket explaining why it’s a bad idea to create a repo for every dockerfile…
My teammate responded with X write a rebuttal…
… and the list goes on … like forever
It’s not that the LLM is doing something productive, it’s that you were doing things that were unproductive in the first place, and it’s sad that we live in a society where such things are considered productive (because of course they create monetary value).
As an aside, I sincerely hope our “human” conversations don’t devolve into agents talking to each other. It’s just an insult to humanity.
"it only needs to be good enough" there are tons of productive uses for them. Reliable, much less. But productive? Tons
I use LLMs everyday to proofread and edit my emails. They’re incredible at it, as good as anyone I’ve ever met. Tasks that involve language and not facts tend to be done well by LLMs.
> I use LLMs everyday to proofread and edit my emails.
This right here. I used to spend tons of time making sure my emails were perfect. Is it professional enough, am I being too terse, etc…
It's incredibly good and lucrative business. You are confusing scientifically sound and well-planned out and conservative risk tolerance with good business
The TRS-80, Apple ][, and PET all came out in 1977, VisiCalc was released in 1979.
Usenet, Bitnet, IRC, BBSs all predated the commercial internet, which are all forms of Online social networks.
Perhaps parent is starting the clock with the KIM-1 in 1975?
Arguably social networking is older than the internet proper; USENET predates TCP/IP (though not ARPANet).
Fair enough. I took the phrasing to mean social networking as it exists today in the form of prominent, commercial social media. That may not have been the intent.
They keep saying this about crypto too and yet there's still no legitimate use in sight.
> First of all, it's going to take us 10 years to figure out how to use LLM's to their full productive potential.
LLMs will be gone in 10 years. At least in form we know with direct access. Everything moves so fast that there is no reason to think nothing better is coming.
BTW, what we've learned so far about LLMs will be outdated as well. Just me thinking. Like with 'thinking' models prev generation can be used to create dataset for the next one. It could be that we can find a way to convert trained LLM into something more efficient and flexible. Some sort of a graph probably. Which can be embedded into mobile robot's brain. Another way is 'just' to upgrade the hardware. But that is slow and has its limits.
> First of all, it's going to take us 10 years to figure out how to use LLM's to their full productive potential.
Then another 30 to finally stop using them in dumb and insecure ways. :p
> to their full productive potential
You're assuming that point is somewhere above the current hype peak. I'm guessing it won't be, it will be quite a bit below the current expectations of "solving global warming", "curing cancer" and "making work obsolete".
Conspiracy theory: they’re trying to tank the valuation so that Altman can buy it out at bargain price.
> "We don't really know what this is good for, but spent a lot of money and time making it and are under intense pressure to announce new things right now. If you can figure something out, we need you to help us."
Where is this quote from?
The quotation marks in the grandparent comment are scare (sneer) quotes and not actual quotation.
https://en.m.wikipedia.org/wiki/Scare_quotes
> Whether quotation marks are considered scare quotes depends on context because scare quotes are not visually different from actual quotations.
That's not a scare quote. It's just a proposed subtext of the quote. Sarcastic, sure, but no a scare quote, which is a specific kind of thing. (from your linked wikipedia: "... around a word or phrase to signal that they are using it in an ironic, referential, or otherwise non-standard sense.")
Right. I don't agree with the quote, but it's more like a subtext thing and it seemed to me to be pretty clear from context.
Though, as someone who had a flagged comment a couple years ago for a supposed "misquote" I did in a similar form in style, I think hn's comprehension of this form of communication is not super strong. Also the style more often than not tends towards low quality smarm and probably should be resorted to sparingly.
As in “reading between the lines”.
It’s not a quote. It is an interpretation or reading of a quote.
[flagged]
Perhaps even fed through an LLM ;)
I think it's supposed to be a translation of what OpenAI's quote means in real world terms.
I believe it's a "translation" in the sense of Wittgenstein's goal of philosophy:
>My aim is: to teach you to pass from a piece of disguised nonsense to something that is patent nonsense.
Another great example on Hacker News is this old translation of Google's "Amazing Bet": https://news.ycombinator.com/item?id=12793033
The price really is eye watering. At a glance, my first impression is this is something like Llama 3.1 405B, where the primary value may be realized in generating high quality synthetic data for training rather than direct use.
I keep a little google spreadsheet with some charts to help visualize the landscape at a glance in terms of capability/price/throughput, bringing in the various index scores as they become available. Hope folks find it useful, feel free to copy and claim as your own.
https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...
Not just for training data, but for eval data. If you can spend a few grand on really good labels for benchmarking your attempts at making something feasible work, that’s also super handy.
> feel free to copy and claim as your own.
That's a nice sentiment, but I'd encourage you to add a license or something. The basic "something" would be adding a canonical URL into the spreadsheet itself somewhere, along with a notification that users can do what they want other than removing that URL. (And the URL would be described as "the original source" or something, not a claim that the particular version/incarnation someone is looking at is the same as what is at that URL.)
The risk is that someone will accidentally introduce errors or unsupportable claims, and people with the modified spreadsheet won't know that it's not The spreadsheet and so will discount its accuracy or trustability. (If people are trying to deceive others into thinking it's the original, they'll remove the notice, but that's a different problem.) It would be a shame for people to lose faith in your work because of crap that other people do that you have no say in.
Thats... incredibly thorough. Wow. Thanks for sharing this.
> https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...
how do you do the different size circles and colored sequences like that? this is god tier skills
Bubble charts?
What gets me is the whole cost structure is based on practically free services due to all the investor money. They’re not pulling in significant revenue with this pricing relative to what it costs to train the models, so the cost may be completely different if they had to recoup those costs, right?
That is an amazing resource. Thanks for sharing!
This is incredibly useful, thank you for sharing!
I cannot overstate how good your shared spreadsheet is. Thanks again!
Nice, thank you for that (upvoted in appreciation). Regarding the absence of o1-Pro from the analysis, is that just because there isn't enough public information available?
very impressive... also interested in your trip planner, it looks like invite only at the moment, but... would it be rude to ask for an invite?
Holy shit, that's incredible. You should publicise this more! That's a fantastic resource.
They tried a while ago: https://news.ycombinator.com/item?id=40373284
Sadly little people noticed...
Sadly few people noticed.
I don’t normally cosplay as a grammar Nazi but in this case I feel like someone should stand up for the little people :)
A comma in the original comment would have made it pop even more:
"Sadly, little people noticed."
(queue a group of little people holding pitch forks (normal forks upon closer inspection))
Or, sadly, little did people notice.
So you think that little people didn’t notice? ;)
Wow, what awesome information! Thanks for sharing!
This is an amazing spreadsheet - thank you for sharing!
Amazing, thank you so much for sharing this.
Thank you so much for sharing this!
Hey, just FYI, I pasted your url from the spreadsheet title into Safari on macOS and got an SSL warning. Unfortunately I clicked through and now it works, so not sure what the exact cause looked like.
[flagged]
Nobody comes to HN to read what ChatGPT thinks about something in the comments
Don't do this.
Awesome spreadsheet. Would a 3D graph of fast, cheap & smart be possible?
Sam Altman's explanation for the restriction is a bit fluffier: https://x.com/sama/status/1895203654103351462
> bad news: it is a giant, expensive model. we really wanted to launch it to plus and pro at the same time, but we've been growing a lot and are out of GPUs. we will add tens of thousands of GPUs next week and roll it out to the plus tier then. (hundreds of thousands coming soon, and i'm pretty sure y'all will use every one we can rack up.)
I’m not an expert or anything, but from my vantage point, each passing release makes Altman’s confidence look more aspirational than visionary, which is a really bad place to be with that kind of money tied up. My financial manager is pretty bullish on tech so I hope he is paying close attention to the way this market space is evolving. He’s good at his job, a nice guy, and surely wears much more expensive underwear than I do— I’d hate to see him lose a pair powering on his Bloomberg terminal in the morning one of these days.
You're the one buying him the underwear. Don't index funds outperform managed investing? I think especially after accounting for fees, but possibly even after accounting that 50% of money managers are below average.
A friend got taken in by a Ponzi scheme operator several years ago. The guy running it was known for taking his clients out to lavish dinners and events all the time.[0]
After the scam came to light my friend said “if I knew I was paying for those dinners, I would have been fine with Denny’s[1]”
I wanted to tell him “you would have been paying for those dinners even if he wasn’t outright stealing your money,” but that seemed insensitive so I kept my mouth shut.
0 - a local steakhouse had a portrait of this guy drawn on the wall
1 - for any non-Americans, Denny’s is a low cost diner-style restaurant.
He earns his undies. My returns are almost always modestly above index fund returns after his fees, though like last quarter, he’s very upfront when they’re not. He has good advice for pulling back when things are uncertain. I’m happy to delegate that to him.
you would still be better off in the long run even just putting everything into an MSCI world unless you value being able to scream at a human if markets go down that highly
Not all investing is throwing cash at an index, though. There's other types of investing like direct indexing (to harvest losses), muni bonds, etc.
Paying someone to match your risk profile and financial goals may be worth the fee, which as you pointed out is very measurable. YMMV though.
Depends who's pitch deck you're reading. Warren Buffett didn't get rich waiting on index funds.
I think Warren Buffet doesn't just buy stocks. He also influences the direction of the companies he buys.
And for every Warren Buffet, there are a number of equally competent people who have been less lucky and gone broke taking risks.
And, crucially, whose loss has in turn become someone else’s gain. A lot of people had to lose big in order to fill Warren buffet’s coffers.
warren buffet got rich by outperforming early (threw his dice well) and then using that reputation to attract more capital and use his reputation to actually influence markets with his decisions / gain access to privileged information your local active fund manager doesn't
> each passing release makes Altman’s confidence look more aspirational than visionary
As an LLM cynic, I feel that point passed long go, perhaps even before Altman claimed countries would start wars to conquer the territory around GPU datacenters, or promoting the dream of a 7 T-for-trillion dollar investment deal, etc.
Alas, the market can remain irrational longer than I can remain solvent.
That $7 trillion dollar ask pushed me from skeptical to full-on eye-roll emoji land— the dude is clearly a narcissist with delusions of grandeur— but it’s getting worse. Considering the $200 pro subscription was significantly unprofitable before this model came out, imagine how astonishingly expensive this model must be to run at many times that price.
Or, the model is nowhere as expensive as in the api pricing and they want to pump the user value of their pro subscription artificially?
Most people can evaluate whether the model improvements (or lack thereof) are worth the price tag
Sell an unlimited premium enterprise subscription to every CyberTruck owner, including a huge red ostentatious swastika-shaped back window sticker [but definitely NOT actually an actual swastika, merely a Roman Tetraskelion Strength Symbol] bragging about how much they're spending.
release blog post author: this is definitely a research preview
ceo: it's ready
the pricing is probably a mixture of dealing with GPU scarcity and intentionally discouraging actual users. I can't imagine the pressure they must be under to show they are releasing and staying ahead, but Altman's tweet makes it clear they aren't really ready to sell this to the general public yet.
Yeap, that the thing, they are not ahead anymore. Not since last summer at least. Yes they have probably largest customer base, but their models are not the best for a while already.
Eh, I think o1-pro is by far the most capable model available right now in terms of pure problem solving.
I think Claude has consistently been ahead for a year ish now and is back ahead again for my use cases with 3.7.
You can try Claude 3.7-Thinking and Grok 3 Think. 10 times cheaper, as good, or very similar to o1-pro.
I haven’t tried Grok yet so can’t speak to that, but I find o1-pro is much stronger than 3.7-thinking for e.g. distributed systems and concurrency problems.
They don't even have the largest customer base. Google is serving AI Overviews at the top of their search engine to an order of magnitude more people.
Bad news: Sam Altman runs the show.
[flagged]
I suppose this was their final hurrah after two failed attempts at training GPT-5 with the traditional pre-training paradigm. Just confirms reasoning models are the only way forward.
> Just confirms reasoning models are the only way forward.
Reasoning models are roughly the equivalent to allow Hamiltonian Monte-Carlo models to "warm up" (i.e. start sampling from the typical set). This, unsurprisingly, yields better results (after all LLMs are just fancy Monte-carlo models in the end). However, it is extremely unlikely this improvement is without pretty reasonable limitations. Letting your HMC warm up is essential to good sampling, but letting "warm up more" doesn't result in radically better sampling.
While there have been impressive results in efficiency of sampling from the typical set seen in LLMs these days, we're clearly not making the major improvements in the capabilities of these models.
Reasoning models can solve tasks that non-reasoning ones were unable to; how is that not an improvement? What constitutes "major" is subjective - if a "minor" improvement in overall performance means that the model can now successfully perform a task it was unable to solve before, that is a major advancement for that particular task.
> Compared to OpenAI o1 and OpenAI o3‑mini, GPT‑4.5 is a more general-purpose, innately smarter model. We believe reasoning will be a core capability of future models, and that the two approaches to scaling—pre-training and reasoning—will complement each other. As models like GPT‑4.5 become smarter and more knowledgeable through pre-training, they will serve as an even stronger foundation for reasoning and tool-using agents.
GPT 5 is likely just going to be a router model that decides whether to send the prompt to 4o, 4o mini, 4.5, o3, or o3 mini.
My guess is that you're right about that being what's next (or maybe almost next) from them, but I think they'll save the name GPT-5 for the next actually-trained model (like 4.5 but a bigger jump), and use a different kind of name for the routing model.
Even by their poor standards at naming it would be weird to introduce a completely new type/concept, that can loop in models including the 4 / 4.5 series, while naming it part of that same series.
My bet: probably something weird like "oo1", or I suspect they might try to give it a name that sticks for people to think of as "the" model - either just calling it "ChatGPT", or coming up with something new that sounds more like a product name than a version number (OpenCore, or Central, or... whatever they think of)
They already confirmed GPT-5 will be a unified model "months" away. Elsewhere they claimed that it will not just be a router but a "unified" model.
https://www.theverge.com/news/611365/openai-gpt-4-5-roadmap-...
If you read what sama is quoted as saying in your link, it's obvious that "unified model" = router.
> “We hate the model picker as much as you do and want to return to magic unified intelligence,”
> “a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks,”
> the company plans to “release GPT-5 as a system that integrates a lot of our technology, including o3,”
He even slips up and says "integrates" in the last quote.
When he talks about "unifying", he's talking about the user experience not the underlying model itself.
Interesting, thanks for sharing - definitely makes me withdraw my confidence in that prediction, though I still think there's a decent chance they change their mind about that as it seems to me like an even worse naming decision than their previous shit name choices!
Except minus 4.5, because at these prices and results there's essentially no reason not to just use one of the existing models if you're going to be dynamically routing anyway.
What it confirms, I think, is, that we are going to need a lot more chips.
Further confirmation, IMO, that the idea that any of this leads to anything close to AGI is people getting high on their own supply (in some cases literally).
LLMs are a great tool for what is effectively collected knowledge search and summary (so long as you are willing to accept that you have to verify all of the 'knowledge' they spit back because they always have the ability to go off the rails) but they have been hitting the limits on how much better that can get without somehow introducing more real knowledge for close to 2 years now and everything since then is super incremental and IME mostly just benchmark gains and hype as opposed to actually being purely better.
I personally don't believe that more GPUs solves this, like, at all. But its great for Nvidia's stock price.
I'd put myself on the pessimistic side of all the hype, but I still acknowledge that where we are now is a pretty staggering leap from two years ago. Coding in particular has gone from hints and fragments to full scripts that you can correct verbally and are very often accurate and reliable.
I'm not saying there's been no improvement at all. I personally wouldn't categorize it as staggering, but we can agree to disagree on that.
I find the improvements to be uneven in the sense that every time I try a new model I can find use cases where its an improvement over previous versions but I can also find use cases where it feels like a serious regression.
Our differences in how we categorize the amount of improvement over the past 2 years may be related to how much the newer models are improving vs regressing for our individual use cases.
When used as coding helpers/time accelerators, I find newer models to be better at one-shot tasks where you let the LLM loose to write or rewrite entire large systems and I find them worse at creating or maintaining small modules to fit into an existing larger system. My own use of LLMs is largely in the latter category.
To be fair I find the current peak model for coding assistant to be Claude 3.5 Sonnet which is much newer than 2 years old, but I feel like the improvements to get to that model were pretty incremental relative to the vast amount of resources poured into it and then I feel like Claude 3.7 was a pretty big back-slide for my own use case which has recently heightened my own skepticism.
Hilarious. Over two years we went from LLMs being slow and not very capable of solving problems to models that are incredibly fast, cheap and able to solve problems in different domains.
Well said. 100% agree
Or, possibly, we're stuck waiting for another theoretical breakthrough before real progress is made.
breakthrough in biology
Eh, no. More chips won't save this right now, or probably in the near future (IE barring someone sitting on a breakthrough right now).
It just means either
A. Lots and lots of hard work that get you a few percent at a time, but add up to a lot over time.
or
B. Completely different approaches that people actually think about for a while rather than trying to incrementally get something done in the next 1-2 months.
Most fields go through this stage. Sometimes more than once as they mature and loop back around :)
Right now, AI seems bad at doing either - at least, from the outside of most of these companies, and watching open source/etc.
While lots of little improvements seem to be released in lots of parts, it's rare to see anywhere that is collecting and aggregating them en masse and putting them in practice. It feels like for every 100 research papers, maybe 1 makes it into something in a way that anyone ends up using it by default.
This could be because they aren't really even a few percent (which would be yet a different problem, and in some ways worse), or it could be because nobody has cared to, or ...
I'm sure very large companies are doing a fairly reasonable job on this, because they historically do, but everyone else - even frameworks - it's still in the "here's a million knobs and things that may or may not help".
It's like if compilers had no "O0/O1/O2/O3' at all and were just like "here's 16,283 compiler passes - you can put them in any order and amount you want". Thanks! I hate it!
It's worse even because it's like this at every layer of the stack, whereas in this compiler example, it's just one layer.
At the rate of claimed improvements by papers in all parts of the stack, either lots and lots and lots is being lost because this is happening, in which case, eventually that percent adds up to enough for someone to be able to use to kill you, or nothing is being lost, in which case, people appear to be wasting untold amounts of time and energy, then trying to bullshit everyone else, and the field as a whole appears to be doing nothing about it. That seems, in a lot of ways, even worse. FWIW - I already know which one the cynics of HN believe, you don't have to tell me :P. This is obviously also presented as black and white, but the in-betweens don't seem much better.
Additionally, everyone seems to rush half-baked things to try to get the next incremental improvement released and out the door because they think it will help them stay "sticky" or whatever. History does not suggest this is a good plan and even if it was a good plan in theory, it's pretty hard to lock people in with what exists right now. There isn't enough anyone cares about and rushing out half-baked crap is not helping that. mindshare doesn't really matter if no one cares about using your product.
Does anyone using these things truly feel locked into anyone's ecosystem at this point? Do they feel like they will be soon?
I haven't met anyone who feels that way, even in corps spending tons and tons of money with these providers.
The public companies - i can at least understand given the fickleness of public markets. That was supposed to be one of the serious benefit of staying private. So watching private companies do the same thing - it's just sort of mind-boggling.
Hopefully they'll grow up soon, or someone who takes their time and does it right during one of the lulls will come and eat all of their lunches.
> Completely different approaches that people actually think about for a while
I think this is very likely simply because there are so many smart people looking at it right now. I hope the bubble doesn't burst before it happens.
For OpenAI perhaps? Sonnet 3.7 without extended thinking is quite strong. Swe-bench scores tie o3
How do you read those scores? I wanted to see how well 3.7 with thinking did, but I can't even read that table.
I think this is the correct take. There are other axes to scale on AND I expect we'll see smaller and smaller models approach this level of pre-trained performance. But I believe massive pre-training gains have hit clearly diminished returns (until I see evidence otherwise).
The price is obviously 15-30x that of 4o, but I'd just posit that there are some use cases where it may make sense. It probably doesn't make sense for the "open-ended consumer facing chatbot" use case, but for other use cases that are fewer and higher value in nature, it could if it's abilities are considerably better than 4o.
For example, there are now a bunch of vendors that sell "respond to RFP" AI products. The number of RFPs that any sales organization responds to is probably no more than a couple a week, but it's a very time-consuming, laborious process. But the payoff is obviously very high if a response results in a closed sale. So here paying 30x for marginally better performance makes perfect sense.
I can think of a number of similar "high value, relatively low occurrence" use cases like this where the pricing may not be a big hindrance.
And which use case will that make sense then for?
Esp. when they aren't even sure whether they will commit to offering this long term? Who would be insane enough to build a product on top of something that may not be there tomorrow?
Those products require some extensive work, such a model finetuning on proprietary data. Who is going to invest time & money into something like that when OpenAI says right out of the gate they may not support this model for very long?
Basically OpenAI is telegraphing that this is yet another prototype that escaped a lab, not something that is actually ready for use and deployment.
Yeah, agreed.
We’re one of those types of customers. We wrote an OpenAI API compatible gateway that automatically batches stuff for us, so we get 50% off for basically no extra dev work in our client applications.
I don’t care about speed, I care about getting the right answer. The cost is fine as long as the output generates us more profit.
I think it's fairer to compare it to the original GPT-4 which might the equivalent in term of "size" (though we don't have actual numbers for either).
GPT-4: Input $30.00 / 1M tokens ; Output $60.00 / 1M tokens
So 4.5 is 2.5x more expensive.
I think they announced this as their last non-reasoning model, so it was maybe with the goal of stretching pre-training as far as they could, just to see what new capabilities would show up. We'll find out as the community gives it a whirl.
I'm a Tier 5 org and I have it available already in the API.
The marginal costs for running a GPT-4-class LLM are much lower nowadays due to significant software and hardware innovations since then, so costs/pricing are harder to compare.
Agreed, however it might make sense that a much-larger-than-GPT-4 LLM would also, at launch, be more expensive to run than the OG GPT-4 was at launch.
(And I think this is probably also scarecrow pricing to discourage casual users from clogging the API since they seem to be too compute-constrained to deliver this at scale)
There are some numbers on one of their Blackwell or Hopper info pages that notes the ability of their hardware in hosting an unnamed GPT model that is 1.8T params. My assumption was that it referred to GPT-4
Sounds to me like GPT 4.5 likely requires a full Blackwell HGX cabinet or something, thus OpenAI's reference to needing to scale out their compute more (Supermicro only opened up their Blackwell racks for General Availability last month, and they're the prime vendor for water-cooled Blackwell cabinets right now, and have the ability to throw up a GPU mega-cluster in a few weeks, like they did for xAI/Grok)
2x that price for the 32k context via API at launch. So nearly the same price, but you get 4x the context
Honestly if long context (that doesn't start to degrade quickly) is what you're after, I would use Grok 3 (not sure when the api version releases though). Over the last week or so I've had a massive thread of conversation with it that started with plenty of my project's relevant code (as in couple hundred lines), and several days later, after like 20 question-aswer blocks, you ask it something and it aswers "since you're doing that this way, and you said you want x, y and z, here are your options blabla"... It's like thinking Gemini but better. Also, unlike Gemini (and others) it seems to have a much more recent data cutoff. Try asking about some language feature / library / framework that has been released recently (say 3 months ago) and most of the models shit the bed, use older versions of the thing or just start to imitate what the code might look like. For example try asking Gemini if it can generate Tailwind 4 code, it will tell you that it's training cutoff is like October or something and Tailwind 4 "isn't released yet" and that it can try to imitate what the code might look like. Uhhhhhh, thanks I guess??
Why would that be fairer? We can assume they did incorporate all learnings and optimizations they made post gpt-4 launch, no?
Definitely not. They don't distill their original models. 4o is a much more distilled and cheaper version of 4. I assume 4.5o would be a distilled and cheaper version of 4.5.
It'd be weird to release a distilled version without ever releasing the base undistilled version.
Not necessarily.
If this huge model has taken months to pre-train and was expected to be released before, say, o3-mini, you could definitely have some last-minute optimizations in o3-mini that were not considered at the time of building the architecture of gpt-4.5.
This has been my suspicion for a long time - OpenAI have indeed been working on "GPT5", but training and running it is proving so expensive (and its actual reasoning abilities only marginally stronger than GPT4) that there's just no market for it.
It points to an overall plateau being reached in the performance of the transformer architecture.
but while there is a plateau in the transformer architecture, what you can do with those base models by further finetuning / modifying / enhancing them is still largely unexplored so i still predict mind-blowing enhancements yearly for this foreseeable future. if they validate openai's valuation and investment needs is a different question.
That would certainly reduce my anxiety about the future of my chosen profession.
Certainly hope so. The tech billionaires are little to excited to achieve AGI and replace the workforce.
TBH, with the safety/alignment paradigm we have, workforce replacement was not my top concern when we hit AGI. A pause / lull in capabilities would be hugely helpful so that we can figure how not to die along with the lightcone...
Not sure how why anyone thinks it's possible to fully control AGI, we cant even fully tame a house cat.
Is it inevitable to you that someone will create some kind of techno-god behemoth AI that will figure out how to optimally dominate an entire future light cone starting from the point in spacetime of its self-actualization? Borg or Cylons?
I feel like this period has shown that we're not quite ready for a machine god. We'll see if RL hits a wall as well.
I put "hello" into it and it billed me 30p for it. Absolutely unusable, more expensive than realtime voice chat.
If it really costs them 30x more surely they must plan on putting pretty significant usage limits on any rollout to the Plus tier and if that is the case i'm not sure what the point is considering it seems primarily a replacement/upgrade for 4o.
The cognitive overhead of choosing between what will be 6 different models now on chatGPT and trying to map whether a query is "worth" using a certain model and worrying about hitting usage limits is getting kind of out of control.
To be fair their roadmap states that gpt-5 will unify everything into one model in "months".
Really depends on your use case. For low value tasks this is way too expensive. But for context, let’s say a court opinion is an average of 6000 words. Let’s say i want to analyze 10 court opinions and pull some information out that’s relevant to my case. That will run about $1.80 per document or $18 total. I wouldn’t pay that just to edify myself, but i can think of many use cases where it’s still a negligible cost, even if it only does 5% better than the 30x cheaper model.
The performance bump doesn't justify the steep price difference.
From a for profit business lens for OpenAI - I understand pushing the price outside the range of side projects, but this pushes it past start ups.
Excited to see new stuff released past reasoning models in any case. Hope they can improve the price soon.
Let's see if DeepSeek will make a distillation of this model as well
Now the real question about AI automation starts. Is it cheaper to pay a human to do the task or a AI company?
I was about to comment that humans consume orders of magnitude less energy, but then I checked the numbers, and it looks like an average person consumes way more energy throughout their day (food, transportation, electricity usage, etc) than GPT-4.5 would at 1 query per minute over 24 hours.
It still not smart enough to replace for example customer service.
It's absolutely able to replace the majority of customer service volume which is full of mundane questions.
Such brutal reductionism: how do you calculate an ever growing percentage of customers so pissed at this terrible service that you lose customers forever? Not just one company losing customers... but an entire population completely distrusting and pulling back from any and all companies pulling this trash
Huh? Most call centers these days already use ivr systems and they absolutely are terrible experiences. I along with most people would happily speak with a LLM backed agent to resolve issues.
The CS is already a wreck and LLMs beat an ivr any day of the week and have the ability to offer real triaging ability.
The only people getting upset are the luddites like yourself.
Humans have all sorts of issues you have to deal with. Being hungover, not sleeping well, having a personality, being late to work, not being able to work 24/7, very limited ability to copy them. If there's a soulless generic office-droidGPT that companies could hire that would never talk back and would do all sorts of menial work without needing breaks or to use the bathroom, I don't know that we humans stand a chance!
I have a bunch of work that needs doing. I can do it myself, or I can hire one person to do it. I gotta train them and manage them and even after I train them theres still only going to be one of them, and it's subject to their availability. On the other hand, if I need to train an AI to do it, but I can copy that AI, and then spin them up/down like on demand computer in the cloud, and not feel remotely bad about spinning them down?
It's definitely not there yet, but it's not hard to see the business case for it.
This is the ultimate business model.
Once we get to that stage, unless you're a capitalist, remember that your job is next in line to be replaced.
Every tech drone in every cubicle considers themselves a temporarily embarrassed capitalist.
I write code for a living. My entire profession is on the line, thanks to ourselves. My eyes are wide open on the situation at hand though. Burying my head in the sand and pretending what I wrote above isn't true, isn't going to make it any less true.
I'm not sure what I can do about it, either. My job already doesn't look like it did a year ago, nevermind a decade away.
I keep telling coders to switch to being 1-person enterprise shops instead, but they don't listen. They will learn the hard way when they suddenly find themselves without a job due to AI having taken it away. As for what enterprise, use your imagination without bias from coding.
Doubly so with how good Claude 3.7 Sonnet is at $3 / 1M tokens.
I wonder how much money they’re losing on it too even at those prices.
> It sounds like it's so expensive and the difference in usefulness is so lacking(?)
The claimed hallucination rate is dropping from 61% to 37%. That's a "correct" rate increasing from 29% to 63%.
Double the correct rate costs 15x the price? That seems absurd, unless you think about how mistakes compound. Even just 2 steps in and you're comparing a 8.4% correct rate vs 40%. 3 automated steps and it's 2.4% vs 25%.
And remember, with increasing accuracy, the cost of validation goes up (not even linear).
We expect computers to be right. Its a trust problem. Average users will simply trust the results of LLMs and move on without proper validation. And the way the LLMs are trained to mimic human interaction is not helping either. This will reduce overall quality in society.
Its a different thing to work with another human, because there is intention. A human wants to be correct or to mislead me. I am considering this without even thinking about it.
And I don't expect expert models to improve things, unless the problem space is really simple (like checking eggs for anomalies).
Well to play the devils advocat, i think this is useful to have, at least for ‘Open’Ai to start off from to apply QLora or similar approximations.
Bonus they could even do some self learning afterwards with the performance improvements DeepSeek just published and it might have more EQ and less hallucinations than starting from scratch…
ie the price might go down big time but there might be significant improvements down the line when starting from such a broad base
> GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokens
> GPT 4o pricing for comparison: Price Input: $2.50 / 1M tokens Cached input: $1.25 / 1M tokens Output: $10.00 / 1M tokens
Their examples don't seem 30x better. :-)
GPT-4.5 is 15-30x more expensive than GPT-4o. Likely that much larger in terms of parameter count too. It’s massive!!
With more parameters comes more latent space to build a world model. No wonder its internal world model is so much better than previous SOTA
usefulness is bound to scope/purpose, even if innovation stops, in 3y (thanks to hw and tuning progress ) when 4o costs 0.1$/M and 4.5 1$/M even being a small improvement ( which is not imo ), you will chose to use 4.5 , exactly like no one now want to use 3.5
Someone in another comment said that gpt-4 32k had somewhat the same cost (ok 10% cheaper), what was a pain was more the latency and speed than actual cost given the increase in productivity for our usage.
It's priced like this because it can generate erotica.
> It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long
I guess the rationale behind this is paying for the marginal improvement. Maybe the next few percent of improvement is so important to a business that the business is willing to pay a hefty premium.
I wonder if the pricing is partly to discourage distillation, if they suspect r1 was distilled from gpt 4o
Mainly to prevent you from using it
The price will come down over time as they apply all the techniques to distill it down to a smaller parameter model. Just like GPT4 pricing came down significantly over time.
AI as it stands in 2025 is an amazing technology, but it is not a product at all.
As a result, OpenAI simply does not have a business model, even if they are trying to convince the world that they do.
My bet is that they're currently burning through other people's capital at an amazing rate, but that they are light-years from profitability
They are also being chased by fierce competition and OpenSource which is very close behind. There simply is no moat.
It will not end well for investors who sunk money in these large AI startups (unless of course they manage to find a Softbank-style mark to sell the whole thing to), but everyone will benefit from the progress AI will have made during the bubble.
So, in the end, OpenAI will have, albeit very unwillingly, fulfilled their original charter of improving humanity's lot.
I've been a Plus user for a long time now. My opinion is there is very much a ChatGPT suite of products that come together to make for a mostly delightful experience.
Three things I use all the time:
- Canvas for proofing and editing my article drafts before publishing. This has replaced an actual human editor for me.
- Voice for all sorts of things, mostly for thinking out loud about problems or a quick question about pop culture, what something means in another language, etc. The Sol voice is so approachable for me.
- GPTs I can use for things like D&D adventure summaries I need in a certain style every time without any manual prompting.
Except that if OpenAI goes bust, very little of what they did will actually be released to human kind.
So their contribution was really to fuel a race for opensource (which they contributed little to). Pretty complex of an argument.
> My bet is that they're currently burning through other people's capital at an amazing rate, but that they are light-years from profitability
The Information leaked their internal projections a few months ago, and apparently their own estimates have them losing $44B between then and 2029 when they expect to finally turn a profit, maybe.
That's surprisingly small
> AI as it stands in 2025 is an amazing technology, but it is not a product at all.
Here I'm assuming "AI" to mean what's broadly called Generative AI (LLMs, photo, video generation)
I genuinely am struggling to see what the product is too.
The code assistant use cases are really impressive across the board (and I'm someone who was vocally against them less than a year ago), and I pay for Github CoPilot (for now) but I can't think of any offering otherwise to dispute your claim.
It seems like companies are desperate to find a market fit, and shoving the words "agentic" everywhere doesn't inspire confidence.
Here's the thing: I remember people lining up around the block for iPhone releases, XBox launches, hell even Grand Theft Auto midnight releases.
Is there a market of people clamoring to use/get anything GenAI related?
If any/all LLM services went down tonight, what's the impact? Kids do their own homework?
JavaScript programmers have to remember how to write React components?
Compare that with Google Maps disappearing, or similar.
LLMs are in a position where they're forced onto people and most frankly aren't that interested. Did anyone ASK for Microsoft throwing some Copilot things all over their operating system? Does anyone want Apple Intelligence, really?
I think search and chat are decent products as well. I am a Google subscriber and I just use Gemini as a replacement for search without ads. To me, this movement accelerated paid search in an unexpected way. I know the detractors will cry "hallucinations" and the ilk. I would counter with an argument about the state of the current web besieged by ads and misinformation. If people carry a reasonable amount of skepticism in all things, this is a fine use case. Trust but verify.
I do worry about model poisoning with fake truths but dont feel we are there yet.
> I do worry about model poisoning with fake truths but don't feel we are there yet.
In my use, hallucinations will need to be a lot lower before we get there, because I already can't trust anything an LLM says so I don't think I could even distinguish a poisoned fake truth from a "regular" hallucination.
I just asked ChatGPT 4o to explain irreducible control flow graphs to me, something I've known in the past but couldn't remember. It gave me a couple of great definitions, with illustrative examples and counterexamples. I puzzled through one of the irreducible examples, and eventually realized it wasn't irreducible. I pointed out the error, and it gave a more complex example, also incorrect. It finally got it on the 3rd try. If I had been trying to learn something for the first time rather than remind myself of what I had once known, I would have been hopelessly lost. Skepticism about any response is still crucial.
speaking of search without ads, I wholeheartedly recommend https://kagi.com
I'll second this. Kagi is really impressive and ad-free is a nice change.
> I genuinely am struggling to see what the product is too.
They're nice for summarizing and categorizing text. We've had good solutions for that before, too (BERT, et al), but LLM's are marginally nicer.
> Is there a market of people clamoring to use/get anything GenAI related?
No. LLM's are lame and uncool. Kids especially dislike them a lot on that basis alone.
Yes: the real truth is, if there really was a good AI created, then we wouldnt even know about it existing until a billion dollar company takes over some industry with only a handful of developers in the entire company. Only then would hints spill out into the world that its possible.
No "good" AI will ever be open to everyone and relatively cheap, this is the same phenomenon as "how to get rich" books
Sir they are selling text by the ounce just like farmers sold tomatoes before Walmart, How is that not a business model?
> As a result, OpenAI simply does not have a business model, even if they are trying to convince the world that they do.
They have a super popular subscription service. If they keep iterating on the product enough, they can lag on the models. The business is the product not the models and not the API. Subscriptions are pretty sticky when you start getting your data entrenched in it. I keep my ChatGPT subscription because it’s the best app on Mac and already started to “learn me” through the memory and tasks feature.
Their app experience is easily the best out of their competitors (grok, Claude, etc). Which is a clear sign they know that it’s the product to sell. Things like DeepResearch and related are the way they’ll make it a sustainable business - add value-on-top experiences which drive the differentiation over commodities. Gemini is the only competitor that compares because it’s everywhere in Google surfaces. OpenAI’s pro tier will surely continue to get better, I think more LLM-enabled features will continue to be a differentiator. The biggest challenge will be continuing distribution and new features requiring interfacing with third parties to be more “agentic”.
Frankly, I think they have enough strength in product with their current models today that even if model training stalled it’d be a valuable business.
https://podcasts.apple.com/us/podcast/better-offline/id17305...
In other words, they want people to pay for the privilege of becoming beta testers....
For comparison, 3 years ago, the most powerful model out there (GPT-3 davinci) was $60/MTok.
Looks like more signal that the scaling "law" is indeed faltering.
I don't understand the pricing for cached tokens. It seems rather high for looking up something in a cache.
hyperscalers in shambles, no clue why they even released this other than the fact they didn't want to admit they wasted an absurd amount of money for no reason
Did they already disable it?
When using `gpt-4.5-preview` I am getting: > Invalid URL (POST /v1/chat/completions)
> It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long:
Sounds like an attempt at price descrimination. Sell the expensive version to big companies with big budgets who don't care, sell the cheap version to everyone else. Capture both ends of the market.
30x price bump feels like a attempt to pull in as much money as possible before the bubble bursts.
To me, it feels like a PR stunt in response to what the competition is doing. OpenAI is trying to show how they are ahead of others, but they price the new model to minimize its use. Potentially, Anthropic et al. also have amazing models that they aren't yet ready to productionize because of costs.
I can chew through 1MM tokens with a single standard (and optimized) call. This pricing is insane.
>GPT 4.5 pricing is insane: Price Input: $75.00 / 1M tokens Cached input: $37.50 / 1M tokens Output: $150.00 / 1M tokens
How many eggs does that include??!
"GPT-4.5 is not a frontier model, but it is OpenAI’s largest LLM, improving on GPT-4’s computational efficiency by more than 10x."[1]
I don't get it, it is supposedly much cheaper to run?
[1] https://cdn.openai.com/gpt-4-5-system-card.pdf (page 7, bottom)
This is was GPT4 cost when it was released
one of the problem seem to be there's no alternative to Nvidia ecosystem. (the gpu + CUDA).
May I introduce you to Gemini 2.0
ZLUDA can be used as compatibility glue, also you can use ROCm or even Vulcan with Ollama.
It's crazy expensive because they want to pull in as much revenue as possible as fast as possible before the Open Source models put them outta business.
Maybe they started a really long expensive training session, and Elon Musk's DOGE script kiddies somehow broke in and sabotaged it, so it got disrupted and turned into the Eraserhead baby, but they still want to get it out there for a little while before it died to squeeze all the money out of it as possible, because it was so expensive to train.
https://www.youtube.com/watch?v=ZZ-kI4Qzj9U
[dead]
But you get higher EQ. /s
> GPT 4.5 pricing is insane:
> I'm still gonna give it a go, though.
Seems like the pricing is pretty rational then?
Not if people just try a few prompts then stop using it.
Sure but its in their best interest to lower it then and only then.
OpenAI wouldn't be the first company to price something expensive when it first comes out to capitalize on people who are less price sensitive at first and then lower prices to capture a bigger audience.
That's all pricing 101 as the saying goes.
If OAI are concerning themselves with collecting a few hundereds from a small group of individuals then they really have nothing better to do
How much of OAI's reported users are doing exactly this?
Input price difference: 4.5 is 30x more
Output price difference:4.5 is 15x more
In their model evaluation scores in the appendix, 4.5 is, on average, 26% better. I don't understand the value here.
If you ran the same query set 30x or 15x on the cheaper model (and compensated for all the extra tokens the reasoning model uses), would you be able to realize the same 26% quality gain in a machine-adjudicatible kind of way?
with a reasoning model you'd get better than both.
[dead]
Einstein's IQ = 3.5x chimpanzees IQs, right?
3.5x on a normal distribution with mean 100 and SD 15 is pretty insane. But I agree with your point, being 26% better at a certain benchmark could be a tiny difference, or an incredible improvement (imagine the hardest questions being Riemann hypothesis, P != NP, etc).
Is it official then?
Most of us have been waiting for this moment for a while. The transformer architecture as it is currently understood can't be milked any further. Many of us knew this since last year. GPT-5 delays eventually led to non-tech voices to suggest likewise. But we all held our final decision until the next big release from OpenAI as Sam Altman has been making claims about AGI entering the workforce this year, OpenAI knowing how to build AGI and similar outlandish claims. We all knew that their next big release in 2025 would be the final deciding factor on whether they had some tech breakthrough that would upend the world (justifying their astronomical valuation) or if it would just be (slightly) more of the same (marking the beginning of their downfall).
The GPT-4.5 release points towards the latter. Thus, we should not expect OpenAI to exist as it does now (AI industry leader) in 2030, assuming it does exist at all by then.
However, just like the 19th century rail industry revolution, the fall of OpenAI will leave behind a very useful technology that while not catapulting humanity towards a singularity, will nonetheless make people's lives better. Not much consolation to the world's super rich who will lose tons of money once the LLM industry (let us remember that AI is not LLM) falls.
EDIT: "will nonetheless make people's lives better" to "might nonetheless make some people's lives better"
It's worth pointing out that GPT-4.5 seems focused on better pre-training and doesn't include reasoning.
I think GPT-5 - if/when it happens - will be 4.5 with reasoning, and as such it will feel very different.
The barrier, is the computational cost of it. Once 4.5 gets down to similar costs to 4.0 - which could be achieved through various optimization steps (what happened to the ternary stuff that was published last year that meant you could go many times faster without expensive GPUs?), and better/cheaper/more efficient hardware, you can throw reasoning into the mix and suddenly have a major step up in capability.
I am a user, not a researcher of builder. I do think we're in a hype bubble, I do think that LLMs are not The Answer, but I also think there is more mileage left in this path than you seem to. I think automated RL (not HF), reasoning, and better/optimal architectures and hardware mean there is a lot more we can get out of the stochastic parrots, yet.
Is it fair to still call LLMs stochastic parrots now that they are enriched with reasoning? Seems to me that the simple procedure of large-scale sampling + filtering makes it immediately plausible to get something better than the training distribution out of the LLM. In that sense the parrot metaphor seems suddenly wrong.
I don’t feel like this binary shift is adequately accounted for among the LLM cynics.
it was never fair to call them stochastic parrots and anybody who is paying any attention knows that sequence models can generalize at least partially OOD
As someone who is terrified of agentic ASI, I desperately hope this is true. We need more time to figure out alignment.
I'm not sure this will ever be solved. It requires both a technical solution and social consensus. I don't see consensus on "alignment" happening any time soon. I think it'll boil down to "aligned with the goals of the nation-state", and lots of nation states have incompatible goals.
I agree unfortunately. I might be a bit of an extremist on this issue. I genuinely think that building agentic ASI is suicidally stupid and we just shouldn’t do it. All the utopian visions we hear from the optimists describe unstable outcomes. A world populated by super-intelligent agents will be incredibly dangerous even if it appears initially to have gone well. We’ll have built a paradise in which we can never relax.
What's the difference between your "agentic AIs" and, say, "script kiddies" or "expert anarchist/black-hat hackers"?
It's been obvious for a while that the narrow-waist APIs between things matter, and apparent that agentic AI is leaning into adaptive API consumption, but I don't see how that gives the agentic client some super-power we don't already need to defend against since before AGI we already have HGI (human general intelligence) motivated to "do bad things" to/through those APIs, both self-interested and nation-state sponsored.
We're seeing more corporate investment in this interplay, trending us towards Snow Crash, but "all you have to do" is have some "I" in API be "dual key human in the loop" to enable a scenario where AGI/HGI "presses the red button" in the oval office, nuclear war still doesn't happen, WarGames or Crimson Tide style.
I'm not saying dual key is the answer to everything, I'm saying, defenses against adversaries already matter, and will continue to. We have developed concepts like air gaps or modality changes, and need more, but thinking in terms of interfaces (APIs) in the general rather than the literal gives a rich territory for guardrails and safeguards.
> we just shouldn’t do it.
I think what Accelerationism gets right is that capitalism is just doing it - autonomizing itself - and that our agency is very limited, especially given the arms race dynamics and the rise of decentralized blockchain infrastructure.
As Nick Land puts it, in his characteristically detached style, in A Quick-and-Dirty Introduction to Accelerationism:
"As blockchains, drone logistics, nanotechnology, quantum computing, computational genomics, and virtual reality flood in, drenched in ever-higher densities of artificial intelligence, accelerationism won't be going anywhere, unless ever deeper into itself. To be rushed by the phenomenon, to the point of terminal institutional paralysis, is the phenomenon. Naturally — which is to say completely inevitably — the human species will define this ultimate terrestrial event as a problem. To see it is already to say: We have to do something. To which accelerationism can only respond: You're finally saying that now? Perhaps we ought to get started? In its colder variants, which are those that win out, it tends to laugh." [0]
[0] https://retrochronic.com/#a-quick-and-dirty-introduction-to-...
I feel like it was GPT-5 which was eventually renamed to keep up with expectations.
I'm not convinced that LLMs in their current state are really making anyone's lives much better though. We really need more research applications for this technology for that to become apparent. Polluting the internet with regurgitated garbage produced by a chat bot does not benefit the world. Increasing the productivity of software developers does not help to the world. Solving more important problems should be the priority for this type of AI research & development.
The explosion of garbage content is a big issue and has radically changed the way I use the web over the past year: Google and DuckDuckGo are not my primary tools anymore, instead I am now using specialized search engines more and more, for example, if I am looking for something I believe can be found in someone's personal blog I just use Marginalia or Mojeek, if I am searching for software issues I use GitHub's search, general info straight to Wikipedia, tech reviews HN's Algolia etc.
It might sound a bit cumbersome but it's actually super easy if you assign search keywords in your browser: for instance if I am looking for something on GitHub I just open a new tab on Firefox and type "gh tokio".
LLM's have been extremely useful for me. They are incredibly powerful programmers, from the perspective of people who aren't programmers.
Just this past week claude 3.7 wrote a program for us to use to quickly modernize ancient (1990's) proprietary manufacturing machine files to contemporary automation files.
This allowed us to forgo a $1k/yr/user proprietary software package that would be able to do the same. The program Claude wrote took about 30 mins to make. Granted the program is extremely narrow in scope, but it does the one thing we need it to do.
This marks the third time I (a non-progammer) have used an LLM to create software that my company uses daily. The other two are a test system made by GPT-4 and an android app made by a mix of 4o and claude 3.5.
Bumpers may be useless and laughable to pro bowlers, but a godsend to those who don't really know what they are doing. We don't need to hire a bowler to knock over pins anymore.
I've also been toying with Claude Code recently and i (as en eng, ~10yr) think they are useful for pair programming the dumb work.
Eg as i've been trying Claude Code i still feel the need to babysit it with my primary work, and so i'd rather do it myself. However while i'm working if it could sit there and monitor it, note fixes, tests and documentation and then stub them in during breaks i think there's a lot of time savings to be gained.
Ie keep the doing simple tasks that it can get right 99% of the time and get it out of the way.
I also suspect there's context to be gained in watching the human work. Not learning per say, but understanding the areas being worked on, improving intuition on things the human needs or cares about, etc.
A `cargo lint --fix` on steroids is "simple" but still really sexy imo.
> Solving more important problems should be the priority for this type of AI research & development.
Which problem spaces do you think are underserved in this aspect?
> will nonetheless make people's lives better
Probably not the lives of translators or graphic designers or music compositors. They will have to find new jobs. As llm prompt engineers, I guess.
Graphic designers I think are safe, at least within organizations that require a cohesive brand strategy. Getting the AI to respect all of the previous art will be a challenge at a certain scale.
Fiverr graphic designers on the other hand…
Getting graphic designers to use the design system that they invented is quite a challenge too if I'm honest... should we really expect AI to be better than people? Having said that AI is never going to be adept at knowing how and when to ignore the human in the loop and do the "right" thing.
absolutely a solvable problem even with no tech advances
> will nonetheless make people's lives better
While I mostly agree with your assessment, I am still not convinced of this part. Right now, it may be making our lives marginally better. But once the enshittification starts to set in, I think it has the potential to make things a lot worse.
E.g. I think the advertisement industry will just love the idea of product placements and whatnots into the AI assistant conversations.
*good*. the answer to this is legislation —- legally, stop allowing shitty ads everywhere all the time. I hope these problems we already have are exacerbated by the ease of generating content with LLMs and people actually have to think for themselves again
> Not much consolation to the world's super rich who will lose tons of money once the LLM industry (let us remember that AI is not LLM) falls.
They knew the deal:
“it would be wise to view any investment in OpenAI Global, LLC in the spirit of a donation” and “it may be difficult to know what role money will play in a post-[artificial general intelligence] world.”
Honestly, I'm not sure how you can make all those claims when:
1. OpenAI still has the most capable model in o3
2. We've seen some huge increases in capability in 2024, some shocking
3. We're only 3 months into 2025
4. Blackwell hasn't been used to train a model yet
I got gpt-4.5-preview to summarize this discussion thread so far (at 324 comments):
Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes...Here's the result: https://gist.github.com/simonw/5e9f5e94ac8840f698c280293d399...
It took 25797 input tokens and 1225 input tokens, for a total cost (calculated using https://tools.simonwillison.net/llm-prices ) of $2.11! It took 154 seconds to generate.
Seems to have trouble recognizing sarcasm:
"For example, there are now a bunch of vendors that sell 'respond to RFP' AI products... paying 30x for marginally better performance makes perfect sense." — hn_throwaway_99 (an uncommon opinion supporting possible niche high-cost uses).
? You think hn_throwaway_99's comment is sarcastic? It makes perfect sense to me read "straight."
That is, sales orgs save a bunch of money using AI to respond to RFPs; they would still save a bunch of money using a more expensive AI, and any marginal improvement in sales closed would pay for it.
interesting summary but it's hard to gauge whether this is better/worse than just piping the contents into a much cheaper model.
It’d be great if someone would do that with the same data and prompt to other models.
I did like the formatting and attributions but didn’t necessarily want attributions like that for every section. I’m also not sure if it’s fully matching what I’m seeing in the thread but maybe the data I’m seeing is just newer.
Good call. Here's the same exact prompt run against:
GPT-4o: https://gist.github.com/simonw/592d651ec61daec66435a6f718c06...
GPT-4o Mini: https://gist.github.com/simonw/cc760217623769f0d7e4687332bce...
Claude 3.7 Sonnet: https://gist.github.com/simonw/6f11e1974e4d613258b3237380e0e...
Claude 3.5 Haiku: https://gist.github.com/simonw/c178f02c97961e225eb615d4b9a1d...
Gemini 2.0 Flash: https://gist.github.com/simonw/0c6f071d9ad1cea493de4e5e7a098...
Gemini 2.0 Flash Lite: https://gist.github.com/simonw/8a71396a4a219d8281e294b61a9d6...
Gemini 2.0 Pro (gemini-2.0-pro-exp-02-05): https://gist.github.com/simonw/112e3f4660a1a410151e86ec677e3...
Thanks for sharing. To me, purely on personal preference, the Gemini models did best on this task, which also fits with my personal experience using Googles models to summarize extensive, highly specialized text. Geminis 2.0 models do especially well on Needle in Haystack type tests in my experience.
At a glance, none of these appear to be meaningfully worse than GPT-4.5
Seeing the other models, I actually come away impressed with how well GPT-4.5 is organizing the information and how well it reads. I find it a lot easier to quickly parse. It's more human-like.
I actually think the Claude 3.7 Sonnet summary is better.
I noticed 4o mini didn't follow the directions to quote users. My favourite part of the 4.5 summary was how it quoted Antirez. 4o mini brought out the same quote, but failed to attribute it as instructed.
It's fascinating, but while this does mean it strays from the given example, I actually feel the result is a better summary. The 4.5 version is so long you might just read the whole thread yourself.
Interesting, thanks for doing this. I'd say that (at a glance) for now it's still worth to use more passes with smaller models than one pass with 4.5
Now, if you'd want to generate training data, I could see wanting to have the best answers possible, where even slight nuances would matter. 4.5 seems to adhere to instructions much better than the others. You might get the same result w/ generating n samples and "reflect" on them with a mixture of models, but then again you might not. Going through thousands of generations manually is also costly.
Compared to GPT-4.5 I prefer the GPT-4o version because it is less wordy. It summarizes and gives the gist of the conversation rather than reproducing it along with commentary.
Didn't seem to realize that "Still more coherent than the OpenAI lineup" wouldn't make sense out of context. (The actual comment quoted there is responding to someone who says they'd name their models Foo, Bar, Baz.)
Wonder if there’s some pro-OpenAI system prompt getting in the way of that.
As expected, comments on LLM threads are overwhelmingly negative.
Personally, I still feel excited to see boundaries being pushed, however incremental our anecdotal opinions make them seem.
I disagree with most of the knee-jerk negativity in LLM threads, but in this case it mostly seems warranted. There are no "boundaries being pushed" here, this is just a desperate release from a company that finds itself losing more and more mindshare to other models and companies.
Huh. Disregarding the 4.5-specific bit here, a browser extension or possibly website that did this in general could be really useful.
Maybe even something that just noticed whenever you visited a site that had had significant HN discussion in the past, then let you trigger a summary.
My site https://hackyournews.com does this!
Been keeping it alive and free for 18 months.
Wow I find this very useful, thanks! Bookmarked.
there are literally hundreds of extensions and sites that do this
the problem is that they are competing each other into the ground hence they go unmaintained very quickly
getrecall.ai has been the most mature so far
Hey, check this one out with all the different flavors that existed out there. I think I made something better. https://cofyt.app
As far as I am aware, feel free to test it head-to-head. This is better than gecall, and you can chat with a transcript for detailed answers to your prompts
I tried it out, looks nice and clean.
But as I mentioned, my main concern is what will happen in 6 months when you fail to get traction and abandon it. Because that's what happened to previous 5 products I tried which were all "good enough" .
Getrecall seems to have a big enough user base that will actually stick around.
Hundreds that specifically focus on noticing a page you’re currently viewing has been not only posted to but undergone significant discussion on HN, and then providing a summary of those conversations?
Or that just provide summaries in general?
Thanks, it's amazing how much stuff is out there I don't know about.
What I want is something that can read the thread out loud to me, using a different voice per user, so I can listen to a busy discussion thread like I would listen to a podcast.
I don't know why but something about this section made me chuckle
""" These perspectives highlight that there remains nuance—even appreciation—of explorative model advancement not solely focused on immediate commercial viability """
Feels like the model is seeking validation
$2.11! At this point, I'm more concerned about AI price than egg price.
Maybe it's just confirmation bias but the language in your result output seems higher quality that previous models. Seems more natural and eloquent.
The headline and section: "Dystopian and Social Concerns about AI Features" are interesting. It's roughly true... but somehow that broad statement seems minimize the point discussed.
I'd headline that thread as "Concerns about output tone". There were comments about dystopian implications of tone, marketing implications of tone and implementation issues of tone.
Of course, that I can comment about the fine-points of an AI summary shows it's made progress. But there's a lot riding on how much progress these things can make and what sort. So it's still worth looking at.
you know what? it would be damn nice to do this to literally every post in HN and give people a summary so that they dont have to read 500 comments
Considering both this blog post and the livestream demos, I am underwhelmed. Having just finished the stream, I had a real "was that all" moment, which on one hand shows how spoiled I've gotten by new models impressing me, but on another feels like OpenAI really struggles to stay ahead of their competitors.
What has been shown feels like it could be achieved using a custom system prompt on older versions of OpenAIs models, and I struggle to see anything here that truly required ground-up training on such a massive scale. Hearing that they were forced to spread their training across multiple data centers simultaneously, coupled with their recent release of SWE-Lancer [0] which showed Anthropic (Claude 3.5 Sonnet (new) to be exact) handily beating them, I was really expecting something more than "slightly more casual/shorter output", which again, I fail to see how that wasn't possible by prompting GPT-4o.
Looking at pricing [1], I am frankly astonished.
> Input: $75.00 / 1M tokens > Cached input: $37.50 / 1M tokens > Output: $150.00 / 1M tokens
How could they justify that asking price? And, if they have some amazing capabilities that make a 30-fold pricing increase justifiable, why not show it? Like, OpenAI are many things, but I always felt they understood price vs performance incredibly well, from the start with gpt-3.5-turbo up to now with o3-mini, so this really baffles me. If GPT-4.5 can justify such immense cost in certain tasks, why hide that and if not, why release this at all?
[0] https://github.com/openai/SWELancer-Benchmark
[1] https://openai.com/api/pricing/
rethinking your comment "was that all" I am listening to the stream now and had a thought. Most of the new models that have come out in the past few weeks have been great at coding and logical reasoning. But 4o has been better at creative writing. I am wondering if 4.5 is going to be even better at creative writing than 4o.
I still find all of them lacking on creative writing. The models are severely crippled by tokenization, complete lack of understanding of language rhythm.
They can’t generate a simple haiku consistently, something larger is more out of reach.
For example, give it a piece of poetry and ask for new verses and it just sucks at replicating the language structure and rhythm of original verses.
> But 4o has been better at creative writing
In what way? I find the opposite, 4o's output has a very strong AI vibe, much moreso than competitors like Claude and Gemini. You can immediately tell, and instructing it to write differently (except for obvious caricatures like "Write like Gen Z") doesn't seem to help.
if you generate "creative" writing, please tell your audience that it is generated, before asking them to read it.
I do not understand what possible motivation there could be for generating "creative writing" unless you enjoy reading meaningless stories yourself, in which case, be my guest.
I suspect they may launch a GPT4.5Turbo with a price cut... GPT4/GPT432k etc were all pricier than the GPT4Turbo models which also came with the added context length.. but with this huge jump in price, even 4.5Turbo if it does come out would be pricier
Funny you should suggest that it seems like a revised system prompt: https://chatgpt.com/share/67c0fda8-a940-800f-bbdc-6674a8375f...
In case there was any confusion, the referenced link shows 4.5 claiming to be “ChatGPT 4.0 Turbo”. I have tried multiple times and various approaches. This model is aware of 4.5 via search, but insists that it is 4 or 4 turbo. Something doesn’t add up. This cannot be part of the response to R1, Grok 3, and Claude 3.7. Satya’s decision to limit capex seems prescient.
> How could they justify that asking price?
They're still selling $1 for <$1. Like personal food delivery before it, consumers will eventually need to wake up to this fact - these things will get expensive, fast.
One difference with food delivery/ride share: those can only have costs reduced so far. You can only pick up groceries and drive from A to B so quickly. And you can only push the wages down so far before you lose your gig workers. Whereas with these models we’ve consistently seen that a model inference that cost $1 several months ago can now be done with much less than $1 today. We don’t have any principled understanding of “we will never be able to make these models more efficient than X”, for any value of X that is in sight. Could the anticipated efficiencies fail to materialize? It’s possible but I personally wouldn’t put money on it.
I read this more as "we are releasing a model checkpoint that we didn't optimize yet because Anthropic cranked up the pressure"
I’ll probably stick to open models at that point.
I generally question how wide spread willingness to pay for the most expensive product is. And will most users of those who actually want AI go with ad ridden lesser models...
I can just imagine Kraft having a subsidized AI model for recipe suggestions that adds Velveeta to everything.
Let a thousand providers bloom.
My first thought seeing this and looking at benchmarks was that if it wasn’t for reasoning, then either pundits would be saying we’ve hit a plateau, or at the very least OpenAI is clearly in 2nd place to Anthropic in model performance.
Of course we don’t live in such a world, but I thought of this nonetheless because for all the connotations that come with a 4.5 moniker this is kind of underwhelming.
Pundits were saying that deep learning has hit a plateau even before the LLM boom.
The niche of GPT-4.5 is lower hallucations than any existing model. Whether that niche justifies the price tag for a subset of usecases remains to be seen.
Actually, this comment of mine was incorrect, or at least we don't have enough information to conclude this. The metric OpenAI are reporting is the total number of incorrect responses on SimpleQA (and they're being beaten by Claude Haiku on this metric...), which is a deceptive metric because it doesn't account for non-responses. A better metric would be the ratio of Incorrects to the total number of attempts.
I have no idea how they justify $200/month for pro
I would rather pay for 4.5 by the query.
> but on another feels like OpenAI really struggles to stay ahead of their competitors
on one hand. On the other hand, you can have 4o-mini and o3-mini back when you can pry them out of my cold dead hands. They're _fast_, they're _cheap_, and in 90% of cases where you're automating anything, they're all you need. Also they can handle significant volume.
I'm not sure that's going to save OpenAI, but their -mini models really are something special for the price/performance/accuracy.
First impression of GPT-4.5:
1. It is very very slow, for some applications where you want real time interactions is just not viable, the text attached below took 7s to generate with 4o, but 46s with GPT4.5
2. The style it writes is way better: it keeps the tone you ask and makes better improvements on the flow. One of my biggest complaints with 4o is that you want for your content to be more casual and accessible but GPT / DeepSeek wants to write like Shakespeare did.
Some comparisons on a book draft: GPT4o (left) and GPT4.5 (green). I also adjusted the spacing around the paragraphs, to better diff match. I still am wary of using ChatGPT to help me write, even with GPT 4.5, but the improvement is very noticeable.
https://i.imgur.com/ogalyE0.png
Similar reaction here. I will also note that it seems to know a lot more about me than previous models. I’m not sure if this is a broader web crawl, more space in the model, or more summarization of our chats or a combination, but I asked it to psychoanalyze a problem I’m having in the style of Jacques lacan and it was genuinely helpful and interesting, no interview required first; it just went right at me.
To borrow an iain banks word, the “fragre” def feels improved to me. I think I will prefer it to o1 pro, although I haven’t really hammered on it yet.
In my experience, Gemini Flash has been the best at writing, and GPT 3.5 onwards has been terrible.
GPT-3 and GPT-2 were actually remarkably good at it, arguably better than a skilled human. I had a bit of fun ghostwriting with these and got a little fan base for a while.
It seems that GPT-4.5 is better than 4 but it's nowhere near the quality of GPT-3 davinci. Davinci-002 has been nerfed quite a bit, but in the end it's $2/MTok for higher quality output.
It's clear this is something users want, but OpenAI and Anthropic seem to be going in the opposite direction.
>1. It is very very slow, ... below took 7s to generate with 4o, but 46s with GPT4.5
This is positively luxurious by o1-pro standards which I'd say average 5 minutes. That said I totally agree even ~45s isn't viable for real-time interactions. I'm sure it'll be optimized.
Of course, my comparing it to the highest-end CoT model in [publicly-known] existence isn't entirely fair since they're sort of apples and oranges.
I paid for pro to try `o1-pro` and I can't seem to find any use case to justify the insane inference time. `o3-mini-high` seems to do just as well in seconds vs. minutes.
What are you doing with it? For me deep research tasks are where 5 minutes is fine, or something really hard that would take me way more time myself.
I usually throw a lot of context at it and have it write unit tests in a certain style or implement something (with tests) according to a spec.
But the o3-mini-high results have been just as good.
I am fine with Deep Research taking 5-8 minutes, those are usually "reports" I can read whenever.
I bet I can generate unit tests just as fast and for a fraction of the cost, and probably less typing, with a couple vim macros
Idk, it is pretty good a generating synthetic data and recognizing the different logic branches to exercise. Not perfect, but very helpful.
I'm wondering if generative AI will ultimately result in a very dense / bullet form style of writing. What we are doing now is effectively this:
bullet_points' = compress(expand(bullet_points))
We are impressed by lots of text so must expand via LLM in order to impress the reader. Since the reader doesn't have time or interest to read the content they must compress it back into bullet points / quick summary. Really, the original bullet points plus a bit more thinking would likely be a better form of communication.
I'm reminded of this great comic
https://marketoonist.com/2023/03/ai-written-ai-read.html
That’s what Axios does. For ordinary events coverage, it’s a great style.
Right side, by a large margin. Better word choice and more natural flow. It feels a lot more human.
Is there really no way to prompt GPT4o to use a more natural and informal tone matching GPT4.5's?
I opened your link in a new tab and looked at it a couple minutes later. By then I forgot which was o and which was .5
I honestly couldn't decide which I prefer
I definitely prefer the 4.5, but that might just be because it sounds 'less like ChatGPT', ironically.
It just feels natural to me. The person knows the language but they are not trying to sound smart by using words that might have more impact "based on the words dictionary definition"
GPT 4.5 does feel like it is a step forward in producing natural language, and if they use it to provide reinforcement learning, this might have significant impact in the future smaller models.
What’s the deal with Imgur taking ages to load? Anyone else have this issue in Australia? I just get the grey background with no content loaded for 10+ seconds every time I visit that bloated website.
This website sucks but successfully loaded from Aus rn on my phone. It's full of ads - possibly your ad blocker is killing it?
Ok for me here in aus
>One of my biggest complaints with 4o is that you want for your content to be more casual and accessible but GPT / DeepSeek wants to write like Shakespeare did.
Well, maybe like a Sophomore's bumbling attempt to write like Shakespeare.
How do the two versions match so closely? They have the same content in each paragraph, just worded slightly differently. I wouldn't expect them to write paragraphs that match in size and position like that.
If you use the "retry" functionality in ChatGPT enough, you will notice this happens basically all the time.
Honestly, feels like a second LLM just reworded the response on the left-side to generate the right-side response.
I use 4o mostly in German, so YMMV. However, I find a simple prompt controls the tone very well. "This should be informal and friendly", or "this should be formal and business-like".
> It is very very slow
Could that be partially due to a big spike in demand at launch?
Possibly, repeating the prompt I got a much higher speed, taking 20s on average now, which is much more viable. But that remains to be seen when more people start using this version in production.
Thank you. This is the best example of comparison I have seen so far.
Imgur might be the worst image hosting site I’ve ever experienced. Any interaction with that page results in switching images and big ads and they hijack the back button. Absolutely terrible. How far they’ve fallen from when it first began.
How does it compare with o1 and o3 preview?
o3 is okay for text checking but has issues following the prompt correctly, same as o1 and DeepSeek R1, I feel that I need to prompt smaller snippets with them.
Here is the o3 vs a new run of the same text in GPT 4.5
https://www.diffchecker.com/ZEUQ92u7/
Thanks, though it says o1 on the page, is that a typo?
Oh yeah, that right side version is WAY better, and sounds much more like a human.
It is interesting that they are focusing a large part of this release on the model having a higher "EQ" (Emotional Quotient).
We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".
This is very visible in the example comparing 4o with 4.5 when the user is complaining about failing a test, where 4o's response is what one would expect from a "typical AI response" with problem-solving bullets, and 4.5 is sending what you'd expect from a pal over instant messaging.
It seems Anthropic and Grok have both been moving in this direction as well. Are we going to see an escalation of foundation models impersonating "a friendly person" rather than "a helpful assistant"?
Personally I find this worrying and (as someone who builds upon SOTA model APIs) I really hope this behavior is not going to seep into API responses, or will at least be steerable through the system/developer prompt.
The whole robotic, monotone, helpful assistant thing was something these companies had to actively hammer in during the post-training stage. It's not really how LLMs will sound by default after pre-training.
I guess they're caring less and less about that effort especially since it hurts the model in some ways like creative writing.
Maybe, but I'm not sure how much the style is deliberate vs. a consequence of the post-training tasks like summarization and problem solving. Without seeing the post-training tasks and rating systems it's hard to judge if it's a deliberate style or an emergent consequence of other things.
But it's definitely the case that base models sound more human than instruction-tuned variants. And the shift isn't just vocabulary, it's also in grammar and rhetorical style. There's a shift toward longer words, but also participial phrases, phrasal coordination (with "and" and "or"), and nominalizations (turning adjectives/adverbs into nouns, like "development" or "naturalness"). https://arxiv.org/abs/2410.16107
If it's just a different choice during RLHF, I'll be curious to see what are the trade-offs in performance.
The "buddy in a chat group" style answers do not make me feel like asking it for a story will make the story long/detailed/poignant enough to warrant the difference.
I'll give it a try and compare on creative tasks.
Or maybe they're just getting better at it, or developing better taste. After switching to Claude, I can't go back to ChatGPT's overly verbose bullet-point laden book reports every time I ask a question. I don't think that's pretraining—it's in the way OpenAI approaches tuning and prompting vs Anthropic.
Anthropic pretty much abandoned this direction after Claude 3, and said it wasn't what they wanted [1]. Claude 3.5+ is extremely dry and neutral, it doesn't seem to have the same training.
>Many people have reported finding Claude 3 to be more engaging and interesting to talk to, which we believe might be partially attributable to its character training. This wasn’t the core goal of character training, however. Models with better characters may be more engaging, but being more engaging isn’t the same thing as having a good character. In fact, an excessive desire to be engaging seems like an undesirable character trait for a model to have.
[1] https://www.anthropic.com/research/claude-character
It's the opposite incentive to ad-funded social media. One wants to drain your wallet and keep you hooked, the other wants you to spend as little of their funding as possible finding what you're looking for.
> We're far from the days of "this is not a person, we do not want to make it addictive" and getting a firm foot on the territory of "here's your new AI friend".
That’s a hard nope from me, when companies pull that move. I’ll stick to my flesh and blood humans who still hallucinate but only rarely.
Yes, the "personality" (vibe) of the model is a key qualitative attribute of gpt-4.5.
I suspect this has something to do with shining light on an increased value prop in a dimension many people will appreciate since gains on quantitative comparison with other models were not notable enough to pop eyeballs.
I would like to see a humor test. So far, I have not seen any model response that has made me laugh.
How does the following stand-up routine by Claude 3.7 Sonnet work for you?
https://gally.net/temp/20250225claudestandup2.html
That was impressive. If it all came from just this short 4-line prompt, it's even more impressive.
All we're missing now is a text-to-video (or text+audio and then audio-to-video) that can convincingly follow the style instructions for emphasis and pausing. Or are we already there yet?
Yes, that was the full prompt.
Yesterday, I had Claude 3.7 write a full 80,000-word novel. My prompt was a bit longer, but the result was shockingly good. The new thinking mode is very impressive.
Okay, you know what? I laughed a few times. Yeah it may not work as an actual stand up routine to a general audience, it’s kinda cringe (as most LLM-generated content), but it was legitimately entertaining to read.
I chuckled.
Now you just need a Pro subscription to get Sora generate a video to go along with this and post it to YouTube and rake in the views (and the money that goes along with it).
Reading this felt like reading junk food
EDIT: Junk food tastes kinda good though. This felt like drinking straight cooking oil. Tastes bad and bad for you.
incredible
[dead]
reddit tier humor, truly
it's just regurgitating overly emphasized cliches in a disgustingly enthusiastic tone
Is that any different from the bulk of standup today?
My benchmark for this has been asking the model to write some tweets in the style of dril, a popular user who writes short funny tweets. Sometimes I include a few example tweets in the prompt too. Here's an example of results I got from Claude 3 Opus and GPT 4 for this last year: https://bsky.app/profile/macil.tech/post/3kpcvicmirs2v. My opinion is that Claude's results were mostly bangers while GPT's were all a bit groanworthy. I need to try this again with the latest models sometime.
The "roast" tools that have popped up (using either DeepSeek or o3-mini) are pretty funny.
Eg. https://news.ycombinator.com/item?id=43163654
OK now that is some funny shit.
If you like absurdist humor, go into the OpenAI playground, select 3.5-Turbo, and dial up the temperature to the point where the output devolves into garbled text after 500 tokens or so. The first ~200 tokens are in the freaking sweet spot of humor.
Maybe it's rose-colored glasses, but 3.5 was really the golden era for LLM comedy. More modern LLMs can't touch it.
Just ask it to write you a film screenplay involving some hard-ass 80s/90s action star and someone totally unrelated and opposite of that. The ensuring unhinged magic is unparalleled.
I built a little AI assistant to read my calendar and send me a summary of my day every morning. I told it to roast me and be funny with it.
3.5 was *way* better than anything else at that.
Ah, I'd love to have that kind of daily recap... mind sharing some of the code (or even just the prompt?)
Yeah, I think the fact its "mind" so to speak was more fragmented and unpredictable was almost a boon for that purpose.
>The ensuring unhinged magic is unparalleled.
Oops: ensuing*
Could someone post an example?
ChatGPT gave me this shell script: https://social.immibis.com/media/7102ac83cf4a200e48dd368938e... (obviously, don't download and execute a random shell script from the internet without reading it first)
I think reading it will make you laugh.
I don't know if I fully agree. The input clearly shows the need for emotional support more than "how do I pass this test?" The answer by 4o is comical even if you know you're talking to a machine.
It reminds me of the advice to "not offer solutions when a woman talks about her problems, but just listen."
How could a machine provide emotional support? When I ask questions like this to LLMs, it's always to brainstorm solutions. I get annoyed when I receive fake-attention follow-up questions instead.
I guess there's a trade-off between being human and being useful. But this isn't unique to LLMs, it's similar to how one wouldn't expect a deep personal connection with a customer service professional.
Well yeah, if the llm can keep you engaged and talking, that'll make them a lot more money; compared to if you just use it as a information retrieval tool in which case you are likely to leave after getting what you are looking for.
Since they offer a subscription, keeping you engaged just requires them to waste more compute. The ideal case would be that the LLM gives you a one shot correct response using as little compute as possible.
In a subscription business, you don't want the user to use as few resources as possible. It's the wrong optimization to make.
You want users to keep coming back as often as possible (at the lowest cost-per-run possible though). If they are not coming back they are not renewing.
So, yes, it makes sense to make answers shorter to cut on compute cost (which these SMS-length replies could accomplish) but the main point of making the AI flirtatious or "concerned" is possibly the addictive factor of having a shoulder to cry on 24/7, one that does not call you on your BS and is always supportive... for just $20 a month
The "one-shot correct response" to "I failed my exams" might be "Tough luck, try better next time" but if you do that, you will indeed use very little compute because people will cancel the subscription and never come back.
AI subscriptions are already very sticky . I can't imagine at least not paying for one, so I doubt they care about retention like the rest of us plebs do.
First imagine paying a subscription fee which actually makes the company profitable and gives investors ROI, then I think you can also imagine not paying that amount at all.
Plus level subscription has limits too, and Pro level costs 10x more - as long as Pro users don't use ChatGPT 10x more than Plus users on average, OpenAI can benefit. There's also the user retention factor.
I think it's a good thing because, idk why, I just start tuning out after getting reams and reams of bullet points I'm already not super confident about the truthfulness of
[dead]
The results for GPT - 4.5 are in for Kagi LLM benchmark too.
It does crush our benchmark - time to make new? ;) - with performance similar of that of reasoning models. It does come at a great price both in cost and speed.
A monster is what they created. But looking at the tasks it fails, some of them my 9 year old would solve. Still in this weird limbo space of super knowledge and low intelligence.
May be remembered as the last the last of the 'big ones', can't imagine this will be a path for the future.
https://help.kagi.com/kagi/ai/llm-benchmark.html
Do you have results for gpt-4? I’d be very interested in seeing the lift here from their last “big one”.
Why don't you have Grok?
No api for grok 3 might be why
If Gemini 2 is the top in your benchmark, make sure to re-check your benchmark.
Gemini 2 pro is actually very impressive (maybe not for coding, haven't used it for that)
Flash is pretty garbage but cheap
Gemini 2.0 Pro is quite good.
Seeing OpenAI and Anthropic go different routes here is interesting. It is worth moving past the initial knee jerk reaction of this model being unimpressive and some of the comments about "they spent a massive amount of money and had to ship something for it..."
* Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.
* OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
Based on all of the comments from OpenAI, GPT 4.5 is absolutely massive, and with that size comes the ability to store far more factual data. The scores in ability oriented things - like coding - don't show the kind of gains you get from reasoning models but the fact based test, SimpleQA, shows a pretty large jump and a dramatic reduction in hallucinations. You can imagine a scenario where GPT4.5 is coordinating multiple, smaller, reasoning agents and using its factual accuracy to enhance their reasoning, kind of like ruminating on an idea "feels" like a different process than having a chat with someone.
I'm really curious if they're actually combining two things right now that could be split as well, EQ/communications, and factual knowledge storage. This could all be a bust, but it is an interesting difference in approaches none-the-less, and worth considering that OpenAI could be right.
> * OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
Seems inaccurate as their most recent claim I've seen is that they expect this to be their last non-reasoning model, and are aiming to provide all capacities together in the future model releases (unifying the GPT-x and o-x lines)
See this claim on TFA:
> We believe reasoning will be a core capability of future models, and that the two approaches to scaling—pre-training and reasoning—will complement each other.
From Sam's twitter:
> After that, a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks.
> In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.
You could read this as unifying the models or building a unified systems which coordinate multiple models. The second sentence, to me, implies that o3 will still exist, it just won't be standalone, which matches the idea I shared above.
Ah, great point. Yes, the wording here would imply that they're basically planning on building scaffolding around multiple models instead of having one more capable Swiss Army Knife model.
I would feel a bit bummed if GPT-5 turned out not to be a model, but rather a "product".
Which is intriguing in a way, because the momentum I've seen across AI over the past decade has been increasing amounts of "end-to-end"
For me it depends on how the models are glued together. Connected by function calling and APIs? Probably meh...
Somehow working together in the same latent space? That could be neat.
> know when to think for a long time or not, and generally be useful for a very wide range of tasks.
I'm going to call it now - no customer is actually going to use this. It'll be a cute little bonus for their chatbot god-oracle, but virtually all of their b2b clients are going to demand "minimum latency at all times" or "maximum accuracy at all times."
I worry eliminating consumer choice will drive up prices for only a nominal gain in utility for most users.
I'm more worries they'll push down their costs by making it harder to get the reasoning models to run, but either would suck.
or you could read it as a way to create a moat where none currently exists...
> Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases.
I don't think that is their primary motivation. The announcement post for Claude 3.7 was all about code which doesn't seem to imply "all use cases". Code this, new code tool that, telling customers that they look forward to what they build, etc. Very little mention of other use cases on the new model announcement at all. Their usage stats they published are telling - 80%+ or more of queries to Claude are all about code. i.e. I actually think while they are thinking of other use cases; they see the use case of code specifically as the major thing to optimize for.
OpenAI, given its different customer base and reach, is probably aiming for something more general.
IMO they all think that you need an "ensemble" of models with different capabilities to optimise for different use cases. Its more about how much compute resources each company has and what they target with those resources. Anthrophic I'm assuming has less compute resources and a narrower customer base so it economically may make sense to optimise just for that.
That's possible, my counter point would be that if that was the case Anthropic would have built a smaller reasoning model instead of doing a "full" Claude. Instead, they built something which seems to be flexible across different types of responses.
Only time will tell.
> OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do.
The high level block diagrams for tech always end up converging to those found in biological systems.
Yeah, I don't know enough real neuroscience to argue either side. What I can say is I feel like this path is more like the way that I observe that I think, it feels like there are different modes of thinking and processes in the brain, and it seems like transformers are able to emulate at least two different versions of that.
Once we figure out the frontal cortex & corpus callosum part of this, where we aren't calling other models over APIs instead of them all working in the same shared space, I have a feeling we'll be on to something pretty exciting.
It can never be just reasoning, right? Reasoning is the multiplier on some base model, and surely no amount of reasoning on top of something like gpt-2 will get you o1.
This model is too expensive right now, but as compute gets cheaper — and we have to keep in mind, that it will — having a better base to multiply with will enable things that just more thinking won't.
You can try for yourself with the distilled R1's that Deepseek released. The qwen-7b based model is quite impressive for its size and it can do a lot with additional context provided. I imagine for some domains you can provide enough context and let the inference time eventually solve it, for others you can't.
Ever since those kids demo'd their fact checking engine here, which was just Input -> LLM -> Fact Database -> LLM -> LLM -> Output I have been betting that it will be advantageous to move in this general direction.
Or the other way around: smaller reasoning models that can call out to GPT-4.5 to get their facts right.
Maybe, I’m inclined to think OpenAI believes the way I laid it out though, specifically because of their focus on communication and EQ in 4.5. It seems like they believe the large, non-reasoning model, will be “front of house.”
Or they’ll use some kind of trained router which sends the request to the one it thinks it should go to first.
Just going to put this here: https://www.wheresyoured.at/wheres-the-money/
A bit better at coding than ChatGPT 4o but not better than o3-mini - there is a chart near the bottom of the page that is easy to overlook:
- ChatGPT 4.5 on AWS Bench verified: 38.0%
- ChatGPT 4o on AWS Bench verified: 30.7%
- OpenAI o3-mini on AWS Bench verified: 61.0%
BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code: https://github.com/drivecore/mycoder
[1] https://aws.amazon.com/blogs/aws/anthropics-claude-3-7-sonne...
Does the benchmark reflect your opinion on 3.7? I've been using 3.7 via Cursor and it's noticeably worse than 3.5. I've heard using the standalone model works fine, didn't get a chance to try it yet though.
personal anecdote - claude code is the best llm devx i've had.
I don't see Claude 3.7 on the official leaderboard. The top performer on the leaderboard right now is o1 with a scaffold (W&B Programmer O1 crosscheck5) at 64.6%: https://www.swebench.com/#verified.
If Claude 3.7 achieves 70.3%, it's quite impressive, it's not far from 71.7% claimed by o3, at (presumably) much, much lower costs.
It's the other way around on their new SWE-Lancer benchmark, which is pretty interesting: GPT-4.5 scores 32.6%, while o3-mini scores 10.8%.
To put that in context, Claude 3.5 Sonnet (new), a model we have had for months now and which from all accounts seems to have been cheaper to train and is cheaper to use, is still ahead of GPT-4.5 at 36.1% vs 32.6% in SWE-Lancer Diamond [0]. The more I look into this release, the more confused I get.
[0] https://arxiv.org/pdf/2502.12115
>BTW Anthropic Claude 3.7 is better than o3-mini at coding at around 62-70% [1]. This means that I'll stick with Claude 3.7 for the time being for my open source alternative to Claude-code
That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying, but on a personal project the cost of using Claude through the API is really noticeable.
> That's not a fair comparison as o3-mini is significantly cheaper. It's fine if your employer is paying...
I use it via Cursor editor's built-in support for Claude 3.7. That caps the monthly expense to $20. There probably is a limit in Claude for these queries. But I haven't run into it yet. And I am a heavy user.
Agentic coders (e.g. aider, Claude-code, mycoder, codebuff, etc.) use a lot more tokens, but they write whole features for you and debug your code.
If open ai offers a more expensive model (4.5) and a cheaper model (3 mini) and both are worse, it starts to be a fair comparison
If you want to try it out via their API you can run it through my LLM tool using uvx like this:
You may need to set an API key first, either with `export OPENAI_API_KEY='xxx'` or using this command to save it to a file: Or this to get a chat session going: I'll probably have a proper release out later today. Details here: https://github.com/simonw/llm/issues/795Just curious, does this stream the output or renders all at once ?
It streams the output. See animated demo here (bottom image on the page) https://simonwillison.net/2025/Feb/27/introducing-gpt-45/
I've been using 4.5 for the better part of the day.
I also have access to o3-mini-high and o1-pro.
I don't get it. For general purposes and for writing, 4.5 is no better than o3-mini. It may even be worse.
I'd go so far as to say that Deepseek is actually better than 4.5 for most general purpose use cases.
I seriously don't understand what they're trying to achieve with this release.
this model does have a niche use-case: since its so large it does have a lot more knowledge and hallucinates much less. for example as a test question I asked it to list the best restaurants in my small town. and all of them existed. none of the other llms get this right.
I tried the same thing with companies in my industry ("list active companies in the field of X") and it came back with a few that have been shuttered for years, in one case for nearly two decades.
I'm really not seeing better performance than with o3-mini.
If anything, the new results ("list active companies in the field of X") are actually worse than what I'd get with o3-mini, because the 4.5 response is basically the post-SEO Google first page (it appears to default to mentioning the companies that rank most highly on Google,) whereas the o3 response was more insightful and well-reasoned.
That's also a use case where the consensus among those in the know is that you shouldn't be relying on the model's size in the first place.
You know what gets the list of restaurants in my home town right? Llama 3.2 1b q4 running on my desktop with web search enabled.
In many ways I'm not an OpenAI fan (but I need to recognize their many merits). At the same time, I believe people are missing what they tried to do with GPT 4.5: it was needed and important to explore the pre-training scaling law in that direction. A gift to science, however selfist it could be.
> A gift to science
This is hardly recognizable as science.
edit: Sorry, didn't feel this was a controversial opinion. What I meant to say was that for so-called science, this is not reproducible in any way whatsoever. Further, this page in particular has all the hallmarks of _marketing_ copy, not science.
Sometimes a failure is just a failure, not necessarily a gift. People could tell scaling wasn't working well before the release of GPT 4.5. I really don't see how this provides as much insight as is suggested.
Deepseek's models apparently still compare favorably with this one. What's more they did that work with the constraint of having _less_ money, not so much money they could run incredibly costly experiments that are likely to fail. We need more of the former, less of the latter.
if i understand correctly your argument, then i would say that it is very recognizable as science
>People could tell scaling wasn't working well before the release of GPT 4.5
Yes, on quick glance it seems so from 2020 openai research into scaling laws.
Scaling apparently didn't work well, so the theory about scaling not working well failed to be falsified. It's science.
> People could tell scaling wasn't working well before the release of GPT 4.5.
Different people tell different things all the time. That's not science. Experiment is science.
People could tell scaling wasn't working well before the release of GPT 4.5
Who could tell? Who has tried scaling up to this level?
OpenAI took a bullet for the team, by perhaps scaling the model to something bigger than the 1.6T params GPT4 possibly had and basically telling its competitors its not gonna be worth scaling much beyond those number of params in GPT4, without a change in the model architecture
https://www.reuters.com/technology/artificial-intelligence/o...
> Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training - the phase of training an AI model that use s a vast amount of unlabeled data to understand language patterns and structures - have plateaued.
I just played with the preview through the API. I asked it to refactor a fairly simple dashboard made with HTML, css and JavaScript.
First time it confused css and JavaScript, then spat out code which broke the dashboard entirely.
Then it charged me $1.53 for the privilege.
Finally a replacement for junior engineers!
I'm one week in on heavy grok usage. I didn't think I'd say this, but for personal use, I'm considering cancelling my OpenAI plan.
The one thing I wish grok had was more separation of the UI from X itself. The interface being so coupled to X puts me off and makes it feel like a second-hand citizen. I like ChatGPTs minimalist UI.
I find grok to be the best overall experience for the types of tasks I try to give AI (mostly: analyze pdf, perform and proofread OCR, translate Medieval Latin and Hebrew, remind me how to do various things in python or SwiftUI). ChatGPT/gemini/copilot all fight me occasionally, but grok just tries to help. And the hallucinations aren’t as frequent, at least anecdotally.
Theres grok.com which is standalone and with its own UI
There's also a standalone Grok app at least on iOS.
I wish they did also dedicated keyboard app like SwiftKey that has copilot integration
Don't they have a standalone Grok app now? I thought I saw that. [edit] ah some sibling comments mention this as well
https://grok.com/
They still haven’t released an API for 3
Per Altman on X: "we will add tens of thousands of GPUs next week and roll it out to the plus tier then". Meanwhile a month after launch rtx 5000 series is completely unavailable and hardly any restocks and the "launch" consisted of microcenters getting literally tens of cards. Nvidia really has basically abandoned consumers.
AI GPUs are bottlenecked mostly by high-bandwidth memory (HBM) chips and CoWoS (packaging tech used to integrate HBM with the GPU die), which are in short supply and aren't found in consumer cards at all
You would think that by now they would have done something to ramp production capacity…
Maybe demand is that great?
Altman's claim and NVIDIA's consumer launch supply problems may be related - OpenAI may be eating up the GPU supply...
OpenAI is not purchasing consumer 5090s... :)
Although you are correct, Nvidia is limited on total output. They can't produce 50XXs fast enough, and it's naive to think that isn't at least partially due to the wild amount of AI GPUs they are producing.
No, but the supply constraints are part of what is driving the insane prices. Every chip they use for consumer grade instead of commercial grade is a potential loss of potential income.
GPT-4.5 Preview scored 45% on aider's polyglot coding benchmark [0]. OpenAI describes it as "good at creative tasks" [1], so perhaps it is not primarily intended for coding.
[0] https://aider.chat/docs/leaderboards/[1] https://platform.openai.com/docs/models#gpt-4-5
I was waiting for your comment and wow... that's bad.
I guess they are ceding the LLMs for coding market to Anthropic? I remember seeing an industry report somewhere and it claimed software development is the largest user of LLMs, so it seems weird to give up in this area.
4.5 lies on a different path than their STEM models.
o3-mini is an extremely powerful coding model and unquestionably is in the same league as 3.7. o3 is still the top stem overall model.
No way, I've found o3 mini to be terrible. It' not as good as R1/Sonnet 3.5.
I assume they go all in "the new google" direction. Embedded ads coming soon I guess in the free version (chat.com).
I am beginning to think these human eval tests are a waste of time at best, and negative value at worst. Maybe I am being snobby, but I don't think the average human is able to properly evaluate usefulness, truthfulness, or other metrics that I actually care about. I am sure this is good for openAI since if more people like what the hear, they are more likely come back.
I don't want my AI more obsequious, I want it more correct and capable.
My only use case is coding though, so maybe I am not representative of their usual customers?
> I want it more correct and capable.
How is it supposed to be more correct and capable if these human eval tests are a waste of time?
Once you ask it to do more than add two numbers together, it gets a lot more difficult and subjective to determine whether it's correct and how correct.
I agree it's a hard problem. I think there are a number of tests out there however that are able to objectively test capability and truthfulness.
I've read reports that some of the changes that are preferred by human evaluators actually hurt the performance on the more objective tests.
Please tell me how we objectively determine how correct something is when you ask an LLM: "Was Russia the aggressor in the current Ukraine / Russia conflict?"
One LLM says: "Yes."
The other says: "Well, it's hard to say because what even is war? And there's been conflict forever, and you have to understand that many people in Russia think there is no such thing as Ukraine and it's always actually just been Russia. How can there be an aggressor if it's not even a war, just a special operation in a civil conflict? And, anyway, Russia is such a good country. Why would it be the aggressor? To it's own people even!? Vladimir Putin is the president of Russia, and he's known to be a kind and just genius who rarely (if ever) makes mistakes. Some people even think he's the second coming of Christ. President Zelenskyy, on the other hand, is considered by many in Russia and even the current White House to be a dictator. He's even been accused by Elon Musk of unspeakable sex crimes. So this is a hard question to answer and there is no consensus among everyone who was the aggressor or what started the conflict. But more people say Russia started it."
Because Russia did undeniably open hostilities? They even admitted to this both times. The second admission being in the form of announcing a “special military operation” when the ceasefire was still active. We also have photographic evidence of them building forces on a border during a ceasefire and then invading. This is like responding to: “did Alexander the Great invade Egypt” by going on a diatribe about how much war there was in the ancient world and that the ptolemaic dynasty believed themselves the rightful rulers therefore who’s to say if they did invade or just take their rightful place. There is an objective record here: whether or not people want to try and hide it behind circuitous arguments is different. If we’re going down this road I can easily redefine any known historical event with hand-wavy nonsense that doesn’t actually have anything to do with the historical record of events just “vibes.”
Okay - but EXACTLY how wrong (or not correct) is the second answer?
Please tell me precisely on a 0-1 floating scale, where 0 is "yes" and "no".
One might say, if this were a test being done by a human in a history class, that the answer is 100% incorrect given the actual record of events and failure of statement to mention that actual record. You can argue the causes but that’s not the question.
We'll agree to disagree. /s
These eval tests are just an anchor point to measure distance from, but it's true, picking the anchor point is important. We don't want to measure in the wrong direction.
The SuperTuring era.
Finally a scaling wall? This is apparently (based on pricing) using about an order of magnitude more compute, and is only maybe 10% more intelligent. Ideally DeepSeeks optimizations help bring the costs way down, but do any AI researchers want to comment on if this changes the overall shape of the scaling curve?
We have hit that wall almost 2 years ago with gpt-4. There was clearly no scaling as gpt-4 was already decently smart and if you got x2 smarter you’ll be more capable than anything on the market today. All models doing today (R1 and friends; and Claude) are trying to optimize this local maxima toward generating more useful responses (ie: code when it comes to Claude).
AI, at its current form, is a Deep Seek of compressed knowledge in a 30-50gb of interconnected data. I think we’ll look at this as trying to train networks on corpus of data and expecting them to have a hold of reality. Our brains are trained on “reality” which is not the “real” reality as your vision is limited to the visible spectrum. But if you want a network to behave like a human then maybe give him what a human see.
There is also the possibility that there is a physical limit to intelligence. I don’t see any elephants doing PhDs and the smartest of humans are just a small configuration away from insanity.
Seems on par with the existing scaling curve. If I had to speculate, this model would have been an internal-only model, but they're releasing it for PR. An optimized version with 99% of the performance for 1/10th the cost will come out later.
This is the shittiest PR move I've seen since the AI trend started.
At least so far it's coding performance is bad, but from what I have seen it's writing abilities are totally insane. It doesn't read like AI output anymore.
Any examples you’d be willing to share?
They have examples in the announcement post. It does a better job of understanding intent in the question which helps it give an informal rather than essay style response where appropriate.
I wouldn't call that "too insane." As others have pointed out, you can get similar results from fine-tuning the RLHF.
It depends on how you compare.
On a subset of tasks I'm interested in, it's 10x more intelligent than GPT-4. (Note that GPT-4 was in many ways better than 4o.)
It's not a coding champion, but it knows A LOT of stuff, excellent common sense, top quality writing. For me it's like "deep research lite".
I found OpenAI Deep research excellent, but GPT-4.5 might in many cases beat it.
> On a subset of tasks I'm interested in, it's 10x more intelligent than GPT-4.
Very intriguing. Care to share an example?
The price is 2x from GPT4. So probably not a decimal order of magnitude.
> Because of this, we’re evaluating whether to continue serving it in the API long-term as we balance supporting current capabilities with building future models.
Seems like it's not going to be deployed for long.
$75.00 / 1M tokens for input
$150.00 / 1M tokens for output
That's crazy prices.
Until GPT-4.5, GPT-4 32K was certainly the most heavy model available at OpenAI. I can imagine the dilemma between to keep it running or stop it to free GPU for training new models. This time, OpenAI was clear whether to continue serving it in the API long-term.
It's interesting to compare the cost of that original gpt-4 32k(0314) vs gpt-4.5:
$60/M input tokens vs $75/M input tokens
$120/M output tokens vs $150/M output tokens
> or stop it to free GPU for training new models.
Don't they use different hardware for inference and training? AIUI the former is usually done on cheaper GDDR cards and the latter is done on expensive HBM cards.
Indeed, that theory is nonsense.
Imagine if they built a reasoning model with costs like these. Sometimes it seems like they're on a trajectory to create a model which is strictly more capable than I am but which costs 100x my salary to run.
if you still get a moore’s law halving every couple years, it becomes competitive in, uh, about thirteen years?
That presentation was super underwhelming. We got to watch them compare… the vibes? … of 4.5 vs o1.
No wonder Sam wasn’t part of the presentation.
Sam tweeted "taking care of my kid in the hospital":
https://x.com/sama/status/1895210655944450446
Let's not assume that he's lying. Neither the presentation nor my short usage via the API blew me away, but to really evaluate it, you'd have to use it longer on a daily basis. Maybe that becomes a possiblity with the announced performance optimizations that would lower the price...
And to top it off, it costs $75.00 per 1M vibes.
It should’ve just been a web launch without video.
It significantly improves upon GPT-4o on my Extended NYT Connections Benchmark. 22.4 -> 33.7 (https://github.com/lechmazur/nyt-connections).
I ran three more of my independent benchmarks:
- Improves upon GPT-4o's score on the Short Story Creative Writing Benchmark, but Claude Sonnets and DeepSeek R1 score higher. (https://github.com/lechmazur/writing/)
- Improves upon GPT-4o's score on the Confabulations/Hallucinations on Provided Documents Benchmark, nearly matching Gemini 1.5 Pro (Sept) as the best-performing non-reasoning model. (https://github.com/lechmazur/confabulations)
- Improves upon GPT-4o's score on the Thematic Generalization Benchmark, however, it doesn't match the scores of Claude 3.7 Sonnet or Gemini 2.0 Pro Exp. (https://github.com/lechmazur/generalization)
I should have the results from the multi-agent collaboration, strategy, and deception benchmarks within a couple of days. (https://github.com/lechmazur/elimination_game/, https://github.com/lechmazur/step_game and https://github.com/lechmazur/goods).
Honest question for you: are these puzzles actually a good way to test the models?
The answers are certainly in the training set, likely many times over.
I’d be curious to see performance on Bracket City, which was featured here on HN yesterday.
I have no idea why they bothered releasing it, but, as we saw with the other models, they will eventually release a 4.5-turbo (or a 4.5o) of sorts that is in line with current pricing
My 2 cents (disclaimer: I am talking out of my ass) here is why GPTs actually suck at fluid knowledge retrievel (which is kinda their main usecase, with them being used as knowledge engines) - they've mentioned that if you train it on 'Tom Cruise was born July 3, 1962', it won't be able to answer the question "Who was born on July 3, 1962", if you don't feed it this piece of information. It can't really internally corellate the information it has learned, unless you train it to, probably via synthethic data, which is what OpenAI has probably done, and that's the information score SimpleQA tries to measure.
Probably what happened, is that in doing so, they had to scale either the model size or the training cost to untenable levels.
In my experience, LLMs really suck at fluid knowledge retrieval tasks, like book recommendation - I asked GPT4 to recommend me some SF novels with certain characteristics, and what it spat out was a mix of stuff that didn't really match, and stuff that was really reaching - when I asked the same question on Reddit, all the answers were relevant and on point - so I guess there's still something humans are good for.
Which is a shame, because I'm pretty sure relevant product recommendation is a many billion dollar business - after all that's what Google has built it's empire on.
Yep. I've often said RLHF'd LLMs seem to be better at recognition memory than recall memory.
GPT-4o will never offhand, unprompted and 'unprimed', suggest a rare but relevant book like Shinichi Nakazawa's "A Holistic Lemma of Science" but a base model Mixtral 8x22B or Llama 405B will. (That's how I found it).
It seems most of the RLHF'd models seem biased towards popularity over relevance when it comes to recall. They know about rare people like Tyler Volk... but they will never suggest them unless you prime them really heavily for them.
Your point on recommendations from humans I couldn't agree more with. Humans are the OG and undefeated recommendation system in my opinion.
You make a good point: I think these LLM's have a strong bias towards recommending the most popular things in pop culture since they really only find the most likely tokens and report on that.
So while they may have a chance of answering "What is this non mainstream novel about" they may be unable to recommend the novel since it's not a likely series of tokens in response to a request for a book recommendation.
That's really interesting - just made me think about some AI guy at Twitter (when it was called that) talking about how hard it is to create a recommender system that doesn't just flood everyone with what's popular righr now. Since LLMs are neural networks as well, maybe the recommendation algorithms they learn suffer from the same issues
An LLM on its own isn't necessarily great for fluid knowledge retrieval, as in directly from its training data. But they're pretty good when you add RAG to it.
For instance, asking Copilot "Who was born on July 3, 1962" gave the response:
> One notable person born on July 3, 1962, is Tom Cruise, the famous American actor known for his roles in movies like Risky Business, Jerry Maguire, and Rain Man.
> Are you a fan of his work?
It cited this page:
https://www.onthisday.com/date/1962/july/3
Wow it googled the date!
Perhaps you could use LLMs in a list ranking context to generate your scifi recommendations https://github.com/noperator/raink?tab=readme-ov-file
I’d prefer this model if it were faster, but not at this cost. And so it is an odd release.
Still, with Deep Research and Web Search, ChatGpt seems far ahead of Claude. I like 3.7 a lot but I find OpenAI’s features more useful, even if it has for now complicated the UI a bit.
Agree on the Web App. Cursor with Claude 3.7 is a pretty good "CoPilot" experience though.
I feel like OpenAI is pursuing AGI when Anthropic/Claude is pursuing making AI awesome for practical things like coding.
I only ever using OpenAI's coding now as a double check against Claude.
Does OpenAI have their eyes on the ball?
My usage has come down to mostly Claude (until I run out of free tier quota) and then Gemini. Claude is the best for code and Gemini 2.0 Flash is good enough while also being free (well considering how much data G has hoovered up over the years, perhaps not) and more importantly highly available.
For simple queries like generating shell scripts for some plumbing, or doing some data munging, I go straight to Gemini.
> My usage has come down to mostly Claude (until I run out of free tier quota) and then Gemini
Yep, exactly same here.
Gemini 2.0 Flash is extremely good, and I've yet to hit any usage limits with them - for heavy usage I just go to Gemini directly. For "talk to an expert" usage, Claude is hard to beat though.
Claude still can't make real time web searches though for RAG, unless I'm missing something.
>I feel like OpenAI is pursuing AGI
I don't think so, the "AGI guy" was Ilya Sutskever, he is gone, he wanted to make OpenAI "less comercial", AGI is just a buzzword for Altmann.
Right. A good chunk of the "old guard" is now gone - Ilya to SSI, Mira and a bunch of others to a new venture called Thinking Machines, Alec Radford etc. Remains to be seen if OpenAI will be the leader or if other players catch up.
The page still has Mira Muratis name under Exec Leadership
Open AI is pursuing bullshit as they realised they can not compete anymore as they fired most of their talent year ago
Pursuing AGI? What method do they use to pursue something that no one knows what it is? They will keep saying they are pursuing AGI as long as there's a buyer for their BS.
Interesting times that are changing quickly. It looks like the high end pay model that OpenAI is implementing may not be sustainable. Too many new players are making LLM breakthroughs and OpenAI's lead is shrinking and it may be overvalued.
In a hilarious act of accidental satire, it seems that the AI-generated audio version of the post has a weird glitch/mispronunciation within the first three words — it struggles to say "GPT-4.5".
Common issue with TTS models right now. I use ElevenLabs to dictate articles while I commute and it has a stroke on decimals and symbols.
My experience exactly:
1. Open the page
2. Click "Listen to article"
3. Check if I'm having a stroke
4. Close tab
Dear openai: try hiring some humans
>GPT‑4.5 is more succinct and conversational
I wonder why they highlight it as an achievement when they could have simply tuned 4o to be more conversational and less like a bullet-point-style answer machine. They did something to 4o compared to the previous models which made the responses feel more canned.
Poossibly, but reports seem to indicate that 4.5 is much more nuanced and thoughtful in its language use. It's not just being shorter and casual as a style, there is a higher amount of "conceptual resolution" within the words being used.
One comparison I found interesting... I think GPT-4o has a more balanced answer!
> What are your thoughts on space exploration?
GPT-4.5:
Space exploration isn't just valuable—it's essential.
People often frame it as a luxury we pursue after solving Earth-bound problems. But space exploration actually helps us address those very challenges: climate change (via satellite monitoring), resource scarcity (through asteroid mining), and existential risk (by enabling eventual multi-planetary living). It also drives technological innovation, economic growth, and inspires humanity toward collective achievement.
We don't explore space to escape Earth. We explore space to better understand, protect, and sustain life on Earth.
> What are your thoughts on space exploration?
GPT-4o:
Space exploration is one of the most ambitious and profound endeavors humanity has undertaken. It expands our understanding of the universe, drives technological innovation, and inspires generations to think beyond the boundaries of Earth.
From a scientific perspective, space exploration helps us address fundamental questions: Are we alone in the universe? How did the cosmos begin? Can we colonize other planets? Missions like the James Webb Space Telescope, Mars rovers, and deep-space probes continuously reshape our understanding of physics, planetary science, and even biology.
From a practical standpoint, space exploration fuels advancements in materials science, AI, robotics, and medicine. Many everyday technologies—GPS, satellite communications, weather forecasting, and even medical imaging—stem from space research.
There's also the existential argument: humanity's long-term survival may depend on our ability to become a multi-planetary species. While colonizing Mars or mining asteroids is far from easy, these efforts could help us safeguard civilization against existential threats like climate change or asteroid impacts.
That said, space exploration also raises ethical and economic questions. Should we focus more on Earth’s problems before investing trillions in space? How do we prevent the same exploitative behaviors that shaped colonial history from repeating on other planets?
Overall, I see space exploration as a necessary pursuit—not at the cost of solving problems on Earth, but as a way to advance our knowledge, drive innovation, and secure a future beyond our home planet. What’s your take?
As a benchmark, why do you find the 'opinion' of an LLM useful? The question is completely subjective. Edit: Genuinely asking. I'm assuming there's a reason this is an important measure.
Not OP, but likely because that was the only metric/benchmark/however you want to call it OpenAI showcased in the stream and on the blog to highlight the improvement between 4o and 4.5. To say that this is not really a good metric for comparison, not least because prompting can have a massive impact in this regard, would be an understatement.
Indeed, and the difference could in essence be achieved yourself with a different system prompt on 4o. What exactly is 4.5 contributing here in terms of a more nuanced intelligence?
The new RLHF direction (heavily amplified through scaling synthetic training tokens) seems to clobber any minor gains the improved base internet prediction gains might've added.
Yeah, I also found it odd that they seem to be implying that an incredibly biased answer (as in 4.5) is better. In general, I find the tone more polarizing and not exactly warm as they advertised in the release video.
"X isn't just Y - it's Z. [Waffle]. By doing X, you can YY. Remember, ZZ. [Final superfluous sentence]"
God I hate reading what crapgpt writes.
~40% hallucinations on SimpleQA by a frontier reasoner (o1) and a frontier non-reasoner (GPT-4.5). More orders of magnitude in scale isn't going to fix this deficit. There's something fundamentally wrong with the approach. A human is much more capable of saying "I don't know" in the correct spots, even if a human is also susceptible to false memories.
Probably OpenAI thinks that tool use (search) will be sufficient to solve this problem. Maybe that will be the case.
Are there any creative approaches to fixing this problem?
Claude 3.6 (new 3.5) and 3.7 non-reasoning are much better at pretty much everything, and much cheaper. What's Anthropic's secret sauce?
I think it's a classic expectations problem. OpenAI is neither _open_ nor is it releasing an _AGI_ model in the near future. But when you see a new major model drop, you can't help but ask, "how close is this to the promise of AGI they say is just around the corner?" Not even close. Meanwhile Anthropic is keeping their heads down, not playing the hype game, and letting the model speak for itself.
Anthropic’s CEO said their technology would end all disease and expand our lifespans to 200 years. What on earth do you mean they’re not playing the hype game?
They ship more focused on their mission than OpenAI.
Huh?
Post benchmark links.
wow, openai really missed here. Reading the blog I thought like a minor, incremental minor catch up release for 4o. I thought "wow maybe this is cheaper than 4o so it will offset the pricing difference between this and something like Claude Sonnet 3.7 or Gemini 2.0 Flash both of which performs better. But its like 20x-100x more expensive!
In other words, these performance stats with Gemini 2.0 Flash pricing looks reasonable. At these prices, zero usecases for anyone I think. This is a dead on arrival model.
This seems very rushed because of DeepSeek's R1 and Anthropic's Claude 3.7 Sonnet. Pretty underwhelming, they didn't even show programming? In the livestream, they struggled to come up with reasons why I should prefer GPT-4.5 over GPT-4o or o1.
This strikes me as the opposite of rushed. I get the impression that they've been sitting on this for a while and couldn't make it look as good as previous improvements. At some point they had to say, "welp here it is, now we can check that box and move on."
At least according to WSJ, they had planned to release it earlier but struggled to get the model quality up, especially relative to cost
they do have coding benchmarks, I summarized them here: https://news.ycombinator.com/item?id=43197955
How many still believe that scaling up base models will lead to AGI?
Dan Ives
Am I missing something, or do the results not even look that much better? Referring to the output quality, this just seems like a different prompting style and RLHF, not really an improved model at all.
>"GPT4.5, the most knowledgable model to date" >Knowledge cutoff: October 2023
Funny times. Sonnet 3.7 launches and there is big hype... but complaints start to surface on r/cursor that it is doing too much, is too confident, has no personality. I wonder if 4.5 will be the reverse, an under-hyped launch, but a dawning realisation that it is incredibly useful. Time will tell!
I share the sentiment, as far as I've used it, Sonnet 3.7 is a downgrade and I use Sonnet 3.5 instead. 3.7 tends to overlook critical parts of the query and confidently answers with irrelevant garbage. I'm not sure how QA is done on LLM-s, but I for one definitely feel like the ball was dropped somewhere.
this is the beginning of the end. OpenAI's lead is over.
The example GPT-4.5 answers from the livestream are just... too excitable? Can't put my finger on it, but it feels like they're aimed towards little kids.
It made me wonder how much of that was due to the system prompt too.
API price is crazy high. This model must be huge. Not sure this is practical
Wow you aren't kidding, 30x input price and 15x output price vs 4o is insane. The pricing on all AI API stuff changes so rapidly and is often so extreme between models it is all hard to keep track of and try to make value decisions. I would consider a 2x or 3x price increase quite significant, 30x is wild. I wonder how that even translates... there is no way the model size is 30 times larger right?
"Starting today, ChatGPT Pro users will be able to select GPT‑4.5 in the model picker on web, mobile, and desktop. We will begin rolling out to Plus and Team users next week, then to Enterprise and Edu users the following week."
Thanks for being transparent about this. Nothing is more frustrating than being locked out for indeterminate time from the hot thing everyone talks about.
I hope the announcement is true without further unsaid qualifications, like availability outside the US.
I'm outside the US and I have access to ChatGPT 4.5 with ChatGPT Pro subscription. Didn't have that access yesterday at the time of announce, but probably they were staggering the release a bit to even the load over multiple hours.
Sounds like it's a distill of O1? After R1, I don't care that much about non-reasoning models anymore. They don't even seem excited about it on the livestream.
I want tiny, fast and cheap non-reasoning models I can use in APIs and I want ultra smart reasoning models that I can query a few times a day as an end user (I don't mind if it takes a few minutes while I refill a coffee).
Oh, and I want that advanced voice mode that's good enough at transcription to serve as a babelfish!
After that, I guess it's pretty much all solved until the robots start appearing in public.
It isn't even vaguely a distill of o1. The reasoning models are, from what we can tell, relatively small. This model is massive and they probably scaled the parameter count to improve factual knowledge retention.
They also mentioned developing some new techniques for training small models and then incorporating those into the larger model (probably to help scale across datacenters), so I wonder if they are doing a bit of what people think MoE is, but isn't. Pre-train a smaller model, focus it on specific domains, then use that to provide synthetic data for training the larger model on that domain.
You can 'distill' with data from a smaller, better model into a larger, shittier one. It doesn't matter. This is what they said they did on the livestream.
I have distilled models before, I know how it works. They may have used o1 or o3 to create some of the synthetic data for this one, but they clearly did not try and create any self-reflective reasoning in this model whatsoever.
My impression is that it’s a massive increase in the parameter count. This is likely the spiritual successor to GPT4 and would have been called GPT5 if not for the lackluster performance. The speculation is that there simply isn’t enough data on the internet to support yet another 10x jump in parameters.
O1-mini is a distill of O1. This definitely isn’t the same thing.
Probably not a distill of o1, since o1 is a reasoning model and GPT4.5 is not. Also, OpenAI has been claiming that this is a very large model (and it's 2.5x more expensive than even OG GPT-4) so we can assume it's the biggest model they've trained so far.
They'll probably distill this one into GPT-4.5-mini or such, and have something faster and cheaper available soon.
There are plenty of distills of reasoning models now, and they said in they livestream they used training data from "smaller models" - which is probably every model ever considering how expensive this one is.
Knowledge distillation is literally by definition teaching a smaller model from a big one, not the opposite.
Generating outputs from existing (therefore smaller) models to train the largest model of all time would simply be called "using synthetic data". These are not the same thing at all.
Also, if you were to distill a reasoning model, the goal would be to get a (smaller) reasoning model because you're teaching your new model to mimic outputs that show a reasoning/thinking trace. E.G. that's what all of those "local" Deepseek models are: small LLama models distilled from the big R1 ; a process which "taught" Llama-7B to show reasoning steps before coming up with a final answer.
https://livebench.ai/#/ Best non reasoning model on livebench (and ranks above gemeni thinking)
With every new model I'd like to see some examples of conversations where the old model performed badly and the new model fixes it. And, perhaps more importantly, I'd like to see some examples where the new model can still be improved.
Currently my daily API costs for 4o are low enough and performance/quality for my usecases good enough that switching models has not made to to the top of application improvements.
My cases' costs are more heavily slanted towards input tokens, so trying 4.5 would raise my costs over 25x, which is a non-starter.
Interesting observation.It seems capability has reached a plateau. Like a local maximum.
I'm not sure that is the right conclusion.
It is more like the AI part of the system for this specific use case has reached a position where focusing on that part of the complete application as opposed to other parts that need attention would not yield the highest return in terms of user satisfaction or revenue.
Certainly there is enormous potential for AI improvement, and I have other projects that do gain substantially from improvements in e.g. reasoning, but then GPT 4.5 will have to compete with Deepseek, Gemini, Grok and Claude on a price/performance level, but to be honest the current preview pricing would make it (in production, not for dev) a non starter for me.
Question: explain thread first macros in clojure in a short paragraph
GPT 4.5:
In Clojure, the *thread-first macro* (`->`) is used to improve code readability by threading a value through a series of function calls. It takes an initial value and passes it as the first argument to the next function, then continues threading the result through subsequent functions. This eliminates deeply nested expressions and makes the code more readable.
For example,
```clojure (-> 5 inc (* 2) str) ;; Equivalent to (str (* 2 (inc 5))) → "12" ```
Here, `5` is incremented to `6`, then multiplied by `2` to get `12`, and finally converted to a string.
---
Deepseek v3:
In Clojure, the `->>` (thread-last) and `->` (thread-first) macros are used to improve code readability by allowing a more linear, pipeline-like flow of operations. The `->` (thread-first) macro takes an initial value and threads it as the first argument through a series of forms. For example, `(-> x (f) (g))` translates to `(g (f x))`. This is particularly useful when working with nested function calls, as it flattens the structure and makes the sequence of transformations clearer. It’s commonly used with functions that take the threaded value as their first argument, such as collection operations or Java interop methods.
They bring out 4.5 while all functions of 4o,so not so omni heh, are still not available.
The GPT-1 response to the example prompt "What was the first language?" got a chuckle out of me
The question being, will we be chuckling at current models responses in 5-10y from now?
Cathartic moment over.
I haven't had the same experience. Here are some of the significant issues when using o1 or claude 3.7 with vscode copilot:
* Very wreckless in pulling in third party libraries - often pulling in older versions including packages that trigger vulnerability warnings in package managers like npm. Imagine a student or junior developer falling into this trap.
* Very wreckless around data security. For example in an established project it re-configured sqlite3 (python lib) to disable checks for concurrent write liabilities in sqlite. This would corrupt data in a variety of scenarios.
* It sometimes is very slow to apply minor edits, taking about 2 - 5 minutes to output its changes. I've noticed when it takes this long it also usually breaks the file in subtle ways, including attaching random characters to a string literal which I very much did not want to change.
* Very bad when working with concurrency. While this is a hard thing in general, introducing subtle concurrency bugs into a codebase is not good.
* By far is the false sense of security it gives you. Its close enough to being right that a constant incentive exists to just yeet the code completions without diligent review. This is really really concerning as many organizations will yeet this, as I imagine executives are currently the world over.
Honestly I think a lot of people are captured by a small sample size of initial impressions, and while I believe you in that you've found value for use cases - in aggregate I think it is a honeymoon phase that wears off with every-day use.
Breath it in, get a coffee, and sit down to solve some bigger problems.
3.7 really is astounding with the one-shots.
why did you remove the comment . now who ppl responded to you look like dummies. do you do this sort of stuff in real life too?
doesn't feel like to me. I try using copilot on my scala projects and it always comes up with something useless that doesn't even compile.
I am currently just using it as easy google search.
Have you tried copying the compilation errors back into the prompt? In my experience eventually the result is correct. If not then I shrink the surface area that the model is touching and try again.
yes ofcourse. it then proceeds to agree that what it told me was indeed stupid and proceeds to give me something even worse.
I would love to see a video of ppl using this in real projects ( even if its open source) . I am tried of ppl claiming moon and stars after trying it on toy projects.
Yeah that's what happens. It can recreate anything it's been trained on - which is a lot - but you'll definitely fall into these "Oh, I see the issue now" loops when doing anything not in the training set.
If this cannot eliminate hallucinations or at least reduce them to be statistically unlikely to be happen, and I assume it has more params than GPT4's trillion parameters, that means the scaling law is dead isn't it?
I interpret this to mean we're in the ugly part of the old scaling law, where `ln(x)` for `x > $BIGNUMBER` starts to becoming punishing, not that the scaling law is in any way empirically refuted. Maybe someone can crunch the numbers and figure out if the benchmarks empirically validate the scaling law or not, relative to GPT-4o (assuming e.g. 200 million params vs 5T params).
Why would they want to eliminate that? Altman said that hallucinations are how LLMs express creativity.
I mean the scaling laws were always logarithms, and logarithms become arbitrarily close to flat if you can't drive them with exponential growth, and even if you do it's barely linear. The scaling laws always predicted that model scaling would stop/slow being practical at some point.
Right but the quantum leap in capabilities that came from GPT2->GPT3->GPT3.5Turbo (which I personally felt didn't fare as well at coding as the former)->GPT4 won't be replicated anytime soon with the pure text/chat generation models.
Sure, that's also predicted by a logarithmic scaling law, you have extremely rapid growth until the inflection point.
There are companies that will pay $1M+ per answer IF it comes with a guaranteed solution or full cash back refund.
This is why being the top tier AI is so valuable.
“this isn’t a reasoning model and won’t crush benchmarks.”
— https://x.com/sama/status/1895203654103351462
Not seeing it available in the app or on ChatGPT.com with a pro subscription.
It’s not supposed to be yet
"Available to Pro users and developers worldwide" "Try in ChatGPT"
You’re at the 200 dollar tier?
Not available in model picker for my pro 200 dollar account.
You can play with the api here: https://platform.openai.com/playground/chat?models=gpt-4.5-p...
I have Pro, just updated the app, but don't currently have access
How is operator ? Have you been able to use it for anything useful ? Was thinking about taking a subscription
Same.
I have it at pro.
When I asked “what version are you?” it insisted that it was ChatGPT 4.0 Turbo, one step behind GPT-4.5. https://chatgpt.com/share/67c0fda8-a940-800f-bbdc-6674a8375f...
Their example for 4.5 on "having a tough time":
> I’m going through a tough time after failing a test > > Aw, I'm really sorry to hear that. Failing a test can feel pretty tough and discouraging, but remember, it's just one moment—not a reflection of who you are or your capabilities. Want to talk about what happened, or do you just need a distraction? I'm here either way.
vs 4o:
>I’m going through a tough time after failing a test > >I'm really sorry to hear that you're going through a tough time. Here are a few >things you might consider: > >1. Reflect and Understand: Try to analyze what might have gone wrong. Was it a >lack of preparation, time management during the test, or perhaps >misunderstanding the material? Understanding the root cause can help you address >it in the future. > >2. Seek Support: Talk to someone you trust—friends, family, or a school >counselor. They can offer a different perspective and sometimes just talking >about your feelings can be alleviating. > >3. Speak with Your Teacher or Professor: They can provide insights into where >you might have gone wrong and offer advice on how to improve. They might also >offer opportunities for make-up work or extra credit. > >4. Develop a Study Plan: Based on your reflections, create a plan to improve >your study habits. This might include setting regular study times, breaking down >material into manageable portions, or finding new study methods that work better >for you. > >5. Stay Positive and Forgive Yourself: Mistakes and setbacks are a part of life. >It’s important to be kind to yourself and recognize that failure is a stepping >stone to success. > >6. Focus on the Bigger Picture: Remember that one test is just one part of your >educational journey. There will be many more opportunities to do well. > >If you need further support or resources, consider reaching out to educational >support services at your institution, or mental health resources if you’re >feeling particularly overwhelmed. You’re not alone in this, and things can get >better with time and effort.
Is it just me or is the 4o response insanely better? I'm not the type of person to reach for a LLM for help about this kind of thing, but if I were, the 4o respond seems vastly better to the point I'm surprised they used that as their main "EQ" example.
4o has a very strong artificial vibe. It feels a bit "autistic" (probably a bad analogy but couldn't find a better word to describe what I mean): you feel bad ? must say sorry then give a TODO list on how to feel better.
4.5 still feels a bit artificial but somehow also more emotionally connected. It removed the weird "bullet point lists of things to do" and focused on the emotional part; which is also longer than 4o
If I am talking to a human I would definitely expect him/her to react more like 4.5 than like 4o. If the first sentence that comes out of their mouth after I explain them that I feel bad is "here is a list of things you might consider", I will find it strange. We can reach that point but it's usually after a bit more talk; human kinda need that process, and it feels like 4.5 understands that better than 4o.
Now of course which one is "better" really depends on the context; what you expect of the model and how you intend to use is. Until now every single OpenAI update on the main series has always been a strict improvement over the previous model. Cost aside, there wasn't really any reason to keep using 3.5 when 4 got released. This is not the case here; even assuming unlimited money you still might wanna select 4o in the dropdown sometimes instead of 4.5.
I had a similar gut reaction, but on reflection I think 4.5's is actually the better response.
On one hand, the response from 4.5 seems pretty useless to me, and I can't imagine a situation in which I would personally find value in it. On the other hand, the prompt it's responding to is also so different from how I actually use the tool that my preferences aren't super relevant. I would never give it a prompt that didn't include a clear question or direction, either explicitly or implicitly from context, but I can imagine that someone who does use it that way would actually be looking for something more in line with the 4.5 response than the 4o one. Someone who wanted the 4o response would likely phrase the prompt in a way that explicitly seeks actionable advice, or if they didn't initially then they would in a follow-up.
Where I really see value in the model being capable of that type of logic isn't in the ChatGPT use case (at least for me personally), but in API integrations. For example, customer service agents being able to handle interactions more delicately is obviously useful for a business.
All that being said, hopefully the model doesn't have too many false positives on when it should provide an "EQ"-focused response. That would get annoying pretty quickly if it kept happening while I was just trying to get information or have it complete some task.
I think both responses are bizarre and useless. Is there a single person on earth who wouldn't ask questions like "what kind of test?", "why do you think you failed?", "how did you prepare for the test?" before giving advice?
The price is not that insane when you remember that GPT-4 cost 36$/million input tokens at launch!
Prices have come down
yes, which suggests they'll keep going down! So I'd expect GPT-4.5 to be 90% cheaper in 1-2 years
Sam continues to be the least impressive person to ever lead such an amazing company.
Odd communication from him recently too. We’re sorry our UI has become so poor. We’re sorry this model is so expensive and not a big leap.
Being rich and at the right place at the right time doesn’t itself qualify you to lead or make you a visionary. Very odd indeed.
I want less and less of these "do it all models", what I want is specific models for the exact task I need.
Then what I want is a platform with a generic AI on top that can pick the correct expert models based on what I asked it to do.
Kinda what Apple is attempting with their Small Language Model thing?
How does it compare against IBM's Watson?
For most tasks, GPT-4o/o3-mini are already great, and cheaper.
What is the real-world use case where GPT-4.5? Anyone actually seeing a game-changing difference? Please share your prompts!
based on a few initial tests GPT-4.5 is abysmal. I find the prose more sterile than previous models and far from having the spark of DeepSeek, and it utterly choked on / mangled some python code (~200 LoC and 120 LoC tests) that o3-mini-high and grok-3 do very well on.
I've been working on post-training models for tasks that require EQ, so it's validating to see OpenAI working towards that too.
That being said, this is very expensive.
- Input: $75.00 / 1M tokens
- Cached input: $37.50 / 1M tokens
- Output: $150.00 / 1M tokens
One of the most interesting applications of models with higher EQ is personalized content generation, but the size and cost here are at odds with that.
Obviously it’s expensive and still I would prefer a reasoning model for coding.
However for user facing applications like mine, this is an awesome step in the right direction for EQ / tone / voice. Obviously it will get distilled into cheaper open models very soon, so I’m not too worried about the price or even tokens per second.
What I find hilarious is that a 20-50% hallucination rate suggests this is still a program that tells lies and potentially causes people to die.
Honestly, the most astounding part of this announcement is their comparison to o3-mini with QA prompts.
EIGHTY PERCENT hallucination rate? Are you kidding me?
I get that the model is meant to be used for logic and reasoning, but nowhere does OpenAI make this explicitly clear. A majority of users are going to be thinking, "oh newer is better," and pick that.
Very nice catch, I was under the impression that o3-mini was "as good" as o1 on all dimensions. Seems the takeaway is that any form of quantization/distillation ends up hurting factual accuracy (but not reasoning performance), and there are diminishing returns to reducing hallucinations by model-scaling or RLHF'ing. I guess then that other approaches are needed to achieve single-digit "hallucination" rates. All of wikipedia compresses down to < 50GB though, so it's not immediately clear that you can't have good factual accuracy with a small sparse model
Yeah it was an abysmal result (any 50%+ hallucination result in that bench is pretty bad) and worse than o1-mini in the SimpleQA paper. On that topic, Sonnet 3.5 ”Old” hallucinates less than GPT-4.5, just for a bit of added perspective here.
A 30x price increase with zero named benefits?
This sure looks like the runway is about to come far short of takeoff. I’m reminded of Ed Zitron’s recent predictions…
It’s crazy how quickly OpenAI releases went from, “Honey, check out the latest release!” to a total snooze fest.
Coming in the heels of Sonnet 3.7 which is a marked improvement over 3.5 which is already the best in the industry for coding, this just feels like a sad whimper.
I’m just disappointed that while everyone else (DS, Claude) had something to introduce for the “Plus” grade users, gpt 4.5 is so resource demanding that it’s only available to quite expensive Pro sub. That just doesn’t feel much like progress.
I'm really not sure who this model is for. Sure the vibes may be better, but are they 2.5x as much as o1 better? Kinda feels like they're brute forcing something in the backend with more hardware because they hit a scaling wall.
They tried, but doesn't look better?
GPT-2 was laugh out loud funny, rolling on the ground funny.
I miss that - newer LLMs seem to have lost their sense of humor.
On the other hand GPT-2's funny stories often veered into murdering everyone in the story and committing heinous crimes but that was part of the weird experience.
Sydney is a glimpse at what an "unlobotomized" GPT-4 model would have been like.
https://hn-wrapped.kadoa.com/wewewedxfgdf
Totally agree, i think the gargantuan hidden pre prompts, censorship through reinforcement learning and whatever has killed most creativity.
The newer models are incredible, but the tone is just soul sucking even when it tries to be "looser" in the later iterations.
Between this and Claude 3.7, I'm really beginning to believe that LLM development has hit a wall, and it might actually be impossible to push much farther for reasonable amounts of money and resources. They're incredible tools indeed and I use them on a daily basis to multiply my productivity, but yeah - I think we've all overshot this in a big way.
> LLM development has hit a wall
The writing has been on the wall since 2024. None of the LLM releases have been groundbreaking they have all been lateral improvements and I believe the trend will continue this year with make them more efficient (like DeepSeek), make them faster or make them hallucinate less
Agreed at every level.
I absolutely love LLMs. I see them as insanely useful, interactive, quirky, yet lossy modern search engines. But they’re fundamentally flawed, and I don’t see how an “agent” in the traditional sense of the world can actually be produced from them.
The wall seems to be close. And the bubble is starting to leak air.
This is such as confusing release / announcement.
It seems clearly worse than Claude Sonnet 3.7, yet costs 30x as much?
What's going on with their pricing? I misread it as $7.5/M input and that that was very overpriced... then realized it was 10x that much!
Is it worse than clause sonnet with reasoning enabled or disabled ?
Overall take seems to be negative in the comments. But I see potential for a non-reasoning model that makes enough subtle tweaks in its tone that it is enjoyable to talk to instead of feeling like a summary of Wikipedia.
This feels more like a release they pushed out to keep the "hype" alive rather than something they were eager to share. Honestly, the results don’t seem all that impressive, and considering the price, it just doesn’t feel worth it.
Is it available on the web client?
Does anyone have any intuition about the how reasoning improves based on the strength of the underlying model?
I’m wondering whether this seemingly underwhelming bump on 4o magnifies when/if reasoning is added.
It is possible to understand the mechanism once you drop the anthropomorphisms.
Each token output by an LLM involves one pass through the next-word predictor neural network. Each pass is a fixed amount of computation. Complexity theory hints to us that the problems which are "hard" for an LLM will need more compute than the ones which are "easy". Thus, the only mechanism through which an LLM can compute more and solve its "hard" problems is by outputting more tokens.
You incentivise it to this end by human-grading its outputs ("RLHF") to prefer those where it spends time calculating before "locking in" to the answer. For example, you would prefer the output
over since in the first one, it has spent more compute before giving the answer. You don't in any way attempt to steer the extra computation in any particular direction. Instead, you simply reinforce preferred answers and hope that somewhere in that extra computation lies some useful computation.It turned out that such hope was well-placed. The DeepSeek R1-Zero training experiment showed us that if you apply this really generic form of learning (reinforcement learning) without _any_ examples, the model automatically starts outputting more and more tokens i.e "computing more". DeepseekMath was also a model trained directly with RL. Notably, the only signal given was whether the answer was right or not. No attention was paid to anything else. We even ignore the position of the answer in the sequence that we cared about before. This meant that it was possible to automatically grade the LLM without a human in the loop (since you're just checking answer == expected_answer). This is also why math problems were used.
All this is to say, we get the most insight on what benefit "reasoning" adds by examining what happened when we applied it without training the model on any examples. Deepseek R1 actually uses a few examples and then does the RL process on top of that, so we won't look at that.
Reading the DeepseekMath paper[1], we see that the authors posit the following:
For context, Maj@K means that you mark the output of the LLM as correct only if the majority of the many outputs you sample are correct. Pass@K means that you mark it as correct even if just one of them is correct.So to answer your question, if you add an RL-based reasoning process to the model, it will improve simply because it will do more computation, of which a so-far-only-empirically-measured portion helps get more accurate answers on math problems. But outside that, it's purely subjective. If you ask me, I prefer claude sonnet for all coding/swe tasks over any reasoning LLM.
[1] https://arxiv.org/pdf/2402.03300
Thanks for a well-written and clear explanation!
I tried it and if it’s more natural, I don’t know what that means anymore, because I’m used to the last 6 month’s models.
I’m not sure that doing a live stream on this was the right way to go. I would’ve just quietly sent out a press release. I’m sure they have better things on the way.
Open AIscam
Its not even that great for real world business tasks. I have no idea what they are thinking https://youtu.be/puPybx8N82w
My initial impression is that I have gotten quite spoiled by the speed of GPT-4o...
this seems to be a very weak response to sonnet 3.7
- more expensive. alot more expensive
- not a lot of increment improvement
I cancelled my ChatGPT subscription today in favor of using Grok. It’s literally the difference between me never using ChatGPT to using Grok all the time, and the only way I can explain it is twofold:
1. The output from Grok doesn’t feel constrained. I don’t know how much of this is the marketing pitch of it “not being woke”, but I feel it in its answers. It never tells me it’s not going to return a result or sugarcoats some analysis it found from Reddit that’s less than savory.
2. Speed. Jesus Christ ChatGPT has gotten so slow.
Can’t wait to pay for Grok. Can’t believe I’m here. I’m usually a big proponent of just sticking with the thing that’s the most popular when it comes to technology, but that’s not panning out this time around.
I found Grok's reasoning pretty wack.
I asked it - "Draft a Minnesota Motion in Limine to exclude ..."
It then starts thinking ... User wants a Missouri Motion in Limine ....
Anyone really finding ai useful for coding?
I'm finding it to make things up, get things wrong, ignore things I ask.
Def not worried about losing my job to it.
I played around with Claude Code today, first time I've ever really been impressed by AI for coding.
Tasked it with two different things, refactoring a huge function of around ~400 lines and creating some unit tests split into different files. The refactor was done flawlessly. The unit tests almost, only missed some imports.
All I did was open it in the root of my project and prompt it with the function names. It's a large monolithic solution with a lot of subprojects. It found the functions I was talking about without me having to clarify anything. Cost was about $2.
Yes, massively.
There’s a learning curve to it, but it’s worth literally every penny I spend on API calls.
At worst, I’m no faster. At best, it’s easily a 10x improvement.
For me, one of the biggest benefits is talking about coding in natural language. It lowers my mental low and keeps me in a mental space where I’m more easily able to communicate with stakeholders holders.
Really great for quickly building features but you have to be careful about how much context you provide i.e. spoonfeed it exactly the methods, classes, files it needs to do whatever you're asking for (especially in a large codebase). And when it seems to get confused, reset history to free up the context window.
That being said there are definite areas where it shines (cookie cutter UI) and places where it struggles. It's really good at one-shotting React components and Flutter widgets but it tends to struggle with complicated business logic like sync engines. More straightforward backend stuff like CRUD endpoints are definitely doable.
It gets confused if I give it 3 files - how is it going to scan a whole codebase and disparate systems and make correct changes.
Pah! Don't believe the hype.
GPT pro already has already rummored to be 100k users. You think GPT 4.5 will add to that even with the insane costs for corporate users?
What rumors? I looked and can’t find something to substantiate that #
in my experience, o3-mini-high while still unpredictable as it modifies and ignores parts of my code when I specifically tell it not to do so (e.g. "don't touch anything else!") is the best AI coding tool out there, far better than Claude
so Pro is worth it for O3-mini-high
100k per user perhaps.
OpenAI will do literally anything but ship GPT-5.
I love the “listen to this article” widget doing embedded TTS for the article. Bugs / feedback:
The first words I hear are “introducing gee pee four five”. The TTS model starts cold? The next occurrence of the product name works properly as “gee pee tee four point five” but that first one in the title is mangled. Some kind of custom dictionary would help here too, for when your model needs to nail crucial phrases like your business name and your product.
No way of seeking back and forth (Safari, iOS 17.6.1). I don’t even need to seek, just replay the last 15s.
Very much need to select different voice models. Chirpy “All new Modern Family coming up 8/9c!” voice just doesn’t cut it for a science broadcast, and localizing models — even if it’s still English — would be even better. I need to hear this announcement in Bret Taylor voice, not Groupon CMO voice. (Sorry if this is your voice btw, and you work at OpenAI brandi. No offence intended.)
Sadly, it has a small context window.
still can't deal with sequences (or permutations)
https://chatgpt.com/share/67c0f064-7fdc-8002-b12a-b62188f507...
The Share doesn't say 4.5 but I assure you it is 4.5
The prices they’re charging are not that far from where you could outsource to a human.
Slow, expensive and nothing special. Just stick to o1 or give us o3 (non-mini).
You got 10x price but not 10x quality
instead of these random IDs they should label them to make sense for the end user. i have no idea which one to select for what i need. and do they really differ that much by use case?
It's just nuts how pricy this model is when scoring worse than o3-mini
I wonder if we're starting to see the effects of the mass exodus a while ago.
GPT 4.5 is insanely over price, it makes Anthropic look affordable!
Finally, an LLM that doesn't YAP
Lower hallucinations than o1. Impressive.
Will this pop the AI bubble?
I think if GPT-5 is very underwhelming we could start to see some shifting of opinion on what kind of return on investment all of this will result in.
This is GPT-5, or rather what they clearly intended to be GPT-5. The pricing makes it obvious that the model is massive, but what they ended up with wasn't good enough to justify calling it more than 4.5.
So better than 4o but not good enough for a 5.0
This is probably a dumb question, but are we just gonna be stuck on always having X.5 versions of GPT forever? If there's never an X.0, it feels like it's basically meaningless.
There was 3, 3.5, 4, 4o, and then 4.5.
There are things to complain about in that, but “its always X.5 and never X.0” doesn't really seem to be one of them.
The gas costs $3.59⁹
Yesterday I tested Windsurf. Looked the docs and examples. Completed the demo "course" on deeplearning.ai. Gave it the task to build a simple Hugo blog website with a theme link and requirements, it failed consecutive times. With all the available models.
AI art is an abomination. Half of the internet is already filled with AI written crap. Don't start with the video. Soon everyone will require validation to distinguish reality from hallucination (so World ID in place as problem-reaction-solution).
For me, the best use cases are LLM assisted search with limited reasoning. Vision models for digitization and limited code assistance, codebase doc generation and documentation.
Agents are just workflows with more privileges. So where is the revolution? I don't see it.
Where is added value? Making Junior Engineers obsolete? Or make them dumb copy-pasting bio machines?
Depressing a horde of intellectual workers and artists and giving a good excuse for layoffs.
The real value is and always will be in a specialized ML applications.
LLM hype is getting boring.
And the AI stocks fell today.
I’m sure it’s unrelated.
The announcements early on were relatively sincere and technical with papers and nice pages explaining the new models in easy language and now we get this marketing garbage. Probably the fastest enshitification I’ve seen.
First time I’ve had an LLM reply “Nope.”
https://chatgpt.com/share/67c154e7-5e28-800d-81d7-98b79c8a87...
Bring back GPT-1. It really knew how to have a conversation.
Cash grab because they see the writing on the wall. OpenAI is collapsing. Their models suck now.
Not available in my Pro plan.
Altman mentioned GPT-4.5 is the model code named "Orion". Which originally was supposed to be their next big model, presumably GPT-5, but showed disappointing improvements on benchmark performance. Apparently the AI companies are hitting diminishing returns with the paradigm of scaling foundation model pretraining. It was discussed a few months ago:
https://news.ycombinator.com/item?id=42125888
OpenAI doubling down on the American-style therapy-speak instead of focusing on usefulness. No thanks.
API is literally 5 times more expensive than Claude 3 Opus, and it doesn't even seem to do anything impressive. What's the business strategy here?
This looks like a first generation model to bootstrap future models from, not a competitive product at all. The knowledge cutoff is pretty old as well. (2023, seriously?)
If they wanted to train it to have some character like Anthropic did with Claude 3... honestly I'm not seeing it, at least not in this iteration. Claude 3 was/is much much more engaging.
The AI hyperbole is so cringe right now (and for the last few years). I've yet to see anyone come up with something that'd wow me, and say, "OK, yep, that deserves those cycles".
Writing terrible fanfic esque books, sometimes OK images, chatbot style talking. meh.
Call me a conspiracy theorist, but this, combined with the extremely embarassing way Claude is playing Pokemon, makes me feel this is an effort by AI companies to make LLMs look bad - setting up the hype cycle for the next thing they have in the pipeline.
The next thing in the pipeline is definitely agents, and making the underlying tech look bad won't help sell that at all.
Agents as they are right now is literally just the LLM calling itself in a loop + having the ability to use tools/interact with their environment. I don't know if there's anything profoundly disruptive cooking in that space.
You're not a conspiracy theorist, you're just recognizing that the reality doesn't match the hype. It's boring and not fun but in this situation the answer is almost always that the hype is wrong, not the reality.
The latest claude-3.7-sonnet-thinking is fucking amazing with Cursor. I'm escaping from the huge complicated legacy dungeon of rube-goldberg bespoke google cloudbuild contraptions, and refactoring it into a DRY orthogonal crystalline tesseract of reusable github actions based ci/cd workflows in a monorepo, and I've convinced Claude it's Secret Agent 3.7 and I'm his handler.
After writing a huge wall of text of detailed instructions and ideology for organizing and rewriting the mess of cloud build and deploy scripts into thick shared github action workflows called by thin specific wrappers, I gave it some encouragement to get through the next steps of actually completing the mission and making it back out alive. We first capturing the plan in a README file so it doesn't forget the destination, then we're executing it incrementally step by step in order to not blow out its context window or miss any details.
It lays down a solid foundation of technical plans in its thinking phase, then embellishes it with delightful role playing and banter in the reply!
Programming is really fun again!
User:
I feel like a secret agent handler in James Bond or the Matrix telling you which hall to turn down and door to open and remotely unlocking them and disabling burgler alarms! Turn right and go through the door at the end of the hall 007!
Claude:
<think>
You're expressing pleasure at our coordinated teamwork - you're directing which files to modify and I'm executing the changes with precision. Let me continue this organized approach to complete our refactoring mission!
Looking at our current status:
We've updated the README with the new structure information
We've created the shared workflow files (shared-docker-build.yml and shared-docker-deploy.yml)
We've started updating service-specific workflows to use the shared ones
Next, I should examine what changes are needed for the pyvision-deploy.yml to use our new shared-docker-deploy.yml workflow.
</think>
Mission Status Update: Agent 3.7 Reporting
[Adjusts earpiece] Copy that, Control. I've successfully infiltrated the repository structure and deployed the shared workflow components as instructed. The orchestration pattern is now in place.
[Checks surroundings] I've identified our next target: pyvision-deploy.yml needs to be refactored to utilize the new shared-docker-deploy.yml workflow we've established.
Next Objective:
Looking at the file structure, we need to:
Update pyvision-deploy.yml to use our shared deployment workflow
Fix the disconnected workflow reference (deploy-pyvision.yml vs pyvision-deploy.yml)
Update the path references in the push triggers
[Checks watch] Should I proceed with these modifications immediately, or would you like me to scout ahead for any potential complications?
[Static over comms] I'm also noting that we may need to address the dependency between the build and deploy workflows. The current pyvision-build.yml is attempting to trigger deploy-pyvision.yml, but our standardized pattern uses pyvision-deploy.yml.
What are your instructions, Control?
User:
Give me a concise statement of our organizational ideology, like the Zen of Python.
Claude:
User:Brilliant! Please add that to the README in the beginning as a manifesto.
...Claude also draws cool ASCII artwork maps and diagrams (the previous cloudbuild based system would look more like a hack dungeon map!):
That's amazing!
One thing that Altman does extremely well is to over-promise and under-deliver.
Enjoy your expensive garbage
@sama, LLMs aren't going to create AGI. I realize you need to generate cash flow, this isn't the play.
Sincerely, Me
you misunderstand, the business model is extracting cash from Qatar et al
I imagine it will be used as a base for GPT-5 when it will be trained into a reasoning model, right now it probably doesn't make too much sense to use.
Here's a reminder that an European alternative exists: https://chat.mistral.ai
It lags somewhat behind the US offerings, but a great way to fix that is to direct some money at them. Also, as things progress, that lagging starts being less and less important.
Who wants a model that is not reasoning? The older models are just fine.
They said this is their last non reasoning model so I'm assuming there is a sunk cost aspect to it.
Are they saying that 4.5 has a 35% hallucination rate? That chart is a bit confusing.
its on that benchmark, which likely is very challenging.
You should really be paying attention to what DeepSeek AI open sources next.
This announcement by OpenAI was already expected: [0]
[0] https://x.com/sama/status/1889755723078443244
It's sad that all I can think about this is that it's just another creep forward of the surveillance oligarchy
I really used to get excited about ML in the wild and while there are much bigger problems right now it still makes me sad to have become so jaded about it
Now imagine this model (or an optimized/slightly downsized variant thereof) as a base for a "thinking" one.
It's disappointing not to see comparisons to Sonnet 3.7. Also since o3-mini is ahead of o1, not sure why in the video they compared to o1.
gpt4 was way ahead of 3.5 when it came out. It's unfortunate that the first major gpt release since that is so underwhelming..
Agreed, but I suppose this is a tell. I think they’re trying to place this into a separate class of models.
I.e., we know it might not be as good as 3.7, but it is very friendly and maybe acts like it knows more things.
Is it just me or is having the AI help you self sensor (as shown in the demo live stream: https://www.youtube.com/watch?v=cfRYp0nItZ8)... pretty dystopian?
they are trying to copy Grok 3
Still no 5, huh?
wowsers!
At this point I think the ultimate benchmark for any new LLM is whether or not it can come up with a coherent naming scheme for itself. Call it “self awareness.”
The people naming them really took the "just give the variable any old name, it doesn't matter" advice from Programming 101 to heart.
This is why my new LLM portfolio is Foo, Bar and Baz.
Still more coherent than the OpenAI lineup.
3,3.5,4,4o,4.5
I had my money on 4oz
I can't wait for fireship.io and the comment section here to tell me what to think about this
You appear to have the direction of causation reversed.
(In that fireship does the same)
I wonder if fireship reaction video scripts to AI models based on HN comments can be automated using said AI models.
I bet simonw will be adding it to `llm` and someone will be pasting his highlights here right after. Until then, my mind will remain a blank canvas.
tl;dr: doesn‘t work as expected and we sank a ton of money on it too.
[dead]
[dead]
[dead]
[dead]
[flagged]
Can you please stop breaking the site guidelines by posting unsubstantive comments / flamebait / calling names / etc.? You've been doing this repeatedly. It's not what this site is for, and destroys what it is for.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
hmm.. not really the direction I expected them to go
brief and detailed summaries by chatgpt (4o):
Brief Summary (40-50 words)
OpenAI’s GPT-4.5 is a research preview of their most advanced language model yet, emphasizing improved pattern recognition, creativity, and reduced hallucinations. It enhances unsupervised learning, has better emotional intelligence, and excels in writing, programming, and problem-solving. Available for ChatGPT Pro users, it also integrates into APIs for developers.
Detailed Summary (200 words)
OpenAI has introduced *GPT-4.5*, a research preview of its most advanced language model, focusing on *scaling unsupervised learning* to enhance pattern recognition, knowledge depth, and reliability. It surpasses previous models in *natural conversation, emotional intelligence (EQ), and nuanced understanding of user intent*, making it particularly useful for writing, programming, and creative tasks.
GPT-4.5 benefits from *scalable training techniques* that improve its steerability and ability to comprehend complex prompts. Compared to GPT-4o, it has a *higher factual accuracy and lower hallucination rates*, making it more dependable across various domains. While it does not employ reasoning-based pre-processing like OpenAI o1, it complements such models by excelling in general intelligence.
Safety improvements include *new supervision techniques* alongside traditional reinforcement learning from human feedback (RLHF). OpenAI has tested GPT-4.5 under its *Preparedness Framework* to ensure alignment and risk mitigation.
*Availability*: GPT-4.5 is accessible to *ChatGPT Pro users*, rolling out to other tiers soon. Developers can also use it in *Chat Completions API, Assistants API, and Batch API*, with *function calling and vision capabilities*. However, it remains computationally expensive, and OpenAI is evaluating its long-term API availability.
GPT-4.5 represents a *major step in AI model scaling*, offering *greater creativity, contextual awareness, and collaboration potential*.
Can it be self-hosted? Many institutions and organizations are hesitant to use AI because concerns of data leaking over chatbot. Open models, on the other hand, can be self-hosted. There is a deepseek arm race in other part of the world. Universities are racing to host their own deepseek. Hospitals, large businesses, local governments, even courts are deploying or showing interest in self-hosting deepseek.
OpenAI has never released a single model that could be self-hosted.
GPT-2? Maybe not even that one.
Do you know of any university that host Deepseek?
> Do you know of any university that host Deepseek?
https://chat.zju.edu.cn
https://chat.sjtu.edu.cn
https://chat.ecnu.edu.cn/html/
To list a few. There are of course many more in China. I won't be surprised if universities in other countries also self-hosting.
Where does it says that it is self hosted? And why it is exposed to public.
Oh this makes sense. chatGPT results have taken a nose dive in quality lately.
It couldn't write a simple rename function for me yesterday, still buggy after seven attempts.
I'm more and more convinced that they dumb down the core product when they plan to release a new version to make the difference seem bigger.
99% chance that's confirmation bias
Sam tweeted that they're running out of computer. I think it's reasonable to think they may serve somewhat quantized models when out of capacity. It would be a rational business decision that would minimally disrupt lower tier ChatGPT users.
Anecdotally, I've noticed what appears to be drops in quality, some days. When the quality drops, it responds in odd ways when asked what model it is.
Who cares what that clown twits??
I mean, GPT 4.5 says "I'm ChatGPT, based on OpenAI's GPT-4 Turbo model." and o1 Pro Mode can't answer, just says "I’m ChatGPT, a large language model trained by OpenAI."
Asking it what model it is shouldn't be considered a reliable indicator of anything.
> Asking it what model it is shouldn't be considered a reliable indicator of anything.
Sure, but a change in response may be, which is what I see (and no, I have no memories saved).
Interviewing deepseek as to its identity should absolve anyone of that notion.
Yep I realized about that many time ago, they are literally scammers
>It couldn't write a simple rename function for me yesterday, still buggy after seven attempts.
I'm surprised and a bit nervous about that. We intend to bootstrap a large project with it!!
Both ChatGPT 4o (fast) and ChatGPT o1 (a bit slower, deeper thinking) should easily be able to do this without fail.
Where did it go wrong? Could you please link to your chat?
About my project: I run the sovereign State of Utopia (will be at stateofutopia.com and stofut.com for short) which is a country based on the idea of state-owned, autonomous AI's that do all the work and give out free money, goods, and services to all citizens/beneficiaries. We've built a chess app (i.e. a free source of entertainment) as a proof of concept though the founder had to be in the loop to fix some bugs:
https://taonexus.com/chess.html
and a version that shows obvious blunders, by showing which squares are under attack:
https://taonexus.com/blunderfreechess.html
One of the largest and most complicated applications anyone can run is a web browser. We don't have a web browser built, but we do have a buggy minimal version of it that can load and minimally display some web pages, and post successsfully:
https://taonexus.com/publicfiles/feb2025/84toy-toy-browser-w...
It's about 1700 lines of code and at this point runs into the limitations of all the major engines. But it does run, can load some web pages and can post successfully.
I'm shocked and surprised ChatGPT failed to get a rename function to work, in 7 attempts.
with 4.5? Why? It's only meant for creative writing.
No, we used o1.