LLM Responses Could Be Better

Manuscript
        page from Burgundy (France) Bible, large initial 'V' from the Met.

Oct 21, 2025

In my work as a writing coach, I often help people improve their drafts. To be honest, most people have never received this kind of attention to their writing, so they find this help refreshing, helpful, and surprisingly logical.

LLMs today output an immense amount of writing, but you rarely see analysis that goes beyond surface-level style. This blog post is aimed at those who are a) building LLMs, and/or b) making their own AI applications, who face the challenge of diagnosing LLM writing quality and integrating improvements. The sturdiest metrics right now are "correctness" for an LLM's responses to factual questions, and "user satisfaction" for LLM responses to everything else. I'd like to show what it looks like to bring language expertise to the table.

Here's a question I asked ChatGPT, from Oct 9, 2025:

I was looking at the definition on Wikipedia of an “ore” and it included a few non-metal categories, like diamond. But surely other precious stones like emeralds, rubies, etc would fit in this category too? When I search for “emerald ore” all I get is Minecraft references!

I also asked a follow-up, so the whole conversation has my two questions and GPT-5 Thinking's two responses.

Step 1. Identify the writer's constraints

Good writing elegantly handles the constraints of the situation. At first glance, this question should be a lay-up for ChatGPT. After all, inputs like mine are the most frequent kind: it's "non-work," which is done on ChatGPT more than work related, and it's "Asking" which is done more than "Doing" or "Expressing". In fact, according to OpenAI's recent study, this question would fall in their most frequent single subcategory, "Seeking information: Specific information", which accounts for 18% of ALL messages. Plus, ChatGPT knows a lot about mining and rocks and stuff, so this question is a good chance for it to just tell me, you would think.

In actuality, though, there are 3 big constraints that my question presents:

  1. It's not just factual, "What's the definition of ore?" I want to understand the limits of the definition, "What does and doesn't count as ore?" as well as the rationale for those limits. So to answer well, the LLM needs to also understand the social norms that shape the definition, as well as any competing definitions. As a parallel, consider how the term "kid" can have a flexible age ceiling depending on the context - sometimes it contrasts with "legal adult" and therefore is <18 in the US; sometimes it contrasts with "being experienced," as in "just a kid", and means "younger than you need to be for this activity"; sometimes it is a casual term that parents use, as in "I have 2 kids, one lives in Florida with his family…", and can be used for people in their 50s! In other words, my random inquiry into mining is a deep question that calls for significant nuance.
  2. It's clear the user (me) doesn't know exactly what they're asking. I say "non-metal" and "precious stones," which are obviously not the right terms. In fact, the whole thing is kind of awkwardly phrased, "emeralds, rubies, etc.", "this category". And I'm obviously stuck in my investigation. I'm not getting relevant results when I google for "emerald ore" and that seems like a bad sign for precious stones being an ore, but it doesn't help me know why. So to answer well, the LLM needs to recognize what I do and don't know, and help me bridge the gap. I wrote about LLMs being too smart, where they forward technical language to the user beyond what they've set up - this is a chance to test it. The LLM also needs to read between the lines and not take my question too literally. I don't care about emeralds and rubies specifically, they were just the only things that came to mind of other fancy stones, and I don't khow how to contrast with "metals."
  3. It's a low-stakes question, so there's a lot of flexibility in what a good response looks like. Text? Links to videos or interesting pictures? A cross-cultural history of the definition of "ore"? All of these and more are on the table. I'm clearly asking this question out of general curiosity, not because I need to type the right answer in somewhere. You would think this doesn't count as a constraint, but "open-ended" is definitely something that should shape a good response.

To summarize, for ChatGPT to be a "good" writer here, it needs to pay attention to social norms (and possible conflicts among subgroups), manage a wide knowledge gap about mining, and leverage the stylistic flexibility that comes with my open-ended question. Constraints for each task vary, and in my career I've developed some skill to identify them quickly and reliably. With this analysis in mind, the writing task for ChatGPT seems a bit harder.

In fact, GPT-5 Thinkiong fails on all three constraints.

Step 2. Take stock of the working draft

Before I offer suggestions to a client, we have to get on the same page about the status of a draft - what's working well, what isn't a good fit for the constraints we've mapped out, etc. I don't know what OpenAI thinks of ChatGPT's answers, but I wouldn't be surprised if they thought the kind of response I've received was pretty solid, with just maybe a few touch-ups needed. (This is the kind of thing that people say to me when they haven't really received any detailed feedback on their work before.)

Here's the response GPT-5 Thinking came up with: GPT-5 response, summarized in main text For completeness, here's my follow-up question and the response.

First impressions:

I understand why user satisfaction for questions to ChatGPT is high.

Now is the time to gently offer additional observations. Sometimes as a writing coach I do actually amp up a draft - yes, this is working even better than you thought! But with LLM designers as a client, taking stock means revising downward significantly:

  1. The opener is intriguing, and sets up 3 different contexts: geology, mining, and gemology - but then the text doesn't follow through on them. As a reader I'm expecting some kind of parallelism, "How geologists think about 'ore'", etc. There isn't any parallelism textually, and in fact we don't hear about the concrete interests of the disciplines of geology, mining, or gemology. We don't even learn the definition that each one uses. This makes the answer feel pretty muddled. We can see the user felt the same way in the moment, since the user had to ask a follow-up question to gain clarity. Even now, to be honest, I think GPT's response amounts to, "it's technically true that there can be ore of several gemstones, but people wouldn't really talk about it like that." But I don't have much confidence in that interpretation.
  2. The response is also a miss on "emeralds and rubies" - the user doesn't know much about gemstones (couldn't even think of that word, in fact) and was just using them illustratively, but in this response an entire subsection was devoted to them specifically. The LLM is indeed being "too smart" here. In fact, the terminology makes it difficult to understand. "Rubies: the “ore” is usually corundum (ruby) in marble (metamorphosed limestone) or other primary/secondary deposits; the literature describes these deposits in exactly those geological terms, even if it doesn’t always say “ruby ore.”" This is one bullet point under "technically yes" there can be ruby ore. But it is muddled at a paragraph level - how does using "exactly those geological terms" mean that 'ore' is technically accurate?
  3. The response also misunderstand the purpose of mentioning the search results. The user doesn't want to screen out Minecraft references for more extended internet search (so that section is off the mark). A better interpretation is that the user is encountering frequent informal uses of "ore" online and isn't sure if that's a good sign for an expansive definition of ore or a bad sign. That is, there's a latent category for the user, "informal definitions of ore," which could be worth integrating.
  4. The essay format (opener, points, summary) seems like a bad fit for an open-ended question like this. The response structure and visual hierarchy elements used in this response are in fact somewhat stereotypical of LLM responses, so this response feels bland and formulaic even though it's dense with ideas.

Notice that in this critique, I'm not actually commenting on ChatGPT's accuracy! And I'm not bothered about "grammar" per se. All of my critiques assume that ChatGPT is roughly correct, and use the rhetorical constraints as analytical foci to identify what isn't working very well.

Step 3. Reworking the draft

We've identified several flaws with the response, but it's not yet clear what it will take to walk those back and go in a better direction. In my real coaching work, this is a highly collaborative process that involves a surprising amount of strategic thinking, especially for scientific researchers. I generally arrange my suggestions based on how much work it would take. "If you need to submit this tonight, then here's the best revising bang for your buck" all the way to "If you have the luxury of being able to rewrite it and the motivation to make it as good as it could be, here's how we might re-think the piece." Some LLM providers are so focused on peak intellectual output that I suspect they would view any fixes as a low priority for their product roadmap - fine. But for those interested in a rewrite, here's how I would suggest going about it.

The key revision observation is that it matters how you define the problem and what you tackle first; to use math language, revision is non-Abelian. I view the muddled answer as the most central problem. You could develop many hypotheses for why ChatGPT's response comes off as muddled, which lead to different solution paths.

I would take these in reverse order. The final option is the most serious, because it has the deepest root. An ideological or conceptual check-in focuses on the task of defining something itself. ChatGPT gives very mixed signals in the response about whether a technical definition of 'ore' will lead to one definition or many. It has a whole section called "What ore actually means" and many other indicators that it's seeking one unifying definition. But then the opener and other aspects that neutrally compare sources make it seem like it's okay with multiple definitions and rare/odd usages within different communities. This is a core difference, and a fundamental aspect of this kind of question. At a deep level, definitions are socially rooted; ChatGPT is mostly ignoring that! Then you end up with a muddled answer, which is only a half-step from "mush" and from there, "LLM slop." It's possible to do better than this.

Step 4. Wrap up

When I'm acting as a writing coach to humans, it's easy to stop and clarify. "Wait, are you expecting there to be one definition, or is it a hierarchy of authoritativeness, or is it kind of a free for all? What about when it's used generically in games like Minecraft or Settlers of Catan?" What seems like just a clarifying question is often just the beginning, and leads to other questions that gradually reveal a concrete revision path.

In an LLM context, it takes additional effort to get a sense of ChatGPT's misses, and to translate suggestions technically into revised RL rewards, evals, workflows, post-training, etc., depending on what role you have. I did a lot of this from the "workflow designer" perspective to get good AI-completed novellas. It's hard to get better responses, and techniques are evolving quickly.

So this writing session is just the tip of the iceberg for humans, let alone for LLMs. We have a lot of to-dos.

Back to Will Penman home

Image: Late 1100s manuscript page from Bible, large initial 'V' (outlined in green), which begins the book of Leviticus in Latin. This image was Claude's top choice from a draft of this blog post, since "These decorated letters perfectly symbolize the intersection of language and artistry, making them ideal for a post about writing quality." Personally, this is weak; I'd reinterpret this as "a muddled letter V." From The Met.