Summary
Reflections on six weeks of coding with Claude
Highlights
id993745636
Progress came mainly came from scaling up training runs and model size, from hundreds of millions of parameters for GPT-1 to now hundreds of billions or maybe trillions of parameters in the most capable models.
A propósito del entrenamiento de los modelos, tal como muchas veces se escucha, uno de los principales factores que impulsó su mejora fue simplemente su escala: más compute permite procesar un corpus de entrenamiento más grande, lo que genera modelos con más parámetros que, en consecuencia, tienen la posibilidad de asumir un universo mayor de estados.
id993745822
What this created was not a full artificial intelligence but artificial intuition. It could “answer off the top of its head,” it had a superhuman recall for facts, and it could blurt out not just sentences but entire essays. But it was still blurting out all its answers, with no ability to “think” before “speaking,” check its work, or follow an explicit procedure—not even, say, long addition. By 2022, this had become such a limitation that it was possible to dramatically improve GPT-3’s performance on mathematical reasoning problems simply by concluding the prompt with “Let’s think step-by-step,” which encouraged models to work through the problem explicitly rather than trying to blurt out an answer. Soon this approach was built into the product, in a new class of “reasoning” models, such as OpenAI’s o1, that were given the ability to “think”—that is, to talk to themselves in a scratchpad—before producing a response.
La innovación en el proceso de entrenamiento y calibración de los modelos que les permitió hacer la transición hacia lo que se ha llamado “modelos de razonamiento”. En este caso, implicó simplemente calibrar su prompt base. Es algo que suena plausible pero tengo que verificar.
id993799728
A text generator is not an agent and does not pursue goals—but it was clear from the beginning how an agent might be built from them. Just provide it with a small scratchpad and a few tools it can invoke. Then tell it a goal, and run it in a loop: given the goal, make a plan to achieve it, execute that plan, then check if the goal was achieved; if not, replan and begin again; continue until you succeed.
Aquí se describe la transición desde cómo programar a un modelo de lenguaje versus programar un agente… no es nada del otro mundo.
id993801108
Put all of this together—the “reasoning” mode, better coding, and greater agency—and by late last year we had crossed a tipping point: Some software developers stopped writing code themselves, and started letting agents write 100% of it.
El logro de agentes autónomos que logran desarrollar buen código fue una transición no linear, una confluencia de un conjunto de factores.
id993801244
Andrej Karpathy says the term “vibecoding” no longer does justice to what’s possible: it’s now “agentic engineering.”
A propósito del dramático aumento en la capacidad de los agentes y la calidad de sus resultados.
id993824837
But directing coding agents is a different thing altogether: it can be done on “manager schedule” rather than “maker schedule.” Garry Tan describes it using the metaphor that it’s as if he used to be a competitive runner (i.e., engineer) who got a knee injury (went into management). But now he has a knee replacement (coding agents)—and it’s a bionic knee, better than before.
Este es un principio importante a compartir con las personas que están aprendiendo a sacarle el provecho a herramientas de IA: tienes que posicionarte en el lugar de gerente del proyecto en lugar de ejecutor.
id993831210
Stepping back, I think a lot of progress since ~GPT-3 has been in taking the core intuitive faculty provided by statistical language models and adding layers of self-monitoring and self-control, such as reasoning and skills. I find it remarkable how much LLMs are aided by some of the same practices that help humans be more effective: working problems out on a scratchpad, planning before executing, and all of the structure and practices that human engineers, designers, and product managers put in place around software development. Elsewhere, Wilson Lin at Cursor reports on an experiment with getting a large team of agents to implement a web browser from scratch, a large undertaking (although one for which there is already a comprehensive set of formal specifications and acceptance criteria). Just getting a bunch of agents to work off of one big shared task list was too chaotic. What worked was having certain agents dedicated to planning—assessing status and figuring out what was needed next to reach the goal—while other agents acted as implementers, picking tasks off the plan and getting them done without worrying too much about the big picture. Again, systems of self-monitoring and self-control.
Esto es un insight importante para comprender cómo funcionan los agentes: son básicamente modelos de lenguaje con algunas metodologías, heurísticas y procedimientos que permiten encauzar su potencia de procesamiento de lenguaje para organizarla y hacerla efectiva.
id993836891
The biggest limitation on these systems right now, it seems to me, is memory. They start each session like Leonard Shelby from Memento, with no short-term memories, needing to review all their notes to get context. This is a very limited form of learning. An LLM can’t develop intuition or taste post-training—which, as Dwarkesh pointed out, means it can’t learn on the job the way a human does. Claude’s memory file generated from our chats is about 400 words, ChatGPT’s is not much over 100; a human assistant who had talked to me as much as they have would have a much deeper understanding of me. No doubt this limitation, too, will be removed sooner or later; I agree with Ethan Mollick when he suggests that this will be transformative.
La principal limitación actual de los modelos de lenguaje es su pobre memoria y casi nula capacidad de aprender de manera autónoma en base a la interacción con el usuario (a menos de que se lo instruya para hacerlo).
id993844190
on the current trajectory, we’re only a year or so away from whole teams of agents that work together like a complete dev shop. A client could come to the process with only a vague, high-level idea of what they need. A product manager agent would interview them to discover requirements. The PM would write a product spec, and a design agent would create UI mockups, both of which the user could review and comment on. Once the spec and design were approved, an engineering agent would produce a tech design; perhaps a second agent with fresh context would review and revise it. A planner agent would turn it into a task list, and a team of implementer agents would execute the coding tasks in parallel, with reviewer agents examining the code for bugs, weaknesses, and best practices. The app would periodically be presented back to the client for user testing and feedback, for as many rounds of iteration as needed to leave the client fully satisfied. On the whole, it would be much like the process performed by humans, but it would take orders of magnitude less money and time.
Esto es algo que perfectamente puede aplicarse al desarrollo de propuestas y proyectos en consultoría. Podría desarrollar una skill que orqueste un conjunto de agentes de manera tal que puedan entrevistar al usuario y ayudarlo paso a paso a desarrollar la propuesta.
id993888631
it is not hard at all to envision how this will play out in other industries whose work essentially consists of talking to people and producing documents: law, accounting, graphic design, business consulting. Virtual service shops, doing in hours what used to take weeks, for hundreds of dollars instead of tens of thousands.
Los modelos de lenguaje terminarán reduciendo radicalmente los costos de servicios de consultoría.