New research from MIT Sloan affiliates indicates that improvements in generative artificial intelligence (AI) performance are not solely attributable to advancements in large language models (LLMs). A large-scale experiment revealed that only half of the performance gains observed after transitioning to a more advanced AI model stemmed from the model itself. The remaining half was due to users adapting their prompts – the written instructions provided to the AI – to leverage the new system effectively.
This finding underscores a crucial reality for businesses: investing in new AI tools will not yield their anticipated value unless employees also refine their usage. The study suggests that prompting is a learnable skill that individuals can quickly improve, even without formal instruction.
David Holtz, SM ’18, PhD ’21, an assistant professor at Columbia University and a research affiliate at the MIT Initiative on the Digital Economy, and a co-author of the study, stated, “People often assume that better results come mostly from better models. The fact that nearly half the improvement came from user behavior really challenges that belief.”
The experiment involved nearly 1,900 participants who were randomly assigned to one of three versions of OpenAI’s DALL-E image generation system: DALL-E 2, the more advanced DALL-E 3, or DALL-E 3 with users’ prompts automatically rewritten by the GPT-4 LLM without their knowledge. Participants were tasked with recreating a reference image, such as a photo, graphic design, or piece of art, by typing instructions into the AI. They had 25 minutes to submit at least 10 prompts and were incentivized with a bonus payment for the top 20% of performers, encouraging them to test and refine their instructions.
The researchers reported several key findings:
- Participants using the baseline version of DALL-E 3 produced images that were more similar to the target image compared to those generated by DALL-E 2 users.
- Participants utilizing the baseline DALL-E 3 wrote prompts that were 24% longer than those of DALL-E 2 users. These prompts also exhibited greater similarity to each other and contained a higher proportion of descriptive words.
- Approximately half of the improvement in image similarity was attributed to the enhanced model, while the other half resulted from users adjusting their prompts to capitalize on the capabilities of the improved models.
While this study focused on image generation, the researchers believe the same pattern is likely to apply to other tasks, including writing and coding.
The research demonstrated that the ability to adapt prompts over time was not exclusive to tech-savvy users. Holtz commented, “People often think that you need to be a software engineer to prompt well and benefit from AI. But our participants came from a wide range of jobs, education levels, and age groups — and even those without technical backgrounds were able to make the most of the new model’s capabilities.”
The data suggests that effective prompting is more about clear communication than coding. Holtz noted, “The best prompters weren’t software engineers. They were people who knew how to express ideas clearly in everyday language, not necessarily in code.”
This accessibility may also contribute to reducing performance disparities among users with varying skill levels and experience. Eaman Jahani, PhD ’22, an assistant professor at the University of Maryland and a digital fellow at the MIT Initiative on the Digital Economy, and a co-author of the study, observed that generative AI has the potential to narrow performance gaps between users. “People who start off at the lower end of the [performance] scale benefited the most, which means the differences in outcomes became smaller,” Jahani said. “Model advances can actually help reduce inequality in output.”
Jahani clarified that the team’s findings are applicable to tasks with clear, measurable outcomes and an identifiable upper limit for a good result. He noted that it is not yet clear whether the same pattern would hold for more open-ended tasks without a single correct answer and with potentially significant payoffs, such as generating transformative new ideas.
One of the more unexpected findings was that rewriting prompts using generative AI led to a significant decrease in performance. The group that used DALL-E 3 with generative AI automatically rewriting their prompts experienced a 58% degradation in performance compared to the baseline DALL-E 3 group. The researchers found that the automatic rewrites frequently introduced extraneous details or altered the intended meaning of the user’s input, causing the AI to produce an incorrect image.
Holtz explained, “[Automatic prompt rewriting] just doesn’t work well for a task like this, where the goal is to match a target image as closely as possible. More importantly, it shows how AI systems can break down when designers make assumptions about how people will use them. If you hard-code hidden instructions into the tool, they can easily conflict with what the user is actually trying to do.”
The study’s implications for businesses are clear: beyond selecting the “right” AI model, leaders must prioritize enabling effective user learning and experimentation. Jahani emphasized that prompting is not a plug-and-play skill. “Companies need to continually invest in their human resources,” he said. “People need to be caught up with these technologies and know how to use them well.”
To maximize the benefits of generative AI, the researchers offer several key priorities for business leaders aiming to enhance AI system effectiveness in real-world settings:
- Invest in training and experimentation: Technical upgrades alone are insufficient. Providing employees with the time and support to refine their interactions with AI systems is crucial for realizing full performance gains.
- Design for iteration: User interfaces that encourage testing, revision, and learning – and clearly display the results – contribute to better outcomes over time.
- Be cautious with automation: While automated prompt rewriting may seem convenient, it can hinder performance rather than improve it if it obscures or overrides user intent.
The paper was co-authored by MIT Sloan PhD students Benjamin S. Manning, SM ’24; Hong-Yi TuYe, SM ’23; and Mohammed Alsobay, ’16, SM ’24; as well as Stanford University PhD student Joe Zhang, Microsoft computational social scientist Siddharth Suri, and University of Cyprus assistant professor Christos Nicolaides, SM ’11, PhD ’14.




