论文标题
驾驶副驾驶,法典和StarCoder2:热温,冷提示或黑魔法?
Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?
论文作者
论文摘要
语言模型是解决增加复杂问题的有希望的解决方案。在软件工程中,他们最近引起了代码助手的关注,该代码助理从自然语言任务描述(提示)中生成程序。他们有潜力节省时间和精力,但仍然了解不足,从而限制了它们的最佳用途。在本文中,我们研究了输入变化对语言模型的两种配置的影响,重点是任务描述,周围上下文,模型创造力和生成的解决方案的数量。我们设计了特定的操作员来修改这些输入,并将其应用于三个基于LLM的代码助手(Copilot,codex,starcoder2)和两个代表算法问题的基准(HumaneVal,LeetCode)。我们的研究研究了这些变化是否显着影响程序质量以及这些效果如何跨模型概括。我们的结果表明,不同的输入参数可以大大提高性能,一击发电率达到79.27%的成功率,而默认设置中的Copelot为22.44%,而Copilot的副本为31.1%。由于我们的研究中的复杂相互作用,在实践中采取这种潜力是具有挑战性的 - 温度,提示和生成的解决方案数量的最佳设置因问题而异。用StarCoder2再现我们的研究证实了这些发现,表明它们不是模型特异性的。我们还发现了令人惊讶的行为(例如,完全删除提示可以有效),揭示了模型脆弱性和改进领域。
Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently gained attention in code assistants, which generate programs from a natural language task description (prompt). They have the potential to save time and effort but remain poorly understood, limiting their optimal use. In this article, we investigate the impact of input variations on two configurations of a language model, focusing on parameters such as task description, surrounding context, model creativity, and the number of generated solutions. We design specific operators to modify these inputs and apply them to three LLM-based code assistants (Copilot, Codex, StarCoder2) and two benchmarks representing algorithmic problems (HumanEval, LeetCode). Our study examines whether these variations significantly affect program quality and how these effects generalize across models. Our results show that varying input parameters can greatly improve performance, achieving up to 79.27% success in one-shot generation compared to 22.44% for Codex and 31.1% for Copilot in default settings. Actioning this potential in practice is challenging due to the complex interplay in our study - the optimal settings for temperature, prompt, and number of generated solutions vary by problem. Reproducing our study with StarCoder2 confirms these findings, indicating they are not model-specific. We also uncover surprising behaviors (e.g., fully removing the prompt can be effective), revealing model brittleness and areas for improvement.
