The team of researchers from the University of Illinois at Urbana Champaign and Tsinghua University aimed to address the challenge of generating low-bias and high-quality coding challenges from open-source code snippets by introducing Magicoder. Magicoder outperforms existing LLMs on various coding benchmarks, including Python text-to-code generation, multilingual coding, and data science program Language Model.
Prominent base models like CodeGen, CodeT5, StarCoder, and CODELLAMA have established the fundamental ability of LLMs in code generation and understanding. Instruction tuning has been proposed to improve pretrained LLMs by finetuning them with instruction-response pairs, and methods like SELF-INSTRUCT and Evol-Instruct have been introduced to generate synthetic data for instruction tuning. Existing code benchmarks such as HumanEval, MBPP, APPS, and CodeContests evaluate LLMs on developing single-function programs from natural language descriptions.
Magicoder is a series of fully open-source LLMs for code, trained on 75K synthetic instruction data using OSS-INSTRUCT, an approach to enlightening LLMs with open-source code snippets to generate high-quality instruction data for code. This method prompts LLMs to cause coding problems and solutions based on seed code snippets from GitHub, ensuring diversity and real-world relevance. Evaluation employs benchmarks like HumanEval and MBPP, focusing on the pass1 metric. INSTRUCTOR is used to categorize OSS-INSTRUCT-generated data based on embedding similarity. Data cleaning techniques, including decontamination and prompt filtering, are applied for robustness.
Magicoder demonstrates competitive performance with top code models with a modest parameter size of no more than 7 billion. Trained on 75,000 synthetic instruction data using OSS-INSTRUCT, Magicoder outperforms advanced code models in Python text-to-code generation, multilingual coding, and data-science program language modeling. The enhanced version, MagicoderS, further improves code generation performance, surpassing other models of similar or larger sizes on various benchmarks. MagicoderS-CL-7B simultaneously achieves cutting-edge results among code models, demonstrating robust and superior code generation capabilities.
In conclusion, the study highlights the effectiveness of using OSS-INSTRUCT, which utilizes LLMs to generate coding challenges from open-source code snippets. Magicoder, trained using OSS-INSTRUCT, performs better than other LLMs with larger parameters on diverse coding benchmarks. Also, when combined with Evol-Instruct, it enhances MagicoderS models that exhibit impressive performance in HumanEval benchmarks, similar to leading models like ChatGPT. The study recommends open-sourcing model weights, training data, and source code to support future research in LLMs for code and scaling OSS-INSTRUCT to larger base models to generate higher-quality data for future work.
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.