Fig.1 Interface of MuseLeader
Designing control methods in music generation systems is essential for music generation along with user preferences. In particular, parameter control provides an effective means of adjusting the atmosphere of generated music, such as ``brightness.'' Additionally, some music generation systems allow users to specify transitions in atmospheric intensity. However, parameter control is constrained by the frame problem, where users can only manipulate the parameters predefined by the system. To overcome this limitation, this study proposes an approach that leverages large language models (LLMs) to allow users to define parameter meanings through text. We also introduce MuseLeader, a working music composition system. This is equipped with a graphical user interface supporting customization of semantically defined time-series parameters. User studies indicate that parameters with clear semantic definitions (e.g. ``Powerful,'' ``robotic'') can be effectively controlled according to user intent. Additionally, some users refine their expressive intentions through changing parameter axis. For further advancements, it is essential not only to enhance the inference capabilities of LLMs but also to explore multimodal inputs beyond text to improve the interpretation of complex and nuanced musical concepts.
The mood's intensity is represented as a numerical value between 0.0 and 1.0 and can be set for every 4 measures. The higher the intensity value in a given section, the more strongly that mood is expected to be reflected in the music.
The music pieces were generated using gpt-4o-2024-11-20.
| Parameter-Axis | (a) 0.00 → 0.25 → 0.50 → 1.00 | (b) 0.00 → 0.25 → 1.00 → 1.00 | (c) 1.00 → 0.75 → 0.50 → 0.00 |
|---|---|---|---|
| Strength | |||
| Robotic | |||
| Brightness | |||
| Classic | |||
| Jazziness | |||
| Urban | |||
| Heart-Pounding | |||
| Emotional |
Fig.2 System composition of MuseLeader
We utilize the inference capabilities of a large-scale language model (gpt-4o-2024-08-06) to associate specified words with editing operations along the axis designated by the user. However, the system struggles to accurately reflect user intent in time-series control. To address this issue, we refine the LLM's prompts.
Task Decomposition: Inspired by ComposerX, our system splits the composition task among multiple LLM agents—a leader, melody, chord, and instrument agent.
Planning of Musical Elements: The leader plan musical elements in four-measure segments documented in a table. Other agents use it to decide which sections to edit.
Use of Four-Measure Delimiters:
our system instructs the LLM to insert newlines every four measure and comments (e.g., % measure [start]-[end])
to clearly show which sections should be edited.
Fig.3 Workflow of Multi-Agent LLM-Based Music Composition and Arrangement
The following are actual prompts used for each agent, translated into English.
This paper does not investigate the optimality of these prompt engineering techniques. Establishing reliable evaluation methods and optimizing prompt design are important challenges for future research. Moreover, these techniques may become unnecessary with advancements in the inference capabilities of large-scale language models in the future.
We use these tools for rendering abc format.