Creating video previews for my Cleopatra EP

April 17, 2024

My new Cleopatra EP

I just released a new EP, which contains a 27 minutes rock opera song on Cleopatra, split in 9 parts for easier listening. I wanted to try and create video previews for the different tracks to share on Instagram, so this is how I did it using the MLT Multimedia Framework.

The motivation

From time to time, I like to tinker with video editing to do cool stuff: this has been helpful for work, but also for silly music videos. The software I’ve been using for quite some time is called Shotcut, which is completely open source and quite powerful.

Now, having completed my new EP, I thought of creating a series of short video previews for each of the tracks that I could share on social networks, e.g., like Instagram or Facebook stories my friends and family could see. The first obvious thought was to do that with Shotcut, but that would have meant a manual process of creating a new project for each of them, adding the effects I wanted, the bits of music, etc., which immediately looked like too much work, especially considering the EP is made of 9 tracks. Ideally, I could have created some sort of template, since all videos needed to look more or less the same, apart from some customizations, but a quick search didn’t seem to suggest this could be done with Shotcut, or at least not easily or quickly.

As such, I started looking into how I could do this programmatically instead, and possibly in a configurable way that would allow me to quickly do the same for any new album in the future without too much effort. I soon landed on MLT as the tool for the job, for a few different reasons. MLT is a quite powerful library that’s built around FFmpeg and was conceived to allow for non-linear editing of video: this means it can be used programmatically to create multitrack videos with different effects and transitions. As a matter of fact, we use it quite extensively ourselves as part of our work at Meetecho, since it’s the tool we use to automatically mix and post-process recordings of sessions we record with our Janus WebRTC Server (e.g., for IETF meeting sessions like this). Unsurprisingly, it’s also the tool that applications like Shotcut and KDENLive use themselves under the hood to provide their non-linear editing features in a more user friendly and visual way.

Even though, as anticipated, we do use MLT at work, I had never played with it myself (someone else in the company did that part), so I decided to start looking into it, and check how complex it would be to use it to create fairly simple (and more or less static) video previews for my music. The end result was videos like the one you can see below, so I was quite happy with it, so I’ll try and explain how I got there, sharing the scripts I used.

Previews layout

As you can see from the video above, I had a fairly simple idea in mind. For each track of the EP, I wanted a vertical video (since, again, the target were Instagram stories) that needed to look like this:

a short audio 15 seconds audio snippet of the track, possibly starting from where I wanted to;
the cover of the EP as background, centered vertically;
the title of the current track centered of the cover, using a custom font;
audio spectrum bars at the top and bottom, possibly colored;
fade in and fade out for both audio and video.

This is normally relatively easy to do with visual tools like Shotcut, since you just drag stuff around, cut the parts to where you want them to start/end, and you can add audio and visual effects easily by using filters you can add dynamically. Again, though, this is not what I wanted to do here, since in this context I wanted to try and do this more programmatically, and in a way that would be easy to replicate using dynamic and customizable properties (not only for the different tracks, but also for different albums in the future). As such, I started looking into the MLT documentation to start acquainting myself with its syntax.

Let’s melt this!

One decision I took faily early in the process was not to use MLT as a library, but to use the command line melt application instead. This application basically provides all the library offers, but using a command line syntax for the different features it supports. As such, it’s easy to use in a bash script, for instance, assuming you familiarize well enough with its sometimes cryptic syntax. Considering my aim were relatively simple and short videos, this sounded good enough for my needs (I’d definitely study the programmatic API in more detail if I wanted to do something more complex than that).

In MLT there’s the important concept of producers and consumers. The idea is that, whatever layout you’re creating, you can either write that down to an XML file that can be processed later on (the format applications like Shotcut rely upon, for instance), preview it in real-time in a window, or more. In my case, I decided to use the XML consumer as an intermediary target, that I could then subsequently pass to melt again for concerting to the actual mp4 target file.

At this point, I needed to start figuring out how to add tracks, and how to add the different effects I wanted in the video to them. As anticipated, MLT is a non-linear video editor, which means you can have multiple audio and video track that may stack over each other or be mixed somehow, or possibly transitioned to/from at different times. For my video previews, it was mainly two tracks:

an audio track with the 15s snippet;
a video track with the EP cover, text and audio spectrum one over the other.

In MLT, items may be in sequence or in parallel, and that has an impact on how you apply filters and transitions as well, since you have to specify the exact times (in terms of frame they appear at, and as such related to the explicit framerate of the target) they need to appear and disappear at. Without going too much in detail on the syntax itself, this is the melt command I eventually landed on:

melt -profile vertical_hd_30 \
	-consumer xml resource="temp.mlt" \
	-audio_track snippet.mp3 \
		-attach-track volume:-70db end=0db in=0 out=29 \
		-attach-track volume:0db end=-70db in=422 out=450 \
		-transition mix in=0 out=450 \
		-filter audiospectrum type=bar bands=40 thickness=20 angle=180 reverse=1 \
			color.1=red color.2=green rect="0 0 1080 400" in=0 out=450 \
		-filter audiospectrum type=bar bands=40 thickness=20 angle=0 \
			color.1=red color.2=green rect="0 1520 1080 400" in=0 out=450 \
	-track colour:black out=29 \
		temp.jpg in=0 out=450 -mix 30 -mixer luma \
		colour:black out=29 -mix 30 -mixer luma \
	-transition composite:"0=0%/0%:100%x100%:100" out=450 a_track=1 b_track=0 sliced_composite=1

First of all, I’m specifying vertical_hd_30 as a profile: this tells MLT the video I want to create is a 30fps 1080x1920 vertical video (as those are the details that specific profile dictates). MLT comes with many profiles out of the box: to know which ones are available, you can use the melt -query profiles command.

Then, you can see that, as anticipated, I’m telling MLT I want an XML consumer (-consumer xml), and that I want the result to be saved to a local file (resource="temp.mlt").

At this point, I start adding tracks, starting from audio (-audio_track snippet.mp3). Fade-in and fade-out for video are implemented using the -attach-track directives: the in and out properties specify when each of them start and end (e.g., in=0 means the very beginning, while out=29 means a second later, since it’s a 30fps video), while the volume and end properties specify how the volume changes in those transitons. The audio spectrum bars, instead, are implemented using a -filter audiospectrum with a few properties to customize how they look like: we’re again specifying when they get in and when they get out (in this case, the whole video); these will be obviously added to the video track, but they’re in the audio section because it’s audio that dictates the shape of the bars. Notice that I’m assuming the snippet is already a clip of 15 seconds here: I’ll get back to this later.

For video, the approach is slightly different, since we don’t really have a video clip to use here, but we want an image to be used as background, that will need to stay there for a while. Besides, we want fade in and out for this static image to. As such, what we do is creating three different clips in sequence:

a black background that will last a second (out=29);
an image (temp.jpg) that needs to appear right away and disappear 15 seconds later;
a black background again, that will only appear towards the end and for a second.

As such, we have overlapping clips, since the image will always be there, but we’ll have black backgrounds on the first and on the last seconds. The way to make them all appear at the same time is via the -mixer luma directive, and the -transition that follows is what actually implements the fade our over the overlaps (so that we can seamlessly transition from black to image at the beginning, and from image to black at the end).

You may have noticed that I’m just talking of a generic image being used as a track, here, but in the previous section I explained how I wanted the video to have the album cover centered vertically as background, and for each track the title of the part in overlay to that using a custom font. There’s no mention of centering a square cover image vertically, in the command above, and definitely no font either. The reason for that is simple and, admittedly, a bit awkware… MLT is definitely capable of doing both things, but I simply couldn’t figure out how to do those things properly As such, I went for the easy way, and just as for audio I pass an already clipped snippet to melt, I decided to use an intermediate step where I’d use ImageMagick convert to pregenerate the right background image too, in three steps.

convert "Cleopatra_cover.jpg" -resize 1080x1080 -background black \
	-gravity center -extent 1080x1920 cover.jpg

convert -alpha on -background "#00000088" -fill white -font Dalek \
	-pointsize 96 -size 1080x200 -gravity Center caption:"I. Overture" temp.png

convert cover.jpg temp.png -gravity center -composite temp.jpg

The first time, I use convert to turn the square cover image into a 1080x1920 vertical image with black background, where the cover is centered vertically. The second convert command is used to generate an image out of text, from a specific font. The third, and final, convert command puts the text image we just generated over the vertical cover we created before, centered vertically. This gives us the static image that we can eventually use in melt. The image below summarizes the process.

Creating the video background

This works because I wanted a static background, but of course it would be terrible if we wanted text moving in the video (e.g., the title coming in from the right, stay there for a while, and exit on the left): for what we’d need MLT to do the actual job of rendering text, for instance, and moving it around. But, again, this was too hard for me to do in the little time I gave to studying the docs and experimenting with it, so convert made it much easier for me to do it quickly.

At any rate, once melt creates the XML file, all we need to convert that to an mp4 video file is pass it to melt again with a different consumer (avformat):

melt -consumer avformat target="temp.mp4" "temp.mlt"

Automatic the process

At this point, I had all the bricks I needed to create an individual video preview from an existing snippet, using a combination of multiple convert and melt invocations. Since I wanted to create multiple video previews, one for each track, I started scripting those call in a loop in a dedicated bash script. Of course, I also needed to somehow make the different parameters customizable, rather than hardcode them in the script itself: this included, for instance, the album cover, the font to use, the font size, and the different audio tracks to create video previews for. For each audio track, I also needed some additional detail, like their title (what should appear in the video), from which second the audio preview of 15 seconds should start from, and what file to save the video preview in.

I decided to create a basic JSON object for the job, which in the case of my Cleopatra EP looks like this:

{
	"cover": "Cleopatra_cover.jpg",
	"font": "Dalek",
	"size": 96,
	"tracks": [
		{
			"title": "I. Overture",
			"audio": "/home/lminiero/ardour/Cleopatra/export/EP/01 - I. Overture.flac",
			"start": 80,
			"output": "cleopatra1"
		},
		{
			"title": "II. Cleopatra's Fanfare",
			"start": 147,
			"audio": "/home/lminiero/ardour/Cleopatra/export/EP/02 - II. Cleopatra's Fanfare.flac",
			"output": "cleopatra2"
		},
		{
			"title": "III. Antony and Cleopatra",
			"start": 0,
			"audio": "/home/lminiero/ardour/Cleopatra/export/EP/03 - III. Antony and Cleopatra.flac",
			"output": "cleopatra3"
		},
		{
			"title": "IV. The Power of Rome",
			"start": 144,
			"audio": "/home/lminiero/ardour/Cleopatra/export/EP/04 - IV. The Power of Rome.flac",
			"output": "cleopatra4"
		},
		{
			"title": "V. Octavian",
			"start": 62,
			"audio": "/home/lminiero/ardour/Cleopatra/export/EP/05 - V. Octavian.flac",
			"output": "cleopatra5"
		},
		{
			"title": "VI. The Battle of Actium",
			"start": 60,
			"audio": "/home/lminiero/ardour/Cleopatra/export/EP/06 - VI. The Battle of Actium.flac",
			"output": "cleopatra6"
		},
		{
			"title": "VII. Aftermath",
			"start": 0,
			"audio": "/home/lminiero/ardour/Cleopatra/export/EP/07 - VII. Aftermath.flac",
			"output": "cleopatra7"
		},
		{
			"title": "VIII. The Asp",
			"start": 0,
			"audio": "/home/lminiero/ardour/Cleopatra/export/EP/08 - VIII. The Asp.flac",
			"output": "cleopatra8"
		},
		{
			"title": "IX. Immortal in Time",
			"start": 47,
			"audio": "/home/lminiero/ardour/Cleopatra/export/EP/09 - IX. Immortal in Time.flac",
			"output": "cleopatra9"
		}
	]
}

As you can see, all the information we need is in the JSON file. At this point, what I need to do is to somehow make a bash script process this JSON file, to then sequentially create previews for each of the tracks using the commands we’ve seen before. An easy way to process a JSON file in bash is using the jq application, which is indeed what I used, so let’s have a look at the final create.sh script I wrote for creating all video previews in a single call:

#!/bin/sh

#~ set -x

ALBUM_COVER=`jq -r '.cover' album.json`
ALBUM_FONT=`jq -r '.font' album.json`
ALBUM_FONT_SIZE=`jq -r '.size' album.json`
ALBUM_TRACKS=`jq -r '.tracks | length' album.json`

convert "$ALBUM_COVER" -resize 1080x1080 -background black \
	-gravity center -extent 1080x1920 cover.jpg

for ((i = 0 ; i < $((ALBUM_TRACKS)) ; i++ ))
do
	TRACK_TITLE=`jq -r --arg i $((i)) '.tracks[$i | tonumber].title' album.json`
	TRACK_AUDIO=`jq -r --arg i $((i)) '.tracks[$i | tonumber].audio' album.json`
	TRACK_START=`jq -r --arg i $((i)) '.tracks[$i | tonumber].start' album.json`
	TRACK_OUTPUT=`jq -r --arg i $((i)) '.tracks[$i | tonumber].output' album.json`

	ffmpeg -y -i "$TRACK_AUDIO" -ss $TRACK_START -t 15 \
		-f mp3 -map 0:a -map_metadata -1 snippet.mp3

	convert -alpha on -background "#00000088" -fill white -font $ALBUM_FONT \
		-pointsize $ALBUM_FONT_SIZE -size 1080x200 -gravity Center caption:"$TRACK_TITLE" temp.png
	convert cover.jpg temp.png -gravity center -composite temp.jpg

	# https://stackoverflow.com/questions/42079575/mlt-melt-concatenate-clips-fade-in-fade-out-audio-and-video

	melt -profile vertical_hd_30 \
		-consumer xml resource="$TRACK_OUTPUT.mlt" \
		-audio_track snippet.mp3 \
			-attach-track volume:-70db end=0db in=0 out=29 \
			-attach-track volume:0db end=-70db in=422 out=450 \
			-transition mix in=0 out=450 \
			-filter audiospectrum type=bar bands=40 thickness=20 angle=180 reverse=1 \
				color.1=red color.2=green rect="0 0 1080 400" in=0 out=450 \
			-filter audiospectrum type=bar bands=40 thickness=20 angle=0 \
				color.1=red color.2=green rect="0 1520 1080 400" in=0 out=450 \
		-track colour:black out=29 \
			temp.jpg in=0 out=450 -mix 30 -mixer luma \
			colour:black out=29 -mix 30 -mixer luma \
		-transition composite:"0=0%/0%:100%x100%:100" out=450 a_track=1 b_track=0 sliced_composite=1

	melt -consumer avformat target="$TRACK_OUTPUT.mp4" "$TRACK_OUTPUT.mlt"
done

rm temp.png temp.jpg snippet.mp3
ls -la *.mp4

The script should be relatively easy to read, but in a nutshell:

we read some global info from the JSON file (cover image, font, font size, number of tracks) and store it in some variables;
we create the vertically centered version of the cover image (first convert from the previous section); we only do it once, since it will be the same for all video previews;
then, for each track, we read its info from the JSON file (title, audio file, starting point, target file) to other variables, and start working on the video;
- we use ffmpeg to create a 15s snippet from the audio file, starting from the provided starting point;
- we generate an image from the track title and overlay it on the cover (second and third convert from the previous section);
- we use melt to first create an XML, and then an mp4, which will contain the video preview for the current track;
finally, we cleanup temporary files and print the list of generated mp4 files.

The end result will be something like this:

-rw-r--r--. 1 lminiero lminiero 1716067 Apr 11 18:12 cleopatra1.mp4
-rw-r--r--. 1 lminiero lminiero 1635272 Apr 11 18:12 cleopatra2.mp4
-rw-r--r--. 1 lminiero lminiero 1485961 Apr 11 18:12 cleopatra3.mp4
-rw-r--r--. 1 lminiero lminiero 1732716 Apr 11 18:12 cleopatra4.mp4
-rw-r--r--. 1 lminiero lminiero 1545258 Apr 11 18:13 cleopatra5.mp4
-rw-r--r--. 1 lminiero lminiero 1753363 Apr 11 18:13 cleopatra6.mp4
-rw-r--r--. 1 lminiero lminiero 1638812 Apr 11 18:13 cleopatra7.mp4
-rw-r--r--. 1 lminiero lminiero 1500549 Apr 11 18:13 cleopatra8.mp4
-rw-r--r--. 1 lminiero lminiero 1703103 Apr 11 18:13 cleopatra9.mp4

so a video preview for each audio track, named as we were asked to.

Generating the JSON file

Now, that JSON file works, but it’s also a pain (and error prone) to generate manually. As such, especially considering that I plan to re-use this approach for future albums too, I created a short script to automatically generate the JSON file as well, starting from the audio tracks I want to create video previews for. Since we want to create a JSON file, we use jq again.

#!/bin/sh

#~ set -x

if [ "$#" -ne 5 ]; then
	echo "Usage: $0 <coverimage> <fontname> <fontsize> <baseoutput> <musicfolder>"
	exit 0
fi

jq --null-input --tab \
	--arg cover "$1" \
	--arg font "$2" \
	--arg size "$3" \
	'{"cover": $cover, "font": $font, "size": $size | tonumber, "tracks": []}' > test.json

i=0
for x in $5/*.flac
do
	i=$((i+1))
	y=`basename "$x" .flac`
	z=`echo "$y" | cut -d' ' -f3-`
	cat test.json | jq --tab \
		--arg title "$z" \
		--arg audio "$x" \
		--arg output "$4$i" \
		'.tracks[.tracks| length] |= . + {"title": $title, "start": 0, "audio": $audio, "output": $output}' > test1.json
	mv test1.json test.json
done

cat test.json

The script expects us to provide some details, since it can’t “guesstimate” (stolen from Saúl! ) what you really want. As such, we’re expected to provide a link to the cover image, the name of the font we want to use for text, the size to use for the font, the base name for target files, and where audio files are stored. To create the JSON file for the Cleopatra EP, for instance, this is how I invoked that script:

./album2json.sh Cleopatra_cover.jpg Dalek 96 cleopatra ~/ardour/Cleopatra/export/EP

This generates a test.json file that looks exactly like the JSON file we’ve seen before, but with all start points for the audio tracks set to 0 (since obviously the script won’t know where you want previews to start from). As such, a manual editing of that file is still needed if you want some tracks to have their preview start at a different point than the beginning, but it’s still a much quicker process than writing the whole file from scratch.

What’s next?

Well, not much: at least for now that’s pretty much it! One thing I may change in the future is the video fade in/out, which looks cool when you watch the videos, but actually causes issues when you use them as Instagram stories. In fact, for some reason Instagram uses the very first video frame for its story preview, which means that all videos will look black there until you play them. Getting rid of fade in would ensure the first frame Instagram uses for its story is a proper one, even though it would lack the smoother transition from story to story as you from from track to track.

Another interesting thing to do could be experimenting with different targets for previews, e.g., more square or horizontal previews rather than the vertical ones (e.g., to use on posts on Mastodon or others). In theory that shouldn’t be too much of an issue, since it will mostly involve choosing a different profile in melt, changing the way we use convert to generate the background image, and fixing the coordinates used for the audio spectrum bars accordingly (which should maybe be put horizontally rather than vertically in that case.

Finally, rather than N individual previews, it might also be cool to have a single video preview file that contains all of them in a sequence (which could be easier to share as well, even though a slightly longer file). That’s probably something that can be done more easily ex-post, starting from the individually created video previews and chaining them one after another (e.g., using ffmpeg or, once again, ` melt`).

Feedback welcome!

This was just a personal and fun experiment, but of course I’d love to hear feedback if you have any, especially if you know of similar efforts (or existing tools) I could have used instead. Please feel free to reach out on Mastodon for any question or comment related to this!