Skip to content

Leveraged Dynamic Time Warping (DTW) to assess the similarity between specific audio tracks

Notifications You must be signed in to change notification settings

Balajirvp/Dynamic-Time-Warping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dynamic Time Warping

Version Python Version Last Commit Documentation Contributions

Dynamic Time Warping (DTW) is a technique used to measure the similarity between two sequences of temporal events, accommodating variations in their speeds. In this project, I intend to utilize DTW to assess the similarity between specific audio tracks: "My Flame" by Bobby Caldwell, "Sky's the Limit" by The Notorious B.I.G. (which samples "My Flame"), and "Take Five" by Dave Brubeck.

It's noteworthy that Song 2 ("Sky's the Limit") is sampled from Song 1 ("My Flame"), while Song 3 ("Take Five") is entirely distinct from the first two songs.

For the purpose of this project, DTW will be applied solely to the instrumental versions of the songs. Due to computational considerations, the analysis will focus on the initial 30 seconds of each track. By employing DTW on these selected segments, I aim to compute similarity scores and quantify the degree of resemblance between the songs.

1) Converting the .mp4 files into .wav files

In this section, I extracted the audio content from the .mp4 video files (which I downloaded) and converted them into the widely used .wav audio format. This step is essential for further analysis and processing of the audio data. Each .mp4 file will be processed to produce a corresponding .wav file, focusing on the audio component for subsequent Dynamic Time Warping (DTW) analysis.

image

image

2) Loading the .wav files and visualizing their audio frequencies

A Mel spectrogram, also known as a Mel-frequency spectrogram, is a representation of an audio signal's frequency content in a way that is more perceptually relevant to human hearing. It's a popular tool used in speech and audio processing tasks. I will be using the librosa package to visualize the 3 songs with the help of a log Mel-frequency spectrogram. The logarithm (log) transformation is applied to the Mel spectrogram because it helps replicate the logarithmic perception of loudness by the human auditory system. This makes the visualization more aligned with how we perceive audio, especially at different frequencies and amplitudes. It's a common practice in audio processing to use the log scale to better capture the human auditory experience and highlight the relevant features for analysis.

image

image

image

3) Performing DTW

With the assistance of Dynamic Time Warping (DTW), we can determine the alignment cost between the songs, providing insights into their level of similarity. DTW is particularly effective in comparing sequences with different speeds, making it well-suited for our task of comparing audio tracks.

To facilitate a more meaningful comparison, we will focus on the normalized alignment cost, a crucial metric derived from DTW. The normalized alignment cost is obtained by normalizing the alignment cost with respect to the lengths of the sequences being compared. It provides a standardized measure of similarity, enabling fair comparisons irrespective of sequence length.

In interpreting the results, a lower normalized alignment cost indicates a higher degree of similarity between the songs. Conversely, a higher normalized alignment cost signifies greater dissimilarity. This metric simplifies the assessment of similarity levels, aiding in the determination of how closely related the songs are in terms of their audio characteristics.

  • Alignment between Song 1 and 2

image

  • Alignment between Song 1 and 3

image

  • Alignment between Song 2 and 3

image

Upon reviewing the results, it's evident that the alignment cost between songs 1 and 2 is notably lower compared to the alignment cost between songs 1 and 3, as well as between songs 2 and 3. This disparity in alignment costs strongly indicates a higher degree of similarity between songs 1 and 2. This aligns with the expectation, considering that song 2 is sampled from song 1. These results from DTW affirm the inherent musical resemblance between these two tracks.