java声音识别技术编程 java听歌识曲

转载

架构领航员 2023-08-29 11:04:49

文章标签 java声音识别技术编程 python java 人工智能机器学习 文章分类 Java 后端开发

java tda

In this article, I introduce my approach to music identification, applying ideas from the field of Topological Data Analysis (TDA). TDA is an emerging field of data analysis, and TDA tools have found successful applications in the area of machine learning.

在本文中，我将运用拓扑数据分析(TDA)领域的思想介绍我的音乐识别方法。 TDA是数据分析的新兴领域，TDA工具已在机器学习领域找到了成功的应用程序。

I will first define the research problem and then discuss my approach, dataset and the results.

我将首先定义研究问题，然后讨论我的方法，数据集和结果。

背景 (Background)

Have you ever wished to know the name of the song you just heard at a bar? If you’re a music lover like me, then the answer is probably yes. While humans can identify a familiar song in ~2 seconds, teaching a computer to do the same is a difficult task. Currently, audio identification is an active area of research in the Music Information Retrieval community.

您是否曾经想知道您刚刚在酒吧听到的歌曲名称？如果您是像我这样的音乐爱好者，那么答案可能是肯定的。尽管人们可以在2秒钟之内识别出一首熟悉的歌曲，但是教电脑做到这一点却是一项艰巨的任务。当前，音频识别是音乐信息检索社区中一个活跃的研究领域。

Observing recent advances in understanding music through geometric lens, which was pioneered by Dmitri Tymoczko in his “A Geometry of Music” book, I have asked myself: how can we use the geometry/topology of music to fingerprint the songs?

观察由Dmitri Tymoczko在他的《音乐的几何学》一书中开创的通过几何透镜理解音乐的最新进展，我问自己：我们如何使用音乐的几何学/拓扑来对歌曲进行指纹识别？

问题陈述 (Problem Statement)

In this work, I aim to find a representation of songs as a time series of topological fingerprints, with a metric to compare pairs of time-varying shapes. I then develop and test an algorithm that can account for noise distortions of input audio clips to identify music.

在这项工作中，我的目标是找到歌曲作为拓扑指纹的时间序列的表示形式，并带有用于比较成对的时变形状的度量。然后，我开发并测试一种算法，该算法可以解决输入音频剪辑的噪声失真以识别音乐。

Task: for a given input music clip, find its closest match in the database and return the name of the song.

任务：对于给定的输入音乐剪辑，在数据库中找到最接近的匹配项并返回歌曲的名称。

Thus, we first need to find suitable representation of musical data, extract signatures, then create a database of the signatures with labels — names of the songs, and finally apply the search algorithm to new noisy samples.

因此，我们首先需要找到合适的音乐数据表示形式，提取签名，然后创建带有标签(歌曲名称)的签名数据库，最后将搜索算法应用于新的嘈杂样本。

(Theory)

The main tool of TDA is persistent homology — a way of computing topological features of data at different scales. In short, the idea is to build filtrations of data and see how certain homology group invariants (Betti numbers) change at different spatial resolutions. Gary Koplik provides a nice introduction to persistent homology that you should read. For the purpoess of this article, all you need to understand is that persistent homology can be used to measure holes — when they appear and die — in a geometric object and this information can be used as a signature of the data.

TDA的主要工具是持久同源性-一种计算不同规模数据拓扑特征的方法。简而言之，该想法是建立数据过滤，并查看某些同源性组不变量(贝蒂数)在不同的空间分辨率下如何变化。 Gary Koplik为您应该阅读的持久性同源性提供了很好的介绍。就本文的目的而言，您需要了解的是，持久性同源性可以用来度量几何对象中的Kong(当Kong出现和死亡时)，并且该信息可以用作数据的签名。

java声音识别技术编程 java听歌识曲_机器学习

Image by the author. 图片由作者提供。

(Approach)

(Extracting Signatures)

I will start by discussing the way I extract signatures from music clips. The first step is to produce constant-Q chromograms from the clips of the songs. To produce such chromograms, I apply constant-Q transform, which transforms a time series to the frequency domain. This transformation is related to the Fourier transform and is said to be well suited for musical data, as it outputs amplitude against log frequency. The constant-Q transform of x[n] is defined as follows:

我将从讨论从音乐片段中提取签名的方式开始。第一步是从歌曲的片段中生成恒定Q色谱图。为了产生此类色谱图，我应用了常数Q变换，该变换将时间序列变换到频域。该变换与傅立叶变换有关，据说非常适合音乐数据，因为它输出幅度相对于对数频率。 x [n]的常量Q变换定义如下：

Definition of the constant-Q transform

常数Q转换的定义

Example constant-Q chromogram is shown below.

示例常数Q色谱图如下所示。

java声音识别技术编程 java听歌识曲_机器学习_02

Example of a constant-Q chromogram. Image by the author.

恒定Q色谱图的示例。图片由作者提供。

(Deforming Tonnetz)

Next, we project the chromogram onto Tonnetz and deform it by defining a height function. In music theory, Tonnetz is a lattice diagram that represents pitch space, allowing to capture certain harmonic relationships in musical data (see figures below). It was first described by Leonhard Euler in 1739. When studying harmony, Leonard Euler put notes on a torus such that in the horizontal direction, notes are separated by perfect 5th, in a diagonal direction (from left to right), notes are separated by major 3rd, and in another diagonal direction (from right to left), notes are separated by minor 3rd.

接下来，我们将色谱图投影到Tonnetz上 ，并通过定义高度函数使其变形。在音乐理论中，Tonnetz是代表音高空间的格子图，可以捕获音乐数据中的某些和声关系(请参见下图)。它由Leonhard Euler于1739年首次描述。在研究和声时，Leonard Euler将音符放在圆环上，这样，在水平方向上，音符在对角线方向(从左到右)上被完美的5th分隔，音符被大三号和另一个对角线方向(从右到左)，音符由小三号分隔。

java声音识别技术编程 java听歌识曲_人工智能_03

Wikipedia 维基百科

The set of pitch classes is R/12Z (i.e. we have 12 semitones in total, and we identify notes that are an octave apart). We place each pitch class onto the Tonnetz as a vertex. We then define a height function h : V → R on the set of vertices of the space R/12Z (each vertex is a pitch class) by associating to each vertex a height that corresponds to the amplitude of a given pitch class in the music clip. After that, the 2-dimensional Tonnetz is deformed, having the values of the height function as its third dimension. Below is an example of the Tonnetz deformed by 2-seconds clip of blues.

音高等级的集合是R / 12Z(即，我们总共有12个半音，并且我们识别出相距八度的音符)。我们将每个音调类作为顶点放置在Tonnetz上。然后，通过将与音乐中给定音高等级的幅度相对应的高度与每个顶点相关联，在空间R / 12Z的顶点集(每个顶点为音高等级)上定义高度函数h：V→R夹。之后，二维Tonnetz变形，以高度函数的值为第三维。下面是2秒钟的蓝调使Tonnetz变形的示例。

java声音识别技术编程 java听歌识曲_人工智能_04

Deformed Tonnetz (2 seconds of blues). Image by the author. 变形的吨铁(2秒发蓝)。图片由作者提供。

(Persistent diagrams)

After finding a way to represent musical data as a topological object, I then apply upper-star filtration to the deformed Tonnetz to produce persistent diagrams.

在找到一种将音乐数据表示为拓扑对象的方法之后，我随后对变形的Tonnetz进行了上星滤波，以生成持久图。

java声音识别技术编程 java听歌识曲_java_05

Persistent diagrams of three 2-seconds clips. Image by the author. 三个2秒剪辑的持久图。图片由作者提供。

In general, persistent homology is useful because, for example, changing the speed of a recording doesn’t alter persistent diagrams that are based on lower-star filtration that much (so we can identify DJ remixes of a song). Figure below illustrates this fact. Given a simplicial complex K and a real-valued function f defined on its vertices, define K_a = {σ ∈ K|max_{v∈σ} : f(v) ≤ a} to be the lower-star filtration.

通常，持久性同源性很有用，因为例如，改变录音速度不会改变基于低星过滤的持久性图表变化不大(因此我们可以识别歌曲的DJ混音)。下图说明了这一事实。给定一个单纯形复数K并在其顶点上定义一个实值函数f ，将K_a = {σ∈K | max_ {v∈σ}：f(v)≤a}定义为低星滤波。

Reparameterizing time-series does not change the original persistent diagram. Image by the author. 重新设置时间序列不会更改原始的持久图。图片由作者提供。

(Matching)

After producing persistent diagrams, I calculate the bottleneck distances between the persistent diagrams to find the closest match of a given song. The bottleneck distance is a metric on the space of persistent diagrams that is defined as follows:

生成持久图之后，我计算持久图之间的瓶颈距离，以找到给定歌曲的最接近匹配项。瓶颈距离是持久图空间的度量标准，定义如下：

java声音识别技术编程 java听歌识曲_java声音识别技术编程_06

With mild assumptions on the function, the authors of [1] have showed that the persistent diagram of a function on a topological space is stable. Persistent homology is thus claimed to be stable under small changes in the input filtration (so that such changes lead to small perturbations in the bottleneck distance). Hence, bottleneck distance provides us with information on how similar two persistent diagrams are.

在对函数有轻微假设的情况下，[1]的作者表明，拓扑空间上函数的持久图是稳定的。因此，据称持久同源性在输入过滤的小变化下是稳定的(这样的变化会导致瓶颈距离的小扰动)。因此，瓶颈距离为我们提供了关于两个持久化图的相似程度的信息。

(Dataset)

The dataset consisted of fifty 30-seconds-long songs from 10 genres (5 songs per genre).

数据集由10种流派的50首30秒长的歌曲组成(每流派5首歌曲)。

The genres are:

流派是：

blues
classical
country
disco
EDM
jazz
punk rock
pop
rap
rock

(Results)

As mentioned above, I compiled a database of 50 songs from 10 genres, with 5 songs per genre. As input to the algorithm, I use .WAV files containing 1, 2, 3, 4, 5-seconds clips of songs. To test the algorithm, I added noise from a normal distribution so that the song clips have signal-to-noise ratio SNR = 10, 20, 30. I then ran the algorithm to identify music within each genre. The accuracy as a function of clip duration for different SNR is reported in the figure below. From [2], it follows that Shazam computes ~50,000–250,000 fingerprints per song. I compute 1 persistent diagram per clip, so I produce ~10,000–20,000 signatures per song.

如上所述，我编辑了一个数据库，其中包含10种流派的50首歌曲，每种流派5首歌曲。作为算法的输入，我使用了包含1、2、3、4、5秒歌曲片段的.WAV文件。为了测试该算法，我添加了正态分布的噪声，以便歌曲剪辑的信噪比SNR = 10、20、30。然后运行该算法，以识别每种流派中的音乐。下图报告了针对不同SNR的削波持续时间的函数精度。从[2]中可以得出，Shazam为每首歌曲计算约50,000–250,000指纹。我为每个剪辑计算1个永久图，因此每首歌曲产生约10,000–20,000个签名。

java声音识别技术编程 java听歌识曲_python_07

Accuracy as a function of clip duration of a given SNR. Image by the author. 精度是给定SNR的限幅持续时间的函数。图片由作者提供。

(Discussion)

1) Though the algorithm performs well, calculating the bottleneck distance is computationally expensive.

1)尽管该算法性能良好，但计算瓶颈距离在计算上却非常昂贵。

2) The height function we defined above is the simplest one. We could also use discrete Gaussian curvature as our height function to reflect different geometries of the Tonnetz. Furthermore, we can capture additional harmonic information using a consonance function as our height function, which would give information on tensions/resolutions of chords in a clip.

2)我们上面定义的高度函数是最简单的函数。我们还可以使用离散的高斯曲率作为我们的高度函数，以反映吨数的不同几何形状。此外，我们可以使用辅音函数作为高度函数来捕获其他谐波信息，这将提供有关片段中和弦的张力/分辨率的信息。

3) While in this article I only utilized the vertical structure of music (i.e. pitch classes), it is important to consider the horizontal structure as well, i.e. chords’ time progression, tempo, etc. (after all, it is important in what order we hear the chords/notes). Additionally, the rhythmic part of musical pieces is important, and we could capture this additional information using audio novelty function.

3)虽然在本文中我仅使用了音乐的垂直结构(即音高类)，但也要考虑水平结构，即和弦的时间进程，速度等，这一点很重要(毕竟，在什么情况下，这很重要)以便我们听到和弦/音符)。此外，音乐作品的节奏部分很重要，我们可以使用音频新颖性功能来捕获这些附加信息。

(Code)

The python code is available in a Google Colab notebook here.

python代码可在此处的Google Colab笔记本中找到。

()

[1] David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence diagrams. Discrete Computational Geometry, 37:103–120, 2007.

[1] David Cohen-Steiner，Herbert Edelsbrunner和John Harer。持久性图的稳定性。离散计算几何，37：103-120，2007年。

[2] Avery Wang. An industrial strength audio search algorithm. 01 2003.

[2] Avery Wang。一种工业强度的音频搜索算法。 2003年1月1日。

翻译自: https://towardsdatascience.com/tda-an-original-way-to-identify-music-86823cbfee85

java tda

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：mysql怎么清楚锁 mysql删除锁表

下一篇：中间件redis和mq的功能 redis中间件有哪些

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯