Scientific Description

Table of contents

  1. the hierarchical structure of scores
  2. a language-based framework
  3. transcription as an optimisation problem
  4. efficiency
  5. related papers

Transcription is the problem of converting a music performance into Common Western music notation. We consider here the symbolic case of the problem, i.e. when the performance in input is a sequence of musical events (notes) with known dates, e.g. a MIDI stream or a MIDI file. This case is important in the context of score edition (when data is entered using a MIDI keyboard) and for corpus captation for musicology.

The library qparse implements an original approach based on techniques of Weighted Tree Automata (in which transitions are assigned a weight in addition to the usual transition labels), and Dynamic Programming algorithms for finding optimal solutions.

the hierarchical structure of scores

A salient feature of music notation of time is its hierarchical structure: events are grouped into bars, tuplets and under beams, and durations are defined proportionally and composed with ties and dots. This observation gave rise in the 90’s to the Rhythm Tree representation of music notation. In such trees, branching represents spliting of time into equal intervals.

The first above tree (a) represents a triplet, i.e. a division of 1 beat into 3 time-intervals of duration 1/3 of beat. Each labels 1 expresses that there is 1 note starting at the beginning of the corresponding interval and spanning through this interval (hence their duration is 1/3 of beat).

In the tree (b), the first leaf is labelled by 2, meaning that there are 2 notes in the first interval of 1/3. The first note is a grace-note (of theoretical duration 0), followeed by a note starting at the beginning of the interval and of duration 1/3.

In (c) there is an initial time-division into 2 sub-interval of duration 1/2 of beat, and the a nested division of the second sub-interval into 2 sub-sub-intervals of duration 1/4 of beat.

In (d) the label 0 for the second leaf expresses that there is no note here, and that the corresponding time-interval is a continuation of the first note. This can be denoted by a tie or a dot.

Finally in (e), there is an additional nested division (in 2) of the last sub-sub-interval.

In our approach to transcription, we use hierarchical formal-language formalisms (such as Tree Automata) as a priori score-models. They enables to infer structured score in one pass, joining the two subproblems of rhythm quantization and score structuring. Sometimes, other transcription methods based on sequential Markov models generate linear representations of scores (e.g. a quantized MIDI file), delegating the task of score structuring to external tools, typically score editors.

a language-based framework

In addition to the performance in input, qparse must be provided a Weighted Tree Automata (WTA), or (as a particular case), a Weighted Context-Free Grammar. This structure acts as an a priori language of rhythm notation. Some weight values, in the rules of the automata (or grammar) indicate preferred rhythms (in terms of tuplets, grace notes etc.).

r1:  q0 -> q1 q2     6
r2:  q0 -> q1 q2 q2 12
r3:  q2 -> q3 q3    10
r4:  q3 -> q4 q4    11 
r5:  q0 -> 0        15
r6:  q0 -> 1         1
r7:  q0 -> 2        79 
r8:  q0 -> 3       102
r9:  q1 -> 0         2
r10: q1 -> 1         1
r11: q1 -> 2        25
r12: q1 -> 3        64
r13: q2 -> 1         2
r14: q3 -> 0         4
r15: q3 -> 1         1
r16: q4 -> 1         1

The above grammar rules contain non-terminal symbols q0-q4 and terminal symbols 0, 1. The non terminal q0 represents a time-interval of one 1/4 bar (containing 1 beat).

The rules r1, r2 define two possible divisions of such a time interval into respectively a duplet and a triplet. The numbers after the rules represent respective complexity values associated to these divisions. In our case, such value is smaller for a duplet (assume more common) than for a triplet. This is relevant in a simple time signature like 1/4. For a time signature like 6/8 the complexity values might be chosen differently.

Further binary divisions of time sub-intervals represented respectively by non terminals q2 and q3 are possible with the rules r3 and r4.

The next rules, aiming at terminal symbols (here natural numbers) indicate the number of notes in the time-intervals, following the conventions for used for the above Rhythm Tree representation: 0 for a continuation, 1 for a single note, 2 for a grace-note followed by a note, 3 for two grace-note followed by a note, etc.

Formally, the domain of weights must be a Semiring, with two operators of product for composing two values, and sum op for choosing a best value. In practice, we use Tropical Semirings in which weight values express the complexity level of rhythm (to minimize), like in the above toy example, or probability domains.

transcription as an optimisation problem

A computation of a WTA is represented by a tree labeled by rules (called parse-tree in the case of a grammar), such that the describe non-terminals in the right-hand-side of a parent node coincide with the non-terminals in the right-hand-side of its children.

The above computation trees are isomorphic to the Rhythm Tree given in the above example. Each of these trees is associated a denotational complexity value (line of blue numbers), according to the given grammar, and computed as the sum of the complexity values of all the rules labeling the tree.

Given an input performance and a language of rhythmic preferences (WTA), our transcription procedures tries to find trees
minimizes its denotational complexity wrt the WTA and maximizes input-output fitness. This is computed by a Dynamic Programming 1-best algorithm, close to Knuth generalization of Dijkstra smallest path algorithm from weighted graphs to weighted hypergraphs.

It is also possible to generate an ordered sequence of the k best solutions, with a k-best algorithm, in order to let the user chose his favorite.

efficiency

Several optimisations on the internal representation of automata (using attributes) and scores (using binary trees for encoding sequences) permitted important efficiency gains, while searching solutions in a large space.

Let us mention two of them. First, we use a compact symbolic representation of WTA where some relation between attributes are provided in intenso in production rules, and generalize the 1-best and k-best algorithms to compute optimal solutions without uncompressing the automata.

Second, we use representation of sequences of bars as binary trees (instead of tuples). This permits to process successive bars monotonically (online), in a greedy way, without having to compute an exponential number of bar sequences.

general description of the transcription framework of qparse.

corpus training weighted grammars for transcription.

description of a related former project for interactive rhythm transcription (see about page).

related models of hierarchical (tree based) models for the representation, transformation and evaluation of rhythm and music notation.