The primary way of providing realtime speech to text captioning for hard of hearing people is to employ expensive professional stenographers who can type as fast as natural speaking rates. Recent work has shown that a feasible alternative is to combine the partial captions of ordinary typists, each of whom is able to type only part of what they hear. In this paper, we extend the state of the art fixedwindow alignment algorithm [] for combining the individual captions into a final output sequence. Our method performs alignment on a sliding window of the input sequences, drastically reducing both the number of errors and the latency of the system to the end user over the previously published approaches.
Realtime captioning provides deaf or hard of hearing people access to speech in mainstream classrooms, at public events, and on live television. Studies performed in the classroom setting show that the latency between when a word was said and when it is displayed must be under five seconds to maintain consistency between the captions being read and other visual cues []. The most common approach to realtime captioning is to recruit a trained stenographer with a special purpose phonetic keyboard, who transcribes the speech to text with less than five seconds of latency. Unfortunately, professional captionists are quite expensive ($150 per hour), must be recruited in blocks of an hour or more, and are difficult to schedule on short notice. Automatic speech recognition (ASR) systems [], on the other hand, attempts to provide a cheap and fully automated solution to this problem. However, the accuracy of ASR quickly plummets to below 30% when used on an untrained speaker’s voice, in a new environment, or in the absence of a high quality microphone []. The accuracy of the ASR systems can be improved using the ‘respeaking’ technique, which requires a person that the ASR has been trained on to repeat the words said by a speaker as he hears them. Simultaneously hearing and speaking, however, is not straightforward, and requires some training.
An alternative approach is to combine the efforts of multiple nonexpert captionists (anyone who can type), instead of relying on trained workers []. In this approach, multiple nonexpert human workers transcribe an audio stream containing speech in realtime. Workers type as much as they can of the input, and, while no one worker’s transcript is complete, the portions captured by various workers tend to overlap. For each input word, a timestamp is recorded, indicating when the word is typed by a worker. The partial inputs are combined to produce a final transcript (see Figure 1). This approach has been shown to dramatically outperform ASR in terms of both accuracy and Word Error Rate (WER) []. Furthermore, recall of individual words irrespective of their order approached and even exceeded that of a trained expert stenographer with seven workers contributing, suggesting that the information is present to meet the performance of a stenographer []. However, aligning these individual words in the correct sequential order remains a challenging problem.
Lasecki et al. [] addressed this alignment problem using offtheshelf multiple sequence alignment tools, as well as an algorithm based on incrementally building a precedence graph over output words. Improved results for the alignment problem were shown using weighted A${}^{*}$ search by Naim et al. []. To speed the search for the best alignment, Naim et al. [] divided sequences into chunks of a fixed time duration, and applied the A${}^{*}$ alignment algorithm to each chunk independently. Although this method speeds the search for the best alignment, it introduces a significant number of errors to the output of the system due to inconsistency at the boundaries of the chunks. In this paper, we introduce a novel sliding window technique which avoids the errors produced by previous systems at the boundaries of the chunks used for alignment. This technique produces dramatically fewer errors for the same amount of computation time.
The problem of aligning and combining multiple transcripts can be mapped to the wellstudied Multiple Sequence Alignment (MSA) problem []. Let ${S}_{1},\mathrm{\dots},{S}_{K},K\ge 2$, be the $K$ sequences over an alphabet $\mathrm{\Sigma}$, and having length ${N}_{1},\mathrm{\dots},{N}_{K}$. For the caption alignment task, we treat each individual word as a symbol in our alphabet $\mathrm{\Sigma}$. The special gap symbol ‘$$’ represents a missing word and does not belong to $\mathrm{\Sigma}$. Let $A=\left({a}_{ij}\right)$ be a $K\times {N}_{f}$ matrix, where ${a}_{ij}\in \mathrm{\Sigma}\cup \{\}$, and the ${i}^{th}$ row has exactly $\left({N}_{f}{N}_{i}\right)$ gaps and is identical to ${S}_{i}$ if we ignore the gaps. Every column of $A$ must have at least one nongap symbol. Therefore, the ${j}^{th}$ column of $A$ indicates an alignment state for the ${j}^{th}$ position, where the state can have one of the ${2}^{K}1$ possible combinations. Our goal is to find the optimum alignment matrix ${A}_{OPT}$ that minimizes the sum of pairs (SOP) cost function:
$$c\left(A\right)=\sum _{1\le i\le j\le K}c\left({A}_{ij}\right)$$  (1) 
where $c\left({A}_{ij}\right)$ is the cost of the pairwise alignment between ${S}_{i}$ and ${S}_{j}$ according to $A$. Formally, $c\left({A}_{ij}\right)={\sum}_{l=1}^{{N}_{f}}\mathrm{sub}\left({a}_{il},{a}_{jl}\right)$, where $\mathrm{sub}\left({a}_{il},{a}_{jl}\right)$ denotes the cost of substituting ${a}_{jl}$ for ${a}_{il}$. If ${a}_{il}$ and ${a}_{jl}$ are identical, the substitution cost is zero. The substitution cost for two words is estimated based on the edit distance between two words. The exact solution to the SOP optimization problem is NPComplete [], but many methods solve it approximately. Our approach is based on weighted A${}^{*}$ search for approximately solving the MSA problem []. {algorithm}[t] {algorithmic}[1] \REQUIRE$K$ input sequences $\mathcal{S}=\{{S}_{1},\mathrm{\dots},{S}_{K}\}$ having length ${N}_{1},\mathrm{\dots},{N}_{K}$, heuristic weight $w$, beam size $b$
input $s\mathrm{}t\mathrm{}a\mathrm{}r\mathrm{}t\mathrm{\in}{\mathbb{N}}^{K}$, $g\mathrm{}o\mathrm{}a\mathrm{}l\mathrm{\in}{\mathbb{N}}^{k}$
output an $N\mathrm{\times}K$ matrix of integers indicating the index into each input sequence of each position in the output sequence \STATE$g\mathrm{}\mathrm{(}s\mathrm{}t\mathrm{}a\mathrm{}r\mathrm{}t\mathrm{)}\mathrm{\leftarrow}\mathrm{0}$, $f\mathrm{}\mathrm{(}s\mathrm{}t\mathrm{}a\mathrm{}r\mathrm{}t\mathrm{)}\mathrm{\leftarrow}w\mathrm{\times}h\mathrm{}\mathrm{(}s\mathrm{}t\mathrm{}a\mathrm{}r\mathrm{}t\mathrm{)}$. \STATE$Q\mathrm{\leftarrow}\mathrm{\{}s\mathrm{}t\mathrm{}a\mathrm{}r\mathrm{}t\mathrm{\}}$ \WHILE$Q\mathrm{\ne}\mathrm{\varnothing}$ \STATE$n\mathrm{\leftarrow}$ EXTRACTMIN($Q$) \FORALL$s\mathrm{\in}{\mathrm{\{}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{\}}}^{K}\mathrm{}\mathrm{\{}{\mathrm{0}}^{K}\mathrm{\}}$ \STATE${n}_{i}\mathrm{\leftarrow}n\mathrm{+}s$ \IF${n}_{i}\mathrm{=}g\mathrm{}o\mathrm{}a\mathrm{}l$ \STATEReturn the alignment matrix for the reconstructed path from $s\mathrm{}t\mathrm{}a\mathrm{}r\mathrm{}t$ to ${n}_{i}$ \ELSIF${n}_{i}\mathrm{\notin}B\mathrm{}e\mathrm{}a\mathrm{}m\mathrm{}\mathrm{(}b\mathrm{)}$ \STATEcontinue; \ELSE\STATE$g\mathrm{}\mathrm{(}{n}_{i}\mathrm{)}\mathrm{\leftarrow}g\mathrm{}\mathrm{(}n\mathrm{)}\mathrm{+}c\mathrm{}\mathrm{(}n\mathrm{,}{n}_{i}\mathrm{)}$ \STATE$f\mathrm{}\mathrm{(}{n}_{i}\mathrm{)}\mathrm{\leftarrow}g\mathrm{}\mathrm{(}{n}_{i}\mathrm{)}\mathrm{+}w\mathrm{\times}h\mathrm{}\mathrm{(}{n}_{i}\mathrm{)}$ \STATEINSERTITEM($Q\mathrm{,}{n}_{i}\mathrm{,}f\mathrm{(}{n}_{i}\mathrm{)}\mathrm{)}$ \ENDIF\ENDFOR\ENDWHILE
[t]
{algorithmic}[1]
\REQUIRE$K$ input sequences $\mathcal{S}=\{{S}_{1},\mathrm{\dots},{S}_{K}\}$ having length ${N}_{1},\mathrm{\dots},{N}_{K}$, window parameter $chunk\mathrm{\_}length$.
\STATE$start\mathrm{\_}time\leftarrow 0$
\WHILE$goal\prec [{N}_{1},\mathrm{\dots},{N}_{K}]$
\FORALL$i$
\STATE$start[i]\leftarrow closest\mathrm{\_}word(i,start\mathrm{\_}time)$
\ENDFOR\STATE$end\mathrm{\_}time\leftarrow start\mathrm{\_}time+chunk\mathrm{\_}length$
\FORALL$i$
\STATE$goal[i]\leftarrow closest\mathrm{\_}word(i,end\mathrm{\_}time)1$
\ENDFOR\STATE$alignmatrix\leftarrow $MSAA${}^{*}(start,goal)$
\STATEconcatenate $alignmatrix$ onto end of $finalmatrix$
\STATE$start\mathrm{\_}time\leftarrow end\mathrm{\_}time$
\ENDWHILE\STATEReturn $finalmatrix$
The problem of minimizing the SOP cost function for $K$ sequences is equivalent to estimating the shortest path between a single source node and a single sink node in a $K$dimensional mesh graph, where each node corresponds to a distinct position in the $K$ sequences. The source node is $\left[0,\mathrm{\dots},0\right]$ and the sink node is $\left[{N}_{1},\mathrm{\dots},{N}_{K}\right]$. The total number of nodes in the lattice is $\left({N}_{1}+1\right)\times \left({N}_{2}+1\right)\times \mathrm{\cdots}\times \left({N}_{K}+1\right)$, and each node has ${2}^{K}1$ possible successors and predecessors. The A${}^{*}$ search algorithm treats each node position $n=\left[{n}_{1},\mathrm{\dots},{n}_{K}\right]$ as a search state, and estimates the cost function $g\left(n\right)$ and the heuristic function $h\left(n\right)$ for each state. The cost function $g\left(n\right)$ represents the exact minimum SOP cost to align the $K$ sequences from the beginning to the current position. The heuristic function represents the approximate minimum cost of aligning the suffixes of the $K$ sequences, starting after the current position $n$. The commonly used heuristic function is ${h}_{pair}\left(n\right)$:
$$  (2) 
where $L(n\to t)$ denotes the lower bound on the cost of the shortest path from $n$ to destination $t$, ${A}_{p}^{*}$ is the optimal pairwise alignment, and ${\sigma}_{i}^{n}$ is the suffix of node $n$ in the $i$th sequence. The weighted A${}^{*}$ search uses a priority queue $Q$ to store the search states $n$. At each step of the A${}^{*}$ search algorithm, the node with the smallest evaluation function, $f\left(n\right)=g\left(n\right)+w{h}_{pair}\left(n\right)$ (where $w\ge 1$), is extracted from the priority queue $Q$ and expanded by one edge. The search continues until the goal node is extracted from $Q$. To further speed up the search, a beam constraint is applied on the search space using the timestamps of each individual input words. If the beam size is set to $b$ seconds, then any state that aligns two words having more than $b$ seconds time lag is ignored. The detailed procedure is shown in Algorithm 2. After the alignment, the captions are combined via majority voting at each position of the alignment matrix. We ignore the alignment columns where the majority vote is below a certain threshold ${t}_{v}$ (typically ${t}_{v}=2$), and thus filter out spurious errors and spelling mistakes.
Although weighted A${}^{*}$ significantly speeds the search for the best alignment, it is still too slow for very long sequences. For this reason, Naim et al. [] divided the sequences into chunks of a fixed time duration, and applied the A${}^{*}$ alignment algorithm to each chunk independently. The chunks were concatenated to produce the final output sequence, as shown in Algorithm 2.
The fixed window based alignment has two key limitations. First, aligning disjoint chunks independently tends to introduce a large number of errors at the boundary of each chunk. This is because the chunk boundaries are defined with respect to the timestamps associated with each word in the captions, but the timestamps can vary greatly between words that should in fact be aligned. After all, if the timestamps corresponded precisely to the original time at which each word was spoken, the entire alignment problem would be trivial. The fact that the various instances of a single word in each transcription may fall on either side of a chunk boundary leads to errors where a word is either duplicated in the final output for more than one chunk, or omitted entirely. This problem also causes errors in ordering among the words remaining within one chunk, because there is less information available to constrain the ordering relations between transcriptions. Second, the fixed window alignment algorithm requires longer chunks ($\ge $ 10 seconds) to obtain reasonable accuracy, and thus introduces unsatisfactory latency.
In order to address the problems described above, we explore a technique based on a sliding alignment window, shown in Algorithm 3. We start with alignment with a fixed chunk size.
After aligning the first chunk, we use the information derived from the
alignment to determine where the next chunk should begin within each transcription. We use a single point in the aligned output as the starting point for the next chunk, and determine the corresponding starting position
within each original transcription.
This single point is determined by a tunable parameter $keep\mathrm{\_}length$ (line 10 of Algorithm 3).
The materials in the output alignment that follow this point is thrown away, and replaced with the output produced by aligning the next chunk starting from this point (line 8).
The process continues iteratively, allowing us to avoid using the erroneous output alignments in the neighborhood of the arbitrary endpoints for each chunk.
{algorithm}[t]
{algorithmic}[1]
\REQUIRE$K$ input sequences $\mathcal{S}=\{{S}_{1},\mathrm{\dots},{S}_{K}\}$ having length ${N}_{1},\mathrm{\dots},{N}_{K}$, window parameters $chunk\mathrm{\_}length$ and $keep\mathrm{\_}length$.
\STATE$start\leftarrow {0}^{K}$, $goal\leftarrow {0}^{K}$
\WHILE$goal\prec [{N}_{1},\mathrm{\dots},{N}_{K}]$
\STATE$endtime\leftarrow chunk\mathrm{\_}length+{max}_{i}time(start[i])$
\FORALL$i$
\STATE$goal[i]\leftarrow closest\mathrm{\_}word(i,endtime)$
\ENDFOR\STATE$alignmatrix\leftarrow $MSAA${}^{*}(start,goal)$
\STATEconcatenate first $keep\mathrm{\_}length$ columns of
$alignmatrix$ onto end of $finalmatrix$
\FORALL$i$
\STATE$start[i]\leftarrow alignmatrix[keep\mathrm{\_}length][i]$
\ENDFOR\ENDWHILE\STATEReturn $finalmatrix$
We evaluate our system on a dataset of four 5minute long audio clips of lectures in electrical engineering and chemistry lectures taken from MIT OpenCourseWare. The same dataset used by [] and []. Each audio clip is transcribed by 10 nonexpert human workers in real time. We measure the accuracy in terms of Word Error Rate (WER) with respect to a reference transcription.


We are interested in investigating how the three key parameters of the algorithm, i.e., the chunk size ($c$), the heuristic weight ($w$) and the keeplength ($k$), affect the system latency, the search speed, and the alignment accuracy. The chunk size directly determines the latency of the system to the end user, as alignment cannot begin until an entire chunk is captured. Furthermore, the chunk size, the heuristic weight, and the keeplength help us to tradeoff speed versus accuracy. We also compare the performance of our algorithm with that of the most accurate fixed alignment window algorithm []. The performance in terms of WER for sliding and fixed alignment windows is presented in Figure 2. Out of the systems in Figure 2, the first three systems consist of sliding alignment window algorithm with different values of keeplength parameter: (1) keeplength = 0.5; (2) keeplength = 0.67; and (3) keeplength = 0.85. The other systems are the graphbased algorithm of [], the MUSCLE algorithm of [], and the most accurate fixed alignment window algorithm of []. We set the heuristic weight parameter ($w$) to 3 and the chunk size parameter ($c$) to 5 seconds for all the three sliding window systems and the fixed window system. Sliding alignment window produces better results and outperforms the other algorithms even for large values of the keeplength parameter. The sliding alignment window with keeplength 0.5 achieves 0.5679 average accuracy in terms of (1WER), providing a 18.09% improvement with respect to the most accurate fixed alignment window (average accuracy 0.4857). On the same dataset, Lasecki et al. [] reported 36.6% accuracy using the Dragon Naturally Speaking ASR system (version 11.5 for Windows).
To show the tradeoff between latency and accuracy, we fix the heuristic weight ($w=3$) and plot the accuracy as a function of chunk size in Figure 3. We repeat this experiment for different values of keeplength. We observe that the sliding window approach dominates the fixed window approach across a wide range of chunk sizes. Furthermore, we can see that for smaller values of the chunk size parameter, increasing the keeplength makes the system less accurate. As the chunk size parameter increases, the performance of sliding window systems with different values of keeplength parameter converges. Therefore, at larger chunk sizes, for which there are smaller number of boundaries, the keeplength parameter has lower impact.
Next, we show the tradeoff between computation speed and accuracy in Figure 3, as we fix the heuristic weight and vary the chunk size over the range [5, 10, 15, 20, 30] seconds. Larger chunks are more accurately aligned but require computation time that grows as ${N}^{K}$ in the chunk size $N$ in the worst case. Furthermore, smaller weights allow faster alignment, but provide lower accuracy.
In this paper, we present a novel sliding window based text alignment algorithm for realtime crowd captioning. By effectively addressing the problem of alignment errors at chunk boundaries, our sliding window approach outperforms the existing fixed window based system [] in terms of word error rate, particularly when the chunk size is small, and thus achieves higher accuracy at lower latency.
Funded by NSF awards IIS1218209 and IIS0910611.
./msa.