MEDIA MINING INDEXER If you are looking for a high end speech recognition product that can automate the process of indexing live news feeds and multimedia archives in real-time, your search ends here.



FEATURES AND BENEFITS


Link back to top

TECHNICAL SPECIFICATIONS

HARDWARE PREREQUISITES

Media Mining Indexer is designed to perform in real time or faster with common off-the-shelf PC hardware. We support hardware running Windows XP (32 bit version), Windows Server 2003 (32 bit version) or Linux on Intel or AMD CPUs.

The Media Mining Indexer requires a sound card to feed input other than PCM audio files.

To feed video from Cable-TV, Video recorders or other input devices using Composite input or S-Video input, Media Mining Indexer provides the ability to use a WDM compatible analog TV card as input. SAIL LABS uses the following TV cards:


Configuration hints for real-time indexing:

SOFTWARE PREREQUISITES


Link back to top

LANGUAGE OPTIONS

Languages supported by Media Mining Indexer are:


Link back to top

INTEGRATION

The open architecture of the Media Mining Indexer enables easy integration with complementary technologies for diverse applications. Some of the products which have integrated the Media Mining Indexer are Fast's Media Miner®, Oracle 9i interMedia, NOA's Dactylo, pluggd's HearHere, Mediaclipping's mediaclipping.de platform, blue order's media archive®, Virage Videologger® etc. The integration can be done at two levels:

The command line tool is currently available in two different formats: precompiled as a win32 binary (asrsample.exe) and in source code form together with a Microsoft® Visual Studio 2003 solution. The command line tool works network transparent, allowing for speech recognition across hardware boundaries.

The API facilitates communication with the Media Mining Indexer. It provides a set of C++ classes. These classes are defined in an API Library Header file. The API allows a client application to:


Link back to top

TECHNOLOGIES

The Media Mining System harnesses the synergies of some of the best speech processing and language technologies produced or currently in development. Technologies such as Automatic Speech Recognition, Speaker Identification, Named Entity Detection, Topic Classification, and Story Segmentation have been integrated in our system, and together produce comprehensively indexed text files from the media stream input.

Automatic Speech Recognition

Automatic Speech recognition is performed in a sequence of steps; it first processes the incoming audio, then segments the audio into sections of speech and non-speech, and then applies speech-recognition to those segments identified as speech. This is done in real-time, for large vocabularies (>>64K entries) and for 8 and 16kHz (but is not limited to these) audio data.

Our speech recognition engine is language independent (modulo changes on acoustic front end, e.g. for tonal languages) and has been run in a variety of languages such as: English, French, Spanish, German, Arabic, Russian, Norwegian or Polish.

Front-End: Standard MFCC coefficients, energy+ 5/3 frame regression, deltas, delta-deltas, cepstral mean subtraction, various types of normalization and active noise cancellation.

Acoustic Models: Speaker- and Gender-independent acoustic models, 3- and 5-phone context models, with Gaussian Prototypes tied at the prototype as well as mixture weight levels.

Language Models: word-based n-gram models employing a Witten-Bell-like back-off. The Language Model and Acoustic scores are combined for maximum accuracy.

Decoder Search: We use a multi-pass, time-synchronous search. During processing, increasingly detailed models are used at each step. After a forward pass and a backward pass, the resulting N-best list is re-scored using the most detailed acoustic models.

Speaker ID / Clustering

The Speaker Identification (SID) system identifies speakers or the speakers' gender. The incoming audio is first split up according to speech / non-speech regions and speaker turns are hypothesized on these chunks. Speaker clustering (SC) and Speaker ID (SID) are run on the resulting chunks. Typically from about 20 to 100 speakers can be identified; for non-target speakers the gender is detected and the unknown speaker's segments are labeled accordingly. Speaker ID/Clustering is language independent.

Speaker Identification: Gaussian Mixture Models (GMM) for a number of pre-selected target speakers and an additional number of cohort speakers who serve for normalization and gender detection purposes.

Speaker Clustering: clusters pre-determined chunks (from initial segmentation) into a number of clusters. As quality measure the within-class dispersion is used. Segments are clustered using a variant of the generalized likelihood ratio criterion.

Speaker Change Detection

Speaker Change Detection (SCD) performs a phone-level decoding stage, which employs a set of broad phonetic classes of speech sounds as well as non-speech sounds. Using the information produced by the decoder (i.e. the "transcript"), the SCD system sequentially hypothesizes speaker turns at phoneme boundaries. A generalized likelihood ratio test is used to determine whether a change should be made.

Story Detection and Topic Classification

Story Detection consists of several phases. First, an episode is partitioned into homogenous speaker turns. The speaker turns are initially classified and adjacent speaker turns are merged and re-classified in a subsequent step.

Support Vector Machines, one model per topic and one model to model general language (all those filler words which really aren't specific to any topic) are used. Each classifier represents the topic and its topic dependent words.  At decoding time, the most likely set of topics given the recognized text is determined.