Hasso-Plattner-Institut
 
    • de
 

Lecture video indexing using video OCR and ASR technologies

One of the most important functionalities of a tele-teaching portal is the search for data. Since recording technology has become more and more inexpensive and easy to use, the amount of content produced for these portals has become huge. Therefore it is nearly impossible for students to find the required content without a search function. But even when the user has found the right content, he still needs to find the piece of information he requires within the 90 minutes of a lecture. Therefore, the development of an automated solution for lecture video indexing is hightly desirable and especially useful for e-learning and tele-teaching. In our study, we have applied Video OCR and ASR technologies described as follows:

 

  • Video OCR: The text displayed in a lecture video is closely related to the lecture content. Therefore, it provides a valuable source for indexing and retrieving lecture video contents. Textual content can be detected, extracted and analyzed automatically by video Optical Character Recognition (OCR) techniques. In this project, we have developed an approach for automated lecture video indexing based on video OCR technology: Firstly, we developed a novel video segmenter to capture the real slide transition. Having adopted a localization-verification scheme we perform text detection secondly. We employ Stroke Width Transform (SWT) not only to remove false alarms from the text detection, but also to analyze the slide structure further. Unlike other OCR-based lecture video indexing approaches we utilize the geometrical information of detected text bounding boxes and stroke width value of texts, so that the summerized lecture outline can be extracted automatically from the OCR transcripts. The video indexing could be performed by using both, segmented slide shots and extracted lecture outlines (cf. Fig. 1).
  • Automated Speech Recognition (ASR) for lecture videos: Speech is the most natural way of communication and also the main carrier of information in nearly all lectures. Therefore, it is of distinct advantage that the speech information can be used for automated indexing of lecture videos. However, most of existing lecture speech recognition systems have only low recognition accuracy, the Word Error Rates (WERs) having been reported from many research publications are approximately 40%–85%. The poor recognition results limit the quality of the later indexing process. Compared to the English language, German has a much higher lexical variety. A German recognition vocabulary is several times larger than a corresponding English one. In addition, German lecture videos in specific domains, as e.g., computer science are more difficult to recognize than common contents such as TV news due to several reasons such as,  poor recording quality, speaker‘s dialects, interfering noise or out of vocabulary problem by which many topic-related technical terms are not in the standard dictionary of a ASR-software. In this project, we have developed a solution that enables a continued improvement of recognition rate by creating and refining new speech training data. It is important that the involved tasks can be performed efficiently and even fully automated, if possible. For this reason, we have implemented an automated procedure for generating a phonetic dictioary and a method for splitting raw audio lecture data into small pieces which are used in the speech model training process. For the manual transcription step we have implemented a simple software-tool, which provides a Sphinx-Trainer comfortable output-format that can be directly used in speech model training process.
  • Key-phrases Extraction from OCR and ASR transcripts: We have developed a method for automated extraction of indexable key-phrases from OCR and ASR transcripts. The high accurate lecture outline which were extracted from OCR texts can be used as a cue for the correct speech context. In this way, a more accurate refinement and extraction of key-phrases for ASR transcript could take place. This may solve the issue of building search indices for highly imperfect ASR transcripts. The extracted key-phrase collection could provides the summerized information about a lecture video or each video segment. On the other hand, the time-based distribution of key-phrases during a video or a segment can also be demonstrated that help to guide the user with navigation within the lecture video.
Figure 1. Visualization of the segmenter slide shots and extracted outline (lecture structure) of the lecture video

An automated analysis and indexing framework for lecture video portal

We have developed a novel lecture video segmenter for slide transition detection, video text detection and recognition methods. In addition, we have also applied automated speech recognition engine for speech text extraction. The question is then how to integrate such multimedia analysis engines into a lecture video portal, so that the analysis process could be easily handled and the efficient indexing functionalities could be provided to the users. In this project, we have designed and implemented an architecture (cf. Fig. 2), in which a set of management services have been created, i.e., network video analysis management, data transmission, result storage and visualization etc. Using this architecture, we can make the multimedia analysis engine applicable for a lecture video portal. The implementation of our approach can be found on teleTASK lecture video portal.

Figure 2. Framework architecture

Selected Publications:

  • Haojin Yang, Christoph Meinel, "Content Based Lecture Video Retrieval Using Speech and Video Text Information", IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES (TLT), online ISSN: 1939-1382, Publisher: IEEE Computer Society and IEEE Education Society (accepted) 
  • Franka Grünewald, Haojin Yang, Elnaz Mazandarani, Matthias Bauer and Christoph Meinel, "Next Generation Tele-Teaching: Latest Recording Tech- nology, User Engagement and Automatic Metadata Retrieval", International Conference on Human Factors in Computing and Informatics (southCHI), Lecture Notes in Computer Science (LNCS) Springer, 01–03 July, 2013 Maribor, Slovenia
  • Haojin Yang, Christoph Oehlke and Christoph Meinel, "An Automated Analysis and Indexing Framework for Lecture Video Portal", 11th International Conference on Web-based Learning (ICWL 2012), 2 - 4th September 2012,  Sinaia, Romania. Springer lecture notes, Volume 7558, 2012. [citation BibTex](accept rate:26%)(best student paper award)
  • Haojin Yang, Bernhard Quehl, Harald Sack, "A skeleton based binarization approach for video text recognition", 13th International Workshop on Image analysis for multimedia interactive services (WIAMIS 2012), 23rd - 25th May 2012, IEEE Press, Dublin Ireland. [poster] [citation BibTex]
  • Haojin Yang, Franka Gruenewald and Christoph Meinel, "Automated extraction of lecture outlines from lecture videos: a hybrid solution for lecture video indexing", 4th International Conference on Computer Supported Education (CSEDU 2012) (indexation by Thomson Reuters Conference Proceedings Citation Index (ISI) and Elsevier Index (EI)), SciTePress, April. 16-18, 2012, Porto Portugal [citation BibTex] (accept rate: 12%)
  • Haojin Yang, Bernhard Quehl and Harald Sack, "Text detection in video images using adaptive edge detection and stroke width verification", 19th International Conference on Systems, Signals and Image Processing (IWSSIP 2012), IEEE Press, Vienna, Austria, April. 11-13, 2012 [citation BibTex]
  • Haojin Yang, Maria Siebert, Patrick Lühne, Harald Sack and Christoph Meinel, "Lecture Video Indexing and Analysis Using Video OCR Technology", 7th International Conference on Signal Image Technology and Internet Based Systems (SITIS 2011), Track Internet Based Computing and Systems, IEEE Press, Dijon (France), Nov.28 - Dec. 1, 2011. [citation BibTex]
  • Haojin Yang, Maria Siebert, Patrick Lühne, Harald Sack and Christoph Meinel, "Automatic Lecture Video Indexing Using Video OCR Technology" IEEE International Symposium on Multimedia 2011 (ISM 2011), IEEE Press, Dana Point, CA, USA, Dec. 5-7, 2011.[citation BibTex]
  • Haojin Yang, Christoph Oehlke and Christoph Meinel, "A Solution for German Speech Recognition for Analysis and Processing of Lecture Videos" 10th IEEE/ACIS International Conference on Computer and Information Science (ICIS 2011) , IEEE Press, Sanya, Heinan Island, China, May 2011 [citation BibTex]

Contact:

Dr. Haojin Yang: haojin.yang(at)hpi.uni-potsdam.de

Other Links

... to our Research
              Security Engineering - Learning & Knowledge Tech - Design Thinking - former
... to our Teaching
              Tele-Lectures - MOOCs - Labs - Systems 
... to our Publications
              Books - Journals - Conference-Papers - Patents
... and to our Annual Reports.