KR102160117B1

Movatterモバイル変換

Info

Publication number: KR102160117B1
Application number: KR1020190047797A
Authority: KR
Inventors: 손석연
Original assignee: 주식회사 한국스테노
Priority date: 2019-04-24
Filing date: 2019-04-24
Publication date: 2020-09-25
Anticipated expiration: 2039-04-24

Abstract

A broadcasting content production system for the disabled is disclosed. According to an embodiment of the present invention, a broadcasting content production system for the disabled comprises: a shorthand input unit for obtaining shorthand input for generating a sound source for screen commentary; an image receiving unit for receiving broadcast content from a content provider; and a screen commentary processing unit which converts the shorthand input into the sound source for screen commentary and synchronizes the sound source for screen commentary and the broadcast content.

Description

Translated fromKorean

장애인을 위한 실시간 방송 컨텐츠 제작 시스템{a real-time broadcast content generating system for disabled}A real-time broadcast content generating system for disabled.

본 발명은 실시간 방송에서 시각 장애인 및 청각 장애인을 위한 방송 컨텐츠 제작 시스템에 관한 것이다. 특히, 본 발명은 방송 컨텐츠와 함께 제공되지 않아 별도로 생성해야만 하는 시각 장애인을 위한 화면해설용 음원과 방송 컨텐츠를 동기화하여 시청자에게 원활한 방송 시청이 가능하도록 하는 방송 컨텐츠 제작 시스템에 관한 것이다.The present invention relates to a broadcasting content production system for the visually impaired and the hearing impaired in real-time broadcasting. In particular, the present invention relates to a broadcast content production system that enables a smooth broadcast viewing by a viewer by synchronizing a screen commentary sound source for the visually impaired and broadcast content, which is not provided with broadcast content and must be generated separately.

화면해설방송이란 소리에만 의존하여 TV를 시청하는 시각장애인에게 영상에 대한 이해와 흥미를 높이기 위해 장면의 전황, 등장인물의 표정이나 몸짓, 대사없이 처리되는 영상 등을 해설하는 서비스를 말한다.Screen commentary broadcasting refers to a service that explains the scenes, facial expressions and gestures of characters, and images processed without dialogue in order to increase the understanding and interest in the video for the visually impaired who only rely on sound to watch TV.

화면해설을 위한 음원이 방송 컨텐츠를 방송국에서 송출할 때 같이 송출되는 것이 바람직하겠으나, 화면해설이 필요한 시청자가 제한적이라는 점에서 모든 방송 컨텐츠에 화면해설을 위한 음원이 제작되지 않는 것이 실정이다. 따라서, 시각장애인을 위한 화면해설방송은 속기사의 화면 해설 입력에 의해 실시간으로 원고가 만들어지고 이를 음원으로 변경하여 화면해설방송을 만들어 송출할 수도 있다.It would be desirable that the sound source for screen commentary is transmitted at the same time when broadcasting content is transmitted from the broadcasting station. Therefore, in the screen commentary broadcast for the visually impaired, a manuscript is created in real time by the screen commentary input by a short story, and the screen commentary broadcast may be created and transmitted by changing it to a sound source.

그러나, 속기사가 방송 컨텐츠를 보고 화면해설을 위한 원고를 타이핑하는 시간이 필요하여 결과적으로 최종 출력단에서 본 컨텐츠(오디오를 포함하는 컨텐츠)와 화면해설을 위한 음원간에 동기화가 맞지 않는 문제가 발생하고 있다.However, the shorthand story needs time to view the broadcast content and type the manuscript for screen commentary, and as a result, there is a problem that the synchronization between the content (content including audio) and the sound source for screen commentary viewed at the final output end occurs. .

따라서, 이하에서는 사전에 방송해설용 음원이 제작되지 않은 실시간 방송 환경에서 화면해설을 위한 음원과 컨텐츠간에 동기화를 수행하여 시각장애인이 원활하게 실시간 방송 컨텐츠를 시청할 수 있는 방법을 설명하도록 한다.Accordingly, hereinafter, a method for a visually impaired person to smoothly view real-time broadcast content by synchronizing the sound source for screen commentary and the content in a real-time broadcasting environment in which a sound source for broadcast commentary is not produced in advance will be described.

본 발명의 일 실시 예에 따른 시각 청각 장애인을 위한 컨텐츠 제작 시스템은 실시간 방송에 있어서, 화면해설용 음원과 방송 컨텐츠간 동기화가 어렵다는 문제를 해결하는 것을 목적으로 한다.An object of the present invention is to solve the problem of difficulty in synchronizing a sound source for screen commentary and broadcast content in real-time broadcasting in a content production system for the visually and hearing impaired according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 장애인을 위한 방송 컨텐츠 제작 시스템은 화면해설용 음원 생성을 위한 속기 입력을 획득하는 속기 입력부, 컨텐츠 제공자로부터 방송 컨텐츠를 수신하는 영상 수신부 및 상기 속기 입력을 화면해설용 음원으로 변환하고, 상기 화면해설용 음원과 상기 방송 컨텐츠를 동기화하는 화면해설 처리부를 포함한다.The broadcasting content production system for the disabled according to an embodiment of the present invention includes a shorthand input unit for obtaining a shorthand input for generating a sound source for screen commentary, an image receiving unit for receiving broadcast content from a content provider, and a sound source for screen commentary. And a screen commentary processing unit configured to convert the sound source to the screen commentary and synchronize the broadcast content.

본 발명의 일 실시 예에 따른 시각 청각 장애인을 위한 컨텐츠 제작 시스템은 방송 컨텐츠를 보면서 입력할 수 밖에 없는 속기 입력의 특성상 발생할 수 밖에 없는 방송 컨텐츠와 화면해설용 음원의 비동기를 해결할 수 있다.The content production system for the visually and hearing impaired according to an embodiment of the present invention can solve the asynchronization of the broadcast content and the sound source for screen commentary, which is inevitable due to the nature of shorthand input that cannot but be input while viewing broadcast content.

또한, 본 발명의 일 실시 예에 따른 시각 청각 장애인을 위한 컨텐츠 제작 시스템은 시각 장애인을 위한 화면해설용 음원과 청각 장애인을 위한 자막을 모두 생성하여 출력할 수 있다.In addition, the content creation system for the visually and hearing impaired according to an embodiment of the present invention may generate and output both a sound source for screen explanation for the visually impaired and a caption for the hearing impaired.

도 1은 본 발명의 일 실시 예에 따른 장애인을 위한 컨텐츠 제작 시스템의 전체 구성도이다.
도 2는 본 발명의 일 실시 예에 따른 음성인식부의 구성을 나타낸다.
도 3은 본 발명의 일 실시 예에 따른 속기입력부의 구성을 나타내는 블록도이다.
도 4는 본 발명의 일 실시 예에 따른 화면해설 처리부의 구성을 나타내는 블록도이다.
도 5는 본 발명의 일 실시 예에 따른 자막 생성 시스템의 동작을 나타내는 흐름도이다.
도 6은 본 발명의 일 실시 예에 따른 화면해설용 음원과 방송 컨텐츠 동기화 방법에 관한 흐름도이다.1 is an overall configuration diagram of a content creation system for a disabled person according to an embodiment of the present invention.
2 shows the configuration of a voice recognition unit according to an embodiment of the present invention.
3 is a block diagram showing the configuration of a shorthand input unit according to an embodiment of the present invention.
4 is a block diagram illustrating a configuration of a screen commentary processing unit according to an embodiment of the present invention.
5 is a flowchart illustrating an operation of a caption generation system according to an embodiment of the present invention.
6 is a flowchart illustrating a method for synchronizing a sound source for screen commentary and broadcast content according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 그러나 본 발명의 사상은 이하의 실시예에 제한되지 아니하며, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에 포함되는 다른 실시예를 구성요소의 부가, 변경, 삭제, 및 추가 등에 의해서 용이하게 제안할 수 있을 것이나, 이 또한 본 발명 사상의 범위 내에 포함된다고 할 것이다.Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the following embodiments, and those skilled in the art who understand the spirit of the present invention can easily add, change, delete, and add components to other embodiments included within the scope of the same idea. It may be suggested, but it will be said that this is also included within the scope of the inventive concept.

첨부 도면은 발명의 사상을 이해하기 쉽게 표현하기 위하여 전체적인 구조를 설명함에 있어서는 미소한 부분은 구체적으로 표현하지 않을 수도 있고, 미소한 부분을 설명함에 있어서는 전체적인 구조는 구체적으로 반영되지 않을 수도 있다. 또한, 설치 위치 등 구체적인 부분이 다르더라도 그 작용이 동일한 경우에는 동일한 명칭을 부여함으로써, 이해의 편의를 높일 수 있도록 한다. 또한, 동일한 구성이 복수 개가 있을 때에는 어느 하나의 구성에 대해서만 설명하고 다른 구성에 대해서는 동일한 설명이 적용되는 것으로 하고 그 설명을 생략한다.In the accompanying drawings, in explaining the overall structure in order to easily understand the spirit of the invention, minute parts may not be specifically expressed, and when describing the minute parts, the overall structure may not be specifically reflected. In addition, even if specific parts such as the installation location are different, if the action is the same, the same name is given, so that the convenience of understanding can be improved. In addition, when there are a plurality of identical configurations, only one configuration will be described, and the same description will be applied to other configurations, and the description will be omitted.

도 1은 본 발명의 일 실시 예에 따른 장애인을 위한 컨텐츠 제작 시스템의 전체 구성도이다.1 is an overall configuration diagram of a content creation system for a disabled person according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 컨텐츠 제공자(20)가 방송 컨텐츠를 장애인을 위한 컨텐츠 제작 시스템(10)에 전달하고, 장애인을 위한 컨텐츠 제작 시스템(10)은 방송 컨텐츠에 기초하여 방송 자막을 생성하거나, 화면해설용 음원이 포함된 컨텐츠를 생성하여 이를 셋톱박스(30)에 전달한다. 셋톱박스(30)는 전달받은 방송 패킷을 인코딩하여 시청자에게 출력한다.As shown in FIG. 1, thecontent provider 20 delivers broadcast content to thecontent production system 10 for the disabled, and thecontent production system 10 for the disabled generates broadcast captions based on the broadcast content. , The content including the sound source for screen commentary is generated and delivered to the set-top box 30. The set-top box 30 encodes the received broadcast packet and outputs it to the viewer.

컨텐츠 제공자(20)는 대표적인 예로 방송국이 있을 수 있으며, 방송국 외 컨텐츠를 제작하여 제공하는 사업자 또는 중간 분배자를 포함할 수도 있다. 컨텐츠 제공자(20)는 오디오와 영상을 포함하는 컨텐츠를 장애인을 위한 컨텐츠 제작 시스템(10)에 전달한다. 여기에서 컨텐츠 제공자(20)가 전달하는 방송 데이터에는 일반적으로 청각장애인을 위한 자막이나, 시각장애인을 위한 화면해설용 음원이 포함되어 있지 않으며, 따라서, 속기사가 실시간으로 방송을 보면서 속기 입력을 통해 자막을 생성하거나, 화면해설원고를 타이핑하여 화면해설용 음원을 생성하는 것이 필요하다.Thecontent provider 20 may be a broadcasting station as a representative example, and may include a business operator or an intermediate distributor that produces and provides content other than the broadcasting station. Thecontent provider 20 delivers content including audio and video to thecontent creation system 10 for the disabled. Here, the broadcast data delivered by thecontent provider 20 generally does not include a caption for the hearing impaired or a sound source for screen commentary for the visually impaired, and thus, the shorthand input through the shorthand input while the stenographer watches the broadcast in real time. It is necessary to generate a sound source for screen commentary by creating or typing a screen commentary.

장애인을 위한 컨텐츠 제작 시스템(10)은 컨텐츠 제공자(20)로부터 전달받은 컨텐츠에 기초하여 방송 자막 및 화면해설용 음원을 생성한다. 구체적으로 장애인을 위한 컨텐츠 제작 시스템(10)은 음성인식부(100), 속기입력부(200), 통합자막처리부(300), 영상수신부(400) 및 화면해설 처리부(500)를 포함할 수 있다.Thecontent production system 10 for the disabled generates broadcast captions and sound sources for screen commentary based on the content received from thecontent provider 20. Specifically, thecontent production system 10 for the disabled may include avoice recognition unit 100, ashorthand input unit 200, an integratedsubtitle processing unit 300, animage receiving unit 400, and a screencommentary processing unit 500.

음성인식부(100)는 컨텐츠 제공자(20)로부터 전달받은 컨텐츠가 재생될 때, 음성을 자동 인식하여 문자로 변환한다. 음성인식부(100)는 일반적으로 사용되는 음성 문자 변환 도구일 수 있으며, 구체적인 예로 구글 클라우드 스피치 API나 Amazon Transcirbe일 수 있다. 음성인식부(100)의 구체적인 구성에 대하여는 이하에서 따로 설명하기로 한다.When the content delivered from thecontent provider 20 is played, thevoice recognition unit 100 automatically recognizes the voice and converts it into text. Thevoice recognition unit 100 may be a commonly used voice-to-text conversion tool, and a specific example may be Google Cloud Speech API or Amazon Transcirbe. A detailed configuration of thevoice recognition unit 100 will be described below.

도 2는 본 발명의 일 실시 예에 따른 음성인식부(100)의 구성을 나타낸다.2 shows the configuration of thevoice recognition unit 100 according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 음성인식부(100)는 음성수신부(110), 음성-문자 변환부(120) 및 정확도 산출부(130)를 포함할 수 있다.As shown in FIG. 2, thevoice recognition unit 100 according to an embodiment of the present invention may include avoice receiver 110, a voice-to-text conversion unit 120, and anaccuracy calculation unit 130.

음성수신부(110)는 컨텐츠로부터 음성 신호를 획득한다. 예를 들어 음성수신부(110)는 마이크일 수 있다. 음성수신부(110)는 컨텐츠로부터 전달되는 모든 음성 신호를 수집하고, 수집한 음성 신호를 디지털 신호로 변환하여 음성-문자 변환부(120)로 전달할 수 있다.Thevoice receiver 110 obtains a voice signal from the content. For example, thevoice receiver 110 may be a microphone. Thevoice receiver 110 may collect all voice signals transmitted from the content, convert the collected voice signals into digital signals, and transmit them to the voice-text converter 120.

또 다른 실시 예에서 음성수신부(110)는 속기사의 음성을 수신할 수도 있다. 속기사가 특정 상황에서 속기키보드를 통한 속기 입력이 어려운 경우, 음성수신부(110)는 속기사의 음성을 수신하여 문자로 변환할 수 있다. 단, 속기사의 음성을 문자로 변환한 데이터는 컨텐츠의 음성 신호를 문자로 변환한 것과 다르게 처리되어 통합 자막 처리부(300)로 전달될 수 있다. 통합 자막 처리부(300)는 속기사의 음성으로부터 변환되는 문자 데이터를 보완 속기 입력과 동일하게 취급하여 최종 자막 생성에 사용할 수 있다.In another embodiment, thevoice receiver 110 may receive the voice of the shorthand. When it is difficult for the stenographer to input shorthand through the shorthand keyboard in a specific situation, thevoice receiver 110 may receive the voice of the stenographer and convert it into text. However, data obtained by converting the voice of the short story into text may be processed differently from the conversion of the voice signal of the content into text and transmitted to the integratedcaption processing unit 300. The integratedcaption processing unit 300 may treat the text data converted from the voice of the shorthand in the same way as the supplementary shorthand input and use it for final caption generation.

음성-문자 변환부(120)는 음성수신부(110)로부터 전달받은 음성 신호를 문자로 변환한다. 구체적으로 음성-문자 변환부(120)는 딥러닝을 통한 자동 음성 인식을 위한 기계 학습 애플리케이션일 수 있다. 음성-문자 변환부(120)는 WAV 및 MP3와 같은 일반적인 형식으로 저장된 오디오 파일을 트랜스크립션하고 단어마다 타임스탬프를 추가할 수 있다.The voice-to-text conversion unit 120 converts the voice signal received from thevoice receiving unit 110 into text. Specifically, the speech-to-text conversion unit 120 may be a machine learning application for automatic speech recognition through deep learning. The voice-to-text conversion unit 120 may transcribe an audio file stored in a general format such as WAV and MP3 and add a time stamp for each word.

정확도 산출부(130)는 자동 음성 인식 간에 음성 인식의 정확도를 산출할 수 있다. 구체적으로, 정확도 산출부(130)는 음성 신호에서 사람의 목소리(육성)와 노이즈를 구별할 수 있으며, 사람의 목소리 크기, 사람의 목소리와 노이즈간 비율 또는 음성 인식 결과에 기초하여 정확도를 산출할 수 있다.Theaccuracy calculator 130 may calculate the accuracy of speech recognition between automatic speech recognition. Specifically, theaccuracy calculation unit 130 may distinguish a human voice (raising) and noise from the voice signal, and calculate the accuracy based on the size of the human voice, the ratio between the human voice and noise, or the voice recognition result. I can.

일 실시 예에서, 정확도 산출부(130)는 음성 신호 중에서 사람의 목소리가 작으면 정확도를 낮은 것으로 볼 수 있으며, 사람의 목소리가 크면 정확도가 높을 것으로 볼 수 있다. 다시 말해서 정확도 산출부(130)는 사람의 목소리 크기에 비례하여 정확도를 산출할 수 있다. 예를 들어, 컨텐츠 속 화자가 마이크에서 떨어져 발언하거나 말소리가 상대적으로 작은 경우가 있을 수 있다. 사람의 목소리가 크고 작은지 여부를 판단하는 기준은 일반적인 컨텐츠에서의 사람 목소리 크기가 될 수 있으며, 구체적인 값은 기계학습을 통해 얻을 수도 있다.In an embodiment, theaccuracy calculator 130 may consider that the accuracy is low when the human voice is small among the voice signals, and the accuracy is high when the human voice is large. In other words, theaccuracy calculation unit 130 may calculate the accuracy in proportion to the loudness of a person's voice. For example, there may be a case where a speaker in the content speaks away from the microphone or the speech sound is relatively low. The criterion for determining whether the human voice is large or small may be the size of the human voice in general content, and a specific value may be obtained through machine learning.

또 다른 실시 예에서, 정확도 산출부(130)는 음성 신호 중에서 사람 목소리와 노이즈간 비율에서 노이즈 비율이 높을수록 정확도가 낮은 것으로 볼 수 있다. 다시 말해서, 정확도 산출부(130)는 노이즈 비율과 반비례하여 정확도를 산출할 수 있다. 예를 들어 컨텐츠 속에서 장내가 소란스럽다거나 비음성적인 소리가 중심이 되는 경우가 있을 수 있다.In another embodiment, theaccuracy calculation unit 130 may be considered to have a lower accuracy as the noise ratio increases in the ratio between human voice and noise among voice signals. In other words, theaccuracy calculator 130 may calculate the accuracy in inverse proportion to the noise ratio. For example, in the content, there may be a case where the interior of the hall is noisy or non-voiced sound is the center.

또 다른 실시 예에서, 정확도 산출부(130)는 인식 결과에 기초하여 정확도를 산출할 수 있다. 정확도 산출부(130)는 음성을 인식하여 문자로 변환한 결과가 표준어 표기에 맞는지 여부를 판단하여 정확도를 산출할 수 있다. 예를 들어 정확도 산출부(130)의 변환 결과가 맞춤법에 맞지 않는 경우가 있을 수 있으며, 컨텐츠 속 화자가 방언을 구사하는 경우가 있을 수 있다.In another embodiment, theaccuracy calculator 130 may calculate the accuracy based on the recognition result. Theaccuracy calculation unit 130 may calculate the accuracy by recognizing the voice and determining whether a result of converting it into a character conforms to the standard language notation. For example, there may be a case in which the conversion result of theaccuracy calculation unit 130 does not fit the spelling, and there may be a case where the speaker in the content speaks a dialect.

정확도 산출부(130)는 음성 인식 결과가 특정 값 이하인 경우 해당 단어 또는 구간의 타임스탬프를 따로 기록할 수 있다. 여기에서 기록된 타임스탬프는 통합 자막 처리부(300)로 전달되거나, 속기 입력부(200)에 전달될 수 있다.Theaccuracy calculator 130 may separately record a timestamp of a corresponding word or section when the speech recognition result is less than or equal to a specific value. The time stamp recorded here may be transmitted to the integratedcaption processing unit 300 or may be transmitted to theshorthand input unit 200.

다시 도 1로 돌아온다.It comes back to FIG. 1 again.

속기입력부(200)는 속기입력을 획득한다. 속기입력부(200)는 속기사로부터 속기입력을 획득할 수 있다. 속기입력부(200)의 구체적인 구성에 대하여는 이하에서 따로 설명하기로 한다.Theshorthand input unit 200 acquires a shorthand input. Theshorthand input unit 200 may obtain a shorthand input from the shorthand. A specific configuration of theshorthand input unit 200 will be separately described below.

도 3은 본 발명의 일 실시 예에 따른 속기입력부(200)의 구성을 나타내는 블록도이다.3 is a block diagram showing the configuration of ashorthand input unit 200 according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 속기입력부(200)는 알림표시부(210) 및 속기입력장치(220)를 포함할 수 있다.As shown in FIG. 3, theshorthand input unit 200 according to an embodiment of the present invention may include anotification display unit 210 and ashorthand input device 220.

알림표시부(210)는 속기사에게 알림을 표시한다. 여기에서 속기사에게 표시되는 알림은 음성인식부(100)에서 인식 정확도가 일정 값 이하임을 알리는 것일 수 있다. 음성 인식 정확도가 일정 값 이하인 경우 음성-문자 변환의 결과가 정확하지 않을 확률이 높은 바, 이때 속기사가 직접 자막을 입력하여 자동 음성 인식의 결과를 보정할 수 있다. 알림표시부(210)는 디스플레이장치 또는 오디오 장치일 수 있으며, 알림표시부(210)는 시각적 또는 청각적 알림을 속기사에게 제공할 수 있다.Thenotification display unit 210 displays a notification to the stenographer. Here, the notification displayed to the stenographer may indicate that the recognition accuracy is less than a certain value by thevoice recognition unit 100. When the speech recognition accuracy is less than a certain value, there is a high probability that the result of the speech-to-text conversion is not accurate. In this case, the stenographer may directly input the subtitle to correct the result of the automatic speech recognition. Thenotification display unit 210 may be a display device or an audio device, and thenotification display unit 210 may provide a visual or audible notification to the stenographer.

속기입력장치(220)는 속기사로부터 속기 입력을 획득한다. 속기입력장치(220)는 속기키보드로부터 속기 입력을 획득하여 문자 데이터화할 수 있다. 속기입력장치(220)는 일반적으로 사용되는 속기키보드일 수 있다. 또한, 속기키보드는 영한 겸용 속기키보드일 수도 있다. 속기입력장치(220)는 디스플레이 장치를 더 포함할 수 있다. 디스플레이 장치는 속기키보드를 통한 자막 입력이 표시될 수 있다.Theshorthand input device 220 obtains a shorthand input from the shorthand. Theshorthand input device 220 may obtain a shorthand input from a shorthand keyboard and convert it into text data. Theshorthand input device 220 may be a commonly used shorthand keyboard. In addition, the shorthand keyboard may be a shorthand keyboard for both English and Korean. Theshorthand input device 220 may further include a display device. The display device may display a caption input through a shorthand keyboard.

다시 도 1로 돌아온다.It comes back to FIG. 1 again.

통합 자막 처리부(300)는 음성인식부(100) 및 속기입력부(200)로부터 전달받은 문자를 통합하여 최종 자막을 생성한다. 구체적으로 통합 자막 처리부(300)는 음성인식부(100)로부터 전달받은 음성-문자 변환 데이터와 속기입력부(200)로부터 전달받은 속기입력 데이터를 통합하여 최종 자막을 생성한다.The integratedcaption processing unit 300 generates a final caption by integrating the text received from thevoice recognition unit 100 and theshorthand input unit 200. Specifically, the integratedcaption processing unit 300 generates a final caption by combining the voice-to-text conversion data received from thespeech recognition unit 100 and the shorthand input data transmitted from theshorthand input unit 200.

일 실시 예에서, 통합 자막 처리부(300)는 음성인식부(100)로부터 전달받은 음성 문자 변환 데이터를 기초로 하고, 음성 문자 변환 데이터의 일부를 속기입력부(200)로부터 전달받은 속기입력 데이터로 보완하여 최종 자막을 생성할 수 있다. 상술한 바와 같이, 특정의 상황에서 음성인식부(100)의 인식 정확도가 낮아 문자 변환 결과가 부정확할 수 있는바, 이 경우 부정확한 문자 변환 결과를 속기사의 직접 입력으로 보완하여 최종 자막을 생성할 수 있다.In one embodiment, the integratedcaption processing unit 300 is based on the voice-to-text conversion data received from thevoice recognition unit 100, and supplements some of the voice-to-text conversion data with shorthand input data received from theshorthand input unit 200 Thus, the final subtitle can be generated. As described above, the text conversion result may be inaccurate due to the low recognition accuracy of thespeech recognition unit 100 in a specific situation. In this case, the final subtitle may be generated by supplementing the incorrect text conversion result with the direct input of the stenographer. I can.

통합 자막 처리부(300)는 음성인식부(100)로부터 정확도가 일정 값 이하인 보완 대상 단어 또는 보완 대상 구간의 타임스탬프 정보를 획득할 수 있다. 그리고 통합 자막 처리부(300)는 보완 대상 단어 또는 구간의 타임스탬프와 속기입력 시작 시간을 비교하여 속기입력 데이터와 음성 문자 변환 데이터를 동기화하여 최종 자막을 생성할 수 있다.The integratedcaption processing unit 300 may obtain a word to be supplemented with an accuracy of less than a certain value or timestamp information of a section to be supplemented from thespeech recognition unit 100. In addition, the integratedcaption processing unit 300 may generate a final caption by synchronizing the shorthand input data and the voice text conversion data by comparing the timestamp of the complementary word or section with the shorthand input start time.

구체적인 실시 예에서, 통합 자막 처리부(300)는 보완 대상 단어 또는 구간의 타임스탬프와 보완 속기입력이 시작된 시간을 비교하고, 그 차이가 가장 작은 보완 대상 단어 또는 구간과 보완 속기입력을 매칭하여 최종 자막을 생성한다.In a specific embodiment, the integratedcaption processing unit 300 compares the timestamp of the complementary word or section with the start time of the supplementary shorthand input, and matches the complementary shorthand input with the smallest difference between the complementary shorthand input and the final subtitle. Create

또 다른 실시 예에서, 통합 자막 처리부(300)는 하나 이상의 보완 대상 단어 또는 구간의 시간 순서와 하나 이상의 보완 속기 입력의 시간 순서만을 비교 매칭하여 최종 자막을 생성한다. 보완 대상 단어 또는 구간의 수와 보완 속기 입력의 수가 동일할 것인바, 각 순서만을 비교하여 순서대로 보완 대상 단어 또는 구간을 보완 속기 입력으로 대신하여 최종 자막이 생성될 수 있다.In another embodiment, the integratedcaption processing unit 300 compares and matches only the temporal order of one or more supplementary target words or sections with the temporal order of one or more supplementary shorthand inputs to generate a final caption. Since the number of supplementary words or sections and the number of supplementary shorthand inputs will be the same, a final subtitle may be generated by comparing only each order and replacing the supplementary words or sections in order with supplementary shorthand input.

영상수신부(400)는 컨텐츠 제공자(20)로부터 방송 컨텐츠를 수신한다. 영상수신부(400)는 방송 수신기 및 인코더를 포함할 수 있다. 영상수신부(400)는 지상파, 케이블, 인터넷 중 적어도 하나를 통해 컨텐츠 제공자(20)로부터 방송 컨텐츠를 수신할 수 있다.Theimage receiving unit 400 receives broadcast content from thecontent provider 20. Theimage receiving unit 400 may include a broadcast receiver and an encoder. Theimage receiver 400 may receive broadcast content from thecontent provider 20 through at least one of terrestrial, cable, and Internet.

영상수신부(400)는 방송 컨텐츠를 음성인식부(100)로 전달하여 음성인식을 수행하게 하거나, 방송 컨텐츠를 속기사가 볼 수 있도록 디코딩하여 디스플레이 장치에 전달할 수 있다. 또한, 영상수신부(400)는 방송 컨텐츠를 화면해설 처리부(500)로 전달할 수도 있다.Theimage receiving unit 400 may transmit the broadcast content to thevoice recognition unit 100 to perform voice recognition, or may decode the broadcast content so that the stenographer can view it and transmit it to the display device. In addition, theimage receiving unit 400 may transmit broadcast content to the screencommentary processing unit 500.

화면해설 처리부(500)는 화면해설용 음원을 생성하고 화면해설용 음원과 영상수신부(400)로부터 전달되는 방송 컨텐츠간 동기화를 수행한다. 더하여, 화면해설 처리부(500)는 방송 컨텐츠와 화면해설용 음원을 하나의 방송 데이터로 인코딩할 수도 있다. 화면해설 처리부(500)에 관하여는 이하에서 따로 자세하게 설명한다.The screencommentary processing unit 500 generates a sound source for screen commentary and synchronizes the sound source for screen commentary and the broadcast content transmitted from theimage receiving unit 400. In addition, the screencommentary processing unit 500 may encode broadcast content and a sound source for screen commentary into one broadcast data. The screencommentary processing unit 500 will be separately described in detail below.

화면해설 처리부(500)는 속기입력부(200)로부터 화면해설 음원 생성용 원고를 획득하고, 원고에 기초하여 음원을 생성할 수 있다.The screencommentary processing unit 500 may obtain a manuscript for generating a screen commentary sound source from theshorthand input unit 200 and generate a sound source based on the manuscript.

도 4는 본 발명의 일 실시 예에 따른 화면해설 처리부의 구성을 나타내는 블록도이다.4 is a block diagram showing a configuration of a screen commentary processing unit according to an embodiment of the present invention.

도 4에 도시되어 있는 바와 같이, 본 발명의 일 실시 예에 따른 화면해설 처리부(500)는 문자-음성 변환부(510), 이미지 객체 인식부(520) 및 동기화부(530)을 포함할 수 있다.4, the screencommentary processing unit 500 according to an embodiment of the present invention may include a text-to-speech conversion unit 510, an imageobject recognition unit 520, and asynchronization unit 530. have.

문자-음성 변환부(510)는 속기입력부(200)로부터 획득한 화면해설 원고를 음성으로 변환한다. 문자-음성 변환부(510)는 일반적으로 사용되는 문자-음성 변환 도구일 수 있다. 예를 들어 문자-음성 변환부(510)는 구글 TTS API, select and speak, amazon polly와 같은 애플리케이션일 수 있다.The text-to-speech conversion unit 510 converts the screen commentary manuscript obtained from theshorthand input unit 200 into speech. The text-to-speech conversion unit 510 may be a commonly used text-to-speech conversion tool. For example, the text-to-speech conversion unit 510 may be an application such as Google TTS API, select and speak, and amazon polly.

속기사는 영상을 보면서 화면해설 원고를 타이핑할 수 있다. 예를 들어 속기사는 대사가 아닌 음성에 대한 해설, 장면에 대한 해설, 장면 변환에 대한 해설, 등장하는 인물에 대한 해설 또는 기타 시각 장애인이 컨텐츠를 즐기는데 있어서 필요한 화면해설 원고를 타이핑할 수 있다.The stenographer can type the on-screen commentary while watching the video. For example, a stenographer can type a commentary on a voice, not a dialogue, a commentary on a scene, a commentary on a scene change, a commentary on a person appearing, or other screen commentary manuscripts necessary for the visually impaired to enjoy the content.

문자-음성 변환부(510)는 생성한 음원을 동기화부(530)로 전달한다.The text-to-speech conversion unit 510 transmits the generated sound source to thesynchronization unit 530.

이미지 객체 인식부(520)는 영상수신부(400)로부터 획득한 방송 컨텐츠에서 이미지 객체를 인식한다. 구체적으로 이미지 객체 인식부(520)는 방송 컨텐츠의 비디오 이미지로부터 객체를 인식한다. 여기에서 이미지 객체 인식부(520)는 일반적으로 사용되는 이미지 객체 인식 도구일 수 있다. 예를 들어 이미지 객체 인식부(520)는 클라우드 비전 API와 같은 애플리케이션일 수 있다.The imageobject recognition unit 520 recognizes an image object from broadcast content acquired from theimage receiving unit 400. Specifically, the imageobject recognition unit 520 recognizes an object from a video image of broadcast content. Here, the imageobject recognition unit 520 may be a commonly used image object recognition tool. For example, the imageobject recognition unit 520 may be an application such as a cloud vision API.

객체 인식이란 이미지 또는 비디오 상의 객체를 식별하는 컴퓨터 비전 기술로서, 딥러닝과 머신 러닝 알고리즘을 통해 수행되는 것일 수 있다. 객체 인식을 통해 이미지에 포함된 객체를 인식할 수 있다.Object recognition is a computer vision technology that identifies an object on an image or video, and may be performed through deep learning and machine learning algorithms. Objects included in images can be recognized through object recognition.

예를 들어, 이미지 객체 인식부(520)는 특정의 이미지 프레임에서 파란 하늘을 추출하여 인식하거나, 특정의 이미지 프레임에서 추출된 객체 중에서 사람으로 인식되는 객체가 없는 경우 해당 이미지에는 사람이 등장하지 않는 것으로 인식할 수 있다.For example, the imageobject recognition unit 520 extracts and recognizes a blue sky from a specific image frame, or when there is no object recognized as a person among objects extracted from a specific image frame, a person does not appear in the image. It can be recognized as.

동기화부(530)는 문자-음성 변화부(510)로부터 획득한 화면해설용 음원과 방송 컨텐츠를 동기화한다. 또한 동기화부(530)는 화면해설용 음원과 방송 컨텐츠를 동기화하여 새로운 방송 데이터를 인코딩할 수도 있다. 구체적으로 동기화부(530)는 화면해설용 음원의 키워드와 이미지 객체 인식부(520)에서 인식된 이미지 객체를 비교하여 동기화를 수행할 수 있다.Thesynchronization unit 530 synchronizes the screen commentary sound source and broadcast content obtained from the text-to-speech change unit 510. In addition, thesynchronization unit 530 may encode new broadcast data by synchronizing a sound source for screen commentary and broadcast content. In more detail, thesynchronization unit 530 may perform synchronization by comparing a keyword of a sound source for screen commentary and an image object recognized by the imageobject recognition unit 520.

일 실시 예에서, 동기화부(530)는 화면해설 원고로부터 키워드를 추출할 수 있다. 동기화부(530)가 화면해설 원고로부터 키워드를 추출하는 방법은 일반적으로 널리 알려진 키워드 추출 방법에 의한다. 구체적인 예를 들면 동기화부(530)는 빅데이터 또는 AI 기반의 키워드 추출 도구를 포함할 수 있다.In one embodiment, thesynchronization unit 530 may extract keywords from the screen commentary manuscript. A method for thesynchronization unit 530 to extract keywords from the screen commentary manuscript is generally known as a keyword extraction method. For a specific example, thesynchronization unit 530 may include a keyword extraction tool based on big data or AI.

또 다른 실시 예에서, 동기화부(530)는 특성의 화면해설 부분에 대한 키워드를 획득할 수 있다. 예를 들어, 속기사가 특정의 단어를 입력하기 전 특정의 키를 입력하여 해당 단어가 키워드임을 표시할 수 있다.In another embodiment, thesynchronization unit 530 may acquire a keyword for a screen description portion of a characteristic. For example, before the stenographer inputs a specific word, a specific key may be input to indicate that the word is a keyword.

또 다른 예를 들어 속기사가 약어 입력을 사용하여 입력한 단어의 경우 해당 단어가 키워드로 처리되어 동기화부(530)에 전달될 수 있다. 여기에서 약어 입력이란, 속기사가 특정의 단어를 모두 입력하는 것이 아닌 특정의 자음의 조합만을 입력하여도 기 저장된 단어가 입력되도록 약속된 입력을 말한다. 약어 입력을 통해 입력되는 단어는 일반적으로 많이 사용되는 단어로서, 키워드로 사용될 여지가 높을 수 있다.For another example, in the case of a word input by a shorthand article using an abbreviation input, the word may be processed as a keyword and transmitted to thesynchronization unit 530. Here, the abbreviation input refers to an input promised to input a pre-stored word even if a shorthand article does not input all of the specific words, but only inputs a combination of specific consonants. Words inputted through abbreviation input are commonly used words and may have a high room for use as keywords.

동기화부(530)는 화면해설용 음원의 기초가 되는 원고의 생성 시간과 영상에서의 이미지 프레임 타임라인을 비교하여 그 차이가 특정 시간 이하인 것을 선정하고, 선정된 이미지 프레임에서 인식된 객체와 화면해설 음원의 키워드를 비교하여 동기화를 수행할 수 있다.Thesynchronization unit 530 compares the creation time of the manuscript, which is the basis of the sound source for screen commentary, and the image frame timeline in the image, selects the difference between the timeline and the object recognized in the selected image frame. Synchronization can be performed by comparing keywords of sound sources.

더하여 동기화부(530)는 방송 컨텐츠와 동기화된 화면해설용 음원을 통합하여 새로운 방송 컨텐츠를 생성할 수 있다. 이때, 동기화부(530)는 방송 컨텐츠의 오디오와 겹치지 않도록 화면해설용 음원의 동기화 정도를 조정할 수도 있다.In addition, thesynchronization unit 530 may generate new broadcast content by integrating a sound source for screen commentary synchronized with the broadcast content. In this case, thesynchronization unit 530 may adjust the degree of synchronization of the sound source for screen commentary so as not to overlap with the audio of the broadcast content.

결과적으로, 본 발명의 일 실시 예에 따른 시각 청각 장애인을 위한 컨텐츠 제작 시스템은 방송 컨텐츠를 보면서 입력할 수 밖에 없는 속기 입력의 특성상 발생할 수 밖에 없는 방송 컨텐츠와 화면해설용 음원의 비동기를 해결할 수 있다.As a result, the content creation system for the visually and hearing impaired according to an embodiment of the present invention can solve the asynchrony between the broadcast content and the sound source for screen commentary, which is inevitable due to the nature of shorthand input that is forced to input while viewing broadcast content. .

다시 도 1로 돌아온다.It comes back to FIG. 1 again.

본 발명의 일 실시 예에 따른 시각 청각 장애인을 위한 컨텐츠 제작 시스템(10)은 출력부(미도시)를 더 포함할 수 있다. 출력부는 생성된 통합 자막, 화면해설용 음원 또는 화면해설용 음원이 동기화된 시각 장애인용 방송 컨텐츠를 출력할 수 있다. 출력부는 통합 자막과 화면해설용 음원을 각각 별도로 태깅하고 패키타이징하여 출력할 수 있다. 셋톱박스(30)는 사용자의 설정에 따라 통합 자막을 출력하거나, 화면해설용 음원을 출력하거나, 두가지를 한번에 출력할 수 있다.Thecontent creation system 10 for the visually and hearing impaired according to an embodiment of the present invention may further include an output unit (not shown). The output unit may output the generated integrated caption, a sound source for screen commentary, or broadcast content for the visually impaired in which the sound source for screen commentary is synchronized. The output unit can separately tag and package the integrated subtitle and sound source for screen commentary, and then output. The set-top box 30 may output an integrated subtitle, a sound source for screen commentary, or output both at once according to a user's setting.

도 5는 본 발명의 일 실시 예에 따른 자막 생성 시스템의 동작을 나타내는 흐름도이다.5 is a flowchart illustrating an operation of a caption generation system according to an embodiment of the present invention.

여기에서 자막 생성 시스템은 음성인식부(100)와 속기입력부(200) 및 통합자막 처리부(300)만을 포함하는 것일 수 있다.Here, the caption generation system may include only thevoice recognition unit 100, theshorthand input unit 200, and the integratedsubtitle processing unit 300.

자막 생성 시스템은 자동 음성 인식 도구를 통해 제1 구간에 포함된 음성을 문자로 변환한다(S10). 여기에서 자동 음성 인식 도구는 상술한 바와 같이 현재 사용되고 있는 자동 음성 인식 도구일 수 있다. 제 1 구간은 음성 인식의 대상이 되는 컨텐츠의 전체 타임라인 중 일부 구간을 의미한다.The caption generation system converts the voice included in the first section into text through an automatic voice recognition tool (S10). Here, the automatic speech recognition tool may be an automatic speech recognition tool currently used as described above. The first section means a partial section of the entire timeline of the content subject to speech recognition.

자막 생성 시스템은 자동 음성 인식의 정확도를 획득하고, 정확도가 특정 값 이상인지 여부를 판단한다(S20). 자동 음성 인식의 정확도는 음성 신호의 크기, 사람의 목소리와 노이즈간 비율 또는 자동 음성 인식의 결과 중 적어도 하나에 기초하여 판단될 수 있다. 그리고 여기에서 임계값으로 사용되는 특정의 기준 값은 임의적으로 입력된 값이거나, 기계학습을 통해 획득되는 값일 수 있다.The caption generation system acquires the accuracy of automatic speech recognition, and determines whether the accuracy is greater than or equal to a specific value (S20). The accuracy of automatic speech recognition may be determined based on at least one of a size of a speech signal, a ratio between a human voice and noise, or a result of automatic speech recognition. In addition, the specific reference value used as the threshold value here may be a randomly input value or a value obtained through machine learning.

자막 생성 시스템은 제 1 구간의 음성-문자 변환의 정확도가 특정 값 이상인 경우 또 다른 제 구간에 포함된 음성을 자동 음성 인식하여 문자로 변환한다(S30).When the accuracy of the voice-to-text conversion of the first section is greater than or equal to a specific value, the subtitle generation system automatically recognizes the voice included in the second section and converts it into text (S30).

한편, 자막 생성 시스템은 제 1 구간의 음성-문자 변환의 정확도가 특정 값 이하인 경우, 제 1 구간의 시작시간을 기록한다(S40). 일반적으로 자동 음성 인식 도구는 문자로 변환된 음성을 획득한 타임스탬프를 기록하고 있으며, 자막 생성 시스템은 정확도가 특정 값 이하인 단어 또는 구간에 대하여 별도로 타임스탬프를 기록하여 관리할 수 있다.Meanwhile, when the accuracy of the voice-to-text conversion of the first section is less than a specific value, the caption generation system records the start time of the first section (S40). In general, an automatic speech recognition tool records a timestamp obtained by acquiring a voice converted into a text, and the caption generation system may separately record and manage a timestamp for a word or section whose accuracy is less than a specific value.

자막 생성 시스템은 제1 구간의 변환 정확도가 특정 값 이하인 경우, 속기사에게 알림을 출력한다(S50). 자막 생성 시스템은 시각적 또는 청각적 방식으로 속기사에게 알림을 출력할 수 있다.When the conversion accuracy of the first section is less than or equal to a specific value, the caption generation system outputs a notification to the shorthand (S50). The caption generation system may output a notification to the stenographer in a visual or audible manner.

자막 생성 시스템은 제 1 음성에 대한 속기 입력을 획득한다(S60). 자막 생성 시스템은 제 1 음성에 대한 속기 입력을 속기키보드를 통해 획득할 수 있다. 또한, 자막 생성 시스템은 제 1 음성에 대한 속기 입력을 음성 인식을 통해 획득할 수도 있다. 여기에서 음성 인식의 대상은 속기사의 음성일 수 있다.The caption generation system acquires a shorthand input for the first voice (S60). The caption generation system may obtain a shorthand input for the first voice through a shorthand keyboard. In addition, the caption generating system may obtain a shorthand input for the first voice through voice recognition. Here, the object of speech recognition may be the speech of a shorthand article.

자막 생성 시스템은 기록된 제 1 구간의 시작시간과 속기입력이 시작된 시간에 기초하여 음성 인식 결과와 속기 입력 결과를 통합하여 최종 자막을 생성한다(S70).The caption generation system generates a final caption by integrating the voice recognition result and the shorthand input result based on the recorded start time of the first section and the shorthand input start time (S70).

일 실시 예에서, 자막 생성 시스템은 변환 정확도가 특정 값 이하인 구간(이하 보완 구간)의 시작 시간과 보완 구간에 대한 속기입력이 시작된 시간을 비교하고, 그 차이가 가장 작은 보완 구간과 보완 속기입력을 매칭하여 최종 자막을 생성한다. 보완 속기입력은 속기사가 알림에 따라 입력한 속기입력 데이터를 지칭한다.In an embodiment, the caption generation system compares the start time of a section (hereinafter referred to as supplementary section) whose conversion accuracy is less than a certain value and a time when shorthand input for the supplementary section is started, and compares the supplementary section with the smallest difference and the supplementary shorthand input. Matching to generate final subtitles. Supplementary shorthand input refers to shorthand input data input by a shorthand by a shorthand notice.

또 다른 실시 예에서, 자막 생성 시스템은 하나 이상의 보완 구간의 시작 시간과 하나 이상의 보완 속기 입력의 시간 순서만을 비교 매칭하여 최종 자막을 생성한다. 보완 대상 단어 또는 구간의 수와 보완 속기 입력의 수가 동일할 것인바, 각 순서만을 비교하여 순서대로 보완 대상 단어 또는 구간을 보완 속기 입력으로 대신하여 최종 자막이 생성될 수 있다.In another embodiment, the caption generation system generates a final caption by comparing and matching only the start times of one or more supplementary sections and a time sequence of one or more supplementary shorthand inputs. Since the number of supplementary words or sections and the number of supplementary shorthand inputs will be the same, a final subtitle may be generated by comparing only each order and replacing the supplementary words or sections in order with supplementary shorthand input.

자막 생성 시스템은 생성된 최종 자막을 셋톱박스로 전달한다. 셋톱박스는 전달받은 자막을 영상과 함께 표시하여 청각장애인을 위한 자막 방송을 출력할 수 있다.The subtitle generation system delivers the final subtitles generated to the set-top box. The set-top box can output closed caption broadcasting for the hearing impaired by displaying the transmitted caption together with the image.

도 6은 본 발명의 일 실시 예에 따른 화면해설용 음원과 방송 컨텐츠 동기화 방법에 관한 흐름도이다.6 is a flowchart illustrating a method for synchronizing a sound source for screen commentary and broadcast content according to an embodiment of the present invention.

장애인을 위한 방송 컨텐츠 제작 시스템은 시각 장애인을 위한 화면 해설에 관한 속기입력을 획득한다(S110). 상술한 바와 같이, 시각 장애인을 위해 장면이나, 등장 인물에 대한 해설과 같은 부분에 대하여 화면 해설이 필요하며, 이러한 화면 해설을 컨텐츠 제공자가 제공하지 않는 것이 일반적인 바, 화면 해설을 위한 속기 입력을 획득하여 화면해설용 음원을 생성할 수 있다.The broadcasting content production system for the disabled acquires a shorthand input for the screen commentary for the visually impaired (S110). As described above, for the visually impaired, screen commentary is required for parts such as commentary on scenes or characters, and it is common that content providers do not provide such screen commentary, so a shorthand input for screen commentary is obtained. Thus, a sound source for screen commentary can be created.

장애인을 위한 방송 컨텐츠 제작 시스템은 속기사가 작성한 화면해설 원고로부터 키워드를 추출한다(S120). 일 실시 예에서, 장애인을 위한 방송 컨텐츠 제작 시스템은 화면해설 원고로부터 일반적으로 널리 사용되는 키워드 추출 알고리즘을 통해 키워드를 추출할 수 있다. 또 다른 실시 예에서, 장애인을 위한 방송 컨텐츠 제작 시스템은 속기 입력간에 특정 키 입력을 통해 태그된 특정의 단어를 키워드로 추출할 수 있다. 또 다른 실시 예에서, 장애인을 위한 방송 컨텐츠 제작 시스템은 약어 입력을 통해 입력된 단어를 키워드로 추출할 수 있다.The broadcasting content production system for the disabled extracts keywords from the screen commentary manuscript written by a short story (S120). In one embodiment, the broadcast content production system for the disabled may extract keywords from a screen commentary manuscript through a commonly used keyword extraction algorithm. In another embodiment, the broadcast content production system for the disabled may extract a specific word tagged through a specific key input between shorthand inputs as a keyword. In another embodiment, the broadcast content production system for the disabled may extract a word input through an abbreviation input as a keyword.

장애인을 위한 방송 컨텐츠 제작 시스템은 화면해설에 관한 속기 입력을 음성으로 변환한다(S130). 여기에서 화면해설에 관한 속기 입력을 음성으로 변환하는 알고리즘은 일반적으로 널리 알려진 문자-음성 변환 알고리즘에 의한다.The broadcasting content production system for the disabled converts shorthand input for screen commentary into voice (S130). Here, the algorithm for converting the shorthand input for screen commentary into speech is based on the generally known text-to-speech algorithm.

장애인을 위한 방송 컨텐츠 제작 시스템은 방송 컨텐츠의 비디오 프레임으로부터 이미지 객체를 인식한다(S140). 장애인을 위한 방송 컨텐츠 제작 시스템은 이미지 객체 인식을 통해 이미지를 구성하고 있는 객체를 판단할 수 있다. 여기에서 이미지 객체 인식은 일반적으로 널리 알려진 이미지 객체 알고리즘에 의한다.The broadcast content production system for the disabled recognizes an image object from the video frame of the broadcast content (S140). The broadcasting content production system for the disabled may determine an object constituting an image through image object recognition. Here, image object recognition is generally based on a widely known image object algorithm.

장애인을 위한 방송 컨텐츠 제작 시스템은 화면해설 음원과 방송 컨텐츠를 동기화한다(S150). 구체적으로 장애인을 위한 방송 컨텐츠 제작 시스템은 화면해설 음원에 대한 키워드와 방송 컨텐츠에 포함된 영상의 이미지 객체 인식 결과에 기초하여 동기화를 수행할 수 있다.The broadcast content production system for the disabled synchronizes the screen commentary sound source and the broadcast content (S150). In more detail, the broadcast content production system for the disabled may perform synchronization based on a keyword for a screen commentary sound source and an image object recognition result of an image included in the broadcast content.

장애인을 위한 방송 컨텐츠 제작 시스템은 화면해설 음원의 기초가 되는 원고가 생성된 시간과 영상 프레임 타임라인을 먼저 비교하고, 그 차이가 특정 시간 이하인 것들을 선정하여, 선정된 화면해설 음원의 키워드와 영상 프레임의 이미지 객체 결과를 비교하여 동기화를 수행할 수 있다.The broadcasting content production system for the disabled first compares the timeline of the creation of the manuscript, which is the basis of the screen commentary sound source, and the image frame timeline, and selects those whose difference is less than a certain time, and selects keywords and image frames of the screen commentary sound source. Synchronization can be performed by comparing the results of the image objects of.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있으며, 또한 캐리어 웨이브(예를 들어, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다.The present invention described above can be implemented as a computer-readable code in a medium on which a program is recorded. The computer-readable medium includes all types of recording devices that store data that can be read by a computer system. Examples of computer-readable media include HDD (Hard Disk Drive), SSD (Solid State Disk), SDD (Silicon Disk Drive), ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. There is also a carrier wave (e.g., transmission over the Internet).

상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The above detailed description should not be construed as restrictive in all respects and should be considered as illustrative. The scope of the present invention should be determined by rational interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

Claims

Translated fromKorean

재생되는 방송 컨텐츠의 음성을 자동 인식하여 문자로 변환하되, 음성 인식의 정확도가 일정 값 이하일 경우 속기사에게 시각적 또는 청각적 알림을 제공하는 음성 인식부와;
자동 음성 인식의 결과를 보정하기 위한 속기 입력 혹은 화면해설용 음원 생성을 위한 속기 입력을 획득하는 속기 입력부와;
상기 음성 인식부로부터 전달받은 음성 문자 변환 데이터와 상기 자동 음성 인식의 결과를 보정하기 위해 상기 속기 입력부로부터 입력된 속기 입력 데이터를 통합하여 최종 자막을 생성하는 통합 자막 처리부와;
컨텐츠 제공자로부터 방송 컨텐츠를 수신하고 디코딩 처리하여 표시장치에 전달하는 영상 수신부와;
화면 해설을 위해 상기 속기 입력부로부터 입력되는 속기 입력을 화면해설용 음원으로 변환하고, 상기 화면해설용 음원과 상기 방송 컨텐츠를 동기화하는 화면해설 처리부;를 포함하되, 상기 화면해설 처리부는,
상기 방송 컨텐츠에 포함된 비디오 이미지로부터 객체를 인식하고 그 인식된 이미지 객체와 화면해설용 음원의 키워드를 비교하여 동기화를 수행함을 특징으로 하는 장애인을 위한 방송 컨텐츠 제작 시스템.A voice recognition unit for automatically recognizing the voice of the played broadcast content and converting it into text, but providing a visual or audible notification to the stenographer when the accuracy of the voice recognition is less than a certain value;
A shorthand input unit for acquiring a shorthand input for correcting a result of automatic speech recognition or a shorthand input for generating a sound source for screen commentary;
An integrated caption processing unit for generating a final caption by integrating the speech text conversion data received from the speech recognition unit with the shorthand input data input from the shorthand input unit to correct a result of the automatic speech recognition;
An image receiving unit that receives broadcast content from a content provider, decodes it, and delivers it to a display device;
Including a screen commentary processing unit for converting the shorthand input input from the shorthand input unit for screen commentary into a sound source for screen commentary, and synchronizing the sound source for screen commentary and the broadcast content; wherein, the screen commentary processing unit,
Recognizing an object from a video image included in the broadcast content, comparing the recognized image object with a keyword of a sound source for screen description, and performing synchronization.

제 1 항에 있어서, 상기 화면해설 처리부는,
상기 속기 입력이 수행된 시간과 이미지 프레임의 타임라인을 비교하여 일정 시간 이하인 것들을 선정하고, 선정된 이미지 프레임에서 인식된 이미지 객체와 화면해설용 음원의 키워드를 비교하여 동기화를 수행하는 장애인을 위한 방송 컨텐츠 제작 시스템.The method of claim 1, wherein the screen commentary processing unit,
Broadcasting for the disabled, which compares the time when the shorthand input was performed and the timeline of the image frame to select those that are less than a certain time, and compares the image object recognized in the selected image frame with the keyword of the sound source for screen commentary to perform synchronization Content creation system.

제 1 항에 있어서,
상기 화면해설 처리부는
화면해설 원고로부터 키워드를 추출하는
장애인을 위한 방송 컨텐츠 제작 시스템.The method of claim 1,
The screen commentary processing unit
Extracting keywords from the screen commentary manuscript
Broadcasting content production system for the disabled.

제 1 항에 있어서,
상기 화면해설 처리부는
속기 입력간에 특정 키 입력에 의해 태그된 특정의 단어를 키워드로 추출하는
장애인을 위한 방송 컨텐츠 제작 시스템.The method of claim 1,
The screen commentary processing unit
Extracts specific words tagged by specific keystrokes as keywords between shorthand inputs
Broadcasting content production system for the disabled.

제 1 항에 있어서,
상기 화면해설 처리부는
약어 입력을 통해 입력된 단어를 키워드로 추출하는
장애인을 위한 방송 컨텐츠 제작 시스템.The method of claim 1,
The screen commentary processing unit
Extracting the entered word as a keyword through abbreviation input
Broadcasting content production system for the disabled.

삭제delete