This is a continuation application of U.S. patent application Ser. No. 14/686,816 filed Apr. 15, 2015, which is a continuation application of U.S. patent application Ser. No. 12/651,799 filed Jan. 4, 2010 (now U.S. Pat. No. 9,049,418), which claims priority from Japanese Priority Patent Application 2009-003688 filed in the Japanese Patent Office Jan. 9, 2009. Each of the above referenced applications is hereby incorporated by reference in its entirety.
BACKGROUND OF THE INVENTION1. Field of the InventionThe present invention relates to a data processing apparatus, a data processing method, and a program. More particularly, the present invention relates to a data processing apparatus, a data processing method, and a program capable of enabling users to grasp easily the details of contents such as television broadcast programs, for example.
2. Description of the Related ArtDigest playback is a contents playback method that allows users to grasp easily the details (outlines) of contents including images and voices, such as television broadcast programs.
According to the digest playback, contents are divided into several scenes based on the characteristic amount of images or voices contained in the contents. Then, images for digest playback such as thumbnails of representative images (e.g., the opening images of respective scenes) are generated and displayed.
Moreover, as a method for effectively extracting a dialog part contained in contents with a relatively low processing load at the time of generating images for digest playback, Japanese Unexamined Patent Application Publication No. 2008-124551, for example, discloses a method of extracting the playback periods of the voices of a dialog during the playback periods of voices being played back in a caption display period.
SUMMARY OF THE INVENTIONHowever, when only the thumbnails of the opening images of the respective scenes are displayed in the digest playback, it may be difficult for users to grasp the details of a scene from the thumbnail of the scene.
For example, in the case of a news program (report program), the opening images of the scenes of the news program are mainly composed of the image of a newscaster (announcer).
In this case, the thumbnails displayed by the digest playback will be mainly composed of the thumbnail image of the newscaster. Therefore, it is difficult to grasp the details of each scene only by watching the thumbnails.
It is therefore desirable to enable users to grasp easily the details of contents including images and voices.
According to an embodiment of the present invention, there is provided a data processing apparatus or a program for causing a computer to function as the data processing apparatus including text acquisition means for acquiring texts to be used as keywords which will be subject to audio retrieval, the texts being related to contents corresponding to contents data including image data and audio data; keyword acquisition means for acquiring the keywords from the texts; audio retrieval means for retrieving utterance of the keywords from the audio data of the contents data and acquiring timing information representing the timing of the utterance of the keywords of which the utterance is retrieved; and playback control means for generating, from image data around the time represented by the timing information among the image data of the playback contents, representation image data of a representation image which will be displayed together with the keywords and performing playback control of displaying the representation image corresponding to the representation image data together with the keywords which are uttered at the time represented by the timing information.
According to another embodiment of the present invention, there is provided a data processing method for enabling a data processing apparatus to perform the steps of: acquiring texts to be used as keywords which will be subject to audio retrieval, the texts being related to contents corresponding to contents data including image data and audio data; acquiring the keywords from the texts; retrieving utterance of the keywords from the audio data of the contents data and acquiring timing information representing the timing of the utterance of the keywords of which the utterance is retrieved; and generating, from image data around the time represented by the timing information among the image data of the playback contents, representation image data of a representation image which will be displayed together with the keywords and performing playback control of displaying the representation image corresponding to the representation image data together with the keywords which are uttered at the time represented by the timing information.
According to the embodiment of the present invention, the texts which are related to contents corresponding to contents data including image data and audio data are acquired and which will be used as keywords which will be subject to audio retrieval, are acquired, and the keywords are acquired from the texts. Moreover, the utterance of the keywords is retrieved from the audio data of the contents data, and timing information representing the timing of the utterance of the keywords of which the utterance is retrieved is acquired. Furthermore, representation image data of a representation image which will be displayed together with the keywords are generated from image data around the time represented by the timing information among the image data of the playback contents. Furthermore, the representation image corresponding to the representation image data is displayed together with the keywords which are uttered at the time represented by the timing information.
The data processing apparatus may be an independent apparatus and may be an internal block included in one apparatus.
The program may be provided by being transferred via a transmission medium or being recorded in a recording medium.
According to the embodiment of the present invention, a user is able to grasp easily the details of scenes included in contents. That is to say, for example, in contents including images and voices, the timings of a scene in which predetermined words are descriptive of the details are acquired, and images around the timings are displayed together with the predetermined words. As a result, the user will be able to grasp easily the details of scenes included in the contents.
BRIEF DESCRIPTION OF THE DRAWINGSFIG. 1 is a block diagram illustrating an exemplary configuration of a recorder according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating a timing information acquisition process.
FIG. 3 is a flowchart illustrating a playback process.
FIG. 4 is a block diagram illustrating a first exemplary configuration of a text acquisition portion.
FIG. 5 is a flowchart illustrating a processing example according to the first exemplary configuration of the text acquisition portion.
FIG. 6 is a diagram illustrating a representation example of a representation image.
FIG. 7 is a diagram illustrating a representation example of the representation image.
FIG. 8 is a block diagram illustrating a second exemplary configuration of the text acquisition portion.
FIG. 9 is a flowchart illustrating a processing example according to the second exemplary configuration of the text acquisition portion.
FIG. 10 is a flowchart illustrating a specific content retrieval process.
FIG. 11 is a block diagram illustrating an exemplary configuration of an audio retrieval portion.
FIG. 12 is a flowchart illustrating an index generation process performed by the audio retrieval portion.
FIG. 13 is a block diagram illustrating a first exemplary configuration of a representation image generation portion.
FIG. 14 is a flowchart illustrating a processing example according to the first exemplary configuration of the representation image generation portion.
FIG. 15 is a block diagram illustrating a second exemplary configuration of the representation image generation portion.
FIG. 16 is a flowchart illustrating a processing example according to the second exemplary configuration of the representation image generation portion.
FIG. 17 is a flowchart illustrating another processing example according to the second exemplary configuration of the representation image generation portion.
FIG. 18 is a flowchart illustrating a list modifying process.
FIG. 19 is a block diagram illustrating an exemplary configuration of a computer according to an embodiment of the present invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTSExemplary Configuration of Recorder According to Embodiment of Present Invention
FIG. 1 is a block diagram illustrating an exemplary configuration of a recorder according to an embodiment of the present invention.
Referring toFIG. 1, the recorder is an HD (hard disk) recorder, for example, and includes acontents acquisition portion11, acontents holding portion12, a timinginformation acquisition unit20, and aplayback control unit30.
Thecontents acquisition portion11 is configured to acquire contents data of contents (for example, images and voices) as the programs of television broadcasts, for example, and supply the acquired contents data to thecontents holding portion12.
When contents data are associated with metadata of contents corresponding to that contents data, thecontents acquisition portion11 also acquires the metadata and supplies them to thecontents holding portion12.
That is to say, thecontents acquisition portion11 is a tuner that receives broadcast data of television broadcasts such as digital broadcasts, and that is configured to acquire the contents data by receiving TS (transport stream), for example, as broadcast data which are transmitted (broadcast) from a non-illustrated broadcasting station, and supply the contents data to thecontents holding portion12.
Here, the broadcast data include contents data as data of the programs which are contents. Furthermore, the broadcast data may include EPG (electronic program guide) data and the like as the metadata of programs (metadata associated with programs (contents)) if necessary.
Moreover, the contents data as the data of programs include at least image data of programs and audio data associated with the image data. Furthermore, the contents data may sometimes include caption data such as closed caption. When caption data are included in the contents data, the contents data may further include display time information representing the display time at which the caption corresponding to the caption data is displayed.
Thecontents acquisition portion11 may be configured, for example, by a communication interface that performs communication via a network such as a LAN (local area network) or the Internet. In this case, thecontents acquisition portion11 acquires the contents data by receiving contents data and metadata (for example, so-called iEPG data) which can be downloaded from a server on a network.
Furthermore, thecontents acquisition portion11 may acquire the contents data by playing back the contents recorded on package media such as DVDs.
Thecontents holding portion12 is configured, for example, by a large-capacity recording (storage) medium such as an HD (hard disk) and is configured to record (store or hold) therein the contents data supplied from thecontents acquisition portion11 if necessary.
When the metadata of contents (programs) such as EPG data are supplied from thecontents acquisition portion11 to thecontents holding portion12, thecontents holding portion12 records therein the metadata.
The recording of contents data in thecontents holding portion12 corresponds to video-recording (including programmed video-recording, so-called automatic video-recording, dubbing, and the like).
The timinginformation acquisition unit20 functions as a data processing apparatus that acquires timing information representing the time at which keywords are uttered during the playback of the contents of which the contents data are recorded in thecontents holding portion12.
Specifically, the timinginformation acquisition unit20 includes atext acquisition portion21, akeyword acquisition portion22, an audiodata acquisition portion23, anaudio retrieval portion24, and a timinginformation storage portion25.
Thetext acquisition portion21 is configured to acquire texts to be used as keywords, which will be used when theaudio retrieval portion24 performs audio retrieval, and supply the texts to thekeyword acquisition portion22.
Thekeyword acquisition portion22 is configured to acquire keywords, which are character strings to be used as targets of audio retrieval, from the texts supplied from thetext acquisition portion21 and supply the keywords to theaudio retrieval portion24.
Here, thekeyword acquisition portion22 may acquire an entirety of the texts supplied from thetext acquisition portion21 as one keyword.
Moreover, thekeyword acquisition portion22 may perform natural language processing such as morphology analysis on the texts from thetext acquisition portion21 so as to decompose the texts into morphemes, thus acquiring an entirety or a part of the morphemes constituting the texts as the keywords.
Here, thekeyword acquisition portion22 may acquire reading information (phonemes) of morphemes, for example, thus acquiring, based on the reading information, words with long reading (namely, words with a predetermined number or more of phonemes) as the keywords.
Furthermore, thekeyword acquisition portion22 may acquire morphemes with a predetermined occurrence frequency or more as the keywords while acquiring only self-sufficient words excluding attached words such as auxiliary words.
Furthermore, thekeyword acquisition portion22 may acquire morphemes, of which the part of speech is a proper noun, as the keywords.
In addition to the above, thekeyword acquisition portion22 may acquire character strings, which are extracted by a so-called characteristic expression extraction technique, for example, as the keywords.
The audiodata acquisition portion23 is configured to acquire audio data by reading the audio data of contents data of target contents among the contents, of which the contents data are recorded in thecontents holding portion12, and supply the audio data to theaudio retrieval portion24.
Theaudio retrieval portion24 is configured to perform audio retrieval of retrieving an utterance of the keywords supplied from thekeyword acquisition portion22 from the audio data of the target contents supplied from the audiodata acquisition portion23. In addition, theaudio retrieval portion24 acquires timing information representing the timing of the utterance of a keyword of which the utterance is retrieved: that is to say, the time (timing) at which the keyword is uttered is acquired based on the beginning of the target contents, for example.
Here, as the timing information, time codes may be used, for example. Moreover, as the timing of the utterance of keywords, the timing of the beginning or ending of an utterance may be used, for example, and besides, any timing during the utterance may be used.
With respect to the target contents, theaudio retrieval portion24 generates a timing information list, in which keywords, of which the utterance is retrieved, and the timing information representing the timing of the utterance thereof are registered in a correlated manner, and supplies the timing information list to the timinginformation storage portion25.
The timinginformation storage portion25 is configured to store the timing information list of the target contents, supplied from theaudio retrieval portion24, and the target contents (or identification information thereof) in a correlated manner.
Theplayback control unit30 is configured to perform playback control of controlling playback such as digest playback of playback contents, in which among the contents of which the contents data are recorded in thecontents holding portion12, contents which are designated to be played back are used as the playback contents.
Specifically, theplayback control unit30 includes a representationimage generation portion31 and adisplay control portion32.
The representationimage generation portion31 is configured to acquire image data of the contents data of the playback contents from thecontents holding portion12 and also acquire the timing information list of the playback contents from the timinginformation storage portion25.
Moreover, the representationimage generation portion31 generates, from image data around the time represented by the timing information registered in the timing information list among the image data of the playback contents, representation image data of a representation image which will be displayed together with keywords which are correlated with the timing information.
Here, as the representation image, so-called thumbnails which are reduced size images obtained by reducing original images may be used, for example.
The representationimage generation portion31 supplies pairs of keywords and representation image data corresponding to the timing information to thedisplay control portion32. That is to say, sets of keywords correlated with the timing information and representation image data generated from the image data around the time represented by the time point are supplied to thedisplay control portion32.
Thedisplay control portion32 displays the representation image corresponding to the representation image data supplied from the representationimage generation portion31 together with keywords which are paired with the representation image data on adisplay device40 such as a TV (television receiver).
In the recorder having the above-described configuration, a timing information acquisition process, a playback process, and the like are performed.
The timing information acquisition process is performed by the timinginformation acquisition unit20. In the timing information acquisition process, the timing information representing the timing of the utterance of keywords during the playback of contents is acquired.
The playback process is performed by theplayback control unit30. In the playback process, the digest playback or the like is performed using the timing information acquired in the timing information acquisition process.
Timing Information Acquisition ProcessWith reference now toFIG. 2, the timing information acquisition process performed by the timinginformation acquisition unit20 ofFIG. 1 will be described. In the recorder ofFIG. 1, it will be assumed that the contents data of one or more contents are recorded in thecontents holding portion12. Moreover, the timing information acquisition process is performed (started) at an arbitrary timing.
In the timing information acquisition process, at step Sit thetext acquisition portion21 acquires texts and supplies the texts to thekeyword acquisition portion22, and the process flow proceeds to step S12.
At step S12, thekeyword acquisition portion22 acquires keywords, which are character strings to be subject to audio retrieval, from the texts supplied from thetext acquisition portion21 and generates a keyword list in which one or more keywords are registered.
That is to say, thekeyword acquisition portion22 extracts one or more character strings to be used as targets of audio retrieval from the texts supplied from thetext acquisition portion21 and generates a keyword list in which each character string is registered as a keyword.
Then, the process flow proceeds from step S12 to step S13, where the audiodata acquisition portion23 selects, as target contents, one of the contents which are not selected as target contents, among the contents of which the contents data are recorded in thecontents holding portion12. Furthermore, at step S13, the audiodata acquisition portion23 acquires audio data of the contents data of the target contents from thecontents holding portion12 and supplies the audio data to theaudio retrieval portion24.
Then, the process flow proceeds from step S13 to step S14, and a timing information list generation process for generating a timing information list of the target contents is performed at steps S14 to S19.
Specifically, at step S14, theaudio retrieval portion24 determines whether or not keywords are registered in the keyword list supplied from thekeyword acquisition portion22.
When it is determined at step S14 that keywords are registered in the keyword list, the process flow proceeds to step S15, where theaudio retrieval portion24 selects one of the keywords registered in the keyword list as a target keyword, and then, the process flow proceeds to step S16.
At step S16, theaudio retrieval portion24 performs audio retrieval to retrieve an utterance of the target keyword from the audio data of the target contents supplied from the audiodata acquisition portion23, and the process flow proceeds to step S17.
Here, the audio retrieval of the utterance of the target keyword from the audio data may be performed using so-called keyword spotting, for example.
Furthermore, the audio retrieval may be performed using other methods, for example, a method (hereinafter also referred to as an index-based retrieval method) of generating the phonemes of the audio data supplied from the audiodata acquisition portion23 to theaudio retrieval portion24 and the index of the positions of the phonemes, thus finding a sequence of phonemes that form the target keyword from the index. The index-based retrieval method is described, for example, in N. Kanda, et al. “Open-Vocabulary Keyword Detection from Super-Large Scale Speech Database,” IEEE Signal Processing Society 2008 International Workshop on Multimedia Signal Processing.
At step S17, theaudio retrieval portion24 determines, based on the results of the audio retrieval at step S16, whether or not the utterance of the target keyword (namely, audio data corresponding to the utterance of the target keyword) is included in the audio data of the target contents.
When it is determined at step S17 that the utterance of the target keyword was included in the audio data of the target contents, theaudio retrieval portion24 detects the timing of the utterance, and then, the process flow proceeds to step S18.
At step S18, theaudio retrieval portion24 registers (stores) the target keyword and the timing information representing the timing of the utterance of the target keyword in the timing information list of the target contents in a correlated manner, and the process flow proceeds to step S19.
On the other hand, when it is determined at step S17 that the utterance of the target keyword is not included in the audio data of the target contents, then, the process flow proceeds to step S19 while skipping step S18.
At step S19, theaudio retrieval portion24 deletes the target keyword from the keyword list, and the process flow then returns to step S14, and the same processes are repeated.
When it is determined at step S14 that keywords are not registered in the keyword list; that is to say, when the audio retrieval was performed for an entirety of the keywords registered in the keyword list generated at step S12, theaudio retrieval portion24 supplies the timing information list of the target contents to the timinginformation storage portion25, and then, the process flow ends.
As described above, in the timing information acquisition process, thetext acquisition portion21 acquires texts and thekeyword acquisition portion22 acquires keywords from the texts. Then, theaudio retrieval portion24 retrieves the utterance of the keywords from the audio data of the target contents and acquires the timing information representing the timing of the utterance of the keyword of which the utterance is retrieved.
Therefore, it is possible to acquire the scenes in which keywords are uttered during the playback of contents; that is to say, it is possible to acquire the timings (the timing information representing the timings) of scenes in which keywords are descriptive of the details.
Playback ProcessWith reference now toFIG. 3, the playback process performed by theplayback control unit30 ofFIG. 1 will be described.
In the recorder ofFIG. 1, it will be assumed that the timing information acquisition process ofFIG. 2 has been performed, and the timinginformation storage portion25 has stored therein the timing information list of an entirety of the contents of which the contents data are recorded in thecontents holding portion12.
For example, when a user operates a non-illustrated operation unit to designate contents to be used for digest playback among the contents of which the contents data are recorded in thecontents holding portion12, the representationimage generation portion31 selects at step S31 the contents designated by the user as playback contents, and the process flow then proceeds to step S32.
At step S32, the representationimage generation portion31 acquires image data of the playback contents from thecontents holding portion12 and also acquires the timing information list of the playback contents from the timinginformation storage portion25, and then, the process flow proceeds to step S33.
At step S33, the representationimage generation portion31 acquires image data around the time represented by the timing information registered in the timing information list among the image data of the playback contents and generates representation image data from the image data.
Specifically, the representationimage generation portion31 generates, as the representation image data, thumbnail image data from image data of a frame (field) corresponding to the time represented by the timing information registered in the timing information list, for example.
The representationimage generation portion31 generates representation image data with respect to an entirety of the timing information registered in the timing information list and supplies the respective representation image data and keywords corresponding to the representation image data to thedisplay control portion32 in a paired manner: that is to say, the keywords correlated with the timing information are paired with the representation image data generated from image data around the time represented by the timing information.
Then, the process flow proceeds from step S33 to step S34, where thedisplay control portion32 displays a list of representation images corresponding to the representation image data supplied from the representationimage generation portion31 together with corresponding keywords on thedisplay device40, and the process flow ends.
In this way, on thedisplay device40, the representation images are displayed together with the keywords which are paired with the representation image data, the keywords being descriptive of the details of a scene including the representation images.
Therefore, the user is able to grasp easily the details of the scenes of the playback contents.
That is to say, even when the playback contents are of a news program in which the representation images are mainly composed of the image of a newscaster, the user is able to grasp easily the details of the scene including the representation image by reading the keywords being displayed together with the respective representation image.
When a list of representation images is displayed, the representation images are displayed sequentially based on the display time of the frames of image data used for generating the representation images.
Although in this example, the thumbnails of the frames corresponding to the time represented by the timing information are used as the representation images, the representation images may be short video clips (including those with a reduced size) including images corresponding to the time represented by the timing information, for example.
First Exemplary Configuration ofText Acquisition Portion21With reference now toFIG. 4, a first exemplary configuration of thetext acquisition portion21 ofFIG. 1 is illustrated.
InFIG. 4, thetext acquisition portion21 is configured as a relatedtext acquisition unit50.
The relatedtext acquisition unit50 is configured to acquire texts (hereinafter also referred to as related texts) that are related to the contents of which the contents data are recorded in thecontents holding portion12 and supply the texts to thekeyword acquisition portion22.
Specifically, inFIG. 4, the relatedtext acquisition unit50 includes ametadata acquisition portion51 and a captiondata acquisition portion52.
When the metadata of the target contents are recorded in thecontents holding portion12, themetadata acquisition portion51 acquires the metadata as the related texts by reading them out of thecontents holding portion12 and supplies the related texts to thekeyword acquisition portion22.
Specifically, when the target contents are television broadcast programs, for example, and the EPG data as the metadata of the television broadcast programs are recorded in thecontents holding portion12, themetadata acquisition portion51 extracts related texts, such as titles of the programs as the target contents, actors' names, or brief summaries (outlines), from the EPG data and supplies the related texts to thekeyword acquisition portion22.
Themetadata acquisition portion51 may acquire the metadata of the target contents from websites on a network such as the Internet, in addition to acquiring the metadata which are recorded in thecontents holding portion12.
Specifically, themetadata acquisition portion51 may acquire the metadata of the target contents from websites (webpages) providing information on programs, such as websites on the Internet providing iEPG or websites of broadcasting stations presenting the programs, for example.
When the contents data of the target contents includes caption data in addition to image data and audio data, the captiondata acquisition portion52 acquires the caption data as the related texts by reading them out of thecontents holding portion12 and supplies the related texts to thekeyword acquisition portion22.
The captiondata acquisition portion52 may acquire display time information representing the display time of a caption corresponding to the caption data from thecontents holding portion12, in addition to acquiring the caption data from thecontents holding portion12. Then, the captiondata acquisition portion52 supplies the display time information to theaudio retrieval portion24.
In this case, theaudio retrieval portion24 may perform the audio retrieval of the utterance of the keywords acquired from the caption data as the related texts with respect only to audio data around the display time represented by the display time information of the caption data. That is to say, the audio retrieval may be performed with respect only to audio data corresponding to a predetermined display time interval of the caption corresponding to the caption data, where the display time interval is extended by a predetermined period at the beginning and ending thereof.
By performing the audio retrieval of the utterance of keywords with respect only to the audio data around the display time represented by the display time information rather than an entirety of the audio data of the target contents, it is possible to improve the accuracy of the audio retrieval, reduce the amount of processing necessary for the retrieval, and accelerate the retrieval processing. As a result, the timing information acquisition process can be performed effectively.
When the caption is superimposed on the image of contents in the form of a telop (ticker) or the like, rather than being included in the contents data as the caption data, the captiondata acquisition portion52 may extract the telop by image processing and convert the telop into text caption data by character recognition so that the telop can be processed in the same manner as the case where the caption is included in the contents data as the caption data.
Processing Example According to First Exemplary Configuration ofText Acquisition Portion21With reference now toFIG. 5, the processing example according to the first exemplary configuration of thetext acquisition portion21 ofFIG. 4 (that is, the process of step S11 in the timing information acquisition process ofFIG. 2) will be described.
At step S41, themetadata acquisition portion51 determines whether or not the metadata of the target contents are present in thecontents holding portion12 or on the Internet websites.
When it is determined at step S41 that the metadata of the target contents are present in thecontents holding portion12 or on the Internet websites, the process flow proceeds to step S42, where themetadata acquisition portion51 acquires the metadata of the target contents from thecontents holding portion12 or the Internet websites as the related texts. Moreover, themetadata acquisition portion51 supplies the metadata as the related texts to thekeyword acquisition portion22, and the process flow proceeds from step S42 to step S43.
When it is determined at step S41 that the metadata of the target contents are not present in thecontents holding portion12 or on the Internet websites, the process flow then proceeds to step S43 while skipping step S42.
At step S43, the captiondata acquisition portion52 determines whether or not the caption data of the target contents are present in thecontents holding portion12.
When it is determined at step S43 that the caption data of the target contents are present in thecontents holding portion12, the process flow proceeds to step S44, where the captiondata acquisition portion52 acquires the caption data of the target contents from thecontents holding portion12 as the related texts and also acquires the display time information of the caption data. Then, the captiondata acquisition portion52 supplies the caption data as the related texts to thekeyword acquisition portion22 and supplies the display time information to theaudio retrieval portion24, and the process flow then proceeds from step S44 to step S45.
At step S45, thekeyword acquisition portion22 determines whether or not the related texts have been supplied from at least one of themetadata acquisition portion51 and the captiondata acquisition portion52.
When it is determined at step S45 that thekeyword acquisition portion22 has not received the related texts from any one of themetadata acquisition portion51 and the captiondata acquisition portion52, the timing information acquisition process ends because in such a case, it is unable to acquire keywords.
When it is determined at step S45 that thekeyword acquisition portion22 has received the related texts from at least one of themetadata acquisition portion51 and the captiondata acquisition portion52, then, the process flow proceeds to step S12 ofFIG. 2, and the above-described processes are performed.
Representation Example of Representation ImageWith reference now toFIG. 6, representation examples of the representation image which is displayed by the playback process ofFIG. 3 are illustrated.
Specifically,FIG. 6 illustrates the representation examples of representation images in which the timing information acquisition process described inFIGS. 2 and 5 is performed with a news program as the contents being used as target contents, and the news program is selected as the playback contents in the playback process ofFIG. 3.
Referring toFIG. 6, four thumbnail images of a newscaster of the news program as the playback contents are displayed as the representation images sequentially from the left in the order of display time.
All of the four thumbnails inFIG. 6 show the newscaster; it is difficult to grasp the details of the news program only by watching the thumbnails.
However, inFIG. 6, keywords corresponding to the representation images as thumbnails are displayed together with the respective thumbnails.
Specifically, inFIG. 6, among the four thumbnail images of the newscaster, a keyword “Subprime Lone” is displayed on the lower part of the first thumbnail (from the left), and a keyword “Nikkei Average Stock Price” is displayed on the lower part of the second thumbnail. Moreover, a keyword “Antiterrorism Special Measures Law” is displayed on the lower part of the third thumbnail, and a keyword “The National High School Baseball Championship” is displayed on the lower part of the fourth thumbnail.
Therefore, the user is able to grasp easily the details of the news program by reading the keywords.
Here, when the contents are divided into several scenes, the keywords can be said to function as titles of the scenes.
Although inFIG. 6, the thumbnails of images corresponding to the time at which the keywords are uttered are displayed as the representation image, thumbnails of the other images of the contents may be displayed as the representation image.
Specifically, images around the time at which the keywords are uttered among the images of the contents may be used as candidates (hereinafter also referred to as thumbnail candidate images) for images to be converted into thumbnails, and the thumbnails of the thumbnail candidate images may be displayed as the representation images rather than displaying the thumbnails of the images corresponding to the time at which the keywords are uttered.
Here, as the thumbnail candidate images, the opening images of scenes when contents are divided based on the characteristic amount of images or voices, for example, among the images around the time at which the keywords are uttered, may be used. Moreover, as the thumbnail candidate images, images of which the characteristic amount of images or voices is greatly different from that of the surrounding images, for example, among the images around the time at which the keywords are uttered, may be used.
That is to say, the thumbnails of the thumbnail candidate images which are images other than the images corresponding to the time at which the keywords are uttered are allowed to be displayed as the representation images. Therefore, there is a high possibility that thumbnails of images (of various scenes) are displayed as the representation images, rather than that the thumbnails of such images of similar scenes as the images of the newscaster illustrated inFIG. 6 are displayed as the representation images.
With reference now toFIG. 7, representation examples of the representation image are illustrated in which various thumbnail images are displayed as the representation images.
InFIG. 7, in place of the images corresponding to the time at which the keywords are uttered, the thumbnails of the thumbnail candidate images around the time are displayed as the four representation images together with the keywords illustrated inFIG. 6.
Specifically, inFIG. 7, the first thumbnail of the thumbnail candidate image shows a house which is put up for auction due to the subprime loan crisis, for example, and is displayed together with the keyword “Subprime Loan Crisis”.
The second thumbnail of the thumbnail candidate image shows the Market Center in the TSE (Tokyo Stock Exchange) Arrows, for example, and is displayed together with the keyword “Nikkei Average Stock Price.”
The third thumbnail of the thumbnail candidate image shows the inside view of The National Diet of Japan, for example, and is displayed together with the keyword “Antiterrorism Special Measures Law.”
The fourth thumbnail of the thumbnail candidate image shows a scene of a high school baseball match, for example, and is displayed together with the keyword “The National High School Baseball Championship.”
The representation images ofFIG. 7 enable users to grasp better the details of the contents than the representation images ofFIG. 6.
However, in the case of the third thumbnail of the thumbnail candidate image showing the interior view of The National Diet of Japan, although it is possible to grasp roughly that the contents are about political issues, it is difficult to grasp further details thereof.
However, from the keyword “Antiterrorism Special Measures Law” being displayed together with the thumbnail, it is possible to grasp easily that the contents are about the antiterrorism special measures law.
Referring toFIGS. 6 and 7, although the keywords are displayed on the lower parts of the representation images, the positions at which the keywords are displayed are not particularly limited. Moreover, the keywords may be displayed to be superimposed on a part of the representation images.
According to the above-described technique disclosed in Japanese Unexamined Patent Application Publication No. 2008-124551, since the playback periods of the voices of a dialogue are extracted, it is possible to perform digest playback in which images corresponding to the playback periods are sequentially played back. However, a list of thumbnails as the representation images is not displayed.
Moreover, even when the technique disclosed in Japanese Unexamined Patent Application Publication No. 2008/124551 is modified to display the thumbnails of the opening images corresponding to the playback periods of the voices of the dialog, the keywords are not displayed as illustrated inFIGS. 6 and 7. Therefore, when the thumbnails of similar images are displayed, it will be difficult to grasp the details of the contents.
Second Exemplary Configuration ofText Acquisition Portion21.With reference now toFIG. 8, a second exemplary configuration of thetext acquisition portion21 ofFIG. 1 is illustrated.
InFIG. 8, thetext acquisition portion21 is configured as a user-input acquisition portion61.
The user-input acquisition portion61 is configured to acquire inputs from a user as texts and supply the texts to thekeyword acquisition portion22.
That is to say, the user-input acquisition portion61 acquires inputs of character strings, which are supplied from a non-illustrated keyboard when a user operates the keyboard, for example, as the texts. Moreover, the user-input acquisition portion61 performs speech recognition on inputs of utterance (speech) of a user to acquire character strings obtained as the results of the speech recognition as the texts.
Processing Example of Second Exemplary Configuration ofText Acquisition Portion21.With reference now toFIG. 9, the processing example according to the second exemplary configuration of thetext acquisition portion21 ofFIG. 8 (that is, the process of step S11 in the timing information acquisition process ofFIG. 2) will be described.
At step S51, the user-input acquisition portion61 determines whether or not texts were input in response to the user operating a keyboard or uttering words. When it is determined at step S51 that the texts were not input, then, the process flow returns to step S51.
When it is determined at step S51 that the texts were input, then, the process flow proceeds to step S52, where the user-input acquisition portion61 acquires the texts and supplies them to thekeyword acquisition portion22. Then, the process flow proceeds to step S12 ofFIG. 2, and the above-described processes are performed.
Here, thekeyword acquisition portion22 may acquire an entirety of the texts supplied from thetext acquisition portion21 as one keyword as described above inFIG. 1.
When thekeyword acquisition portion22 acquires an entirety of the texts supplied from thetext acquisition portion21 as one keyword, the texts themselves input by the user are used as the keywords. Therefore, it can be said that the user is able to input the keywords.
Specific Content Retrieval ProcessWhen the inputs from a user are acquired as texts, and keywords are acquired from the texts (including the case where the texts themselves input from the user are used as the keywords), in addition to the timing information acquisition process described inFIG. 2, where the timing information list is generated in which the keywords and the timing information of the keywords are registered in a correlated manner, a specific content retrieval process may be performed so as to retrieve contents containing utterance of keywords acquired from the inputs from the user.
With reference now toFIG. 10, a specific content retrieval process which can be performed by the recorder ofFIG. 1 will be described.
The specific content retrieval process can be performed by using the timing information acquisition process ofFIG. 2 and the playback process ofFIG. 3.
That is to say, in the specific content retrieval process, at step S61, thetext acquisition portion21 acquires texts in the same manner as described inFIG. 9 and supplies the texts to thekeyword acquisition portion22.
Specifically, when a user inputs the names of actors that the user is interested in, or words representing a genre, the text acquisition portion21 (specifically, the user-input acquisition portion61 ofFIG. 8) acquires the user's inputs as texts and supplies the texts to thekeyword acquisition portion22.
Then, the process flow proceeds from step S61 to step S62, where thekeyword acquisition portion22 acquires keywords from the texts supplied from thetext acquisition portion21 and generates a keyword list registering the keywords therein in the same manner as step S12 ofFIG. 2. Then, thekeyword acquisition portion22 supplies the keyword list to theaudio retrieval portion24, and the process flow proceeds from step S62 to step S63.
In this case, in the keyword list, the names of actors that the user is interested in, or the words representing a genre are registered as keywords.
At step S63, the audiodata acquisition portion23 determines whether or not contents which are not selected as target contents remain in the contents of which the contents data are recorded in thecontents holding portion12.
When it is determined at step S63 that the contents which are not selected as target contents remain in the contents of which the contents data are recorded in thecontents holding portion12, then, the process flow proceeds to step S64, where the audiodata acquisition portion23 selects, as target contents, one of the contents which are not selected as target contents, among the contents of which the contents data are recorded in thecontents holding portion12.
Furthermore, at step S64, the audiodata acquisition portion23 acquires audio data of the contents data of the target contents from thecontents holding portion12 and supplies the audio data to theaudio retrieval portion24.
Then, the process flow proceeds from step S64 to step S65, where theaudio retrieval portion24 performs the timing information list generation process for generating a timing information list of the target contents; that is to say, the same processes as steps S14 to S19 ofFIG. 2 are performed.
At step S65, the timing information list generation process is performed whereby the timing information list of the target contents is generated and stored in the timinginformation storage portion25. Then, the process flow proceeds to step S66, and at steps S66 to S68, theplayback control unit30 performs the same processes as the respective steps S32 to S34 in the playback process ofFIG. 3 while using the target contents as the playback contents.
Specifically, at step S66, the representationimage generation portion31 of theplayback control unit30 acquires image data of the target contents from thecontents holding portion12 and also acquires the timing information list of the target contents from the timinginformation storage portion25, and then, the process flow proceeds to step S67.
At step S67, the representationimage generation portion31 acquires image data around the time represented by the timing information registered in the timing information list among the image data of the target contents and generates representation image data from the image data.
Specifically, the representationimage generation portion31 generates, as the representation image data, thumbnail image data from image data of a frame corresponding to the time represented by the timing information registered in the timing information list, for example.
The representationimage generation portion31 generates representation image data with respect to an entirety of the timing information registered in the timing information list and supplies the respective representation image data and keywords corresponding to the representation image data to thedisplay control portion32 in a paired manner.
Then, the process flow proceeds from step S67 to step S68, where thedisplay control portion32 displays a list of representation images corresponding to the representation image data supplied from the representationimage generation portion31 together with corresponding keywords on thedisplay device40.
In this way, on thedisplay device40, the representation images are displayed together with the keywords which are paired with the representation image data, the keywords being descriptive of the details of a scene (consecutive frames) including the representation images.
Then, the process flow returns to step S63 from step S68, and the same processes are repeated.
When it is determined at step S63 that the contents which are not selected as target contents do not remain in the contents of which the contents data are recorded in thecontents holding portion12; that is to say, when the processes of steps S63 to S68 are performed using an entirety of the contents of which the contents data are recorded in thecontents holding portion12 as the target contents, then, the process flow ends.
In this case, the names of actors that the user is interested in, or the words representing a genre are used as keywords. Therefore, the target contents contain many utterances of the names of actors that the user is interested in, or the words representing a genre, and a number of thumbnails are displayed together with the keywords.
On the other hand, when the target contents contain few utterances of the names of actors that the user is interested in, or the words representing a genre; that is to say, in an extreme case where the target contents contain no utterance of the names of actors that the user is interested in, or the words representing a genre, thumbnails are not displayed as the representation images.
Therefore, the user is able to grasp easily that the contents for which a number of thumbnails are displayed together with keywords are contents related to the actors that the user is interested in or contents related to the genres that the user is interested in.
In the specific content retrieval process ofFIG. 10, it is necessary to perform the timing information list generation process of step S65 (corresponding to steps S14 to S19 ofFIG. 2) while using an entirety of the contents of which the contents data are recorded in thecontents holding portion12 as the target contents rather than using the contents designated by the user.
Therefore, it is particularly desirable to accelerate the audio retrieval for retrieving the utterance of keywords from the audio data among the timing information list generation process.
As method of accelerating the audio retrieval, the above-described index-based retrieval method can be used, for example, in which the phonemes of audio data and the index of the positions of the phonemes are generated, thus finding a sequence of phonemes that form the target keyword from the index.
Therefore, when the specific content retrieval process ofFIG. 10 is performed, it is particularly desirable to configure theaudio retrieval portion24 ofFIG. 1 so as to perform the audio retrieval using the index-based retrieval method.
Exemplary Configuration ofAudio Retrieval Portion24 Performing Audio Retrieval Using Index-Based Retrieval MethodWith reference now toFIG. 11, an exemplary configuration of theaudio retrieval portion24 performing audio retrieval using an index-based retrieval method is illustrated.
Referring toFIG. 11, theaudio retrieval portion24 includes anindex generation portion71, anindex storage portion72, and akeyword retrieval portion73.
Theindex generation portion71 is configured to receive the audio data of the target contents from the audiodata acquisition portion23.
Theindex generation portion71 generates phonemes (phoneme string) in the audio data of the target contents supplied from the audiodata acquisition portion23 and the index of the positions (timings) of the phonemes and supplies the phonemes and the index to theindex storage portion72.
Theindex storage portion72 is configured to temporarily store the index supplied from theindex generation portion71.
Thekeyword retrieval portion73 is configured to receive the keywords from thekeyword acquisition portion22.
Thekeyword retrieval portion73 retrieves a sequence of phonemes that form the keywords supplied from thekeyword acquisition portion22 from the index stored in theindex storage portion72.
When it was possible to retrieve the sequence of phonemes of the keyword from the index stored in theindex storage portion72, thekeyword retrieval portion73 acquires the timing information representing the timing (the position of the sequence of phonemes) from the index stored in theindex storage portion72 by determining that it was possible to retrieve the utterance of the keyword. Then, thekeyword retrieval portion73 generates a timing information list in which the keywords and the timing information are registered in a correlated manner and supplies the timing information list to the timinginformation storage portion25.
Processing Example ofAudio Retrieval Portion24 Performing Audio Retrieval Using Index-Based Retrieval MethodWhen theaudio retrieval portion24 is configured as illustrated inFIG. 11 so as to perform the audio retrieval using the index-based retrieval method, theaudio retrieval portion24 performs an index generation process for generating an index at step S64 inFIG. 10 prior to the timing information list generation process of step S65 upon receiving the audio data of the target contents from the audiodata acquisition portion23.
With reference now toFIG. 12, the index generation process performed by theaudio retrieval portion24 ofFIG. 11 will be described.
At step S71, theindex generation portion71 generates phonemes in the audio data of the target contents supplied from the audiodata acquisition portion23 and the index of the positions of the phonemes and supplies the phonemes and the index to theindex storage portion72, and then, the process flow proceeds to step S72.
At step S72, theindex storage portion72 temporarily stores the index supplied from theindex generation portion71, and the process flow ends.
After the index generation process is completed, the timing information list generation process of step S65 inFIG. 10 is performed. Specifically, thekeyword retrieval portion73 performs audio-based keyword retrieval (corresponding to step S16 ofFIG. 2) for retrieving the sequence of phonemes that form the keyword supplied from thekeyword acquisition portion22, from the index stored in theindex storage portion72.
First Exemplary Configuration of RepresentationImage Generation Portion31With reference now toFIG. 13, a first exemplary configuration of the representationimage generation portion31 ofFIG. 1 is illustrated.
Referring toFIG. 13, the representationimage generation portion31 includes an imagedata acquisition portion81 and athumbnail generation portion82.
The imagedata acquisition portion81 is configured to acquire image data of the target contents (or the playback contents) from thecontents holding portion12 and supply the image data to thethumbnail generation portion82.
Thethumbnail generation portion82 is configured to receive the timing information list of the target contents (or the playback contents) from the timinginformation storage portion25 in addition to receiving the image data of the target contents from the imagedata acquisition portion81.
Based on the timing information registered in the timing information list supplied from the timinginformation storage portion25, thethumbnail generation portion82 generates thumbnail image data from the image data corresponding to the time represented by the timing information among the image data supplied from the imagedata acquisition portion81 as representation image data.
Then, thethumbnail generation portion82 supplies the keywords correlated with the timing information and the thumbnail image data as the representation image data generated based on the timing information to thedisplay control portion32 in a paired manner.
Processing Example of First Exemplary Configuration of RepresentationImage Generation Portion31With reference now toFIG. 14, the processing example of the first exemplary configuration of the representationimage generation portion31 ofFIG. 13 (that is, the processes of steps S32 and S33 in the playback process ofFIG. 3) will be described.
The same processes are performed at steps S66 and S67 ofFIG. 10. Specifically, at step S81, thethumbnail generation portion82 acquires the timing information list of the playback contents from the timinginformation storage portion25, and the process flow proceeds to step S82.
At step S82, the imagedata acquisition portion81 acquires the image data of the playback contents from thecontents holding portion12 and supplies the image data to thethumbnail generation portion82, and then, the process flow proceeds to step S83.
Here, the above-described processes of steps S81 and S82 are performed at step S32 ofFIG. 3 (step S66 ofFIG. 10). Moreover, the later-described processes of steps S83 and S84 are performed at step S33 ofFIG. 3 (step S67 ofFIG. 10).
Specifically, at step S83, based on the timing information registered in the timing information list supplied from the timinginformation storage portion25, thethumbnail generation portion82 acquires image data corresponding to the time represented by the timing information among the image data supplied from the imagedata acquisition portion81.
Then, the process flow proceeds from step S83 to step S84, where thethumbnail generation portion82 generates thumbnail image data from the image data corresponding to the time represented by the timing information as the representation image data.
At step S84, thethumbnail generation portion82 supplies the keywords correlated with the timing information in the timing information list and the thumbnail image data as the representation image data generated based on the timing information to thedisplay control portion32 in a paired manner, and then, the process flow proceeds to step S34 ofFIG. 3 (step S68 ofFIG. 10).
Second Exemplary Configuration of RepresentationImage Generation Portion31As described above, theaudio retrieval portion24 performs the audio retrieval of retrieving the utterance of the target keyword supplied from thekeyword acquisition portion22, from the audio data of the target contents supplied from the audiodata acquisition portion23 and acquires the timing information of the target keyword of which the utterance is retrieved.
That is to say, when the utterance of the target keyword is retrieved from the audio data of the target contents, theaudio retrieval portion24 acquires the timing information of the target keyword of which the utterance is retrieved.
Therefore, when a plurality of times of utterance of the target keyword occurs in the target contents, theaudio retrieval portion24 acquires the timing information of the target keyword with respect to the plurality of times of utterance.
As described above, when the timing information of the target keyword is acquired with respect to the plurality of times of utterance; that is to say, when a plurality of pieces of timing information is acquired with respect to the target keyword, the target keyword and the plurality of pieces of timing information are registered in the timing information list in a correlated manner.
Moreover, when the keywords and the plurality of pieces of timing information are registered in the timing information list in a correlated manner, a plurality of representation images generated from the image data corresponding to the time represented by each of the plurality of pieces of timing information is displayed together with the same keyword in the playback process ofFIG. 3.
However, from the perspective of attracting the user's attention, it is desirable that the plurality of representation images being displayed together with the keywords registered in the timing information list are composed of images which differ as much as possible, rather than being similar to each other such as similar images of a newscaster.
With reference now toFIG. 15, a second exemplary configuration of the representationimage generation portion31 ofFIG. 1 is illustrated.
In the drawing, the same or similar portions or units as those illustrated inFIG. 13 will be denoted by the same reference numerals, and description thereof will be appropriately omitted.
The representationimage generation portion31 ofFIG. 15 is similar to the case ofFIG. 13 in that it includes the imagedata acquisition portion81 and thethumbnail generation portion82.
However, the representationimage generation portion31 ofFIG. 15 is different from the case ofFIG. 13 in that it further includes asimilarity calculation portion83 and a selectingportion84.
The representationimage generation portion31 ofFIG. 15 is configured to calculate the degree of similarity representing the similarity between an image corresponding to the image data around the time represented by the timing information registered in the timing information list and an image corresponding to the image data around the time represented by other timing information. Furthermore, based on the degree of similarity, the representationimage generation portion31 selects timing information representing the time at which the representation image is not similar to other representation images, among the timing information registered in the timing information list, as final timing information representing the timing of the image data which will be used as the representation image data. Then, the representationimage generation portion31 generates representation image data from the image data around the time represented by the final timing information.
That is to say, inFIG. 15, thesimilarity calculation portion83 is configured to receive the image data of the target contents (or the playback contents) from the imagedata acquisition portion81. Furthermore, thesimilarity calculation portion83 is configured to receive the timing information list of the target contents (or the playback contents) from the timinginformation storage portion25.
Thesimilarity calculation portion83 sets the keywords registered in the timing information list supplied from the timinginformation storage portion25 as a target keyword sequentially, and acquires timing information correlated with the target keyword as candidate timing information representing the candidates for the timing of an image which will be used as the representation image.
When one candidate timing information is acquired for the target keyword, thesimilarity calculation portion83 supplies the one candidate timing information to the selectingportion84 together with the target keyword.
Moreover, when a plurality of pieces of candidate timing information was acquired for the target keyword, thesimilarity calculation portion83 sets images corresponding to the image data corresponding to the time represented by each of the plurality of pieces of candidate timing information of the target keyword as candidate images which will be used as the candidates for a representation image and calculates the degree of similarity between each candidate image and each image corresponding to the image data corresponding to the time represented by the timing information correlated with other keywords.
That is to say, thesimilarity calculation portion83 calculates the degree of similarity between each of the plurality of candidate images corresponding to the time represented by the plurality of pieces of candidate timing information of the target keyword and each image corresponding to the time represented by the timing information (timing information correlated with keywords (other keywords) other than the target keyword) in the timing information list excluding the plurality of pieces of candidate timing information using the image data supplied from the imagedata acquisition portion81.
Then, thesimilarity calculation portion83 supplies the degree of similarity calculated between each of the plurality of candidate images (hereinafter also referred to as candidate images of the candidate timing information) corresponding to the time represented by the plurality of pieces of candidate timing information of the target keyword and an image (hereinafter also referred to as a similarity calculation target image) corresponding to the time represented by the timing information correlated with the other keywords and the candidate timing information to the selectingportion84 together with the target keyword.
When one candidate timing information is supplied from thesimilarity calculation portion83 with respect to the target keyword, the selectingportion84 selects the one candidate timing information as the final timing information representing the timing of the image data which will be used as the representation image data and supplies the candidate timing information to thethumbnail generation portion82 together with the target keyword supplied from thesimilarity calculation portion83.
When a plurality of pieces of candidate timing information is supplied from thesimilarity calculation portion83 with respect to the target keyword, the selectingportion84 selects, as the final timing information, candidate timing information of a candidate image which is the least similar to the similarity calculation target image, among the plurality of candidate images of the plurality of pieces of candidate timing information, based on the degree of similarity supplied from thesimilarity calculation portion83.
Then, the selectingportion84 supplies the final timing information to thethumbnail generation portion82 together with the target keyword supplied from thesimilarity calculation portion83.
As described above, inFIG. 15, thethumbnail generation portion82 receives the final timing information and the target keyword from the selectingportion84. Furthermore, thethumbnail generation portion82 receives the image data of the target contents from the imagedata acquisition portion81.
Thethumbnail generation portion82 generates, as representation image data, thumbnail image data from the image data corresponding to the time represented by the final timing information among the image data supplied from the imagedata acquisition portion81 based on the final timing information supplied from the selectingportion84.
Then, thethumbnail generation portion82 supplies the target keyword supplied from the selectingportion84, namely the keyword correlated with the final timing information, and the thumbnail image data as the representation image data generated based on the final timing information to thedisplay control portion32 in a paired manner.
Here, as the degree of similarity between images calculated in the similarity calculation portion83 (that is, the degree of similarity between the candidate image and the similarity calculation target image), a distance (metric) between images calculated from a color image histogram (e.g., RGB color histogram) may be used. A method of calculating the distance from a color image histogram is described, for example, in Y. Rubner, et al., “The Earth Mover's Distance as a Metric for Image Retrieval,” International Journal of Computer Vision 40(2) pp. 99-121 (2000).
Furthermore, the degree of similarity may be calculated using the image data per se of the contents and may be calculated using the reduced image data of the image data of the contents. When the degree of similarity is calculated using the reduced image data of the image data of the contents, it is possible to decrease the amount of processing necessary for calculating the degree of similarity.
Processing Example of Second Exemplary Configuration of RepresentationImage Generation Portion31With reference now toFIG. 16, the processing example of the second exemplary configuration of the representationimage generation portion31 ofFIG. 15; that is, the processes of steps S32 and S33 in the playback process ofFIG. 3 (and steps S66 and S67 ofFIG. 10) will be described.
At step S101, the degree ofsimilarity calculation portion83 acquires a timing information list of the playback contents from the timinginformation storage portion25, and then, the process flow proceeds to step S102.
At step S102, the imagedata acquisition portion81 acquires the image data of the playback contents from thecontents holding portion12 and supplies the image data to thesimilarity calculation portion83 and thethumbnail generation portion82, and then, the process flow proceeds to step S103.
Here, the above-described processes of steps S101 and S102 are performed at step S32 ofFIG. 3 (step S66 ofFIG. 10). Moreover, the later-described processes of steps S103 to S111 are performed at step S33 ofFIG. 3 (step S67 ofFIG. 10).
At step S103, thesimilarity calculation portion83 selects one keyword which has not been selected as a target keyword, among the keywords registered in the timing information list supplied from the timinginformation storage portion25 as the target keyword, and then, the process flow proceeds to step S104.
At step S104, thesimilarity calculation portion83 acquires timing information correlated with the target keyword from the timing information list supplied from the timinginformation storage portion25 as candidate timing information, and then, the process flow proceeds to step S105.
At step S105, thesimilarity calculation portion83 determines whether or not a plurality of pieces of candidate timing information is acquired with respect to the target keyword.
When it is determined at step S105 that a plurality of pieces of candidate timing information is not acquired with respect to the target keyword; that is, when one candidate timing information is acquired with respect to the target keyword, thesimilarity calculation portion83 supplies the one candidate timing information to the selectingportion84 together with the target keyword.
Then, the process flow proceeds from step S105 to step S106, and the selectingportion84 selects one of the candidate timing information supplied from thesimilarity calculation portion83 as the final timing information. Furthermore, at step S106, the selectingportion84 supplies the final timing information to thethumbnail generation portion82 together with the target keyword supplied from thesimilarity calculation portion83, and then, the process flow proceeds to step S109.
When it is determined at step S105 that a plurality of pieces of candidate timing information is acquired with respect to the target keyword, then, the process flow proceeds to step S107, where thesimilarity calculation portion83 sets images corresponding to the image data corresponding to the time represented by each of the plurality of pieces of candidate timing information of the target keyword as candidate images and calculates the degree of similarity between each of the plurality of candidate images and each image (similarity calculation target image) corresponding to the image data corresponding to the time represented by the timing information correlated with other keywords.
That is to say, thesimilarity calculation portion83 calculates the degree of similarity between each of the plurality of candidate images corresponding to the time represented by the plurality of pieces of candidate timing information of the target keyword and the similarity calculation target image which is the image corresponding to the time represented by the timing information correlated with keywords (other keywords) other than the target keyword in the timing information list using the image data supplied from the imagedata acquisition portion81.
Then, thesimilarity calculation portion83 supplies the degree of similarity calculated between each of the plurality of candidate images of the plurality of pieces of candidate timing information of the target keyword and the similarity calculation target image to the selectingportion84 together with the target keyword.
Then, the process flow proceeds from step S107 to step S108, and the selectingportion84 selects, as the final timing information, candidate timing information of a candidate image which is the least similar to the similarity calculation target image, among the plurality of candidate images supplied from thesimilarity calculation portion83, based on the degree of similarity supplied from thesimilarity calculation portion83 with respect to the target keyword.
That is to say, if the degrees of similarity having smaller values represent the lower similarity, the selectingportion84 detects the minimum value (or maximum value) of the degree of similarity between the similarity calculation target image and each of the plurality of candidate images. Furthermore, the selectingportion84 sets a candidate image of which the minimum value (or maximum value) of the degree of similarity detected for each of the plurality of candidate images is the lowest (or highest) as the candidate image which is the least similar to the similarity calculation target image and selects the candidate timing information of the candidate image as the final timing information.
Then, the selectingportion84 supplies the final timing information to thethumbnail generation portion82 together with the target keyword supplied from thesimilarity calculation portion83, and the process flow proceeds to step S109.
At step S109, thethumbnail generation portion82 acquires the image data corresponding to the time represented by the final timing information supplied from the selectingportion84 from the image data of the target contents supplied from the imagedata acquisition portion81, and then, the process flow proceeds to step S110.
At step S110, thethumbnail generation portion82 generates the thumbnail image data from the image data corresponding to the time represented by the final timing information as the representation image data.
Furthermore, at step S110, thethumbnail generation portion82 supplies the target keyword supplied from the selectingportion84 and the thumbnail image data as the representation image data generated based on the final timing information supplied from the selectingportion84 to thedisplay control portion32 in a paired manner.
Then, the process flow proceeds from step S110 to step S111, and thesimilarity calculation portion83 determines whether or not an entirety of the keywords registered in the timing information list supplied from the timinginformation storage portion25 have been processed.
When it is determined at step S111 that the entirety of the keywords registered in the timing information list have not yet been processed, that is, when there is a keyword that is not yet used as the target keyword, among the keywords registered in the timing information list, then, the process flow returns to step S103. Then, at step S103, one of the keywords that have not yet been used as the keyword is selected as a new target keyword from the keywords registered in the timing information list, and the same processes are repeated.
When it is determined at step S111 that the entirety of the keywords registered in the timing information list have been processed, then, the process flow proceeds to step S34 ofFIG. 3 (step S68 ofFIG. 10).
As described above, when a plurality of pieces of timing information is correlated with a target keyword in the timing information list, the degree of similarity between the similarity calculation target image and the candidate images of each of the candidate timing information is calculated using the plurality of pieces of timing information as the candidate timing information. Then, based on the degree of similarity, the candidate timing information of the candidate image which is the least similar to the similarity calculation target images among the plurality of candidate images is selected as the final timing information. As a result, the plurality of representation images displayed on thedisplay device40 together with the keyword registered in the timing information list is composed of images which differ as much as possible.
Therefore, it is more possible to attract the user's attention than in the case of displaying similar images such as images of a newscaster as the representation images.
Another Processing Example of Second Exemplary Configuration of RepresentationImage Generation Portion31With reference now toFIG. 17, another processing example of the second exemplary configuration of the representationimage generation portion31 ofFIG. 15; that is, the processes of steps S32 and S33 in the playback process ofFIG. 3 (steps S66 and S67 ofFIG. 10) will be described.
Referring toFIG. 17, the same processes as steps S101 to S111 ofFIG. 16 are performed at steps S121 and S122 and steps S124 and S132.
However, inFIG. 17, at step S123 between the steps S122 and S124, thesimilarity calculation portion83 performs a list modifying process of modifying the timing information list acquired from the timinginformation storage portion25.
List Modifying ProcessWith reference now toFIG. 18, the list modifying process performed by thesimilarity calculation portion83 ofFIG. 15 will be described.
At step S141, thesimilarity calculation portion83 selects one keyword which has not been selected as a target keyword, among the keywords registered in the timing information list supplied from the timinginformation storage portion25 as the target keyword, and then, the process flow proceeds to step S142.
At step S142, thesimilarity calculation portion83 selects one of the timing information which has not been selected as the target timing information among the timing information correlated with the target keyword, from the timing information list supplied from the timinginformation storage portion25, as target timing information, and then, the process flow proceeds to step S143.
At step S143, thesimilarity calculation portion83 selects one or more timings around the time represented by the target timing information among the timings of the image data of the target contents supplied from the imagedata acquisition portion81 as the candidates for additional timing which will be additionally correlated with the target keyword.
That is to say, thesimilarity calculation portion83 selects timings other than the timings represented by the target timing information, among the timings that divide a predetermined time interval around the time represented by the target timing information into a predetermined number of brief time intervals, as the candidates for additional timing. Here, the length of the predetermined time interval and the number of brief time intervals dividing the predetermined time interval may have a fixed value or may have a variable value that is determined by random numbers, for example.
Then, the process flow proceeds from step S143 to step S144, and thesimilarity calculation portion83 calculates the degree of similarity between each image corresponding to one or more candidates for additional timing and each image corresponding to the other timings.
Here, among the images corresponding to one or more candidates for additional timing, an image for which the degree of similarity is calculated will be regarded as a target image.
The “images corresponding to the other timings”, for which the degree of similarity with the target image is calculated at step S144, refer to an image excluding the target image among the images corresponding to one or more candidates for additional timing and an image corresponding to the time represented by the target timing information.
Then, the process flow proceeds from step S144 to step S145, where based on the degree of similarity calculated at step S144, thesimilarity calculation portion83 determines the timings (the candidates for additional timing) of the images which are not similar to the images corresponding to the other timings, among the images of the one or more candidates for additional timing, as additional timing.
That is to say, for example, if the degrees of similarity having larger values represent the higher similarity, thesimilarity calculation portion83 selects images of which the degree of similarity with the images of the other timings is not more than a threshold value such as the minimum value or maximum value or images of which the rank of the degree of similarity is within N (N>1) from the lowest rank, among the images corresponding to the one or more candidates for additional timing, as the images which are not similar to the images corresponding to the other timings, and determines the timings (the candidates for additional timing) of the images as the additional timing.
Furthermore, at step S145, thesimilarity calculation portion83 registers the timing information representing the additional timing in the timing information list in the form of additionally correlating the timing information with the target keyword, and then, the process flow proceeds to step S146.
At step S146, thesimilarity calculation portion83 determines whether or not an entirety of the timing information correlated with the target keyword has been processed.
When it is determined at step S146 that the entirety of the timing information correlated with the target keyword has not yet been processed; that is, when there is timing information which has not been selected as the target timing information among the timing information correlated with the target keyword, then, the process flow returns to step S142.
Then, the processes of steps S142 to S146 are repeated.
According to the processes of steps S142 to S146, among one or more timings around the time represented by the target timing information correlated with the target keyword, timing information representing the timings of images which are not similar to each other (images which are not similar to the images corresponding to the time represented by the target timing information) is additionally correlated with the target keyword.
When it is determined at step S146 that the entirety of the timing information correlated with the target keyword has been processed, then, the process flow proceeds to step S147, where thesimilarity calculation portion83 determines whether or not an entirety of the keywords registered in the timing information list have been processed.
When it is determined at step S147 that the entirety of the keywords registered in the timing information list have not yet been processed; that is, when there is a keyword that has not been selected as the target keyword, among the keywords registered in the timing information list, then, the process flow returns to step S141. Then, the processes of steps S141 to S147 are repeated.
When it is determined at step S147 that the entirety of the keywords registered in the timing information list have been processed, then, the process flow returns to a main routine.
As described above, in the list modifying process, the timings of images which are not similar to each other as much as possible among one or more timings (the candidates for additional timing) around the time represented by the timing information registered in the timing information list. Then, the timing information representing the additional timing is additionally added to the timing information list, thus modifying the timing information list.
Thereafter, inFIG. 17, using the modified timing information list, the same processes as steps S103 to S111 ofFIG. 16 are performed at steps S124 to S132.
Therefore, according to the processing ofFIG. 17, thumbnails of images which are not similar to each other are displayed as the representation images together with keywords.
As a result, since the processing ofFIG. 17 is performed in the specific content retrieval process ofFIG. 10, thumbnails of scenes which are not similar to each other are displayed with respect to the contents containing utterance of keywords acquired from the inputs from the user. Therefore, the user is able to grasp the details of the contents at one glance and easily find the contents that the user is interested in than in the case of displaying the thumbnails of similar scenes.
Computer Implementing the Present InventionThe above-described processing series can be executed not only by hardware but also by software. When the processing series is executed by software, a program included in the software is installed in a general-purpose computer.
With reference now toFIG. 19, an exemplary configuration of a computer according to an embodiment of the present invention, in which program for executing the above-described processing series is installed.
The program may be first recorded in ahard disk105 or aROM103 as a recording medium installed in the computer.
Alternatively, the program may be stored (recorded) in aremovable storage medium111. Theremovable storage medium111 may be provided as a so-called package software. Here, theremovable storage medium111 may be a flexible disk, a CD-ROM (compact disc read only memory), a MO (magneto optical) disc, a DVD (digital versatile disc), a magnetic disc, or a semiconductor memory.
The program may be installed in the internalhard disk105 by downloading the program via a communication network or a broadcasting network, in addition to installing the program in the computer from theremovable storage medium111 as described above. That is to say, the program may be wirelessly transferred from a download site to the computer via a digital broadcasting satellite or may be transferred through wires to the computer via a network such as a LAN (local area network) or the Internet.
The computer has incorporated therein a CPU (central processing unit)102, and an input/output interface110 is connected to theCPU102 via abus101.
TheCPU102 executes a program stored in the ROM (read only memory)103 in response to, and in accordance with, commands which are input via the input/output interface110 by a user operating aninput unit107 or the like. Alternatively, theCPU102 executes a program stored in thehard disk105 by loading the program in a RAM (random access memory)104.
In this way, theCPU102 executes the processing corresponding to the above-described flowcharts or the processing performed by the configuration illustrated in the block diagrams. Then, theCPU102 outputs, transmits, or records the processing results through anoutput unit106, through acommunication unit108, or in thehard disk105, for example, via the input/output interface110 as necessary.
Theinput unit107 includes a keyboard, a mouse, a microphone, and the like. Theoutput unit106 includes an LCD (liquid crystal device), a speaker, and the like.
Here, in this specification, the processing that the computer executes in accordance with the program may not be executed in a time-sequential manner in the order described in the flowcharts. That is to say, the processing that the computer executes in accordance with the program includes processing that is executed in parallel and/or separately (for example, parallel processing or object-based processing).
Moreover, the program may be executed by a single computer (processor) and may be executed by a plurality of computers in a distributed manner. Furthermore, the program may be executed by being transferred to a computer at a remote location.
The present application contains subject matter related to that disclosed in Japanese Priority Patent Application JP 2009-003688 filed in the Japan Patent Office on Jan. 9, 2009, the entire content of which is hereby incorporated by reference.
The embodiments of the present invention are not limited to the above-described embodiments, but various modifications can be made in a range not departing from the gist of the present invention.
For example, thetext acquisition portion21 may be configured by the relatedtext acquisition unit50 ofFIG. 4 and the user-input acquisition portion61 ofFIG. 8.