CN100585663C

Movatterモバイル変換

Info

Publication number: CN100585663C
Application number: CN200510132618A
Authority: CN
Inventors: 江本直博
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2004-12-24
Filing date: 2005-12-23
Publication date: 2010-01-27
Anticipated expiration: 2025-12-23
Also published as: JP2006178334A; KR20060073502A; CN1794315A; KR100659212B1

Abstract

A learning method using demo voice similar to that of the learner is provided. A database is established for each speaker, and characteristics extracted from the voice of the speaker corresponds to one or more voice data packets generated from the speaker for language learning, and are respectively stored. The voice of the leaner is extracted, and the characteristics of the voice of the leaner are then extracted. The extracted characteristics of the learner are compared with the characteristics of the voices of multiple speakers stored in the database, and a packet of voice data of one speaker is selected from the database according to the comparison result. According to the selected voice data, the voice for language learning is reproduced.

Description

Translated fromChinese

语言学习系统language learning system

技术领域technical field

本发明涉及一种帮助学习语言的语言学习系统。The present invention relates to a language learning system that assists language learning.

背景技术Background technique

在外语或母语的语言学习，特别是发音或朗读的自学中，广泛使用对记录在CD(Compact Disk)等记录介质上的示范声音进行再现，通过模仿该示范声音来进行发音或朗读这一学习方法。其目的在于，通过模仿示范声音来掌握正确的发音。在这里，为了更有效地促进学习，必须要评价示范声音与自己的声音之间的差别。但是，绝大多数情况下，记录在CD中的示范声音是某个特定的播音员或说母语的人的声音。也就是说，由于对于大多数学习者来说，这些示范声音是通过具有与自己的声音完全不同的特征的声音发出来的，所以存在以下问题，即，难以评价自己的发音与示范声音相比，正确到哪种程度。In the language learning of a foreign language or mother tongue, especially in the self-study of pronunciation or reading aloud, it is widely used to reproduce the demonstration sound recorded on a recording medium such as a CD (Compact Disk), and to learn pronunciation or reading aloud by imitating the demonstration sound method. Its purpose is to master correct pronunciation by imitating model voices. Here, in order to promote learning more effectively, it is necessary to evaluate the difference between the model voice and one's own voice. However, in most cases, the sample voice recorded on the CD is that of a particular announcer or native speaker. That is, since for most learners, these model sounds are uttered by a voice having completely different characteristics from their own voice, there is a problem that it is difficult to evaluate their own pronunciation compared with the model voice , to what extent is it correct.

作为解决该问题的技术，有例如专利文献1、2中所述的技术。专利文献1中所述的技术是使用户的语调、语速、音质等参数反映到示范声音上，将示范声音变换为与用户声音相似的声音的技术。专利文献2中所述的技术是学习者可以从多个示范声音之中选择任意一种的技术。As a technique for solving this problem, there are techniques described in Patent Documents 1 and 2, for example. The technique described in Patent Document 1 reflects parameters such as the user's intonation, speech rate, and voice quality on a sample voice, and converts the sample voice into a voice similar to the user's voice. The technique described in Patent Document 2 is a technique in which a learner can select any one from among a plurality of demonstration sounds.

专利文献1：特开2002-244547号公报Patent Document 1: JP-A-2002-244547

专利文献2：特开2004-133409号公报Patent Document 2: JP-A-2004-133409

发明内容Contents of the invention

但是，由专利文献1中所述的技术，虽然可以矫正语调，但也存在以下问题，即，难以矫正例如英语中的“r与1”或“s和th”等发音明显不同的发音。此外，由于需要对声音波形进行修正，所以还存在处理复杂的问题。此外，由专利文献2中所述的技术，由于是选择示范声音的方式，所以存在以下问题，即，必须要学习者自己选择示范声音，比较烦琐。However, although the intonation can be corrected by the technique described in Patent Document 1, there is a problem that it is difficult to correct pronounced pronunciations such as "r and 1" or "s and th" in English that are obviously different. In addition, since the sound waveform needs to be corrected, there is also a problem of complicated processing. In addition, since the technology described in Patent Document 2 is a method of selecting a demonstration sound, there is a problem that the learner must select a demonstration sound by himself, which is cumbersome.

本发明是鉴于上述问题而提出来的，其目的在于，提供一种利用更简单的处理就可以使用与学习者相似的示范声音进行学习的语言学习系统以及方法。The present invention has been made in view of the above problems, and an object of the present invention is to provide a language learning system and method capable of learning using a model voice similar to a learner with simpler processing.

为了解决上述课题，本发明提供一种语言学习系统，该系统具有：数据库，其对于每一个讲话者，将从该讲话者的声音中提取出的特征量和该讲话者的一个或多个声音数据相关联而分别进行存储；声音取得单元，其取得学习者的声音；特征量提取单元，其从利用上述声音取得单元所取得的声音中，提取上述学习者的声音的特征量；声音数据选择单元，其对利用上述特征量提取单元提取出的上述学习者的特征量，与记录在上述数据库中的多个讲话者的特征量进行比较，并根据该比较，从上述数据库中选择出一个讲话者的声音数据；以及再现单元，其按照利用上述声音数据选择单元所选择出的1个声音数据，输出声音。In order to solve the above-mentioned problems, the present invention provides a language learning system including: a database that, for each speaker, includes feature quantities extracted from the speaker's voice and one or more voices of the speaker The data are associated and stored separately; a voice acquisition unit, which acquires the learner's voice; a feature quantity extraction unit, which extracts the feature quantity of the learner's voice from the voice obtained by the above-mentioned voice acquisition unit; voice data selection A unit for comparing the feature quantity of the learner extracted by the feature quantity extraction unit with the feature quantities of a plurality of speakers recorded in the database, and selecting a speaker from the database based on the comparison audio data of the user; and a reproducing unit that outputs a sound according to one piece of audio data selected by the audio data selection unit.

优选的实施方式的特征为，上述声音数据选择单元包括近似度计算单元，该近似度计算单元对于每个讲话者计算出近似度指数，该近似度指数表示记录在上述数据库中的多个讲话者的特征量与利用上述特征量提取单元所提取的上述学习者的特征量之差，然后，根据利用该近似度计算单元所计算出的近似度指数，从上述数据库中选择与满足规定条件的1个讲话者的特征量相对应的1个声音数据。在此情况下，上述规定条件也可以是以下条件，即，选择与表示近似度最高的近似度指数相关联的1个讲话者的声音数据。A preferred embodiment is characterized in that the voice data selection unit includes an approximation calculation unit that calculates, for each speaker, an approximation index indicating the number of speakers recorded in the database. The difference between the feature quantity of the above-mentioned learner and the feature quantity of the above-mentioned learner extracted by the above-mentioned feature quantity extraction unit, and then, according to the approximation degree index calculated by the use of the approximation degree calculation unit, select 1 from the above-mentioned database and satisfy the specified condition. One piece of voice data corresponding to the feature quantity of one speaker. In this case, the predetermined condition may be a condition that the voice data of one speaker associated with the similarity index indicating the highest similarity degree is selected.

另外的优选的实施方式，该语言学习系统还可以具有语速变换单元，其对利用上述声音数据选择单元所选择出的声音数据的语速进行变换，上述再现单元按照利用上述语速变换单元变换语速的声音数据，输出声音。In another preferred embodiment, the language learning system may further include a speech rate conversion unit that converts the speech rate of the voice data selected by the voice data selection unit, and the reproduction unit converts the speech rate by the speech rate conversion unit. Speech rate audio data, output audio.

另外的优选的实施方式，该语言学习系统还可以具有：存储单元，其存储示范声音；比较单元，其对上述示范声音和利用上述声音取得单元所取得的学习者的声音进行比较，产生表示二者的近似度的信息；以及数据库更新单元，其在由利用上述比较单元产生的信息所表示的近似度满足规定条件的情况下，使利用上述声音取得单元所取得的学习者的声音，与利用上述特征量提取单元所提取出的特征量相关联而追加到上述数据库中。Another preferred embodiment, the language learning system can also have: a storage unit, which stores the demonstration voice; a comparison unit, which compares the above-mentioned demonstration voice with the learner's voice obtained by the above-mentioned voice acquisition unit, and generates a representation. Information about the degree of approximation of the learner; and a database update unit that, when the degree of approximation represented by the information generated by the comparison unit satisfies a predetermined condition, makes use of the learner’s voice acquired by the voice acquisition unit to be compared with the learner’s voice obtained by the voice acquisition unit The feature quantities extracted by the feature quantity extraction means are linked and added to the database.

由本发明，可以再现具有与学习者相似的声音特征的讲话者的声音，作为学习中的范文的声音。因此，学习者能更正确地识别应模仿(应作为目标)的发音，由此，可以提高其学习效率。According to the present invention, it is possible to reproduce the voice of a speaker having a voice characteristic similar to that of a learner as the voice of a model text being studied. Therefore, the learner can more correctly recognize the utterances that should be imitated (should be targeted), thereby improving their learning efficiency.

附图说明Description of drawings

图1是表示本发明的第1实施方式涉及的语言学习系统1的功能结构的框图。FIG. 1 is a block diagram showing the functional configuration of a language learning system 1 according to the first embodiment of the present invention.

图2是例示数据库DB1的内容的图。FIG. 2 is a diagram illustrating the contents of the database DB1.

图3是表示语言学习系统1的硬件结构的框图。FIG. 3 is a block diagram showing the hardware configuration of the language learning system 1 .

图4是表示语言学习系统1的动作的流程图。FIG. 4 is a flowchart showing the operation of the language learning system 1 .

图5是表示语言学习系统1中的数据库DB1的更新动作的流程图。FIG. 5 is a flowchart showing an update operation of the database DB1 in the language learning system 1 .

图6是例示示范声音(上)及用户声音(下)的频谱包络的图。FIG. 6 is a diagram illustrating spectral envelopes of an exemplary voice (top) and a user voice (bottom).

具体实施方式Detailed ways

下面参照附图说明本发明的具体实施方式。Specific embodiments of the present invention will be described below with reference to the accompanying drawings.

<1.结构><1. Structure>

图1是表示本发明的第1实施方式涉及的语言学习系统1的功能结构的框图。存储部11存储了数据库DB1，该数据库DB1将从讲话者的声音中提取出的特征量与该讲话者声音的声音数据相相关联而进行存储。输入部12取得学习者(用户)的声音，作为用户声音数据进行输出。特征提取部13从学习者的声音中提取特征量。声音数据提取(选择)部14将由特征提取部13提取的特征量与记录在数据库DB1中的特征量进行比较，并提取出满足预先规定的条件的、1个讲话者的特征量，再从数据库DB1中提取(选择)出与提取出的1个讲话者的特征量相关联的声音数据。再现部15再现由声音数据提取(选择)部14提取(选择)出的声音数据，通过扬声器或者耳机等发出可听的声音。FIG. 1 is a block diagram showing the functional configuration of a language learning system 1 according to the first embodiment of the present invention. The storage unit 11 stores a database DB1 that stores a feature amount extracted from a speaker's voice in association with voice data of the speaker's voice. The input unit 12 acquires a learner's (user's) voice, and outputs it as user voice data. The feature extraction unit 13 extracts feature amounts from the learner's voice. The voice data extraction (selection) section 14 compares the feature quantity extracted by the feature extraction section 13 with the feature quantity recorded in the database DB1, and extracts the feature quantity of a speaker satisfying a predetermined condition, and then extracts the feature quantity from the database DB1. Voice data associated with the extracted feature amount of one speaker is extracted (selected) from DB1. The reproducing unit 15 reproduces the audio data extracted (selected) by the audio data extracting (selecting) unit 14, and emits audible audio through a speaker, earphone, or the like.

关于数据库DB1的详细内容将如后所述，但语言学习系统1还具有用来更新数据库DB1的下述结构要素。存储部16存储了示范声音数据库DB2，该示范声音数据库DB2将作为语言学习样本的示范声音数据和该示范声音的文本数据相相关联而存储。比较部17对由输入部12所取得的用户声音数据和存储在存储部16中的示范声音数据进行比较。比较的结果，如果用户声音满足预先规定的条件，则DB更新部18将用户声音数据追加到数据库DB1中。The details of the database DB1 will be described later, but the language learning system 1 also has the following components for updating the database DB1. The storage unit 16 stores a sample voice database DB2 that stores sample voice data as a language learning sample in association with text data of the sample voice. The comparison unit 17 compares the user voice data acquired by the input unit 12 with the sample voice data stored in the storage unit 16 . As a result of the comparison, if the user's voice satisfies a predetermined condition, the DB update unit 18 adds the user's voice data to the database DB1.

图2是例示数据库DB1的内容的图。数据库DB1中记录了讲话者ID(图2中为“ID001”)和从该讲话者的声音数据中提取出的特征量，该讲话者ID是确定讲话者的标识符。在数据库DB1中，还将范文ID、该范文的声音数据、以及该范文的发音水平(后述)相相关联而记录，该范文ID是确定范文的标识符。数据库DB1具有多个由范文ID、声音数据、以及发音水平构成的数据组，各数据组与赋予给声音数据的讲话者的讲话者ID相关联而存储记录。也就是说，数据库DB1具有从多个讲话者得到的多个范文的声音数据，这些数据通过讲话者ID及特征量，与每个讲话者相关联而记录。FIG. 2 is a diagram illustrating the contents of the database DB1. A speaker ID ("ID001" in FIG. 2), which is an identifier for specifying a speaker, and a feature quantity extracted from the speaker's voice data are recorded in the database DB1. In the database DB1, a model text ID, which is an identifier for specifying a model text, is recorded in association with the voice data of the model text and the pronunciation level (described later) of the model text. The database DB1 has a plurality of data groups consisting of sample text ID, voice data, and pronunciation level, and each data group is associated with a speaker ID assigned to a speaker of the voice data and stored for recording. That is, the database DB1 has voice data of a plurality of sample texts obtained from a plurality of speakers, and these data are associated and recorded for each speaker by speaker ID and feature value.

图3是表示语言学习系统1的硬件结构的框图。CPU(CentralProcessing Unit)101以RAM(Random Access Memory)102作为工作区域，读出存储在ROM(Read Only Memory)103或HDD(Hard DiskDrive)104中的程序并执行。HDD 104是存储各种应用程序及数据的存储装置。此外，HDD 104还存储数据库DB1及示范声音数据库DB2。显示器105是CRT(Cathode Ray Tube)或LCD(Liquid CrystalDisplay)等、在CPU 101的控制下显示文字及图像的显示装置。麦克风106是用来取得用户的声音的声音收集装置，输出与用户发出的声音相对应的声音信号。声音处理部107具有将由麦克风106所输出的模拟声音信号变换为数字声音数据的功能，以及将存储在HDD 104中的声音数据变换为声音信号并输出给扬声器108的功能。此外，用户可以通过操作键盘109，向语言学习系统1输入指令。以上所说明的各结构要素通过总线110彼此连接。此外，语言学习系统1可以通过I/F(接口)111与其它设备进行通信。FIG. 3 is a block diagram showing the hardware configuration of the language learning system 1 . CPU (Central Processing Unit) 101 reads and executes programs stored in ROM (Read Only Memory) 103 or HDD (Hard Disk Drive) 104 using RAM (Random Access Memory) 102 as a work area. The HDD 104 is a storage device that stores various application programs and data. In addition, the HDD 104 also stores a database DB1 and a demonstration sound database DB2. Thedisplay 105 is a display device such as a CRT (Cathode Ray Tube) or an LCD (Liquid Crystal Display), which displays text and images under the control of theCPU 101. Themicrophone 106 is a sound collection device for acquiring a user's voice, and outputs a voice signal corresponding to the user's voice. Theaudio processing unit 107 has a function of converting an analog audio signal output from themicrophone 106 into digital audio data, and a function of converting audio data stored in theHDD 104 into an audio signal and outputting it to thespeaker 108. In addition, the user can input instructions to the language learning system 1 by operating thekeyboard 109 . The constituent elements described above are connected to each other by thebus 110 . In addition, the language learning system 1 can communicate with other devices through an I/F (interface) 111 .

<2.动作><2. Action>

下面，对本实施方式涉及的语言学习系统1的动作进行说明。在这里，首先说明对范文的声音进行再现的动作，然后再说明对数据库DB1的内容进行更新的动作。在语言学习系统1中，通过CPU 101执行存储在HDD 104中的语言学习程序，而具有图1所示的功能。此外，学习者(用户)在语言学习程序的开始时等，操作键盘109，输入确定自己的标识符即用户ID。CPU 101将所输入的用户ID存储到RAM 102中，作为当前正在使用系统的学习者的用户ID。Next, the operation of the language learning system 1 according to the present embodiment will be described. Here, the operation of reproducing the sound of the sample text will be described first, and then the operation of updating the contents of the database DB1 will be described. In the language learning system 1, the language learning program stored in theHDD 104 is executed by theCPU 101 to have the functions shown in FIG. 1 . In addition, the learner (user) operates thekeyboard 109 at the start of the language learning program, etc., and inputs a user ID which is an identifier for identifying himself. TheCPU 101 stores the input user ID into theRAM 102 as the user ID of the learner who is currently using the system.

<2-1.再现声音><2-1. Reproducing sound>

图4是表示语言学习系统1动作的流程图。如果执行语言学习程序，则语言学习系统1的CPU 101对示范声音数据库DB2进行检索，制成可以利用的范文的列表。CPU 101根据该列表，在显示器105上显示提醒用户选择范文的消息。用户按照显示器105上所显示的消息，从列表中存在的范文中选择出1篇范文。CPU 101对选择出的范文的声音进行再现(步骤S101)。具体地说，CPU 101从示范声音数据库DB2中读出范文的示范声音数据，并将读出的示范声音数据输出给声音处理部107。声音处理部107对输入的示范声音数据进行数/模变换后，作为模拟声音信号输出给扬声器108。这样，从扬声器108中再现出示范声音。FIG. 4 is a flowchart showing the operation of the language learning system 1 . If the language learning program is executed, theCPU 101 of the language learning system 1 searches the model voice database DB2 to make a list of available model texts.CPU 101 displays on display 105 a message reminding the user to select a model essay according to the list. According to the message displayed on thedisplay 105, the user selects a model essay from the model essays in the list.CPU 101 reproduces the sound of the model text selected (step S101). Specifically, theCPU 101 reads the sample voice data of the model text from the sample voice database DB2, and outputs the read sample voice data to thevoice processing unit 107. Thesound processing unit 107 performs digital/analog conversion on the input sample sound data, and outputs it to thespeaker 108 as an analog sound signal. In this way, demonstration sounds are reproduced from thespeaker 108 .

用户从扬声器108中听到再现的示范声音后，对着麦克风模仿示范声音而朗读范文。也就是说，进行用户声音的输入(步骤S102)。具体地说，如下所述。如果示范声音的再现结束，则CPU 101在显示器105上显示诸如“下面轮到你了。请朗读范文。”等提醒用户朗读范文的消息。接着CPU 101在显示器105上显示“按下空格键后开始朗读，如果朗读结束请再按一次空格键。”等指示用于进行用户声音输入的操作的消息。用户按照显示器105上所显示的消息对键盘109进行操作，进行用户声音的输入。也就是说，在按下键盘109的空格键后，对着麦克风朗读范文。如果朗读结束了，则用户再按一次空格键。After the user hears the reproduced demonstration sound from theloudspeaker 108, the user imitates the demonstration sound to the microphone and reads the model text aloud. That is, input of the user's voice is performed (step S102). Specifically, as follows. If the reproduction of the demonstration sound ends, theCPU 101 displays on the display 105 a message such as "It's your turn below. Please read the model text aloud." etc. to remind the user to read the model text aloud. Then theCPU 101 displays on thedisplay 105 "Press the space bar and start reading aloud. If the reading is finished, please press the space bar again." and other instructions indicating the operation of the user's voice input. The user operates thekeyboard 109 in accordance with the message displayed on thedisplay 105 to input the user's voice. That is to say, after pressing the space bar of thekeyboard 109, the model text is read aloud to the microphone. If the reading is over, the user presses the spacebar again.

用户的声音由麦克风106变换为电信号。麦克风106对用户声音信号进行输出。用户声音信号由声音处理部107变换为数字声音数据，并作为用户声音数据记录到HDD 104中。CPU 101在示范声音的再现完成之后，以空格键的按下作为触发，开始用户声音数据的记录，以再次按下空格键作为触发，结束用户声音数据的记录。也就是说，从用户首次按下空格键，到再次按下空格键之间的用户声音被记录到HDD 104之中。The user's voice is converted into an electrical signal by themicrophone 106 . Themicrophone 106 outputs user voice signals. The user voice signal is converted into digital voice data by thevoice processing unit 107, and is recorded in theHDD 104 as user voice data. After the reproduction of the demonstration sound is completed, theCPU 101 starts the recording of the user's voice data by pressing the space bar as a trigger, and ends the recording of the user's voice data by pressing the space bar again as a trigger. That is to say, the user's voice between pressing the space bar for the first time and pressing the space bar again is recorded in theHDD 104 from the user.

接下来，CPU 101对得到的用户声音数据进行特征量提取处理(步骤S103)。具体地说，如下所述。CPU 101将声音数据分割为各个预先确定的时间段(帧)。CPU 101求出振幅频谱的对数，该振幅频谱是将被分解为帧的、表示示范声音数据的波形和表示用户声音信号的波形进行傅里叶变换之后得到的，然后，对其进行傅里叶逆变换后，得到每个帧的频谱包络。CPU 101从这样得到的频谱包络中提取第1共振峰及第2共振峰的共振峰频率。一般地，元音由第1及第2共振峰的分布而进行特征识别。CPU 101从声音数据的开头起，将从每个帧得到的共振峰频率分布与预先确定的元音(例如“a”)的共振峰频率分布进行匹配。如果通过匹配而判断该帧为与元音“a”相当的帧，则CPU 101计算出该帧的共振峰之中，预先确定的共振峰(例如第1、第2、第3这3个共振峰)的共振峰频率。CPU 101将计算出的共振峰频率存储到RAM 102中，作为用户声音的特征量P。Next, theCPU 101 performs feature extraction processing on the obtained user voice data (step S103). Specifically, as follows. TheCPU 101 divides the sound data into respective predetermined time periods (frames). TheCPU 101 finds the logarithm of the amplitude spectrum obtained by Fourier transforming the waveform representing the demonstration voice data and the waveform representing the user voice signal decomposed into frames, and then Fourier-transforms it. After leaf inverse transformation, the spectral envelope of each frame is obtained. TheCPU 101 extracts the formant frequencies of the first formant and the second formant from the spectral envelope thus obtained. Generally, vowels are characterized by the distribution of the first and second formants. TheCPU 101 matches the formant frequency distribution obtained from each frame with the formant frequency distribution of a predetermined vowel (for example, "a") from the beginning of the voice data. If it is judged that the frame is a frame corresponding to the vowel "a" through matching, theCPU 101 calculates among the formants of the frame, the predetermined formants (for example, the 1st, 2nd, and 3rd three formants) ) formant frequency. TheCPU 101 stores the calculated formant frequency in theRAM 102 as the feature quantity P of the user's voice.

然后，CPU 101从数据库DB1中提取(选择)与该用户声音的特征量P相似的特征量相关联的声音数据(步骤S104)。具体地说，对所提取的特征量P和记录在数据库DB1中的特征量进行比较，确定与特征量P最近似的特征量。在比较中，例如在特征量P和数据库DB1之间，计算出第1～第3共振峰频率值的差，再计算补足三个共振峰频率的差的绝对值的量，作为表示二者的近似度的近似度指数。CPU 101从数据库DB1中确定所计算出的近似度指数最小的特征量，即与特征量P最近似的特征量。CPU 101再提取出与所确定的特征量相关联的声音数据，并将提取出的声音数据存储到RAM 102之中。Then, theCPU 101 extracts (selects) voice data associated with feature quantities similar to the feature quantity P of the user's voice from the database DB1 (step S104). Specifically, the extracted feature amount P is compared with the feature amount recorded in the database DB1, and the feature amount closest to the feature amount P is determined. In the comparison, for example, between the feature quantity P and the database DB1, the difference between the first to third formant frequency values is calculated, and then the amount that complements the absolute value of the difference between the three formant frequencies is calculated as an expression of the two The proximity index of the proximity. TheCPU 101 determines, from the database DB1, the feature quantity whose computed approximation index is the smallest, that is, the feature quantity most similar to the feature quantity P. TheCPU 101 then extracts sound data associated with the determined feature quantity, and stores the extracted sound data into theRAM 102.

然后，CPU 101进行声音数据的再现(步骤S105)。具体地说，如下所述。CPU 101向声音处理部107输出声音数据。声音处理部107将输入的声音数据进行数/模变换后，作为声音信号输出给扬声器108。这样，提取出的声音数据作为声音从扬声器108中再现。在这里，因为声音数据是利用特征量的匹配而提取的，所以再现的声音成为与学习者的声音特征近似的声音。因此，对于那些仅通过听由声音特征完全不同于自己的讲话者(播音员、说母语的人等)发出的声音而很难模仿的范文，由于是由具有与自己非常相似的声音特征的讲话者发出的声音，学习者也可以更准确地理解应模仿的发音，从而使学习效率提高。Then, theCPU 101 reproduces the audio data (step S105). Specifically, as follows. TheCPU 101 outputs audio data to theaudio processing unit 107. Theaudio processing unit 107 performs digital/analog conversion on the input audio data, and outputs it to thespeaker 108 as an audio signal. Thus, the extracted sound data is reproduced from thespeaker 108 as sound. Here, since the voice data is extracted by matching feature quantities, the reproduced voice is a voice similar to the learner's voice characteristics. Therefore, for those model texts that are difficult to imitate only by listening to the voice of a speaker (announcer, native speaker, etc.) The learner can also understand the pronunciation that should be imitated more accurately, thereby improving the learning efficiency.

<2-2.数据库更新><2-2. Database update>

下面，说明数据库DB1的更新动作。Next, the updating operation of the database DB1 will be described.

图5是表示语言学习系统1中的数据库DB1的更新动作的流程图。首先，利用上述步骤S101～S102的处理，进行示范声音的再现及用户声音的输入。然后，CPU 101进行示范声音与用户声音的比较处理(步骤S201)。具体地说，如下所述。CPU 101将表示示范声音数据的波形分割为各个预先确定的时间段(帧)。此外，CPU 101将表示用户声音数据的波形也同样分割为各个帧。CPU 101以对数值求出振幅频谱，该振幅频谱是将被分解为帧的、表示示范声音数据的波形和表示用户声音信号的波形进行傅里叶变换之后得到的，然后，对其进行傅里叶逆变换后得到每个帧的频谱包络。FIG. 5 is a flowchart showing an update operation of the database DB1 in the language learning system 1 . First, by the above-mentioned processing of steps S101 to S102, playback of demonstration voices and input of user voices are performed. Then, theCPU 101 performs a comparison process between the demonstration voice and the user's voice (step S201). Specifically, as follows. TheCPU 101 divides the waveform representing the exemplary sound data into respective predetermined time periods (frames). In addition, theCPU 101 similarly divides the waveform representing the user's voice data into individual frames. TheCPU 101 finds the amplitude spectrum by logarithmic values obtained by performing Fourier transform on the waveform representing the demonstration voice data and the waveform representing the user voice signal decomposed into frames, and then performs Fourier transform on it. The spectral envelope of each frame is obtained after leaf inverse transformation.

图6是例示示范声音(上)及用户声音(下)的频谱包络的图。图6所示的频谱包络由帧I～帧III这3个帧构成。CPU 101对于每个帧，比较取得的频谱包络，进行将二者的近似度数值化的处理。近似度的数值化(近似度指数的计算)例如可以以如下的方式进行。CPU 101可以对于整个声音数据，计算对将特征性的共振峰的频率和频谱密度表示在频谱密度-频率图中时的两点间的距离进行补足后的值，作为近似度指数。或者，也可以对于整个声音数据，计算对特定的频率中的频谱密度的差进行积分后得到的值，作为近似度指数。此外，由于示范声音和用户声音通常长度(时间)不同，所以优选在上述处理之前进行使二者长度一致的处理。FIG. 6 is a diagram illustrating spectral envelopes of an exemplary voice (top) and a user voice (bottom). The spectrum envelope shown in FIG. 6 is composed of three frames, frame I to frame III. TheCPU 101 compares the obtained spectrum envelopes for each frame, and performs processing of quantifying the degree of approximation between them. Numericalization of the degree of approximation (calculation of the degree of proximity index) can be performed, for example, as follows. TheCPU 101 may calculate, as an approximation index, a value obtained by complementing the distance between two points when the frequency and spectral density of a characteristic formant are expressed in a spectral density-frequency diagram for the entire audio data. Alternatively, for the entire audio data, a value obtained by integrating the difference in spectral density at a specific frequency may be calculated as the approximation index. In addition, since the length (time) of the demo voice and the user voice are usually different, it is preferable to perform a process of making both lengths equal before the above-mentioned process.

下面，再参照图5进行说明。CPU 101根据计算出的近似度指数，判断是否进行数据库DB1的更新(步骤S202)。具体地说，如下所述。HDD 104中预先存储用于将取得的声音数据追加登录到数据库DB1中的条件。CPU 101判断步骤S201中计算出的近似度指数是否满足该登录条件。在满足登录条件的情况下(步骤S202：是)，CPU 101使处理进入后述的步骤S203。在不满足登录条件的情况下(步骤S202：否)，CPU 101结束处理。Next, it will be described with reference to FIG. 5 again. TheCPU 101 judges whether to update the database DB1 according to the calculated approximation index (step S202). Specifically, as follows. TheHDD 104 stores conditions for additionally registering acquired audio data in the database DB1 in advance. TheCPU 101 judges whether the proximity index calculated in step S201 satisfies the registration condition. When the registration condition is satisfied (step S202: Yes), theCPU 101 advances the processing to step S203 described later. When the registration condition is not satisfied (step S202: NO), theCPU 101 ends the processing.

在满足登录条件的情况下，CPU 101进行数据库更新处理(步骤S203)。具体地说，如下所述。CPU 101对满足登录条件的声音数据，赋予确定该声音数据的讲话者即学习者(用户)的用户ID。CPU 101从数据库DB1中检索与用户ID相同的用户ID，使声音数据与该用户ID相关联而追加登录到数据库DB1中。在从更新请求中提取出的用户ID未在数据库DB1中登录的情况下，CPU101追加登录该用户ID，与该用户ID相关联而登录声音数据。这样，学习者的声音数据被追加登录到数据库DB1中，进行了更新。When the login condition is satisfied, theCPU 101 performs database update processing (step S203). Specifically, as follows. TheCPU 101 assigns the user ID of the learner (user) who identifies the speaker of the voice data to the voice data satisfying the registration condition. TheCPU 101 retrieves the same user ID as the user ID from the database DB1, associates voice data with the user ID, and additionally registers it in the database DB1. When the user ID extracted from the update request is not registered in the database DB1, theCPU 101 additionally registers the user ID, and registers the audio data in association with the user ID. In this way, the voice data of the learner is additionally registered in the database DB1 and updated.

以上所说明的数据库更新的动作，可以与上述的声音再现动作同时进行，也可以在声音再现动作完成后进行。这样，通过将学习者的声音数据依次追加到数据库DB1中，而在数据库DB1中积累多个的讲话者的声音数据。因此，随着语言学习系统1被使用，数据库DB1中登录的讲话者的声音数据越来越多，同时，在新的学习者使用语言学习系统1时，再现与自己特征相似的声音的概率也会越来越高。The operation of updating the database described above may be performed simultaneously with the above-mentioned audio reproduction operation, or may be performed after the audio reproduction operation is completed. In this way, voice data of a plurality of speakers are accumulated in database DB1 by sequentially adding voice data of learners to database DB1. Therefore, as the language learning system 1 is used, more and more voice data of speakers are registered in the database DB1, and at the same time, when a new learner uses the language learning system 1, the probability of reproducing a voice similar to his own characteristics also increases. will get higher and higher.

<3.变形例><3. Modifications>

本发明并不局限于上述实施方式，可以进行各种变形。The present invention is not limited to the above-described embodiments, and various modifications are possible.

<3-1.变形例1><3-1. Modification 1>

在上述实施方式之中，也可以在将步骤S104中提取出的声音数据存储到RAM 102中后，CPU 101对声音数据进行语速变换处理。具体地说，如下所述。RAM 102预先存储对语速变换处理前后的语速比例进行指定的变量a。CPU 101对提取出的声音数据进行使声音时间(从声音数据的开头到末尾的再现所需要的时间)为原来的a倍的处理。在a＞1的情况下，利用语速变换处理，声音的长度伸长。即语速变慢。相反，在a＜1的情况下，利用语速变换处理，声音的长度缩短。即语速变快。在本实施方式之中，作为变量a的初始值，设定为比1大的值。因此，在示范声音被再现，然后输入用户声音后，以与用户声音相似的声音而再现的范文以比示范声音慢的方式被再现。因此，学习者可更加明确地识别应模仿的发音(作为目标的发音)。In the above embodiment, after the voice data extracted in step S104 is stored in theRAM 102, theCPU 101 performs speech rate conversion processing on the voice data. Specifically, as follows. TheRAM 102 stores in advance a variable a specifying the speech rate ratio before and after the speech rate conversion process. TheCPU 101 performs a process of multiplying the audio time (the time required for reproduction from the beginning to the end of the audio data) of the extracted audio data by a times the original. In the case of a>1, the speech rate conversion process increases the length of the speech. That is, the speech rate slows down. Conversely, in the case of a<1, the speech rate conversion process shortens the length of the speech. That is, the speed of speech becomes faster. In this embodiment, a value larger than 1 is set as an initial value of the variable a. Therefore, after the demonstration sound is reproduced and then the user's voice is input, the model text reproduced in a sound similar to the user's voice is reproduced in a slower manner than the demonstration sound. Therefore, the learner can more clearly recognize the utterance to be imitated (the target utterance).

<3-2.变形例2><3-2. Modification 2>

在上述实施方式之中，是步骤S104中，提取与从学习者(用户)的声音中提取出的特征量最近似的特征量相关联的声音数据，但提取声音数据的条件并不局限于与学习者声音的特征量最近似。例如，也可以在数据库DB1之中，预先与范文的声音数据相关联记录该声音的发音水平(表示与示范声音的近似度的指数，发音水平越高，越近似于示范声音)，将该发音水平加入声音数据选择的条件之中。作为具体的条件，例如也可以是如下条件，即，从发音水平大于或等于某一定水平的声音数据当中提取特征量最近似的。或者，也可以是如下条件，即，从特征量的近似度大于或等于某值的声音数据当中提取出发音水平最高的。发音水平可以与例如步骤S201中的近似度指数的计算同样地进行计算。In the above-described embodiment, in step S104, the sound data associated with the feature quantity closest to the feature quantity extracted from the sound of the learner (user) is extracted, but the condition for extracting the sound data is not limited to the same The feature amount of the learner's voice is most approximate. For example, it is also possible to record the pronunciation level of the voice in association with the voice data of the model text in advance in the database DB1 (an index representing the similarity to the model voice, the higher the pronunciation level is, the closer it is to the model voice), and the pronunciation The level is added to the condition of sound data selection. As a specific condition, for example, the condition that the most approximate feature value is extracted from voice data whose pronunciation level is equal to or higher than a certain level may be used. Alternatively, it may be a condition that the highest utterance level is extracted from the audio data whose feature values have a degree of approximation greater than or equal to a certain value. The pronunciation level can be calculated in the same way as the calculation of the similarity index in step S201, for example.

<3-3.变形例3><3-3.Modification 3>

此外，系统的结构并不局限于上述实施方式中说明的。语言学习系统1也可以通过网络与服务器装置相连接，使服务器承担上述语言学习系统的部分功能。In addition, the structure of the system is not limited to that described in the above-mentioned embodiment. The language learning system 1 can also be connected to a server device through a network, so that the server can assume part of the functions of the above language learning system.

此外，在上述实施方式中，CPU 101通过执行语言学习程序，以软件的方式实现作为语言学习系统的功能。但也可以使用与图1所示的功能结构要素相当的电子电路等，以硬件的方式实现系统。In addition, in the above-described embodiment, theCPU 101 realizes the function as a language learning system in the form of software by executing the language learning program. However, it is also possible to implement the system as hardware using an electronic circuit or the like equivalent to the functional components shown in FIG. 1 .

<3-4.变形例4><3-4.Modification 4>

在上述实施方式之中，对使用第1～第3共振峰的共振峰频率作为讲话者的声音特征量的方式进行了说明，但声音的特征量并不限于共振峰频率。也可以是根据频谱图等其它声音分析方法计算出的特征量。In the above-mentioned embodiments, the form using the formant frequencies of the first to third formants as the speaker's voice feature quantity has been described, but the voice feature quantity is not limited to the formant frequency. It may also be a feature quantity calculated from other sound analysis methods such as a spectrogram.