SYSTEM AND METHOD FOR AUDIO ANNOTATION
Technical Field
[0001] The present invention relates to a system and method for audio annotation on a document.
Background of Invention
[0002] Annotation or note-taking while reading a digital document is a useful tool for document management, comprehension and the like. Annotation in a desktop device environment has its challenges and is made more difficult on mobile devices such as mobile phones and tablets.
[0003] On mobile devices, there is a difficulty with annotation when the user is away from their desk or without a stylus for example. Further, in existing annotation systems, the user may have to carefully touch or select the first word to be highlighted, drag to select the rest of the passage, and then use a virtual keyboard to type the annotation associated with the selected text.
[0004] While existing systems such as Dynomite™, Evernote™ and NiCEBook™ facilitate the structuring, organising and sharing of textual annotations, and others like Sononcent™, OneNote™ and Notability™ enable the reader to make annotations on digital documents by using voice as input, this is done by way of the user being required to select text segments by hand or a mouse and then anchor a voice annotation to it.
[0005] A problem with these types of arrangements, whether desktop or mobile, is that the user must manually append the text passages with the recorded voice annotations. The action of preparing to take the annotation then draws the reader’s mind away from the annotation they wish to make and instead results on them focusing on the mechanism by which to record the annotation. This has the undesirable effect of adversely affecting their ability to comprehend the reading material.
[0006] Other difficulties associated with existing annotation taking systems are the different contexts of use, as well as the ways in which users hold the device - which can act to further hinder the annotation taking task. [0007] It would be desirable to provide a system which ameliorates or at least alleviates one or more of the above-mentioned problems or provides a useful alternative.
[0008] A reference herein to a patent document or other matter which is given as prior art is not to be taken as an admission that that document or matter was known or that the information it contains was part of the common general knowledge as at the priority date of any of the claims.
Summary of Invention
[0009] According to a first aspect, the present invention provides, a system for anchoring an audio annotation to a passage within an electronic document, the system including: a controller having a processor; a recording component operable by the controller, the recording component including a microphone and an eye-tracker component to capture the gaze of the user, wherein the processor carries out the steps of: in response to an audio input to the microphone, while the audio input is being received, evaluating via the eye-tracker component, the users gaze, thereby determining the passage in the document that the user's gaze is directed to; and mapping the audio input to the passage in the document.
[0010] Advantageously, the present invention provides an interactive gaze assisted audio annotation taking system which enables one or more users to implicitly annotate (which may include voice annotations or voice to text annotations) passages on a document in a seamless manner. The passage in the document may be text, or may be non-text such as a figure, table, chart picture of the like, or the text may refer to the text associated with a table or graphic (such as a label). Implicit annotation refers to tagging or annotating with passages based on natural user behaviours. In the present invention, the behaviour may correspond to the users gaze activity and the temporal order in which they read the document. Advantageously, the ability to implicitly tag or annotate does not require user to deliberately perform any action to associate the annotation to the relevant passage.
[0011] The present invention may take the form of a PDF viewer application whereby the present invention is embedded, with an eye-tracker component to evaluate a user’s gaze and facilitate the annotation of audio with reference to passages thereby enabling the user to accurately and without conscious thought to make an annotation at a place in a document.
[0012] In a further advantage, audio input is hands-free, making it suitable for interaction with mobile devices and users can also make the annotation without taking their attention away from the document.
[0013] Audio input may take any suitable form by way of a microphone and may include natural language processing to make translation between speech and text fast and accurate.
[0014] The present invention allows for the task of audio annotation to be seamless making it easy and convenient for users to interact with digital text on their device by way of utilising user’s gaze as a resource for anchoring the audio annotation to passages.
[0015] In an embodiment, the evaluation includes determining position data associated with a slider on the electronic document. The position data may include the position of the slider in the document and the page number of the document at the time the audio input was received.
[0016] In an embodiment, the evaluation includes determining fixation gaze data, the fixation gaze data being data that is observed during a window spanning the audio annotation. The fixation gaze data may include one or more gaze points observed in the window and grouped into fixations to determine a dispersion and/or duration threshold.
[0017] In an embodiment, the evaluation includes a machine learning component trained on one or more of gaze and/or temporal features of one or more users that reflects the reading and annotation-taking patterns of the user. Any suitable temporal feature may be recorded, but may include, for example, the duration the user has spent reading a passage and the temporal order within which the passage has been read before recording an annotation or the like.
[0018] In an embodiment, indicia may be displayed for the audio mapped to the passage in the document. A highlight may be provided in the relevant passage of the document upon user engagement with the indicia. [0019] According to a second aspect, the present invention provides a method for anchoring an audio annotation to a passage within an electronic document, the method including: receiving an audio input to a microphone, and while the audio input is being received, evaluating via an eye-tracker component, the users gaze, thereby determining the passage in the document that the user's gaze is directed to and mapping the audio input to the passage in the document.
Brief Description of Drawings
[0020] The invention will now be described in further detail by reference to the accompanying drawings. It is to be understood that the particularity of the drawings does not supersede the generality of the preceding description of the invention.
[0021] Figure 1 is a schematic diagram of an example network that can be utilised to give effect to the system according to an embodiment of the invention;
[0022] Figure 2 is a diagram illustrating devices that may be utilised with the system and method of the present invention;
[0023] Figure 3 is a schematic diagram illustrating operation of the system and method of the present invention;
[0024] Figure 4 is a schematic diagram illustrating operation of the system and method of the present invention in use by a user; and
[0025] Figure 5 is flow diagram illustrating the process steps adopted by the system and method of the present invention.
Detailed Description
[0026] Referring to Figure 1 , there is shown a system 100 for automatically and implicitly mapping audio recordings to specific passages on a digital document, with devices making up the system, in accordance with an exemplary embodiment of the present invention. The system 100 includes one or more servers 120 which include one or more databases 125 and one or more devices 110a, 110b, 110c (associated with a user for example) which may be communicatively coupled to a cloud computing environment 130, “the cloud” and interconnected via a network 115 such as the internet or a mobile communications network. It will also be appreciated that the system and method may reside on the one or more devices 110a, 110b, 110c. Devices 110a, 110b, 110c may take any suitable form and may include for example smartphones, tablets, laptop computers, desktop computers, server computers, among other forms of computer systems. Each of the devices 110a, 110b, 110c include a , a microphone and eye-tracker component which will be further described with reference to Figure 2.
[0027] Although “cloud” has many connotations, according to embodiments described herein, the term includes a set of network services that are capable of being used remotely over a network, and the method described herein may be implemented as a set of instructions stored in a memory and executed by a cloud computing platform. The software application may provide a service to one or more servers 120, or support other software applications provided by a third party servers. Examples of services include a website, a database, software as a service, or other web services.
[0028] The transfer of information and/or data over the network 115 can be achieved using wired communications means or wireless communications means. It will be appreciated that embodiments of the invention may be realised over different networks, such as a MAN (metropolitan area network), WAN (wide area network) or LAN (local area network). Also, embodiments need not take place over a network, and the method steps could occur entirely on a client or server processing system.
[0029] Figure 2 illustrates the devices 110A, 110B and 110C that are utilised with the system and method of the present invention. Whether on a desktop computer 110A, tablet 110B or mobile device 110C, each of the devices includes a microphone 210A, 210B or 210C and an eye-tracker component 215A, 215B or 215C respectively. The eye-tracker component 215A, 215B or 215C captures the users gaze on the document that is displayed on device 110A, 110B or 110C. Microphone 210A, 210B or 210C is provided to capture audio input from the user in order to anchor their audio annotation to a particular passage within a document. The passage in the document may be text, or may be non-text such as a figure, table, chart picture of the like, or the text may refer to the text associated with a table or graphic (such as a label). In an embodiment, a camera associated with the device and the eye-tracker component 215A, 215B or 215C may be one and the same unit where possible (to save on space in the device or on the display of the device). Any suitable eye-tracker component may be provided such as a Tobii 4C although it will be appreciated that any particular eye-tracker may be utilised. Any suitable frequency or frame rate may be captured by the eye-tracker component dependent on system resources, and may be, for example, a frequency of 90 Hz or higher. A lower frequency or frame rate may also be possible (i.e. in the order of 30 Hz). The microphone 210A, 210B or 210C may be an internal microphone associated with the device or may be an external microphone.
[0030] Figure 3 is a schematic diagram illustrating the system 300 of the present invention in operation. A user 305 is associated with an input component 310 which may consist of, for example, a device 110C. The device 110C includes a microphone 210C for receiving audio from the user 305 as well as an eye-tracker component 215C for tracking the gaze of the user 305. Also provided is a gaze feature generation component 315 and machine learning predictive model component 320, a text extraction component 325 and an annotated document 330 which is being viewed by the user on the device 110C.
[0031] In operation, a user 305 can open up the digital document 330 on their device 205C which is typically a PDF file (but need not be). Upon loading the digital document 330, passages from the PDF are extracted via computer vision or the like.
[0032] Extraction of passages from the PDF, is carried out by extracting and change PDF pages to an image. The extracted images may then be applied to optical character recognition engine, for example PyTessBaseAPI which is a library which provides functions to segment images into text components. The paragraphs are then extracted from the images by giving the paragraph as an input component. The extracted paragraphs are then saved as bounded boxes for use by the system and method of the present invention in anchoring the user annotation to the appropriate passage in the document.
[0033] While the user 305 reads the document, the gaze coordinates are continuously recorded by the eye-tracker component 215C. The user 305 would typically press a recording button provided on an interface associated with the device 215C or may be by way of a voice command. Upon completion of the audio annotation recording, the gaze co-ordinates are mapped to the page co-ordinates to keep track of where the user 305 was looking on the page while reading and making the annotations. The page co-ordinates are allocated to various extracted passages and region-based gaze and temporal features are calculated.
[0034] It will be appreciated that the audio input from the user may be raw audio data which may be processed to extract and provide the audio annotation associated with the document. For example, raw audio data from the microphone may be filtered to remove noise or and/or frequency smoothing may be applied. An audio signal threshold may further be applied (for example a signal at 26dB) in order to remove silent segments in the audio input. In addition, audio segments within the audio input, for example, audio segments having a duration below a threshold amount of time may be discarded.
[0035] In an embodiment, for each audio input, a Region-of-Analysis (ROA) may be defined. Since in some situations, gaze patterns while a user records audio may indicate that the audio is not directly related to the passages where a user was looking while they spoke but may be related to the passages the user had read before recording the annotation. The ROA may start from the end time of a particular audio input to the end time of the successive audio input under consideration. For example, the ROA for each audio input may be a period of silence from the time from the end of the previous audio input until the end of the present audio input.
[0036] As noted above, computer vision may be utilised to parse what the user is looking at. This may be combined with ROA allowing the image data frames to be fetched within the ROA in order to better map the gaze data to pixels in the document.
[0037] The machine learning component 320 is preferably pre-trained such that a feature vector is provided and consists of extracted gaze features of one or more participants while they read and make audio annotations. This assists with the system predicting the text regions and audio annotations of a user.
[0038] The pre-training of the machine learning component may be via data of number of users in which each user reads one or more documents and makes audio annotations while recording their gaze. In this case, the users explicitly highlight the passages to which each audio annotation corresponds to. The features extracted from the gaze data act to serve as features and the passage highlighted by the user may serve as the ground truth for the machine learning model. [0039] The text is extracted at component 325 and the predicted passages are highlighted in the document 330 with indicia such as a sound icon being anchored to the top of the predicted passage. The user 305 can then retrieve the recorded audio annotation and visualise the anchored passage by tapping on the indicia displayed beside the passage. Tapping on the indicia plays the recorded audio annotation as well highlighting the relevant passage.
[0040] It will be appreciated that the annotation may be provided in audio format and/or may be converted into text such that the user 305 has the option of listening to the audio annotation or viewing a speak to text translation of the audio annotation highlighted at the relevant point in the document. Advantageously, the audio annotations are anchored to a passage implicitly in that there is no requirement for the user to manually highlight a passage for annotation.
[0041] The machine learning component 320 predicts the passages which the audio annotation is associated with. The goal of the machine learning component 320 is to map a feature vector of the gaze and temporal features computed from the passages to classify whether an audio annotation was related to a specific passage or not.
[0042] For each extracted passage, the classifier may either predict “not annotated” (i.e. the audio annotation was not made with reference to that particular passage) or “annotated” (i.e. audio annotation was made with reference to that particular passage). To solve this binary classification problem for each passage, the classifier may be trained on the whole dataset.
[0043] As annotation-taking behaviour is an individual activity that might differ from user to user, a generic classifier is preferable (i.e. to avoid overfitting the classifier to the audio annotation behaviour of a particular user). The classifier may be trained by way of a “leave-one-user-out” cross-validation.
[0044] The classifier may take any suitable form, for example a classifier in the form of a random forest classifier may be provided. The classifier generally is used for predicting a category of a new observation based on the data observed during the training phase, with the goal of the classifier to classify whether an audio annotation was related to a specific passage or not. This can be done in a number of ways, but for example, by first training the classifier on data from a number of users (for example, 32 users) who are taking annotations while reading a document. A set of gaze and temporal features are extracted from the user’s reading and annotation-taking behaviour that are indicative of whether a read passage is to be mapped to an audio annotation. Once the classifier is trained and there is reasonable performance from the classifier, this classifier may be utilised to make a prediction for an audio annotation recorded by any user.
[0045] The mapping of audio annotations to passages within a document without any additional information from the user is carried out by way of a analysing gaze behaviour during audio annotation. For example, it has been found that mapping is not as straight forward as merely selecting where the user is looking when speaking but the machine learning component 320 may be provided to overcome this challenge based on a collected data set such that anchoring audio annotations to the correct passage may be provided within a reasonable level of performance. For example, the “AUC” (Area Under Curve) being equal to 0.89.
[0046] In an alternative embodiment, the machine learning component 320 may be replaced by a slider component 320 which, while not as effective, can also provide a suitable result. The slider component 320 (which can replace the machine learning component 320), maps a user's recorded audio annotations to the reference text by way of a slider value on the electronic document which they are viewing.
[0047] For example, a vertical slider value associated with the document may be retrieved together with a page number of the document at the time that the user started to record the audio annotations. The slider value and the page number can then be used to retrieve the image frame which was being displayed on the display when the user started to record the annotations. The audio annotation is then mapped to the paragraph which was displayed at the top of the image frame.
[0048] In a further embodiment, the machine learning component 320 may be replaced with a gaze data component 320, which is gaze data observed and analysed during a window spanning the audio annotation. The gaze points observed in this window may be cluttered into "fixation" groups to determine a dispersion threshold (by setting the value or dispersion and duration to a threshold parameter, for example, 200 and 100 respectively). The fixation points may be assigned to the nearest passage. The fixation count may be counted as one gaze feature for each passage. In this way the audio annotation is primarily mapped to the passage at which the user has looked at the most whilst speaking.
[0049] Fixations may be generated by way of a Dispersion-Threshold Identification algorithm or the like. The Dispersion-Threshold Identification algorithm produces accurate results by using only two parameters, dispersion, and duration threshold, which may be set to 20 and 100, respectively. It will be appreciated that not all fixations will be within passages, due to calibration offsets and tracking errors. In that regard each fixation outside the passages may be assigned to the nearest extracted passage by using hierarchical clustering.
[0050] A comparison between the three approaches, a slider arrangement, a fixation arrangement and a machine learning arrangement is provided in Table 1 below whereby the machine learning arrangement has a better result that the slider arrangement and fixation arrangement.
[0051] The slider and fixation arrangements are limited but useful arrangements. For the slider arrangement, a user must explicitly position the slider to the appropriate point in the document (which would require non-passive effort from the user). The fixation arrangement requires the user to explicitly look and fixate (again requiring nonpassive effort from the user) at the passage which is to be mapped with the recorded audio annotation. The machine learning arrangement when trained correctly on features that capture the broad reading and annotation taking patterns of users allow more efficient mapping of audio annotation to passages. [0052] Depending on the application, the slider arrangement or fixation arrangement may also be provided where for example a more simplified arrangement is desired. A slider would be an option when the eye-tracker component is not available or offline. The fixation arrangement may be useful if the user does not want the algorithm to decide and the user wants to direct the system to map an audio annotation to a passage.
[0053] Figure 4 is a schematic diagram illustrating the steps carried out in operation form the point of view of a user using the system and method of the present invention. In operation, software associated with the system 400 is running on the device 205C. At 405 the user clicks the start button on the display of the device 205C or via a keyboard or alternatively via speech to text commands on the document on the screen of the device 205C. The user speaks and the system maps audio annotations to the passages in the document which in this case is a PDF file. The audio annotation from the user based on the prediction of the machine learning component are then anchored to the reference passages which are inferred from their gaze behaviour.
[0054] As shown in Figure 4, the system 400 offers two features, notably recording and retrieval. For example, to record an audio annotation while reading the document the user either presses the recording button at the left side of the document viewer or, in a possible embodiment, via a speech to text command and speaks out loud their annotations in relation to the passage that there are looking at. A prediction is then made by the machine learning component regarding the reference passage based on the users gaze activity. The audio annotation is saved in a memory and an indicia which may take the form of a sound icon may then be provided an displayed beside the reference passage to provide an indication that an annotation has been made as shown in 415. Clicking on the audio annotation indicia plays the recorded audio annotation and preferably also highlights the referenced passages when playback is occurring. As shown in 410 the reader presses the stop button to finish making a recording or they may do so by voice to text command. As shown in 420 the relevant text portion is highlighted as the audio annotation is playing.
[0055] Figure 5 is a flow diagram illustrating a method for anchoring an audio annotation to a passage within an electronic document. The method 500 starts at step 505 where a user (say 305) associated with, for example a device (i.e. 205C) which may be a mobile phone, tablet desktop computer or the like, has a document open and a microphone and eye-tracker component in operation associated with the device. At step 505, audio input is received from the user 305 to a microphone 210C associated with the device while they are looking at the device. Control then moves to step 510 in which, while the audio input is being received, the user's gaze is evaluated via the eyetracker component. Control then moves to step 515 in which the passage in the document that the user's gaze is directed to is determined. Control then moves to step 520 where the audio input is then mapped to the passage in the document that the user's gaze is directed to, as determined in step 515.
[0056] The evaluation may occur in a number of different ways. In an embodiment, the evaluation may occur by way of the position of a slider associated with the electronic document. The slider having a value of 0 where 0 is the start of the page to MAX, where MAX is the end of the page, and determining the location of the user's gaze.
[0057] In an alternative, the evaluation may occur by way of fixation gaze data, which is gaze data observed and analysed during a window spanning the audio annotation. The gaze points observed in this window may be cluttered into "fixations" to determines a dispersion threshold (by setting the value or dispersion and duration to a threshold parameter, for example, 200 and 100 respectively). The fixation points may be assigned to the nearest passage. The fixation count may be counted as one gaze feature for each passage. In this way the audio annotation is primarily mapped to the passage at which the user has looked at the most whilst speaking. For the fixationbased approach a classifier may be utilised (which may be for example a logistic regression classifier), but a feature vector in this embodiment, consists of the count of fixation feature for each passage. Therefore, the system and method of the present invention may predict the passage at which the user has looked the most while recording the annotation which is indicated by the fixation count feature.
[0058] Preferably, the evaluation occurs by way of a machine learning component trained on one or more of gaze and/or temporal features that reflects the reading and annotation-taking patterns of the user. In operation, the classifier is trained and then is fed a feature vector as described with reference to Figure 3. [0059] The method may further include the steps of providing an indicia for display on the document associated with the audio annotation, preferably the display is an audio icon. The method may further include the step of determining via the audio input the start and stop of an audio annotation. For example, it may be by way of the user dictation a voice command or the microphone sensing a voice command when the software is being used. The method may further include the step of providing playback of the audio annotation to the user and highlighting the relevant passage while the audio annotation is being played.
[0060] While the invention has been described in conjunction with a limited number of embodiments, it will be appreciated by those skilled in the art that many alternative, modifications and variations in light of the foregoing description are possible. Accordingly, the present invention is intended to embrace all such alternative, modifications and variations as may fall within the spirit and scope of the invention as disclosed.
[0061] The present application may be used as a basis or priority in respect of one or more future applications and the claims of any such future application may be directed to any one feature or combination of features that are described in the present application. Any such future application may include one or more of the following claims, which are given by way of example and are non-limiting in regard to what may be claimed in any future application.