This is a public domain speech dataset consisting of 13,100 short audio clipsof a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip.Clips vary in length from 1 to 10 seconds and have a total length of approximately24 hours.
The texts were published between 1884 and 1964, and are in the public domain. The audio wasrecorded in 2016-17 by theLibriVox project and is also in thepublic domain.
Metadata is provided intranscripts.csv. This file consists of one recordper line, delimited by the pipe character (0x7c). The fields are:
Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.
| Total Clips | 13,100 |
| Total Words | 225,715 |
| Total Characters | 1,308,678 |
| Total Duration | 23:55:17 |
| Mean Clip Duration | 6.57 sec |
| Min Clip Duration | 1.11 sec |
| Max Clip Duration | 10.10 sec |
| Mean Words per Clip | 17.23 |
| Distinct Words | 13,821 |
| Abbreviation | Expansion |
|---|---|
| Mr. | Mister |
| Mrs. | Misess (*) |
| Dr. | Doctor |
| No. | Number |
| St. | Saint |
| Co. | Company |
| Jr. | Junior |
| Maj. | Major |
| Gen. | General |
| Drs. | Doctors |
| Rev. | Reverend |
| Lt. | Lieutenant |
| Hon. | Honorable |
| Sgt. | Sergeant |
| Capt. | Captain |
| Esq. | Esquire |
| Ltd. | Limited |
| Col. | Colonel |
| Ft. | Fort |
(*) there's no standard expansion for "Mrs."
This dataset is in the public domain in the US (and most likely other countries as well). There are no restrictions on its use. For more information, please see:librivox.org/pages/public-domain.
This dataset consists of excerpts from the following works:
Recordings by Linda Johnson fromLibriVox. Alignment and annotation byKeith Ito. All text, audio, and annotations are in the public domain. We request that you use this dataset for good and not evil.
As this work is in the public domain, you may use it without attribution. However, if you'd like to cite it in a publication, please do so by linking to this page or using the following: