Effect of utterance style and speaker on the minimum sample size for estimating speech production rate

Pablo Arantes

Abstract

We investigated the role of speaking style and individual speakers on estimating the minimum sample size required for stable estimation of speaking rate. The compared speaking styles are semi-spontaneous interviews and sentence reading. We analyzed 20 speech samples, 10 in each style, from 5 male and 5 female speakers. Stabilization times are the point along the time series defined by successive values of cumulative speaking rate where variability is reduced. Two criteria for defining stability are presented and compared, one based on the change point statistical analysis and one on a perceptual threshold. We also tested the effect of progressively increasing the sample size submitted to stability analysis (starting with 30 seconds and reaching up to 300 seconds). The results show that average stabilization times depend on the criteria used for detection, but are generally longer for the semi-spontaneous style, ranging from 60 to 70 seconds for reading and 80 to 110 seconds for semi-spontaneous speech. Stabilization times tend to be longer as the sample duration increases. Speaker sex has no significant impact on stabilization times. Estimates of stabilization time vary among different speakers almost as much as intra-speaker variability. The results are relevant to forensic phonetics applications because they suggest, based on an explicit and reproducible methodology, what is the minimum duration a speech sample needs to have in order to estimate from it the speech production rate for speaker comparison purposes.

References

ARANTES, P. Estimativas de longo termo da frequência fundamental: implicações para a fonética forense. Revista Virtual de Estudos da Linguagem – ReVEL, v. 12, n. 23, p. 217–236, 2014.

ARANTES, P. Speech rate estimation: how long should the utterance be? Anais do Colóquio Brasileiro de Prosódia da Fala, v. 3, 2015.

ARANTES, P.; ERIKSSON, A. Temporal stability of long-term measures of fundamental frequency. (N. Campbell, D. Gibbon, D. Hirst, Eds.)Proceedings of the 7th International Conference on Speech Prosody. Anais...Dublin: ISCA, 2014.

ARANTES, P.; ERIKSSON, A.; GUTZEIT, S. Effect of language, speaking style and speaker on long-term F0 estimation. Interspeech 2017. Anais...Stockholm: ISCA, 2017.

ARANTES, P.; LIMA, V. G. Towards a methodology to estimate minimum sample length for speaking rate. Revista do GEL, v. 14, n. 2, p. 183–197, 2017.

BARBOSA, P. A. Incursões em torno do ritmo da fala. Campinas: Pontes, 2006.

BARBOSA, P. A. Prosódia. 1. ed. São Paulo: Parábola, 2019.

BOERSMA, P. Praat, a system for doing phonetics by computer. Glot International, v. 5, n. 9/10, p. 341–345, 2001.

BÓNA, J. Temporal characteristics of speech: The effect of age and speech style. The Journal of the Acoustical Society of America, v. 136, n. 2, p. EL116–EL121, 1 ago. 2014.

CAO, H.; LEI, Y. Fundamental frequency statistics for young male speakers of Mandarin. Journal of Forensic Science and Medicine, v. 3, n. 4, p. 217–222, 2017.

CAO, H.; WANG, Y. A forensic aspect of articulation rate variation in Chinese. Proceedings of the XVIIth ICPhS. Anais... Em: XVIITH ICPHS. 2011.

CRYSTAL, T. H.; HOUSE, A. S. Segmental durations in connected speech signals: Preliminary results. The Journal of the Acoustical Society of America, v. 72, n. 3, p. 705–716, 1 set. 1982.

EDWARDS, J.; BECKMAN, M. E.; FLETCHER, J. The articulatory kinematics of final lengthening. The Journal of the Acoustical Society of America, v. 89, n. 1, p. 369–382, 1 jan. 1991.

ERIKSSON, A. Aural/acoustic vs. automatic methods in forensic phonetic case work. Em: NEUSTEIN, A.; PATIL, H. A. (Eds.). Forensic Speaker Recognition: Law Enforcement and Counter-terrorism. [s.l.] Springer, 2011. p. 41–70.

GFROERER, S. Auditory-instrumental forensic speaker recognition. Proceedings of Eurospeech 2003. Anais...2003.

GOLD, E.; FRENCH, P. International practices in forensic speaker comparison. The International Journal of Speech, Language and the Law, v. 18, n. 2, p. 293–307, 2011.

GOLD, E.; FRENCH, P. International practices in forensic speaker comparisons: second survey. International Journal of Speech Language and the Law, v. 26, n. 1, p. 1–20, 2019.

HIROSE, K.; KAWANAMI, H. Temporal rate change of dialogue speech in prosodic units as compared to read speech. Speech Communication, v. 36, n. 1–2, p. 97–111, jan. 2002.

HOWELL, P.; KADI-HANIFI, K. Comparison of prosodic properties between read and spontaneous speech material. Speech Communication, v. 10, n. 2, p. 163–169, jun. 1991.

HUDSON, T. et al. F0 statistics for 100 young male speakers of Standard Southern British English. ICPhS XVI. Anais...Saarbrücken: 2007.

JAFFE, J.; BRESKIN, S. Temporal Patterns of Speech and Sample Size. Journal of Speech and Hearing Research, v. 13, n. 3, p. 667–668, set. 1970.

JESSEN, M. Forensic reference data on articulation rate in German. Science and Justice, v. 47, p. 50–67, 2007.

JESSEN, M. Forensic phonetics and the influence of speaking style on global measures of fundamental frequency. Em: GREWENDORF, G.; RATHERT, M. (Eds.). Formal linguistics and law. Berlin: Mouton de Gruyter, 2009. p. 115–139.

KENDALL, T. Speech Rate, Pause, and Sociolinguistic Variation: Studies in Corpus Sociophonetics. London: Palgrave Macmillan, 2013.

KILLICK, R.; ECKLEY, I. A. changepoint: An R Package for Changepoint Analysis. Journal of Statistical Software, v. 58, n. 3, p. 1–19, 2014.

KÜNZEL, H. Some general phonetic and forensic aspects of speaking tempo. Forensic Linguistics, v. 4, n. 1, p. 48–83, 1997.

LEHISTE, I. Suprasegmentals. Cambridge, MA: MIT Press, 1970.

LINDH, J. Preliminary descriptive F0-statistics for young male speakers. Lund Working Papers, v. 52, p. 89–92, 2006.

MACHAČ, P.; SKARNITZL, R. Principles of Phonetic Segmentation. Prague: Epocha Publishing House, 2009.

MORRISON, G. Forensic voice comparison. Em: I. FRECKELTON; H. SELBY (Eds.). Expert Evidence. Sydney, Australia: Thomson Reuters, 2010.

MORRISON, G. S. Forensic voice comparison and the paradigm shift. Science and Justice, v. 49, n. 4, p. 298–308, 2009.

OLIVEIRA, J. C. C. Multiparametric analysis of phonetic-acoustic measures in genetically and non-genetically related speakers: implications for forensic speaker comparison. Tese de doutorado—Campinas: Universidade Estadual de Campinas, 2021.

PETTORINO, M. et al. VtoV: a perceptual cue for rhythm identification. Prosody-Discourse Interface Conference 2013. Anais... Em: PROSODY-DISCOURSE INTERFACE CONFERENCE 2013. 2013.

PFITZINGER, H. R. Two approaches to speech rate estimation. Proceedings of the 6th Australian Internation Conference on Speech Science and Technology (SST ’96). Anais...1996.

PFITZINGER, H. R. Local speech rate as a combination of syllable and phone rate. Proceedings of the 5th ICSLP. Anais...1998.

QUENÉ, H. On the just noticeable difference for tempo in speech. Journal of Phonetics, v. 35, p. 353–362, 2007.

R CORE TEAM. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2020.

SKARNITZL, R.; VAŇKOVÁ, J. Fundamental frequency statistics for male speakers of Common Czech. AUC PHILOLOGICA, v. 2017, n. 3, p. 7–17, set. 2017.

TURK, A.; NAKAI, S.; SUGAHARA, M. Acoustic segment durations in prosodic research: a practical guide. Em: SUDHOFF, S. et al. (Eds.). Methods in empirical prosody research. Berlin: Walter de Gruyter, 2006. p. 2–27.

WIGHTMAN, C. W. et al. Segmental durations in the vicinity of prosodic phrase boundaries. The Journal of the Acoustical Society of America, v. 91, n. 3, p. 1707–1717, 1 mar. 1992.