The present study has two main goals. The first is to describe the effects of three speaking styles (spontaneous interview, sentence reading and word list reading) on statistical estimators of fundamental frequency (_{0}) variability (mean, standard deviation, skewness and kurtosis) in five female and five male speakers of Brazilian Portuguese (BP). Most _{0} contours of word reading are bimodal. Analysis of their time-normalized contours suggests this is caused by the time-compressed realization of fast transitions from low to high or high to low tones aligned with stressed syllables. Considering only unimodal distributions, results show that there are no statistically significant effects in the male data for any of the four variability estimators. Effects show up in female data. Spontaneous style has statistically significant higher mean, SD and skewness than read speech. Findings in the previous literature indicate the reverse pattern, though, for languages other than BP. The second goal of the study is to characterize the statistical properties of _{0} distributions beyond mean and SD. Results confirm previous observations that most _{0} distributions have positive skewness, are left-tailed and have kurtosis values that deviate significantly from the normal because of large deviations from the central or modal value. A distribution fitting procedure tested six distributions. The asymmetric Burr type XII distribution emerges as the one that best fits the data in the corpus. Results show that two of the parameters that determine its shape correlate well with the empirical _{0} distribution values of SD and skewness. Important effects of speaking style on _{0} seen in female speakers can be reproduced by combinations of the Burr distributions’ parameters.

Este estudo tem dois objetivos principais. O primeiro é descrever o efeito de três estilos de elocução (entrevista espontânea, leitura de frases e leitura de palavras) sobre estimadores da variabilidade de _{0} e também sobre o formato dos histogramas de _{0} de cinco falantes femininos e masculinos do português brasileiro (PB). A maioria dos contornos de leitura de palavras é bimodal. A análise de contornos normalizados temporalmente sugere que a bimodalidade é causada por transições rápidas entre tons altos e baixos. Os resultados das distribuições unimodais mostram que a variável estilo só causa efeitos significativos nos dados das mulheres: o estilo espontâneo apresenta valores maiores de média, desvio-padrão e assimetria em comparação ao estilo leitura de frases. O segundo objetivo do estudo é caracterizar as propriedades estatísticas de _{0} para além da média e do desvio-padrão. Nossos resultados confirmam observações anteriores ao mostrar que distribuições de _{0} no geral têm assimetria positiva e valores de curtose que excedem o que seria esperado para a distribuição normal. Um procedimento para ajuste de distribuições testou seis distribuições de probabilidade teóricas. A distribuição assimétrica Burr tipo XII foi a que teve o melhor ajuste em relação aos dados estudados. Os dois parâmetros que descrevem o formato da distribuição têm boa correlação com os valores de desvio-padrão e assimetria das distribuições empíricas. Os efeitos do estilo de elocução sobre as distribuições de _{0} das mulheres podem ser reproduzidos por combinações dos parêmetros da distribuição Burr.

In this article we deal with two lines of research that do not cross paths regularly, at least not as we investigate them here. The first is the study of the effects of speaking style on _{0} and the second is the statistical description and modeling of _{0} distributions.

Research on speaking styles revolves around the task of describing or characterizing the many landmarks in a continuum that goes from what can be called spontaneous speech to speech read from previously prepared texts in laboratory conditions. In between these two, it is possible to identify styles that are defined in relation to the content and function of the spoken content, such as news broadcasts, sports narration, theatrical speech and many others. Different dimensions of language and speech are investigated in relation to speaking styles: linguistic stress, speaking rate, vowel reduction, content vs. function words and voice quality, to name a few - see specially Llisterri (1992) for a systematic review of research strategies and results that have accrued around the subject.

Prosodic correlates are consistently researched in regard to the effects of speaking styles; see Llisterri (1992, p. 13–14) for a comprehensive list of suprasegmental acoustic correlates that have already been investigated. Here, we concentrate on the effects of speaking style on overall _{0} variability. The most common research strategy has been to determine how spontaneous and read speech styles affect a number of statistical descriptors of _{0} distribution, mainly _{0} mean and standard deviation. At least six articles systematically review results on this theme (ESKÉNAZI, 1993; HOLLIEN; HOLLIEN; DE JONG, 1997; JESSEN, 2009; KARLSSON _{0} mean than spontaneous speech, although not all studies use the same definition for the latter style. A great number of results show that the two styles do not differ in _{0} standard deviation; studies that show a difference are divided in almost equal numbers between those pointing to spontaneous speech having grater standard deviation and those showing the reverse. Only a few studies report results of inferential statistical tests, and most numerical averages for both mean and standard deviation values presented are close; so, even if differences are statistically significant, the effect sizes are likely to be small. In terms of language diversity, the reviewed studies are dominated by English; and other languages also present in the reviewed papers are Dutch, French, German and Swedish, but in lesser numbers.

Besides reviewing previous work, Arantes and Linhares (2017) present original results from a study that includes seven languages (Brazilian Portuguese, English, Estonian, French, German, Italian and Swedish). The same data collection and analysis procedures were used for all languages. For these data, spontaneous speech was elicited in the form of a semi-directed interview (which the authors classify under the “spontaneous speech” label) and read speech consisted of sentences taken from written transcripts of each participant’s interview and later read by them. Results agree with previous findings_{0} value; when it comes to _{0} standard deviation, the results are not more definitive than previously found, although the significant difference in favor of spontaneous speech observed when all language data are collapsed gives an indication that, in larger samples, the small-sized effects observed may yield significance.

The second line of research previously mentioned also has a tradition of its own. Its chief purpose is to establish the main statistical characteristics of _{0} distributions in general. There are theoretical and applied motivations for this line of inquiry. On the theoretical side, one is interested in knowing how to best describe and model _{0} from a statistical point of view and to relate that to the physiology of the voice production mechanism and to linguistic factors that may affect it (FUJISAKI, 1988). On the applied side, there is the development of normalization strategies (JASSEM; KUDELA-DOBROGOWSKA, 1980; MAIDMENT; LECUMBERRI, 1996; ROSE, 1987; 1991) that allow the generation of _{0} contours to abstract away from between-speaker variability and emphasize linguistic-motivated contour movements.

Another main practical reason for the interest in statistical properties of _{0} distributions comes from the potential use of _{0} as an acoustic parameter in forensic speaker comparison. Eriksson (2011, p. 49) mentions that _{0} mean and standard deviation are often suggested as “descriptors of individual differences”. Surveys of common practices in the field (GOLD; FRENCH, 2011; 2019) recognize that _{0} is widely considered by expert practitioners as being useful in speaker comparison tasks. Despite highlighting limitations to the indexical properties of _{0}, Kinoshita and colleagues (2009) suggest that statistical parameters other than _{0} mean and standard deviation may be added as features in voice comparison procedures in order to make _{0} more resistant to within-speaker variability and non-linguistic factors that may affect it and, in turn, make _{0} a more robust factor in forensic speaker comparison. The authors make this claim based on the observation that _{0} histograms generated from audio samples by the same speaker recorded in different occasions “show striking similarities in their shapes” (KINOSHITA; ISHIHARA; ROSE, 2009, p. 93). In the cited article, this observation in corroborated by presenting a selected number of histogram pairs. There is, however, no mention of a systematic study of this behavior to attest its consistency across several speakers.

Research on the general statistical properties of _{0} distributions goes back at least to the 1930s (COWAN, 1936) and review papers (FITCH; HOLBROOK, 1970; HOLLIEN; PAUL, 1969; TRAUNMÜLLER; ERIKSSON, [S.d.]) list numerous studies with similar goals. Most of these studies try to characterize the _{0} distribution by means of _{0} mean and _{0} standard deviation. A few of them try to correlate differences in those estimators with different speaking styles and, in some cases, with speaker sex and physical traits such as body height and weight (HOLLIEN; PAUL, 1969). It is much rarer for statistical descriptors beyond mean and standard deviation to be reported, maybe because it is assumed that _{0} data can be modeled as a normal distribution and as such could be wholly characterized by the distribution mean and standard deviation. Cowan (1936 _{0} contours of “stage speech” are “more or less normally distributed”, although no further technical details are provided about this statement. More recent studies have not corroborated this claim (JASSEM; STEFFEN-BATÓG; CZAJKA, 1973; JASSEM, 1971). The studies reported by the authors applied χ^{2} goodness-of-fit tests to _{0} distributions of samples of read speech with about one minute in duration. They concluded that about 90% of them differ significantly from a normal distribution (the existence of one bimodal distribution is also reported). The authors attribute non-normality to deviations in skewness and kurtosis. Measuring skewness by means of Walker’s measure and using a table of critical values taken from a collection of statistical tables, they conclude that, in most cases, the skewness values in their samples differ significantly from what would be expected for a normal distribution. Positive skewness (asymmetry to the left of the central value) is the typical case, with some cases of negative or no skewness. Later studies (HORII, 1975; 1982) have corroborated the finding that _{0} distributions are characterized by positive skewness deviation with reference to the normal, both for read and spontaneous speech, although no information is provided about tests performed to identify statistical significance. Positive deviation seems the most common occurrence, although evidence of typical negative skewness values is also found (ZEMLIN, 1968

Given the fact that no deviation in skewness and kurtosis is the exception rather than the rule, the authors point out that this “may be helpful in the classification of voices for purposes of identification” (JASSEM; STEFFEN-BATÓG; CZAJKA, 1973, p. 219). This observation ties in well with the suggestions made by Kinoshita and colleagues (2009) mentioned earlier in this section that estimators that contribute to more fine-grained detail in _{0} histograms may strengthen its usefulness in speaker comparison tasks.

Evidence of significant deviations in skewness and kurtosis being the typical case in _{0} is mostly based on data collected in English, with the exception of Mikheev (1971 _{0}) of Russian speech to be positively skewed. It would be desirable that a more diverse language pool be studied to claim that this finding is truly cross-linguistic.

At the start, we stated that one of the goals in this article is to bring together the two lines of research we described. This is done by reanalyzing a speech _{0}. In that study, the authors found the effects by looking at differences in a set of measures of central tendency and variability of _{0}. Here we advance the analysis by also looking for differences in skewness and kurtosis.

Motivated by previous evidence concerning the presence of many cases of bimodality in _{0} histograms in the _{0} and perform a distribution fitting analysis including other distributions other than the normal. Lastly, we try to model differences due to speaking style as changes in values of the distribution that was found to be the best one to model _{0} data.

In this section we present the speech material analyzed (1.1); the phonetic methods used to extract _{0} contours (1.2); and the procedures used to analyze the _{0} distributions and to fit statistical distributions to _{0} data (1.3). In order to conform to open science principles, data files and R scripts used to analyze the data are made available at

The speech material analyzed here is the Brazilian Portuguese subset of a database of recordings called “A typology for word stress and speech rhythm based on acoustic and perceptual considerations”,

In the first step of the phonetic analysis, audio samples were segmented into units defined as a function of speaking style. Interview samples were segmented into phonetic utterances, defined by Kendall (2013) as stretches of speech delimited by silent pauses. Each complete sentence sample was segmented into individual sentences and each complete word reading sample was segmented into individual words. The net duration of all units combined, the number of units and the mean duration of units are shown in table 1 for each sample in the corpus. The table also shows frame duration (in milliseconds) and the number of voiced frames per audio sample. The reciprocal of frame duration indicates the rate at which the _{0} extraction algorithm tries to estimate values in the voiced portions of the speech sample (more information on _{0} extraction procedure ahead). Number of frames per sample corresponds to the number of _{0} observations taken from each audio sample.

Source: the author.

Mean net duration of participant speech per audio sample (in seconds, standard deviation in parentheses) is 565 (166) for interviews, 191 (40.4) for sentence readings and 37.8 (9.42) for word readings. Median number of units (median absolute deviation in parentheses) is 274 (54.1) for interviews, 67 (26.7) for sentence readings and 46 (1.48) for word readings. Mean unit duration (in seconds, SD in parentheses) is 2.06 (0.49) for interviews, 2.9 (0.77) for sentence readings and 0.8 (0.19) for word readings. Frame duration (in milliseconds, SD in parentheses) has median values of 6.22 (1.8) for the female group and 9.52 (1.22) for the male group.

Before the _{0} extraction phase, stretches of audio files that contained the speech of the experimenter, overlap between speaker and experimenter, and non-speech events were silenced to minimize _{0} extraction errors. Extraction of _{0} contours was done with the help of a Praat script (ARANTES, 2019) that implements a heuristic suggested by Hirst (2011) to optimize the values passed to the _{1} and 1.5·_{3}, where _{1} and _{3} are the first and third quartiles of the voiced samples in the first

Each _{0 }contour obtained through the script was then checked individually and remaining _{0} extraction errors were hand-corrected by two analysts trained to perform the task. Most errors commonly detected by this procedure were octave halving or doubling and incorrect voicing detection, usually in fricatives or transient noise in plosive releases. Cases such as incorrect devoicing of frames, which can occur during glottalized or creaky phonation, had to be found by the analyst by comparing the _{0} contour with both the respective oscillogram and spectrogram.

In order to characterize _{0 }contours in our sample in terms of their statistical properties, a series of statistical estimators were calculated for each of them.

Measures of central tendency: arithmetic mean;

Measure of dispersion: standard deviation;

Measure of asymmetry: Pearson's moment coefficient of skewness;

Measure of kurtosis: Pearson’s kurtosis.

All statistical estimators were calculated for values in Hertz and log-Hertz (hereafter referred to as logHz)_{0} data for two separate reasons. From a purely statistical point of view, it is well known that it can reduce the skewness of many types of data and help them become more normal-like (LIMPERT; STAHEL; ABBT, 2001). Besides that, log-transformation can be justified on physiological and linguistic grounds, as pointed by Fujisaki and colleagues (FUJISAKI, 1988; FUJISAKI; HIROSE, 1984; FUJISAKI, HIROYA; OHNO; GU, 2004) and explained below.

Fujisaki advocates for log-transforming _{0} values based on the observation that the relation between _{0} and vocal folds elongation can be described linearly if _{0} is measured on the log-scale. From a linguistic point of view, Fujisaki puts forward the idea that the surface _{0} contour of an utterance can be conceived as the result of the superposition of two separate components – local and relatively fast rise-fall movements, and a global and relatively slow declining baseline. The first of the two components roughly corresponds to pitch accents connected to prosodic words and the second corresponds to larger units, such as clauses, phrases, or sentences. If the _{0} contour is expressed on a log scale, it is possible to treat the superposition of both components mathematically as an addition operation, simplifying the formulation of a model of the interaction between the accent and phrase components, such as Fujisaki's namesake model. The fact that there are two independent laryngeal mechanisms to elongate the vocal folds – rotation on the thyroid cartilage around the cricothyroid joint, associated with the accent component; and forward translation of thyroid cartilage, associated with the phrase component – further justifies treating their combined action as additive in nature.

D’Agostino test of skewness (D’AGOSTINO, 1970) and Anscombe-Glynn test of kurtosis (ANSCOMBE; GLYNN, 1983), available through the _{0} sample in the corpus differ significantly from what would be expected for the Gaussian or normal distribution. The normal distribution is symmetric, meaning that data are about equally distributed around the central value, with skewness equals to zero. A sample with negative skewness is said to be left-skewed or left-tailed, meaning that the mass of the distribution is concentrated to the right of the center; whereas one with positive skewness is said to be right-skewed or right-tailed, meaning that the mass of the distribution is concentrated to the left of the center. The normal distribution has a kurtosis value of 3. Kurtosis is traditionally presented as a peak feature, describing the shape of the center of a distribution: either more flat-topped (platykurtic) or more pointed (leptokurtic) relative to the normal distribution. A number of authors (see Westfall (2014) and references therein) argue that the correct interpretation of kurtosis is to consider it a measure of the propensity of a distribution to be heavy-tailed, that is, to generate extreme values, or values far from the central tendency.

Given prior experience with the data set we analyze in the present study (ARANTES; ERIKSSON, 2019; ARANTES; LINHARES, 2017), we knew that there were cases of distributions that showed evidence of bimodality, i.e., histograms with more than one modal value. In order to identify these cases in a more objective way, Hartingan’s _{0} distributions (both in Hz and logHz) in the corpus – the test yields significance when distributions are non-unimodal. Since in bimodal distributions the mean value of the overall distribution is likely not to be representative of any of the distributions that can be assumed to be mixed together, bimodal cases were identified and treated separately according to the procedures described in section 1.3.2. Unimodal samples were subject to a distribution fitting procedure described in the following section.

An α level of 5% was adopted for all statistical analysis conducted in the present study and they were all carried out using the R statistical computing environment (R CORE TEAM, 2020).

With the aim of establishing which univariate parametric distributions best describe the unimodal _{0} samples in our corpus and whether speaking styles have an effect on this, we used the R package called

Symmetric distributions:

Normal or Gaussian;

Logistic.

Asymmetric or skewed distributions:

Burr type XII, also known as Singh–Maddala or generalized log-logistic distribution;

Gumbel or Generalized Extreme Value distribution Type-I;

Gamma;

Weibull.

As will be reported in the Results section, most _{0} distributions are right-skewed, and for this reason we tested more asymmetric than symmetric distributions. Of the tested distributions, Burr type XII is a heavy-tail distribution, meaning that it has “a larger probability of getting very large values” (WOLFRAM RESEARCH, 2020). The others are considered thin-tail distributions, meaning that “the PDF [probability density function] decreases exponentially for large values” (WOLFRAM RESEARCH, 2020) of the variable. Weibull can have both kinds of tails depending on the values of its parameters. Distributions included in the fitting analysis were chosen consulting a compendium of probability distributions (MCLAUGHLIN, 2016) and considering their availability within R, either as part of the

All six distributions listed above were tested against each _{0} sample in our corpus, and the best fit was the distribution that yielded the smallest Anderson-Darling goodness-of-fit statistic (abbreviated ^{2}).

Following the identification of bimodality with the dip test, we did a visual inspection of the histograms that yielded significant _{0} distributions, two cases of false positives and three of false negatives were found. In all cases, non-unimodal distributions were bimodal, i.e., when more than one mode was detected, only two were present in the histogram. Distributions identified as non-unimodal were further analyzed in two ways. First by submitting them to a distribution mixture analysis and then to time-normalization.

We now discuss the procedure to identify bimodality. The smoothed histogram shown in figure 1 shows evidence of two distinct modes, almost one octave apart, suggesting that there is not just one source of systematic _{0} variability in play, but possibly two. The distribution shown in figure 1 could be, then, the result of the superposition of two more basic (unimodal) distributions. In situations like this, a technique called distribution mixture analysis (BISHOP, 2006) can be used to estimate the parameters of the putative underlying distributions. One of the simplest and most tractable cases is when component distributions assumed to underlie a given distribution are normal or Gaussian and this is referred to as a Gaussian Mixture Model or GMM. In order to carry out a GMM analysis, it is necessary to give estimates for the mean, the standard deviation and a parameter called λ, which corresponds to the relative weight of each normal distribution in the joint distribution. We used the

Source: the author.

Word reading was the style that yielded the most cases of bimodal distributions (more details in the Results section). In order to better understand what might be driving this behavior, we submitted the word reading samples to a time-normalization analysis. Time-normalization allows the comparison on the same temporal scale of the _{0} contour applied to different words with the same size and repetitions of the same word within word reading samples across different speakers. See Arantes (2015) for a more detailed explanation of the time-normalization technique. The interval over which time-normalization was done was the duration of each word in word reading samples. Start and end points for each word were marked in _{0} samples were taken from each marked interval. The _{0} contour was smoothed (the

This section presents descriptive statistics and distribution fitting analysis for unimodal and bimodal _{0} distributions. As explained in section 1.3, we decided to report unimodal and bimodal cases separately, based on the reasoning that values of statistical estimators taken from the overall sample in bimodal distributions are likely not to be representative of any of the distributions that are mixed together in the ensemble.

By applying the procedure described in section 1.3 to identify bimodal distributions, it was possible to determine that 21 out of the 30 distributions in the study sample are unimodal and are going to be analyzed in more detail in sections 2.1 and 2.2. The other nine are bimodal and a complete analysis of those distributions is presented in section 2.3.

To provide the reader with an overview of the variability encountered in our corpus, figure 2 presents smoothed histograms of all the 30 _{0} distributions that make this study’s corpus. The smoothed histograms in the figure were generated using the procedure described in 2.3.2 that uses kernel density estimates to obtain a continuous density curve out of raw histograms.

Source: the author.

This section presents descriptive and inferential statistics regarding unimodal _{0} distributions. The breakdown of the 21 unimodal distributions by speaking style (and speaker sex) is: interview (5 female, 4 male), sentence reading (5 female, 4 male), and word reading (1 female, 2 male). Values of the four statistical estimators mentioned in section 1.3 for each of the 21 unimodal distributions are shown in Figures 3 (Hertz scale) and 4 (logHz scale).

Source: the author.

Source: the author.

Summary statistics for the four estimators will be presented below, in both scales, followed by statistical comparisons. Since the number of cases in the word reading style is low for both female and male speakers, paired

Source: the author.

Interview | Sentences | Words | |
---|---|---|---|

Female | 223 (12.6) | 208 (11.3) | 207 |

Male | 140 (29.5) | 138 (26.9) | 145 (13.1) |

Source: the author.

Interview | Sentences | Words | |
---|---|---|---|

Female | 5.39 (0.06) | 5.33 (0.06) | 5.33 |

Male | 4.9 (0.12) | 4.91 (0.01) | 4.95 (0.01) |

For female speakers, the interview style has a greater mean average _{0} in both scales; effect sizes are large [Hz:

Source: the author.

Interview | Sentences | Words | |
---|---|---|---|

Female | 34.5 (5.38) | 18.3 (2.54) | 12.1 |

Male | 28.1 (19.8) | 21.7 (7.57) | 28.3 (0.36) |

Source: the author.

Interview | Sentences | Words | |
---|---|---|---|

Female | 0.14 (0.023) | 0.087 (0.016) | 0.058 |

Male | 0.17 (0.078) | 0.15 (0.03) | 0.18 (0.012) |

For female speakers, the interview style has greater _{0} standard deviation values in both scales; effect sizes are large [Hz:

Similar results regarding mean and standard deviation were reported in Arantes and Nascimento (2017) for the same BP speech material, although there the authors did not separate unimodal and bimodal distributions. Here the analysis is taken further by also looking at skewness and kurtosis.

Regarding skewness, the test used to detect it turned significant results for all 21 unimodal distributions, regardless of measurement scale. Skewness is positive for all distributions measured in Hz scale and ranges from 0.32 to 2.53. When measured in the log scale, 18 distributions have positive values and 3 negative, ranging from -0.22 to 1.40. These findings point to a strong tendency towards positive skewness, confirming previous observations found in the literature that _{0} distributions tend to be right-skewed. Tables 6 and 7 present mean skewness values as a function of speaking style and speaker sex in the Hertz and log scales, respectively. As expected, log-transformation has the effect of bringing skewness values down, sometimes close to zero, although, even in this case, statistical tests for skewness turned out to be significant in all cases.

Source: the author.

Interview | Sentences | Words | |
---|---|---|---|

Female | 1.61 (0.66) | 0.59 (0.23) | 0.82 |

Male | 1.6 (0.23) | 1.07 (0.84) | 1.04 (0.012) |

Source: the author.

Interview | Sentences | Words | |
---|---|---|---|

Female | 0.79 (0.48) | 0.18 (0.25) | 0.64 |

Male | 0.84 (0.25) | 0.36 (0.59) | 0.50 (0.19) |

For female speakers, the interview style has greater skewness values in both scales [Hz:

Regarding kurtosis, the statistical test to detect excess kurtosis yielded significant values for all distributions, regardless of measurement scale: all values are greater than 3 for _{0} values in Hz (range: 3.15-16.3); for values in logHz, 19 out of 21 values are greater than 3 (range: 2.69-7.13). As with skewness, conversion to log-scale lowers excess kurtosis, although not enough to bring it to the level of the normal distribution. The fact that most samples have kurtosis values that deviate from what would be expected were they normally distributed can thus be seen as evidence that _{0} distributions are best characterized as heavy-tail distributions. The results seen for skewness corroborate this and point to the fact that the heaviness tends to concentrate on the left tails, or towards higher _{0 }values.

Tables 8 and 9 present mean kurtosis values as a function of speaking style and speaker sex in the Hertz and log scale respectively. For female speakers, the interview style has a higher kurtosis level than sentence reading; effect sizes are large and moderate for Hz and logHz scales respectively; the difference is significant in Hz scale but no on logHz. [Hz:

Source: the author.

Interview | Sentences | Words | |
---|---|---|---|

Female | 8.69 (3.61) | 4.60 (0.87) | 3.75 |

Male | 7.78 (2.01) | 7.46 (6.09) | 4.16 (0.76) |

Source: the author.

Interview | Sentences | Words | |
---|---|---|---|

Female | 5.09 (1.46) | 4.26 (0.63) | 3.34 |

Male | 4.60 (1.18) | 4.39 (1.72) | 3.24 (0.78) |

Summing up the results of style effect on distribution parameters, the takeaway is that, for female speakers, sentence reading has a lower mean level, less variability (SD), less asymmetry (skewness) and less extreme values (kurtosis) compared to the interview style. This is true regardless of scale. For male speakers, there is no significant effect of style on any of the parameters, regardless of scale. For both sexes, we do not include word reading in the statistical comparisons, because of sample size for this style: for female speakers, four out of five word reading distributions are bimodal and for male speakers, three out of five.

In this section we report the results of the distribution fitting procedure described in section 1.3.1. Overall, the theoretical distributions that best describe the _{0} samples are: Burr 76% (N = 16), Gumbel 14% (N = 3), Gamma 5% (N = 1) and Logistic 5% (N = 1). Breaking the results as a function of style, we have:

First, we can note that asymmetric or skewed distributions are those that best fit _{0} data with except for one sentence reading sample, for which the Logistic distribution is best. This could be expected, given that the results presented in section 2.1 show that, regardless of speaking style, most distributions have significant positive skewness and significant excess kurtosis.

Numerically, Burr type XII distribution is the best fit for interview and sentence reading styles. The Gumbel distribution, which is similar to Burr’s in the sense of being right-skewed, gets first place in the unimodal cases for two of the word reading style samples. As mentioned in section 1.3.1, the best fit was determined by ranking the candidates by their goodness-of-fit values as estimated by the Anderson-Darling statistic. Since Burr type XII is so prevalent, we decided to investigate by how much it lost the first place in the goodness-of-fit procedure, especially in cases where Gumbel was the best candidate, as it is similar to Burr type XII.

In all cases where Gumbel is considered the best fit (f1 words, m2 interview and m5 words), Burr is a close second: the ^{2} statistic for Burr being from 1.2 to 2.5 times bigger than Gumbel’s. In the one case where the symmetric Logistic distribution is the best fit (m5 sentences), Burr is second with an ^{2} value that is 1.3 times greater than the one for the best fit. In the case where Gamma is the best fit (f4 sentences), Normal is the second best (^{2 }1.3 times bigger), followed by Burr (^{2 }1.9 times bigger). Considering all five samples, the candidates ranked third or lower have an average ^{2} value that is 17.8 times bigger than the best fit, with ^{2 }values ranging from 2 to 107 larger than the best fit. The results suggest that even when Burr type XII is not the best fit, it is a close second (or third in one single case), even when the best fit is a symmetric distribution. Thus, the data derived from our corpus of samples seem to support the proposal that Burr type XII is a reasonable candidate for the purpose of modeling most unimodal _{0} distributions.

Burr type XII is a distribution that is defined by three parameters: shape 1, shape 2, and rate, all being positive real numbers. Figure 5 shows how the shape of a Burr type XII distribution changes when we keep two parameters fixed and change the values of the third (values in the figure were chosen in a range plausible for _{0} data). As we can see in panel (a), lowering the value of shape 1 makes the distribution more right-skewed and heavy-tailed; higher values make the distribution more symmetric. Panel (b) shows that shape 2 controls the height of modal density and tail weight; increasing the value heightens the modal density and thins the distribution’s tails. Panel (c) shows that the rate parameter (or location, the reciprocal of rate) locates the distribution’s center in the

Source: the author.

We used the _{0} samples in our corpus. Then, for each of the 21 unimodal _{0} contours we correlated the estimated values for the three parameters to the statistical descriptors (mean, SD, skewness and kurtosis) reviewed in section 2.1. Applying the Pearson product-moment correlation test, we found three significant correlations: skewness and shape 1: -0.82 [_{0} distributions in a way that other distributions do not. The role of the sample mean in locating the distribution on the

In order to uncover a possible effect of speaking style on the shape of the _{0} data as described by a Burr type XII distribution, we calculated the mean values of the three distribution parameters as a function of the three speaking styles. Figure 6 shows the probability density functions of the three speaking styles for female (panel a) and male speakers (panel b), generated from mean parameter values.

Source: the author.

There is a noticeable effect of style on the distribution shape for female speakers. Modal value shifts towards higher values in the following order: word reading < sentence reading < interview. Values are 5.3, 5.325 and 5.35 or approximately 204, 205 and 210 Hz, converting the values back to the Hertz scale. The most visible change is seen in the wideness of the distribution, following the same direction of the shift in modal value. All three distributions are right-skewed. Sentence reading and interview have much heavier tails on both sides. These observations corroborate the significant differences in standard deviation and skewness presented in section 2.1 for female speakers. No relevant difference in distribution shape motivated by speaking style is seen for male speakers, also corroborating the lack of statistical significance in standard deviation and skewness also presented in section 2.1.

It seems unlikely that the lack of effect of speaking style on male speakers could be explained on physiological grounds. One possible explanation for the lack of effect in male data could be that it is due to the relatively small sample size in each group. Since the effect sizes reported in female data were larger than the ones seen in the male data, significant results were likelier to be found in that group even in a small sample. Future work with larger sample sizes may help understand if the sex difference is reliable or an artifact caused by an insufficient sample size.

The pattern seen in the female data suggests that the interview style generates more asymmetric distributions, with larger _{0} excursions and overall greater variability than the other two styles. This is compatible with a view that spontaneous or semi-spontaneous speaking styles are livelier, as speakers tend to be more engaged and involved in what they are saying than in read speech, that has a lesser degree of _{0} modulation and therefore could be perceived as having a relatively more level intonation pattern. Arantes and Eriksson (2019) present data that may be interpreted as indication that this is not a fixed pattern, but something that can, at least in part, be language-specific or culture-specific. In their study, the authors studied the multilingual corpus that includes the BP data analyzed here and developed a methodology to measure similarity between _{0} contours. Applying the methodology to investigate inter-speaker variation in contour similarity, the authors report that for a group of languages consisting of English, French, Italian, Brazilian Portuguese and Swedish, spontaneous interview was the style which yielded the highest levels of variation. A closer inspection of the pairs of contours with the largest dissimilarity index values revealed that most of them included rare cases of interview samples with a more level _{0} profile. The few atypical interview contours, then, generated most of the inter-speaker variation. For Estonian and German, on the other hand, sentence reading was the style yielding more inter-speaker variation. The crucial difference, though, was that most sentence reading contours showed a good deal of _{0} modulation and extended ranges and the pairs with the largest dissimilarity values included, in most cases, the few atypical cases of sentence reading contours with less modulation and narrower ranges. One conclusion that could be drawn from the results is that speaking styles differ in terms of _{0} variation, but language communities may vary in terms of which style has a more lively profile, with larger excursions and overall more modulation, and which ones show a more level profile.

Going back to the results in the present study, it seems that male speakers adhere to the pattern show in Arantes and Eriksson (2019) to a lesser extent than the female speakers. It is possible to see that, for the sentence reading style, males and females present similar SD and skewness values. For the interview style, on the other hand, female speakers show an increase in SD and skewness that is not present to the same degree in the male sample (cf. tables 5 and 7). This could result in the impression that females sound livelier or more expressive than males in the interview style, although this hypothesis should be confirmed by future experiments. If it is shown that female spontaneous contours are reliably perceived as livelier than male’s contours, that may be a sociophonetic effect that results from the different expectations regarding the behavior of females and males in social interactions.

The coarse-grained dynamics revealed by the distinct statistical characteristics of spontaneous and read speech in the female group is illustrated in figure 7. It shows stretches of two contours taken from the central 20 seconds of speaker _{0 }excursions and an expanded _{0} range in the case of the highly positive-skewed green contour and the lack of extreme _{0} excursions and a more symmetrical variation around the median _{0} value in the blue contour. Follow-up studies should look in more detail how the differences play out in the small scale of individual sentences.

Source: the author.

In this section we present the results of the analysis that was carried out on bimodal distributions as described in section 1.3.2. First, we review the effect of speaking style and speaker sex on the occurrence of _{0} bimodality. Then we present the results of the Gaussian Mixture Models analysis. Last, we apply time-normalization to a set of _{0} contours to better understand what might be motivating the emergence of bimodality.

Hartigan’s dip test turned out significant _{0} distributions are not unimodal: nine out of the 30 distributions, or 30% of the sample, according to the procedures described in section 1.3 and irrespective of scale (Hz or logHz). Speaking style has a major effect on this distribution: almost 78% of those occurrences (7 out of 9) correspond to the word reading style – the other two are from the sentence reading and the interview sample of the same male speaker (labeled m3). Out of all word reading samples (10), 70% are bimodal. Both sexes are roughly as likely to generate bimodal _{0} distributions: four out of five female speakers did it, all of them in the word reading style; among male speakers, four out of five also did it, three when reading words, and one when reading sentences and in the interview.

As noted in the Material and Methods section, all distributions classified as non-unimodal are bimodal, i.e., there are two prominent peaks in their histograms. The values obtained by the GMM analysis for the three parameters (mean, standard deviation and λ) that characterize the two component distributions present in each of the nine bimodal samples in the corpus are shown in Figure 7 as a function of speaker sex.

Source: the author.

In terms of the mean values of the two component distributions, the mean difference is 0.36 logHz for the female speakers and 0.41 logHz for the male speakers, indicating that the lower component distribution is centered around values that are around 50% below the center of the distribution with a higher mean value. For 4 out of the 9 samples (2 females and 2 males), the lower component mean is 70% to 90% below the higher component mean, suggesting that at least part of this lower distribution can be in the non-modal voicing register.

There is a negative correlation of -0.74 for the difference between the second and first components’ mean and the λ of the first component distribution, meaning that the further apart the means of the two components are, the less the lower component contributes to the overall distribution. The histogram in figure 1 is an example of this pattern: the mean of the first component’s distribution is located at a point almost one octave below the mean of the second component’s distribution and it accounts for around 20% of the overall distribution.

The variation coefficients of the mean parameter estimated by the GMM analysis of the two components are 1.72% and 2.13% for the female speakers and 1.56% and 2.94% for the male speakers. The component with the lower mean also tends to have a lower standard deviation and this pattern tends to be slightly stronger for the male speakers.

Time-normalized contours of all words in word reading style for three speakers, m4, f5 and f1, are shown in figures 9, 10 and 11. They were chosen because they illustrate three different patterns, detailed below.

Source: the author.

Speaker m4’s overall _{0} distribution is bimodal, but the mean value of the two component distributions is not far apart (4.62 and 4.95 logHz, approximately half octave apart) and both account more or less equally for the overall distribution (53% and 47%). The range spanned by word contours is around 1.2 octaves. Blue dashed lines indicate the locations of the mean values of the two component distributions. A visual examination of the contours indicate that the two means approximately coincide with the rises and falls of the contours and that there is a balance in terms of the contribution of rises and falls to each word contour.

Source: the author.

Speaker f5’s overall _{0} distribution is bimodal, but the component distributions’ means are located farther apart (4.74 and 5.38 logHz, approximately 0.9 octave) and the lower mean component distribution accounts for approximately 21% of the overall _{0} distribution. The range spanned by the word contours is around 1.8 octaves. Blue dashed lines indicate the location of the two component distributions’ means. For this speaker, we can see that the overall median is very close to the mean of one of the component distributions. The other component has a lower mean value that is at the same level of most falls in the word contours; most falls reach deeper levels when compared to speaker m4 and the contour stays there for less time; this could explain why the lower component distribution contributes less to the overall _{0} distribution.

Source: the author.

As a comparison, speaker f1 overall _{0} distribution has a unimodal histogram and it is apparent from the word contours in figure 10 that there is not a great deal of intonation modulation – word contours span around 0.3 octave.

In terms of contour dynamics, word reading seems to be characterized by a good amount of _{0} movement in a relatively short time interval – mean word duration is 800 ms (SD 186) all speakers polled; see the mean word duration as a function of speaker in table 1.

In a language like BP, in which word stress is not fixed and there can be lexical contrasts defined by stress position within a word, the patterns observed seem to suggest that speakers tend to single out the stressed syllable against the background of unstressed syllables by aligning an upward or downward _{0} movement with it.

In general terms, the time-compressed nature of the contours in isolated word reading seems to be what causes the bimodality in _{0} histograms: contour rises contribute to an _{0} distribution with a relatively higher mean value and contour falls generate a distribution with a lower mean value; the ensemble distribution that results, then, is bimodal.

Taken as a whole, the results presented here corroborate tendencies already suggested in the previous literature about speaking styles with the added effort to be as explicit and complete as possible in the description of the phonetic and statistics analysis procedures to ensure an adequate level of reproducibility. Besides contributing to a well-established stream of previous studies on speaking styles, we also introduced something new by going a step further in the statistical modeling of _{0} distributions by trying a distribution fitting analysis that included a number of theoretical probability distributions other than the Normal or Gaussian.

We will start the general discussion by presenting the most important findings regarding the effect of speaking style on the shape of _{0} distributions. First, word reading style strongly favors the emergence of bimodal distributions regardless of speaker sex. In contrast, the other two styles seldom generate bimodal distributions. The mixture analysis shows that the two components that make up the bimodal distribution can be either closer or farther apart. When they are farther apart, the lower component is usually low enough that it is consistent with being in the non-modal register. Visual inspection of time-normalized _{0} contours coming from bimodal distribution cases suggests that the two components in the mixture can be associated with the time-compressed instantiation of intonational rises and falls that are linked with the signaling of stressed syllables. Recognizing the existence of a nontrivial amount of bimodal _{0} distributions, its prevalence in a particular speaking style and giving it a proper treatment is a new approach in the literature. Previous mentions to bimodality are limited to: (1) Jassem (1971), which reports the occurrence of one case of bimodal distribution (1 out of 10 speakers); Jassem _{0} samples by the same speaker that suggest bimodality; and (2) Kinoshita _{0}. In other cases, bimodality seems to arise from the fact that the typical contours in isolated word reading are characterized by lower and higher levels and brief transitions between them. Because of the time-compressed nature of the contours, the transition is so brief that the two peaks in the histogram are associated with the lower and higher levels with only a few data points between these levels. In some cases, the lower level is in non-modal phonation territory, but not in all of them. Regarding the occurrence of non-modal phonation in the corpus, we refer the reader to Silveira and Arantes (2017). In that study, the authors report the results of an auditory analysis of non-modal phonation occurrences in the same corpus analyzed here. The authors conducted auditory analysis of stretches of very low _{0} regions in contours with bimodal histograms and found that most instances of very low _{0} resulted in the perception of non-modal phonation, although there was no attempt to quantify this association. This finding corroborates the frequent observation made in the voice quality literature that lowered _{0} levels are usually associated with the perception of laryngealized phonation (EDMONDSON; ESLING, 2006; ESLING; HARRIS, 2005; GORDON; LADEFOGED, 2001; HANSON; CHUANG, 2001; LAVER, 1980; REDI; SHATTUCK-HUFNAGEL, 2001).

Results concerning the effects caused by sentence reading and interview styles presented here indicate they are not uniform among male and female speakers. Female speakers show significant differences between styles: interview tends to have higher mean, standard deviation, skewness and kurtosis than sentence reading. Differences are statistically significant with large effect sizes for all estimators, except for kurtosis in log-scale (non-significant, moderate effect size). Male speakers show differences in the same direction (interview > sentence reading) for all four statistical estimators, although none of them is statistically significant. Effect sizes are smaller than the ones observed for female speakers for mean, standard deviation and kurtosis, but moderate for skewness. The present results are in line with what was reported by Arantes and Nascimento (2017) for the same data. In the earlier study, the difference in mean value in favor of the interview style in comparison to the reading style is greater in female speakers (1 semitone) than in male speakers (0.15 st), although no significant test is reported. For standard deviation, previous results showed that males present greater values than females (3 st vs. 2 st) irrespective of style; in terms of styles, interview presents a greater value than reading (2.92 st vs. 2.1 st), but values as a function of both sex and style are not presented. The results of the present study also show this tendency of male speakers to have larger standard deviation values than female speakers in both styles. Considering the results in the perspective of the broader literature on speaking style, they seem to confirm that, while for the majority of other languages studied read speech style presents larger _{0} mean and standard deviation than read speech, the opposite is the case for Brazilian Portuguese.

Now we turn our attention to the statistical characterization of _{0} distributions. The statistical tests reported in section 2.1 show that most distributions have skewness and kurtosis values that significantly deviate from what would be expected from normally distributed samples, regardless of measurement scale. The current results corroborate what the scarce previous literature (HORII, 1975; 1982; JASSEM; STEFFEN-BATÓG; CZAJKA, 1973; JASSEM, 1971) has reported: regardless of speaker sex or speaking style, _{0} distributions tend to have positive skewness and kurtosis values above 3. These results point to the fact that these distributions are asymmetric and heavy-tailed, strongly hinting that the normal distribution is not the best theoretical statistical distribution to model empirical _{0} distributions. Eriksson (ERIKSSON, 2011, p. 49–50) offers one possible explanation for why positive skewness arises in _{0} data: “positive skewing occurs primarily because there is much more room for fundamental frequency variation upwards that downwards”. Downward movement range is limited, according to this explanation, because going lower than a certain threshold “will normally result in creak which speakers tend to avoid” (

Given the empirical results on statistical estimators of _{0} variability and possible explanations for this behavior, there is a clear gap in the literature. Until now, the available studies (JASSEM; STEFFEN-BATÓG; CZAJKA, 1973) only demonstrate that empirical _{0} data deviate significantly from normal distributions. No previous study that we are aware of tried to go beyond that and test the fitness of other theoretical distributions to _{0} data. The results of the distribution fitting analysis reported in section 2.2 try to fill this gap. As expected, none of the distributions in our corpus can be adequately modeled by a normal distribution. Even in the one case of an empirical distribution that was best modeled by a symmetric distribution, this distribution was the Logistic, not the normal. For all other 95% of unimodal empirical _{0} distributions, the best fit were right-skewed theoretical distributions, especially Burr type XII, a three-parameter heavy-tailed distribution, and in second place a two-parameter thin-tailed distribution called Gumbel. Both are used to model real-world phenomena such as survival data, insurance losses and income distribution that are characterized by the presence of events with extreme deviations from a central value. Given Eriksson’s (2011) account of why _{0} data has pervasive positive skewness, one could associate the presence of relatively unbounded upwards excursions as a source of extreme values. The results presented in section 2.2.1 provide evidence that the three Burr distribution parameters estimated from the empirical _{0} distributions have a good amount of correlation with the empirical distribution’s mean, standard deviation and skewness (and kurtosis indirectly, given its significant positive correlation with skewness). This result shows that the Burr distribution does a good job of capturing important information provided by three (four, indirectly) estimators that define the shape of a unimodal _{0} distribution.

Bringing together both the speaking style and the distribution fitting themes, section 2.2.2 shows that the effect of speaking styles on the shape of _{0} distributions can be represented by different combinations of the three Burr distribution parameters. The probability density plots in figure 6 generated from the parameter combinations for female and male speakers and the three styles show that they are able to represent the lack of a significant style effect in male speakers and adequately capture the larger effects seen in the female speakers, especially the fact that the interview style has greater mean, standard deviation and skewness than the sentence reading style. This result gives additional evidence for the usefulness of the Burr distribution to model the effect of a relevant paralinguistic phenomenon on the shape of _{0} distributions.

The results concerning the application of the so called “extreme value” probability distributions to _{0} data are encouraging since they are able to capture important features documented in empirical _{0} samples. Burr type XII stands out as the distribution that best fits the data in the speech material analyzed here. Future work should improve the present results by enlarging the number of speakers and including other languages as well to test if language has an effect on what distribution comes out as the best fit. Further suggestions for future studies include exploring how assuming that _{0} data follows an underlying distribution such as Burr type XII can be useful both in explaining linguistic phenomena (as we did here with speaking styles) and in practical applications such as speaker comparison in forensic contexts. In the latter case, we suggest checking if the distribution parameters are useful at capturing possible invariant features in _{0} distributions coming from the same speaker in non-contemporaneous recordings such as those that Kinoshita and colleagues (2009; 2010) noted in unsystematic observations. Follow-up studies with a larger number of participants could test if individual speakers can be identified in a big pool on the basis of the values of the parameters that describe their _{0} distribution following the methodology used by Kinoshita and colleagues. A larger number of speakers can help estimate the degree of between-speaker variation in these parameters. These studies should also obtain the distribution parameter values for different speech samples by the same speaker to estimate the within-speaker variability. If the between-speaker variability is larger than within-speaker variability, then the parameter values are useful in speaker comparison tasks (NOLAN, 1993; ROSE, 2002).

A final suggestion for future work is to explore ways of adequately modeling bimodal _{0} distributions. Here we used Gaussian mixture models to analyze the data as a first approximation, but, given the evidence against _{0} being normally distributed, other mixture possibilities should be explored. Two important questions to be answered are whether the individual component distributions in bimodal cases have the same statistical characteristics as unimodal ones and whether the two underlying component distributions in bimodal cases of the same type, especially in the case where one of them is mostly comprised of creaky phonation.

On the whole, the results reported in this article show how speaking styles have an important role in shaping the overall shape of _{0} distributions. The patterns observed suggest a robust tendency for word list reading to generate bimodal distributions. Read sentences and the interview style are associated with unimodal but right-skewed distributions. The distribution fitting analysis corroborates previous suggestions in the literature that the normal distribution is not the best theoretical distribution to model _{0} distributions. We report results of an initial analysis showing that Burr type XII, an extreme value distribution, is the best statistical distribution to model empirical right-skewed _{0} distributions.

The author would like to thank Professor Anders Eriksson (Stockholm University) for generously granting access to the sound files of his project “A typology for word stress and speech rhythm based on acoustic and perceptual considerations”. The author acknowledges the work of Maria Érica Linhares (FAPESP grant 2014/21161-5), Suska Gutzeit (PIBIC-CNPq/UFSCar grant 2014-2015) and Isabela Silveira (FAPESP grant 2016/16544-8) in the processing of the _{0} contours analyzed here, done as part of undergraduate research projects carried under the author’s supervision. The author also thanks the reviewers for their careful reading and for their comments, which greatly improved the paper. Finally, the author thanks Julia Arantes and Leonardo Oliveira for proofreading the manuscript.

The authors used ANOVA followed by pairwise tests to investigate differences between groups and speakers.

See

The natural (base