Modeling linguistic variables with regression models: Addressing non-gaussian distributions, non-independent observations, and non-linear predictors with random effects and generalized additive models for location, scale, and shape

Coupé, Christophe

File Download

Content.pdf

Links for fulltext

(May Require Subscription)

Publisher Website: 10.3389/fpsyg.2018.00513
Scopus: eid_2-s2.0-85045691106
WOS: WOS:000430124900001

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Linguistics: Journal/Magazine Articles

Article: Modeling linguistic variables with regression models: Addressing non-gaussian distributions, non-independent observations, and non-linear predictors with random effects and generalized additive models for location, scale, and shape

Title	Modeling linguistic variables with regression models: Addressing non-gaussian distributions, non-independent observations, and non-linear predictors with random effects and generalized additive models for location, scale, and shape
Authors	Coupé, Christophe
Keywords	Phonemic inventory size Smooth terms Generalized additive models Box-Cox t distribution Delaporte distribution Generalized linear models GAMLSS Mixed-effects models
Issue Date	2018
Citation	Frontiers in Psychology, 2018, v. 9, n. APR How to Cite? DOI: http://dx.doi.org/10.3389/fpsyg.2018.00513
Abstract	© 2018 Coupé. As statistical approaches are getting increasingly used in linguistics, attention must be paid to the choice of methods and algorithms used. This is especially true since they require assumptions to be satisfied to provide valid results, and because scientific articles still often fall short of reporting whether such assumptions are met. Progress is being, however, made in various directions, one of them being the introduction of techniques able to model data that cannot be properly analyzed with simpler linear regression models. We report recent advances in statistical modeling in linguistics. We first describe linear mixed-effects regression models (LMM), which address grouping of observations, and generalized linear mixed-effects models (GLMM), which offer a family of distributions for the dependent variable. Generalized additive models (GAM) are then introduced, which allow modeling non-linear parametric or non-parametric relationships between the dependent variable and the predictors. We then highlight the possibilities offered by generalized additive models for location, scale, and shape (GAMLSS). We explain how they make it possible to go beyond common distributions, such as Gaussian or Poisson, and offer the appropriate inferential framework to account for 'difficult' variables such as count data with strong overdispersion. We also demonstrate how they offer interesting perspectives on data when not only the mean of the dependent variable is modeled, but also its variance, skewness, and kurtosis. As an illustration, the case of phonemic inventory size is analyzed throughout the article. For over 1,500 languages, we consider as predictors the number of speakers, the distance from Africa, an estimation of the intensity of language contact, and linguistic relationships. We discuss the use of random effects to account for genealogical relationships, the choice of appropriate distributions to model count data, and non-linear relationships. Relying on GAMLSS, we assess a range of candidate distributions, including the Sichel, Delaporte, Box-Cox Green and Cole, and Box-Cox t distributions. We find that the Box-Cox t distribution, with appropriate modeling of its parameters, best fits the conditional distribution of phonemic inventory size. We finally discuss the specificities of phoneme counts, weak effects, and how GAMLSS should be considered for other linguistic variables.
Persistent Identifier	http://hdl.handle.net/10722/262791
ISI Accession Number ID	WOS:000430124900001

DC Field	Value	Language
dc.contributor.author	Coupé, Christophe	-
dc.date.accessioned	2018-10-08T02:47:03Z	-
dc.date.available	2018-10-08T02:47:03Z	-
dc.date.issued	2018	-
dc.identifier.citation	Frontiers in Psychology, 2018, v. 9, n. APR	-
dc.identifier.uri	http://hdl.handle.net/10722/262791	-
dc.description.abstract	© 2018 Coupé. As statistical approaches are getting increasingly used in linguistics, attention must be paid to the choice of methods and algorithms used. This is especially true since they require assumptions to be satisfied to provide valid results, and because scientific articles still often fall short of reporting whether such assumptions are met. Progress is being, however, made in various directions, one of them being the introduction of techniques able to model data that cannot be properly analyzed with simpler linear regression models. We report recent advances in statistical modeling in linguistics. We first describe linear mixed-effects regression models (LMM), which address grouping of observations, and generalized linear mixed-effects models (GLMM), which offer a family of distributions for the dependent variable. Generalized additive models (GAM) are then introduced, which allow modeling non-linear parametric or non-parametric relationships between the dependent variable and the predictors. We then highlight the possibilities offered by generalized additive models for location, scale, and shape (GAMLSS). We explain how they make it possible to go beyond common distributions, such as Gaussian or Poisson, and offer the appropriate inferential framework to account for 'difficult' variables such as count data with strong overdispersion. We also demonstrate how they offer interesting perspectives on data when not only the mean of the dependent variable is modeled, but also its variance, skewness, and kurtosis. As an illustration, the case of phonemic inventory size is analyzed throughout the article. For over 1,500 languages, we consider as predictors the number of speakers, the distance from Africa, an estimation of the intensity of language contact, and linguistic relationships. We discuss the use of random effects to account for genealogical relationships, the choice of appropriate distributions to model count data, and non-linear relationships. Relying on GAMLSS, we assess a range of candidate distributions, including the Sichel, Delaporte, Box-Cox Green and Cole, and Box-Cox t distributions. We find that the Box-Cox t distribution, with appropriate modeling of its parameters, best fits the conditional distribution of phonemic inventory size. We finally discuss the specificities of phoneme counts, weak effects, and how GAMLSS should be considered for other linguistic variables.	-
dc.language	eng	-
dc.relation.ispartof	Frontiers in Psychology	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject	Phonemic inventory size	-
dc.subject	Smooth terms	-
dc.subject	Generalized additive models	-
dc.subject	Box-Cox t distribution	-
dc.subject	Delaporte distribution	-
dc.subject	Generalized linear models	-
dc.subject	GAMLSS	-
dc.subject	Mixed-effects models	-
dc.title	Modeling linguistic variables with regression models: Addressing non-gaussian distributions, non-independent observations, and non-linear predictors with random effects and generalized additive models for location, scale, and shape	-
dc.type	Article	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.3389/fpsyg.2018.00513	-
dc.identifier.scopus	eid_2-s2.0-85045691106	-
dc.identifier.volume	9	-
dc.identifier.issue	APR	-
dc.identifier.spage	null	-
dc.identifier.epage	null	-
dc.identifier.eissn	1664-1078	-
dc.identifier.isi	WOS:000430124900001	-
dc.identifier.issnl	1664-1078	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Modeling linguistic variables with regression models: Addressing non-gaussian distributions, non-independent observations, and non-linear predictors with random effects and generalized additive models for location, scale, and shape

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats