Computational Modeling of Syntax Acquisition with Cognitive Constraints.

Lifeng Jin

Syntactic structures are unobserved theoretical constructs which are useful in explaining a wide range of linguistic and psychological phenomena. Language acquisition studies how such latent structures are acquired by human learners through many hypothesized learning mechanisms and apparatuses, which can be genetically endowed or of general cognitive use. Through computational modeling, this thesis aims at understanding the issue of learning such latent structures in a bottom-up fashion, starting from a position with fewest assumptions possible about what learners know to facilitate learning. The learning technique used in all models is distributional learning, where regularities in statistics of surface forms: words, characters, images, are used for inducing clusters for words and phrases, as well as hierarchical structures of such clusters for generating the observed linear linguistic sequences with maximum likelihood. The central question these models are trying to answer is how much of syntax can be learned with distributional learning only. Novel models for syntax acquisition modeling are proposed in this thesis, starting from Bayesian grammar induction models to grammar induction models with neural networks; from models without any constraint to models with psycholinguistically-inspired constraints; from models with words as input to models with distributed representations of words, characters and images as input. These models have achieved high consistency between induced latent structures and syntactic structures from linguistic theories. Through evaluation and comparison of proposed models and models from previous work on unsupervised parsing and grammar induction, results presented in this thesis first paint a relatively complete picture of state-of-the-art grammar induction performance on a large set of languages with different typological features, supporting the generality of proposed algorithms as well as providing crosslinguistic performance data for analysis of interaction between distributional learning and linguistic typology. These models also provided us valuable insights into properties of language and cognition, providing us with evidence of the degree to which statistical information of words and characters can guide syntax learning. Many things considered to be essential to syntax acquisition, such as categories, head directionality, case and verb valency, have been shown to be inducible using distributional learning with computational models. The incorporation of memory constraints into grammar induction models as well as supervised neural left-corner parsers has strengthened the claim that performance interacts and constrains the formation of linguistic competence. Multilingual induction has shown how high frequency markers guide grammar induction models in the induction process. Character sequences provide useful information when grammatical relations are expressed through affixes, and images provide information for languages where high frequency markers are not enough for the formation of syntactic categories. Analyses of results from these models have also presented evidences of things which are not easily learned from distributional learning such as preposition phrase attachment, tense and grammatical categories marked by affixes. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com.bibliotheek.ehb.be/en-US/products/dissertations/individuals.shtml.]