Ti trovi qui: Home » PhD_Program

PhD Program in Computer and Data Science @Unimore


Research profile: Quantitative modelling of cultural diversity of language and cognition


Our goal is to train researchers to become experts in quantitative modeling of abstract linguistic data , able to work out dedicated algorithms to extract historical signals from syntactic parameters, to master computer-assisted strategies and tools for the study of language transmission and diversification through time, space, and society, and to develop computational models of the structure of human grammars.

Students will be trained in the following research areas:
A) Implementing tools and methods of cognitive and quantitative sciences for the investigation of the structure of human history and cultural diversity

B) Modeling history through language: quantitative phylogenetics and comparative methods (Bayesian techniques and statistical algorithms)

C) Modeling language structure and diversity: networks and structural implications through Machine Learning

D) Treating cognitive data through computational tools


1. Who can apply

1) Students with background in linguistics (e.g., computational linguistics, formal linguistics, historical linguistics, psycholinguistics, neurolinguistics, language acquisition) who want to specialize in computational and quantitative treatment and analysis of language diversity.

2) Students with background in computer science (e.g., Machine Learning, algorithms, deep learning, AI, …) who want to face previously unaddressed challenges in the application of computational skills and tools to problems addressed by formal and historical investigation in linguistics.


2.  Hints for potential research projects


Realizing parametric phylogenetic analyses of families or subfamilies of languages

Keywords : language family, comparative methods, parametric comparison, computational phylogenies, data collection and analysis


The goal is to perform parametric comparison within historically established language families through the implementation of the formal and quantitative tools developed by the Parametric Comparison Method (www.parametricomparison.unimore.it), in order to explore their phylogenetic and historical structure and discover deeper historical relationships with other families. This investigation will be conducted using the comparison procedure designed and implemented by the PCM, which consists of the following toolkit: (1) a set of binary syntactic parameters which define crosslinguistic variation in nominal structures across the World’s languages: languages are represented as strings of binary values (+/-); (2) a set of implicational formulas defining parameter interdependencies: one value (though not the other) of a given parameter p1 may entail the irrelevance of another parameter p2, whose manifestations become then predictable; (3) a set of co-varying surface manifestations for each parameter; (4) a parameter setting procedure: each parameter is associated with a set of questions of the type “Does the (set of) structure(s)/interpretation(s) α occur in language L?”; only answers YES, which are based on positive evidence, set parameter values; (5) a set of distance-based and character-based quantitative algorithms and statistical procedures to extract historical signals from parameter values and

generate language phylogenies.



Creating a Bayesian phylogenetic algorithm for evolutionarily modeling of natural language syntax

Keywords : quantitative phylogenetics, Bayesian analysis, syntactic parameters, natural language, reconstruction


The goal is to work out dedicated algorithms to extract historical signals from abstract natural language data (syntactic parameters), to reconstruct ancestral states, and to explain language transmission and diversification across time. The novelty of the project consists in the implementation of computational quantitative tools of Bayesian derivation on a type of language abstract structures (syntactic parameters) characterized by formal properties (for example, their multilayered deductive structure, www.parametricomparison.unimore.it ) and peculiarities which are not observed in the linguistic data traditionally used in the field to automatically generate hypotheses of historical relatedness across languages: these peculiarities partially undermine the robustness of automatically-generated phylogenetic hypotheses because the quantitative tools which are traditionally used in linguistics are not able to deal with the peculiarities of syntactic characters and, consequently, to fully process the information provided by parametric data. To this end, novel/refined quantitative tools are required: this process is precisely aimed at implementing such tools.



Modeling structural interdependencies and their effects in constraining language diversity

Keywords : networks, binary strings, implicational structure, machine learning, language transmission and change


The goal is to explore the multi-layered deductive structure of natural language grammars and its effects in constraining language diversity. In formal cognitive approaches to language structure and diversity, crosslinguistic structural variation is assumed to be universally constrained by a finite set of abstract points of variation (syntactic parameters) whose reciprocal interactions generate the attested surface diversity across languages, which turns then out to be predictable once the subset of variable abstract structures and their reciprocal interactions is fully defined and formalized. The purpose of this project is to explore and identify different possible types of parameter interdependencies and their effects on language acquisition and change, through the implementation of machine learning techniques and data-driven approaches able to disclose parameter dependencies which cannot be identified through manual data analysis, to formalize implicational rules among parameters, and to compare them with those produced by linguists’ expert judgement. 

These procedures will be aimed at: 

(a) discovering parameters that can be entirely dispensable as fully deducible from combinations of the others; 

(b) detecting further possible partial dependencies among parameters; 

(c) identifying the largest set of parameter values which are identically set in all the languages of a given family; 

(d) finding the least restrictive sets of parameter values that are sufficient to distinguish a family from all others.


3. Collaborations

Research within profile will be conducted in collaboration with the Center for Language History and Diversity, of which the Dipartimento di Comunicazione ed Economia is a member, and the Department of Language and Linguistic Science at York, with whom the Dipartimento di Comunicazione ed Economia has an academic cooperation agreement. Students will be asked to work in strict collaboration with the University of York, and to spend there at least one long research stay.


4. Teaching

Besides Specialized Courses, students will attend a selection of the General Courses offered by the Program.

Students will be asked to attend at least one summer/winter school per year, which will be chosen according to the offer and to the student’s needs.

Students will be asked to spend two periods of study abroad (for a minimum of 6 months per year), one of which must be spent at the University of York.

If required, students will be asked to attend introductory courses for MA students offered at DCA/FIM/York in the disciplines they do not manage and whose bases are required to attend more specialized PhD courses.


Specialized Courses 

Unveiling Historical Relations Across Human Languages with Data Science: Computational Phylogenetics in Linguistics

12 hrs, 3 ECTS

LECTURERS: Andrea Ceolin, Cristina Guardiano

SYLLABUS: By relying on computational techniques, algorithms and data analysis methods proper of data science, quantitative phylogenetics has developed automatic tools for  to build taxonomies and hierarchical trees of biological entities, to explore their phylogenetic relations, discover common ancestors, date their splits, and reconstruct unattested stages. Over the past 20 years, these methods have been applied to linguistic classifications as well, to automatically compare various types of language data, generate hypotheses of historical relation across languages and language families, aid in automatic cognate identification, explore dynamics of language evolution, reconstruct migration patterns, and compare linguistic, genetic and cultural evolution of human populations. This course discusses the application of such computational phylogenetic tools and taxonomic algorithms to the investigation of historical relations across human languages.


Quantitative and formal modeling of historical sciences

12 hrs, 3 ECTS

LECTURERS: Cristina Guardiano, Giuseppe Longobardi

SYLLABUS: The implementation of quantitative models, computational tools and automatic algorithms of data collection and analysis has brought into human sciences models, idealizations, and explanatory standards typical of natural sciences. This course explores how these tools are extended and applied to those human sciences that specifically deal with history and cultural transmission. Focusing on the historical investigation of language diversity, the course will show how the application of computational techniques for data processing and analysis, in combination with the adoption of abstract cognitive language structures as taxonomic characters for phylogenetic reconstruction (Irimia et al. 2022), brings about the possibility to address large-scope genealogical issues, to look for general principles on possible historical evolution which cannot be revealed by archeology or demography alone, and ultimately to contribute some of the required heuristics and tools for a deeper investigation of human history.


Coding Syntactic Diversity

20 hrs, 5 ECTS

LECTURERS: Paola Crisma, Cristina Guardiano

SYLLABUS: The application of computational techniques to code, annotate and parse linguistic data has benefited from the increasing availability of digital corpora but also, crucially, from the refinement of formal models of human language structure and diversity. This course presents one such models, recently developed to encode the syntactic diversity attested in the world’s languages, and its application to the analysis of syntactically annotated corpora of linguistic data. In the parametric framework of cognitive biolinguistics, human grammars are represented as finite strings of binary values/states (1/0, or +/-). In this approach, the label “parameters” refers to a set of open choices between binary values, generated by our invariant universal language faculty, and closed by each language learner on the basis of the linguistic evidence s/he is exposed to. Parameter systems exhibit two layers of deductive structure: (a) each parameter is responsible for a set of different co-varying surface linguistic patterns (manifestations), and (b) parameters form a network of partial implications: one value (though not the other) of a given parameter p1 may entail the irrelevance of another parameter p2, whose manifestations would then become predictable. The parameter setting algorithm presented in this course is based on all such properties, and consists of the following components: (i) a list of binary parameters; (ii) a list of formulas which define cross-parametric implications in this set; (iii) for each parameter, the list of surface manifestations it generates; (iv) a list of YES/NO questions associated to each manifestation, which are used to collect the data required to set the value of each parameter in a given language (only YES answers set the value 1).


[Official Website @Unimore.it]