PhD Program in Computer and Data Science @Unimore
Research profile: Quantitative modelling of cultural diversity of language and cognition
Our goal is to train researchers in the analysis and quantitative modeling of deep language data, able to work out dedicated algorithms to extract historical signals from syntactic parameters, to master computer-assisted strategies and tools for the study of language transmission and diversification through time, space, and society, and to develop computational models of the structure of human grammars.
Goals
(a) work out dedicated algorithms to extract historical signals from syntactic parameters
(b) investigate the structure of parameter diversity and change through computer-assisted strategies and quantitative tools
(c) develop data-driven computational models of the structure of human grammars
Research areas
A) Implement tools and methods of cognitive and quantitative sciences for the investigation of the structure of human history and cultural diversity
B) Investigate and apply quantitative algorithms and computer-aided techniques to the explanation of language diversity and change: quantitative phylogenetics and comparative methods (Bayesian techniques and statistical algorithms)
C) explore machine learning and data driven algorithms to model the structure of human language and its variation through the processing of cognitive data
D) Treat cognitive data through computational tools
Who can apply
1) Students with background in linguistics (e.g., computational linguistics, formal linguistics, historical linguistics, psycholinguistics, neurolinguistics, language acquisition) who want to specialize in computational and quantitative treatment and analysis of language diversity.
2) Students with background in computer science (e.g., Machine Learning, algorithms, deep learning, AI, ) who want to face previously unaddressed challenges in the application of computational skills and tools to problems addressed by formal and historical investigation in linguistics.
Research Theses
Phylogenetic algorithms for evolutionarily modeling of natural language syntax
Keywords: quantitative phylogenetics, Bayesian analysis, syntactic parameters, natural language, historical reconstruction
Research objectives. Work out dedicated phylogenetic algorithms to extract a historical signal from abstract natural language data (syntactic parameters), to reconstruct their ancestral states, and to explain their transmission and diversification across time. The novelty of the project consists in the implementation of computational phylogenetic algorithms on a type of abstract structures (syntactic parameters) which are not traditionally employed in quantitative phylogenetics, ad whose formal properties (a multilayered deductive structure, www.parametricomparison.unimore.it) require purposedly devised tools (Ceolin et al 2021).
Connections with research groups, companies, universities. Research on this project will be conducted in collaboration with the Center for Language history and diversity, of which the Dipartimento di Comunicazione ed Economia is a member, and the Department of Linguistics at York, with whom the DCE has an academic cooperation agreement.
Supervisor: Prof. Cristina Guardiano
Modeling structural interdependencies and their effects in constraining language diversity
Keywords. syntactic parameters, networks, binary strings, implicational structure, machine learning, language transmission and change
Research objectives. Explore the multi-layered deductive structure of natural language grammars and its effects in constraining language diversity. In formal cognitive approaches to language structure and diversity, crosslinguistic structural variation is assumed to be universally constrained by a finite set of abstract points of variation (syntactic parameters, www.parametricomparison.unimore.it) whose reciprocal interactions generate the observed surface diversity across languages. The latter turns out to be entirely predictable once the subset of variable abstract structures and their reciprocal interactions is fully defined and formalized (Bortolussi et al 2011). The purpose of this project is to investigate parameter interdependencies and their effects on language acquisition and change, through the implementation of machine learning techniques and data-driven approaches able to disclose dependencies which cannot be identified through manual data analysis, to formalize implicational rules among parameters, and to compare them with those produced by linguists expert judgement. These procedures will be aimed at: (a) discovering parameters that can be entirely dispensable as fully deducible from other parameters; (b) detecting partial dependencies among parameters; (c) observing the distribution of parameter states and implications within selected language families and across families.
Possible connections with research groups, companies, universities: The envisaged research is part of the project Parameter theory on historical corpora: Measuring the power of parameter setting theory on historical corpora (MUR PRIN 2022 20224XEE9P PARTHICO). Research will be also conducted in collaboration with the Center for Language history and diversity, of which the Dipartimento di Comunicazione ed Economia is a member, and the Department of Linguistics at York, with whom the DCE has an academic cooperation agreement.
Supervisor: Prof. Cristina Guardiano
Parametric phylogenetic analyses of language families/subfamilies
Keywords. Parameter theory, Language family, Comparative methods, Parametric comparison, Quantitative phylogenetics, Synchronic and Diachronic variation, Parameter change, Historical corpora
Research objectives. Implement the Parametric Comparison Method (PCM, www.parametricomparison.unimore.it) to the analysis of the phylogenetic structure of a selected historical language family and of its relations with other families, through the investigation of contemporary and/or historical varieties. Concerning currently spoken languages, data can (and will) be gathered from native speakers; by contrast, the investigation of historical varieties requires parsing of closed collections of linguistic material attesting the relevant diachronic stages. The tools implemented by the PCM to extract parameter values from surface language data will be tested and refined to deal with both scenarios and to unveil patterns of parameter change across time and space.
Research tools. (1) a set of binary syntactic parameters defining syntactic variation in nominal structures across the Worlds languages; (2) a set of implicational formulas defining parameter interdependencies; (3) a set of co-varying surface manifestations for each parameter; (4) a formal parameter setting procedure that converts the observed surface patterns to a string of binary values representing the deep structure of each language; (5) a set of computer-based procedures to extract a historical signal from parameter values and automatically generate language phylogenies.
Possible connections with research groups, companies, universities: The envisaged research is part of the project Parameter theory on historical corpora: Measuring the power of parameter setting theory on historical corpora (MUR PRIN 2022 20224XEE9P PARTHICO). Research will be also conducted in collaboration with the Center for Language history and diversity, of which the Dipartimento di Comunicazione ed Economia is a member, and the Department of Linguistics at York, with whom the DCE has an academic cooperation agreement.
Supervisor: Prof. Cristina Guardiano
Teaching
Besides Specialized Courses, students will attend a selection of the General Courses offered by the Program.
Students will be asked to attend at least one summer/winter school per year, which will be chosen according to the offer and to the students needs.
Students will be asked to spend two periods of study abroad (for a minimum of 6 months per year), one of which must be spent at the University of York.
If required, students will be asked to attend introductory courses for MA students offered at DCA/FIM/York in the disciplines they do not manage and whose bases are required to attend more specialized PhD courses.
Specialized Courses (academic year 2024-2025)
Quantitative and formal modeling of historical sciences: an introduction to the Parametric Comparison Method
12 hrs, 6 ECTS
LECTURER: Cristina Guardiano
SYLLABUS: The need to reach progressively more profound levels of chronological depth in the investigation of the human past is a requirement for any discipline with ambitions of historical reconstruction. In contemporary times, the achievements reached by historical sciences (e.g., population genetics) in the search for long-persistence patterns able to reveal deep-time relations were possible thanks to two radical paradigm shifts: the adoption of quantitative modeling and automatic procedures, to process and measure big amounts of data and extract generalizations sustained by statistical support, and a qualitative change in the type of taxonomic data, thanks to the discovery that abstract entities, not directly observable but responsible of several variable surface traits, are more able to retain historical information than observable patterns. In linguistics, the development of the historical paradigm in the XIX century has prompted an extraordinary progress in our knowledge about human history by revealing relations among languages/populations which could have not been discovered by archaeology or demography alone, thanks to the identification of abstract patterns of language transmission and change. In the past 30 years, thanks to the development of Quantitative Phylogenetics, historical investigation in linguistics has benefited from the adoption of computer-based techniques, taxonomic algorithms and methods proper of data science, leading to the implementation of a wide array of automatic tools to generate computer-based taxonomies, explore dynamics of language evolution, reconstruct ancestral states and migration patterns, compare linguistic, genetic, and cultural evolution, model language contact, reconstruct character-by-character the evolution of a family from the assumed shared ancestor. These tools prompt excellent results in performing accurate objective reconstructions but has also reveal important limits in attaining the chronological depth required for long-range investigation, demonstrating that the goal of discovering deep-time relations using languages can only be pursued through the combination of quantitative modelling with a radical qualitative change in the level of linguistic characters employed for taxonomic reconstruction. The Parametric Comparison Method (PCM) implements a comparative model precisely based on these tenets. One of its major goals is working out computable tools for assessing historical relatedness between languages against chance when etymological evidence is missing. To this end, the PCM exploits cognitive parametric theories to measure grammatical diversity and its distribution, and demonstrates that abstract cognitive entities retain a significant historical signal able to reveal unknown historical crosslinguistic connections.
Coding Syntactic Diversity
12 hrs, 3 ECTS
LECTURER: Cristina Guardiano
SYLLABUS: The application of computational techniques to code, annotate and parse linguistic data has benefited from the increasing availability of digital corpora but also, crucially, from the refinement of formal models of human language structure and diversity. This course presents one such models, recently developed to encode the syntactic diversity attested in the worlds languages, and its application to the analysis of syntactically annotated corpora of linguistic data. In the parametric framework of cognitive biolinguistics, human grammars are represented as finite strings of binary values/states (1/0, or +/-). In this approach, the label parameters refers to a set of open choices between binary values, generated by our invariant universal language faculty, and closed by each language learner based on the linguistic evidence s/he is exposed to. Parameter systems exhibit two layers of deductive structure: (a) each parameter is responsible for a set of different co-varying surface linguistic patterns (manifestations), and (b) parameters form a network of partial implications: one value (though not the other) of a given parameter p1 may entail the irrelevance of another parameter p2, whose manifestations would then become predictable. The parameter setting algorithm presented in this course is based on all such properties, and consists of the following components: (i) a list of binary parameters; (ii) a list of formulas which define cross-parametric implications in this set; (iii) for each parameter, the list of surface manifestations it generates; (iv) a list of YES/NO questions associated to each manifestation, which are used to collect the data required to set the value of each parameter in a given language (only YES answers set the value 1).
[Official Website @Unimore.it]