PhD Program in Computer, Language and Data Science – Academic Year 2026/2027 – Parametric Comparison Method

In the Academic Year 2026/2027, three scholarships will be assigned as part of the project

DHisGram – Syntax out of Africa: Deep History through Human Grammars

The DHisGram project applies the Parametric Comparison Method (PCM) to investigate the deep linguistic history of two highly controversial macro-areas: South-East Asia and the Pacific, and the Americas.
This research theme can be developed into several subprojects, defined according to the candidate’s expertise.

The projects require expertise in at least one of the following areas:

theoretical and comparative linguistics, with particular emphasis on historical comparison
computational and mathematical modeling of language diversity

Keywords: Language History, Syntactic Modeling, Historical Comparison, Neuro-Symbolic Computing, Low-resource NLP, Graph Neural Networks, Latent Space Inference.

Candidates with advanced expertise in one or both areas, and an interest in their interdisciplinary integration, may refer to the two subprojects outlined below.

Subproject 1
Modeling syntactic diversity through the Parametric Comparison Method

Keywords: Language History, Syntactic Modeling, Historical Comparison

Overview

The project is situated in the field of comparative and historical linguistics, with a specific interest in cross-linguistic comparison based on formal grammatical properties. It is intended for applicants who wish to investigate linguistic diversity and historical relations among languages through a theoretically informed and methodologically explicit comparative framework.
Proposals may focus on one or more language groups located in any of the two macro-areas under analysis. Selection and justification of the languages to be investigated must be consistent with the research goals of the project.
Here follow some suggestions: the lists are indicative, not exhaustive.

Southeast Asia: Sino-Tibetan (Sinitic: e.g. Mandarin, Cantonese; Tibeto-Burman); Austroasiatic (e.g. Vietnamese, Khmer, Mon, Wa)
Pacific and Oceania: Austronesian (Oceanic: e.g. Tongan, Samoan, Māori, Hawaiian, Äiwoo; Formosan; Malayo-Polynesian: e.g. Malay, Tagalog, Javanese, Malagasy; West New Guinea; Papuan)
Australia: Pama-Nyungan, Non-Pama-Nyungan
The Americas: Eskimo-Aleut (Inuktitut), Na-Dene (e.g. Navajo), Salishan (e.g. Straits Salish), Muskogean (e.g. Chickasaw), Quechuan (e.g. Quechua/Quichua), Uto-Aztecan (e.g. Pima, Papago, Nahuatl), Mayan (e.g. Tzotzil), Arawakan (e.g. Garifuna), Tupian / Tupi-Guaraní (e.g. Munduruku, Paraguayan Guaraní), Waikuruan (e.g. Kadiweu), Macro-Jê (e.g. Kaingang), Cariban (e.g. Kuikuro)

Research focus

(1) Analysis of the internal structure of the nominal domain in a selected set of languages, compared on the basis of a system of nominal parameters, which will be further enriched and expanded to accommodate to typologically diverse languages
(2) Analysis of the distribution of syntactic diversity across the selected languages, through the adoption of statistical methodologies and computer assisted taxonomic techniques
(3) Analysis of thephylogenetic structure of the group under investigation, to test existing phylogenetic hypotheses, especially for languages or areas whose historical interpretation remains controversial or insufficiently explored

Research tasks

Data collection and documentation. Gathering of the relevant data from elicitation sessions with language experts and native speakers, with the possibility of consulting grammars, corpora, databases, depending on the available resources. Data collection/elicitation is guided by the PCM-questionnaire.
Syntactic analysis. Comparative analysis of selected structures within the nominal domain. Formulation of novel theoretical hypotheses, if needed, and refinement of the parameter system.
Comparative evaluation. Measurement and interpretation of similarities and differences across the selected languages, including formulation and testing of historical or phylogenetic hypotheses.
Critical dialogue with the literature. Comparison of the results with existing classifications, reconstructions, or competing accounts in the relevant literature.

Proposed activities

Literature review on the theoretical framework, the languages under investigation, and the relevant comparative and historical debates.
Training in comparative methodology, formal analysis, and, where relevant, quantitative or computational tools.
Data collection and organization, including the construction of a structured comparative dataset within the PCM database.
Analytical work on selected nominal subdomain(s), with discussion of the emerging results in relation to the broader research question.
Systematic collaboration and interaction with specialists on the languages/areas involved.

Supervision: Cristina Guardiano, one or more members of the International Teaching Staff with expertise in the relevant fields.

Subproject 2
Scaling the Parametric Comparison Method via Large Language Models and Graph Analytics:
A Computational Framework for Syntactic Inference

Keywords: Neuro-Symbolic Computing, Low-resource NLP, Graph Neural Networks, Latent Space Inference.

Background

The PCM framework has so far relied on a system of rules developed within classical formal linguistics, without the support of automated techniques. While effective for medium-scale comparison, this approach has not yet been tested on a systematic, global scale. The project addresses this gap by integrating formal and comparative linguistic theory with State-of-the-Art (SotA) Machine Learning (ML) models, specifically designed to handle structured symbolic knowledge and sparse data.

Research focus

The core objective is to design and implement a model for automatic data extraction and analysis that is fully compatible with the PCM framework. The model will support linguists in identifying the underlying syntactic structures that generate observable patterns across languages.The work will be based on targeted datasets: rather than large corpora, the PCM relies on carefully selected data from controlled syntactic configurations, interpreted through speaker judgments. This poses specific challenges for Informed ML: developing algorithms that can learn from “Small Data” by incorporating linguistic vincles as inductive biases, especially for under-documented languages.

Research tasks

The project will focus on three tightly connected directions:

Data construction and representation: Multi-Task NLP (Natural Language Processing ) & LLMs (Large Language Models )
Develop methodologies to collect and structure data that reliably encode the observable syntactic patterns (e.g. word orders and their interpretations) generated from the underlying rules to be compared. Implementation of Zero-Shot and Few-Shot Information Extraction (IE) pipelines to map descriptive grammatical texts into PCM-compliant binary vectors.

Pattern detection and correlation: Graph-based Dependency Discovery
Identify co-occurrence patterns among surface structures, with the goal of uncovering non-obvious dependencies linked to shared underlying rules. Modeling the PCM parameters as a Knowledge Graph. We will apply Link Prediction and Relation Extraction to uncover non-obvious dependencies (hidden rules) and perform Matrix Completion to infer missing parameters in languages where data is incomplete or noisy.

Cross-linguistic modeling
Analyze how abstract syntactic rules are distributed across languages and mapping them at a macro-area scale. Development of Generative Latent Variable Models (such as Variational Autoencoders – VAE) to project syntactic structures into a continuous manifold. This allows for the application of Unsupervised Clustering and Manifold Learning to reconstruct phylogenetic trajectories and measure syntactic divergence in a mathematically rigorous latent space.

Proposed activities

Critical review of relevant literature about computational methods implemented in formal linguistic, historical comparison and SotA data extraction (Transformers, GNNs-Graph Neural Networks).
Evaluation and adaptation of existing analytical and extraction techniques. Implementation of a hybrid architecture that combines the logical consistency of PCM (Symbolic) with the generalization power of Neural Networks (Deep Learning).
Design and implementation of novel models tailored to PCM requirements. Integration of Bayesian Neural Networks or Conformal Prediction to provide confidence intervals for the inferred syntactic rules, crucial for historical reconstructions.

Supervision: Marko Bertogna, Giorgia Franchini, Cristina Guardiano

Click here for further information about the PhD Program

Click here to access the call for application