Martin Haspelmath (Max Planck Institute for Evolutionary Anthropology) Language parameters and construction parameters in the CrossGram database collection
Replicability of research results minimally relies on data accessibility, but the data should ideally be FAIR: Findable, Accessible, Interoperable, and Reusable. For technical interoperability, the CLDF standard (Forkel et al. 2018) could be used by typologists, though uptake seems to have been slow so far. In this presentation, I discuss the design of the CrossGram database collection, which is specifically designed for typological datasets (it has been public since the summer of 2024: https://crossgram.clld.org/), Here I describe how CrossGram enhances findability and reusability, and I highlight the two different data types that it supports (language parameters and construction parameters).
CrossGram makes typological data more findable in that it “brings to light” what is often “hidden away” in supplementary spreadsheet files (or even tables in PDF files, though this is becoming rare). Research papers typically limit themselves to summary tables or graphs and a few small maps, but ideally we want to access all typological datasets with the amenities known from CLLD applications such as WALS Online (Dryer & Haspelmath 2013, wals.info) or Grambank (Skirgård et al. 2023, grambank.clld.org). These provide not only easy exporting in CLDF format, but also easy searching and sorting in data tables, as well as map visualization, and links to references and Glottolog language information.
In addition, CrossGram provides glossed example sentences in tabular form, similar to the thousands of examples in the APiCS database (Michaelis et al. 2013). These are a particularly striking case of increased transparency, because it is not uncommon for example sentences to be hidden in PDF supplements (for example, Bugaeva 2022 has a supplement of 80 pages of annotated examples). Interlinear glossed text has a range of applications even independently of the typological claims that the examples illustrate, so this is another obvious improvement in reusability.
CrossGram supports two types of typological data: Language parameters that classify entire languages (i.e. parameters of the type known from the maps of WALS and Grambank), and construction parameters that classify constructions. There are many grammatical meanings or functions that can be rendered by multiple constructions in a given language, and if we only consider language parameters, the language must be classified as “mixed” (or a minor construction must be ignored). For example, Kashmiri has both correlative relative clauses and postnominal relative clauses, so both of these strategies could be included in the database and their properties recorded. Stereotypically, typology consists in classifying languages into types, but in reality, languages often have multiple types coexisting with each other, so the addition of constructions and construction types as a data type makes typological databases more fine-grained.
Finally, CrossGram parameters (both language parameters and construction parameters) are not only explained succinctly and clearly, but there is also a sophisticated keyword annotation that allows users to easily find grammatical information on a wide range of topics, and for the future, integration with the envisaged “Grammaticon” reference catalogue is planned (see Haspelmath 2022). This will enhance findability and accessibility even further, and it will facilitate replication and (more generally) cumulative science.