GraDiAn - the Grammatical Distribution Analyser

Keywords: #nlp #gradian

Introduction

The Grammatical Distribution Analyser (GraDiAn) is a Python library for analysing grammatical distributions; particularly with the aim of analysing NLP datasets. The library was originally created as part of my Bachelor’s degree at the University of York. Drift in grammatical distribution is often disregarded. The intention of the project was to analyse state of the art NLP datasets for undocumented grammatical distribution drift. Following the detection of such drift, the next stage would be to investigate the impact on machine learning models which such drift can cause.

See the original report here

For the purposes of the report and GraDiAn, grammatical distribution is defined as a measure of frequencies of various grammatical properties over of a text or series of texts. One potential use for grammatical distribution could be outlining a particular author’s writing style. Drift in grammatical distribution represents the idea that two datasets possess a statistically different grammatical distribution to each other. One example of this effect is that machine learning datasets and benchmarks often contain multiple splits, importantly a ‘train’ split which the model learns from and a ‘test’ split which the model is evaluated against.

Data Types

SentTree

The first of the two abstract data types which GraDiAn provides is the SentTree structure. The SentTree class allows a user to categorise sentences based on parse-trees and the appearances of different linguistic properties of tokens on that tree. For example, where a standard parse-tree may just display the tokens with the syntactic dependencies shown between elements; the SentTree means that the tree could be expanded to include other attributes such as part-of-speech (POS) tags and sentiment.

Syntactic Dependency Counter (SDC)

The Syntactic Dependency Counter is a much simpler structure in comparison to the SentTree. The inspiration for the SDC class was a severe contrast to popular word-embedding techniques. Where traditional word-embedding techniques discard explicit grammatical properties of the text, the SDC reduces text down to just its syntactic dependencies.

Example Usage

In the report mentioned above, both SentTrees and SDCs were used to represent attributes of different data points in state of the art NLP datasets. Each attribute was then evaluated for distribution drift between train and test splits. Further work could include investigating the effect this drift can have on the performance of machine learning models.


I hope you enjoyed reading this blog post! Sign up to my newsletter here: