Junheng Hao, Chuan Lei, et al.
KDD 2021
Consider a dataset of features, such as {SEX, INCOME, RACE, EDUCATION}. A user may want to know where in the feature space observations are concentrated, and where it is sparse or empty. An interpretable region is a “hypercube”, such as {RACE in {Black, White}} & {10 <= EDUCATION <= 13}, containing all observations satisfying the constraints; typically, such regions are defined by a small number of features, say 3 or fewer. To quantify each multivariate observation’s density, we use Gower distance between observations, which works on numeric and categorical features, input into OPTICS. We partition the dataset recursively using regression trees into regions that reflect different average levels of density, which can be ranked. These regions can be useful on their own for manual data exploration, or as input to another application; for instance, an ML model may perform worse in sparse data regions, so by partitioning we may predict the model’s performance for particular feature values. We believe the combination of density-based partitions on mixed-type data that are also interpretable is novel and interesting theoretically. Results are shown on visualizable toy data.
Junheng Hao, Chuan Lei, et al.
KDD 2021
Bobak Pezeshki, Radu Marinescu, et al.
UAI 2022
Eliran Roffe, Samuel Ackerman, et al.
AAAI 2022
Nandana Mihindukulasooriya, Sarthak Dash, et al.
ISWC 2023