Events
DMS Statistics and Data Science Seminar |
Time: Sep 24, 2020 (02:00 PM) |
Location: ZOOM |
Details: Speaker: Artür Manukyan (University of Massachusetts Medical School & the Broad Institute of MIT and Harvard University) Title: Graph-based Learning for Class Cover Problem and Adaptive Clustering Algorithms Using Statistical Tests of Spatial Data Analysis
Abstract: In statistical learning, numerous methods are based on graphs. A type of graph, called proximity graph, offers solutions to many challenges in supervised and unsupervised learning. Class cover catch digraphs (CCCDs) are such digraphs that have been introduced to investigate the class cover problem (CCP). The goal of CCP is to find a set of hyperballs such that their union encapsulates, or covers, a subset of the training data from the class of interest. CCP is closely related to statistical classification, and CCCDs achieve relatively good performance in many tasks of statistical classification and clustering, such as imbalanced learning and hot spot detection. We mainly discuss the advantages of CCCDs in statistical learning, but we primarily focus on clustering algorithms which are based on recently developed unsupervised installations of CCCDs, called cluster catch digraphs (CCD). These digraphs are used to devise clustering methods that are hybrids of density-based and graph-based methods. CCDs are appealing digraphs for partitioning and clustering of data sets since they estimate the number of clusters without validation indices; however, CCDs, and density-based methods in general, require parameters representing the spatial intensity of assumed clusters exist in the data set. We offer parameter-free versions of the CCD algorithm that does not require specifying the spatial intensity parameter, whose choice is often critical to find an optimal partitioning of the data set. We approach the problem of estimating the number of clusters by borrowing a tool from spatial data analysis, namely Ripley's K function. We call our new digraphs based on the K function as R-CCDs. We show that the domination number of R-CCDs locate and separate the clusters from the noise clusters in data sets and, hence, allow the estimation of the true number of clusters. Our parameter-free clustering algorithms are composed of methods that estimate both the number of clusters and the spatial intensity parameter, making them completely parameter-free. We conduct Monte Carlo simulations and use real life experiments to compare R-CCDs with some commonly used density-based and prototype-based clustering methods.
This is joint work with Elvan Ceyhan (Mathematics and Statistics, Auburn University).
Seminar website: http://webhome.auburn.edu/~ezc0066/stat-datasci-seminar.html Join from PC, Mac, Linux, iOS or Android: https://auburn.zoom.us/j/93758346031 If you're a new participant, we have a quick start guide here: https://aub.ie/zoomquickstart
|