Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published May 31, 2006 | Submitted
Report Open

Data complexity in machine learning

Abstract

We investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by the data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. The data complexity can also be defined based on a learning model, which is more realistic for applications. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Since in practice we have to approximate the ideal data complexity measures, we also discuss the impact of such approximations.

Attached Files

Submitted - dcomplex.pdf

Files

dcomplex.pdf
Files (1.6 MB)
Name Size Download all
md5:7b08c6f6eb70866146a8787a458fdd97
1.6 MB Preview Download

Additional details

Created:
August 19, 2023
Modified:
January 13, 2024