TY - JOUR

T1 - Minimum Description Length Principle in Supervised Learning with Application to Lasso

AU - Kawakita, Masanori

AU - Kawakita, Masanori

AU - Takeuchi, Jun'ichi

N1 - Funding Information:
Manuscript received May 16, 2016; revised April 19, 2020; accepted May 4, 2020. Date of publication May 29, 2020; date of current version June 18, 2020. This work was supported in part by JSPS KAKENHI Grant Numbers 25870503 and 18H03291 and in part by the Okawa Foundation for Information and Telecommunications. This article was presented in part at the 33rd International Conference on Machine Learning. (Corresponding author: Masanori Kawakita.) Masanori Kawakita was with the Graduate School of Informatics, Nagoya University, Nagoya 464-8601, Japan. He is now with Mie Toyopet Corporation, Tsu city 514-0821, Japan (e-mail: m.kawakita@mietoyopet.co.jp).
Publisher Copyright:
© 1963-2012 IEEE.

PY - 2020/7

Y1 - 2020/7

N2 - The minimum description length (MDL) principle is extended to supervised learning. The MDL principle is a philosophy that the shortest description of given data leads to the best hypothesis about the data source. One of the key theories for the MDL principle is Barron and Cover's theory (BC theory), which mathematically justifies the MDL principle based on two-stage codes in density estimation (unsupervised learning). Though the codelength of two-stage codes looks similar to the target function of penalized likelihood methods, parameter optimization of penalized likelihood methods is done without quantization of parameter space. Recently, Chatterjee and Barron have provided theoretical tools to extend BC theory to penalized likelihood methods by overcoming this difference. Indeed, applying their tools, they showed that the famous penalized likelihood method 'lasso' can be interpreted as an MDL estimator and enjoys performance guarantee by BC theory. An important fact is that their results assume a fixed design setting, which is essentially the same as unsupervised learning. The fixed design is natural if we use lasso for compressed sensing. If we use lasso for supervised learning, however, the fixed design is considerably unsatisfactory. Only random design is acceptable. However, it is inherently difficult to extend BC theory to the random design regardless of whether the parameter space is quantized or not. In this paper, a novel theoretical tool for extending BC theory to supervised learning (the random design setting and no quantization of parameter space) is provided. Applying this tool, when the covariates are subject to a Gaussian distribution, it is proved that lasso in the random design setting can also be interpreted as an MDL estimator, and that lasso enjoys the risk bound of BC theory. The risk/regret bounds obtained have several advantages inherited from BC theory. First, the bounds require remarkably few assumptions. Second, the bounds hold for any finite sample size n and any finite feature number p even if n\ll p. Behavior of the regret bound is investigated by numerical simulations. We believe that this is the first extensions of BC theory to supervised learning (random design).

AB - The minimum description length (MDL) principle is extended to supervised learning. The MDL principle is a philosophy that the shortest description of given data leads to the best hypothesis about the data source. One of the key theories for the MDL principle is Barron and Cover's theory (BC theory), which mathematically justifies the MDL principle based on two-stage codes in density estimation (unsupervised learning). Though the codelength of two-stage codes looks similar to the target function of penalized likelihood methods, parameter optimization of penalized likelihood methods is done without quantization of parameter space. Recently, Chatterjee and Barron have provided theoretical tools to extend BC theory to penalized likelihood methods by overcoming this difference. Indeed, applying their tools, they showed that the famous penalized likelihood method 'lasso' can be interpreted as an MDL estimator and enjoys performance guarantee by BC theory. An important fact is that their results assume a fixed design setting, which is essentially the same as unsupervised learning. The fixed design is natural if we use lasso for compressed sensing. If we use lasso for supervised learning, however, the fixed design is considerably unsatisfactory. Only random design is acceptable. However, it is inherently difficult to extend BC theory to the random design regardless of whether the parameter space is quantized or not. In this paper, a novel theoretical tool for extending BC theory to supervised learning (the random design setting and no quantization of parameter space) is provided. Applying this tool, when the covariates are subject to a Gaussian distribution, it is proved that lasso in the random design setting can also be interpreted as an MDL estimator, and that lasso enjoys the risk bound of BC theory. The risk/regret bounds obtained have several advantages inherited from BC theory. First, the bounds require remarkably few assumptions. Second, the bounds hold for any finite sample size n and any finite feature number p even if n\ll p. Behavior of the regret bound is investigated by numerical simulations. We believe that this is the first extensions of BC theory to supervised learning (random design).

UR - http://www.scopus.com/inward/record.url?scp=85087176519&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85087176519&partnerID=8YFLogxK

U2 - 10.1109/TIT.2020.2998577

DO - 10.1109/TIT.2020.2998577

M3 - Article

AN - SCOPUS:85087176519

VL - 66

SP - 4245

EP - 4269

JO - IEEE Transactions on Information Theory

JF - IEEE Transactions on Information Theory

SN - 0018-9448

IS - 7

M1 - 9103589

ER -