Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published November 13, 2017 | Supplemental Material + Published
Journal Article Open

A non-linear data mining parameter selection algorithm for continuous variables

Abstract

In this article, we propose a new data mining algorithm, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, a preferred selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection more efficient. This algorithm introduces interpretable parameters by transforming the original inputs and also a faithful fit to the data. The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an optimal and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodology with the inclusion variable transformations and interactions. Moreover, this method controls multicollinearity, leading to an optimal set of explanatory variables.

Additional Information

© 2017 Tavallali et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Received: May 23, 2017; Accepted: October 24, 2017; Published: November 13, 2017. Data Availability: All synthetic data generated or analyzed during this study are included in the Supporting Information. The human data used in this study comes from the Framingham Heart Study. This data is publicly available to qualified investigators. De-identified data can be provided to investigators of approved research proposals. Data can be requested by a submitting research application to one of the following: Directly from Framingham Heart Study (https://www.framinghamheartstudy.org/), BioLINCC (https://biolincc.nhlbi.nih.gov/home/), or dbGaP (https://www.ncbi.nlm.nih.gov/gap) Data sets used in this study can be found using the following links: 1- (https://biolincc.nhlbi.nih.gov/studies/gen3/?q=framingham) for the Gen3 cohort 2- (https://biolincc.nhlbi.nih.gov/studies/framcohort/?=framingham) for the Original Cohort 3- (https://biolincc.nhlbi.nih.gov/studies/framoffspring/?q=framingham) for the Offspring Cohort. The research leading to this manuscript was not funded. The author Sean Brady (S.B.), having the affiliation at Principium Consulting, LLC, has not financially contributed to this research. This author participated in the original idea of the study through discussions with the first author, Peyman Tavallali (P.T.). S.B. helped draft the manuscript, and revised the manuscript critically for important intellectual content. S.B.'s contribution to this study has solely been individual, non-profit, scientific, and unfunded. S.B. nor Principium Consulting, LLC did not provide any financial support in any form for this study. No funder provided support in the form of salaries for authors, and no funder had any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the 'author contributions' section. Competing interests: We declare no competing interest. The Principium Consulting, LLC is not doing research or business in the field of statistics learning. There are no marketed products, employment, consultancy, patents, and products in development relating to the material of this manuscript. The collaboration with S.B. does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no restrictions on sharing of data and/or materials regarding the manuscript. We would like to thank Dr. Niema M. Pahlevan and Prof. Morteza Gharib for giving us the permission to use the Framingham Heart Study data in this paper. The Framingham Heart Study is conducted and supported by the National Heart Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01- HC-25195). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or conclusions of the Framingham Heart Study or the NHLBI. Author Contributions: Conceptualization: Peyman Tavallali, Sean Brady. Formal analysis: Peyman Tavallali, Marianne Razavi. Investigation: Peyman Tavallali, Marianne Razavi. Methodology: Peyman Tavallali. Software: Peyman Tavallali. Supervision: Peyman Tavallali. Validation: Peyman Tavallali, Marianne Razavi. Writing – original draft: Peyman Tavallali. Writing – review & editing: Peyman Tavallali, Marianne Razavi, Sean Brady.

Attached Files

Published - journal.pone.0187676.pdf

Supplemental Material - S1Dataset.zip

Files

S1Dataset.zip
Files (7.8 MB)
Name Size Download all
md5:00f8297956200d4c384f6adbdf0f7fec
70.3 kB Preview Download
md5:b82a56a6765b529b89e1b83dd97e053e
7.7 MB Preview Download

Additional details

Created:
August 19, 2023
Modified:
October 17, 2023