Effect of clustering data in improving machine learning model accuracy

Samih M. Mostafa, Hirofumi Amano

研究成果: ジャーナルへの寄稿記事

抄録

Supervised machine learning algorithms consider the relationship between dependent and independent variables rather than the relationship between the instances. Machine learning algorithms try to learn the relationship between the input and output from the historical data in order to attain precise predictions about unseen future. Conventional foretelling algorithms are usually based on a model learned and trained from historical data. The instances in the historical data may vary in its characteristics. The variation may be a result of difference in case's pertinence degree to some cases compared to others. However, the problem with such machine learning algorithms is their dealing with the whole data without considering this variation. This paper presents a novel technique to the trained model to improve the prediction accuracy. The proposed method clusters the data using K-means clustering algorithm, and then applies the prediction algorithm to every cluster. The value of K which gives the highest accuracy is selected. The authors performed comparative study of the proposed technique and popular prediction methods namely Linear Regression, Ridge, Lasso, and Elastic. On analysing on five datasets with different sizes and different number of clusters, it was observed that the accuracy of the proposed technique is better from the point of view of Root Mean Square Error (RMSE), and coefficient of determination (Rz).

元の言語英語
ページ(範囲)2973-2981
ページ数9
ジャーナルJournal of Theoretical and Applied Information Technology
97
発行部数21
出版物ステータス出版済み - 11 15 2019

Fingerprint

Data Clustering
Learning systems
Historical Data
Machine Learning
Learning algorithms
Learning Algorithm
Prediction
Coefficient of Determination
Lasso
K-means Algorithm
K-means Clustering
Supervised Learning
Number of Clusters
Ridge
Linear regression
Clustering algorithms
Mean square error
Model
Comparative Study
Clustering Algorithm

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

これを引用

Effect of clustering data in improving machine learning model accuracy. / Mostafa, Samih M.; Amano, Hirofumi.

:: Journal of Theoretical and Applied Information Technology, 巻 97, 番号 21, 15.11.2019, p. 2973-2981.

研究成果: ジャーナルへの寄稿記事

@article{ce10548119334fb687c5f6ae326434c6,
title = "Effect of clustering data in improving machine learning model accuracy",
abstract = "Supervised machine learning algorithms consider the relationship between dependent and independent variables rather than the relationship between the instances. Machine learning algorithms try to learn the relationship between the input and output from the historical data in order to attain precise predictions about unseen future. Conventional foretelling algorithms are usually based on a model learned and trained from historical data. The instances in the historical data may vary in its characteristics. The variation may be a result of difference in case's pertinence degree to some cases compared to others. However, the problem with such machine learning algorithms is their dealing with the whole data without considering this variation. This paper presents a novel technique to the trained model to improve the prediction accuracy. The proposed method clusters the data using K-means clustering algorithm, and then applies the prediction algorithm to every cluster. The value of K which gives the highest accuracy is selected. The authors performed comparative study of the proposed technique and popular prediction methods namely Linear Regression, Ridge, Lasso, and Elastic. On analysing on five datasets with different sizes and different number of clusters, it was observed that the accuracy of the proposed technique is better from the point of view of Root Mean Square Error (RMSE), and coefficient of determination (Rz).",
author = "Mostafa, {Samih M.} and Hirofumi Amano",
year = "2019",
month = "11",
day = "15",
language = "English",
volume = "97",
pages = "2973--2981",
journal = "Journal of Theoretical and Applied Information Technology",
issn = "1992-8645",
publisher = "Asian Research Publishing Network (ARPN)",
number = "21",

}

TY - JOUR

T1 - Effect of clustering data in improving machine learning model accuracy

AU - Mostafa, Samih M.

AU - Amano, Hirofumi

PY - 2019/11/15

Y1 - 2019/11/15

N2 - Supervised machine learning algorithms consider the relationship between dependent and independent variables rather than the relationship between the instances. Machine learning algorithms try to learn the relationship between the input and output from the historical data in order to attain precise predictions about unseen future. Conventional foretelling algorithms are usually based on a model learned and trained from historical data. The instances in the historical data may vary in its characteristics. The variation may be a result of difference in case's pertinence degree to some cases compared to others. However, the problem with such machine learning algorithms is their dealing with the whole data without considering this variation. This paper presents a novel technique to the trained model to improve the prediction accuracy. The proposed method clusters the data using K-means clustering algorithm, and then applies the prediction algorithm to every cluster. The value of K which gives the highest accuracy is selected. The authors performed comparative study of the proposed technique and popular prediction methods namely Linear Regression, Ridge, Lasso, and Elastic. On analysing on five datasets with different sizes and different number of clusters, it was observed that the accuracy of the proposed technique is better from the point of view of Root Mean Square Error (RMSE), and coefficient of determination (Rz).

AB - Supervised machine learning algorithms consider the relationship between dependent and independent variables rather than the relationship between the instances. Machine learning algorithms try to learn the relationship between the input and output from the historical data in order to attain precise predictions about unseen future. Conventional foretelling algorithms are usually based on a model learned and trained from historical data. The instances in the historical data may vary in its characteristics. The variation may be a result of difference in case's pertinence degree to some cases compared to others. However, the problem with such machine learning algorithms is their dealing with the whole data without considering this variation. This paper presents a novel technique to the trained model to improve the prediction accuracy. The proposed method clusters the data using K-means clustering algorithm, and then applies the prediction algorithm to every cluster. The value of K which gives the highest accuracy is selected. The authors performed comparative study of the proposed technique and popular prediction methods namely Linear Regression, Ridge, Lasso, and Elastic. On analysing on five datasets with different sizes and different number of clusters, it was observed that the accuracy of the proposed technique is better from the point of view of Root Mean Square Error (RMSE), and coefficient of determination (Rz).

UR - http://www.scopus.com/inward/record.url?scp=85075496297&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85075496297&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:85075496297

VL - 97

SP - 2973

EP - 2981

JO - Journal of Theoretical and Applied Information Technology

JF - Journal of Theoretical and Applied Information Technology

SN - 1992-8645

IS - 21

ER -