In the era of big data, profitable opportunities are becoming available for many applications. As the amount of data keeps increasing, machine learning becomes an attractive tool to analyze the information acquired. However, harnessing meaningful data remains a challenge. The machine learning tools employed in many applications apply all training data without taking into consideration how relevant are some of them. In this paper, we propose a data selection strategy for the training step of Neural Networks to obtain the most significant data information and improve algorithm performance during training. The approach proposes a data-selection strategy applied to classification and regression problems leading to computational savings and classification error reduction. Based on open datasets, including a deep neural network case, the examples corroborate the effectiveness of the proposed approach.
Introduction
Over the last few years, machine learning methods have witnessed growing popularity. The quantity of information generated in the world is soaring, raising the basic question if all data stored is useful. The current trend of creating data at an increasing rate is overwhelming the capacity to store, analyze, and make proper use of learning algorithms. In part, this phenomenon originates from the proliferation of sensors, human-computer interactions, internet of things, medical data, and machine-to-machine and mobile communications, to name a few data generators.