ABSTRACT

Spam Email messages have a big problem either for users or for the Internet service providers. The content of such messages may contain viruses and bad information. The spam messages also occupy a huge amount of space on the mail boxes. So, the process of Emails' classification is very important to be analyzed and discussed. This research work aims at classifying the email messages into either spam or non-spam. The E-mail messages or a dataset can be represented in a matrix form. The rows of the matrix are representing the instances (messages) while the columns are representing the features of such instances. K-Nearest Neighbor (KNN) and Naive Bayes (NB) are two classifiers where they are used to classify the email messages. The proposed approach based on partitioning the dataset into segment and compared with the adopted approach. Moreover, feature selection methods are adopted to choose the significant features and eliminate the others to avoid processing overheads. The choice of the relevant features plays an important role of the classification accuracy. In this work, some feature selection methods are adopted, analyzed, and operated. The performance of such methods is compared. Moreover, a feature selection method is proposed and discussed. The performance of the proposed feature selection method is compared with the adopted ones. This work is operated on a chosen dataset taken from the Internet. The dataset contains about four-thousand messages with fifty-eight features. Moreover, the dataset is supported with a target feature representing the class labels. From the practical experiments it is shown that the performance of the proposed method is better than the adopted ones. It is also expected that the proposed method is applicable to other datasets for other application domains.

Keywords: - Spam Messages, Classification Algorithms, Feature Selection Methods, Text Representation, and Performance Evaluation