DATA MINING OF MEDICAL DATASETS WITH MISSING ATTRIBUTES FROM DIFFERENT SOURCES Essay

Abstraction

Two major jobs in informations excavation are 1 ) Covering with losing values in the datasets used for cognition find, and 2 ) utilizing one information set as a forecaster of other datasets. We explore this job utilizing four different datasets from the UCI Machine larning depository, from four different beginnings with different losing values. Each dataset contains 13 properties and one category property which denotes the presence of bosom disease and the absence of bosom disease. Missing values were replaced in a figure of ways ; foremost by utilizing normal mean and manner method, secondly by taking the properties that contains losing values, thirdly by taking the records that contains more than 50 per centum of values losing and make fulling the staying losing values. We besides experimented with different categorization techniques, including Decision tree, Naive Bayes, and MultilayerPerceptron, utilizing Medical Datasets. Rapid Miner and Weka tools. The consistence of the datasets was found by uniting the datasets together and comparing the consequences of this datasets with the categorization mistake of different datasets. It can be seen from the consequences that if fewer figure of losing values are present, the normal mean and manner method is good. If larger sum of losing values are present than the 3rd method of taking records along with different preprocessing stairss works better, and utilizing one dataset as a forecaster of other dataset produced moderate truth

Recognitions

Table OF CONTENTS

Abstraction

Recognitions

1 Introduction

Background

Problems

Data Sets

2 LITERATURE REVIEW

2.1 Rapid Miner

2.2 Decision Trees

2.3 Naive Bayes

2.4 Nervous Networks

3 METHODOLOGY

3.1 Preprocessing the information

3.2 Constructing the theoretical account

4 Consequence

5 Decision

6 Mentions

List OF TABLES

Table 1.All the four informations sets

Table 2.Percentage of losing values in each information set

Table 3.Accuracy of Cleveland informations set utilizing categorization algorithms

Table 4.Accuracy obtained after the theoretical account was tested

Table 5.Information addition weights of properties

Table 6.Accuracy obtained after information addition weighting of properties

Table 7.Accuracy of Hungary Data set utilizing Wrapper method

Table 8. Feature Selection truth of VA Long Beach

1 Introduction

1.1 Background

Data excavation is the procedure of analysing informations from different positions and sum uping it into utile information. [ 1 ] Various stairss involved in mining the informations are data integrating, informations choice, informations cleaning, informations transmutation, informations excavation.The first measure in information excavation is Data choice ; that is, to merely choose the information which is utile for information excavation. The 2nd measure is data cleaning ; the information we collected may hold mistakes, losing values ; inconsistent informations which must be corrected. Even after informations cleansing informations is non ready for informations excavation, so the following measure is data transmutation ; that is, collection, smoothing, standardization, discretization etc. The concluding measure is the information excavation itself ; that is, to happen interesting forms within the informations. Data excavation techniques include categorization, arrested development, constellating, association etc. [ 1 ]

We will write a custom essay sample on
DATA MINING OF MEDICAL DATASETS WITH MISSING ATTRIBUTES FROM DIFFERENT SOURCES Essay
or any similar topic only for you
Order now

The chief end of informations excavation is categorization. Give a aggregation of records incorporating a set of properties, where one of the property is category property, our end is to happen the theoretical account for category property as a map of values of other properties. A preparation set is used to construct the theoretical account and a trial set is used to sort the informations harmonizing to that theoretical account. To foretell the public presentation of a classifier on new informations, we need to measure its mistake rate on a dataset that played no portion in the information of classifier. This independent dataset is called a trial set [ 2 ]

1.2 Problems

The chief focal point of this paper is to happen the forms in the informations set related to coronary arteria disease from UCI bosom informations sets. Harmonizing to a 2007 study, about 16 million Americans have coronary arteria disease ( CAD ) . In U.S. , coronary arteria disease is the taking slayer of both work forces and adult females. Each twelvemonth, about 500,000 people die because of CAD.

Normally a medical dataset consists of a figure of trials to be conducted to name a disease. However, most medical datasets have big Numberss of losing values because of the trials that are non conducted ; many utile property values will be losing in medical informations set due to the disbursal of executing trials, attributes that could non be recorded when the information was collected, or properties ignored by users because of privateness concerns. Further perplexing the job from a information excavation point of view is that different groups of doctors collect different informations ; that is, different medical datasets frequently contain different properties, doing it hard to utilize one dataset as a forecaster of another.

This paper chiefly focuses on two different issues. The first focal point of this paper is preprocessing the information chiefly covering with losing properties, and to happen the improved truth of the informations set after preprocessing stairss and to compare the truth utilizing different categorization algorithms such as determination tree, Naive Bayes and Neural Networks.

The 2nd focal point is to develop the informations utilizing categorization techniques and to prove the informations utilizing different datasets, to verify the consequences obtained from the trained informations. The Cleveland database, collected from Cleveland Clinic Foundation, was used as the preparation set. The Switzerland, Hungary, and VA Long Beach informations sets were used as trial sets. All the datasets contain 13 properties and one category property. The Cleveland dataset used for preparation contains merely 6 losing values, and the three trial datasets contain about 90 % losing values.

1.3 Datasets

The bosom information set is collected from UCI depository. Each information set consists of 13 properties among that 6 are numerical and 8 are categorical properties and one particular property.

The followers are the four informations sets for bosom disease:

1. Cleveland

2. Hungary

3. Switzerland

4. VA long beach.

The beginning and the Godhead of the datasets:

1. Magyar Institute of Cardiology. Budapest: Andras Janosi, M.D.

2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.

3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.

4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D. , Ph.D

This datasets contains 76 properties but merely 14 properties are largely used in most of the research. The presence value of bosom disease is a value in the scope of 1,2,3,4, with an absence value 0.

Table 1. All the four Datasets

Data Sets

Number of Cases

Cleveland

303

Magyar

294

Switzerland

123

VA Long Beach

200

The 14 properties that are used are:

1. Age in old ages

2. Sexual activity – ( 1=Male ; 0=Female )

3. Cp -Chest hurting type

Value 1: typical angina

Value 2: untypical angina

Value 3: non-anginal hurting

Value 4: symptomless

4. Resting blood force per unit area ( in millimeter Hg on admittance to the infirmary )

5. Chol-Serum cholesterin in mg/dl

6. Fbs- ( Fasting blood sugar & gt ; 120 mg/dl ) ( 1 = true ; 0 = false )

7. Resting electrocardiographic consequences

Value 0: normal

Value 1: holding ST-T moving ridge abnormalcy ( T wave inversions and/or ST lift or depression of & gt ; 0.05 millivolt )

Value 2: screening likely or definite left ventricular hypertrophy by Estes ‘ standards

8. Maximal bosom rate achieved

9. Exercise induced angina ( 1 = yes ; 0 = no )

10. ST depression induced by exercising relation to rest

11. The incline of the extremum exercising ST section

— Value 1: up inclining

— Value 2: level

— Value 3: down inclining

12. Number of major vass ( 0-3 ) colored by flourosopy

13. Thal-3 = normal ; 6 = fixed defect ; 7 = reversable defect

14. The predicted property

Diagnosis of bosom disease ( angiographic disease position )

Value 0: & lt ; 50 % diameter narrowing

Value 1, 2, 3, 4: & gt ; 50 % diameter narrowing

The properties with the largest sum of losing values are incline of the extremum exercising, ST section represented as incline, figure of major vass ( 0-3 ) colored by flourosopy represented by ca, normal ; fixed defect ; reversable defectA represented by thal, and serum cholesterin in mg/dl represented by chol.

2 Literature Review

2.1 Rapid mineworker

RapidMiner, once known YALE ( Yet Another Learning Environment ) , is package widely used for machine acquisition, knowledge find and information excavation. RapidMiner is being used in both research and besides in practical informations excavation Fieldss.

The Java scheduling linguistic communication is used in Rapid Miner, which means it can run in any operating system. RapidMiner can manage many formats of input such as CSV, Arff, SPSS, Xrff, Database illustration beginnings, and attributes that are described in XML file format. Different types of properties that are present are Input, Output, informations preprocessing and visual image.

RapidMiner contains more than 500 operators. The nested operator can be described through graphical user interface XML files which are created with RapidMiner. Individual RapidMiner maps can besides be called straight from bid line. It is used easy to specify analytical stairss and to bring forth graphs more efficaciously. It provides a big aggregation of informations mining algorithms for executing categorization. Many visual image tools such as overlapping histogram, 3D spread secret plan and tree charts are present.

RapidMiner can manage any type of undertakings like categorization, constellating, proof, visual image, preprocessing, station processing etc. It besides supports many sorts of preprocessing stairss such as discretization, Outlier ( Detection and remotion ) , Filter, Selection, weighting, Normalization etc are available. All mold and attribute rating methods from Weka are available within RapidMiner.

RapidMiner consists of two positions, Design position and Result position. The design position is used to bring forth the procedure and run the procedure. The consequence position is used to bring forth the consequences.

Figure 1. RapidMiner Interface

Figure 2. RapidMiner Design View

2.2 Decision Trees

Decision trees are a supervised acquisition technique normally used for undertakings like categorization, constellating and arrested development. Decision trees are chiefly used in the field of finance, technology, selling and medical specialty. Decision braid can manage any type of informations that is nominal, numeral and text. They chiefly focus on the relationships of the properties. Input to a determination tree is the set of objects described by the set of belongingss, and creates end product as yes/no determination, or as one of several different categorizations.

Figure 3. Decision tree created by RapidMiner

Since determination trees can be represented diagrammatically as tree-like constructions they are easier to understand by worlds. The root node is the beginning of the tree, and each node is used to measure the properties. At each node, the value of the property for the given case is used to find which subdivision to follow to a kid node. Categorization of cases can be done utilizing a determination tree get downing from the root node and go oning until a leaf node is reached.

Decision tree creative activity involves spliting the preparation informations into root node and leaf node divisions until the full information set has been analyzed. [ 3 ] . The information is split until they have the same set of categorization or the splitting can non be done any longer due to miss of farther properties.

An efficient determination tree is one in which the root node divides the information efficaciously, and hence requires fewer nodes. One of the of import things is to choose an property that best splits the information into single categories. The splitting is done based on the information addition of each property. Information addition is based on the construct of information, which gives the information required for a determination in spots. Entropy is calculated from

Entropy ( P1, P2aˆ¦ Pn ) = -I?i Pi log2Pi

Information of information is calculated based on each property. Entropy gives how of import an property is by the information that is given. First the information of whole informations set is calculated. The split is done based on this property. The property that can outdo split the informations can be found from this. In the same manner this process is used until the foliage nodes are reached.

A determination tree is built utilizing a preparation dataset and these trees can so be used to sort illustrations in trial dataset. Decision trees can besides be used to explicitly depict informations and besides used for determination devising, as they produce regulations which can be easy to understand and can be read by any user.

Sometimes determination tree larning produces a tree that is excessively big. If the tree is excessively big the new samples are ill generated. Sniping is one of the of import stairss in determination tree larning that addresses this job. The size of determination tree can be reduced by sniping ( that is, taking ) the irrelevant properties for which the truth of determination tree does non acquire reduced by sniping. Sniping improves the truth of the tree for future cases. The job of over adjustment and besides noisy informations can be reduced by sniping since the irrelevant properties created by them are ignored.

2.3 Naive Bayes

Another classifier is Naive Bayes. Naive Bayes operates in two stages, preparation set and proving set. Naive Bayes is inexpensive and it can manage around 10,000 properties. It is besides fast and extremely scalable theoretical account.

Naive Bayes considers attributes as independent of each others in footings of lending to the category property. For illustration a fruit may be considered as apple if it round, ruddy, and 4 ” in diameter. Although these characteristics depend on each other, Naive Bayes considers these characteristics independently to see it as apple.

One of the jobs with Naive Bayes is that Naive Bayes does non necessitate batch of cases for the possible combination of properties. Naive Bayes can be used for both binary and multiclass categorization jobs. Naive Bayes can merely manage discrete or discretized properties. Naive Bayes requires binning. Several discretization methods are present they are Supervised and unsupervised discretization. Supervised discretization method uses category information of preparation set to choose discretization cut point. Unsupervised discretization does non utilize the category information. [ 4 ]

Entire preparation informations set is used for categorization and discretization. Unsupervised discretization methods are equal breadth, Equal frequence and fixed frequence discretization. Error based, Entropy based are supervised discretization methods. [ 5 ] . Entropy based discretization uses category information, the information is calculated based on the category label so it finds the best split so that the bins are every bit pure as possible that is the bulk of values in a bin correspond to holding the same category label. The split is done based on the maximum information addition. [ 6 ]

2.4 Nervous Networks

The human encephalon serves as a theoretical account for Neural Networks. Artificial nerve cells were foremost proposed in 1943 by Warren McCulloch, a neurophysiologist, and Walter Pitts, an MIT logistician. [ 7 ] . Nervous Networks are utile for informations excavation and determination support applications. They are besides utile for pattern acknowledgment or informations categorization through the acquisition procedure. [ 8 ] .

A nervous web contains the nerve cells and weight edifice blocks. The strength of the web depends on the interaction between the edifice blocks. The MultilayerPerceptron ( MLP ) Neural Network Model is largely used, with webs that consist of three beds Input, Hidden and Output. The values of the input layer come from values in a information set. The input nerve cells send informations via synapses to the concealed bed and through end product bed through synapses.

The MLP uses supervised technique called back extension for preparation. Each bed is to the full connected to the wining bed. The signal for each nerve cell is received from the old bed ; each signal is multiplied by a different weight value. Then the inputs that are weighted are summed and these are passed through the modification map through this the end products are scaled through the fixed scope of values. Then the end product is send to the all the nerve cells in the following bed. Mistake at each end product is so “ back propagated ” to the hidden and the inputs, altering the weights based on the derived function of the mistake with regard to the weights.

MLP preparation involves seting parametric quantities such as the figure of nerve cells to be used in concealed bed. If deficient figure of nerve cells is used the complex informations can non be modeled and the consequence would be hapless. If more figure of nerve cells are used, it may take long clip, it may over suit the information. The web may execute good on the preparation set but the trial set would give hapless consequences on future cases.

3. Methodology

The Cleveland, Switzerland, Hungary and VA Long Beach are the four datasets collected from the UCI Machine Learning Repository for this undertaking. The Cleveland dataset is used for preparation and Switzerland, Hungary and VA datasets are used for proving. One of import job will be to utilize the preparation Cleveland dataset for proving the three datasets. The other of import job is to cover with the losing values in each dataset. All four datasets contains about 13 properties and one category property.

First, all the four datasets are collected in.txt format. The datasets are loaded into RapidMiner utilizing the IO Example Source operator.

3.1. Preprocessing the Data

The first measure is to preprocess the Cleveland dataset used for preparation. This chiefly involves covering with losing values, outliers and characteristic choice. Different preprocessing stairss that are used are

To make full the losing values

To cover with outliers

Attribute choice, numeral informations discretization, Normalization etc

Once preprocessing of Cleveland information is done, the information set is used for developing utilizing different algorithms. Then different preprocessing stairss are done on the three trial informations sets.

Missing values in the dataset represent a batch of different things. They may be due to a trial that is non conducted or the informations that is non available. Missing values in RapidMiner and maori hens are normally represented by “ ? ” . The quality of the categorization of the informations could be reduced by the losing values, so make fulling in the losing values plays an of import function in informations excavation. Different methods are used for covering with losing values. The most often used method is replacing the categorical values with the manner and numerical values with mean. The 2nd method is taking the properties that contain around 90 % of losing informations. Properties that are removed are ca, thal, incline, chol. No alteration was observed in the categorization mistake when these properties were removed. The 3rd method that is used is to take the cases in a information set if it contains 6 or more losing values out of the 13 values that are present, since cases losing excessively many values would non be good to utilize due to the consistence of the informations. Then the staying missing values are filled based on the frequence of category properties.

Outliers are observations that deviate from the original dataset. That is, the cases that are unnatural distances from the other cases in the information. Sometimes they may happen due to some common mistakes that occur due to data transmittal. In some instances outliers plays a important function in acquisition of the information. Common methods used to place outliers are Density based outlier sensing, Distance based outlier sensing and LOF sensing methods.

Distance based outlier sensing utilizing a k-nearest neighbour algorithm is used to place outliers. A denseness based outlier sensing uses density maps like Square distance, Euclidean distance, angle, cosine distance, upside-down cosine distance, and the LOFoutlierdetection identifies outliers utilizing minimum upper and lower bounds with a denseness map.

Feature Selection is chiefly used to happen characteristics that play an of import function in categorization, and to take characteristics with small or no prognostic information. This method is chiefly used in informations sets with many characteristics.Types of characteristic choice methods used are Filter method and Wrapper methods. The Filter method selects features independent of the classifier, and Wrapper methods makes usage of classifier for characteristic choice. The filter method select characteristics based on general features of informations so the filter methods are much faster than the wrapper method. The negligee method uses an initiation algorithm as a development map to choose feature subsets.

Four informations sets are collected in.txt format. First the Cleveland dataset is read into Rapid mineworker by utilizing the property editor. The ( .dat and.aml ) files are created for the Cleveland datasets. The mean and manner of each property is obtained. Then the losing values are identified. The Cleveland dataset contains merely 6 losing values. Removing the losing values from the Cleveland dataset would non impact the dataset since less than 2 % of the information is losing. So the losing values are removed from the dataset.

The following measure would be to cover with the outliers. Distance based outliers methods are used to observe the outliers utilizing the K Nearest Neighbor and Euclidean distance. It has been identified that depending on the significance of outlier, outlier function is determiner in medical informations set, depending on function of the outlier they are removed or non removed. ( Reference elf )

The following measure would be to choose characteristics that play of import functions in the information. Feature choice by and large improves the truth of the categorization by establishing it on the most relevant characteristics. The Infogainweighting utilizing Filter method and Wrapper method utilizing Forward and Backward choice method are used. First the Information addition of the properties is obtained. The properties are selected first by taking 4 properties, 5 properties and so on and besides by utilizing 50 % of the properties, 70 % of the properties. By utilizing these different methods we identify the top 10 characteristics that play of import functions in the categorization, which are selected to construct the theoretical account. Wrapper method selects features depending on the acquisition algorithm and the characteristics selected by one algorithm may differ for another algorithm. Forward and backward choices are the two methods used to construct a set of characteristics. Forward choice starts merely with one subset of property and extra properties are added until there is no public presentation addition. Backward choice is the antonym of forward choice, as it starts with complete property set and property are removed from that subset until there is addition in the public presentation. Decision Tree and MultiLayerPerceptron are used as algorithms to choose characteristics utilizing Forward and Backward choice. Then the preprocessed Cleveland information set is saved as a new file.

Once the Cleveland information set used for preparation is preprocessed it is ready to prove with the three proving sets. Before proving the information sets, preprocessing of the three datasets is besides done. Three categorization algorithms Decision tree, Naive Bayes and MultiLayer Perceptron are used to construct a categorization theoretical account utilizing the Cleveland dataset. Then the theoretical account is build utilizing the categorization algorithms.

The following of import measure is preprocessing of trial datasets. One of import measure in preprocessing this datasets is covering with the losing values, which were much more prevailing in the trial datasets. The Switzerland, Hungarian and VA long Beach datasets consist of 50 % to 90 % of cases with losing values. Because of this, it is non possible to merely take cases with losing values.

The following losing values methods are used to make full the losing informations. We foremost replaced the categorical values with the manner and numerical values with mean. As the cholesterin property has approximately 99 % of losing values in one dataset, it was replaced by normal value based on age and gender. The cholesterin value is replaced by normal degree of values that is for Females below age 40 old ages chol degree is 183 mg/dL, from age 40 to 49 old ages chol degree is 119 mg/dL, from 50 old ages or above chol degree is 219 mg/dL. For male below 40 old ages chol degree is 185 mg/dL, age 40 to 49 old ages chol degree is 205 mg/dL, age above 50 old ages chol degree is 208 mg/dL.

By replacing the losing values in all the informations sets with this method the Magyar gave less categorization mistake while the other two datasets still produced high categorization mistake.

Different methods are used to cover with losing in the two informations sets that produced the highest categorization mistake. We tested taking the losing properties that contain around 90 % of losing informations, ca, thal and incline, chol. Using this method did non impact better the figure of right anticipations, so the 3rd method is used to cover with losing values.

As with the Cleveland dataset, cases in a information set were removed if they contain 6 or more than 6 losing values out of 13 values that are present. Since the ca impute contains about 99 % of losing values the ca property is removed from the datasets. Since ca is a excess property, taking this attributes does non impact the information set. Then the staying missing values are filled based on the frequence of category property. This method, nevertheless, did non bring forth any alteration in consequences on the Switzerland information set. There was an addition in figure of right anticipations for Hungary informations set.

3.2. Constructing the Model

In order to compare the effectivity of different categorization algorithms, determination tree categorization, Naive Bayes, and Multilayer Perceptron are used. First the Cleveland information set is used to construct the theoretical account utilizing the Decision tree categorization algorithm in Rapid mineworker. Different standards are used to construct the determination tree, the standard ‘s used are gini index, addition ratio and information addition. Of these, the information addition produced a better determination tree, so the standards used for attribute choice and besides for numerical split for constructing determination tree is information addition. Simple truth is non the best to find the classifier, so sensitiveness and specificity are used alternatively.

The truth on the positive cases is Sensitivity:

Sensitivity = True Positive/ ( True Positive + False Negative )

The truth on the negative cases is Specificity:

Specificity = True Negative/ ( True Negative + False Positive )

Accuracy = True Positive + True Negative/ ( True Positive + False Negative+ True Negative + False Positive )

The MultilayerPerceptron method is used with more than one concealed bed to happen the truth. One of import measure in MLP is taking the figure of concealed beds, as RapidMiner allows a pick of the figure of concealed beds. First the Numberss of concealed beds are chosen as 0, 1, 2 and so on and lower limit of one hidden bed is used. The figure of concealed beds that are used are 3 concealed beds. In this manner, the preparation set is used to sort the other three datasets. The datasets are tested by replacing the losing values by three different methods.

Consequences

Different experiments are conducted to prove how the information sets collected from one beginning act as a forecaster of another information sets collected from different beginnings, and to compare the truth utilizing different algorithm. Since the information set contains losing values three different methods are used to make full the losing values. Then the truth is obtained to place the method that worked better to make full the losing values. Different preprocessing stairss are besides conducted. The characteristics that play an of import function are identified.

The first measure is to happen per centum of losing values in each information set. It can be noticed from Table 2. Below show that the property ca incorporate 90 % to 99 % of losing values in all the information sets where the ca value is replaced by utilizing the normal value and besides one property in the Switzerland information set is losing 100 % .

Table 2. Percentage of losing values in each dataset

Properties

Cleveland

Hungary

Switzerland

VA Long Beach

Age

Sexual activity

CP

Trestbps

2 %

28 %

Chol

8 %

100 %

4 %

Fbs

3 %

61 %

4 %

Restecg

1 %

Thalach

1 %

27 %

Exang

1 %

27 %

Oldpeak

5 %

28 %

Slope

65 %

14 %

51 %

Calcium

99 %

96 %

99 %

Thal

90 %

42 %

83 %

The chief end is to utilize one information set as a forecaster of another information set. Since the Cleveland informations set contains less than 2 % , for our experiments, we build a theoretical account utilizing the Cleveland information set. Models were built from the preprocessed information set utilizing Decision tree, Naive Bayes, and Multilayerperceptron ( MLP ) algorithms. Table 3 gives the truth obtained utilizing three algorithms after constructing the theoretical account in footings of the per centum of right anticipations. It can be observed from the tabular array that the MultiLayerPerceptron worked better for constructing the theoretical account with an truth of 91.75 % , and Decision Tree was 2nd with an truth of 81.19 % .

Table 3. Accuracy of the Cleveland dataset utilizing categorization Algorithms

Algorithm

Accuracy

Decision Tree

81.19 %

NaA?ve Bayes

63.97 %

MultiLayerPerceptron

91.75 %

Each theoretical accounts built utilizing the Cleveland Data set is used to prove the preprocessed datasets Hungary, Switzerland and VA Long Beach. The information from the Table 4 below shows that the highest sum of truth is obtained from the Hungary information set. As seen from the Table 2 we know that the Numberss of losing values in the Hungary information set are less compared to the Switzerland and VA Long Beach. This indicates that the normal method works better for Hungary informations set.

Table 4. Accuracy obtained after the theoretical account was tested

Training and Test datasets

Decision Tree

NaA?ve Bayes

MLP

Cleveland and Switzerland

14.75 %

31.15 %

22.13 %

Cleveland and VA Long

Beach

28.00 %

33.50 %

30.00 %

Cleveland and Hungary

64.97 %

65.65 %

64.71 %

The following of import measure is utilizing Feature Selection for choosing the characteristics that plays an of import function in categorization. Information addition weighting is used to choose the characteristics that plays appropriate function in the categorization of informations set. The top 10 properties are selected to construct the theoretical account. The following Table 5. gives the Information addition weighting of the properties.

Table 5. Information addition weights of properties

Properties

Weights

Thal

1.0

CP

0.8732776041805517

Calcium

0.7895335660685716

Thalach

0.6349096700185458

Oldpeak

0.5578017189627767

Exang

0.5256702757853041

Slope

0.5057053502400372

Age

0.24902728380010647

Sexual activity

0.19151360545536644

Restecg

0.10978766175891774

Trestbps

0.010542392846149206

Chol

0.004312041740539823

Fbs

0.0

After choosing the properties utilizing the Information addition weighting, the top 10 properties are used to construct the theoretical account and the theoretical account physique is used to prove the three informations sets to see if the figure of right anticipations is improved. Table 6 shows the addition and lessening in the figure of right anticipations.

Table 6. Accuracy obtained after information addition weighting of properties

Data sets

Decision tree

NaA?ve Bayes

MLP

Switzerland

25.89 %

24.11 %

22.32 %

VA Long Beach

33.16 %

38.42 %

33.16 %

Hungary

64.29 %

65.74 %

68.03 %

Feature choice utilizing the Wrapper method was tested with different algorithms. The Decision tree and MLP algorithms are used to choose the properties, utilizing both forward choice and backward choice methods. Table 7 gives the truth obtained by utilizing forward choice method and backward choice methods for all three algorithms.

Table 7. Accuracy of Hungary Data set utilizing Wrapper method

Data sets

Decision tree

Naive Bayes

MLP

Forward Selection

63.38 %

65.49 %

64.08 %

Backward Selection

65.49 %

65.49 %

64.08 %

From Table 4 and Table 6 it can be noticed that the theoretical account physique by the Cleveland information set was better able to sort the Hungary informations set than the Switzerland and VA Long Beach information sets. The Switzerland and VA Long Beach algorithms were far worse, with less than 30 % of properties are merely right predicted.

In order to better this, taking cases with at least 6 losing properties and make fulling the staying cases with the cases that are present in the Switzerland and VA Long Beach, and the same algorithms and the preprocessing stairss used for larning. Table 7 shows the truth of the dataset is improved than compared to the first method that is used to make full the losing values. The Numberss of right anticipations are improved.

Table 7. Accuracy of the VA Long Beach after replacing the losing values by staying values

Algorithm

Accuracy

Decision Tree

56.73 %

Naive Bayes

54.44 %

MultiLayerPerceptron

46.15 %

The following measure is to choose the characteristics that plays of import function in the information set, based on different algorithms. Table 8 shows that the Numberss of right anticipations are improved in some instances while the figure of right anticipations in some instances remains the same.

Table 8. Feature Selection truth of VA Long Beach

Data sets

Decision tree

Naive Bayes

MLP

Information addition weighting

53.45 %

55.25 %

62.03 %

Forward Selection

70.69 %

69.32 %

59.71 %

Backward Selection

62.80 %

60.25 %

59.75 %

Decision

Study has shown that choosing the appropriate characteristics in the information set dramas of import function in informations. If the appropriate characteristics are removed from the information set nevertheless it can be seen that the truth of the information is reduced. The normal method worked better for less figure of losing values. However if big figure of non random losing values are present in the informations taking the cases that contains 60 % or more than 60 % of losing values for each patient record are removed and replacing the staying values based on the staying case worked better compared to the Normal method.

The datasets collected from different beginnings gave moderate truth if less figure of losing values are present in the information. If big figure of losing values are present in the information the informations collected from one beginning gave really inefficient consequence in categorization of the informations from other beginning.

×

Hi there, would you like to get such a paper? How about receiving a customized one? Check it out