[ Pobierz całość w formacie PDF ]
options in response to h, as in
java weka.filters.AttributeFilter -h
Table 8.5 lists the filters implemented in Weka, along with their principal
options.
The first, AddFilter, inserts an attribute at the given position. For all
instances in the dataset, the new attribute s value is declared to be missing.
If a list of comma-separated nominal values is given using the L option,
the new attribute will be a nominal one, otherwise it will be numeric. The
attribute s name can be set with N.
AttributeSelectionFilter allows you to select a set of attributes using
different methods: since it is rather complex we will leave it to last.
AttributeFilter has already been used above. However, there is a
further option: if V is used the matching set is inverted that is, only
attributes not included in the R specification are deleted.
An important filter for practical applications is DiscretizeFilter. It
contains an unsupervised and a supervised discretization method, both
discussed in Section 7.2. The former implements equal-width binning, and
the number of bins can be set manually using B. However, if O is present,
the number of bins will be chosen automatically using a cross-validation
procedure that maximizes the estimated likelihood of the data. In that case,
B gives an upper bound to the possible number of bins. If the index of a
class attribute is specified using c, supervised discretization will be
performed using the MDL method of Fayyad and Irani (1993). Usually,
discretization loses the ordering implied by the original numeric attribute
when it is transformed into a nominal one. However, this information is
preserved if the discretized attribute with k values is transformed into k -1
binary attributes. The D option does this automatically by producing one
binary attribute for each split point (described in Section 7.2 [p. 239]).
MakeIndicatorFilter is used to convert a nominal attribute into a binary
indicator attribute and can be used to transform a multiclass dataset into
several two-class ones. The filter substitutes a binary attribute for the
chosen nominal one, setting the corresponding value for each instance to 1
if a particular original value was present and to 0 otherwise. Both the
attribute to be transformed and the original nominal value are set by the
user. By default the new attribute is declared to be numeric, but if N is
given it will be nominal.
Suppose you want to merge two values of a nominal attribute into a
8.3 PROCESSING DATASETS USING THE MACHINE LEARNING PROGRAMS 2 9 3
single category. This is done by MergeAttributeValuesFilter. The name of
the new value is a concatenation of the two original ones, and every
occurrence of either of the original values is replaced by the new one. The
index of the new value is the smaller of the original indexes.
Some learning schemes, like support vector machines, can handle only
binary attributes. The advantage of binary attributes is that they can be
treated as either nominal or numeric. NominalToBinaryFilter transforms all
multivalued nominal attributes in a dataset into binary ones, replacing each
attribute with k values by k 1 binary attributes. If a class is specified using
the c option, it is left untouched. The transformation used for the other
attributes depends on whether the class is numeric. If the class is numeric,
the M52 transformation method is employed for each attribute; otherwise a
simple one-per-value encoding is used. If the N option is used, all new
attributes will be nominal, otherwise they will be numeric.
One way of dealing with missing values is to replace them globally
before applying a learning scheme. ReplaceMissingValuesFilter substitutes
the mean, for numeric attributes, or the mode, for nominal ones, for each
occurrence of a missing value.
To remove from a dataset all instances that have certain values for
nominal attributes, or numeric values above or below a certain threshold,
use InstanceFilter. By default all instances are deleted that exhibit one of
a given set of nominal attribute values (if the specified attribute is
nominal), or a numeric value below a given threshold (if it is numeric).
However, the matching criterion can be inverted using V.
The SwapAttributeValuesFilter is a simple one: all it does is swap the
positions of two values of a nominal attribute. Of course, this could also be
accomplished by editing the ARFF file in a word processor. The order of
attribute values is entirely cosmetic: it does not affect machine learning at
all. If the selected attribute is the class, changing the order affects the
[ Pobierz całość w formacie PDF ]