数据预处理工具

1. 载入库

这里,使用R中的默认函数read.csv读取数据,使用caTools包切分数据。

# install.packages("caTools")
library(caTools)

2. 载入数据

这里的数据为Data.csv,格式是csv格式,使用的是read.csv的函数读取。

dataset = read.csv('Data.csv')
dataset
A data.frame: 10 × 4
Country Age Salary Purchased
<chr> <int> <int> <chr>
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 NA Yes
France 35 58000 Yes
Spain NA 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

3. 处理缺失值

这里,使用的是编写函数,进行缺失值的处理。

dataset$Age = ifelse((dataset$Age),
                     ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                     dataset$Age)
dataset$Salary = ifelse((dataset$Salary),
                        ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                        dataset$Salary)

这里,使用平均值作为缺失值的填充值

dataset
A data.frame: 10 × 4
Country Age Salary Purchased
<chr> <dbl> <dbl> <chr>
France 44.00000 72000.00 No
Spain 27.00000 48000.00 Yes
Germany 30.00000 54000.00 No
Spain 38.00000 61000.00 No
Germany 40.00000 63777.78 Yes
France 35.00000 58000.00 Yes
Spain 38.77778 52000.00 No
France 48.00000 79000.00 Yes
Germany 50.00000 83000.00 No
France 37.00000 67000.00 Yes

查看缺失值,可以看到缺失值已经使用平均值,填充完成

4. 对分类变量进行重新编码

4.1 对变量重编码

对数据进行重新编码

# Encoding categorical data
dataset$Country = factor(dataset$Country,
                         levels = c('France', 'Spain', 'Germany'),
                         labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
                           levels = c('No', 'Yes'),
                           labels = c(0, 1))
dataset
A data.frame: 10 × 4
Country Age Salary Purchased
<fct> <dbl> <dbl> <fct>
1 44.00000 72000.00 0
2 27.00000 48000.00 1
3 30.00000 54000.00 0
2 38.00000 61000.00 0
3 40.00000 63777.78 1
1 35.00000 58000.00 1
2 38.77778 52000.00 0
1 48.00000 79000.00 1
3 50.00000 83000.00 0
1 37.00000 67000.00 1

5. 将数据分为训练数据和测试数据

将训练群体为80%的数据,将测试数据为20%的数据

dataset
A data.frame: 10 × 4
Country Age Salary Purchased
<fct> <dbl> <dbl> <fct>
1 44.00000 72000.00 0
2 27.00000 48000.00 1
3 30.00000 54000.00 0
2 38.00000 61000.00 0
3 40.00000 63777.78 1
1 35.00000 58000.00 1
2 38.77778 52000.00 0
1 48.00000 79000.00 1
3 50.00000 83000.00 0
1 37.00000 67000.00 1
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

查看训练数据

training_set
A data.frame: 8 × 4
Country Age Salary Purchased
<fct> <dbl> <dbl> <fct>
1 1 44.00000 72000.00 0
2 2 27.00000 48000.00 1
3 3 30.00000 54000.00 0
4 2 38.00000 61000.00 0
5 3 40.00000 63777.78 1
7 2 38.77778 52000.00 0
8 1 48.00000 79000.00 1
10 1 37.00000 67000.00 1

查看测试数据

test_set
A data.frame: 2 × 4
Country Age Salary Purchased
<fct> <dbl> <dbl> <fct>
6 1 35 58000 1
9 3 50 83000 0

6. 数据标准化

因为X不同列的量纲单位不一样,需要将其进行标准化,然后才可以相互比较及建模

training_set[,2:3] = scale(training_set[,2:3])
training_set
A data.frame: 8 × 4
Country Age Salary Purchased
<fct> <dbl> <dbl> <fct>
1 1 0.90101716 0.9392746 0
2 2 -1.58847494 -1.3371160 1
3 3 -1.14915281 -0.7680183 0
4 2 0.02237289 -0.1040711 0
5 3 0.31525431 0.1594000 1
7 2 0.13627122 -0.9577176 0
8 1 1.48678000 1.6032218 1
10 1 -0.12406783 0.4650265 1

查看X的训练数据集,可以看到已经对数据进行了标准化

对X的测试数据,同样进行标准化处理

test_set[,2:3] = scale(test_set[,2:3])
test_set
A data.frame: 2 × 4
Country Age Salary Purchased
<fct> <dbl> <dbl> <fct>
6 1 -0.7071068 -0.7071068 1
9 3 0.7071068 0.7071068 0

这样,对数据的预处理就完成了。