1. 载入库
这里,使用R中的默认函数read.csv读取数据,使用caTools包切分数据。
# install.packages("caTools")
library(caTools)
2. 载入数据
这里的数据为Data.csv,格式是csv格式,使用的是read.csv的函数读取。
dataset = read.csv('Data.csv')
dataset
| Country | Age | Salary | Purchased |
|---|---|---|---|
| <chr> | <int> | <int> | <chr> |
| France | 44 | 72000 | No |
| Spain | 27 | 48000 | Yes |
| Germany | 30 | 54000 | No |
| Spain | 38 | 61000 | No |
| Germany | 40 | NA | Yes |
| France | 35 | 58000 | Yes |
| Spain | NA | 52000 | No |
| France | 48 | 79000 | Yes |
| Germany | 50 | 83000 | No |
| France | 37 | 67000 | Yes |
3. 处理缺失值
这里,使用的是编写函数,进行缺失值的处理。
dataset$Age = ifelse((dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse((dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Salary)
这里,使用平均值作为缺失值的填充值
dataset
| Country | Age | Salary | Purchased |
|---|---|---|---|
| <chr> | <dbl> | <dbl> | <chr> |
| France | 44.00000 | 72000.00 | No |
| Spain | 27.00000 | 48000.00 | Yes |
| Germany | 30.00000 | 54000.00 | No |
| Spain | 38.00000 | 61000.00 | No |
| Germany | 40.00000 | 63777.78 | Yes |
| France | 35.00000 | 58000.00 | Yes |
| Spain | 38.77778 | 52000.00 | No |
| France | 48.00000 | 79000.00 | Yes |
| Germany | 50.00000 | 83000.00 | No |
| France | 37.00000 | 67000.00 | Yes |
查看缺失值,可以看到缺失值已经使用平均值,填充完成
4. 对分类变量进行重新编码
4.1 对变量重编码
对数据进行重新编码
# Encoding categorical data
dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain', 'Germany'),
labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0, 1))
dataset
| Country | Age | Salary | Purchased |
|---|---|---|---|
| <fct> | <dbl> | <dbl> | <fct> |
| 1 | 44.00000 | 72000.00 | 0 |
| 2 | 27.00000 | 48000.00 | 1 |
| 3 | 30.00000 | 54000.00 | 0 |
| 2 | 38.00000 | 61000.00 | 0 |
| 3 | 40.00000 | 63777.78 | 1 |
| 1 | 35.00000 | 58000.00 | 1 |
| 2 | 38.77778 | 52000.00 | 0 |
| 1 | 48.00000 | 79000.00 | 1 |
| 3 | 50.00000 | 83000.00 | 0 |
| 1 | 37.00000 | 67000.00 | 1 |
5. 将数据分为训练数据和测试数据
将训练群体为80%的数据,将测试数据为20%的数据
dataset
| Country | Age | Salary | Purchased |
|---|---|---|---|
| <fct> | <dbl> | <dbl> | <fct> |
| 1 | 44.00000 | 72000.00 | 0 |
| 2 | 27.00000 | 48000.00 | 1 |
| 3 | 30.00000 | 54000.00 | 0 |
| 2 | 38.00000 | 61000.00 | 0 |
| 3 | 40.00000 | 63777.78 | 1 |
| 1 | 35.00000 | 58000.00 | 1 |
| 2 | 38.77778 | 52000.00 | 0 |
| 1 | 48.00000 | 79000.00 | 1 |
| 3 | 50.00000 | 83000.00 | 0 |
| 1 | 37.00000 | 67000.00 | 1 |
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
查看训练数据
training_set
| Country | Age | Salary | Purchased | |
|---|---|---|---|---|
| <fct> | <dbl> | <dbl> | <fct> | |
| 1 | 1 | 44.00000 | 72000.00 | 0 |
| 2 | 2 | 27.00000 | 48000.00 | 1 |
| 3 | 3 | 30.00000 | 54000.00 | 0 |
| 4 | 2 | 38.00000 | 61000.00 | 0 |
| 5 | 3 | 40.00000 | 63777.78 | 1 |
| 7 | 2 | 38.77778 | 52000.00 | 0 |
| 8 | 1 | 48.00000 | 79000.00 | 1 |
| 10 | 1 | 37.00000 | 67000.00 | 1 |
查看测试数据
test_set
| Country | Age | Salary | Purchased | |
|---|---|---|---|---|
| <fct> | <dbl> | <dbl> | <fct> | |
| 6 | 1 | 35 | 58000 | 1 |
| 9 | 3 | 50 | 83000 | 0 |
6. 数据标准化
因为X不同列的量纲单位不一样,需要将其进行标准化,然后才可以相互比较及建模
training_set[,2:3] = scale(training_set[,2:3])
training_set
| Country | Age | Salary | Purchased | |
|---|---|---|---|---|
| <fct> | <dbl> | <dbl> | <fct> | |
| 1 | 1 | 0.90101716 | 0.9392746 | 0 |
| 2 | 2 | -1.58847494 | -1.3371160 | 1 |
| 3 | 3 | -1.14915281 | -0.7680183 | 0 |
| 4 | 2 | 0.02237289 | -0.1040711 | 0 |
| 5 | 3 | 0.31525431 | 0.1594000 | 1 |
| 7 | 2 | 0.13627122 | -0.9577176 | 0 |
| 8 | 1 | 1.48678000 | 1.6032218 | 1 |
| 10 | 1 | -0.12406783 | 0.4650265 | 1 |
查看X的训练数据集,可以看到已经对数据进行了标准化
对X的测试数据,同样进行标准化处理
test_set[,2:3] = scale(test_set[,2:3])
test_set
| Country | Age | Salary | Purchased | |
|---|---|---|---|---|
| <fct> | <dbl> | <dbl> | <fct> | |
| 6 | 1 | -0.7071068 | -0.7071068 | 1 |
| 9 | 3 | 0.7071068 | 0.7071068 | 0 |
这样,对数据的预处理就完成了。
















