数据预处理工具

1. 载入库

这里,使用R中的默认函数​​read.csv​​​读取数据,使用​​caTools​​包切分数据。

# install.packages("caTools")
library(caTools)

2. 载入数据

这里的数据为​​Data.csv​​​,格式是​​csv​​​格式,使用的是​​read.csv​​的函数读取。

dataset = read.csv('Data.csv')
dataset

A data.frame: 10 × 4

Country

Age

Salary

Purchased

<chr>

<int>

<int>

<chr>

France

44

72000

No

Spain

27

48000

Yes

Germany

30

54000

No

Spain

38

61000

No

Germany

40

NA

Yes

France

35

58000

Yes

Spain

NA

52000

No

France

48

79000

Yes

Germany

50

83000

No

France

37

67000

Yes

3. 处理缺失值

这里,使用的是编写函数,进行缺失值的处理。

dataset$Age = ifelse((dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse((dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Salary)

这里,使用平均值作为缺失值的填充值

dataset

A data.frame: 10 × 4

Country

Age

Salary

Purchased

<chr>

<dbl>

<dbl>

<chr>

France

44.00000

72000.00

No

Spain

27.00000

48000.00

Yes

Germany

30.00000

54000.00

No

Spain

38.00000

61000.00

No

Germany

40.00000

63777.78

Yes

France

35.00000

58000.00

Yes

Spain

38.77778

52000.00

No

France

48.00000

79000.00

Yes

Germany

50.00000

83000.00

No

France

37.00000

67000.00

Yes

查看缺失值,可以看到缺失值已经使用平均值,填充完成

4. 对分类变量进行重新编码

4.1 对变量重编码

对数据进行重新编码

# Encoding categorical data
dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain', 'Germany'),
labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0, 1))
dataset

A data.frame: 10 × 4

Country

Age

Salary

Purchased

<fct>

<dbl>

<dbl>

<fct>

1

44.00000

72000.00

0

2

27.00000

48000.00

1

3

30.00000

54000.00

0

2

38.00000

61000.00

0

3

40.00000

63777.78

1

1

35.00000

58000.00

1

2

38.77778

52000.00

0

1

48.00000

79000.00

1

3

50.00000

83000.00

0

1

37.00000

67000.00

1

5. 将数据分为训练数据和测试数据

将训练群体为80%的数据,将测试数据为20%的数据

dataset

A data.frame: 10 × 4

Country

Age

Salary

Purchased

<fct>

<dbl>

<dbl>

<fct>

1

44.00000

72000.00

0

2

27.00000

48000.00

1

3

30.00000

54000.00

0

2

38.00000

61000.00

0

3

40.00000

63777.78

1

1

35.00000

58000.00

1

2

38.77778

52000.00

0

1

48.00000

79000.00

1

3

50.00000

83000.00

0

1

37.00000

67000.00

1

library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

查看训练数据

training_set

A data.frame: 8 × 4

Country

Age

Salary

Purchased

<fct>

<dbl>

<dbl>

<fct>

1

1

44.00000

72000.00

0

2

2

27.00000

48000.00

1

3

3

30.00000

54000.00

0

4

2

38.00000

61000.00

0

5

3

40.00000

63777.78

1

7

2

38.77778

52000.00

0

8

1

48.00000

79000.00

1

10

1

37.00000

67000.00

1

查看测试数据

test_set

A data.frame: 2 × 4

Country

Age

Salary

Purchased

<fct>

<dbl>

<dbl>

<fct>

6

1

35

58000

1

9

3

50

83000

0

6. 数据标准化

因为X不同列的量纲单位不一样,需要将其进行标准化,然后才可以相互比较及建模

training_set[,2:3] = scale(training_set[,2:3])
training_set

A data.frame: 8 × 4

Country

Age

Salary

Purchased

<fct>

<dbl>

<dbl>

<fct>

1

1

0.90101716

0.9392746

0

2

2

-1.58847494

-1.3371160

1

3

3

-1.14915281

-0.7680183

0

4

2

0.02237289

-0.1040711

0

5

3

0.31525431

0.1594000

1

7

2

0.13627122

-0.9577176

0

8

1

1.48678000

1.6032218

1

10

1

-0.12406783

0.4650265

1

查看X的训练数据集,可以看到已经对数据进行了标准化

对X的测试数据,同样进行标准化处理

test_set[,2:3] = scale(test_set[,2:3])
test_set

A data.frame: 2 × 4

Country

Age

Salary

Purchased

<fct>

<dbl>

<dbl>

<fct>

6

1

-0.7071068

-0.7071068

1

9

3

0.7071068

0.7071068

0

这样,对数据的预处理就完成了。