课程讲解中的lesson5.rmd
Lesson 5
========================================================
### lesson4学习了两个变量之间的关系,本节课将更关注三个甚至更多个变量之间的关系。
### Multivariate Data
Notes:
***
### Moira Perceived Audience Size Colored by Age
Notes:
***
### Third Qualitative Variable
Notes:
```{r Third Qualitative Variable}
library(ggplot2)
library(dplyr)
# library(scales)
# library(memisc)
# library(reshape)
library(gridExtra)
ggplot(aes(x = gender,y=age),
data = subset(pf, !is.na(gender))) + geom_boxplot()+
stat_summary(fun.y = mean, geom='point',shape=4)
ggplot(aes(x = age,y=friend_count),
data = subset(pf, !is.na(gender))) +
geom_line(aes(color=gender),stat='summary',fun.y=median)
```
***
### Plotting Conditional Summaries
Notes:
# Write code to create a new data frame,
# called 'pf.fc_by_age_gender', that contains
# information on each age AND gender group.
# The data frame should contain the following variables:
# mean_friend_count,
# median_friend_count,
# n (the number of users in each age and gender grouping)
# Here is an example of the structure of your data frame. Your
# data values will be different. Note that if you are grouping by
# more than one variable, you will probably need to call the
# ungroup() function.
# age gender mean_friend_count median_friend_count n
# 1 13 female 247.2953 150 207
# 2 13 male 184.2342 61 265
# 3 14 female 329.1938 245 834
# 4 14 male 157.1204 88 1201
# See the Instructor Note for two hints.
# DO NOT DELETE THESE NEXT TWO LINES OF CODE
# ==============================================================
pf <- read.delim('/datasets/ud651/pseudo_facebook.tsv')
suppressMessages(library(dplyr))
# ENTER YOUR CODE BELOW THIS LINE.
# ==============================================================
```{r 3_practice}
library(dplyr)
age_gender_groups <- group_by(pf, age, gender)
pf.fc_by_age_gender <- summarise(age_gender_groups,
friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n())
pf.fc_by_age_gender <- arrange(pf.fc_by_age_gender, age, gender)
head(pf.fc_by_age_gender)
```
```{r3_practice another way}
library(dplyr)
pf.fc_by_age_gender2 <- pf %.%
filter(!is.na(gender)) %.%
group_by(age,gender) %.%
summarise( friend_count_mean = mean(friend_count),
friend_count_median = median(friend_count),
n = n()) %.%
ungroup()%.%
arrange(age)
head(pf.fc_by_age_gender2)
```
```{r plotting conditional summaries}
# 通过筛选条件得到新的分组后的df,根据新的df直接进行画图
ggplot(aes(x=age,y=friend_count_median),
data=subset(pf.fc_by_age_gender,!is.na(gender))) +
geom_line(aes(color=gender))
```
***
### 5.Thinking in Ratios
Notes:
***
### 6.Wide and Long Format
Notes: 可以使用tidyr包重构数据,tidyr和reshape均可完成该任务,且tidyr更容易一点。
```{r}
install.packages('tidyr')
library(tidyr)
spread(subset(pf.fc_by_age_gender,selsect=c('gender','age','friend_count_median')),gender,friend_count_median)
```
***
### 7.Reshaping Data
Notes:
```{r}
install.packages('reshape2')
library(reshape2)
pf.fc_by_age_gender.wide <-
subset(pf.fc_by_age_gender[c('age','gender','friend_count_median')],
!is.na(gender)) %>%
spread(gender,friend_count_median) %>%
mutate(ratio=male/female)
head(pf.fc_by_age_gender.wide)
```
***
### 8.Ratio Plot
Notes: 绘制男性和女性的比例,以确定普通女性用户拥有的好友数是普通男性用户拥有的好友数的几倍
```{r Ratio Plot}
ggplot(aes(x=age,y=female/male),
data= pf.fc_by_age_gender.wide) +
geom_line() +
geom_hline(yintercept = 1, alpha=0.3, linetype = 2)
```
***
### 9.Third Quantitative Variable
Notes:
# Create a variable called year_joined
# in the pf data frame using the variable
# tenure and 2014 as the reference year.
# The variable year joined should contain the year
# that a user joined facebook.
# See the Instructor Notes for three hints if you get
# stuck. Scroll down slowly to see one hint at a time
# if you would like some guidance.
# This programming exercise WILL BE automatically graded.
# DO NOT ALTER THE CODE BELOW THIS LINE
# ========================================================
pf <- read.delim('/datasets/ud651/pseudo_facebook.tsv')
# ENTER YOUR CODE BELOW THIS LINE.
# ========================================================
```{r Third Quantitative Variable}
# 增加一个新变量:新加入facebooK的年
pf$year_joined <- floor(2014 - pf$tenure/365)
```
***
### 10.Cut a Variable
Notes:
# Create a new variable in the data frame
# called year_joined.bucket by using
# the cut function on the variable year_joined.
# You need to create the following buckets for the
# new variable, year_joined.bucket
# (2004, 2009]
# (2009, 2011]
# (2011, 2012]
# (2012, 2014]
# Note that a parenthesis means exclude the year and a
# bracket means include the year.
# Look up the documentation for cut or try the link
# in the Instructor Notes to accomplish this task.
# DO NOT DELETE THE TWO LINES OF CODE BELOW THIS LINE
# ========================================================================
pf <- read.delim('/datasets/ud651/pseudo_facebook.tsv')
pf$year_joined <- floor(2014 - pf$tenure / 365)
# ENTER YOUR CODE BELOW THIS LINE
# ========================================================================
```{r Cut a Variable}
summary(pf$year_joined)
table(pf$year_joined)
pf$year_joined.bucket <- cut(pf$year_joined,
c(2004,2009,2011,2012,2014))
```
***
### 11.Plotting it All Together
Notes:
# Create a line graph of friend_count vs. age
# so that each year_joined.bucket is a line
# tracking the median user friend_count across
# age. This means you should have four different
# lines on your plot.
# You should subset the data to exclude the users
# whose year_joined.bucket is NA.
# If you need a hint, see the Instructor Notes.
# This assignment is not graded and
# will be marked as correct when you submit.
# ENTER YOUR CODE BELOW THIS LINE
# ===================================================
```{r Plotting it All Together}
table(pf$year_joined.bucket,useNA = 'ifany')
ggplot(aes(x = age,y=friend_count),
data = subset(pf, !is.na(year_joined.bucket))) +
geom_line(aes(color=year_joined.bucket),stat='summary',fun.y=median)
```
***
### 12.Plot the Grand Mean
Notes:
# Write code to do the following:
# (1) Add another geom_line to code below
# to plot the grand mean of the friend count vs age.
# (2) Exclude any users whose year_joined.bucket is NA.
# (3) Use a different line type for the grand mean.
# As a reminder, the parameter linetype can take the values 0-6:
# 0 = blank, 1 = solid, 2 = dashed
# 3 = dotted, 4 = dotdash, 5 = longdash
# 6 = twodash
# This assignment is not graded and
# will be marked as correct when you submit.
# The code from the last programming exercise should
# be your starter code!
# ENTER YOUR CODE BELOW THIS LINE
# ==================================================================
```{r Plot the Grand Mean}
ggplot(aes(x = age,y=friend_count),
data = subset(pf, !is.na(year_joined.bucket))) +
geom_line(aes(color=year_joined.bucket),stat='summary',fun.y=mean, shape=2) + # 将median换成mean
geom_line(stat='summary',fun.y=mean,linetype=3) # 在现有基础上,再加一层整体的平均值曲线(虚线)
```
***
### 13.Friending Rate
Notes:
What is the median friend rate?
0.2205
What is the maximum friend rate?
417.00
```{r Friending Rate}
with(subset(pf,tenure>=1),summary(friend_count/tenure)) # 寻找当每天访问量大于1的用户的平均访问量等四分位数
```
***
### 14.Friendships Initiated
Notes:创建每天建立友谊与使用时长之间的关系图,需要利用变量:age,tenure(使用时长),Friendships Initiated(建立的友谊),year_joined.bucket
# Create a line graph of mean of friendships_initiated per day (of tenure)
# vs. tenure colored by year_joined.bucket.
# You need to make use of the variables tenure,
# friendships_initiated, and year_joined.bucket.
# You also need to subset the data to only consider user with at least
# one day of tenure.
# This assignment is not graded and
# will be marked as correct when you submit.
# ENTER YOUR CODE BELOW THIS LINE
# ========================================================================
结果:用户使用的时间越长,新建立的友谊越少
```{r Friendships Initiated}
ggplot(aes(x=tenure,y=friendships_initiated/tenure),
data=subset(pf,tenure >= 1)) +
geom_line(aes(color=year_joined.bucket))
```
***
### 15.Bias-Variance Tradeoff Revisited
Notes:
# Instead of geom_line(), use geom_smooth() to add a smoother to the plot.
# You can use the defaults for geom_smooth() but do color the line
# by year_joined.bucket
# ALTER THE CODE BELOW THIS LINE
# ==============================================================================
ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
随着不断的取值,相应的曲线变得越来越平滑。
```{r Bias-Variance Tradeoff Revisited}
ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined.bucket),
stat = 'summary',
fun.y = mean)
ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
```
***
### 16.Sean's NFL Fan Sentiment Study NFL:国家橄榄球大联盟
Notes:
七天滑动平均值:整个七天期内将其平滑,
偏差和方差的权衡:
平滑过渡
***
### 17.Introducing the Yogurt Data Set 酸奶数据集
Notes:
***
### Histograms Revisited
Notes:
```{r Histograms Revisited}
yo <- read.csv('yogurt.csv')
str(yo)
# change the id from an int to a factor 因子
yo$id <- factor(yo$id)
str(yo)
qplot(data=yo,x=price,fill=I('#F79420'))
```
***
### 19.Number of Purchases
Notes:
# Create a new variable called all.purchases,
# which gives the total counts of yogurt for
# each observation or household.
# One way to do this is using the transform
# function. You can look up the function transform
# and run the examples of code at the bottom of the
# documentation to figure out what it does.
# The transform function produces a data frame
# so if you use it then save the result to 'yo'!
# OR you can figure out another way to create the
# variable.
# DO NOT ALTER THE CODE BELOW THIS LINE
# ========================================================
yo <- read.csv('yogurt.csv')
# ENTER YOUR CODE BELOW THIS LINE
# ========================================================
```{r Number of Purchases}
summary(yo)
unique(yo$price)
length(unique(yo$price))
table(yo$price) # 用于计算每一类值的出现频数
#新增变量,该变量给出每次观察或每次家庭购买的酸奶总数 all.purchases,实际上需要对每个家庭的各个口味的酸奶数量相加即可
yo <- transform(yo,all.purchases = strawberry + blueberry + pina.colada + plain + mixed.berry)
summary(yo$all.purchases)
```
***
### 20.Prices over Time
Notes:
# Create a scatterplot of price vs time.
# This will be an example of a time series plot.
# Resolve overplotting issues by using
# techniques you learned in Lesson 4.
# What are some things that you notice?
# ENTER YOUR CODE BELOW THIS LINE
# ================================================
```{r Prices over Time}
ggplot(aes(x=time,y=price),data=yo) +
geom_point()
# 下面的方法是为了减少重叠而引入的技巧(加入抖动和透明度)
ggplot(aes(x=time,y=price),data=yo) +
geom_jitter(alpha=1/4,shape=21,fill=I('#F79420'))
```
***
### 21.Sampling Observations
Notes:
***
### 22.Looking at Samples of Households
```{r Looking at Sample of Households}
# 设置种子,实现重复性
set.seed(4230)
# 抽取16个家庭
sample.ids <- sample(levels(yo$id), 16)
ggplot(aes(x=time,y=price),
data = subset(yo, id%in% sample.ids)) +
facet_wrap(~ id) +
geom_line() +
geom_point(aes(size = all.purchases),pch=1)
```
***
### 23.The Limits of Cross Sectional Data
Notes:
facebook数据就是一个截面数据,该数据的限制在于:没有一个用户的时间序列级别的数据,无法实现序列性的追踪
yogurt数据集则是一个序列性的数据,其有着每个用户在一定时间段内的酸奶购买数量的数据。
***
### 24.Many Variables
Notes:
***
### 25.Scatterplot Matrix
Notes:
在散点图矩阵中,在每对变量之间有一个散点图网格,但不适用于分类变量。
```{r Scatterplot Matrices}
install.packages('GGally')
library(GGally)
theme_set(theme_minimal(20)) # 设置主题
set.seed(1834)
pf_subset <- pf[,c(2:15)]
names(pf_subset)
ggpairs(pf_subset[sample.int(nrow(pf_subset),1000),])
```
***
### 26.Even More Variables
Notes:
***
### 27.Heat Maps
Notes:
```{r}
nci <- read.table("nci.tsv")
colnames(nci) <- c(1:64)
```
```{r}
install.packages("reshape2")
library(reshape2)
nci.long.samp <- melt(as.matrix(nci[1:200,]))
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)
ggplot(aes(y = gene, x = case, fill = value),
data = nci.long.samp) +
geom_tile() +
scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))
```
***
### Analyzing Three of More Variables
Reflection:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!
Notes:conclusion:
从散点图的简单扩展开始,并绘制在第四课中使用的条件总结,例如为多个组添加总结
然后尝试用一些技术来一次检查大量的变量,例如散点图矩阵和热图
重塑数据reshape data,从每种情况一行的广泛数据移到每个变量组合一行的综合数据,并将数据在长格式和宽格式之间来回移动
接下来第六课的大纲:
编写代码擦除和格式化新的数据集,并使用探索新数据分析结果来解释、指导、专研砖石价格的统计模型
习题集practice_lesson5.R(没有做)
#lesson 8: 探索多个变量 explore multi-data
# ch1: 带有分面和颜色的价格直方图
# Create a histogram of diamond prices.
# Facet the histogram by diamond color
# and use cut to color the histogram bars.
# The plot should look something like this.
# http://i.imgur.com/b5xyrOu.jpg
# Note: In the link, a color palette of type
# 'qual' was used to color the histogram using
# scale_fill_brewer(type = 'qual')
# This assignment is not graded and
# will be marked as correct when you submit.
# ENTER YOUR CODE BELOW THIS LINE
# ===========================================
# ch2 :价格与按切工填色的表格
# Create a scatterplot of diamond price vs.
# table and color the points by the cut of
# the diamond.
# The plot should look something like this.
# http://i.imgur.com/rQF9jQr.jpg
# Note: In the link, a color palette of type
# 'qual' was used to color the scatterplot using
# scale_color_brewer(type = 'qual')
# This assignment is not graded and
# will be marked as correct when you submit.
# ENTER YOUR CODE BELOW THIS LINE
# ===========================================
# ch3 典型表格
# 大多数完美切工钻石的典型表范围是多少?
# 大多数优质切工钻石的典型表范围是多少?
# ch4 价格与体积和砖石净度
# Create a scatterplot of diamond price vs.
# volume (x * y * z) and color the points by
# the clarity of diamonds. Use scale on the y-axis
# to take the log10 of price. You should also
# omit the top 1% of diamond volumes from the plot.
# Note: Volume is a very rough approximation of
# a diamond's actual volume.
# The plot should look something like this.
# http://i.imgur.com/excUpea.jpg
# Note: In the link, a color palette of type
# 'div' was used to color the scatterplot using
# scale_color_brewer(type = 'div')
# This assignment is not graded and
# will be marked as correct when you submit.
# ENTER YOUR CODE BELOW THIS LINE
# ===========================================
# ch5 新建友谊的比例
# Many interesting variables are derived from two or more others.
# For example, we might wonder how much of a person's network on
# a service like Facebook the user actively initiated. Two users
# with the same degree (or number of friends) might be very
# different if one initiated most of those connections on the
# service, while the other initiated very few. So it could be
# useful to consider this proportion of existing friendships that
# the user initiated. This might be a good predictor of how active
# a user is compared with their peers, or other traits, such as
# personality (i.e., is this person an extrovert?).
# Your task is to create a new variable called 'prop_initiated'
# in the Pseudo-Facebook data set. The variable should contain
# the proportion of friendships that the user initiated.
# This programming assignment WILL BE automatically graded.
# DO NOT DELETE THIS NEXT LINE OF CODE
# ========================================================================
pf <- read.delim('/datasets/ud651/pseudo_facebook.tsv')
# ENTER YOUR CODE BELOW THIS LINE
# ========================================================================
# ch6 prop_initiated与使用时长
# Create a line graph of the median proportion of
# friendships initiated ('prop_initiated') vs.
# tenure and color the line segment by
# year_joined.bucket.
# Recall, we created year_joined.bucket in Lesson 5
# by first creating year_joined from the variable tenure.
# Then, we used the cut function on year_joined to create
# four bins or cohorts of users.
# (2004, 2009]
# (2009, 2011]
# (2011, 2012]
# (2012, 2014]
# The plot should look something like this.
# http://i.imgur.com/vNjPtDh.jpg
# OR this
# http://i.imgur.com/IBN1ufQ.jpg
# This assignment is not graded and
# will be marked as correct when you submit.
# ENTER YOUR CODE BELOW THIS LINE
# ===========================================================
# ch7 平滑化prop_initiated与使用时长
# Smooth the last plot you created of
# of prop_initiated vs tenure colored by
# year_joined.bucket. You can bin together ranges
# of tenure or add a smoother to the plot.
# There won't be a solution image for this exercise.
# You will answer some questions about your plot in
# the next two exercises.
# This assignment is not graded and
# will be marked as correct when you submit.
# ENTER YOUR CODE BELOW THIS LINE
# ====================================================
# ch8 最大的prop_initiated组
# 平均来说,哪个组发起的 Facebook 好友请求比例最大?你在上个练习中创建的带平滑器的图可以帮你回答这个问题。
# ○ 在 2009 年之前加入的人
# ○ 在 2009 年至 2011 年之间加入的人
# ○ 在 2011 年至 2012 年之间加入的人
# ○ 在 2012 后加入的人
# ch9: 最大组均值prop_initiated
# 对于发起好友请求比例最大的组,这个组的平均(即均值)好友请求比例是多少?
# 你认为该组发起的好友请求比例大于其他组的原因是什么?
# ch 10:经过分组、分面和填色的价格/克拉
# Create a scatter plot of the price/carat ratio
# of diamonds. The variable x should be
# assigned to cut. The points should be colored
# by diamond color, and the plot should be
# faceted by clarity.
# The plot should look something like this.
# http://i.imgur.com/YzbWkHT.jpg.
# Note: In the link, a color palette of type
# 'div' was used to color the histogram using
# scale_color_brewer(type = 'div')
# This assignment is not graded and
# will be marked as correct when you submit.
# ENTER YOUR CODE BELOW THIS LINE
# ===========================================
# ch11 :Gapminder 多变量分析
# The Gapminder website contains over 500 data sets with information about
# the world's population. Your task is to continue the investigation you did at the
# end of Problem Set 4 or you can start fresh and choose a different
# data set from Gapminder.
# If you’re feeling adventurous or want to try some data munging see if you can
# find a data set or scrape one from the web.
# In your investigation, examine 3 or more variables and create 2-5 plots that make
# use of the techniques from Lesson 5.
# You can find a link to the Gapminder website in the Instructor Notes.
# Once you've completed your investigation, create a post in the discussions that includes:
# 1. the variable(s) you investigated, your observations, and any summary statistics
# 2. snippets of code that created the plots
# 3. links to the images of your plots
# Copy and paste all of the code that you used for
# your investigation, and submit it when you are ready.
# ============================================================================================