Spark生态之Spark-csv学习1之安装和简单的examples
原创
©著作权归作者所有:来自51CTO博客作者KeepLearningAI的原创作品,请联系作者获取转载授权,否则将追究法律责任
1.安装:
(1) Spark-shell:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0
直接运行就进入了shell
(2)eclipse的project:
可以从(1)中安装的三个jar包导入到project中,jar在/home/hadoop/.ivy2中
2.使用:
由于spark-shell不方便调试,故没太研究,具体的请参考【1】,一下主要是在eclipse下
(0) 数据下载:
wget https://github.com/databricks/spark-csv/raw/master/src/test/resources/cars.csv
(1)readCSVBySPARKSQL
examples:
/**
* @author xubo
* sparkCSV learning
* @time 20160419
* reference https://github.com/databricks/spark-csv
*/
package com.apache.spark.sparkCSV.learning
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object readCsvBySparkSQLLoad {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkLearning:SparkCSV").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext._
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "file/data/sparkCSV/input/cars.csv", "header" -> "true"))
df.select("year", "model").save("file/data/sparkCSV/output/newcars.csv", "com.databricks.spark.csv")
df.show
sc.stop
}
}
运行结果:
+----+-----+-----+--------------------+-----+
|year| make|model| comment|blank|
+----+-----+-----+--------------------+-----+
|2012|Tesla| S| No comment| |
|1997| Ford| E350|Go get one now th...| |
|2015|Chevy| Volt| null| null|
+----+-----+-----+--------------------+-----+
自己的文件:(天池比赛:菜鸟网络)
/**
* @author xubo
* sparkCSV learning
* @time 20160419
* reference https://github.com/databricks/spark-csv
*/
package com.apache.spark.sparkCSV.learning
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object readCsvBySparkSQLLoad2 {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkLearning:SparkCSV").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext._
val file1 = "file/data/sparkCSV/input/sample_submission.csv"
println(file1);
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> file1, "header" -> "false"))
// df.select("year", "model").save("newcars.csv", "com.databricks.spark.csv")
df.show
sc.stop
}
}
运行结果:
file/data/input/sample_submission.csv
2016-04-19 00:19:32 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: 100.78.140.148, but we couldn't find any external IP address!
+-----+---+---+
| C0| C1| C2|
+-----+---+---+
| 535|all| 1|
| 727|all| 1|
| 1765|all| 1|
| 8230|all| 1|
| 9574|all| 1|
| 9595|all| 1|
| 9754|all| 1|
| 9964|all| 1|
|11068|all| 1|
|12223|all| 1|
|12940|all| 1|
|13282|all| 1|
|14920|all| 1|
|17392|all| 1|
|17731|all| 1|
|18958|all| 1|
|19966|all| 1|
|22108|all| 1|
|22282|all| 1|
|23671|all| 1|
+-----+---+---+
(2)其他版本的Spark和Pathon、R、Java等请见参考【1】
更多的Spark学习代码请见:https://github.com/xubo245/SparkLearning
参考:
【1】 https://github.com/databricks/spark-csv
【1】 http://www.iteblog.com/archives/1380