上图是一张很常见的城市热力图,像这样的图是如何绘制的呢?
其实,每个地区都有自己的经纬度和上网ip区段,可以通过解析上网日志中的ip,定位某个地区的客流量。
本篇文章主要介绍,如果通过解析上网日志,查找热门地区经纬度,并把统计数据插入Mysql表中。
数据准备
这里需要两份数据:
- 日志数据:20090121000132.394251.http.format
链接:https://pan.baidu.com/s/1luckcRUOpCDVmivLJ03XOQ
提取码:kroh
2. 城市ip段数据:ip.txt
链接:https://pan.baidu.com/s/1cOJhlCrfmC1SWXTZXMwovg
提取码:ydrv
需求分析
- 加载城市ip段信息,获取ip起始数字和结束数字,经度,纬度
- 加载日志数据,获取ip信息,然后转换为数字,和ip段比较
- 比较的时候采用二分法查找,找到对应的经度和纬度
- 然后对经度和维度做单词计数
- 插入Mysql表中
代码实现
import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Iplocation {
def ip2Long(ip: String): Long = {
//把IP地址转换为Long类型数字 192.168.10.11
val ips: Array[String] = ip.split("\\.")
var ipNum:Long = 0L
for(i <- ips){
ipNum = i.toLong | ipNum << 8L
}
ipNum
}
def binarySearch(ipNum: Long, city_ip_array: Array[(String, String, String, String)]): Int = {
//定义数组开始下标
var start = 0
//定义数组结束下标
var end = city_ip_array.length - 1
while(start <= end){
//获取中间下标
val middle = (start + end)/2
if(ipNum >= city_ip_array(middle)._1.toLong && ipNum <= city_ip_array(middle)._2.toLong){
return middle
}
if(ipNum < city_ip_array(middle)._1.toLong){
end = middle -1
}
if(ipNum > city_ip_array(middle)._2.toLong){
start = middle + 1
}
}
//没有查找到返回-1,防止函数报错
-1
}
def main(args: Array[String]): Unit = {
//1.创建SparkConf对象
val sparkConf: SparkConf = new SparkConf().setAppName("Iplocation").setMaster("local[2]")
//2.创建SparkContext对象
val sc = new SparkContext(sparkConf)
//3.加载ip
val city_ip_RDD: RDD[(String, String, String, String)] = sc.textFile("./data/ip.txt").map(x => x.split("\\|")).map(x => (x(2), x(3), x(x.length - 2), x(x.length - 1)))
//4.广播变量,把城市ip广播到worker节点的executor
val cityIpBroadCast: Broadcast[Array[(String, String, String, String)]] = sc.broadcast(city_ip_RDD.collect())
//5.读取运营商日志数据
val userIpsRDD: RDD[String] = sc.textFile("./data/20090121000132.394251.http.format").map(x => x.split("\\|")(1))
//6.遍历userIpsRDD去city_ip_RDD中去匹配
val resultRDD: RDD[((String, String), Int)] = userIpsRDD.mapPartitions(
iter => {
//6.1获取广播变量的值
val city_ip_array: Array[(String, String, String, String)] = cityIpBroadCast.value
//6.2获取每一个ip地址
iter.map(ip => {
//把ip地址转换为数字
val ipNum: Long = ip2Long(ip)
//使用转换后的ip去广播变量数组中进行匹配,获取long类型数字在数组中的下标
val index: Int = binarySearch(ipNum, city_ip_array)
//获取对应下标的数组信息
val result: (String, String, String, String) = city_ip_array(index)
//经纬度封装为元组, 出现次数记为1
((result._3, result._4), 1)
})
}
)
//7.累加相同经纬度出现的次数
val finalResult: RDD[((String, String), Int)] = resultRDD.reduceByKey(_ + _)
//8.输出到mysql表中
finalResult.foreachPartition(
iter => {
//8.1创建mysql数据库连接
var connection: Connection = null
try {
connection = DriverManager.getConnection("jdbc:mysql://node03:3306/spark", "root", "123456")
//8.2定义插入sql语句
val sql = "insert into city_hot_places (longitude, latitude, hot) values (?, ?, ?)"
//8.3获取PreParedStatement
val ps: PreparedStatement = connection.prepareStatement(sql)
//8.4给sql中变量赋值
iter.foreach(t => {
ps.setString(1, t._1._1)
ps.setString(2, t._1._2)
ps.setInt(3, t._2)
//设置批量提交
ps.addBatch()
})
//执行sql语句
ps.executeBatch()
} catch {
case e:Exception => println(e.getMessage)
} finally {
if(connection != null){
connection.close()
}
}
}
)
//9.关闭SparkContext,释放资源
sc.stop()
}
}
查看统计结果
总结
- 通过广播变量提高程序效率
- 二分查找降低查找复杂度
- 使用foreachPartition减少数据库连接