第一种方式

通过thrift接口,这种方式是最简单的,但是访问速度慢,而且thrift接口socket是由超时的
用Python操作HBase之HBase-Thrift 这种方式遍历rdd会出问题,当rdd特别大的时候。

通过happybase增强thrift接口
安装happyhbase
安装过程失败,尝试修正方法,centos7
yum install python-devel
安装happybase也失败了。看了只有使用原生的thrift接口了。

Python操作HBase之happybase

第二种方式

通过newAPIHadoopRDD接口,尝试了好几次,没有搞通
经过研究,这种方式终于搞通了,参考链接
错误处理方法 分布式需要在每个节点上都拷贝这个文件,修改配置。
通过scp拷贝方式最为简介。
scp /var/lib/spark/jars/hbase/spark-examples_2.11-1.6.0-typesafe-001.jar root@192.168.100.13:/var/lib/spark/jars/hbase/spark-examples_2.11-1.6.0-typesafe-001.jar

第三种方式

经过很多折腾,这种方式也终于搞通了。
修改启动参数:
pyspark2 --conf spark.kryoserializer.buffer.max=1024m --conf spark.driver.maxResultSize=20G --conf spark.driver.memory=20G --total-executor-cores=100 --executor-memory=10G --executor-cores=2 --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ 需要很长的时间,只在启动的spark-shell节点上操作。

import datetime
from pyspark import SparkConf, SparkContext
#driver意思为连接spark集群的机子,所以配置host要配置当前编写代码的机子host
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import udf
from pyspark.sql.types import *

conf = SparkConf()
conf.set("dfs.socket.timeout", 300000)
sqlContext.setConf("spark.sql.shuffle.partitions","400")
root="/user/XieHongjun/"
file=root + "all.parquet"
df = spark.read.parquet(file)
df.head(10)

```catalog = ''.join("""{
  "table":{"namespace":"default", "name":"test"},
  "rowkey":"key",
  "columns":{
  "rowkey":{"cf":"rowkey", "col":"key", "type":"string"}, 
  "code":{"cf":"result", "col":"code", "type":"string"}, 
  "date":{"cf":"result", "col":"date", "type":"string"},
  "time":{"cf":"result", "col":"time", "type":"string"},
  "price":{"cf":"result", "col":"price", "type":"float"},
  "ratio":{"cf":"result", "col":"ratio", "type":"float"},
  "bigratio":{"cf":"result", "col":"bigratio", "type":"float"}, 
  "timestamp":{"cf":"result", "col":"timestamp", "type":"string"}
  }
  }""".split())
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'

df1.write.options(catalog=catalog) \
        .mode('overwrite') \
        .format("org.apache.spark.sql.execution.datasources.hbase") \
        .option("zookeeper.znode.parent", "/hbase-unsecure") \
        .option("hbase.zookeeper.quorum", "cdh-192-168-100-11,cdh-192-168-100-12,cdh-192-168-100-13") \
        .option("hbase.zookeeper.property.clientPort", "2181") \
        .option("newTable", "5") \
        .option("hbase.cluster.distributed",True) \
        .save()

需要先用habse shell创建hbase表

对于小数据写入没有问题了,但如果数据量巨大时,写入habse会出错。
WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxnpython hbase 连接池 pyspark连接hbase_zookeeperConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid

参考对habse性能进行优化,没有解决regionserver线程挂掉的问题。
19/06/12 15:05:18 INFO client.AsyncProcess: #4, table=test, attempt=25/35 failed=3833ops, last exception: java.net.ConnectException: Connection refused on cdh-192-168-100-16,60020,1560260036428, tracking started null, retrying after=10032ms, replay=3833ops
19/06/12 15:05:18 INFO client.AsyncProcess: #3, table=test, attempt=25/35 failed=3833ops, last exception: java.net.ConnectException: Connection refused on cdh-192-168-100-16,60020,1560260036428, tracking
进行无数次百度之后,无果,最后还是查看regionserver日志,发现OOM日志。

  • export ‘HBASE_REGIONSERVER_OPTS=-Xms52428800 -Xmx52428800 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:ReservedCodeCacheSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hbase_hbase-REGIONSERVER-00ccac38cdc5566a0bbb251eb51faae5_pid20975.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh’
  • HBASE_REGIONSERVER_OPTS=’-Xms52428800 -Xmx52428800 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:ReservedCodeCacheSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hbase_hbase-REGIONSERVER-00ccac38cdc5566a0bbb251eb51faae5_pid20975.hprof -XX:OnOutOfMemoryError=/usr/lib64/cmf/service/common/killparent.sh’

因此调整了JVM设置:

python hbase 连接池 pyspark连接hbase_sql_02


才解决了问题。

从hbase读出到df

df = spark.read.options(catalog=catalog) \
        .format("org.apache.spark.sql.execution.datasources.hbase") \
        .option("zookeeper.znode.parent", "/hbase-unsecure") \
        .option("hbase.zookeeper.quorum", "cdh-192-168-100-11,cdh-192-168-100-12,cdh-192-168-100-13") \
        .option("hbase.zookeeper.property.clientPort", "2181") \
        .option("hbase.cluster.distributed",True) \
        .load()
df.show()

报错:
Py4JJavaError: An error occurred while calling o225.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, cdh-192-168-100-14, executor 3): java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setCaching(I)Lorg/apache/hadoop/hbase/client/Scan;

接口测试例子:

catalog = ''.join("""{
  "table":{"namespace":"test", "name":"test_table"},
  "rowkey":"key",
  "columns":{
  "col0":{"cf":"rowkey", "col":"key", "type":"string"},
  "col1":{"cf":"result", "col":"class", "type":"string"}
  }
  }""".split())

data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
  df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])
  df.show()
  df.write.options(catalog=catalog,newTable="5").format(data_source_format).save()

df_read = spark.read.options(catalog=catalog).format(data_source_format).load()
df_read.show()