介绍
自从Hadoop集群上了kerberos后,简直是痛不欲生啊,ops和dev各种吐槽,最初启用kerberos的初衷是为了防止hive、impala的误操作,而kerberos只是授权认证,hive/impala的授权是通过自身的命令行实现的,但是hive和impala的授权都是粗粒度的,有没有细粒度的授权呢?Sentry就是为此而生。
Sentry为确保数据安全,提供了一个统一平台,使用现有的Hadoop Kerberos实现安全认证(本文去掉kerberos组件)。授权是通过命令。Sentry通过一组特权,如SELECT和INSERT,控制着对Hive Metastore中每个schema对象的访问。schema对象是数据管理中常见的实体,例如SERVER、DATABASE、TABLE、 COLUMN和URI,也就是HDFS中文件的位置。Cloudera Search有它自己的一组特权(如QUERY)和对象(如COLLECTION)。
Sentry通常被配制成默认不允许访问服务和数据。因此,在被划分到设有指定访问角色的用户组之前,用户只有有限的权限访问系统。
Sentry提供了:
有层次结构的对象,自动地从上层对象继承权限;
包含了一组多个对象/权限对的规则;
用户组可以被授予一个或多个角色;
用户可以被指定到一个或多个用户组中;
规划
主机 | hive | Sentry | DB |
hadoop1 | | | |
hadoop2 | | | |
hadoop3 | | | |
hadoop4 | | | |
hadoop5 | hive-server2 | | |
hadoop6 | | sentry-store | mysql |
部署
安装Sentry
yum install sentry sentry-store sentry-hdfs-plugin -y
配置Sentry
/etc/sentry/conf/sentry-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>sentry.service.security.mode</name>
<value>none</value>
</property>
<property>
<name>sentry.service.server.principal</name>
<value></value>
</property>
<property>
<name>sentry.service.server.keytab</name>
<value></value>
</property>
<property>
<name>sentry.service.admin.group</name>
<value>hive,impala,hue,admin</value>
</property>
<property>
<name>sentry.service.reporting</name>
<value>JMX</value>
</property>
<property>
<name>sentry.service.web.enable</name>
<value>true</value>
</property>
<property>
<name>sentry.service.web.port</name>
<value>51000</value>
</property>
<property>
<name>sentry.service.web.authentication.type</name>
<value>NONE</value>
</property>
<property>
<name>sentry.service.allow.connect</name>
<value>impala,hive,solr,hue</value>
</property>
<property>
<name>sentry.verify.schema.version</name>
<value>true</value>
</property>
<property>
<name>sentry.service.server.rpc-address</name>
<value>hadoop6</value>
</property>
<property>
<name>sentry.service.server.rpc-port</name>
<value>8038</value>
</property>
<property>
<name>sentry.store.jdbc.url</name>
<value>jdbc:mysql://172.31.217.156:3306/sentry</value>
</property>
<property>
<name>sentry.store.jdbc.user</name>
<value>root</value>
</property>
<property>
<name>sentry.store.jdbc.password</name>
<value>123456</value>
</property>
<property>
<name>sentry.store.jdbc.driver</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>sentry.hive.server</name>
<value>hadoop5</value>
</property>
<property>
<name>sentry.store.group.mapping</name>
<value>org.apache.sentry.provider.common.HadoopGroupMappingService</value>
</property>
</configuration>
ps:在sentry-site.xml文件中的sentry.service.admin.group中添加hive,impala和hue组。如果最终用户在这些管理组之一中,则该用户对Sentry服务器具有管理权限。
初始化Sentry
因为使用mysql作为后端DB,所以sentry需要使用java的mysql-connector,拷贝hive的mysql-connector-java.jar到Sentry的lib目录
cp /usr/lib/hive/lib/mysql-connector-java.jar /usr/lib/sentry/lib/
初始化DB
cd /etc/sentry/conf
sentry --command schema-tool --conffile ./sentry-site.xml --dbType mysql --initSchema
启动Sentry
/etc/init.d/sentry-store restart
测试Sentry WEB
http://172.31.217.156:51000/
Hive配置调整
/etc/hive/conf/hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://172.31.217.156:3306/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>datanucleus.readOnlyDatastore</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>false</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateColumns</name>
<value>true</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>haddop1:23125,hadoop6:23125</value>
</property>
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.warehouse.subdir.inherit.perms</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop5:9083</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>36000</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>hadoop2:2181,hadoop3:2181,hadoop4:2181</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>hive.server2.thrift.min.worker.threads</name>
<value>2</value>
</property>
<property>
<name>hive.server2.thrift.max.worker.threads</name>
<value>10</value>
</property>
<property>
<name>hive.metastore.authorization.storage.checks</name>
<value>true</value>
</property>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>hive.enable.spark.execution.engine</name>
<value>true</value>
</property>
<property>
<name>spark.home</name>
<value>/opt/programs/spark_1.6.0</value>
</property>
<!-- Sentry Hiveserver2 config-->
<property>
<name>hive.sentry.conf.url</name>
<value>file:///etc/hive/conf/sentry-site.xml</value>
</property>
<property>
<name>hive.server2.session.hook</name>
<value>org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook</value>
</property>
<property>
<name>hive.security.authorization.task.factory</name>
<value>org.apache.sentry.binding.hive.SentryHiveAuthorizationTaskFactoryImpl</value>
</property>
<!-- Sentry hivemeastore config -->
<property>
<name>hive.metastore.filter.hook</name>
<value>org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook</value>
</property>
<property>
<name>hive.metastore.pre.event.listeners</name>
<value>org.apache.sentry.binding.metastore.MetastoreAuthzBinding</value>
</property>
<property>
<name>hive.metastore.event.listeners</name>
<value>org.apache.sentry.binding.metastore.SentryMetastorePostEventListener</value>
</property>
</configuration>
ps:注意其中关于sentry的配置
/etc/hive/conf/sentry-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>hive.sentry.server</name>
<value>hadoop6</value>
</property>
<property>
<name>sentry.service.security.mode</name>
<value>none</value>
</property>
<property>
<name>sentry.hive.provider.backend</name>
<value>org.apache.sentry.provider.db.SimpleDBProviderBackend</value>
</property>
<property>
<name>sentry.service.client.server.rpc-address</name>
<value>hadoop6</value>
</property>
<property>
<name>sentry.service.client.server.rpc-connection-timeout</name>
<value>200000</value>
</property>
<property>
<name>sentry.service.client.server.rpc-port</name>
<value>8038</value>
</property>
<property>
<name>hive.sentry.provider</name>
<value>org.apache.sentry.provider.file.HadoopGroupResourceAuthorizationProvider</value>
</property>
<property>
<name>hive.sentry.failure.hooks</name>
<value>com.cloudera.navigator.audit.hive.HiveSentryOnFailureHook</value>
</property>
<property>
<name>sentry.hive.server</name>
<value>hadoop5</value>
</property>
<property>
<name>sentry.hive.testing.mode</name>
<value>true</value>
</property>
</configuration>
ps:此sentry-site.xml不是hadoop6上的sentry配置文件,是单独为hadoop5的hive配置的sentry配置。
添加lib库
拷贝hadoop6上面/usr/lib/sentry/lib/里面的sentry*.jar和shiro-*.jar文件到hadoop5的/usr/lib/hive/lib/
注意!!!
在官方文档中,提示你要开启sentry服务,hive的warehouse目录权限必须是771,也就是执行以下操作
hadoop -fs -chmod -R 771 /user/hive/warehouse
hadoop -fs -chown -R hive:hive /user/hive/warehouse
ps:但是因为某种原因,本文并不需要执行此步骤,还是继续保持相关目录1777的权限即可。
重启hive服务
重启metastore
/etc/init.d/hive-metastore restart
重启hiveserver2
/etc/init.d/hive-server2 restart
hive权限测试
ps:必须以hive用户登录
查看当前权限
beeline
show current roles
show roles
创建admin权限组
CREATE ROLE admin_role;
GRANT ROLE admin_role TO GROUP admin;
GRANT ALL ON server hadoop5 to role admin_role;
创建hive权限
GRANT ROLE admin_role TO GROUP hive;
ps:因为sentry默认连接进来的默认用户是没有任何权限的。
准备测试数据
/tmp/sample.csv
10.1.2.3,US,android,createNote
10.200.88.99,FR,windows,updateNote
10.1.2.3,US,android,updateNote
10.200.88.77,FR,ios,createNote
10.1.4.5,US,windows,updateTag
加载测试数据到sensitive
create database sensitive;
create table sensitive.events (ip STRING, country STRING, client STRING, action STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
load data local inpath '/tmp/sample.csv' overwrite into table sensitive.events;
加载测试数据到filtered
create database filtered;
create table filtered.events as select country, client, action from sensitive.events;
create table filtered.events_usonly as select * from filtered.events where country = 'US';
0: jdbc:hive2://172.31.217.155:10000> create database filtered;
INFO : Compiling command(queryId=hive_20170314200303_b3042205-49f1-475b-972c-9255e0faf9c0): create database filtered
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hive_20170314200303_b3042205-49f1-475b-972c-9255e0faf9c0); Time taken: 0.083 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20170314200303_b3042205-49f1-475b-972c-9255e0faf9c0): create database filtered
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20170314200303_b3042205-49f1-475b-972c-9255e0faf9c0); Time taken: 0.029 seconds
INFO : OK
No rows affected (0.124 seconds)
0: jdbc:hive2://172.31.217.155:10000> create table filtered.events as select country, client, action from sensitive.events;
INFO : Compiling command(queryId=hive_20170314200303_4fc58355-18bb-4f2b-87de-a894cf8688ec): create table filtered.events as select country, client, action from sensitive.events
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:country, type:string, comment:null), FieldSchema(name:client, type:string, comment:null), FieldSchema(name:action, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20170314200303_4fc58355-18bb-4f2b-87de-a894cf8688ec); Time taken: 0.137 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20170314200303_4fc58355-18bb-4f2b-87de-a894cf8688ec): create table filtered.events as select country, client, action from sensitive.events
INFO : Query ID = hive_20170314200303_4fc58355-18bb-4f2b-87de-a894cf8688ec
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : Starting Spark Job = ccae0717-542b-4827-a411-800bc660910b
INFO : =====Spark Job[ccae0717-542b-4827-a411-800bc660910b] statistics=====
INFO : HIVE
INFO : RECORDS_OUT_1_filtered.events: 5
INFO : CREATED_FILES: 1
INFO : RECORDS_IN: 5
INFO : DESERIALIZE_ERRORS: 0
INFO : Spark Job[ccae0717-542b-4827-a411-800bc660910b] Metrics
INFO : ExecutorDeserializeTime: 1558
INFO : ExecutorRunTime: 1149
INFO : ResultSize: 2077
INFO : JvmGCTime: 181
INFO : ResultSerializationTime: 1
INFO : MemoryBytesSpilled: 0
INFO : DiskBytesSpilled: 0
INFO : BytesRead: 3259
INFO : Execution completed successfully
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Moving data to: hdfs://dev-dalu:8020/user/hive/warehouse/filtered.db/events from hdfs://dev-dalu:8020/user/hive/warehouse/filtered.db/.hive-staging_hive_2017-03-14_20-03-52_904_3142690968931776036-2/-ext-10001
INFO : Starting task [Stage-3:DDL] in serial mode
INFO : Starting task [Stage-2:STATS] in serial mode
INFO : Table filtered.events stats: [numFiles=1, numRows=5, totalSize=105, rawDataSize=100]
INFO : Completed executing command(queryId=hive_20170314200303_4fc58355-18bb-4f2b-87de-a894cf8688ec); Time taken: 19.663 seconds
INFO : OK
No rows affected (19.809 seconds)
0: jdbc:hive2://172.31.217.155:10000> create table filtered.events_usonly as select * from filtered.events where country = 'US';
INFO : Compiling command(queryId=hive_20170314200404_31d0b74a-9bd8-45e7-b24e-ebd5b19c10cf): create table filtered.events_usonly as select * from filtered.events where country = 'US'
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:events.country, type:string, comment:null), FieldSchema(name:events.client, type:string, comment:null), FieldSchema(name:events.action, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20170314200404_31d0b74a-9bd8-45e7-b24e-ebd5b19c10cf); Time taken: 0.133 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hive_20170314200404_31d0b74a-9bd8-45e7-b24e-ebd5b19c10cf): create table filtered.events_usonly as select * from filtered.events where country = 'US'
INFO : Query ID = hive_20170314200404_31d0b74a-9bd8-45e7-b24e-ebd5b19c10cf
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : Starting Spark Job = c14fb371-f3ff-43a0-bd47-28c369edd92c
INFO : =====Spark Job[c14fb371-f3ff-43a0-bd47-28c369edd92c] statistics=====
INFO : HIVE
INFO : CREATED_FILES: 1
INFO : RECORDS_OUT_1_filtered.events_usonly: 3
INFO : RECORDS_IN: 5
INFO : DESERIALIZE_ERRORS: 0
INFO : Spark Job[c14fb371-f3ff-43a0-bd47-28c369edd92c] Metrics
INFO : ExecutorDeserializeTime: 1589
INFO : ExecutorRunTime: 1488
INFO : ResultSize: 2095
INFO : JvmGCTime: 194
INFO : ResultSerializationTime: 1
INFO : MemoryBytesSpilled: 0
INFO : DiskBytesSpilled: 0
INFO : BytesRead: 3480
INFO : Execution completed successfully
INFO : Starting task [Stage-0:MOVE] in serial mode
INFO : Moving data to: hdfs://dev-dalu:8020/user/hive/warehouse/filtered.db/events_usonly from hdfs://dev-dalu:8020/user/hive/warehouse/filtered.db/.hive-staging_hive_2017-03-14_20-04-21_023_473130083580993490-2/-ext-10001
INFO : Starting task [Stage-3:DDL] in serial mode
INFO : Starting task [Stage-2:STATS] in serial mode
INFO : Table filtered.events_usonly stats: [numFiles=1, numRows=3, totalSize=65, rawDataSize=62]
INFO : Completed executing command(queryId=hive_20170314200404_31d0b74a-9bd8-45e7-b24e-ebd5b19c10cf); Time taken: 4.362 seconds
INFO : OK
No rows affected (4.507 seconds)
ps:为啥filtered加载数据这么多输出,因为我们hive底层是spark执行引擎,而数据是从sensitive导入过来的,所以会把sql转换为spark job来执行。
创建test权限组
create role test_role;
GRANT ALL ON DATABASE filtered TO ROLE test_role;
use sensitive;
GRANT SELECT(ip) on TABLE sensitive.events TO ROLE test_role;
GRANT ROLE test_role TO GROUP test;
授予test权限组权限
ps:
admin_role,具有管理员权限,可以读写所有数据库,并授权给 admin 和 hive 组(对应操作系统上的组)
test_role,只能读写 filtered 数据库,和只能读取sensitive库中events 表中的ip字段,并授权给 test 组。
以admin登录hive进行查看
登录
beeline -u "jdbc:hive2://172.31.217.155:10000" -n admin -p admin
当前roles
以test登录hive进行查看
登录
beeline -u "jdbc:hive2://172.31.217.155:10000" -n test -p test
当前roles
查询
admin查询filtered
beeline -u "jdbc:hive2://172.31.217.155:10000" -n admin -p admin
>use filtered;
>select * from events;
test查询sensitive
beeline -u "jdbc:hive2://172.31.217.155:10000" -n test -p test
>use sensitive;
>select * from events;
>select ip from events;
test用户尝试给自己授权
GRANT ROLE admin_role TO GROUP test;