kafka -> spark -> hive

介绍从kafka消费数据然后写入hive

所有部件的安装配置

步骤非常重要

另外下面每一个环节都有环境变量需要配置,但是没有在各个步骤中列出来而是统一放在了文章最后,所以别忘了增加环节变量。
  • kafka
kafka的安装和启动比较简单,按照官网文档quickstart来做即可:
1、下载
2、启动zookeeper(kafka是靠zookeeper来管理的)
如果遇到权限错误,可以sudo chown -R paul:paul kafka_2.11-2.1.0
3、启动kafka server(这样才有了broker)
4、创建topic
5、生产topic的数据
  • spark
这个过程还是很简单的,解压配置环境变量即可,不过后面配置好hive后会需要拷贝配置文件到conf中。
  • hive
hive的依赖超级长,这里以一种方式完整的呈现出来,后面有需要把Derby换成MySQL也是可以的。
hive的元存储很重要,具体内容:Hive学习之Metastore及其配置管理
因为内嵌式的,也就是使用derby的方式既无法并行开启hive,也没有找到spark使用时因对版本不一致的error,所以改用本地MySQL方式存储metastore
    • Hadoop
hive需要HDFS,所以需要先装好Hadoop,Hadoop的安装可以参考官网的GettingStart
另外Ubuntu系统需要把/etc/hosts中的hostname地址修改为127.0.0.1
(但是在spark运行network流处理时会报IP回环错误,需要在spark-env.sh中设置SPARK_LOCAL_IP=本机固定IP)
注意,官网的步骤基本一个不漏都要做好。
包括:
1 安装ssh

2 Pseudo-Distributed Operation

3 YARN on a Single Node

上面的步骤执行完毕后,最后保证sbin/start-dfs.sh 和 sbin/start-yarn.sh已经启动起来
如果遇到权限问题,可以先创建一个组
groupadd hadoop
然后
sudo chown -R paul:hadoop hadoop-XXX
    • MySQL
使用MySQL来管理metastore的方法详见:configure-mysql-metastore-for-hive
注意事项:
1、不在mysql中使用mysql> SOURCE $HIVE_HOME/scripts/metastore/upgrade/mysql/hive-schema-0.14.0.mysql.sql;
而是配置好其他mysql配置
2、配置好hive-site.xml后使用bin/schematool来做初始化
当然在后面初始化时有看到Initialization script hive-schema-2.3.0.mysql.sql
3、后面spark也需要用到
ln -s /usr/share/java/mysql-connector-java.jar $SPARK_HOME/jars/mysql-connector-java.jar
    • Derby
其实Hadoop装好后就可以装Hive了,但是因为下面用到的是Derby来做JDBC的连接metastore。所以这里可以先把Derby装好。
derby的安装比较简单,解压然后配置环境变量即可,没有额外的配置文件和服务要启动。
不过记得创建data文件夹。
    • Hive
Hive的安装前面部分流程可以参考官网GettingStarted
Hive版本的选择参考:Hive+on+Spark:Getting+Started
在使用hdfs dfs创建并chmod文件夹后,有个很关键的hive-env.sh和hive-site.xml的配置,需要参考:apache-hive-installation-on-ubuntu
因为已经改用MySQL了,所以下面不再使用derby了。
接着,使用derby初始化metastore_db:
bin/schematool -initSchema -dbType derby
bin/schematool -initSchema -dbType mysql 
接着启动hive后就可以尝试官网后面的:
 DDL操作、DML操作、SQL操作 
至此已经可以创建database和table,Hive的简单体验已经可以了,但是还有重要的一块是spark读写hive还没有完成。

上面整个过程还墙裂参考了tutorialspoint.com/hive/hive_installation
    • 为了spark读写hive需要额外的配置
上面有给出hive-site.xml的配置,但是有个问题是,当spark读写hive时,因为derby的版本不一样,所以会有ERROR ObjectStore: Version information found in metastore differs 2.3.0 from expected schema version 1.2.0. Schema verififcation is disabled hive.metastore.schema.verification so setting version.
解决思路就是修改hive-site.xml的配置
至此hive-site.xml为:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:derby:;databaseName=/usr/local/hive/apache-hive-2.3.4-bin/metastore_db;create=true</value>
        <description>
            JDBC connect string for a JDBC metastore.
            To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
            For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
        </description>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
        <description>location of default database for the warehouse</description>
    </property>
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://master:9083</value>
        <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.apache.derby.jdbc.EmbeddedDriver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.PersistenceManagerFactoryClass</name>
        <value>org.datanucleus.api.jdo.JDOPersistenceManagerFactory</value>
        <description>class implementing the jdo persistence</description>
    </property>
</configuration>
最后:hive --service metastore  启动服务
因为上面这个方案并没有解决问题,所以还是用MySQL的方式。故hive-site.xml的配置应该为:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value> <description> JDBC connect string for a JDBC metastore. To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL. For example, jdbc:postgresql://myhost/db?ssl=true for postgres database. </description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> <description>location of default database for the warehouse</description> </property> <property> <name>hive.metastore.uris</name> <value/> <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>MySQL JDBC driver class</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> <description>user name for connecting to mysql server</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hivepassword</value> <description>password for connecting to mysql server</description> </property> <property>  <name>hive.metastore.schema.verification</name>  <value>false</value>  <description>      Enforce metastore schema version consistency.      True: Verify that version information stored in metastore matches with one from Hive jars.  Also disable automatic            schema migration attempt. Users are required to manully migrate schema after Hive upgrade which ensures            proper metastore schema migration. (Default)      False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.  </description>   </property> </configuration>

其实上面的配置并不完整,因为使用的spark预编译好的版本中的hive与实际配置的hive版本不一致,所以需要在上面添加一个配置使其不检查也不修改metastore中的schema
参考:【过往记忆】Spark连接Hive的metastore异常
<property> 
   <name>hive.metastore.schema.verification</name> 
   <value>false</value> 
    <description> 
    Enforce metastore schema version consistency. 
    True: Verify that version information stored in metastore matches with one from Hive jars.  Also disable automatic 
          schema migration attempt. Users are required to manully migrate schema after Hive upgrade which ensures 
          proper metastore schema migration. (Default) 
    False: Warn if the version information stored in metastore doesn't match with one from in Hive jars. 
    </description> 
 </property>
    • cp各种配置到spark的conf
需要拷贝到spark的配置有:
Hadoop的:core-site.xml、hdfs-site.xml、yarn-site.xml和Hive的hive-site.xml
    • spark-shell读取hive
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
val spark = SparkSession.builder().appName("Spark Hive Example").config("spark.sql.warehouse.dir", "/user/hive/warehouse").enableHiveSupport().getOrCreate()
import spark.implicits._
import spark.sql
sql("show tables").show()
    • 附环境变量
最后把所以的环境变量附上:
JAVA_HOME=/usr/lib/jvm/java-8-oracle

PATH="$PATH:$HOME/bin"
export HADOOP_HOME=/usr/local/hadoop/hadoopln
export HADOOP_MAPRED_HOME=/usr/local/hadoop/hadoopln
export HADOOP_COMMON_HOME=/usr/local/hadoop/hadoopln
export HADOOP_HDFS_HOME=/usr/local/hadoop/hadoopln
export YARN_HOME=/usr/local/hadoop/hadoopln
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export SPARK_HOME=/usr/local/spark/sparkln
export HIVE_HOME=/usr/local/hive/hiveln
export HIVE_CONF_DIR=$HIVE_HOME/conf
export DERBY_HOME=/usr/local/derby/derbyln
export DERBY_INSTALL=/usr/local/derby/derbyln
export PATH=$PATH:$HADOOP_HOME/bin:$SPARK_HOME/bin:$HIVE_HOME/bin:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib/*:.
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar:.

=================
The END








评论

此博客中的热门博文

Bazel WORKSPACE文件编写

Bazel BUILD文件的编写

Bazel的概念和技术