HDFS与Hive迁移

source hadoop 2.7.5
target hadoop 2.6.5

hadoop distcp

DistCp Version2 Guide
大型集群间拷贝 DistCp Version 2 (distributed copy) is a tool used for large inter/intra-cluster copying.
分布式MapReduce和容错 It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
注意文件权限,可能会造成不必要的访问异常,此文忽略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$ hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and append new
data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB
-delete Delete from target, files missing in source
-diff <arg> Use snapshot diff report to identify the
difference between source and target
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied to <= n
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs are
saved
-m <arg> Max number of concurrent maps to use for copy
-mapredSslConf <arg> Configuration for ssl config file, to use with
hftps://. Must be in the classpath.
-overwrite Choose to overwrite target files unconditionally,
even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If -p is
specified with no <arg>, then preserves
replication, block size, user, group, permission,
checksum type and timestamps. raw.* xattrs are
preserved when both the source and destination
paths are in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is independent of
the -p flag. Refer to the DistCp documentation for
more details.
-sizelimit <arg> (Deprecated!) Limit number of files copied to <= n
bytes
-skipcrccheck Whether to skip CRC checks between source and
target paths.
-strategy <arg> Copy strategy to use. Default is dividing work
based on file sizes
-tmp <arg> Intermediate work path to be used for atomic
commit
-update Update target, copying only missingfiles or
directories

基本用法

DistCp Version2 Guide

HDFS迁移实例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ hadoop fs -ls hdfs://127.0.0.1:9000/hdfsapi/test
Found 1 items
-rw-r--r-- 1 shaozhipeng supergroup 538 2018-05-08 20:19 hdfs://127.0.0.1:9000/hdfsapi/test/readme.txt


$ hadoop fs -ls hdfs://192.168.99.100:9000/test
Found 1 items
-rw-r--r-- 2 hadoop supergroup 48 2016-12-22 11:24 hdfs://192.168.99.100:9000/test/hdfs.test


$ hadoop distcp hdfs://127.0.0.1:9000/hdfsapi/test hdfs://192.168.99.100:9000/test

$ hadoop fs -ls hdfs://192.168.99.100:9000/test
Found 2 items
-rw-r--r-- 2 hadoop supergroup 48 2016-12-22 11:24 hdfs://192.168.99.100:9000/test/hdfs.test
drwxr-xr-x - shaozhipeng supergroup 0 2018-05-08 20:26 hdfs://192.168.99.100:9000/test/test

Hive表迁移实例

建表语句+动态分区
如果是warehouse完全相同,直接拷贝旧的metastore数据
hive insert into table … WARNING: Hive-on-MR is deprecated in Hive 2…Consider using a different execution engine (i.e. spark, tez)
往分区表插入两条数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/apache-hive-2.3.2-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in file:/usr/local/apache-hive-2.3.2-bin/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
Time taken: 4.675 seconds, Fetched: 1 row(s)
hive> show tables;
OK
src
test
Time taken: 0.061 seconds, Fetched: 2 row(s)
hive> select * from src limit 10;
OK
238 val_238
86 val_86
311 val_311
27 val_27
165 val_165
409 val_409
255 val_255
278 val_278
98 val_98
484 val_484
Time taken: 1.932 seconds, Fetched: 10 row(s)
hive> desc test;
OK
daystr string ????
datestr string

# Partition Information
# col_name data_type comment

datestr string
Time taken: 0.384 seconds, Fetched: 7 row(s)
hive> insert into table test partition(datestr='2018-05-07') select '2018-05-07' as daystr from src limit 1;
hive> insert into table test partition(datestr='2018-05-08') select '2018-05-08' as daystr from src limit 1;
hive> select * from test;
OK
2018-05-07 2018-05-07
2018-05-08 2018-05-08
Time taken: 0.168 seconds, Fetched: 2 row(s)

想将旧的warehouse下的数据拷贝到新的warehouse目录下
因为旧的目录下已经有文件了,千万千万注意不能overwrite,默认就好

1
2
3
4
5
6
7
8
9
10
11
12
hadoop distcp hdfs://127.0.0.1:9000/user/hive/warehouse/test hdfs://192.168.99.100:9000/usr/hive/warehouse/
hadoop distcp hdfs://127.0.0.1:9000/user/hive/warehouse/src hdfs://192.168.99.100:9000/usr/hive/warehouse/

$ hadoop fs -ls hdfs://192.168.99.100:9000/usr/hive/warehouse/test
Found 2 items
drwxr-xr-x - shaozhipeng supergroup 0 2018-05-08 21:26 hdfs://192.168.99.100:9000/usr/hive/warehouse/test/datestr=2018-05-07
drwxr-xr-x - shaozhipeng supergroup 0 2018-05-08 21:26 hdfs://192.168.99.100:9000/usr/hive/warehouse/test/datestr=2018-05-08

$ hadoop fs -ls hdfs://192.168.99.100:9000/usr/hive/warehouse/src
Found 2 items
-rw-r--r-- 1 shaozhipeng supergroup 5812 2018-05-08 21:28 hdfs://192.168.99.100:9000/usr/hive/warehouse/src/kv1.txt
-rw-r--r-- 1 shaozhipeng supergroup 5811 2018-05-08 21:28 hdfs://192.168.99.100:9000/usr/hive/warehouse/src/kv1_copy_1.txt

进入192.168.99.100 使用hive访问数据

1
2
hive> select * from src limit 10;
FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'src'

创建src和test
创建分区
查询数据,迁移完成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
hive> create table src (key int , value string);
OK
Time taken: 1.457 seconds
hive> select * from src limit 10;
OK
238 val_238
86 val_86
311 val_311
27 val_27
165 val_165
409 val_409
255 val_255
278 val_278
98 val_98
484 val_484
Time taken: 1.427 seconds, Fetched: 10 row(s)
hive> create table test (daystr string) partitioned by (datestr string);
OK
Time taken: 0.063 seconds
hive> select * from test;
OK
Time taken: 0.178 seconds
hive> show partitions test;
OK
Time taken: 0.138 seconds
hive> alter table test add partition (datestr='2018-05-07');
OK
Time taken: 0.217 seconds
hive> alter table test add partition (datestr='2018-05-08');
OK
Time taken: 0.098 seconds
hive> show partitions test;
OK
datestr=2018-05-07
datestr=2018-05-08
Time taken: 0.094 seconds, Fetched: 2 row(s)
hive> select * from test;
OK
2018-05-07 2018-05-07
2018-05-08 2018-05-08
Time taken: 0.12 seconds, Fetched: 2 row(s)
邵志鹏 wechat
扫一扫上面的二维码关注我的公众号
0%