sqoop 으로 hadoop 으로 넣을경우 4가지 파일 포멧 리스트
* Oracle 에 있는데이터를 Hadoop 으로 옮겨넣을때 그동안은 실시간으로 넣어야해서 flume을 썼는데
배치성으로 도는 작업등은 flume까진 필요없었다.
지금 들어와있는 프로젝트에서는 sqoop 을 사용 해서 데이터를 hadoop으로 넣는 작업을 진행했다.
sqoop 은 크게 어려움은 없었으며 쉘상에서 명령어의 사용을 통해서 데이터를 전송해서 사실 개인적으로 사용하기 많이 편했다. 단지 플럼처럼 커스터마이징이 될지는 아직 모르는 상태.
원본은
ORACLE 상에 일반 HEAP TABLE 이다.
테스트용 테이블을 만들고나서 임시로 1,000 건의 데이터를 넣었다.
CREATE TABLE HDFS_4
(
ID VARCHAR(100),
NUM NUMBER(10),
TEST VARCHAR(100),
REG_DT DATE DEFAULT SYSDATE
);
INSERT INTO HDFS_4
SELECT 'USR_'||Dbms_Random.string('A',5),
Trunc(Dbms_Random.Value(10000,90000)),
Dbms_Random.string('A',100),
SYSDATE + To_Number(Dbms_Random.Value(1,30))
FROM DUAL
CONNECT BY LEVEL <= 1000;
1. text plain
실행한 명령어
sqoop import --target-dir=/dev/test2_text --query='select *from HDFS_4 where $CONDITIONS' -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
실행 결과
[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_text --query='select *from HDFS_4 where $CONDITIONS' -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
16/08/10 09:09:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258
16/08/10 09:09:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/08/10 09:09:28 INFO manager.SqlManager: Using default fetchSize of 1000
16/08/10 09:09:28 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/08/10 09:09:29 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:09:29 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where (1 = 0)
16/08/10 09:09:29 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where (1 = 0)
16/08/10 09:09:30 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce
Note: /tmp/sqoop-hdfs/compile/284d77af5917fa2113d961ae72341cc4/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/08/10 09:09:31 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/284d77af5917fa2113d961ae72341cc4/QueryResult.jar
16/08/10 09:09:31 INFO mapreduce.ImportJobBase: Beginning query import.
16/08/10 09:09:33 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/
16/08/10 09:09:33 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050
16/08/10 09:09:35 INFO db.DBInputFormat: Using read commited transaction isolation
16/08/10 09:09:35 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM (select *from HDFS_4 where (1 = 1) ) t1
16/08/10 09:09:35 WARN db.TextSplitter: Generating splits for a textual index column.
16/08/10 09:09:35 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/08/10 09:09:35 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/08/10 09:09:35 INFO mapreduce.JobSubmitter: number of splits:4
16/08/10 09:09:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0007
16/08/10 09:09:35 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0007
16/08/10 09:09:35 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0007/
16/08/10 09:09:35 INFO mapreduce.Job: Running job: job_1470728284233_0007
16/08/10 09:09:45 INFO mapreduce.Job: Job job_1470728284233_0007 running in uber mode : false
16/08/10 09:09:45 INFO mapreduce.Job: map 0% reduce 0%
16/08/10 09:09:53 INFO mapreduce.Job: map 25% reduce 0%
16/08/10 09:09:55 INFO mapreduce.Job: map 75% reduce 0%
16/08/10 09:09:56 INFO mapreduce.Job: map 100% reduce 0%
16/08/10 09:09:56 INFO mapreduce.Job: Job job_1470728284233_0007 completed successfully
16/08/10 09:09:56 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=606092
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=489
HDFS: Number of bytes written=139000
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=57580
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=28790
Total vcore-seconds taken by all map tasks=28790
Total megabyte-seconds taken by all map tasks=44221440
Map-Reduce Framework
Map input records=1000
Map output records=1000
Input split bytes=489
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=2674
CPU time spent (ms)=38860
Physical memory (bytes) snapshot=1292296192
Virtual memory (bytes) snapshot=13235789824
Total committed heap usage (bytes)=689438720
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=139000
16/08/10 09:09:56 INFO mapreduce.ImportJobBase: Transferred 135.7422 KB in 23.6533 seconds (5.7388 KB/sec)
16/08/10 09:09:56 INFO mapreduce.ImportJobBase: Retrieved 1000 records.
[hdfs@amb2 ~]$
cat 으로 보면 내용이 다 보인다
컬럼간의 구분이 쉼표로 되어있는것이 보인다.
CSV랑 같은 형태로 저장
2. AVRO
실행한 명령어
sqoop import --target-dir=/dev/test2_avro --query='select *from HDFS_4 where $CONDITIONS' --as-avrodatafile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
실행 결과
[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_avro --query='select *from HDFS_4 where $CONDITIONS' --as-avrodatafile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
16/08/10 09:15:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258
16/08/10 09:15:04 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/08/10 09:15:04 INFO manager.SqlManager: Using default fetchSize of 1000
16/08/10 09:15:04 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/08/10 09:15:05 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:15:05 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where (1 = 0)
16/08/10 09:15:05 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where (1 = 0)
16/08/10 09:15:05 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce
Note: /tmp/sqoop-hdfs/compile/837dbbb2e304900b1151d2fa6186b0b7/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/08/10 09:15:07 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/837dbbb2e304900b1151d2fa6186b0b7/QueryResult.jar
16/08/10 09:15:07 INFO mapreduce.ImportJobBase: Beginning query import.
16/08/10 09:15:08 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:15:08 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where (1 = 0)
16/08/10 09:15:08 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where (1 = 0)
16/08/10 09:15:08 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-hdfs/compile/837dbbb2e304900b1151d2fa6186b0b7/QueryResult.avsc
16/08/10 09:15:08 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/
16/08/10 09:15:08 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050
16/08/10 09:15:10 INFO db.DBInputFormat: Using read commited transaction isolation
16/08/10 09:15:10 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM (select *from HDFS_4 where (1 = 1) ) t1
16/08/10 09:15:10 WARN db.TextSplitter: Generating splits for a textual index column.
16/08/10 09:15:10 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/08/10 09:15:10 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/08/10 09:15:10 INFO mapreduce.JobSubmitter: number of splits:4
16/08/10 09:15:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0008
16/08/10 09:15:11 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0008
16/08/10 09:15:11 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0008/
16/08/10 09:15:11 INFO mapreduce.Job: Running job: job_1470728284233_0008
16/08/10 09:15:18 INFO mapreduce.Job: Job job_1470728284233_0008 running in uber mode : false
16/08/10 09:15:18 INFO mapreduce.Job: map 0% reduce 0%
16/08/10 09:15:26 INFO mapreduce.Job: map 25% reduce 0%
16/08/10 09:15:27 INFO mapreduce.Job: map 100% reduce 0%
16/08/10 09:15:28 INFO mapreduce.Job: Job job_1470728284233_0008 completed successfully
16/08/10 09:15:28 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=607148
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=489
HDFS: Number of bytes written=130230
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=49984
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=24992
Total vcore-seconds taken by all map tasks=24992
Total megabyte-seconds taken by all map tasks=38387712
Map-Reduce Framework
Map input records=1000
Map output records=1000
Input split bytes=489
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=1696
CPU time spent (ms)=38110
Physical memory (bytes) snapshot=1327677440
Virtual memory (bytes) snapshot=13322395648
Total committed heap usage (bytes)=721420288
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=130230
16/08/10 09:15:28 INFO mapreduce.ImportJobBase: Transferred 127.1777 KB in 20.4259 seconds (6.2263 KB/sec)
16/08/10 09:15:28 INFO mapreduce.ImportJobBase: Retrieved 1000 records.
[hdfs@amb2 ~]$
cat 으로보면 구분자쪽이 약간 깨진것처럼 보인다.
3. 시퀀스파일
실행한 명령어
sqoop import --target-dir=/dev/test2_seq --query='select *from HDFS_4 where $CONDITIONS' --as-sequencefile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
실행 결과
[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_seq --query='select *from HDFS_4 where $CONDITIONS' --as-sequencefile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
16/08/10 09:22:31 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258
16/08/10 09:22:31 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/08/10 09:22:31 INFO manager.SqlManager: Using default fetchSize of 1000
16/08/10 09:22:31 INFO tool.CodeGenTool: Beginning code generation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/08/10 09:22:32 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:22:32 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where (1 = 0)
16/08/10 09:22:32 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where (1 = 0)
16/08/10 09:22:32 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce
Note: /tmp/sqoop-hdfs/compile/3a0d3133347de2dccea51b1c74f948bd/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/08/10 09:22:34 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/3a0d3133347de2dccea51b1c74f948bd/QueryResult.jar
16/08/10 09:22:34 INFO mapreduce.ImportJobBase: Beginning query import.
16/08/10 09:22:36 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/
16/08/10 09:22:36 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050
16/08/10 09:22:38 INFO db.DBInputFormat: Using read commited transaction isolation
16/08/10 09:22:38 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM (select *from HDFS_4 where (1 = 1) ) t1
16/08/10 09:22:38 WARN db.TextSplitter: Generating splits for a textual index column.
16/08/10 09:22:38 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/08/10 09:22:38 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/08/10 09:22:38 INFO mapreduce.JobSubmitter: number of splits:4
16/08/10 09:22:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0009
16/08/10 09:22:38 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0009
16/08/10 09:22:39 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0009/
16/08/10 09:22:39 INFO mapreduce.Job: Running job: job_1470728284233_0009
16/08/10 09:22:47 INFO mapreduce.Job: Job job_1470728284233_0009 running in uber mode : false
16/08/10 09:22:47 INFO mapreduce.Job: map 0% reduce 0%
16/08/10 09:22:54 INFO mapreduce.Job: map 25% reduce 0%
16/08/10 09:22:55 INFO mapreduce.Job: map 100% reduce 0%
16/08/10 09:22:56 INFO mapreduce.Job: Job job_1470728284233_0009 completed successfully
16/08/10 09:22:56 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=605572
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=489
HDFS: Number of bytes written=157788
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=48586
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=24293
Total vcore-seconds taken by all map tasks=24293
Total megabyte-seconds taken by all map tasks=37314048
Map-Reduce Framework
Map input records=1000
Map output records=1000
Input split bytes=489
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=765
CPU time spent (ms)=27050
Physical memory (bytes) snapshot=1312333824
Virtual memory (bytes) snapshot=13253160960
Total committed heap usage (bytes)=707788800
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=157788
16/08/10 09:22:56 INFO mapreduce.ImportJobBase: Transferred 154.0898 KB in 20.6491 seconds (7.4623 KB/sec)
16/08/10 09:22:56 INFO mapreduce.ImportJobBase: Retrieved 1000 records.
[hdfs@amb2 ~]$
cat 으로보면 맨위에 시퀀스파일이다! 라고 선언하고 약간깨진듯한문자로 이어진 형태를 보인다.
4. 파켓? 파큇? 파일
실행한 명령어
* sqoop 은 현재 테스트버전은 1.4.6 이다. 명령어를 보면 위와 좀 다르다.
위에 명령어는 전부 query를 날린건데 이건 TABLE지정으로 바꿨다.
현재 버그가 있는 관계로 아래처럼 바꿨으며 테스트 시점에 검색을 해보니 패치파일이 있었다.
그러나 난 ambari를 통해서 아예 통으로 설치해서 일일 찾아서 수정하기 귀찮으므로 1.4.7버전을 기다린다 그냥. ㅎㅎ;
sqoop import --target-dir=/dev/test2_pq --table HDFS_4 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
아래는 쿼리를 실행했을경우 나는 오류로 JAVA null pointer 예외 가 발생한다.
[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_pq --query='select *from HDFS_4 where $CONDITIONS' --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
16/08/10 09:29:29 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258
16/08/10 09:29:29 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/08/10 09:29:29 INFO manager.SqlManager: Using default fetchSize of 1000
16/08/10 09:29:29 INFO tool.CodeGenTool: Beginning code generation
16/08/10 09:29:29 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:97)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:148)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)
at org.apache.sqoop.Sqoop.main(Sqoop.java:244)
[hdfs@amb2 ~]$
실행 결과
TABLE 을 가져오는 명령으로 실행하면 아래와 같이 실행중에 에러가난다.
젠장~ 내용을 대충 보자면~ 데이터 타입으로 보여져서 기존table에서 데이터 타입만 변경한 뷰를 만들어서 그 뷰를 가져와보기로 한다.
16/08/10 09:31:25 INFO mapreduce.Job: Task Id : attempt_1470728284233_0010_m_000002_1, Status : FAILED
Error: org.apache.avro.UnresolvedUnionException: Not in union ["null","long"]: 2016-08-29 21:09:28.0
at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:561)
at org.apache.avro.generic.GenericData.deepCopy(GenericData.java:941)
at org.apache.avro.generic.GenericData.deepCopy(GenericData.java:922)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.copy(DatasetKeyOutputFormat.java:327)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.write(DatasetKeyOutputFormat.java:321)
at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.write(DatasetKeyOutputFormat.java:300)
at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:658)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at org.apache.sqoop.mapreduce.ParquetImportMapper.map(ParquetImportMapper.java:70)
at org.apache.sqoop.mapreduce.ParquetImportMapper.map(ParquetImportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
위에러로 인한 뷰생성후 재실행
CREATE OR REPLACE VIEW VW_HDFS_4
AS
SELECT ID,NUM,TEST,To_Char(REG_DT,'YYYY-MM-DD HH24:MI:SS') REG_DT
FROM HDFS_4;
실행결과 - 아래 박스에서 검은글씨부분은 에러난것 , 컬러가 있는것은 정상실행부분
에러는 이미 실행전에 hadoop에 디렉토리를 만들고 meta파일을 생성해서 이미 존재한다고 한 에러.
[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_pq --table VW_HDFS_4 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
16/08/10 09:36:33 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258
16/08/10 09:36:33 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/08/10 09:36:33 INFO manager.SqlManager: Using default fetchSize of 1000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/08/10 09:36:34 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop will not process this sqoop connection, as "FLASHONE"."VW_HDFS_4" is not an Oracle table, it's a VIEW.
16/08/10 09:36:34 INFO tool.CodeGenTool: Beginning code generation
16/08/10 09:36:34 INFO tool.CodeGenTool: Will generate java class as codegen_VW_HDFS_4
16/08/10 09:36:35 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:36:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0
16/08/10 09:36:35 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce
Note: /tmp/sqoop-hdfs/compile/95f85eece904f411d74c356ac450d5b5/codegen_VW_HDFS_4.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/08/10 09:36:36 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/95f85eece904f411d74c356ac450d5b5/codegen_VW_HDFS_4.jar
16/08/10 09:36:36 INFO mapreduce.ImportJobBase: Beginning import of VW_HDFS_4
16/08/10 09:36:37 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:36:38 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:36:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0
16/08/10 09:36:38 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetExistsException: Descriptor directory already exists: hdfs://amb2.local:8020/dev/test2_pq/.metadata
org.kitesdk.data.DatasetExistsException: Descriptor directory already exists: hdfs://amb2.local:8020/dev/test2_pq/.metadata
at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.create(FileSystemMetadataProvider.java:219)
at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:137)
at org.kitesdk.data.Datasets.create(Datasets.java:239)
at org.kitesdk.data.Datasets.create(Datasets.java:307)
at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:107)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:89)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:108)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
at org.apache.sqoop.manager.OracleManager.importTable(OracleManager.java:445)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:148)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)
at org.apache.sqoop.Sqoop.main(Sqoop.java:244)
[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_pq2 --table VW_HDFS_4 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID
16/08/10 09:36:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258
16/08/10 09:36:56 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/08/10 09:36:56 INFO manager.SqlManager: Using default fetchSize of 1000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/08/10 09:36:57 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop will not process this sqoop connection, as "FLASHONE"."VW_HDFS_4" is not an Oracle table, it's a VIEW.
16/08/10 09:36:57 INFO tool.CodeGenTool: Beginning code generation
16/08/10 09:36:57 INFO tool.CodeGenTool: Will generate java class as codegen_VW_HDFS_4
16/08/10 09:36:57 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:36:57 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0
16/08/10 09:36:57 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce
Note: /tmp/sqoop-hdfs/compile/586ec1c10f04fc4c1fdd5986c9fae96b/codegen_VW_HDFS_4.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/08/10 09:36:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/586ec1c10f04fc4c1fdd5986c9fae96b/codegen_VW_HDFS_4.jar
16/08/10 09:36:59 INFO mapreduce.ImportJobBase: Beginning import of VW_HDFS_4
16/08/10 09:36:59 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:37:00 INFO manager.OracleManager: Time zone has been set to GMT
16/08/10 09:37:00 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0
16/08/10 09:37:02 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/
16/08/10 09:37:02 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050
16/08/10 09:37:04 INFO db.DBInputFormat: Using read commited transaction isolation
16/08/10 09:37:04 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN("ID"), MAX("ID") FROM "VW_HDFS_4"
16/08/10 09:37:04 WARN db.TextSplitter: Generating splits for a textual index column.
16/08/10 09:37:04 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/08/10 09:37:04 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/08/10 09:37:04 INFO mapreduce.JobSubmitter: number of splits:4
16/08/10 09:37:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0011
16/08/10 09:37:04 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0011
16/08/10 09:37:04 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0011/
16/08/10 09:37:04 INFO mapreduce.Job: Running job: job_1470728284233_0011
16/08/10 09:37:11 INFO mapreduce.Job: Job job_1470728284233_0011 running in uber mode : false
16/08/10 09:37:11 INFO mapreduce.Job: map 0% reduce 0%
16/08/10 09:37:20 INFO mapreduce.Job: map 25% reduce 0%
16/08/10 09:37:21 INFO mapreduce.Job: map 100% reduce 0%
16/08/10 09:37:22 INFO mapreduce.Job: Job job_1470728284233_0011 completed successfully
16/08/10 09:37:22 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=610560
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=28037
HDFS: Number of bytes written=137608
HDFS: Number of read operations=200
HDFS: Number of large read operations=0
HDFS: Number of write operations=36
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=60614
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=30307
Total vcore-seconds taken by all map tasks=30307
Total megabyte-seconds taken by all map tasks=46551552
Map-Reduce Framework
Map input records=1000
Map output records=1000
Input split bytes=505
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=1345
CPU time spent (ms)=41390
Physical memory (bytes) snapshot=1502236672
Virtual memory (bytes) snapshot=13242277888
Total committed heap usage (bytes)=828375040
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/08/10 09:37:22 INFO mapreduce.ImportJobBase: Transferred 134.3828 KB in 21.319 seconds (6.3034 KB/sec)
16/08/10 09:37:22 INFO mapreduce.ImportJobBase: Retrieved 1000 records.
[hdfs@amb2 ~]$
cat으로 보면 첫번째 스크린샷은 시작지점이고 두번째 스크린샷은 마지막부분이다.
타입에대한 내용도 기술되어있다. 좀더 보면 ~avro.schema라는 글도 보인다.
검색대로 스키마를 avro를 사용한것으로 보인다.
이번프로젝트에서는 text(CSV) 와 parquet을 사용하력 한다.
이제는 parquet으로 저장했을경우 데이터를 다른놈들이 잘가져오는지를 봐야한다. ㅠ_ㅜ;
'OpenSource(Bigdata&distribution&Network) > Sqoop' 카테고리의 다른 글
sqoop parquet snappy 테스트 (0) | 2016.08.10 |
---|---|
sqoop data import 성능 테스트 ( 압축 , parquet ) [테스트 실패 진행불가] (0) | 2016.08.10 |