본문 바로가기

OpenSource(Bigdata&distribution&Network)/Sqoop

sqopp import format 별로 저장

sqoop 으로 hadoop 으로 넣을경우 4가지 파일 포멧 리스트 


* Oracle 에 있는데이터를 Hadoop 으로 옮겨넣을때 그동안은 실시간으로 넣어야해서 flume을 썼는데 

배치성으로 도는 작업등은 flume까진 필요없었다. 

지금 들어와있는 프로젝트에서는 sqoop 을 사용 해서 데이터를 hadoop으로 넣는 작업을 진행했다. 


sqoop 은 크게 어려움은 없었으며 쉘상에서 명령어의 사용을 통해서 데이터를 전송해서 사실 개인적으로 사용하기 많이 편했다. 단지 플럼처럼 커스터마이징이 될지는 아직 모르는 상태.


원본은 

ORACLE 상에 일반 HEAP TABLE 이다. 

테스트용 테이블을 만들고나서 임시로 1,000 건의 데이터를 넣었다.


CREATE TABLE HDFS_4 

 (

  ID VARCHAR(100),

  NUM NUMBER(10),

  TEST VARCHAR(100),

  REG_DT DATE DEFAULT SYSDATE 

);


INSERT INTO HDFS_4

SELECT 'USR_'||Dbms_Random.string('A',5),

       Trunc(Dbms_Random.Value(10000,90000)),

       Dbms_Random.string('A',100),

       SYSDATE + To_Number(Dbms_Random.Value(1,30)) 

FROM DUAL

CONNECT BY LEVEL <= 1000;








1. text plain 


실행한 명령어

sqoop import --target-dir=/dev/test2_text --query='select *from HDFS_4 where $CONDITIONS' -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID


실행 결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_text --query='select *from HDFS_4 where $CONDITIONS' -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:09:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:09:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:09:28 INFO manager.SqlManager: Using default fetchSize of 1000

16/08/10 09:09:28 INFO tool.CodeGenTool: Beginning code generation

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:09:29 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:09:29 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:09:29 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:09:30 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/284d77af5917fa2113d961ae72341cc4/QueryResult.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:09:31 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/284d77af5917fa2113d961ae72341cc4/QueryResult.jar

16/08/10 09:09:31 INFO mapreduce.ImportJobBase: Beginning query import.

16/08/10 09:09:33 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 09:09:33 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 09:09:35 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 09:09:35 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM (select *from HDFS_4 where  (1 = 1) ) t1

16/08/10 09:09:35 WARN db.TextSplitter: Generating splits for a textual index column.

16/08/10 09:09:35 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

16/08/10 09:09:35 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

16/08/10 09:09:35 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 09:09:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0007

16/08/10 09:09:35 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0007

16/08/10 09:09:35 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0007/

16/08/10 09:09:35 INFO mapreduce.Job: Running job: job_1470728284233_0007

16/08/10 09:09:45 INFO mapreduce.Job: Job job_1470728284233_0007 running in uber mode : false

16/08/10 09:09:45 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 09:09:53 INFO mapreduce.Job:  map 25% reduce 0%

16/08/10 09:09:55 INFO mapreduce.Job:  map 75% reduce 0%

16/08/10 09:09:56 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 09:09:56 INFO mapreduce.Job: Job job_1470728284233_0007 completed successfully

16/08/10 09:09:56 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=606092

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=489

HDFS: Number of bytes written=139000

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=57580

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=28790

Total vcore-seconds taken by all map tasks=28790

Total megabyte-seconds taken by all map tasks=44221440

Map-Reduce Framework

Map input records=1000

Map output records=1000

Input split bytes=489

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=2674

CPU time spent (ms)=38860

Physical memory (bytes) snapshot=1292296192

Virtual memory (bytes) snapshot=13235789824

Total committed heap usage (bytes)=689438720

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=139000

16/08/10 09:09:56 INFO mapreduce.ImportJobBase: Transferred 135.7422 KB in 23.6533 seconds (5.7388 KB/sec)

16/08/10 09:09:56 INFO mapreduce.ImportJobBase: Retrieved 1000 records.

[hdfs@amb2 ~]$ 



cat 으로 보면 내용이 다 보인다 

컬럼간의 구분이 쉼표로 되어있는것이 보인다.

CSV랑 같은 형태로 저장



2. AVRO 


실행한 명령어

sqoop import --target-dir=/dev/test2_avro --query='select *from HDFS_4 where $CONDITIONS' --as-avrodatafile  -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID


실행 결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_avro --query='select *from HDFS_4 where $CONDITIONS' --as-avrodatafile  -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:15:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:15:04 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:15:04 INFO manager.SqlManager: Using default fetchSize of 1000

16/08/10 09:15:04 INFO tool.CodeGenTool: Beginning code generation

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:15:05 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:15:05 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:15:05 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:15:05 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/837dbbb2e304900b1151d2fa6186b0b7/QueryResult.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:15:07 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/837dbbb2e304900b1151d2fa6186b0b7/QueryResult.jar

16/08/10 09:15:07 INFO mapreduce.ImportJobBase: Beginning query import.

16/08/10 09:15:08 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:15:08 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:15:08 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:15:08 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-hdfs/compile/837dbbb2e304900b1151d2fa6186b0b7/QueryResult.avsc

16/08/10 09:15:08 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 09:15:08 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 09:15:10 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 09:15:10 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM (select *from HDFS_4 where  (1 = 1) ) t1

16/08/10 09:15:10 WARN db.TextSplitter: Generating splits for a textual index column.

16/08/10 09:15:10 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

16/08/10 09:15:10 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

16/08/10 09:15:10 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 09:15:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0008

16/08/10 09:15:11 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0008

16/08/10 09:15:11 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0008/

16/08/10 09:15:11 INFO mapreduce.Job: Running job: job_1470728284233_0008

16/08/10 09:15:18 INFO mapreduce.Job: Job job_1470728284233_0008 running in uber mode : false

16/08/10 09:15:18 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 09:15:26 INFO mapreduce.Job:  map 25% reduce 0%

16/08/10 09:15:27 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 09:15:28 INFO mapreduce.Job: Job job_1470728284233_0008 completed successfully

16/08/10 09:15:28 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=607148

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=489

HDFS: Number of bytes written=130230

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=49984

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=24992

Total vcore-seconds taken by all map tasks=24992

Total megabyte-seconds taken by all map tasks=38387712

Map-Reduce Framework

Map input records=1000

Map output records=1000

Input split bytes=489

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=1696

CPU time spent (ms)=38110

Physical memory (bytes) snapshot=1327677440

Virtual memory (bytes) snapshot=13322395648

Total committed heap usage (bytes)=721420288

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=130230

16/08/10 09:15:28 INFO mapreduce.ImportJobBase: Transferred 127.1777 KB in 20.4259 seconds (6.2263 KB/sec)

16/08/10 09:15:28 INFO mapreduce.ImportJobBase: Retrieved 1000 records.

[hdfs@amb2 ~]$ 



cat 으로보면 구분자쪽이 약간 깨진것처럼 보인다. 



3. 시퀀스파일 


실행한 명령어

sqoop import --target-dir=/dev/test2_seq --query='select *from HDFS_4 where $CONDITIONS' --as-sequencefile  -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID


실행 결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_seq --query='select *from HDFS_4 where $CONDITIONS' --as-sequencefile  -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:22:31 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:22:31 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:22:31 INFO manager.SqlManager: Using default fetchSize of 1000

16/08/10 09:22:31 INFO tool.CodeGenTool: Beginning code generation

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:22:32 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:22:32 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:22:32 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:22:32 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/3a0d3133347de2dccea51b1c74f948bd/QueryResult.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:22:34 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/3a0d3133347de2dccea51b1c74f948bd/QueryResult.jar

16/08/10 09:22:34 INFO mapreduce.ImportJobBase: Beginning query import.

16/08/10 09:22:36 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 09:22:36 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 09:22:38 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 09:22:38 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM (select *from HDFS_4 where  (1 = 1) ) t1

16/08/10 09:22:38 WARN db.TextSplitter: Generating splits for a textual index column.

16/08/10 09:22:38 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

16/08/10 09:22:38 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

16/08/10 09:22:38 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 09:22:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0009

16/08/10 09:22:38 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0009

16/08/10 09:22:39 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0009/

16/08/10 09:22:39 INFO mapreduce.Job: Running job: job_1470728284233_0009

16/08/10 09:22:47 INFO mapreduce.Job: Job job_1470728284233_0009 running in uber mode : false

16/08/10 09:22:47 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 09:22:54 INFO mapreduce.Job:  map 25% reduce 0%

16/08/10 09:22:55 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 09:22:56 INFO mapreduce.Job: Job job_1470728284233_0009 completed successfully

16/08/10 09:22:56 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=605572

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=489

HDFS: Number of bytes written=157788

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=48586

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=24293

Total vcore-seconds taken by all map tasks=24293

Total megabyte-seconds taken by all map tasks=37314048

Map-Reduce Framework

Map input records=1000

Map output records=1000

Input split bytes=489

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=765

CPU time spent (ms)=27050

Physical memory (bytes) snapshot=1312333824

Virtual memory (bytes) snapshot=13253160960

Total committed heap usage (bytes)=707788800

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=157788

16/08/10 09:22:56 INFO mapreduce.ImportJobBase: Transferred 154.0898 KB in 20.6491 seconds (7.4623 KB/sec)

16/08/10 09:22:56 INFO mapreduce.ImportJobBase: Retrieved 1000 records.

[hdfs@amb2 ~]$ 


cat 으로보면 맨위에 시퀀스파일이다! 라고 선언하고 약간깨진듯한문자로 이어진 형태를 보인다.



4. 파켓? 파큇? 파일 


실행한 명령어

* sqoop 은 현재 테스트버전은 1.4.6 이다. 명령어를 보면 위와 좀 다르다.

위에 명령어는 전부 query를 날린건데 이건 TABLE지정으로 바꿨다. 

현재 버그가 있는 관계로 아래처럼 바꿨으며 테스트 시점에 검색을 해보니 패치파일이 있었다. 

그러나 난 ambari를 통해서 아예 통으로 설치해서 일일 찾아서 수정하기 귀찮으므로 1.4.7버전을 기다린다 그냥. ㅎㅎ;

sqoop import --target-dir=/dev/test2_pq --table HDFS_4 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID


아래는 쿼리를 실행했을경우 나는 오류로 JAVA null pointer 예외 가 발생한다.

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_pq --query='select *from HDFS_4 where $CONDITIONS' --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:29:29 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:29:29 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:29:29 INFO manager.SqlManager: Using default fetchSize of 1000

16/08/10 09:29:29 INFO tool.CodeGenTool: Beginning code generation

16/08/10 09:29:29 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException

java.lang.NullPointerException

at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:97)

at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)

at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)

at org.apache.sqoop.Sqoop.run(Sqoop.java:148)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)

at org.apache.sqoop.Sqoop.main(Sqoop.java:244)

[hdfs@amb2 ~]$


실행 결과 

TABLE 을 가져오는 명령으로 실행하면 아래와 같이 실행중에 에러가난다.  

젠장~ 내용을 대충 보자면~ 데이터 타입으로 보여져서 기존table에서 데이터 타입만 변경한 뷰를 만들어서 그 뷰를 가져와보기로 한다.

16/08/10 09:31:25 INFO mapreduce.Job: Task Id : attempt_1470728284233_0010_m_000002_1, Status : FAILED

Error: org.apache.avro.UnresolvedUnionException: Not in union ["null","long"]: 2016-08-29 21:09:28.0

at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:561)

at org.apache.avro.generic.GenericData.deepCopy(GenericData.java:941)

at org.apache.avro.generic.GenericData.deepCopy(GenericData.java:922)

at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.copy(DatasetKeyOutputFormat.java:327)

at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.write(DatasetKeyOutputFormat.java:321)

at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.write(DatasetKeyOutputFormat.java:300)

at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:658)

at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)

at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)

at org.apache.sqoop.mapreduce.ParquetImportMapper.map(ParquetImportMapper.java:70)

at org.apache.sqoop.mapreduce.ParquetImportMapper.map(ParquetImportMapper.java:39)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)

at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)


Container killed by the ApplicationMaster.

Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143


위에러로 인한 뷰생성후 재실행 

CREATE OR REPLACE VIEW VW_HDFS_4

AS

SELECT ID,NUM,TEST,To_Char(REG_DT,'YYYY-MM-DD HH24:MI:SS') REG_DT

  FROM HDFS_4;


실행결과 - 아래 박스에서 검은글씨부분은 에러난것 , 컬러가 있는것은 정상실행부분 

에러는 이미 실행전에 hadoop에 디렉토리를 만들고 meta파일을 생성해서 이미 존재한다고 한 에러.

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_pq --table VW_HDFS_4 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:36:33 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:36:33 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:36:33 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:36:34 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop will not process this sqoop connection, as "FLASHONE"."VW_HDFS_4" is not an Oracle table, it's a VIEW.

16/08/10 09:36:34 INFO tool.CodeGenTool: Beginning code generation

16/08/10 09:36:34 INFO tool.CodeGenTool: Will generate java class as codegen_VW_HDFS_4

16/08/10 09:36:35 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:36:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0

16/08/10 09:36:35 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/95f85eece904f411d74c356ac450d5b5/codegen_VW_HDFS_4.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:36:36 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/95f85eece904f411d74c356ac450d5b5/codegen_VW_HDFS_4.jar

16/08/10 09:36:36 INFO mapreduce.ImportJobBase: Beginning import of VW_HDFS_4

16/08/10 09:36:37 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:36:38 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:36:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0

16/08/10 09:36:38 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetExistsException: Descriptor directory already exists: hdfs://amb2.local:8020/dev/test2_pq/.metadata

org.kitesdk.data.DatasetExistsException: Descriptor directory already exists: hdfs://amb2.local:8020/dev/test2_pq/.metadata

at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.create(FileSystemMetadataProvider.java:219)

at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:137)

at org.kitesdk.data.Datasets.create(Datasets.java:239)

at org.kitesdk.data.Datasets.create(Datasets.java:307)

at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:107)

at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:89)

at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:108)

at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)

at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)

at org.apache.sqoop.manager.OracleManager.importTable(OracleManager.java:445)

at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)

at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)

at org.apache.sqoop.Sqoop.run(Sqoop.java:148)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)

at org.apache.sqoop.Sqoop.main(Sqoop.java:244)



[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_pq2 --table VW_HDFS_4 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:36:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:36:56 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:36:56 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:36:57 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop will not process this sqoop connection, as "FLASHONE"."VW_HDFS_4" is not an Oracle table, it's a VIEW.

16/08/10 09:36:57 INFO tool.CodeGenTool: Beginning code generation

16/08/10 09:36:57 INFO tool.CodeGenTool: Will generate java class as codegen_VW_HDFS_4

16/08/10 09:36:57 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:36:57 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0

16/08/10 09:36:57 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/586ec1c10f04fc4c1fdd5986c9fae96b/codegen_VW_HDFS_4.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:36:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/586ec1c10f04fc4c1fdd5986c9fae96b/codegen_VW_HDFS_4.jar

16/08/10 09:36:59 INFO mapreduce.ImportJobBase: Beginning import of VW_HDFS_4

16/08/10 09:36:59 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:37:00 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:37:00 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0

16/08/10 09:37:02 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 09:37:02 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 09:37:04 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 09:37:04 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN("ID"), MAX("ID") FROM "VW_HDFS_4"

16/08/10 09:37:04 WARN db.TextSplitter: Generating splits for a textual index column.

16/08/10 09:37:04 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

16/08/10 09:37:04 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

16/08/10 09:37:04 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 09:37:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0011

16/08/10 09:37:04 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0011

16/08/10 09:37:04 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0011/

16/08/10 09:37:04 INFO mapreduce.Job: Running job: job_1470728284233_0011

16/08/10 09:37:11 INFO mapreduce.Job: Job job_1470728284233_0011 running in uber mode : false

16/08/10 09:37:11 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 09:37:20 INFO mapreduce.Job:  map 25% reduce 0%

16/08/10 09:37:21 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 09:37:22 INFO mapreduce.Job: Job job_1470728284233_0011 completed successfully

16/08/10 09:37:22 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=610560

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=28037

HDFS: Number of bytes written=137608

HDFS: Number of read operations=200

HDFS: Number of large read operations=0

HDFS: Number of write operations=36

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=60614

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=30307

Total vcore-seconds taken by all map tasks=30307

Total megabyte-seconds taken by all map tasks=46551552

Map-Reduce Framework

Map input records=1000

Map output records=1000

Input split bytes=505

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=1345

CPU time spent (ms)=41390

Physical memory (bytes) snapshot=1502236672

Virtual memory (bytes) snapshot=13242277888

Total committed heap usage (bytes)=828375040

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=0

16/08/10 09:37:22 INFO mapreduce.ImportJobBase: Transferred 134.3828 KB in 21.319 seconds (6.3034 KB/sec)

16/08/10 09:37:22 INFO mapreduce.ImportJobBase: Retrieved 1000 records.

[hdfs@amb2 ~]$


cat으로 보면 첫번째 스크린샷은 시작지점이고 두번째 스크린샷은 마지막부분이다. 

타입에대한 내용도 기술되어있다. 좀더 보면 ~avro.schema라는 글도 보인다. 

검색대로 스키마를 avro를 사용한것으로 보인다.



이번프로젝트에서는 text(CSV) 와 parquet을 사용하력 한다.

이제는 parquet으로 저장했을경우 데이터를 다른놈들이 잘가져오는지를 봐야한다. ㅠ_ㅜ;