달력

09

« 2017/09 »

  •  
  •  
  •  
  •  
  •  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30

'hadoop'에 해당되는 글 7

  1. 2016.08.10 sqoop parquet snappy 테스트
  2. 2016.08.10 sqopp import format 별로 저장
  3. 2016.03.15 spark + Hadoop + python ~ HelloWorld
  4. 2012.08.04 Hive java connection 설정
  5. 2012.07.31 Hive로 결정.
  6. 2012.07.29 Hbase 설치.
  7. 2012.06.18 hadoop 설치 (2)

현재 프로젝트에서 SQOOP 과 압축과 저장포멧을 선택해야해서 간단하게 테스트를 했다. 


테스트항목은 sqoop 을 통해서 oracle 데이터를 hadoop 에 넣을때 

snappy의 압축 / 비압축  

text plain / parquet 포멧

이 두가지 종류로 총4 개의 테스트를 진행한다.




테스트 장비의 간단한 스펙 


Host 장비 : 

CPU : Xeon CPU E5-2620 v3 * 2 HT ( Total 24 Core )

RAM : 256GB

HDD : PCI-E(NVMe) Vm OS , SATA (hadoop , oracle data )




guest os 스펙 


HADOOP ECO SYSTEM


vm node spec 

core : 16core ( 4socket , 4core )

ram : 16GB

1 name node , 4 data node , 1 ambari-util server 구성 



ORACLE


vm node spec

core : 8core ( 4socket , 2core )

ram : 8GB

1 single node : Oracle DB (Enterprise Single Node) 12C



원천데이터 (SOURCE) 설명 

이전테스트 실패에서 알수있듯이 랜덤으로 텍스트를 가득 채웠더니 압축률이 0%가 나왔다. 근거는

압축과 비압축 전송총량이 동일했다. 

해당 테스트 결과는 아래에 있다. 


위와 같은 문제로 일반적인 데이터 즉 무언가 문장이 있고 숫자가있는 데이터를 만들기로 하고 oracle dictionary 에 있는 comment 를 가지고 테스트데이터를 만들었다. 


테이블 구조 : 파티션테이블이 아닌 일반 힙테이블.


CREATE TABLE HDFS_TEST5

(

TABLE_NAME VARCHAR2(128),

LVL NUMBER,

COMMENTS VARCHAR(4000),

REG_DT VARCHAR2(19),

SEQ NUMBER 

)


대량 들어있는 데이터를 보면.

약간 데이터 스럽게 생겼다 ㅎㅎ;;


ROW COUNT : 11,620,800건




테스트 1.

========================= 압축률이 0% 인 결과를 보여주는 테스트 ============================


CASE 1. 파일포멧 : text plain , 비압축


명령어

sqoop import --target-dir=/dev/test/data_nc_txt --table HDFS_3_SUB -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1


실행 결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test/data_nc_txt --table HDFS_3_SUB -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1

16/08/10 17:46:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 17:46:04 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 17:46:04 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 17:46:06 INFO oracle.OraOopManagerFactory: 

**************************************************

*** Using Data Connector for Oracle and Hadoop ***

**************************************************

16/08/10 17:46:06 INFO oracle.OraOopManagerFactory: Oracle Database version: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

16/08/10 17:46:06 INFO oracle.OraOopManagerFactory: This Oracle database is not a RAC.

16/08/10 17:46:06 INFO tool.CodeGenTool: Beginning code generation

16/08/10 17:46:06 INFO manager.SqlManager: Executing SQL statement: SELECT "ACOL1","ACOL2","ACOL3","ACOL4","ACOL5","ACOL6","ACOL7","ACOL8","ACOL9","ACOL10" FROM HDFS_3_SUB WHERE 0=1

16/08/10 17:46:06 INFO manager.SqlManager: Executing SQL statement: SELECT "ACOL1","ACOL2","ACOL3","ACOL4","ACOL5","ACOL6","ACOL7","ACOL8","ACOL9","ACOL10" FROM "HDFS_3_SUB" WHERE 1=0

16/08/10 17:46:06 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/7e534716bf2036f166e1b14257055d00/HDFS_3_SUB.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 17:46:09 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/7e534716bf2036f166e1b14257055d00/HDFS_3_SUB.jar

16/08/10 17:46:09 INFO mapreduce.ImportJobBase: Beginning import of HDFS_3_SUB

16/08/10 17:46:11 INFO manager.SqlManager: Executing SQL statement: SELECT "ACOL1","ACOL2","ACOL3","ACOL4","ACOL5","ACOL6","ACOL7","ACOL8","ACOL9","ACOL10" FROM "HDFS_3_SUB" WHERE 1=0

16/08/10 17:46:12 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 17:46:12 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 17:46:15 WARN oracle.OraOopUtilities: System property java.security.egd is not set to file:///dev/urandom - Oracle connections may time out.

16/08/10 17:46:15 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 17:46:15 INFO oracle.OraOopOracleQueries: Session Time Zone set to GMT

16/08/10 17:46:15 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL :

begin 

  dbms_application_info.set_module(module_name => 'Data Connector for Oracle and Hadoop', action_name => 'import 20160810174606KST'); 

end;

16/08/10 17:46:15 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session disable parallel query

16/08/10 17:46:15 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set "_serial_direct_read"=true

16/08/10 17:46:15 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set tracefile_identifier=oraoop

16/08/10 17:46:17 INFO oracle.OraOopDataDrivenDBInputFormat: The table being imported by sqoop has 72704 blocks that have been divided into 128 chunks which will be processed in 4 splits. The chunks will be allocated to the splits using the method : ROUNDROBIN

16/08/10 17:46:17 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 17:46:17 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470815003334_0001

16/08/10 17:46:18 INFO impl.YarnClientImpl: Submitted application application_1470815003334_0001

16/08/10 17:46:18 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470815003334_0001/

16/08/10 17:46:18 INFO mapreduce.Job: Running job: job_1470815003334_0001

16/08/10 17:46:26 INFO mapreduce.Job: Job job_1470815003334_0001 running in uber mode : false

16/08/10 17:46:26 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 17:46:40 INFO mapreduce.Job:  map 7% reduce 0%

16/08/10 17:46:41 INFO mapreduce.Job:  map 32% reduce 0%

16/08/10 17:46:42 INFO mapreduce.Job:  map 57% reduce 0%

16/08/10 17:46:43 INFO mapreduce.Job:  map 80% reduce 0%

16/08/10 17:46:44 INFO mapreduce.Job:  map 87% reduce 0%

16/08/10 17:46:46 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 17:46:46 INFO mapreduce.Job: Job job_1470815003334_0001 completed successfully

16/08/10 17:46:46 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=614156

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=11629

HDFS: Number of bytes written=505000000

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=116770

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=58385

Total vcore-seconds taken by all map tasks=58385

Total megabyte-seconds taken by all map tasks=89679360

Map-Reduce Framework

Map input records=500000

Map output records=500000

Input split bytes=11629

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=1801

CPU time spent (ms)=75560

Physical memory (bytes) snapshot=1357881344

Virtual memory (bytes) snapshot=13260427264

Total committed heap usage (bytes)=709361664

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=505000000

16/08/10 17:46:46 INFO mapreduce.ImportJobBase: Transferred 481.6055 MB in 34.0606 seconds (14.1397 MB/sec)

16/08/10 17:46:46 INFO mapreduce.ImportJobBase: Retrieved 500000 records.

[hdfs@amb2 ~]$ 



CASE 2. 파일포멧 : text plain , 압축 : snappy 


명령어

sqoop import --target-dir=/dev/test/data_sn_txt --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --table HDFS_3_SUB -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1


실행 결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test/data_sn_txt --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --table HDFS_3_SUB -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1

16/08/10 17:50:13 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 17:50:13 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 17:50:13 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 17:50:14 INFO oracle.OraOopManagerFactory: 

**************************************************

*** Using Data Connector for Oracle and Hadoop ***

**************************************************

16/08/10 17:50:14 INFO oracle.OraOopManagerFactory: Oracle Database version: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

16/08/10 17:50:14 INFO oracle.OraOopManagerFactory: This Oracle database is not a RAC.

16/08/10 17:50:14 INFO tool.CodeGenTool: Beginning code generation

16/08/10 17:50:14 INFO manager.SqlManager: Executing SQL statement: SELECT "ACOL1","ACOL2","ACOL3","ACOL4","ACOL5","ACOL6","ACOL7","ACOL8","ACOL9","ACOL10" FROM HDFS_3_SUB WHERE 0=1

16/08/10 17:50:14 INFO manager.SqlManager: Executing SQL statement: SELECT "ACOL1","ACOL2","ACOL3","ACOL4","ACOL5","ACOL6","ACOL7","ACOL8","ACOL9","ACOL10" FROM "HDFS_3_SUB" WHERE 1=0

16/08/10 17:50:14 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/175537b93a775793afb75735d65b176f/HDFS_3_SUB.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 17:50:16 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/175537b93a775793afb75735d65b176f/HDFS_3_SUB.jar

16/08/10 17:50:16 INFO mapreduce.ImportJobBase: Beginning import of HDFS_3_SUB

16/08/10 17:50:17 INFO manager.SqlManager: Executing SQL statement: SELECT "ACOL1","ACOL2","ACOL3","ACOL4","ACOL5","ACOL6","ACOL7","ACOL8","ACOL9","ACOL10" FROM "HDFS_3_SUB" WHERE 1=0

16/08/10 17:50:18 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 17:50:18 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 17:50:21 WARN oracle.OraOopUtilities: System property java.security.egd is not set to file:///dev/urandom - Oracle connections may time out.

16/08/10 17:50:21 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 17:50:21 INFO oracle.OraOopOracleQueries: Session Time Zone set to GMT

16/08/10 17:50:21 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL :

begin 

  dbms_application_info.set_module(module_name => 'Data Connector for Oracle and Hadoop', action_name => 'import 20160810175014KST'); 

end;

16/08/10 17:50:21 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session disable parallel query

16/08/10 17:50:21 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set "_serial_direct_read"=true

16/08/10 17:50:21 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set tracefile_identifier=oraoop

16/08/10 17:50:22 INFO oracle.OraOopDataDrivenDBInputFormat: The table being imported by sqoop has 72704 blocks that have been divided into 128 chunks which will be processed in 4 splits. The chunks will be allocated to the splits using the method : ROUNDROBIN

16/08/10 17:50:22 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 17:50:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470815003334_0002

16/08/10 17:50:23 INFO impl.YarnClientImpl: Submitted application application_1470815003334_0002

16/08/10 17:50:23 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470815003334_0002/

16/08/10 17:50:23 INFO mapreduce.Job: Running job: job_1470815003334_0002

16/08/10 17:50:31 INFO mapreduce.Job: Job job_1470815003334_0002 running in uber mode : false

16/08/10 17:50:31 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 17:50:41 INFO mapreduce.Job:  map 25% reduce 0%

16/08/10 17:50:42 INFO mapreduce.Job:  map 50% reduce 0%

16/08/10 17:50:44 INFO mapreduce.Job:  map 80% reduce 0%

16/08/10 17:50:46 INFO mapreduce.Job:  map 89% reduce 0%

16/08/10 17:50:47 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 17:50:47 INFO mapreduce.Job: Job job_1470815003334_0002 completed successfully

16/08/10 17:50:47 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=614136

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=11629

HDFS: Number of bytes written=505057648

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=86936

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=43468

Total vcore-seconds taken by all map tasks=43468

Total megabyte-seconds taken by all map tasks=66766848

Map-Reduce Framework

Map input records=500000

Map output records=500000

Input split bytes=11629

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=1239

CPU time spent (ms)=67760

Physical memory (bytes) snapshot=1410674688

Virtual memory (bytes) snapshot=13316395008

Total committed heap usage (bytes)=737673216

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=505057648

16/08/10 17:50:47 INFO mapreduce.ImportJobBase: Transferred 481.6605 MB in 29.1766 seconds (16.5084 MB/sec)

16/08/10 17:50:47 INFO mapreduce.ImportJobBase: Retrieved 500000 records.

[hdfs@amb2 ~]$





본격적인테스트는 아래부터 이다.


========================= 4가지 케이스 테스트  ============================




CASE 1. 텍스트 , 비압축


명령어

sqoop import --target-dir=/dev/test2/data_nc_txt --table HDFS_TEST5 -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1


결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2/data_nc_txt --table HDFS_TEST5 -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1

16/08/10 18:09:15 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 18:09:15 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 18:09:15 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 18:09:17 INFO oracle.OraOopManagerFactory: 

**************************************************

*** Using Data Connector for Oracle and Hadoop ***

**************************************************

16/08/10 18:09:17 INFO oracle.OraOopManagerFactory: Oracle Database version: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

16/08/10 18:09:17 INFO oracle.OraOopManagerFactory: This Oracle database is not a RAC.

16/08/10 18:09:17 INFO tool.CodeGenTool: Beginning code generation

16/08/10 18:09:17 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM HDFS_TEST5 WHERE 0=1

16/08/10 18:09:17 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:09:17 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/83d82e240522d651aa49f619fb1c723b/HDFS_TEST5.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 18:09:20 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/83d82e240522d651aa49f619fb1c723b/HDFS_TEST5.jar

16/08/10 18:09:20 INFO mapreduce.ImportJobBase: Beginning import of HDFS_TEST5

16/08/10 18:09:22 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:09:23 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 18:09:23 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 18:09:26 WARN oracle.OraOopUtilities: System property java.security.egd is not set to file:///dev/urandom - Oracle connections may time out.

16/08/10 18:09:26 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 18:09:26 INFO oracle.OraOopOracleQueries: Session Time Zone set to GMT

16/08/10 18:09:26 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL :

begin 

  dbms_application_info.set_module(module_name => 'Data Connector for Oracle and Hadoop', action_name => 'import 20160810180917KST'); 

end;

16/08/10 18:09:26 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session disable parallel query

16/08/10 18:09:26 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set "_serial_direct_read"=true

16/08/10 18:09:26 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set tracefile_identifier=oraoop

16/08/10 18:09:27 INFO oracle.OraOopDataDrivenDBInputFormat: The table being imported by sqoop has 180224 blocks that have been divided into 185 chunks which will be processed in 4 splits. The chunks will be allocated to the splits using the method : ROUNDROBIN

16/08/10 18:09:27 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 18:09:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470815003334_0004

16/08/10 18:09:28 INFO impl.YarnClientImpl: Submitted application application_1470815003334_0004

16/08/10 18:09:28 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470815003334_0004/

16/08/10 18:09:28 INFO mapreduce.Job: Running job: job_1470815003334_0004

16/08/10 18:09:36 INFO mapreduce.Job: Job job_1470815003334_0004 running in uber mode : false

16/08/10 18:09:36 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 18:09:47 INFO mapreduce.Job:  map 3% reduce 0%

16/08/10 18:09:49 INFO mapreduce.Job:  map 10% reduce 0%

16/08/10 18:09:50 INFO mapreduce.Job:  map 13% reduce 0%

16/08/10 18:09:52 INFO mapreduce.Job:  map 23% reduce 0%

16/08/10 18:09:54 INFO mapreduce.Job:  map 27% reduce 0%

16/08/10 18:09:55 INFO mapreduce.Job:  map 37% reduce 0%

16/08/10 18:09:57 INFO mapreduce.Job:  map 44% reduce 0%

16/08/10 18:09:58 INFO mapreduce.Job:  map 59% reduce 0%

16/08/10 18:10:00 INFO mapreduce.Job:  map 60% reduce 0%

16/08/10 18:10:01 INFO mapreduce.Job:  map 73% reduce 0%

16/08/10 18:10:03 INFO mapreduce.Job:  map 77% reduce 0%

16/08/10 18:10:04 INFO mapreduce.Job:  map 82% reduce 0%

16/08/10 18:10:06 INFO mapreduce.Job:  map 83% reduce 0%

16/08/10 18:10:08 INFO mapreduce.Job:  map 91% reduce 0%

16/08/10 18:10:10 INFO mapreduce.Job:  map 95% reduce 0%

16/08/10 18:10:11 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 18:10:12 INFO mapreduce.Job: Job job_1470815003334_0004 completed successfully

16/08/10 18:10:12 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=614016

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=16707

HDFS: Number of bytes written=1422465312

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=241392

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=120696

Total vcore-seconds taken by all map tasks=120696

Total megabyte-seconds taken by all map tasks=185389056

Map-Reduce Framework

Map input records=11620800

Map output records=11620800

Input split bytes=16707

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=2308

CPU time spent (ms)=178250

Physical memory (bytes) snapshot=1499193344

Virtual memory (bytes) snapshot=13390106624

Total committed heap usage (bytes)=746586112

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=1422465312

16/08/10 18:10:12 INFO mapreduce.ImportJobBase: Transferred 1.3248 GB in 48.8625 seconds (27.763 MB/sec)

16/08/10 18:10:12 INFO mapreduce.ImportJobBase: Retrieved 11620800 records.

[hdfs@amb2 ~]$ 




case 2 . 텍스트 , snappy 압축


명령어

sqoop import --target-dir=/dev/test2/data_sn_txt --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --table HDFS_TEST5 -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1


결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2/data_sn_txt --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --table HDFS_TEST5 -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1

16/08/10 18:12:53 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 18:12:53 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 18:12:53 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 18:12:54 INFO oracle.OraOopManagerFactory: 

**************************************************

*** Using Data Connector for Oracle and Hadoop ***

**************************************************

16/08/10 18:12:54 INFO oracle.OraOopManagerFactory: Oracle Database version: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

16/08/10 18:12:54 INFO oracle.OraOopManagerFactory: This Oracle database is not a RAC.

16/08/10 18:12:54 INFO tool.CodeGenTool: Beginning code generation

16/08/10 18:12:54 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM HDFS_TEST5 WHERE 0=1

16/08/10 18:12:54 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:12:54 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/a40bc11d71b280f6f6f0be86d8987524/HDFS_TEST5.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 18:12:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/a40bc11d71b280f6f6f0be86d8987524/HDFS_TEST5.jar

16/08/10 18:12:56 INFO mapreduce.ImportJobBase: Beginning import of HDFS_TEST5

16/08/10 18:12:57 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:12:58 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 18:12:58 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 18:13:01 WARN oracle.OraOopUtilities: System property java.security.egd is not set to file:///dev/urandom - Oracle connections may time out.

16/08/10 18:13:01 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 18:13:01 INFO oracle.OraOopOracleQueries: Session Time Zone set to GMT

16/08/10 18:13:01 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL :

begin 

  dbms_application_info.set_module(module_name => 'Data Connector for Oracle and Hadoop', action_name => 'import 20160810181254KST'); 

end;

16/08/10 18:13:01 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session disable parallel query

16/08/10 18:13:01 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set "_serial_direct_read"=true

16/08/10 18:13:01 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set tracefile_identifier=oraoop

16/08/10 18:13:02 INFO oracle.OraOopDataDrivenDBInputFormat: The table being imported by sqoop has 180224 blocks that have been divided into 185 chunks which will be processed in 4 splits. The chunks will be allocated to the splits using the method : ROUNDROBIN

16/08/10 18:13:02 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 18:13:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470815003334_0005

16/08/10 18:13:02 INFO impl.YarnClientImpl: Submitted application application_1470815003334_0005

16/08/10 18:13:02 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470815003334_0005/

16/08/10 18:13:02 INFO mapreduce.Job: Running job: job_1470815003334_0005

16/08/10 18:13:10 INFO mapreduce.Job: Job job_1470815003334_0005 running in uber mode : false

16/08/10 18:13:10 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 18:13:22 INFO mapreduce.Job:  map 3% reduce 0%

16/08/10 18:13:23 INFO mapreduce.Job:  map 10% reduce 0%

16/08/10 18:13:25 INFO mapreduce.Job:  map 15% reduce 0%

16/08/10 18:13:26 INFO mapreduce.Job:  map 24% reduce 0%

16/08/10 18:13:28 INFO mapreduce.Job:  map 27% reduce 0%

16/08/10 18:13:29 INFO mapreduce.Job:  map 36% reduce 0%

16/08/10 18:13:31 INFO mapreduce.Job:  map 43% reduce 0%

16/08/10 18:13:32 INFO mapreduce.Job:  map 55% reduce 0%

16/08/10 18:13:34 INFO mapreduce.Job:  map 56% reduce 0%

16/08/10 18:13:35 INFO mapreduce.Job:  map 67% reduce 0%

16/08/10 18:13:37 INFO mapreduce.Job:  map 70% reduce 0%

16/08/10 18:13:38 INFO mapreduce.Job:  map 81% reduce 0%

16/08/10 18:13:40 INFO mapreduce.Job:  map 85% reduce 0%

16/08/10 18:13:41 INFO mapreduce.Job:  map 90% reduce 0%

16/08/10 18:13:43 INFO mapreduce.Job:  map 94% reduce 0%

16/08/10 18:13:44 INFO mapreduce.Job:  map 97% reduce 0%

16/08/10 18:13:47 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 18:13:47 INFO mapreduce.Job: Job job_1470815003334_0005 completed successfully

16/08/10 18:13:47 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=613996

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=16707

HDFS: Number of bytes written=809177318

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=244686

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=122343

Total vcore-seconds taken by all map tasks=122343

Total megabyte-seconds taken by all map tasks=187918848

Map-Reduce Framework

Map input records=11620800

Map output records=11620800

Input split bytes=16707

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=2268

CPU time spent (ms)=166060

Physical memory (bytes) snapshot=1489506304

Virtual memory (bytes) snapshot=13366169600

Total committed heap usage (bytes)=789577728

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=809177318

16/08/10 18:13:47 INFO mapreduce.ImportJobBase: Transferred 771.6916 MB in 49.3399 seconds (15.6403 MB/sec)

16/08/10 18:13:47 INFO mapreduce.ImportJobBase: Retrieved 11620800 records.

[hdfs@amb2 ~]$ 





case 3. 파켓 포멧  , 비압축 


명령어

sqoop import --target-dir=/dev/test2/data_nc_pq --table HDFS_TEST5 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1


결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2/data_nc_pq --table HDFS_TEST5 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1

16/08/10 18:16:27 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 18:16:27 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 18:16:27 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 18:16:28 INFO oracle.OraOopManagerFactory: 

**************************************************

*** Using Data Connector for Oracle and Hadoop ***

**************************************************

16/08/10 18:16:28 INFO oracle.OraOopManagerFactory: Oracle Database version: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

16/08/10 18:16:28 INFO oracle.OraOopManagerFactory: This Oracle database is not a RAC.

16/08/10 18:16:28 INFO tool.CodeGenTool: Beginning code generation

16/08/10 18:16:28 INFO tool.CodeGenTool: Will generate java class as codegen_HDFS_TEST5

16/08/10 18:16:28 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM HDFS_TEST5 WHERE 0=1

16/08/10 18:16:28 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:16:28 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/cc39417a2380979acedad21e69766d18/codegen_HDFS_TEST5.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 18:16:31 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/cc39417a2380979acedad21e69766d18/codegen_HDFS_TEST5.jar

16/08/10 18:16:31 INFO mapreduce.ImportJobBase: Beginning import of HDFS_TEST5

16/08/10 18:16:31 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:16:32 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:16:34 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 18:16:34 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 18:16:36 WARN oracle.OraOopUtilities: System property java.security.egd is not set to file:///dev/urandom - Oracle connections may time out.

16/08/10 18:16:36 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 18:16:36 INFO oracle.OraOopOracleQueries: Session Time Zone set to GMT

16/08/10 18:16:36 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL :

begin 

  dbms_application_info.set_module(module_name => 'Data Connector for Oracle and Hadoop', action_name => 'import 20160810181628KST'); 

end;

16/08/10 18:16:36 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session disable parallel query

16/08/10 18:16:36 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set "_serial_direct_read"=true

16/08/10 18:16:36 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set tracefile_identifier=oraoop

16/08/10 18:16:37 INFO oracle.OraOopDataDrivenDBInputFormat: The table being imported by sqoop has 180224 blocks that have been divided into 185 chunks which will be processed in 4 splits. The chunks will be allocated to the splits using the method : ROUNDROBIN

16/08/10 18:16:37 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 18:16:37 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470815003334_0006

16/08/10 18:16:38 INFO impl.YarnClientImpl: Submitted application application_1470815003334_0006

16/08/10 18:16:38 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470815003334_0006/

16/08/10 18:16:38 INFO mapreduce.Job: Running job: job_1470815003334_0006

16/08/10 18:16:46 INFO mapreduce.Job: Job job_1470815003334_0006 running in uber mode : false

16/08/10 18:16:46 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 18:17:10 INFO mapreduce.Job:  map 2% reduce 0%

16/08/10 18:17:22 INFO mapreduce.Job:  map 9% reduce 0%

16/08/10 18:17:25 INFO mapreduce.Job:  map 16% reduce 0%

16/08/10 18:17:28 INFO mapreduce.Job:  map 21% reduce 0%

16/08/10 18:17:31 INFO mapreduce.Job:  map 28% reduce 0%

16/08/10 18:17:34 INFO mapreduce.Job:  map 36% reduce 0%

16/08/10 18:17:37 INFO mapreduce.Job:  map 40% reduce 0%

16/08/10 18:17:40 INFO mapreduce.Job:  map 49% reduce 0%

16/08/10 18:17:43 INFO mapreduce.Job:  map 60% reduce 0%

16/08/10 18:17:46 INFO mapreduce.Job:  map 65% reduce 0%

16/08/10 18:17:49 INFO mapreduce.Job:  map 72% reduce 0%

16/08/10 18:17:52 INFO mapreduce.Job:  map 78% reduce 0%

16/08/10 18:17:55 INFO mapreduce.Job:  map 82% reduce 0%

16/08/10 18:17:58 INFO mapreduce.Job:  map 87% reduce 0%

16/08/10 18:18:01 INFO mapreduce.Job:  map 93% reduce 0%

16/08/10 18:18:03 INFO mapreduce.Job:  map 94% reduce 0%

16/08/10 18:18:04 INFO mapreduce.Job:  map 97% reduce 0%

16/08/10 18:18:07 INFO mapreduce.Job:  map 99% reduce 0%

16/08/10 18:18:10 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 18:18:10 INFO mapreduce.Job: Job job_1470815003334_0006 completed successfully

16/08/10 18:18:10 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=618184

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=49495

HDFS: Number of bytes written=456758842

HDFS: Number of read operations=200

HDFS: Number of large read operations=0

HDFS: Number of write operations=36

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=612860

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=306430

Total vcore-seconds taken by all map tasks=306430

Total megabyte-seconds taken by all map tasks=470676480

Map-Reduce Framework

Map input records=11620800

Map output records=11620800

Input split bytes=16707

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=31375

CPU time spent (ms)=363220

Physical memory (bytes) snapshot=2344189952

Virtual memory (bytes) snapshot=13274750976

Total committed heap usage (bytes)=1465909248

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=0

16/08/10 18:18:10 INFO mapreduce.ImportJobBase: Transferred 435.5992 MB in 97.1124 seconds (4.4855 MB/sec)

16/08/10 18:18:10 INFO mapreduce.ImportJobBase: Retrieved 11620800 records.

[hdfs@amb2 ~]$ 




case 4. 파켓 포멧 , snappy 압축


명령어

sqoop import --target-dir=/dev/test2/data_sn_pq --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --table HDFS_TEST5 -direct --as-parquetfile --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1


결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2/data_sn_pq --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --table HDFS_TEST5 -direct --as-parquetfile --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ACOL1

16/08/10 18:21:21 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 18:21:21 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 18:21:21 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 18:21:22 INFO oracle.OraOopManagerFactory: 

**************************************************

*** Using Data Connector for Oracle and Hadoop ***

**************************************************

16/08/10 18:21:22 INFO oracle.OraOopManagerFactory: Oracle Database version: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

16/08/10 18:21:22 INFO oracle.OraOopManagerFactory: This Oracle database is not a RAC.

16/08/10 18:21:22 INFO tool.CodeGenTool: Beginning code generation

16/08/10 18:21:22 INFO tool.CodeGenTool: Will generate java class as codegen_HDFS_TEST5

16/08/10 18:21:22 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM HDFS_TEST5 WHERE 0=1

16/08/10 18:21:22 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:21:22 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/ffc7dd24a3b45d0a3b7dad6697d1826d/codegen_HDFS_TEST5.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 18:21:24 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/ffc7dd24a3b45d0a3b7dad6697d1826d/codegen_HDFS_TEST5.jar

16/08/10 18:21:24 INFO mapreduce.ImportJobBase: Beginning import of HDFS_TEST5

16/08/10 18:21:25 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:21:27 INFO manager.SqlManager: Executing SQL statement: SELECT "TABLE_NAME","LVL","COMMENTS","REG_DT","SEQ" FROM "HDFS_TEST5" WHERE 1=0

16/08/10 18:21:29 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 18:21:29 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 18:21:32 WARN oracle.OraOopUtilities: System property java.security.egd is not set to file:///dev/urandom - Oracle connections may time out.

16/08/10 18:21:32 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 18:21:32 INFO oracle.OraOopOracleQueries: Session Time Zone set to GMT

16/08/10 18:21:32 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL :

begin 

  dbms_application_info.set_module(module_name => 'Data Connector for Oracle and Hadoop', action_name => 'import 20160810182122KST'); 

end;

16/08/10 18:21:32 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session disable parallel query

16/08/10 18:21:32 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set "_serial_direct_read"=true

16/08/10 18:21:32 INFO oracle.OracleConnectionFactory: Initializing Oracle session with SQL : alter session set tracefile_identifier=oraoop

16/08/10 18:21:32 INFO oracle.OraOopDataDrivenDBInputFormat: The table being imported by sqoop has 180224 blocks that have been divided into 185 chunks which will be processed in 4 splits. The chunks will be allocated to the splits using the method : ROUNDROBIN

16/08/10 18:21:33 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 18:21:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470815003334_0007

16/08/10 18:21:33 INFO impl.YarnClientImpl: Submitted application application_1470815003334_0007

16/08/10 18:21:33 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470815003334_0007/

16/08/10 18:21:33 INFO mapreduce.Job: Running job: job_1470815003334_0007

16/08/10 18:21:41 INFO mapreduce.Job: Job job_1470815003334_0007 running in uber mode : false

16/08/10 18:21:41 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 18:21:57 INFO mapreduce.Job:  map 2% reduce 0%

16/08/10 18:21:58 INFO mapreduce.Job:  map 8% reduce 0%

16/08/10 18:22:00 INFO mapreduce.Job:  map 9% reduce 0%

16/08/10 18:22:01 INFO mapreduce.Job:  map 14% reduce 0%

16/08/10 18:22:03 INFO mapreduce.Job:  map 15% reduce 0%

16/08/10 18:22:04 INFO mapreduce.Job:  map 20% reduce 0%

16/08/10 18:22:06 INFO mapreduce.Job:  map 22% reduce 0%

16/08/10 18:22:07 INFO mapreduce.Job:  map 27% reduce 0%

16/08/10 18:22:09 INFO mapreduce.Job:  map 28% reduce 0%

16/08/10 18:22:10 INFO mapreduce.Job:  map 33% reduce 0%

16/08/10 18:22:12 INFO mapreduce.Job:  map 34% reduce 0%

16/08/10 18:22:13 INFO mapreduce.Job:  map 40% reduce 0%

16/08/10 18:22:15 INFO mapreduce.Job:  map 44% reduce 0%

16/08/10 18:22:16 INFO mapreduce.Job:  map 48% reduce 0%

16/08/10 18:22:19 INFO mapreduce.Job:  map 54% reduce 0%

16/08/10 18:22:20 INFO mapreduce.Job:  map 59% reduce 0%

16/08/10 18:22:22 INFO mapreduce.Job:  map 65% reduce 0%

16/08/10 18:22:24 INFO mapreduce.Job:  map 71% reduce 0%

16/08/10 18:22:25 INFO mapreduce.Job:  map 76% reduce 0%

16/08/10 18:22:26 INFO mapreduce.Job:  map 77% reduce 0%

16/08/10 18:22:29 INFO mapreduce.Job:  map 78% reduce 0%

16/08/10 18:22:32 INFO mapreduce.Job:  map 81% reduce 0%

16/08/10 18:22:33 INFO mapreduce.Job:  map 82% reduce 0%

16/08/10 18:22:34 INFO mapreduce.Job:  map 83% reduce 0%

16/08/10 18:22:35 INFO mapreduce.Job:  map 86% reduce 0%

16/08/10 18:22:36 INFO mapreduce.Job:  map 87% reduce 0%

16/08/10 18:22:38 INFO mapreduce.Job:  map 90% reduce 0%

16/08/10 18:22:39 INFO mapreduce.Job:  map 92% reduce 0%

16/08/10 18:22:40 INFO mapreduce.Job:  map 93% reduce 0%

16/08/10 18:22:41 INFO mapreduce.Job:  map 96% reduce 0%

16/08/10 18:22:42 INFO mapreduce.Job:  map 97% reduce 0%

16/08/10 18:22:44 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 18:22:46 INFO mapreduce.Job: Job job_1470815003334_0007 completed successfully

16/08/10 18:22:46 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=618704

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=49495

HDFS: Number of bytes written=456758940

HDFS: Number of read operations=200

HDFS: Number of large read operations=0

HDFS: Number of write operations=36

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=466256

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=233128

Total vcore-seconds taken by all map tasks=233128

Total megabyte-seconds taken by all map tasks=358084608

Map-Reduce Framework

Map input records=11620800

Map output records=11620800

Input split bytes=16707

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=9158

CPU time spent (ms)=325900

Physical memory (bytes) snapshot=2222469120

Virtual memory (bytes) snapshot=13310775296

Total committed heap usage (bytes)=1477443584

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=0

16/08/10 18:22:46 INFO mapreduce.ImportJobBase: Transferred 435.5993 MB in 78.3224 seconds (5.5616 MB/sec)

16/08/10 18:22:46 INFO mapreduce.ImportJobBase: Retrieved 11620800 records.

[hdfs@amb2 ~]$ 




위테스트에 대한 결과.


 

텍스트 , 비압축 

텍스트, 압축 

파켓, 비압축 

파켓, 압축

 총 작업및 전송시간(Sec)

 48.8625

 49.3399

 97.1124

 78.3224

 총 전송량

 1.3248(GB)

 771.6916(MB)

 435.5992(MB)

 435.5993(MB)

 초당 전송량 (MB/sec)

 27.763

 15.6403

 4.4855

 5.5616

       





저작자 표시
신고
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설

sqoop 으로 hadoop 으로 넣을경우 4가지 파일 포멧 리스트 


* Oracle 에 있는데이터를 Hadoop 으로 옮겨넣을때 그동안은 실시간으로 넣어야해서 flume을 썼는데 

배치성으로 도는 작업등은 flume까진 필요없었다. 

지금 들어와있는 프로젝트에서는 sqoop 을 사용 해서 데이터를 hadoop으로 넣는 작업을 진행했다. 


sqoop 은 크게 어려움은 없었으며 쉘상에서 명령어의 사용을 통해서 데이터를 전송해서 사실 개인적으로 사용하기 많이 편했다. 단지 플럼처럼 커스터마이징이 될지는 아직 모르는 상태.


원본은 

ORACLE 상에 일반 HEAP TABLE 이다. 

테스트용 테이블을 만들고나서 임시로 1,000 건의 데이터를 넣었다.


CREATE TABLE HDFS_4 

 (

  ID VARCHAR(100),

  NUM NUMBER(10),

  TEST VARCHAR(100),

  REG_DT DATE DEFAULT SYSDATE 

);


INSERT INTO HDFS_4

SELECT 'USR_'||Dbms_Random.string('A',5),

       Trunc(Dbms_Random.Value(10000,90000)),

       Dbms_Random.string('A',100),

       SYSDATE + To_Number(Dbms_Random.Value(1,30)) 

FROM DUAL

CONNECT BY LEVEL <= 1000;








1. text plain 


실행한 명령어

sqoop import --target-dir=/dev/test2_text --query='select *from HDFS_4 where $CONDITIONS' -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID


실행 결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_text --query='select *from HDFS_4 where $CONDITIONS' -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:09:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:09:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:09:28 INFO manager.SqlManager: Using default fetchSize of 1000

16/08/10 09:09:28 INFO tool.CodeGenTool: Beginning code generation

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:09:29 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:09:29 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:09:29 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:09:30 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/284d77af5917fa2113d961ae72341cc4/QueryResult.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:09:31 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/284d77af5917fa2113d961ae72341cc4/QueryResult.jar

16/08/10 09:09:31 INFO mapreduce.ImportJobBase: Beginning query import.

16/08/10 09:09:33 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 09:09:33 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 09:09:35 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 09:09:35 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM (select *from HDFS_4 where  (1 = 1) ) t1

16/08/10 09:09:35 WARN db.TextSplitter: Generating splits for a textual index column.

16/08/10 09:09:35 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

16/08/10 09:09:35 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

16/08/10 09:09:35 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 09:09:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0007

16/08/10 09:09:35 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0007

16/08/10 09:09:35 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0007/

16/08/10 09:09:35 INFO mapreduce.Job: Running job: job_1470728284233_0007

16/08/10 09:09:45 INFO mapreduce.Job: Job job_1470728284233_0007 running in uber mode : false

16/08/10 09:09:45 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 09:09:53 INFO mapreduce.Job:  map 25% reduce 0%

16/08/10 09:09:55 INFO mapreduce.Job:  map 75% reduce 0%

16/08/10 09:09:56 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 09:09:56 INFO mapreduce.Job: Job job_1470728284233_0007 completed successfully

16/08/10 09:09:56 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=606092

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=489

HDFS: Number of bytes written=139000

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=57580

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=28790

Total vcore-seconds taken by all map tasks=28790

Total megabyte-seconds taken by all map tasks=44221440

Map-Reduce Framework

Map input records=1000

Map output records=1000

Input split bytes=489

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=2674

CPU time spent (ms)=38860

Physical memory (bytes) snapshot=1292296192

Virtual memory (bytes) snapshot=13235789824

Total committed heap usage (bytes)=689438720

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=139000

16/08/10 09:09:56 INFO mapreduce.ImportJobBase: Transferred 135.7422 KB in 23.6533 seconds (5.7388 KB/sec)

16/08/10 09:09:56 INFO mapreduce.ImportJobBase: Retrieved 1000 records.

[hdfs@amb2 ~]$ 



cat 으로 보면 내용이 다 보인다 

컬럼간의 구분이 쉼표로 되어있는것이 보인다.

CSV랑 같은 형태로 저장



2. AVRO 


실행한 명령어

sqoop import --target-dir=/dev/test2_avro --query='select *from HDFS_4 where $CONDITIONS' --as-avrodatafile  -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID


실행 결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_avro --query='select *from HDFS_4 where $CONDITIONS' --as-avrodatafile  -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:15:04 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:15:04 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:15:04 INFO manager.SqlManager: Using default fetchSize of 1000

16/08/10 09:15:04 INFO tool.CodeGenTool: Beginning code generation

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:15:05 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:15:05 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:15:05 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:15:05 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/837dbbb2e304900b1151d2fa6186b0b7/QueryResult.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:15:07 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/837dbbb2e304900b1151d2fa6186b0b7/QueryResult.jar

16/08/10 09:15:07 INFO mapreduce.ImportJobBase: Beginning query import.

16/08/10 09:15:08 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:15:08 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:15:08 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:15:08 INFO mapreduce.DataDrivenImportJob: Writing Avro schema file: /tmp/sqoop-hdfs/compile/837dbbb2e304900b1151d2fa6186b0b7/QueryResult.avsc

16/08/10 09:15:08 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 09:15:08 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 09:15:10 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 09:15:10 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM (select *from HDFS_4 where  (1 = 1) ) t1

16/08/10 09:15:10 WARN db.TextSplitter: Generating splits for a textual index column.

16/08/10 09:15:10 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

16/08/10 09:15:10 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

16/08/10 09:15:10 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 09:15:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0008

16/08/10 09:15:11 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0008

16/08/10 09:15:11 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0008/

16/08/10 09:15:11 INFO mapreduce.Job: Running job: job_1470728284233_0008

16/08/10 09:15:18 INFO mapreduce.Job: Job job_1470728284233_0008 running in uber mode : false

16/08/10 09:15:18 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 09:15:26 INFO mapreduce.Job:  map 25% reduce 0%

16/08/10 09:15:27 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 09:15:28 INFO mapreduce.Job: Job job_1470728284233_0008 completed successfully

16/08/10 09:15:28 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=607148

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=489

HDFS: Number of bytes written=130230

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=49984

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=24992

Total vcore-seconds taken by all map tasks=24992

Total megabyte-seconds taken by all map tasks=38387712

Map-Reduce Framework

Map input records=1000

Map output records=1000

Input split bytes=489

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=1696

CPU time spent (ms)=38110

Physical memory (bytes) snapshot=1327677440

Virtual memory (bytes) snapshot=13322395648

Total committed heap usage (bytes)=721420288

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=130230

16/08/10 09:15:28 INFO mapreduce.ImportJobBase: Transferred 127.1777 KB in 20.4259 seconds (6.2263 KB/sec)

16/08/10 09:15:28 INFO mapreduce.ImportJobBase: Retrieved 1000 records.

[hdfs@amb2 ~]$ 



cat 으로보면 구분자쪽이 약간 깨진것처럼 보인다. 



3. 시퀀스파일 


실행한 명령어

sqoop import --target-dir=/dev/test2_seq --query='select *from HDFS_4 where $CONDITIONS' --as-sequencefile  -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID


실행 결과

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_seq --query='select *from HDFS_4 where $CONDITIONS' --as-sequencefile  -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:22:31 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:22:31 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:22:31 INFO manager.SqlManager: Using default fetchSize of 1000

16/08/10 09:22:31 INFO tool.CodeGenTool: Beginning code generation

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:22:32 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:22:32 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:22:32 INFO manager.SqlManager: Executing SQL statement: select *from HDFS_4 where  (1 = 0) 

16/08/10 09:22:32 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/3a0d3133347de2dccea51b1c74f948bd/QueryResult.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:22:34 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/3a0d3133347de2dccea51b1c74f948bd/QueryResult.jar

16/08/10 09:22:34 INFO mapreduce.ImportJobBase: Beginning query import.

16/08/10 09:22:36 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 09:22:36 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 09:22:38 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 09:22:38 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ID), MAX(ID) FROM (select *from HDFS_4 where  (1 = 1) ) t1

16/08/10 09:22:38 WARN db.TextSplitter: Generating splits for a textual index column.

16/08/10 09:22:38 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

16/08/10 09:22:38 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

16/08/10 09:22:38 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 09:22:38 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0009

16/08/10 09:22:38 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0009

16/08/10 09:22:39 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0009/

16/08/10 09:22:39 INFO mapreduce.Job: Running job: job_1470728284233_0009

16/08/10 09:22:47 INFO mapreduce.Job: Job job_1470728284233_0009 running in uber mode : false

16/08/10 09:22:47 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 09:22:54 INFO mapreduce.Job:  map 25% reduce 0%

16/08/10 09:22:55 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 09:22:56 INFO mapreduce.Job: Job job_1470728284233_0009 completed successfully

16/08/10 09:22:56 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=605572

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=489

HDFS: Number of bytes written=157788

HDFS: Number of read operations=16

HDFS: Number of large read operations=0

HDFS: Number of write operations=8

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=48586

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=24293

Total vcore-seconds taken by all map tasks=24293

Total megabyte-seconds taken by all map tasks=37314048

Map-Reduce Framework

Map input records=1000

Map output records=1000

Input split bytes=489

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=765

CPU time spent (ms)=27050

Physical memory (bytes) snapshot=1312333824

Virtual memory (bytes) snapshot=13253160960

Total committed heap usage (bytes)=707788800

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=157788

16/08/10 09:22:56 INFO mapreduce.ImportJobBase: Transferred 154.0898 KB in 20.6491 seconds (7.4623 KB/sec)

16/08/10 09:22:56 INFO mapreduce.ImportJobBase: Retrieved 1000 records.

[hdfs@amb2 ~]$ 


cat 으로보면 맨위에 시퀀스파일이다! 라고 선언하고 약간깨진듯한문자로 이어진 형태를 보인다.



4. 파켓? 파큇? 파일 


실행한 명령어

* sqoop 은 현재 테스트버전은 1.4.6 이다. 명령어를 보면 위와 좀 다르다.

위에 명령어는 전부 query를 날린건데 이건 TABLE지정으로 바꿨다. 

현재 버그가 있는 관계로 아래처럼 바꿨으며 테스트 시점에 검색을 해보니 패치파일이 있었다. 

그러나 난 ambari를 통해서 아예 통으로 설치해서 일일 찾아서 수정하기 귀찮으므로 1.4.7버전을 기다린다 그냥. ㅎㅎ;

sqoop import --target-dir=/dev/test2_pq --table HDFS_4 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID


아래는 쿼리를 실행했을경우 나는 오류로 JAVA null pointer 예외 가 발생한다.

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_pq --query='select *from HDFS_4 where $CONDITIONS' --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:29:29 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:29:29 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:29:29 INFO manager.SqlManager: Using default fetchSize of 1000

16/08/10 09:29:29 INFO tool.CodeGenTool: Beginning code generation

16/08/10 09:29:29 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.NullPointerException

java.lang.NullPointerException

at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:97)

at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)

at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)

at org.apache.sqoop.Sqoop.run(Sqoop.java:148)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)

at org.apache.sqoop.Sqoop.main(Sqoop.java:244)

[hdfs@amb2 ~]$


실행 결과 

TABLE 을 가져오는 명령으로 실행하면 아래와 같이 실행중에 에러가난다.  

젠장~ 내용을 대충 보자면~ 데이터 타입으로 보여져서 기존table에서 데이터 타입만 변경한 뷰를 만들어서 그 뷰를 가져와보기로 한다.

16/08/10 09:31:25 INFO mapreduce.Job: Task Id : attempt_1470728284233_0010_m_000002_1, Status : FAILED

Error: org.apache.avro.UnresolvedUnionException: Not in union ["null","long"]: 2016-08-29 21:09:28.0

at org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:561)

at org.apache.avro.generic.GenericData.deepCopy(GenericData.java:941)

at org.apache.avro.generic.GenericData.deepCopy(GenericData.java:922)

at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.copy(DatasetKeyOutputFormat.java:327)

at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.write(DatasetKeyOutputFormat.java:321)

at org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$DatasetRecordWriter.write(DatasetKeyOutputFormat.java:300)

at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:658)

at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)

at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)

at org.apache.sqoop.mapreduce.ParquetImportMapper.map(ParquetImportMapper.java:70)

at org.apache.sqoop.mapreduce.ParquetImportMapper.map(ParquetImportMapper.java:39)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)

at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)


Container killed by the ApplicationMaster.

Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143


위에러로 인한 뷰생성후 재실행 

CREATE OR REPLACE VIEW VW_HDFS_4

AS

SELECT ID,NUM,TEST,To_Char(REG_DT,'YYYY-MM-DD HH24:MI:SS') REG_DT

  FROM HDFS_4;


실행결과 - 아래 박스에서 검은글씨부분은 에러난것 , 컬러가 있는것은 정상실행부분 

에러는 이미 실행전에 hadoop에 디렉토리를 만들고 meta파일을 생성해서 이미 존재한다고 한 에러.

[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_pq --table VW_HDFS_4 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:36:33 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:36:33 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:36:33 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:36:34 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop will not process this sqoop connection, as "FLASHONE"."VW_HDFS_4" is not an Oracle table, it's a VIEW.

16/08/10 09:36:34 INFO tool.CodeGenTool: Beginning code generation

16/08/10 09:36:34 INFO tool.CodeGenTool: Will generate java class as codegen_VW_HDFS_4

16/08/10 09:36:35 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:36:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0

16/08/10 09:36:35 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/95f85eece904f411d74c356ac450d5b5/codegen_VW_HDFS_4.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:36:36 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/95f85eece904f411d74c356ac450d5b5/codegen_VW_HDFS_4.jar

16/08/10 09:36:36 INFO mapreduce.ImportJobBase: Beginning import of VW_HDFS_4

16/08/10 09:36:37 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:36:38 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:36:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0

16/08/10 09:36:38 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetExistsException: Descriptor directory already exists: hdfs://amb2.local:8020/dev/test2_pq/.metadata

org.kitesdk.data.DatasetExistsException: Descriptor directory already exists: hdfs://amb2.local:8020/dev/test2_pq/.metadata

at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.create(FileSystemMetadataProvider.java:219)

at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:137)

at org.kitesdk.data.Datasets.create(Datasets.java:239)

at org.kitesdk.data.Datasets.create(Datasets.java:307)

at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:107)

at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:89)

at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:108)

at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)

at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)

at org.apache.sqoop.manager.OracleManager.importTable(OracleManager.java:445)

at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)

at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)

at org.apache.sqoop.Sqoop.run(Sqoop.java:148)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)

at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)

at org.apache.sqoop.Sqoop.main(Sqoop.java:244)



[hdfs@amb2 ~]$ sqoop import --target-dir=/dev/test2_pq2 --table VW_HDFS_4 --as-parquetfile -direct --connect jdbc:oracle:thin:@192.168.0.117:1521:ORCL --username flashone --password 1234 --split-by ID

16/08/10 09:36:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258

16/08/10 09:36:56 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

16/08/10 09:36:56 INFO manager.SqlManager: Using default fetchSize of 1000

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.0-258/accumulo/lib/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/08/10 09:36:57 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop will not process this sqoop connection, as "FLASHONE"."VW_HDFS_4" is not an Oracle table, it's a VIEW.

16/08/10 09:36:57 INFO tool.CodeGenTool: Beginning code generation

16/08/10 09:36:57 INFO tool.CodeGenTool: Will generate java class as codegen_VW_HDFS_4

16/08/10 09:36:57 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:36:57 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0

16/08/10 09:36:57 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce

Note: /tmp/sqoop-hdfs/compile/586ec1c10f04fc4c1fdd5986c9fae96b/codegen_VW_HDFS_4.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

16/08/10 09:36:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/586ec1c10f04fc4c1fdd5986c9fae96b/codegen_VW_HDFS_4.jar

16/08/10 09:36:59 INFO mapreduce.ImportJobBase: Beginning import of VW_HDFS_4

16/08/10 09:36:59 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:37:00 INFO manager.OracleManager: Time zone has been set to GMT

16/08/10 09:37:00 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "VW_HDFS_4" t WHERE 1=0

16/08/10 09:37:02 INFO impl.TimelineClientImpl: Timeline service address: http://amb3.local:8188/ws/v1/timeline/

16/08/10 09:37:02 INFO client.RMProxy: Connecting to ResourceManager at amb3.local/192.168.0.143:8050

16/08/10 09:37:04 INFO db.DBInputFormat: Using read commited transaction isolation

16/08/10 09:37:04 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN("ID"), MAX("ID") FROM "VW_HDFS_4"

16/08/10 09:37:04 WARN db.TextSplitter: Generating splits for a textual index column.

16/08/10 09:37:04 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.

16/08/10 09:37:04 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

16/08/10 09:37:04 INFO mapreduce.JobSubmitter: number of splits:4

16/08/10 09:37:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1470728284233_0011

16/08/10 09:37:04 INFO impl.YarnClientImpl: Submitted application application_1470728284233_0011

16/08/10 09:37:04 INFO mapreduce.Job: The url to track the job: http://amb3.local:8088/proxy/application_1470728284233_0011/

16/08/10 09:37:04 INFO mapreduce.Job: Running job: job_1470728284233_0011

16/08/10 09:37:11 INFO mapreduce.Job: Job job_1470728284233_0011 running in uber mode : false

16/08/10 09:37:11 INFO mapreduce.Job:  map 0% reduce 0%

16/08/10 09:37:20 INFO mapreduce.Job:  map 25% reduce 0%

16/08/10 09:37:21 INFO mapreduce.Job:  map 100% reduce 0%

16/08/10 09:37:22 INFO mapreduce.Job: Job job_1470728284233_0011 completed successfully

16/08/10 09:37:22 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=610560

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=28037

HDFS: Number of bytes written=137608

HDFS: Number of read operations=200

HDFS: Number of large read operations=0

HDFS: Number of write operations=36

Job Counters 

Launched map tasks=4

Other local map tasks=4

Total time spent by all maps in occupied slots (ms)=60614

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=30307

Total vcore-seconds taken by all map tasks=30307

Total megabyte-seconds taken by all map tasks=46551552

Map-Reduce Framework

Map input records=1000

Map output records=1000

Input split bytes=505

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=1345

CPU time spent (ms)=41390

Physical memory (bytes) snapshot=1502236672

Virtual memory (bytes) snapshot=13242277888

Total committed heap usage (bytes)=828375040

File Input Format Counters 

Bytes Read=0

File Output Format Counters 

Bytes Written=0

16/08/10 09:37:22 INFO mapreduce.ImportJobBase: Transferred 134.3828 KB in 21.319 seconds (6.3034 KB/sec)

16/08/10 09:37:22 INFO mapreduce.ImportJobBase: Retrieved 1000 records.

[hdfs@amb2 ~]$


cat으로 보면 첫번째 스크린샷은 시작지점이고 두번째 스크린샷은 마지막부분이다. 

타입에대한 내용도 기술되어있다. 좀더 보면 ~avro.schema라는 글도 보인다. 

검색대로 스키마를 avro를 사용한것으로 보인다.



이번프로젝트에서는 text(CSV) 와 parquet을 사용하력 한다.

이제는 parquet으로 저장했을경우 데이터를 다른놈들이 잘가져오는지를 봐야한다. ㅠ_ㅜ;

저작자 표시
신고
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설

서론.

Hbase에 접근하여 SQL형태로 데이터를 추출하는작업을 하고자 처음 사용한것은 phoneix ~

===> http://phoenix.apache.org/

관심 있는 사람은 들어가보겠지만

일단 난 갈아타기로 했다. 

이것저것 찾아보다가 SPARK를 발견.

나이가 좀있는사람은 알겠지만 국내 성인잡지중에 SPARK라고 있었다.

음~ -_-ㅋ;;; 난 그걸 얼굴에 철판을 깔고 당당히 서점에서 사서 본적은 있다.

물론 지금도 있는지 찾아볼필요도 없었고 궁금하지도 않다.;;;;;;;


아무튼 spark는 자바도 지원하고 겸사겸사 공부하려고하는 파이썬도 지원하고있다보니 관심있게보다가.

이쪽분야에서 고공분투하는 연구팀및 개발자분들이 극찬을 한것을 보고 이게모지??? 하면서 깔기시작하다가.


개인사정으로 구성만하고 이제서야 helloworld를 찍어보려고 한다.


본론 1

준비물 : hadoop ~ ㅋㅋㅋ

참고로 난 2.7.2 ( 1master + 5data node )


spark는 당연히 spark 사이트에서 받으면 되고 .

내가 이글을 쓴시점에 1.6.1이 나왔다. (내가 처음안건 1.5~)

우선뭐 ~ 

내려받아서 압축풀면된다.

난 바로 하둡에 붙일것이라 ~ 아예 hadoop에 맞게 컴파일이 된버전을 받았다.


1. 다운로드

스파크 주소 : http://spark.apache.org 

다운로드메뉴로 가면~


아래같은 모습을 볼것이고


난 2번항목에서 아래처럼 hadoop 용을 선택했다. ( 내가 설치한 hadoop은 2.7.2 이걸쓸때 최신버전이 또올라왔길래 바로 업그레이드)



본론 2

1. helloworld 찍어보기.

개발이면뭐 콘솔에 찍어보겠으나. 

이건 데이터니까~ 샘플파일하나 하둡에 넣고 그 파일의 row 갯수확인후 완료


2. hadoop에 아무 텍스트파일하나 만들어서 넣는다.

혹시모르니 hadoop에 파일넣는것도 같이 기록하련다. 난 돌대가리니까 ㅋ


$vi testman.txt 

나불나불~ 

나불나불~


저장후

$hadoop fs -mkdir /test <-- 디렉토리 생성

$hadoop fs -put /test testman.txt <-- OS파일을 HADOOP으로 밀어넣기.



$cd 스파크있는 home 경로

$./bin/pyspark 

Python 2.7.5 (default, Nov 20 2015, 02:00:19) 

[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

16/03/15 22:20:55 INFO SparkContext: Running Spark version 1.6.1

16/03/15 22:20:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

16/03/15 22:20:57 INFO SecurityManager: Changing view acls to: hadoop

16/03/15 22:20:57 INFO SecurityManager: Changing modify acls to: hadoop

16/03/15 22:20:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)

16/03/15 22:20:58 INFO Utils: Successfully started service 'sparkDriver' on port 49115.

16/03/15 22:20:59 INFO Slf4jLogger: Slf4jLogger started

16/03/15 22:20:59 INFO Remoting: Starting remoting

16/03/15 22:21:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.0.111:41015]

16/03/15 22:21:00 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 41015.

16/03/15 22:21:00 INFO SparkEnv: Registering MapOutputTracker

16/03/15 22:21:00 INFO SparkEnv: Registering BlockManagerMaster

16/03/15 22:21:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b918f3eb-05d7-4f85-a0cb-678f3327e8b6

16/03/15 22:21:00 INFO MemoryStore: MemoryStore started with capacity 511.5 MB

16/03/15 22:21:00 INFO SparkEnv: Registering OutputCommitCoordinator

16/03/15 22:21:01 INFO Utils: Successfully started service 'SparkUI' on port 4040.

16/03/15 22:21:01 INFO SparkUI: Started SparkUI at http://192.168.0.111:4040

16/03/15 22:21:01 INFO Executor: Starting executor ID driver on host localhost

16/03/15 22:21:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37043.

16/03/15 22:21:01 INFO NettyBlockTransferService: Server created on 37043

16/03/15 22:21:01 INFO BlockManagerMaster: Trying to register BlockManager

16/03/15 22:21:01 INFO BlockManagerMasterEndpoint: Registering block manager localhost:37043 with 511.5 MB RAM, BlockManagerId(driver, localhost, 37043)

16/03/15 22:21:01 INFO BlockManagerMaster: Registered BlockManager

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1

      /_/


Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)

SparkContext available as sc, HiveContext available as sqlContext.

>>> xx = sc.hadoopFile("hdfs://192.168.0.111:9000/test/testman.txt","org.apache.hadoop.mapred.TextInputFormat","org.apache.hadoop.io.Text","org.apache.hadoop.io.LongWritable") <-- document 에 있는 문서보고 이렇게 쓰랜다 ;;; 그래서 우선 따라서 이대로 해봄

16/03/15 22:30:19 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB)

16/03/15 22:30:19 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.5 KB)

16/03/15 22:30:19 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37043 (size: 13.9 KB, free: 511.5 MB)

16/03/15 22:30:19 INFO SparkContext: Created broadcast 0 from hadoopFile at PythonRDD.scala:613

16/03/15 22:30:20 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 166.7 KB, free 334.2 KB)

16/03/15 22:30:20 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 13.0 KB, free 347.2 KB)

16/03/15 22:30:20 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:37043 (size: 13.0 KB, free: 511.5 MB)

16/03/15 22:30:20 INFO SparkContext: Created broadcast 1 from broadcast at PythonRDD.scala:570

16/03/15 22:30:21 INFO FileInputFormat: Total input paths to process : 1

16/03/15 22:30:21 INFO SparkContext: Starting job: take at SerDeUtil.scala:201

16/03/15 22:30:21 INFO DAGScheduler: Got job 0 (take at SerDeUtil.scala:201) with 1 output partitions

16/03/15 22:30:21 INFO DAGScheduler: Final stage: ResultStage 0 (take at SerDeUtil.scala:201)

16/03/15 22:30:21 INFO DAGScheduler: Parents of final stage: List()

16/03/15 22:30:21 INFO DAGScheduler: Missing parents: List()

16/03/15 22:30:21 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at PythonHadoopUtil.scala:181), which has no missing parents

16/03/15 22:30:21 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.3 KB, free 350.5 KB)

16/03/15 22:30:21 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1980.0 B, free 352.5 KB)

16/03/15 22:30:21 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:37043 (size: 1980.0 B, free: 511.5 MB)

16/03/15 22:30:21 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006

16/03/15 22:30:21 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at PythonHadoopUtil.scala:181)

16/03/15 22:30:21 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks

16/03/15 22:30:21 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,ANY, 2144 bytes)

16/03/15 22:30:21 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)

16/03/15 22:30:21 INFO HadoopRDD: Input split: hdfs://192.168.0.111:9000/test/testman.txt:0+60

16/03/15 22:30:21 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

16/03/15 22:30:21 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

16/03/15 22:30:21 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

16/03/15 22:30:21 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

16/03/15 22:30:21 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

16/03/15 22:30:22 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2193 bytes result sent to driver

16/03/15 22:30:22 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 502 ms on localhost (1/1)

16/03/15 22:30:22 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 

16/03/15 22:30:22 INFO DAGScheduler: ResultStage 0 (take at SerDeUtil.scala:201) finished in 0.544 s

16/03/15 22:30:22 INFO DAGScheduler: Job 0 finished: take at SerDeUtil.scala:201, took 0.755415 s

>>> 16/03/15 22:30:45 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:37043 in memory (size: 1980.0 B, free: 511.5 MB)

16/03/15 22:30:45 INFO ContextCleaner: Cleaned accumulator 2


>>> xx.count() <-- 방금 읽은거 갯수출력명령

16/03/15 22:32:49 INFO SparkContext: Starting job: count at <stdin>:1

16/03/15 22:32:49 INFO DAGScheduler: Got job 1 (count at <stdin>:1) with 2 output partitions

16/03/15 22:32:49 INFO DAGScheduler: Final stage: ResultStage 1 (count at <stdin>:1)

16/03/15 22:32:49 INFO DAGScheduler: Parents of final stage: List()

16/03/15 22:32:49 INFO DAGScheduler: Missing parents: List()

16/03/15 22:32:49 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[3] at count at <stdin>:1), which has no missing parents

16/03/15 22:32:49 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 6.0 KB, free 353.2 KB)

16/03/15 22:32:49 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 3.7 KB, free 356.9 KB)

16/03/15 22:32:49 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:37043 (size: 3.7 KB, free: 511.5 MB)

16/03/15 22:32:49 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

16/03/15 22:32:49 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (PythonRDD[3] at count at <stdin>:1)

16/03/15 22:32:49 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks

16/03/15 22:32:49 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,ANY, 2144 bytes)

16/03/15 22:32:49 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, localhost, partition 1,ANY, 2144 bytes)

16/03/15 22:32:49 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)

16/03/15 22:32:49 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)

16/03/15 22:32:49 INFO HadoopRDD: Input split: hdfs://192.168.0.111:9000/test/testman.txt:0+60

16/03/15 22:32:49 INFO HadoopRDD: Input split: hdfs://192.168.0.111:9000/test/testman.txt:60+61

16/03/15 22:32:50 INFO PythonRunner: Times: total = 661, boot = 627, init = 33, finish = 1

16/03/15 22:32:50 INFO PythonRunner: Times: total = 633, boot = 616, init = 15, finish = 2

16/03/15 22:32:50 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 2179 bytes result sent to driver

16/03/15 22:32:50 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 734 ms on localhost (1/2)

16/03/15 22:32:50 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2179 bytes result sent to driver

16/03/15 22:32:50 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 750 ms on localhost (2/2)

16/03/15 22:32:50 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 

16/03/15 22:32:50 INFO DAGScheduler: ResultStage 1 (count at <stdin>:1) finished in 0.752 s

16/03/15 22:32:50 INFO DAGScheduler: Job 1 finished: count at <stdin>:1, took 0.778757 s

27  <--- 결과가 출력됨.

>>>


결론 

HelloWorld 완료~ ㅋ


참고 : 

나같은경우엔 hadoop master에다가 spark를 같이 넣어서 하는중이라 주소가 같은데.

위에 pyspark 를 실행하면 http://192.168.0.111:4040/stages 웹관리자? 가 있다고 주저리주저리 하는걸 볼수있다.

써있기로는 SparkUI 라고하는데 일종의 모니터링툴이다.


아래처럼 깔끔한 UI에서 방금 하둡에 있는 파일을 읽은 내용이 나온다. 

아래 그림보면 count한것도 나온다~ 

오!!!!!!!!!! 멋찜~ ㅋㅋㅋㅋㅋㅋ 




내가 실행했던 count명령에 대한 상세내용~ 그림으로도 니가 이케했는데 이케댔어요~~~ 라고 이쁘장하게 잘나온다.

인터넷으로 돌아다니는 공개자료들중에 뭐 보면 몇노드에서 백억건을 읽는는데 몇분몇분 이러는 결과가 아래처럼 잘 나온다.

리포트로 쓰기에도 나쁘지않은 디자인이다 ㅋ



저작자 표시
신고
TAG hadoop, spark
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설

어찌되었든 DB만은 할수없는 일이다.
좋은(비싸기만 한것말고 적재적소의 데이터베이스) DB에 잘 설계된 데이터구조를 올려놓고 나면
잘만들어진 프로그램이 좋은 인터페이스 역할을 해야 좋은데이터가 만들어지는것이지.

DB혼자 잘나바야 데이터 넣기도 어렵고
개발혼자 잘나바야 데이터 꺼내서 활용하기도 어렵다.

개발과 DB는 어찌되었든 같이 조화가 되어야지 불화(?) 가 되어서는 안되는것 같다.

아무튼.
데이터 insert , select 를 위해서 hive를 이용해서 데이터 조작을 위한 테스트를 진행하려고 한다.

준비사항 :
1. hive-0.8.1-bin.tar.gz 안의 라이브러리들.
2. 개발툴 ( 나는 eclipse )
3. WAS 아무거나 ( 나는 tomcat - was라고 치자..... )

 

1. 설정 (관련 라이브러리 추가)


아래3가지 ( libfb , slf4j 2가지를 처음에 빼고 나머지만 라이브러리에 추가 했더니 에러도 잘 안나오고
실행은 안되고 알수가 없었다. 꼭 추가하길
)

이클립스에 추가되어야할 라이브러리들.

hive-jdbc-버전.jar

hive-exec-버전.jar

hive-metastore-버전.jar

hive-service-버전.jar

hadoop-core-버전.jar

commons-logging-버전.jar

log4j-버전.jar

libfb버전.jar

slf4j-api-버전.jar

slf4j-log4j12-버전.jar

2. hive server 실행

Hive를 mysql에 연결한 작업까지 하고나서 이제 Hive 서비스를 띄운다. 
(물론 HIVE_HOME 설정은 되어있는 상태이며 bin디렉토리까지 path로 잡아놓은 상태 이고 mysql 로 띄운상태이다.) 

[hadoop@master1 ~]$hive --service hiveserver &
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Hive history file=/home/hadoop/hive/log/tansactionhist/hive_job_log_hadoop_201208041919_1622721562.txt

아래는 그냥 로그파일에 찍히는 내용을 보고자띄움.

[hadoop@master1 ~]$tail -f  /home/hadoop/hive/log/tansactionhist/hive_job_log_hadoop_201208041919_1622721562.txt

 

2. 테스트 (테스트는 https://cwiki.apache.org/Hive/hiveclient.html 에 있는 내용을 테스트함 )

테스트를 위해서 참고한 apache.org에 소스가 있으니 그대로 가져와도 된다.

아래는 해당 사이트에 있는 소스임 여기서 내서버와 관련된 설정만 바꾸도록 한다. 

 

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

public class HiveJdbcClient {
  private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

  /**
* @param args
* @throws SQLException
   */
  public static void main(String[] args) throws SQLException {
      try {
      Class.forName(driverName);
    } catch (ClassNotFoundException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
      System.exit(1);
    }
    Connection con = DriverManager.getConnection("jdbc:hive://192.168.0.141:10000/default", "", "");
    Statement stmt = con.createStatement();
    String tableName = "testHiveDriverTable";
    stmt.executeQuery("drop table " + tableName);
    ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)");
    // show tables
    String sql = "show tables '" + tableName + "'";
    System.out.println("Running: " + sql);
    res = stmt.executeQuery(sql);
    if (res.next()) {
      System.out.println(res.getString(1));
    }
    // describe table
    sql = "describe " + tableName;
    System.out.println("Running: " + sql);
    res = stmt.executeQuery(sql);
    while (res.next()) {
      System.out.println(res.getString(1) + "\t" + res.getString(2));
    }

    // load data into table
    // NOTE: filepath has to be local to the hive server
    // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
    String filepath = "/home/hadoop/test.txt"; // <---- 이걸 참고할것 아래에 내용 이어서.
    sql = "load data local inpath '" + filepath + "' into table " + tableName;
    System.out.println("Running: " + sql);
    res = stmt.executeQuery(sql);

    // select * query
    sql = "select * from " + tableName;
    System.out.println("Running: " + sql);
    res = stmt.executeQuery(sql);
    while (res.next()) {
      System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
    }
}

}

위의 파일경로와 파일명이 써있는부분의 파일생성은 hive 서버를 작동시킨 곳에다가 파일을 넣는다.
본인이 tomcat를 로컬에다가 띄웠다고 로컬에 넣는것이 아닌 hive서버 경로이다.
또한 그냥 단순히 apache.org에 있는 내용을 vi 로 작성했더니 인식을 못하더라 ;;;;; 예제 있는 쉘그대로 실행할것.

 

 

[hadoop@master1 ~]$echo -e '1\x01foo' > /home/hadoop/test.txt
[hadoop@master1 ~]$echo -e '2\x01bar' >> /home/hadoop/test.txt

아무튼 위처럼 파일을 생성하고나서 실행를 해보면 .

 

3. 결과

* 아래는 로그파일에 찍힌 내용 

Hive history file=/home/hadoop/hive/log/tansactionhist/hive_job_log_hadoop_201208041919_859275775.txt
OK
OK
OK
OK
Copying data from file:/home/hadoop/test.txt
Copying file: file:/home/hadoop/test.txt
Loading data to table default.testhivedrivertable
OK
OK

 

 

* 아래는 console 에 찍힌 내용

Running: show tables 'testHiveDriverTable'
testhivedrivertable
Running: describe testHiveDriverTable
key int
value string
Running: load data local inpath '/home/hadoop/test.txt' into table testHiveDriverTable
Running: select * from testHiveDriverTable
1 foo
2 bar 

 

 

* hive로 들어가서 select 를 해본내용

[hadoop@master1 ~]$ hive
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in file:/usr/local/hive/conf/hive-log4j.properties
Hive history file=/home/hadoop/hive/log/tansactionhist/hive_job_log_hadoop_201208042146_1805362014.txt
hive> select *From testHiveDriverTable;
OK
1       foo
2       bar
Time taken: 3.097 seconds
hive>

 

 

* Hadoop 관리자 모습 (test파일이 추가된모습 )

 

신고

'OpenSource(Bigdata&distribution&Network) > BigData' 카테고리의 다른 글

log파일 flume으로 hdfs에 저장하기  (2) 2015.10.02
Hadoop Hbase 이용한 데이터 적재및 결과 DB전송  (0) 2015.09.17
Hive java connection 설정  (0) 2012.08.04
Hive로 결정.  (0) 2012.07.31
Hbase 설치.  (0) 2012.07.29
hadoop 설치  (2) 2012.06.18
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설

 

Hbase 로 이것저것 보다가 pig , hive를 발견했다.
사실 pig 와 hive는 hadoop을 보면서 봤던 단어이기도 하다.

그중에 hive!
여기저기 찾아보니 페이스북에서 개발했고 그게 오픈소스프로젝트에서 업그레이드?를 한다고 하던데 (맞는지 안맞는지는 난 모름)
기본적인 세팅을 하고 hive를 실행해보니 약간 DB와 비슷하다.
내가 늘하던 SQL과 비슷해서 (단 INSERT는 무슨 load도 아니고;;;; 뭐냐;;) 그래서
hive로 이것저것 하기로 하고 세팅을 하기로 했다.

아파치 hive 문서를 찾아보면

https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallationandConfiguration

소스를 받아서 압축풀고 경로잡고 하면 바로 인식하는데 이게 문제가 있었다.

일단 소스 설치하고 테스트해보자.

요구사항은 Hadoop이 설치된곳. 그리고 소스파일이다.

Hadoop 설치된곳에 Hadoop 과 , jdk가 다 있으니 큰일없이 테스트 가능할것이다.
그리고 hive소스 [소스위치 : http://apache.tt.co.kr/hive/ ] ( 지금 보니까 보인다. tt.co.kr은 오늘과내일 호스팅서비스업체인데... 이런거도 해주네 ㅡ_ㅡ;;; )

1. Hive 소스풀고 경로 설정

뭐 굳이 없어도 알아서들 하겠지....
HIVE_HOME이 없으면 자바에러가 주루룩 난다.

HIVE_HOME=/usr/local/hive
PATH=$HIVE_HOME/bin
export HIVE_HOME PATH

 

2. 실행해보자.

보면 다 알겠지만 절~라 쉽다.

[hadoop@master1 ~]$ hive
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in file:/usr/local/hive/conf/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201207311732_843316443.txt
hive>

3. 테이블만들고 확인하고 문제점 발생!

 

아래는 테이블을 만들고 그 테이블을 확인한 예제이다.

hive> show tables;
OK
Time taken: 6.199 seconds
hive> create table first_tbl (col1 int , col2 string);
OK
Time taken: 0.299 seconds
hive> show tables;
OK
first_tbl
Time taken: 0.084 seconds
hive> describe first_tbl;
OK
col1    int
col2    string
Time taken: 0.147 seconds
hive>

이제 계속 아래처럼 해봤다.

보면 알겠지만 hive를 실행한 디렉토리하위로 metastore_db가 생성이 된다 .
저걸모르고 이상하게 테이블이 자꾸 사라져서 뭐가 문제인지 졸라리 당황하다가 저걸 발견했다.

실제로 HADOOP에는 만든 데이터가 계속있지만 실제로 HIVE에서 show tables로 확인해보면 안보인다.
결국은 metastore 가 테이블 목록인 기본 메타 데이터를 저장하는 공간일것이라고 예상된다.
( 물론 설정의 문제가 있겠으나 내용자체를 DB관리할겸..... )

이것을 토대로 찾아보니 이 meta정보를 실제 DB에 넣어서 관리한다고 하니. 이것을 MYSQL에 넣어보겠다.
oracle사랑에 빠진 나로서는 mysql이 그닥 반갑지는 않다.
단지 oracle에 clob이 생성되는것은 싫었을뿐.
( 문서를 찾아보니 오라클에서는 long타입을 사용해서
데이터를 관리하는데 알다시피 long 타입은 더이상 안쓰고
더군다나 나오는 패치가 clob타입이어서 그냥 메타데이터고
커넥션의 즉각적인 반응을 위해서 MYSQL로 하기로 함)

hive> quit;
[hadoop@master1 ~]$ ls
derby.log            hive            lost+found    test2.txt   TEST.java  URLCAT2.class  URLCAT.class  work
hadoop-1.0.3.tar.gz  hive_save1.txt  metastore_db  TEST.class  test.txt   URLCAT2.java   URLCAT.java   zookeeper
[hadoop@master1 ~]$ cd /usr/local/hive
[hadoop@master1 hive]$ hive
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in file:/usr/local/hive/conf/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201207311736_308605943.txt
hive> show tables;
OK
Time taken: 6.093 seconds
hive>

 

4. mysql 설치및 사용자 생성.

MYSQL.COM 의 현재 시점의 최신소스 버전은 5.5.25a (오 버전업 많이도 했다. )
mysql3.0 부터 설치하고 사용해봤던 나로서는 많이 변했구나..... 하는 생각이 든다.
설치하면서 추가모듈없어서 리눅스에 이것저거 설치하고 하는 모습이 내머릿속에 스쳐지나갔다. .... 시발... 술생각나네;;;;

설치위치는 hbase1 ( 192.168.0.144 ) 서버에 설치하기로 했다. hbase 는 ... 일단 보류..

[root@hbase1 mysql-5.5.25a]# groupadd mysql
[root@hbase1 mysql-5.5.25a]# useradd -r -g mysql mysql 
[root@hbase1 Desktop]# cd ..
[root@hbase1 Desktop]# mv mysql-5.5.25a/ /usr/local/mysql
[root@hbase1 Desktop]# cd /usr/local/mysql
[root@hbase1 mysql]# chown -R mysql .
[root@hbase1 mysql]# chgrp -R mysql .
[root@hbase1 mysql]# ./scripts/mysql_install_db --user=mysql
Installing MySQL system tables...
OK
Filling help tables...
OK

To start mysqld at boot time you have to copy
support-files/mysql.server to the right place for your system

PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER !
To do so, start the server, then issue the following commands:

./bin/mysqladmin -u root password 'new-password'
./bin/mysqladmin -u root -h hbase1 password 'new-password'

Alternatively you can run:
./bin/mysql_secure_installation

which will also give you the option of removing the test
databases and anonymous user created by default.  This is
strongly recommended for production servers.

See the manual for more instructions.

You can start the MySQL daemon with:
cd . ; ./bin/mysqld_safe &

You can test the MySQL daemon with mysql-test-run.pl
cd ./mysql-test ; perl mysql-test-run.pl

Please report any problems with the ./bin/mysqlbug script!

[root@hbase1 mysql]#

[root@hbase1 mysql]# chown -R root .
[root@hbase1 mysql]# chown -R mysql data
[root@hbase1 mysql]# cp support-files/my-medium.cnf /etc/my.cnf
[root@hbase1 mysql]#
[root@hbase1 mysql]# ./bin/mysqld_safe --user=mysql &
[1] 6038
[root@hbase1 mysql]# 120731 18:21:29 mysqld_safe Logging to '/usr/local/mysql/data/hbase1.err'.
120731 18:21:29 mysqld_safe Starting mysqld daemon with databases from /usr/local/mysql/data

[root@hbase1 mysql]#

[root@hbase1 mysql]# ./bin/mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.5.25a-log MySQL Community Server (GPL)

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>
mysql> create user 'hivemeta'@'호스트명' identified by '1234';
Query OK, 0 rows affected (0.00 sec)

mysql> grant all privileges on *.* to 'hivemeta'@'호스트명';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql>  

 

hive 디렉토리의 conf에 hive-default.xml 이 있는데 로그를 보니까 이건 안쓴다고 한단다.
검색해보니 hive-site.xml로 변경하면 될듯.
변경하고나서 아래 내용을 고쳐준다.

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hbase1:3306/hive?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hivemeta</value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>1234</value>
  <description>password to use against metastore database</description>
</property>

 

5. 테스트

[hadoop@master1 conf]$ hive
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
Logging initialized using configuration in file:/usr/local/hive/conf/hive-log4j.properties
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201207312003_1038190772.txt
hive> create table babo (col1 int , col2 string);
OK
Time taken: 3.012 seconds
hive>

 

Mysql 에서 확인해보면

mysql> select *From TBLS;
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+
|      6 |  1343732091 |     1 |                0 | hadoop |         0 |     6 | test_tbl | MANAGED_TABLE | NULL               | NULL               |
|     11 |  1343732650 |     1 |                0 | hadoop |         0 |    11 | babo     | MANAGED_TABLE | NULL               | NULL               |
+--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+
2 rows in set (0.00 sec)

 mysql>

mysql> select *from COLUMNS_V2;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
|     6 | NULL    | col1        | int       |           0 |
|     6 | NULL    | col2        | string    |           1 |
|    11 | NULL    | col1        | int       |           0 |
|    11 | NULL    | col2        | string    |           1 |
+-------+---------+-------------+-----------+-------------+
4 rows in set (0.00 sec)
mysql>
신고

'OpenSource(Bigdata&distribution&Network) > BigData' 카테고리의 다른 글

log파일 flume으로 hdfs에 저장하기  (2) 2015.10.02
Hadoop Hbase 이용한 데이터 적재및 결과 DB전송  (0) 2015.09.17
Hive java connection 설정  (0) 2012.08.04
Hive로 결정.  (0) 2012.07.31
Hbase 설치.  (0) 2012.07.29
hadoop 설치  (2) 2012.06.18
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설

Hadoop 설치후 파일몇개 올려보고 할게 없다.
Map Reduce테스트나 자바콘솔로 테스트하고 말곤 그닥 당장은 할게 없었다.

실제 Hadoop을 기반으로한 DB를 만들어서 뭔가를 해봐야한다는 생각에 이것저것 찾아본 결과. HBASE~ 찾았다.

Google 에서 쓴다고들하는데 그건 내가 알바아니고.
일단 사용해보고 이것저것 테스트좀 해보기위해서 설치부터 해봐야것다.

나의 성능좋은(?) 서버에 이전에 설치한 Hadoop 과 함께 HBASE를 올려보도록한다.

내가 보기엔 아래와 같은 상태같다. 아니면 댓글누군가 달겠지;;;

Hadoop의 HDFS에 HBase를 올린다라고 생각하면 되고. 단.
Hbase간의 분산시스템은 zookeeper가 해준다라고 생각하면 될것같다.

 

위에서 보면 알겠지만 서버로 구축한다면 6대다.
물론 local에 설치해도 된다만 저럴수도 있겠다 싶어서 저렇게 해보려한다.
zookeeper도 사실 서버를 나누고 싶었는데 이상하게 에러나는곳이 있어서 일단 그건 나중에... 안되면 안되는갑다 하고 넘어가련다...

아래 스펙의 Vmware 이미지를 3개를 또 준비한다. ( 원래는 cloudera HA 테스트를 해보려 했으나. 내가 cloud서비스를 할것도 아니고 .. 큰 관심도 없고 해서 테스트하려다가 말았다. 그때 준비한것을 사용하는중)

 

위처럼 3개를 구성해서 기존 Hadoop 에 올리면 아래처럼 6개의 이미지를 돌리게되는 상황.....

위는 설치하면 저렇게 버벅일것이다를 보여주는것뿐이고...
이제 본격적인 설치를 해보자.

 

1. vmware 구성
CPU : 2core
RAM : 2GB
HDD : 20GB * 2
로 구성된 리눅스 vmware image를 3개를 준비한다.

 

2. 아이피 할당.
기존 Hadoop ( HDFS )은 전에 설치한것이니 아이피만 확인.
hadoop (master) : 192.168.0.141
hadoop (slave1) : 192.168.0.142
hadoop (slave2) : 192.168.0.143

신규 이번 Hbase 를 위한 아이피.
아이피옆의 괄호는 hostname
hbase (master) : 192.168.0.144 (hbase1)
hbase (slave1) : 192.168.0.145 (hbase2)
hbase (slave2) : 192.168.0.146 (hbase3)

위의 그림에서 잠깐 나온 zookeeper 가 hbase를 분산해서 관리하게 해주는 시스템으로 보여진다.
이 예기인 즉슨 이놈이 각 서버를 통신을 한다는 예기다.
설치문서를 봐선 딱히 서버간의 ssh 설정을 하지는 않는다.
아마도 특정포트를 가지고 서로 통신하는가본데 설치의 용이함을 위해서 그냥 ssh 설정을 해도 되지 않을까?????

 

3. 소스 준비 (당연히 각 3대서버에
http://www.apache.org 에 다 있다. 없는게 없다.
내가 받은 버전은 아래와 같다.

hbase : http://apache.tt.co.kr/hbase/ => hbase-0.94.0.tar.gz
zookeeper : http://apache.tt.co.kr/zookeeper/ => zookeeper-3.3.5.tar.gz
java : http://java.oracle.com => 1.7 ( hadoop  설치할때 사용한 jdk )

 

4. zookeeper 설치.
oracle 11g 도 그렇고 이것도 그렇고 일단 기본적으로
DB라는것이 올라가기전에 클러스터구성을 먼저하고나서 그 위에 DB를 올린다.

소유자는 hadoop 계정으로 그냥 이어서 하도록하겠다. 하고싶은대로 아무거나 해도 된다.

 

ㄱ. zookeeper 압축을 풀고나서 /usr/local/zookeeper 로 옮긴다.

 

ㄴ. zookeeper안의 conf디렉토리에 zoo_sample.cfg 를 복사해서 같은위치 zoo.cfg로 둔다.
기본값에서 수정,추가 한것은 dataDir , server1 , server2 , server3 이다.
* 지금와서 안것인데 dataDir에는 딱히 대단한 데이터가 쌓이는것은 없었다. 단지 용량이 생각보다 큰 로그파일이 생기다 보니 필요에 따라서 디스크를 새로 할당해줘야할상황이 올것같다.
* 설치하고 띄울때 에러가나서 이것저것하면서 확인했는데 이상하게 주석이랑 같이 인식을 하는부분이 있어서 주석을 다 지워버렸다.
* start하는 소스를 보면 알겠지만 이상하게 변수를 주석있는 부분까지 자꾸 인식하여 승질나서 주석을 다 지워버렸다.
* 아래 server.1 같은것은 분산서버목록 2888,3888이것은 포트 대역이거나 포트일것이다.

 

[hadoop@hbase2 conf]$ cat zoo.cfg
tickTime=2000
initLimit=5
syncLimit=3
dataDir=/home/hadoop/zookeeper/data
clientPort=2181
server.1=hbase1:2888:3888
server.2=hbase2:2888:3888
server.3=hbase3:2888:3888

 

ㄷ. 위에 설정된 /home/hadoop/zookeeper/data 에 디렉토리를 만들어주고 myid 라는 파일명으로 파일하나를 생성한다.
생성후 파일안에 숫자를 넣어주는데 그 숫자는 위의 설정파일에 server.1 이라는 이름에 숫자 1 이것을 넣어준다.
* 숫자를 안쓰니까 java에러가 났다. 숫자느라고.

192.168.0.144(hbase1) 서버의 /home/hadoop/zookeeper/data/myid 파일안에 1이라고 해주고
192.168.0.145(hbase2) 서버의 /home/hadoop/zookeeper/data/myid 파일안에 2이라고 해주고
192.168.0.146(hbase3) 서버의 /home/hadoop/zookeeper/data/myid 파일안에 3이라고 해준다.

 

ㄹ. 시동
설정은 끝났으며 이제 구동한다.
* 물론 path를 잡아주엇기때문에 난 아래 박스처럼 시작한다. 경로는 귀찮아서 다 안쓴다.
/usr/local/zookeeper/bin/zkServer.sh start
* hadoop 처럼 난 master에서 구동하면 다 실행될줄 알았다. 그래서 왜이렇게 안뜨나 했는데 샹~ ( 각각서버에서 시작을 다 해줄것 )
로그파일또는 out파일을 보면서 에러를 잡으면서 하는데 자꾸 에러가나서 그냥 수동으로 띄웠더니 에러가 안남....

서버 1
[hadoop@hbase1 ~]$ zkServer.sh start
JMX enabled by default
Using config: /usr/local/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED

서버 2
[hadoop@hbase2 ~]$ zkServer.sh start
JMX enabled by default
Using config: /usr/local/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED

서버3
[hadoop@hbase3 ~]$ zkServer.sh start
JMX enabled by default
Using config: /usr/local/zookeeper/conf/zoo.cfg
Starting zookeeper ... STARTED


아래 4622 번의 QuorumPeerMain 이 떠있으면 정상. ( 4622는 pid니까 신경끄시길 -_- )
[hadoop@hbase3 ~]$ jps
5199 Jps
4622 QuorumPeerMain
[hadoop@hbase3 ~]$

 

ㅁ. hbase 설치

별것없다 이것역시 압축풀고 옮겨준다.
나는 경로를 /usr/local/hbase 이렇게.

 

ㅂ. hbase 환경설정

1. hbase-env.sh 

사실 아래는 난 수정하지 않았다.
혹시나해서 export JAVA_HOME만 주석을 풀고 내 자바경로를 잡아준것이 다일뿐.
* 쉘파일을 뒤져보니 환경설정을 로드하고나서 hbase가 뜨는것을 확인했다. 그러므로 서버자체에
JAVA_HOME이 잘설정되어있으면 문제가 없다는뜻인것이다.

몇몇사이트를 뒤져보니 설정하게 되어있던데 그때는 로드가 안되었거나 또는 다른문제가 있었겠지...

파일 /usr/local/hbase/conf/hbase-env.sh 안에를 수정한다. ( 주석을 풀어주는면서 내용 수정한다. )

export JAVA_HOME="경로"
export HBASE_CLASSPATH="conf경로"
export HBASE_MANAGES_ZK=false

2. hbase-site.xml
실제로 내가 수정한 파일은 이것 하나라고 봐야맞다.
처음에는 비어있다(주석빼고) 

하나 하나 채워주기로 했다.
아래의 property 를 전부 추가했다.

<configuration>

<!-- hadoop 위치 설정 -->
<property>
        <name>hbase.rootdir</name>
        <value>hdfs://master1:9000/hbase</value>
</property>

<!-- hadoop 포트 설정. 정확한 용도 모른 -->
<property>
        <name>hbase.master</name>
        <value>hbase1:60000</value>
</property>

<!-- 분산으로 설정할꺼에여~~~~ 라고 알려줌 -->
<property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
</property>
<!-- 몇개가 분산으로 될꺼에요 라고 알려줌 -->

<property>
        <name>dfs.replication</name>
        <value>3</value>
</property>

<!-- 
zookeeper 설정 - 이것이 문제가 되었었다. 문서를 찾아보면 zookeeper는 서버를 따로 두고 해도 된다고 했었다 물론 공식문서말고... 헌데 다른서버에 두고 아래 value값을 그 호스트이름을 써주니까 start할때 저 서버에서 hbase를 찾는것을 확인했다.
없으니 당연히 에러 그래서 hbase와 같은곳에 뒀다. 
-->
<property>
        <name>hbase.zookeeper.quorum</name>
        <value>hbase1,hbase2,hbase3</value>
</property>

<!-- 포트 설정 1 -->
<property>
        <name>hbase.zookeeper.peerport</name>
        <value>2888</value>
</property>

<!-- 포트 설정 2 -->
<property>
        <name>hbase.zookeeper.leaderport</name>
        <value>3888</value>
</property>
</configuration>

 

ㅅ. 시작

/usr/local/hbase/bin/start-hbase.sh 시작
hbase2: starting zookeeper, logging to /usr/local/hbase/bin/../logs/hbase-hadoop-zookeeper-hbase2.out
hbase1: starting zookeeper, logging to /usr/local/hbase/bin/../logs/hbase-hadoop-zookeeper-hbase1.out
hbase3: starting zookeeper, logging to /usr/local/hbase/bin/../logs/hbase-hadoop-zookeeper-hbase3.out
starting master, logging to /usr/local/hbase/logs/hbase-hadoop-master-hbase1.out
hbase3: starting regionserver, logging to /usr/local/hbase/bin/../logs/hbase-hadoop-regionserver-hbase3.out
hbase2: starting regionserver, logging to /usr/local/hbase/bin/../logs/hbase-hadoop-regionserver-hbase2.out

* 시작했다 종료했다 반복하다 보니 자바에러가 났다.
* 물론 정상적으로 작동은 했다. 
* Address already in use 이런 오류가 튀어나오던데 예상컨데 zookeeper를 hbase가 컨트롤 해주는듯한것같기도 하고...
* 일단 정상

/usr/local/hbase/bin/stop-hbase.sh 종료 
stopping hbase.........
hbase2: no zookeeper to stop because kill -0 of pid 4768 failed with status 1
hbase3: no zookeeper to stop because kill -0 of pid 4701 failed with status 1
hbase1: no zookeeper to stop because kill -0 of pid 5520 failed with status 1

 

확인
서버 master
[hadoop@hbase1 ~]$ jps
5591 HMaster
6263 Jps
5318 QuorumPeerMain

서버 slave1
[hadoop@hbase2 data]$ jps
5722 Jps
5600 HRegionServer
4683 QuorumPeerMain
[hadoop@hbase2 data]$

 

ㅇ. 테스트

[hadoop@hbase1 bin]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.0, r1332822, Tue May  1 21:43:54 UTC 2012

hbase(main):001:0>

모 사이트 테스트 내용을 따라해봤다. ( http://confluence.openflamingo.org/pages/viewpage.action?pageId=754451 )

hbase(main):001:0> create 'testtbl' , {name=>'col1'},{name=>'col2'},{name=>'col3'}
NameError: undefined local variable or method `name' for #<Object:0x70faf7c7>

hbase(main):002:0> create 'testtbl', {NAME=>'COL1'},{NAME=>'COL2'},{NAME=>'COL3'}
0 row(s) in 1.7030 seconds

hbase(main):003:0> put 'testtbl','10001','COL1','10001',20090929
0 row(s) in 0.1530 seconds

hbase(main):004:0> put 'testtbl','10001','COL2','30',20090929
0 row(s) in 0.0290 seconds

hbase(main):005:0> put 'testtbl','10001','COL3','TESTMAN',20090929
0 row(s) in 0.0240 seconds

hbase(main):007:0> get 'testtbl' ,'10001'
COLUMN                           CELL                                                                                      
 COL1:                           timestamp=20090929, value=10001                                                           
 COL2:                           timestamp=20090929, value=30                                                              
 COL3:                           timestamp=20090929, value=TESTMAN                                                         
3 row(s) in 0.0290 seconds

hbase(main):008:0> list
TABLE                                                                                                                      
testtbl                                                                                                                    
1 row(s) in 0.0370 seconds

hbase(main):009:0>

 

ㅈ. 끝.
DB구성은 완료되었고. HDFS위에 HBASE를 올렸다.
그러면 그게 보이나????
HADOOP에서 제공하는 관리자 화면으로 보면 hbase가 보인다.

위에 보이나? hbase와 테스트시 만든 testtbl이 보인다.
이파일을 HADOOP이 관리를 하게 되나보다.

이제 DB도 구성되었고 본격적으로 개발을 하나 하도록한다.
개발 오랜만에해서 -_-; 잘될라나...

신고

'OpenSource(Bigdata&distribution&Network) > BigData' 카테고리의 다른 글

log파일 flume으로 hdfs에 저장하기  (2) 2015.10.02
Hadoop Hbase 이용한 데이터 적재및 결과 DB전송  (0) 2015.09.17
Hive java connection 설정  (0) 2012.08.04
Hive로 결정.  (0) 2012.07.31
Hbase 설치.  (0) 2012.07.29
hadoop 설치  (2) 2012.06.18
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설

HADOOP 설치.

1대의 서버급 PC에 HADOOP 을 우선 여기저기 널려있는 문서를 찾아서 설치부터 하기로 했다.
RDB와의 유연한 연동 테스트를 위해서 스트레스 테스트를 위해서 설치부터 진행한다.

1. vmware OS 준비
CPU : 2
RAM : 2GB
HDD : 20GB
로 세팅하여 3대를 준비한다.

2. 기본 설치준비 사항. (2012.06.18일 기준 최신버전을 다 받았다)
OS 는 리눅스 (Cent OS 5.7로 선택)
apache-hadoop : 1.03
jdk : 1.7.0_05 ( 64bit )

 


3. 설치전 ( 이렇게 생겨먹은 형태로 설치하려고 한다. )

오라클 RAC를 설치할때와 마찬가지로 SSH 로 각 3대를 인증없이 로그인되도록 만들고나서
1:N 구조 형태로 연결을 한다.


IP : 192.168.0.141 ( master )
IP : 192.168.0.142 ( slave1 )
IP : 192.168.0.143 ( slave2 )
master나 slave이름은 호스트이름으로 정하기로 한다. ( 남들도 그렇게 하드라고... )

설치 파일도 그렇고 메모리도 그렇고 기본설치할때는 생각보다 메모리를 적게 먹는다.
오라클처럼 GUI가 제공되는것은 아니지만 오픈소스만의 좀 있어보이는 설치방식(?) 도 마음에 든다 

 

4. 기본 공통 설치 ( 각 OS 이미지에 공통으로 적용될것을 먼저 진행하여 복사를 한다.

ㄱ. JDK 1.7 풀어서 원하는곳에 넣기. ( /usr/local/java/j2se ) 
ㄴ. hadoop 소스파일 풀어서 원하는곳에 넣기. ( /usr/local/hadoop )
ㄷ. 환경설정파일 수정.

- hosts 파일 수정

[hadoop@master conf]$ cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1                 localhost.localdomain localhost
192.168.0.141           master
192.168.0.142           slave1
192.168.0.143           slave2

    - profile 또는 해당 사용자의 profile ( 나는 그냥 전체 적용시켜버리겠다 )
    /etc/profile 에 아래 내용 추가

    HADOOP_HOME=/usr/local/hadoop
    JAVA_HOME=/usr/local/java/j2se
    CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib/ext
    PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin
    export HADOOP_HOME_WARN_SUPPRESS=" " # 파티션 포멧할때 에러가 나서 사이트 검색해서 추가함.
    export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC JAVA_HOME CLASSPATH HADOOP_HOME

 

 

ㄹ. 디렉토리 생성
   
 -  네임노드용 디렉토리와, 작업용, 데이터용 디렉토리를 생성한다. ( 원래는 네임노드는 name만 데이터노드는 data만 있으면 되는데 귀찮아서;;; 한군데 다 만들고 그냥 vm-image clone 함.

     $>mkdir -p /home/hadoop/work/mapred/system
    $>mkdir -p /home/hadoop/work/data
    $>mkdir -p /home/hadoop/work/name
    $>chmod 755 /home/hadoop/work/data 
    #위의 chmod 를 한것은? 
    
네임노드가 올라오지 않고 에러가 자꾸 나서 찾아보고나서 권한수정
    위 작업이 없이 그냥 하둡을 실행하면.

    아래와 같은log4j에러가 찍힌다. 혹시 보게되면 이것을 꼭 해주길;;;
     WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
     WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/home/hadoop/work/mapred/system/jobtracker.info" - Aborting...
     WARN org.apache.hadoop.mapred.JobTracker: Writing to file hdfs://192.168.0.141:9000/home/hadoop/work/mapred/system/jobtracker.info failed!
     WARN org.apache.hadoop.mapred.JobTracker: FileSystem is not ready yet!
     WARN org.apache.hadoop.mapred.JobTracker: Failed to initialize recovery manager.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /home/hadoop/work/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

        at org.apache.hadoop.ipc.Client.call(Client.java:1070)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
        at $Proxy5.addBlock(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy5.addBlock(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3510)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3373)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2589)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2829)

 

5. 각 os image 부팅후 각 서버별로 설정 (참고 : 지금 설치하는 사용자는 hadoop이라는 사용자로 설치한다. )

ㄱ. ssh 설정 
     각서버를 패스워드 없이 드나들고 , hadoop 이 각 노드와의 처리를 위해서 노인증처리 작업을 한다.

     - 아래 작업은 3대중에 1곳만 한다. (어차피 복사할거라..) 
     - hadoop 계정으로 로그인후 ( 로그인 되어있으면 말고.. )
     $> cd [Enter] ( 홈디렉토리 이동 )
    $> ssh-keygen -t rsa [Enter]
     $> [Enter][Enter][Enter][Enter][Enter][Enter] ....
     * 이렇게 하고나서 다시 프롬프트가 나오면 확인차
     $> ls -al [Enter]
     * .ssh 라는 이름의 숨겨진 디렉토리가 보일것이다.
     $> cd .ssh [Enter]
    $> cp id_rsa.pub authorized_keys [Enter]

     * 이제 다른 서버로 접속 ( master에서 설치했다면 slave1 이나 slave2 로 접속한다.) 
     $> ssh slave1 
     [어쩌고 저쩌고] yes/no ? 물을것이다.  
     한번도 들어와 본적이 없는곳이라면 물어보겠지 일종의 암호를 저장하시겠습니까? 와 비슷한... (암호를 저장하는것은 아니다!)
     - yes 하면 패스워드를 물을것이 들어가본다.
     - 잘 들어가지면 다시 나온다.
     $>exit [Enter]
     해당 .ssh 디렉토리내부를 다른서버에 복사

     * 위의 작업을 한곳을 제외한 나머지곳에는 ssh-keygen -t rsa 명령어 치고 엔터나치면서 .ssh 디렉토리를 생성하고나서.

     $>scp * hadoop@slave1:.ssh [Enter]
     * 그러면 패스워드 한번 묻고 복사가 될것이다.

     * 그렇게 하고나서 다시 ssh master 나 ssh 아이피등 어찌되었든 3개의 서버를 와따갔다해보면 
     yes/no ? 최초에 한번묻고 패스워드는 안물어 볼 것이다.( 이래야 정상인데;;; )

 

  ㄴ. Hadoop 관련 설정 ( 내 하둡 위치 : /usr/local/hadoop )

      - conf 디렉토리 안에 설정파일 몇가지를 수정한다.
      - hadoop-env.sh 
        #JAVA_HOME 수정후 주석풀기.
        #HADOOP_HEAPSIZE 주석풀기. 
        #HADOOP_SSH_OPTS 주석풀기. [data노드가 연결이 안되어서 이것저것 찾아보다가 이부분도 풀었다;;]


- core-site.xml
 <configuration>
 <property>
        <name>fs.default.name</name>
        <value>hdfs://192.168.0.141:9000</value>
 </property>
 </configuration>

 

- hdfs-site.xml
  <configuration>
        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
        <property>
                <name>dfs.name.dir</name>
                <value>/home/hadoop/work/name</value>
        </property>
        <property>
                <name>dfs.data.dir</name>
                <value>/home/hadoop/work/data</value>
        </property>
        <property>
                <name>dfs.support.append</name>
                <value>true</value>
        </property>
</configuration>

 

mapred-site.xml
  <configuration>
    <property>
        <name>mapred.map.child.java.opts</name>
        <value>-Xmx200m </value>
    </property>
    <property>
        <name>mapred.reduce.child.java.opts</name>
        <value>-Xmx200m </value>
    </property>
    <property>
        <name>mapred.job.tracker</name>
        <value>hdfs://192.168.0.141:9001</value>
    </property>
    <property>
        <name>mapred.system.dir</name>
        <value>/home/hadoop/work/mapred/system</value>
    </property>
</configuration>


 

- slaves
  192.168.0.142
  192.168.0.143

위 설정파일(4가지)도 각 서버에 전부 적용해도 된다. 여기다가 쓴이유는..... 아무생각없이 작성하다가;;; 여기다가 썼다; 

 

6. hadoop 띄우기[완료]

ㄱ. 시작전 포멧
     위에서 path를 hadoop밑의bin디렉토디도 추가했다면
     $> hadoop namenode -format [Enter]

     사실 위의 명령어를 한번 딱치면 format 이 실패했다고 떠버렸다. 처음이라 그런가 하여 다시 시도하면 바로 성공;;; 

12/06/19 12:29:40 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = master/192.168.0.141
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012
************************************************************/
Re-format filesystem in /home/hadoop/work/name ? (Y or N) Y # 다시 시도한 흔적 ㅎㅎ;
12/06/19 12:29:42 INFO util.GSet: VM type       = 64-bit
12/06/19 12:29:42 INFO util.GSet: 2% max memory = 17.77875 MB
12/06/19 12:29:42 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/06/19 12:29:42 INFO util.GSet: recommended=2097152, actual=2097152
12/06/19 12:29:43 INFO namenode.FSNamesystem: fsOwner=hadoop
12/06/19 12:29:43 INFO namenode.FSNamesystem: supergroup=supergroup
12/06/19 12:29:43 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/06/19 12:29:43 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/06/19 12:29:43 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/06/19 12:29:43 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/06/19 12:29:43 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/06/19 12:29:43 INFO common.Storage: Storage directory /home/hadoop/work/name has been successfully formatted.
12/06/19 12:29:43 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/192.168.0.141
************************************************************/

 

 

ㄴ.  시작[완료]

$>start-all.sh [Enter]
starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-namenode-master.out
192.168.0.142: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-datanode-slave1.out
192.168.0.143: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-datanode-slave2.out
192.168.0.141: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-secondarynamenode-master.out
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-jobtracker-master.out
192.168.0.142: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-slave1.out
192.168.0.143: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-hadoop-tasktracker-slave2.out

위 메세지 마냥 각 연결된 노드를 전부 띄워주는것 같다 [선지식이 없는 상태에서 하다보니 추측만 할뿐]

* 실제로 start-all.sh 파일을 열어보면 두개의 실행을 해주는것을 확인했다. [아래는 start-all.sh 파일]
     
bin=`dirname "$0"`
bin=`cd "$bin"; pwd`

if [ -e "$bin/../libexec/hadoop-config.sh" ]; then
  . "$bin"/../libexec/hadoop-config.sh
else
  . "$bin/hadoop-config.sh"
fi

# start dfs daemons
"$bin"/start-dfs.sh --config $HADOOP_CONF_DIR

# start mapred daemons
"$bin"/start-mapred.sh --config $HADOOP_CONF_DIR

      하나는 분산파일시스템 시작인것 같고. 하나는 맵리듀스인지 나발인지 올리는것 같다;;;;

 

 

ㄷ. 확인

     이것도 관리자같은것이 있다[Oracle이나 MSSQL 같은 EM 은 아니고 그냥 상태 파악용??? 현황판?]

 http://192.168.0.141:50070

 

 

 

http://192.168.0.141:50030 

- 두개의 노드가 live하다라든가 [ 두개 노드 datanode만 말하는것 같다. ]
- 용량이 어쩌구 저쩌구 되어있고. [ heap 인걸보니 아마도 파일을 메모리에 올려놓고 사용하는 양을 예기하는것 같은데... ]

 

* 이제부터 책과 함께 이것저것 하면서 파봐야한다.

* 자바로 띄우는걸 보니 자바로 접근하는것이 상당히 유연할듯한데. 개인적으로 자바를 별로 안좋아라 해서 약간 걱정이다.

* 프로젝트있을때만 한 자바.... 개놈에 자바.... 아무튼 오라클과 hadoop 테스트 때문에 불가피하게 또 해야겠다.... (-.ㅜ)

 

신고

'OpenSource(Bigdata&distribution&Network) > BigData' 카테고리의 다른 글

log파일 flume으로 hdfs에 저장하기  (2) 2015.10.02
Hadoop Hbase 이용한 데이터 적재및 결과 DB전송  (0) 2015.09.17
Hive java connection 설정  (0) 2012.08.04
Hive로 결정.  (0) 2012.07.31
Hbase 설치.  (0) 2012.07.29
hadoop 설치  (2) 2012.06.18
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설