달력

11

« 2017/11 »

  •  
  •  
  •  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  •  
  •  

Spark2.1 Hive Auth Custom Test


- 2017년 현재. 모 빅데이터 프로젝트 운영을 하고있는중.

- 요구사항 : Spark 를 사용하는데 Hive권한 인증을 사용하려한다.


Spark 버전 : 2.1

문제점 : 

Spark의 강력한 기능에는 현재 호튼웍스 빅데이터 플랫폼에서 사용하고있는 Spark인증은 찾아봐도 없었다.


Hive Metastore를 쓰기때문에 Custom을 해서 재컴파일하려고 했고.

테스트는 잘되고 .


그 위치만 올려서 나중에 안까먹으려한다.


- Spark Source는 Scala로 되어있다 .

- 일단 Scala를 좀 스스로 Hello World는 찍고나서 아래부분에 기능을 추가함.

SparkSource 위치

sql/hive/src/main/scala/org/apache/spark/sql/hive


파일명 : TableReader.scala


Trait를 참조하는곳이 두군데니까 꼭 두군데를 수정할것

일반 Heap Table  과 Partition Table 을 두가지로 분리해서 소스가 개발되었으니. 두군데 다수정해야한다.


어차피 쓰는용도는 확정이니 난 Hive에 들어있는 권한을 사용하기로 함.


아래는 내가 수정한부분의 일부이다.(이정도면 안까먹을듯)

def makeRDDForTable(

      hiveTable: HiveTable,

      deserializerClass: Class[_ <: Deserializer],

      filterOpt: Option[PathFilter]): RDD[InternalRow] = {


이부분부터 수정을 시작.

var hive_table_nm = hiveTable.getTableName()

val driver = "com.mysql.jdbc.Driver"

val url = "jdbc:mysql://내메타스토어경로/hive?characterEncoding=utf8"

val username = "flashone"

val password = "1234"

val check_query = "SELECT count(*) CNT from TBL_PRIVS A , TBLS B WHERE A.TBL_ID = B.TBL_ID AND B.TBL_NAME = ? and A.GRANTOR = ? "

var connection:Connection = null

var resultSet:ResultSet = null

var stat:PreparedStatement = null

try {


     Class.forName(driver)

     connection = DriverManager.getConnection(url, username, password)


     stat = connection.prepareStatement(check_query)

     stat.setString(1,hive_table_nm)

     stat.setString(2,hive_user_nm)

     resultSet = stat.executeQuery()

     while ( resultSet.next() )

     {

            if ( resultSet.getString("CNT") == "0" ) {


                   val Npath = new Path("hdfs path")


                   logInfo(hiveTable.getDataLocation().toString())


                   hiveTable.setDataLocation(Npath)


                   throw new Exception("Access Denied")

            }


     }


   } catch {


   }

   resultSet.close()

   stat.close()

   connection.close()


https://github.com/ocmpromaster/sparkcustom21

여기서 수정하고있음 ㅋ



Rstudio에서 인증처리하는 모습 :

Custom을일단했으니. Audit도 별문제는 없다.



저작자 표시
신고
TAG 2, 2.1, auth, Grant, Hive, MySQL, spark
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설

개놈에 카산드라 -- 연동 짜증난다 ㅋ

Hbase 연동을 먼저좀 해보려고 하다가 실패를 하고 짜증나서 카산드라 연동좀 시도해보고 성공후에 기록남김.


spark 버전 : 1.6.1 ( 하둡 2.7.2를 사용해서 하둡용으로 컴파일된 버전사용 )

cassandra 버전 : 3.4 

cassandra spark java connector 버전 : 1.5버전 사용.

node 수 : 6


cassandra 설치는 쉬우므로 내가 기록할 정보만 기록한다.

나는 6대로 계속 테스트중이므로~~~


참고로 모든 노드의 스펙은 아래와 같다. ( vmware 스펙 ㅋ )


카산드라 설정파일 작업부분.

위의 VM들을 가지고 모든 노드에 각각 아이피를 할당하여 설정한 설정정보만 남긴다.


카산드라의 분산및 리플리케이션을 위한 구성을 위해서 건드려야할 파일리스트 


cassandra.yaml

* 변경위치 : 

- listen_address : 해당노드의 아이피

- rpc_address : 해당노드의 아이피

- seed_provider의 seeds 부분 : 연결된노드 아이피들~~ "111.111.111.111,222.222.222.222" 이런식으로.


cassandra-env.sh

* 변경위치 :

- JVM_OPTS의 hostname 부분 : 해당노드의 아이피


여기까지 변경하고 cassandra를 실행하면 1개의 datacenter 에 1개의 rack 으로 해서 6개의 노드가 생겨지더라.


아래 두개파일로 datacenter 는 몇개로 할지 rack은 몇개로 할지 등을 설정.


cassandra-rackdc.properties


cassandra-topology.properties


카산드라 노드 연결된 상태 확인.

[hadoop@os1 bin]$ ./nodetool status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack

UN  192.168.0.111  201.03 KB  256          49.5%             e13656b7-1436-4011-bd1d-01042a2d75fc  rack1

UN  192.168.0.112  205.89 KB  256          47.7%             09ef262e-9013-4292-9ea8-98a32ebdadc1  rack1

UN  192.168.0.113  253.59 KB  256          47.7%             bad6d57a-3135-4dff-b546-21887d77aca6  rack1

UN  192.168.0.114  198.71 KB  256          49.3%             c2e5c7bc-b9a1-4751-9e6b-6179b20fc76f  rack1

UN  192.168.0.115  214 KB      256          53.8%             9f9d28e9-7e03-4653-a329-7aba65aaf5f0  rack1

UN  192.168.0.116  219.55 KB  256          52.1%             cffba9e0-19b3-4070-93ba-ecb4e6263e47  rack1


[hadoop@os1 bin]$ 


참고로 노드가 죽으면 DN 으로 바뀌더라;;;


이제 카산드라에 기본 데이터몇개 넣기.

* 설정파일이 로컬이아닌 아이피를 넣으니 cqlsh 만 치면 못찾더라 ㅋ

* 메뉴얼의 굵은글씨만 보고 일단 만들어서 실행 ㅋ


[hadoop@os1 bin]$ ./cqlsh 192.168.0.111

Connected to CASDR at 192.168.0.111:9042.

[cqlsh 5.0.1 | Cassandra 3.4 | CQL spec 3.4.0 | Native protocol v4]

Use HELP for help.

cqlsh> create keyspace keytest with replication = {'class' :'SimpleStrategy','replication_factor' : 3};

cqlsh>

cqlsh> use keytest;

cqlsh:keytest> create table test_table ( id varchar,name varchar,primary key(id));          

cqlsh:keytest> insert into test_table (id,name) values('2','fasdlkffasdffajfkls');

cqlsh:keytest> select *from test_table;


 id | name

----+-------------------------------

  3 | fasdlkfasdfasdfaffasdffajfkls

  2 |           fasdlkffasdffajfkls

  1 |     fasdlkfjafkljfsdklfajfkls


(3 rows)

cqlsh:keytest> 



* 현재 상황은 1~6번까지 노드가 있고 spark는 1번노드에 설치해놓음.

* 인터넷에 떠돌아 다니는 소스를 조합하여 몇몇가지 수정해서 실행함.

* 아래는 콘솔에서 짠 cassandra 테이블 카운트 세는 소스

* 아래소스보면 커넥터를 1.6과 1.5를 테스트하느라 둘다 쓴게 나오는데 1.6 , 1.5어느거 하나만 써도 상관없이 작동하니 신경쓰지말것.

import static com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions;


import java.util.ArrayList;

import java.util.List;


import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.TableName;

import org.apache.hadoop.hbase.client.Result;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.function.Function;

import org.apache.hadoop.hbase.io.ImmutableBytesWritable;

import org.apache.hadoop.hbase.mapreduce.TableInputFormat;


import com.datastax.spark.connector.japi.CassandraJavaUtil;

import com.datastax.spark.connector.japi.CassandraRow;




public class test3 {


        public static void main(String[] args)

        {


                SparkConf conf = new SparkConf(true)

                .set("spark.cassandra.connection.host","192.168.0.111")

                .setMaster("spark://192.168.0.111:7077")

                .setAppName("casr")

                .setJars(new String[]{"/apps/spark/lib/spark-cassandra-connector-java_2.10-1.6.0-M1.jar","/apps/spark/lib/spark-cassandra-connec

tor_2.10-1.6.0-M1.jar"})

                .setSparkHome("/apps/spark");



                JavaSparkContext sc = new JavaSparkContext(conf);


                sc.addJar("/home/hadoop/classes/cassandra-driver-core-3.0.0.jar");

                sc.addJar("/home/hadoop/classes/guava-19.0.jar");

                JavaRDD<CassandraRow> cassandraRdd = CassandraJavaUtil.javaFunctions(sc)

                        .cassandraTable("keytest", "test_table")

                        .select("name");


                                        System.out.println("row count:"+ cassandraRdd.count()); // once it is in RDD, we can use RDD 



               }

}


위의것을 컴파일후 실행하면 아래처럼 실행되더라.

[hadoop@os1 ~]$ java test3

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/hadoop/classes/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/hadoop/classes/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

16/04/08 00:53:19 INFO SparkContext: Running Spark version 1.6.1

16/04/08 00:53:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

16/04/08 00:53:20 INFO SecurityManager: Changing view acls to: hadoop

16/04/08 00:53:20 INFO SecurityManager: Changing modify acls to: hadoop

16/04/08 00:53:20 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)

16/04/08 00:53:21 INFO Utils: Successfully started service 'sparkDriver' on port 54330.

16/04/08 00:53:22 INFO Slf4jLogger: Slf4jLogger started

16/04/08 00:53:22 INFO Remoting: Starting remoting

16/04/08 00:53:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.0.111:34441]

16/04/08 00:53:22 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 34441.

16/04/08 00:53:22 INFO SparkEnv: Registering MapOutputTracker

16/04/08 00:53:22 INFO SparkEnv: Registering BlockManagerMaster

16/04/08 00:53:22 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e91fb4d5-8f1e-45b5-9631-2af967164aff

16/04/08 00:53:22 INFO MemoryStore: MemoryStore started with capacity 1077.8 MB

16/04/08 00:53:22 INFO SparkEnv: Registering OutputCommitCoordinator

16/04/08 00:53:23 INFO Utils: Successfully started service 'SparkUI' on port 4040.

16/04/08 00:53:23 INFO SparkUI: Started SparkUI at http://192.168.0.111:4040

16/04/08 00:53:23 INFO HttpFileServer: HTTP File server directory is /tmp/spark-d02efafe-ae00-4eaa-9cfc-0700e5848dd3/httpd-c8f30120-f92a-4ec9-922f-c201e94998af

16/04/08 00:53:23 INFO HttpServer: Starting HTTP Server

16/04/08 00:53:23 INFO Utils: Successfully started service 'HTTP file server' on port 48452.

16/04/08 00:53:23 INFO SparkContext: Added JAR /apps/spark/lib/spark-cassandra-connector-java_2.10-1.6.0-M1.jar at http://192.168.0.111:48452/jars/spark-cassandra-connector-java_2.10-1.6.0-M1.jar with timestamp 1460044403426

16/04/08 00:53:23 INFO SparkContext: Added JAR /apps/spark/lib/spark-cassandra-connector_2.10-1.6.0-M1.jar at http://192.168.0.111:48452/jars/spark-cassandra-connector_2.10-1.6.0-M1.jar with timestamp 1460044403435

16/04/08 00:53:23 INFO AppClient$ClientEndpoint: Connecting to master spark://192.168.0.111:7077...

16/04/08 00:53:23 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160408005323-0013

16/04/08 00:53:23 INFO AppClient$ClientEndpoint: Executor added: app-20160408005323-0013/0 on worker-20160407175148-192.168.0.111-57563 (192.168.0.111:57563) with 8 cores

16/04/08 00:53:23 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160408005323-0013/0 on hostPort 192.168.0.111:57563 with 8 cores, 1024.0 MB RAM

16/04/08 00:53:23 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44940.

16/04/08 00:53:23 INFO NettyBlockTransferService: Server created on 44940

16/04/08 00:53:23 INFO BlockManagerMaster: Trying to register BlockManager

16/04/08 00:53:23 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.111:44940 with 1077.8 MB RAM, BlockManagerId(driver, 192.168.0.111, 44940)

16/04/08 00:53:23 INFO BlockManagerMaster: Registered BlockManager

16/04/08 00:53:23 INFO AppClient$ClientEndpoint: Executor updated: app-20160408005323-0013/0 is now RUNNING

16/04/08 00:53:24 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

16/04/08 00:53:24 INFO SparkContext: Added JAR /home/hadoop/classes/cassandra-driver-core-3.0.0.jar at http://192.168.0.111:48452/jars/cassandra-driver-core-3.0.0.jar with timestamp 1460044404233

16/04/08 00:53:24 INFO SparkContext: Added JAR /home/hadoop/classes/guava-19.0.jar at http://192.168.0.111:48452/jars/guava-19.0.jar with timestamp 1460044404244

16/04/08 00:53:24 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it

16/04/08 00:53:25 INFO Cluster: New Cassandra host /192.168.0.111:9042 added

16/04/08 00:53:25 INFO Cluster: New Cassandra host /192.168.0.112:9042 added

16/04/08 00:53:25 INFO LocalNodeFirstLoadBalancingPolicy: Added host 192.168.0.112 (datacenter1)

16/04/08 00:53:25 INFO Cluster: New Cassandra host /192.168.0.113:9042 added

16/04/08 00:53:25 INFO LocalNodeFirstLoadBalancingPolicy: Added host 192.168.0.113 (datacenter1)

16/04/08 00:53:25 INFO Cluster: New Cassandra host /192.168.0.114:9042 added

16/04/08 00:53:25 INFO LocalNodeFirstLoadBalancingPolicy: Added host 192.168.0.114 (datacenter1)

16/04/08 00:53:25 INFO Cluster: New Cassandra host /192.168.0.115:9042 added

16/04/08 00:53:25 INFO LocalNodeFirstLoadBalancingPolicy: Added host 192.168.0.115 (datacenter1)

16/04/08 00:53:25 INFO Cluster: New Cassandra host /192.168.0.116:9042 added

16/04/08 00:53:25 INFO LocalNodeFirstLoadBalancingPolicy: Added host 192.168.0.116 (datacenter1)

16/04/08 00:53:25 INFO CassandraConnector: Connected to Cassandra cluster: CASDR

16/04/08 00:53:26 INFO SparkContext: Starting job: count at test3.java:44

16/04/08 00:53:26 INFO DAGScheduler: Got job 0 (count at test3.java:44) with 7 output partitions

16/04/08 00:53:26 INFO DAGScheduler: Final stage: ResultStage 0 (count at test3.java:44)

16/04/08 00:53:26 INFO DAGScheduler: Parents of final stage: List()

16/04/08 00:53:26 INFO DAGScheduler: Missing parents: List()

16/04/08 00:53:26 INFO DAGScheduler: Submitting ResultStage 0 (CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15), which has no missing parents

16/04/08 00:53:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.3 KB, free 7.3 KB)

16/04/08 00:53:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.8 KB, free 11.1 KB)

16/04/08 00:53:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.111:44940 (size: 3.8 KB, free: 1077.7 MB)

16/04/08 00:53:26 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006

16/04/08 00:53:26 INFO DAGScheduler: Submitting 7 missing tasks from ResultStage 0 (CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15)

16/04/08 00:53:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks

16/04/08 00:53:27 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (os1.local:56503) with ID 0

16/04/08 00:53:27 INFO BlockManagerMasterEndpoint: Registering block manager os1.local:53103 with 511.1 MB RAM, BlockManagerId(0, os1.local, 53103)

16/04/08 00:53:27 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, os1.local, partition 0,NODE_LOCAL, 26312 bytes)

16/04/08 00:53:27 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, os1.local, partition 1,NODE_LOCAL, 26222 bytes)

16/04/08 00:53:27 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 2, os1.local, partition 5,NODE_LOCAL, 26352 bytes)

16/04/08 00:53:28 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on os1.local:53103 (size: 3.8 KB, free: 511.1 MB)

16/04/08 00:53:31 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 3, os1.local, partition 2,ANY, 26312 bytes)

16/04/08 00:53:31 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 4, os1.local, partition 3,ANY, 21141 bytes)

16/04/08 00:53:31 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 5, os1.local, partition 4,ANY, 23882 bytes)

16/04/08 00:53:31 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, os1.local, partition 6,ANY, 10818 bytes)

16/04/08 00:53:32 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 1106 ms on os1.local (1/7)

16/04/08 00:53:33 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 5530 ms on os1.local (2/7)

16/04/08 00:53:33 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 4) in 1945 ms on os1.local (3/7)

16/04/08 00:53:33 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6015 ms on os1.local (4/7)

16/04/08 00:53:33 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 2) in 5959 ms on os1.local (5/7)

16/04/08 00:53:33 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 5) in 2150 ms on os1.local (6/7)

16/04/08 00:53:33 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 3) in 2286 ms on os1.local (7/7)

16/04/08 00:53:33 INFO DAGScheduler: ResultStage 0 (count at test3.java:44) finished in 7.049 s

16/04/08 00:53:33 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 

16/04/08 00:53:33 INFO DAGScheduler: Job 0 finished: count at test3.java:44, took 7.435243 s

16/04/08 00:53:34 INFO CassandraConnector: Disconnected from Cassandra cluster: CASDR

16/04/08 00:53:35 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.0.111:44940 in memory (size: 3.8 KB, free: 1077.8 MB)

row count:3 <--------- 요기 요거가 나온답 ㅋ

16/04/08 00:53:35 INFO BlockManagerInfo: Removed broadcast_0_piece0 on os1.local:53103 in memory (size: 3.8 KB, free: 511.1 MB)

16/04/08 00:53:35 INFO SparkContext: Invoking stop() from shutdown hook

16/04/08 00:53:35 INFO ContextCleaner: Cleaned accumulator 1

16/04/08 00:53:35 INFO SparkUI: Stopped Spark web UI at http://192.168.0.111:4040

16/04/08 00:53:35 INFO SparkDeploySchedulerBackend: Shutting down all executors

16/04/08 00:53:35 INFO SparkDeploySchedulerBackend: Asking each executor to shut down

16/04/08 00:53:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

16/04/08 00:53:35 INFO MemoryStore: MemoryStore cleared

16/04/08 00:53:35 INFO BlockManager: BlockManager stopped

16/04/08 00:53:35 INFO BlockManagerMaster: BlockManagerMaster stopped

16/04/08 00:53:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

16/04/08 00:53:35 INFO SparkContext: Successfully stopped SparkContext

16/04/08 00:53:35 INFO ShutdownHookManager: Shutdown hook called

16/04/08 00:53:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-d02efafe-ae00-4eaa-9cfc-0700e5848dd3/httpd-c8f30120-f92a-4ec9-922f-c201e94998af

16/04/08 00:53:35 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

16/04/08 00:53:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-d02efafe-ae00-4eaa-9cfc-0700e5848dd3

16/04/08 00:53:35 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.

[hadoop@os1 ~]$


주의사항

위에 저 빨간 글씨인 row count 나오게 하려고 별짓을 다했다. ㅋ

이 클래스안에서 실행되는 spark와 cassandra는 새로 이 두놈에 접속을 하여 자바에서 설정된 환경설정을 따라서 

실행되는것으로 보인다.

java 소스상에 보면 setJars 를 통해서 jar파일위치를 추가해주었다.

그랬더니 길고긴 닭질을 끝내고 성공하게 됨;;;; 

그렇게 서버를 만져도 자꾸 망각하게되는 부분. ㅎㅎ;;;


저작자 표시
신고
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설

이클립스에서 개발테스트중에 발생한 오류 ( 실행하면 SparkConf쪽에서 바로 오류가 난다.)

사실 저 매쏘드를 찾아보려고 jar을 다뒤져보고 역컴파일하다가 짜증나서 폭팔함;;


Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;

at org.apache.spark.util.Utils$.getSystemProperties(Utils.scala:1546)

at org.apache.spark.SparkConf.<init>(SparkConf.scala:59)

at spark1.test1.main(test1.java:20)


해결책 : 버전을 잘 마춰준다 ㅋㅋ

참고 : spark-1.6.1 , scala-library 2.10.4버전 사용 

maven 버전은 아래처럼 마춰서 사용. 



에러 날때 사용한 버전

<dependency>

  <groupId>org.apache.spark</groupId>

  <artifactId>spark-core_2.11</artifactId>

  <version>1.6.1</version>

  </dependency>

  <dependency>

  <groupId>org.apache.spark</groupId>

  <artifactId>spark-sql_2.11</artifactId>

  <version>1.6.1</version>

  </dependency>

  <dependency>

  <groupId>org.apache.spark</groupId>

  <artifactId>spark-streaming_2.11</artifactId>

  <version>1.6.1</version>

  </dependency>

  <dependency>

  <groupId>org.apache.spark</groupId>

  <artifactId>spark-mllib_2.11</artifactId>

  <version>1.6.1</version>

   </dependency>


이걸사용하고나니 없어짐 

 <dependency>

  <groupId>org.apache.spark</groupId>

  <artifactId>spark-core_2.10</artifactId>

  <version>1.6.1</version>

  </dependency>

  <dependency>

  <groupId>org.apache.spark</groupId>

  <artifactId>spark-sql_2.10</artifactId>

  <version>1.6.1</version>

  </dependency>

  <dependency>

  <groupId>org.apache.spark</groupId>

  <artifactId>spark-streaming_2.10</artifactId>

  <version>1.6.1</version>

  </dependency>

  <dependency>

  <groupId>org.apache.spark</groupId>

  <artifactId>spark-mllib_2.10</artifactId>

  <version>1.6.1</version>

  </dependency>


저작자 표시
신고
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설

서론.

Hbase에 접근하여 SQL형태로 데이터를 추출하는작업을 하고자 처음 사용한것은 phoneix ~

===> http://phoenix.apache.org/

관심 있는 사람은 들어가보겠지만

일단 난 갈아타기로 했다. 

이것저것 찾아보다가 SPARK를 발견.

나이가 좀있는사람은 알겠지만 국내 성인잡지중에 SPARK라고 있었다.

음~ -_-ㅋ;;; 난 그걸 얼굴에 철판을 깔고 당당히 서점에서 사서 본적은 있다.

물론 지금도 있는지 찾아볼필요도 없었고 궁금하지도 않다.;;;;;;;


아무튼 spark는 자바도 지원하고 겸사겸사 공부하려고하는 파이썬도 지원하고있다보니 관심있게보다가.

이쪽분야에서 고공분투하는 연구팀및 개발자분들이 극찬을 한것을 보고 이게모지??? 하면서 깔기시작하다가.


개인사정으로 구성만하고 이제서야 helloworld를 찍어보려고 한다.


본론 1

준비물 : hadoop ~ ㅋㅋㅋ

참고로 난 2.7.2 ( 1master + 5data node )


spark는 당연히 spark 사이트에서 받으면 되고 .

내가 이글을 쓴시점에 1.6.1이 나왔다. (내가 처음안건 1.5~)

우선뭐 ~ 

내려받아서 압축풀면된다.

난 바로 하둡에 붙일것이라 ~ 아예 hadoop에 맞게 컴파일이 된버전을 받았다.


1. 다운로드

스파크 주소 : http://spark.apache.org 

다운로드메뉴로 가면~


아래같은 모습을 볼것이고


난 2번항목에서 아래처럼 hadoop 용을 선택했다. ( 내가 설치한 hadoop은 2.7.2 이걸쓸때 최신버전이 또올라왔길래 바로 업그레이드)



본론 2

1. helloworld 찍어보기.

개발이면뭐 콘솔에 찍어보겠으나. 

이건 데이터니까~ 샘플파일하나 하둡에 넣고 그 파일의 row 갯수확인후 완료


2. hadoop에 아무 텍스트파일하나 만들어서 넣는다.

혹시모르니 hadoop에 파일넣는것도 같이 기록하련다. 난 돌대가리니까 ㅋ


$vi testman.txt 

나불나불~ 

나불나불~


저장후

$hadoop fs -mkdir /test <-- 디렉토리 생성

$hadoop fs -put /test testman.txt <-- OS파일을 HADOOP으로 밀어넣기.



$cd 스파크있는 home 경로

$./bin/pyspark 

Python 2.7.5 (default, Nov 20 2015, 02:00:19) 

[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

16/03/15 22:20:55 INFO SparkContext: Running Spark version 1.6.1

16/03/15 22:20:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

16/03/15 22:20:57 INFO SecurityManager: Changing view acls to: hadoop

16/03/15 22:20:57 INFO SecurityManager: Changing modify acls to: hadoop

16/03/15 22:20:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)

16/03/15 22:20:58 INFO Utils: Successfully started service 'sparkDriver' on port 49115.

16/03/15 22:20:59 INFO Slf4jLogger: Slf4jLogger started

16/03/15 22:20:59 INFO Remoting: Starting remoting

16/03/15 22:21:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.0.111:41015]

16/03/15 22:21:00 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 41015.

16/03/15 22:21:00 INFO SparkEnv: Registering MapOutputTracker

16/03/15 22:21:00 INFO SparkEnv: Registering BlockManagerMaster

16/03/15 22:21:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b918f3eb-05d7-4f85-a0cb-678f3327e8b6

16/03/15 22:21:00 INFO MemoryStore: MemoryStore started with capacity 511.5 MB

16/03/15 22:21:00 INFO SparkEnv: Registering OutputCommitCoordinator

16/03/15 22:21:01 INFO Utils: Successfully started service 'SparkUI' on port 4040.

16/03/15 22:21:01 INFO SparkUI: Started SparkUI at http://192.168.0.111:4040

16/03/15 22:21:01 INFO Executor: Starting executor ID driver on host localhost

16/03/15 22:21:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37043.

16/03/15 22:21:01 INFO NettyBlockTransferService: Server created on 37043

16/03/15 22:21:01 INFO BlockManagerMaster: Trying to register BlockManager

16/03/15 22:21:01 INFO BlockManagerMasterEndpoint: Registering block manager localhost:37043 with 511.5 MB RAM, BlockManagerId(driver, localhost, 37043)

16/03/15 22:21:01 INFO BlockManagerMaster: Registered BlockManager

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1

      /_/


Using Python version 2.7.5 (default, Nov 20 2015 02:00:19)

SparkContext available as sc, HiveContext available as sqlContext.

>>> xx = sc.hadoopFile("hdfs://192.168.0.111:9000/test/testman.txt","org.apache.hadoop.mapred.TextInputFormat","org.apache.hadoop.io.Text","org.apache.hadoop.io.LongWritable") <-- document 에 있는 문서보고 이렇게 쓰랜다 ;;; 그래서 우선 따라서 이대로 해봄

16/03/15 22:30:19 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB)

16/03/15 22:30:19 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.5 KB)

16/03/15 22:30:19 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37043 (size: 13.9 KB, free: 511.5 MB)

16/03/15 22:30:19 INFO SparkContext: Created broadcast 0 from hadoopFile at PythonRDD.scala:613

16/03/15 22:30:20 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 166.7 KB, free 334.2 KB)

16/03/15 22:30:20 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 13.0 KB, free 347.2 KB)

16/03/15 22:30:20 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:37043 (size: 13.0 KB, free: 511.5 MB)

16/03/15 22:30:20 INFO SparkContext: Created broadcast 1 from broadcast at PythonRDD.scala:570

16/03/15 22:30:21 INFO FileInputFormat: Total input paths to process : 1

16/03/15 22:30:21 INFO SparkContext: Starting job: take at SerDeUtil.scala:201

16/03/15 22:30:21 INFO DAGScheduler: Got job 0 (take at SerDeUtil.scala:201) with 1 output partitions

16/03/15 22:30:21 INFO DAGScheduler: Final stage: ResultStage 0 (take at SerDeUtil.scala:201)

16/03/15 22:30:21 INFO DAGScheduler: Parents of final stage: List()

16/03/15 22:30:21 INFO DAGScheduler: Missing parents: List()

16/03/15 22:30:21 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at PythonHadoopUtil.scala:181), which has no missing parents

16/03/15 22:30:21 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.3 KB, free 350.5 KB)

16/03/15 22:30:21 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1980.0 B, free 352.5 KB)

16/03/15 22:30:21 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:37043 (size: 1980.0 B, free: 511.5 MB)

16/03/15 22:30:21 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006

16/03/15 22:30:21 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at PythonHadoopUtil.scala:181)

16/03/15 22:30:21 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks

16/03/15 22:30:21 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,ANY, 2144 bytes)

16/03/15 22:30:21 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)

16/03/15 22:30:21 INFO HadoopRDD: Input split: hdfs://192.168.0.111:9000/test/testman.txt:0+60

16/03/15 22:30:21 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

16/03/15 22:30:21 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

16/03/15 22:30:21 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

16/03/15 22:30:21 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

16/03/15 22:30:21 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

16/03/15 22:30:22 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2193 bytes result sent to driver

16/03/15 22:30:22 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 502 ms on localhost (1/1)

16/03/15 22:30:22 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 

16/03/15 22:30:22 INFO DAGScheduler: ResultStage 0 (take at SerDeUtil.scala:201) finished in 0.544 s

16/03/15 22:30:22 INFO DAGScheduler: Job 0 finished: take at SerDeUtil.scala:201, took 0.755415 s

>>> 16/03/15 22:30:45 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:37043 in memory (size: 1980.0 B, free: 511.5 MB)

16/03/15 22:30:45 INFO ContextCleaner: Cleaned accumulator 2


>>> xx.count() <-- 방금 읽은거 갯수출력명령

16/03/15 22:32:49 INFO SparkContext: Starting job: count at <stdin>:1

16/03/15 22:32:49 INFO DAGScheduler: Got job 1 (count at <stdin>:1) with 2 output partitions

16/03/15 22:32:49 INFO DAGScheduler: Final stage: ResultStage 1 (count at <stdin>:1)

16/03/15 22:32:49 INFO DAGScheduler: Parents of final stage: List()

16/03/15 22:32:49 INFO DAGScheduler: Missing parents: List()

16/03/15 22:32:49 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[3] at count at <stdin>:1), which has no missing parents

16/03/15 22:32:49 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 6.0 KB, free 353.2 KB)

16/03/15 22:32:49 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 3.7 KB, free 356.9 KB)

16/03/15 22:32:49 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:37043 (size: 3.7 KB, free: 511.5 MB)

16/03/15 22:32:49 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006

16/03/15 22:32:49 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (PythonRDD[3] at count at <stdin>:1)

16/03/15 22:32:49 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks

16/03/15 22:32:49 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,ANY, 2144 bytes)

16/03/15 22:32:49 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, localhost, partition 1,ANY, 2144 bytes)

16/03/15 22:32:49 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)

16/03/15 22:32:49 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)

16/03/15 22:32:49 INFO HadoopRDD: Input split: hdfs://192.168.0.111:9000/test/testman.txt:0+60

16/03/15 22:32:49 INFO HadoopRDD: Input split: hdfs://192.168.0.111:9000/test/testman.txt:60+61

16/03/15 22:32:50 INFO PythonRunner: Times: total = 661, boot = 627, init = 33, finish = 1

16/03/15 22:32:50 INFO PythonRunner: Times: total = 633, boot = 616, init = 15, finish = 2

16/03/15 22:32:50 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 2179 bytes result sent to driver

16/03/15 22:32:50 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 734 ms on localhost (1/2)

16/03/15 22:32:50 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2179 bytes result sent to driver

16/03/15 22:32:50 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 750 ms on localhost (2/2)

16/03/15 22:32:50 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 

16/03/15 22:32:50 INFO DAGScheduler: ResultStage 1 (count at <stdin>:1) finished in 0.752 s

16/03/15 22:32:50 INFO DAGScheduler: Job 1 finished: count at <stdin>:1, took 0.778757 s

27  <--- 결과가 출력됨.

>>>


결론 

HelloWorld 완료~ ㅋ


참고 : 

나같은경우엔 hadoop master에다가 spark를 같이 넣어서 하는중이라 주소가 같은데.

위에 pyspark 를 실행하면 http://192.168.0.111:4040/stages 웹관리자? 가 있다고 주저리주저리 하는걸 볼수있다.

써있기로는 SparkUI 라고하는데 일종의 모니터링툴이다.


아래처럼 깔끔한 UI에서 방금 하둡에 있는 파일을 읽은 내용이 나온다. 

아래 그림보면 count한것도 나온다~ 

오!!!!!!!!!! 멋찜~ ㅋㅋㅋㅋㅋㅋ 




내가 실행했던 count명령에 대한 상세내용~ 그림으로도 니가 이케했는데 이케댔어요~~~ 라고 이쁘장하게 잘나온다.

인터넷으로 돌아다니는 공개자료들중에 뭐 보면 몇노드에서 백억건을 읽는는데 몇분몇분 이러는 결과가 아래처럼 잘 나온다.

리포트로 쓰기에도 나쁘지않은 디자인이다 ㅋ



저작자 표시
신고
TAG hadoop, spark
Posted by ORACLE,DBA,BIG,DATA,JAVA 흑풍전설