【Ubuntu】CDH5をインストール~サンプルを動かすまで
スポンサーリンク
Cloudera社が提供しているHadoopのディストリビューションCDHをUbuntuにインストールする。こちらに従いインストールしています。
CDHとは
CDHは、Apache Hadoopや関連プロジェクトすべてを包含し、機能検証済み、かつ、世界でもっとも導入実績の多いディストリビューションです。本ディストリビューションは、100% Apacheラインセンスに基づくオープンソース製品であり、Hadoopソリューションとしては唯一、バッチ処理、インタラクティブSQL、インタラクティブ検索、ロールベースのアクセスコントロール機能などを備えています。他のディストリビューションよりも、さらに多くの企業ユーザーにダウンロードされ使用されているディストリビューションと言えます。
環境
1.パッケージのダウンロード
*1の"On Ubuntu and other Debian systems, do the following"に従い実施。
$ wget http://archive.cloudera.com/cdh5/one-click-install/trusty/amd64/cdh5-repository_1.0_all.deb
2.パッケージのインストール
$ sudo dpkg -i cdh5-repository_1.0_all.deb Selecting previously unselected package cdh5-repository. (Reading database ... 100266 files and directories currently installed.) Preparing to unpack cdh5-repository_1.0_all.deb ... Unpacking cdh5-repository (1.0) ... Setting up cdh5-repository (1.0) ... gpg: keyring `/etc/apt/secring.gpg' created gpg: keyring `/etc/apt/trusted.gpg.d/cloudera-cdh5.gpg' created gpg: /etc/apt/trustdb.gpg: trustdb created gpg: key 02A818DD: public key "Cloudera Apt Repository" imported gpg: Total number processed: 1 gpg: imported: 1
リポジトリの登録。
$ curl -s http://archive.cloudera.com/cdh5/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add - OK
$ sudo apt-get update $ sudo apt-get install hadoop-0.20-conf-pseudo ・ ・ ・ ・ /usr/lib/hadoop-0.20-mapreduce/hadoop-core-mr1.jar * Failed to start Hadoop jobtracker. Return value: 1 invoke-rc.d: initscript hadoop-0.20-mapreduce-jobtracker, action "start" failed. Setting up hadoop-0.20-mapreduce-tasktracker (2.6.0+cdh5.4.2+567-1.cdh5.4.2.p0.4 ~trusty-cdh5.4.2) ... /usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.6.0-mr1-cdh5.4.2.jar /usr/lib/hadoop-0.20-mapreduce/hadoop-core-mr1.jar starting tasktracker, logging to /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-ta sktracker-vagrant-ubuntu-trusty.out /usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.6.0-mr1-cdh5.4.2.jar /usr/lib/hadoop-0.20-mapreduce/hadoop-core-mr1.jar * Failed to start Hadoop tasktracker. Return value: 1 invoke-rc.d: initscript hadoop-0.20-mapreduce-tasktracker, action "start" failed . Processing triggers for ureadahead (0.100.0-16) ... Setting up hadoop-0.20-conf-pseudo (2.6.0+cdh5.4.2+567-1.cdh5.4.2.p0.4~trusty-cd h5.4.2) ... update-alternatives: using /etc/hadoop/conf.pseudo.mr1 to provide /etc/hadoop/co nf (hadoop-conf) in auto mode Processing triggers for libc-bin (2.19-0ubuntu6.6) ...
エラーがでた。。。きにせず、次に進む。
3.HDFSのフォーマット
$ sudo -u hdfs hdfs namenode -format
4.HDFSの開始
$ for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
webアクセスして画面が開くか確認。
http://localhost:50070/
必要なディレクトリの作成。
$ sudo /usr/lib/hadoop/libexec/init-hdfs.sh
mapreduceの起動。
$ for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ; done
http://localhost:50030/ にアクセスして以下のページがみれるか確認。
ユーザ作成
$ groupadd map $ useradd -d /home/map -g map -m map $ passwd map $ sudo -u hdfs hadoop fs -mkdir -p /user/map $ sudo -u hdfs hadoop fs -chown map /user/map
inputディレクトリを作成し、設定ファイルをコピー。コピーされたかを確認する。
map$ hadoop fs -mkdir input map$ hadoop fs -put /etc/hadoop/conf/*.xml input map$ hadoop fs -ls input Found 4 items -rw-r--r-- 1 map map 2133 2015-05-27 13:20 input/core-site.xml -rw-r--r-- 1 map map 3032 2015-05-27 13:20 input/fair-scheduler.xml -rw-r--r-- 1 map map 1875 2015-05-27 13:20 input/hdfs-site.xml -rw-r--r-- 1 map map 582 2015-05-27 13:20 input/mapred-site.xml
サンプルプログラムの実行。メモリエラー。ちーん。
map$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+' 15/05/28 08:47:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/05/28 08:47:33 INFO mapred.FileInputFormat: Total input paths to process : 4 15/05/28 08:47:33 INFO mapred.JobClient: Running job: job_201505280830_0003 15/05/28 08:47:35 INFO mapred.JobClient: map 0% reduce 0% 15/05/28 08:48:01 INFO mapred.JobClient: map 25% reduce 0% 15/05/28 08:48:02 INFO mapred.JobClient: map 50% reduce 0% 15/05/28 08:48:34 INFO mapred.JobClient: map 50% reduce 17% 15/05/28 08:48:48 INFO mapred.JobClient: Task Id : attempt_201505280830_0003_m_000002_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237) attempt_201505280830_0003_m_000002_0: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000f3b3a000, 32268288, 0) failed; error='Cannot allocate memory' (errno=12) attempt_201505280830_0003_m_000002_0: # attempt_201505280830_0003_m_000002_0: # There is insufficient memory for the Java Runtime Environment to continue. attempt_201505280830_0003_m_000002_0: # Native memory allocation (malloc) failed to allocate 32268288 bytes for committing reserved memory. attempt_201505280830_0003_m_000002_0: # An error report file with more information is saved as: attempt_201505280830_0003_m_000002_0: # /var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/map/jobcache/job_201505280830_0003/attempt_201505280830_0003_m_000002_0/work/hs_err_pid3532.log
そもそもメモリサイズって。
$ cat /proc/meminfo MemTotal: 501492 kB MemFree: 154976 kB MemAvailable: 227880 kB SwapTotal: 522236 kB SwapFree: 225972 kB $ free -m total used free shared buffers cached Mem: 489 345 144 0 5 66 -/+ buffers/cache: 273 216 Swap: 509 284 225
swapを1GB増やす。こちらを参照。
$ sudo dd if=/dev/zero of=/swapfile1 bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 6.55761 s, 164 MB/s
swapファイルの確認。
$ ll /swapfile1 -rw-r--r-- 1 root root 1073741824 Jun 3 04:10 /swapfile1
パーミッション変更。
$ sudo chmod 600 /swapfile1
フォーマット
$ sudo mkswap /swapfile1 Setting up swapspace version 1, size = 1048572 KiB no label, UUID=0314552f-ebe3-4998-92b7-0e9e492c2142
現状のスワップ設定を確認。
$ swapon -s Filename Type Size Used Priority /dev/sda5 partition 522236 136616 -1
有効化。
$ sudo swapon /swapfile1
増えました。
$ swapon -s Filename Type Size Used Priority /dev/sda5 partition 522236 135552 -1 /swapfile1 file 1048572 0 -2
確認。
$ cat /proc/meminfo | grep Swap SwapCached: 18392 kB SwapTotal: 1570808 kB SwapFree: 1438384 kB
$su - map $for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done $for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ; done
サンプル実行。次はoutputディレクトリが残っている旨エラー。
$/usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+' ・ ・ ・ ・ 15/06/03 04:22:46 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:8020/var/lib/ hadoop-hdfs/cache/mapred/mapred/staging/map/.staging/job_201506030345_0002 15/06/03 04:22:46 WARN security.UserGroupInformation: PriviledgedActionException as:map (auth:SIMPLE ) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:8020/ user/map/output already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:8020/user/map /output already exists
以下で削除。
$sudo -u hdfs hadoop fs -rmdir --ignore-fail-on-non-empty /user/map/output
以下で削除されたか確認。
$ sudo -u hdfs hadoop fs -ls -R /
消えていない。おそらくこの事象。以下でできた。
$ sudo -u hdfs hdfs dfs -rm -r -skipTrash /user/map/output Deleted /user/map/output
今度は成功。
~$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+' 15/06/03 05:55:45 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/06/03 05:55:47 INFO mapred.FileInputFormat: Total input paths to process : 4 15/06/03 05:55:50 INFO mapred.JobClient: Running job: job_201506030555_0001 15/06/03 05:55:51 INFO mapred.JobClient: map 0% reduce 0% 15/06/03 05:56:33 INFO mapred.JobClient: map 50% reduce 0% 15/06/03 05:57:05 INFO mapred.JobClient: map 100% reduce 0% 15/06/03 05:57:09 INFO mapred.JobClient: map 100% reduce 67% 15/06/03 05:57:13 INFO mapred.JobClient: map 100% reduce 100% 15/06/03 05:57:21 INFO mapred.JobClient: Job complete: job_201506030555_0001 15/06/03 05:57:22 INFO mapred.JobClient: Counters: 33 15/06/03 05:57:22 INFO mapred.JobClient: File System Counters 15/06/03 05:57:22 INFO mapred.JobClient: FILE: Number of bytes read=204 15/06/03 05:57:22 INFO mapred.JobClient: FILE: Number of bytes written=1256351 15/06/03 05:57:22 INFO mapred.JobClient: FILE: Number of read operations=0 15/06/03 05:57:22 INFO mapred.JobClient: FILE: Number of large read operations=0 15/06/03 05:57:22 INFO mapred.JobClient: FILE: Number of write operations=0 15/06/03 05:57:22 INFO mapred.JobClient: HDFS: Number of bytes read=8041 15/06/03 05:57:22 INFO mapred.JobClient: HDFS: Number of bytes written=320 15/06/03 05:57:22 INFO mapred.JobClient: HDFS: Number of read operations=10 15/06/03 05:57:22 INFO mapred.JobClient: HDFS: Number of large read operations=0 15/06/03 05:57:22 INFO mapred.JobClient: HDFS: Number of write operations=2 15/06/03 05:57:22 INFO mapred.JobClient: Job Counters 15/06/03 05:57:22 INFO mapred.JobClient: Launched map tasks=4 15/06/03 05:57:22 INFO mapred.JobClient: Launched reduce tasks=1 15/06/03 05:57:22 INFO mapred.JobClient: Data-local map tasks=4 15/06/03 05:57:22 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=135539 15/06/03 05:57:22 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=38986 15/06/03 05:57:22 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/06/03 05:57:22 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/06/03 05:57:22 INFO mapred.JobClient: Map-Reduce Framework 15/06/03 05:57:22 INFO mapred.JobClient: Map input records=218 15/06/03 05:57:22 INFO mapred.JobClient: Map output records=6 15/06/03 05:57:22 INFO mapred.JobClient: Map output bytes=186 15/06/03 05:57:22 INFO mapred.JobClient: Input split bytes=419 15/06/03 05:57:22 INFO mapred.JobClient: Combine input records=6 15/06/03 05:57:22 INFO mapred.JobClient: Combine output records=6 15/06/03 05:57:22 INFO mapred.JobClient: Reduce input groups=6 15/06/03 05:57:22 INFO mapred.JobClient: Reduce shuffle bytes=222 15/06/03 05:57:22 INFO mapred.JobClient: Reduce input records=6 15/06/03 05:57:22 INFO mapred.JobClient: Reduce output records=6 15/06/03 05:57:22 INFO mapred.JobClient: Spilled Records=12 15/06/03 05:57:22 INFO mapred.JobClient: CPU time spent (ms)=8640 15/06/03 05:57:22 INFO mapred.JobClient: Physical memory (bytes) snapshot=735318016 15/06/03 05:57:22 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3910836224 15/06/03 05:57:22 INFO mapred.JobClient: Total committed heap usage (bytes)=822448128 15/06/03 05:57:22 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 15/06/03 05:57:22 INFO mapred.JobClient: BYTES_READ=7622 15/06/03 05:57:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 15/06/03 05:57:22 INFO mapred.FileInputFormat: Total input paths to process : 1 15/06/03 05:57:23 INFO mapred.JobClient: Running job: job_201506030555_0002 15/06/03 05:57:24 INFO mapred.JobClient: map 0% reduce 0% 15/06/03 05:57:41 INFO mapred.JobClient: map 100% reduce 0% 15/06/03 05:57:51 INFO mapred.JobClient: map 100% reduce 100% 15/06/03 05:57:57 INFO mapred.JobClient: Job complete: job_201506030555_0002 15/06/03 05:57:57 INFO mapred.JobClient: Counters: 33 15/06/03 05:57:57 INFO mapred.JobClient: File System Counters 15/06/03 05:57:57 INFO mapred.JobClient: FILE: Number of bytes read=204 15/06/03 05:57:57 INFO mapred.JobClient: FILE: Number of bytes written=496351 15/06/03 05:57:57 INFO mapred.JobClient: FILE: Number of read operations=0 15/06/03 05:57:57 INFO mapred.JobClient: FILE: Number of large read operations=0 15/06/03 05:57:57 INFO mapred.JobClient: FILE: Number of write operations=0 15/06/03 05:57:57 INFO mapred.JobClient: HDFS: Number of bytes read=434 15/06/03 05:57:57 INFO mapred.JobClient: HDFS: Number of bytes written=150 15/06/03 05:57:57 INFO mapred.JobClient: HDFS: Number of read operations=4 15/06/03 05:57:57 INFO mapred.JobClient: HDFS: Number of large read operations=0 15/06/03 05:57:57 INFO mapred.JobClient: HDFS: Number of write operations=2 15/06/03 05:57:57 INFO mapred.JobClient: Job Counters 15/06/03 05:57:57 INFO mapred.JobClient: Launched map tasks=1 15/06/03 05:57:57 INFO mapred.JobClient: Launched reduce tasks=1 15/06/03 05:57:57 INFO mapred.JobClient: Data-local map tasks=1 15/06/03 05:57:57 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=20861 15/06/03 05:57:57 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=9178 15/06/03 05:57:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 15/06/03 05:57:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 15/06/03 05:57:57 INFO mapred.JobClient: Map-Reduce Framework 15/06/03 05:57:57 INFO mapred.JobClient: Map input records=6 15/06/03 05:57:57 INFO mapred.JobClient: Map output records=6 15/06/03 05:57:57 INFO mapred.JobClient: Map output bytes=186 15/06/03 05:57:57 INFO mapred.JobClient: Input split bytes=114 15/06/03 05:57:57 INFO mapred.JobClient: Combine input records=0 15/06/03 05:57:57 INFO mapred.JobClient: Combine output records=0 15/06/03 05:57:57 INFO mapred.JobClient: Reduce input groups=1 15/06/03 05:57:57 INFO mapred.JobClient: Reduce shuffle bytes=204 15/06/03 05:57:57 INFO mapred.JobClient: Reduce input records=6 15/06/03 05:57:57 INFO mapred.JobClient: Reduce output records=6 15/06/03 05:57:57 INFO mapred.JobClient: Spilled Records=12 15/06/03 05:57:57 INFO mapred.JobClient: CPU time spent (ms)=1550 15/06/03 05:57:57 INFO mapred.JobClient: Physical memory (bytes) snapshot=294739968 15/06/03 05:57:57 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1568473088 15/06/03 05:57:57 INFO mapred.JobClient: Total committed heap usage (bytes)=211836928 15/06/03 05:57:57 INFO mapred.JobClient: org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter 15/06/03 05:57:57 INFO mapred.JobClient: BYTES_READ=234
/user/mapって書かないと見えない。どっかでおかしくなったな。
$ sudo hadoop fs -ls /user/map/output Found 3 items -rw-r--r-- 1 map supergroup 0 2015-06-03 05:57 /user/map/output/_SUCCESS drwxr-xr-x - map supergroup 0 2015-06-03 05:57 /user/map/output/_logs -rw-r--r-- 1 map supergroup 150 2015-06-03 05:57 /user/map/output/part-00000||
inputフォルダ内のファイルをgrep dfs[a-z.]+ した際の出現回数が表示される。
$ sudo hadoop fs -cat /user/map/output/part-00000 | head 1 dfs.datanode.data.dir 1 dfs.namenode.checkpoint.dir 1 dfs.namenode.name.dir 1 dfs.replication 1 dfs.safemode.extension 1 dfs.safemode.min.datanodes
hdfsユーザ、一般ユーザ、suの使い分けがよくわからんな。そこちゃんと勉強しよう。
$ for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x stop ; done