我安装的版本是Hadoop 3.2.1,操作系统平台是CentOS Linux 8,以下所有的操作都是在这个平台下运行的。
安装的方法参考:怎么在CentOS Linux 8上安装Hadoop?安装配置Hadoop的详细步骤

Hadoop面向客户的主要有3个模块:HDFS、YARN、MapReduce,我们将分别对这三个模块进行使用测试:

HDFS使用测试

HDFS是Hadoop的分布式文件系统,我们可以把它看做是一块大的硬盘。我们需要存文件的时候就拷贝进去,需要取文件的时候,直接从HDFS里面拷贝出来。
列出目录“/”下的所有文件:

hdfs dfs -ls /

显示:

Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 13:13 /hbase
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 13:15 /tmp

列出目录“/hbase”下的所有文件:

hdfs dfs -ls /hbase

显示:

Found 12 items
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 13:13 /hbase/.hbck
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 13:14 /hbase/.tmp
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 15:29 /hbase/MasterProcWALs
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 13:14 /hbase/WALs
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 13:13 /hbase/archive
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 13:13 /hbase/corrupt
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 13:14 /hbase/data
-rw-r--r--   3 hadoop supergroup         42 2019-12-31 13:13 /hbase/hbase.id
-rw-r--r--   3 hadoop supergroup          7 2019-12-31 13:13 /hbase/hbase.version
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 13:13 /hbase/mobdir
drwxr-xr-x   - hadoop supergroup          0 2019-12-31 15:29 /hbase/oldWALs
drwx--x--x   - hadoop supergroup          0 2019-12-31 13:13 /hbase/staging

“hdfs dfs -ls”命令的功能类似于Linux的“ls”命令,同样类似的命令举例如下:

hdfs dfs -mkdir  /input         #创建/input文件夹
hdfs dfs -rm -r /input          #删除/input文件夹
hdfs dfs -put a.log /input      #将a.log上传至/input文件夹
hdfs dfs -get /input/a.log      #将a.log下载到本地电脑上
hdfs dfs -cat a.log             #查看a.log的内容

有了以上的命令,我们就可以做基本的大数据操作了。
hdfs命令的功能非常强大,具体的使用方法可以运行命令“ hdfs help”来查看和学习。

YARN使用测试

YARN是Hadoop的分布式计算系统,它主要包含两个模块,ResourceManager负责计算时的资源分配与调度,NodeManager负责真正的执行应用程序。
我们利用系统的示例来看看YARN是怎么运作的:

hadoop jar /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar pi 10 15

这个程序执行的结果是pi的值,后面的10和15可以分别调整,数值越大,计算结果越精准。
我测试的结果:
戴尔R620,XeonE5-2603 v2,64G内存,8个硬盘开8个虚拟机,2NN,6DN,6NM:

1000 1000,耗时1504s(25分钟),得数:3.141552
10000 10000,耗时14684s(4小时5分钟),得数:3.14159256

如果配置高一些,计算的结果会更精准。

还有一个经典的案例,统计单词的数量,我们随便编辑一个文本文件,上传到HDFS上:

vi Hello.txt

内容如下:

Hello 9Tristone. Hello everyone. Happy everyday.

把这个文件上传到HDFS上,并查看是否上传成功:

hdfs dfs -put Hello.txt /input
hdfs dfs -ls /input

显示内容:

Found 1 items
-rw-r--r--   3 hadoop supergroup         49 2019-12-31 16:00 /input/Hello.txt

调用统计代码,统计单词出现的次数:

hdfs dfs -rm -r /output         #如果以前运行过,需要先把上次运行的结果删掉
hadoop jar /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /input /output 

显示如下:

2019-12-31 16:03:34,191 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2019-12-31 16:03:36,895 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1577768579177_0003
2019-12-31 16:03:38,134 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-12-31 16:03:41,322 INFO input.FileInputFormat: Total input files to process : 1
2019-12-31 16:03:41,825 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-12-31 16:03:43,460 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-12-31 16:03:44,205 INFO mapreduce.JobSubmitter: number of splits:1
2019-12-31 16:03:45,404 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2019-12-31 16:03:46,937 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1577768579177_0003
2019-12-31 16:03:46,937 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-12-31 16:03:48,414 INFO conf.Configuration: resource-types.xml not found
2019-12-31 16:03:48,415 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-12-31 16:03:50,123 INFO impl.YarnClientImpl: Submitted application application_1577768579177_0003
2019-12-31 16:03:50,241 INFO mapreduce.Job: The url to track the job: http://hadoop227:8088/proxy/application_1577768579177_0003/
2019-12-31 16:03:50,242 INFO mapreduce.Job: Running job: job_1577768579177_0003
2019-12-31 16:10:17,730 INFO mapreduce.Job:  map 0% reduce 0%
2019-12-31 16:10:25,922 INFO mapreduce.Job:  map 100% reduce 0%
2019-12-31 16:10:33,999 INFO mapreduce.Job:  map 100% reduce 100%
2019-12-31 16:10:34,017 INFO mapreduce.Job: Job job_1577768579177_0003 completed successfully
2019-12-31 16:10:34,190 INFO mapreduce.Job: Counters: 54
        File System Counters
                FILE: Number of bytes read=79
                FILE: Number of bytes written=463881
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=143
                HDFS: Number of bytes written=53
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
                HDFS: Number of bytes read erasure-coded=0
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=6248
                Total time spent by all reduces in occupied slots (ms)=5397
                Total time spent by all map tasks (ms)=6248
                Total time spent by all reduce tasks (ms)=5397
                Total vcore-milliseconds taken by all map tasks=6248
                Total vcore-milliseconds taken by all reduce tasks=5397
                Total megabyte-milliseconds taken by all map tasks=6397952
                Total megabyte-milliseconds taken by all reduce tasks=5526528
        Map-Reduce Framework
                Map input records=1
                Map output records=6
                Map output bytes=73
                Map output materialized bytes=79
                Input split bytes=94
                Combine input records=6
                Combine output records=5
                Reduce input groups=5
                Reduce shuffle bytes=79
                Reduce input records=5
                Reduce output records=5
                Spilled Records=10
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=227
                CPU time spent (ms)=2780
                Physical memory (bytes) snapshot=575057920
                Virtual memory (bytes) snapshot=5171691520
                Total committed heap usage (bytes)=453509120
                Peak Map Physical memory (bytes)=336764928
                Peak Map Virtual memory (bytes)=2582700032
                Peak Reduce Physical memory (bytes)=238292992
                Peak Reduce Virtual memory (bytes)=2588991488
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=49
        File Output Format Counters 
                Bytes Written=53

查看运行的结果:

hdfs dfs -ls /output
hdfs dfs -cat /output/part-r-00000

显示如下:

9Tristone.      1
Happy   1
Hello   2
everyday.       1
everyone.       1

默认以单词正序排序显示。

MapReduce调用外部脚本使用测试

我们使用PHP作为脚本文件来测试。
第一步:编写PHP的Map代码

mkdir -p /wwwroot/hadoop
vi /wwwroot/hadoop/mapper.php
内容如下:
#!/usr/bin/php
<?php
ini_set('memory_limit', '-1'); //内存使用不做限制,限制内存的事情交给系统自己解决
$word2count = array();
//STDIN (标准输入) = fopen(“php://stdin”, “r”);
//一行一行的读入数据
while (($line = fgets(STDIN)) !== false)
{
    //删除前后的空格,字符转为小写
    $line = strtolower(trim($line));
    //“\W匹配:任意个非单词字符”
    //通过“非单词字符”分割句子为“单词字符”
    // PREG_SPLIT_NO_EMPTY返回分隔后的非空部分
    $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
    //单词计数增加
    foreach ($words as $word) 
{
        if(!isset($word2count[$word]))
        {
            $word2count[$word] = 0;
        }
        $word2count[$word] += 1;
    }
}
//输出结果到STDOUT(标准输出)
//我们这里的输出就是Reduce的输入
//即:reducer .php的输入
foreach ($word2count as $word => $count)
{
    // PHP_EOL:换行符号,unix系列输出\n,windows系列输出\r\n,mac用输出\r
    // chr(9):制表符分隔tab-delimited
    echo $word, chr(9), $count, PHP_EOL;
}

这段代码的大致意思是:把输入的每行文本中的单词找出来,并以:
zoo 1
hello 3
world 5
这样的形式输出出来。
第二步:编写PHP的Reducer代码

vi /wwwroot/hadoop/reducer.php

内容如下:

#!/usr/bin/php
<?php
ini_set('memory_limit', '-1'); //内存使用不做限制,限制内存的事情交给系统自己解决
$word2count = array();
//STDIN (标准输入) = fopen(“php://stdin”, “r”);
//一行一行的读入数据
while (($line = fgets(STDIN)) !== false) {
    //删除两头的空格
    $line = trim($line);
    //分析我们从mapper.php获得的输入
    list($word, $count) = explode(chr(9), $line);
    //将count(当前为字符串)转换为int
    $count = intval($count);
    //计算单词的数量总和
    if ($count > 0) $word2count[$word] += $count;
}
//按词汇分类sort the words lexigraphically
//这个集合不是必需的,我们这样做只是为了
//使我们的最终输出看起来更像官方的Hadoop单词计数示例
// ksort() 函数对关联数组按照键名进行升序排序。
ksort($word2count);
//输出结果到STDOUT(标准输出)
foreach ($word2count as $word => $count)
{
    echo $word, chr(9), $count, PHP_EOL;
}

这段代码的大意是汇总各Mapper的统计结果,最终得出每个单词出现了多少次数,排序后以:
hello 2
world 1
zoo 5
这样的形式输出,接受的输入方式“hello 1”,也就是mapper.php输出的方式。

第三部:赋予权限|同步文件|执行代码。
赋予代码可执行的权限

chown -R hadoop:hadoop /wwwroot/hadoop
chmod +x /wwwroot/hadoop/mapper.php /wwwroot/hadoop/reducer.php

利用rsync将php代码同步到所有服务器上,确认每台机器上都有了以后,执行以下命令。
这些命令执行用hadoop用户:

su hadoop
mkdir ~/test
cd ~/test
echo 'Hello world! Hello everyone.' > tt.log
hdfs dfs -mkdir /input         #在Hadoop的hdfs上创建目录/input
hdfs dfs -put tt.log /input    #将要统计的文件上传到hdfs的/input上
hdfs dfs -ls /input            #上传完毕,查看文件列表

hadoop jar /home/hadoop/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar -mapper /wwwroot/hadoop/mapper.php -reducer /wwwroot/hadoop/reducer.php -input /input/* -output /output
hdfs dfs -ls /output           #查看结果文件列表
hdfs dfs -cat /output/success  #查看结果文件内容,success要改成你看到的真实的名字
hdfs dfs -get /output/success  #下载结果文件内容到本机,success要改成你看到的真实的名字
Logo

更多推荐