初学Scala,单词计数小案例

Spark分布式大数据处理引擎"前哨兵"Scala,很有打Linux命令行的感觉,对运维来说很舒服. 测试数据很简单,就一句英文(In the face of the Committee's threatened contempt vote), path修改成自己文件所在的路径,即可进行测试程序.package com.scala.practiceimport scala.io....

普天&同庆

4371人浏览 · 2019-05-09 14:21:08

普天&同庆 · 2019-05-09 14:21:08 发布

Spark分布式大数据处理引擎"前哨兵"Scala,很有打Linux命令行的感觉,对运维来说很舒服. 测试数据很简单,就一句英文(

In the face of the Committee's threatened contempt vote), path修改成自己文件所在的路径,即可进行测试程序.

package com.scala.practice

import scala.io.Source

object WordCount {

  def main(args: Array[String]): Unit = {

    /**
      * 读取文件,获得每一行的数据,转换成列表
      * List("In the face of the Committee's threatened contempt vote.)
      */
    val lines = Source.fromFile("/path/news.txt").getLines().toList

    /**
      * 按照空格切分句子,切成单词,放在列表中
      * List("In, the, face, of, the, Committee's, threatened, contempt, vote.)
      */
    val wordList = lines.flatMap(_.split(" "))

    /**
      * 对切分的单词进行计数
      * List(("In,1), (the,1), (face,1), (of,1), (the,1), (Committee's,1), (threatened,1), (contempt,1), (vote.,1))
      */
    val countList = wordList.map((_, 1))

    /**
      * 对单词进行分组, 相同的放在一个列表中
      * Map(Committee's -> List((Committee's,1)), threatened -> List((threatened,1)), contempt -> List((contempt,1)),
      * "In -> List(("In,1)), vote. -> List((vote.,1)), face -> List((face,1)), of -> List((of,1)), the -> List((the,1), (the,1)))
      */
    val grouping = countList.groupBy(_._1)

    /**
      * 组内聚合,相同的单词个数相加, 去除(replace)多余的掉标点符号
      * Map(Committee's -> 1, threatened -> 1, In -> 1, contempt -> 1, face -> 1, vote -> 1, of -> 1, the -> 2)
      */
    val aggregation = grouping.map(x => (x._1.replace(".", "").replace("\"", ""), x._2.map(_._2).sum))

    /**
      * 按照单词个数进行倒排序
      * List((the,2), (Committee's,1), (threatened,1), (In,1), (contempt,1), (face,1), (vote,1), (of,1))
      */
    val sorting = aggregation.toList.sortBy(-_._2) // 对统计结果按照个数多少排序(倒序)

    sorting.foreach(println)

    /**
      * 优化简写, 可用length求长度,也可用sum求和,视具体情况而定
      */
    lines.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map(x => (x._1, x._2.length)).toList.sortBy(-_._2).foreach(println)
    lines.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map(x => (x._1, x._2.map(_._2).sum)).toList.sortBy(-_._2).foreach(println)

  }

}

Linux

更多推荐

Linux虚拟文件系统之文件系统卸载（sys_umount())

Linux中卸载文件系统由umount系统调用实现，入口函数为sys_umount()。较于文件系统的安装较为简单，下面是具体的实现。1. /*sys_umont系统调用*/2. SYSCALL_DEFINE2(umount, char __user *, name, int, flags)3. {4.struct path path;

Linux

网卡速率和双工模式的配置

http://linux.chinaitlab.com/system/792187.html1、mii-tool 配置网络设备协商方式的工具； 1.1 mii-tool 介绍； mii-tool - view, manipulate media-independent interface status （mii-tool 是查看，管理介质的网络接口的状态）

Linux

Linux系统下超级终端Minicom的使用方法（例如：连接交换机，路由器等）转http://baike.baidu.com/view/2911642.htm?fr=ala0_1

Linux系统下超级终端Minicom的使用方法 　　Linux下的Minicom的功能与下的超级终端功能相似，适于在通过超级终端对设备的管理以及对嵌入操作系统的升级，现写出Minicom的使用手册： 　　1．启动minicom 　　以root权限登录系统 　　使用命令 　　minicom –s 则minicom启动，屏