［MapReduce］Filter Pattern

这是Udacity的课程 intro to hadoop and mapReduce里面Lesson4的练习该练习完成了从大量的论坛post中，过滤出只有一句话的post。以下是题目具体描述以及python代码#!/usr/bin/pythonimport sysimport csv# To run this code on the actual data, please do

Lesley dude

856人浏览 · 2015-07-27 18:49:49

Lesley dude · 2015-07-27 18:49:49 发布

这是Udacity的课程 intro to hadoop and mapReduce里面Lesson4的练习

该练习完成了从大量的论坛post中，过滤出只有一句话的post。以下是题目具体描述以及python代码

#!/usr/bin/python
import sys
import csv

# To run this code on the actual data, please download the additional dataset.
# You can find instructions in the course materials (wiki) and in the instructor notes.
# There are some things in this data file that are different from what you saw
# in Lesson 3. The dataset is more complicated and closer to what you might
# see in the real world. It was generated by exporting data from a SQL database.
# 
# The data in at least one of the fields (the body field) can include newline
# characters, and all the fields are enclosed in double quotes. Therefore, we
# will need to process the data file in a way other than using split(","). To do this, 
# we have provided sample code for using the csv module of Python. Each 'line'
# will be a list that contains each field in sequential order.
# 
# In this exercise, we are interested in the field 'body' (which is the 5th field, 
# line[4]). The objective is to count the number of forum nodes where 'body' either 
# contains none of the three punctuation marks: period ('.'), exclamation point ('!'), 
# question mark ('?'), or else 'body' contains exactly one such punctuation mark as the 
# last character. There is no need to parse the HTML inside 'body'. Also, do not pay
# special attention to newline characters.

punctualSet = ['.', '!', '?']

def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    for line in reader:

        # YOUR CODE HERE
        if one_sentence(line[4]):
            writer.writerow(line)
            
def one_sentence(body):
    tokens = [".", "!", "?"]
    if body[-1] in tokens:
        body = body[:-1]
    if containsAny(body, tokens):
        return False
    # for token in tokens:
    #     if len(body.strip().split(token)) > 1:
    #         return False
    return True

def containsAny(body, tokens):
    """Check whether 'str' contains ANY of the chars in 'set'"""
    return 1 in [c in body for c in tokens]

test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"This is one sentence\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Also one sentence!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Hey!\nTwo sentences!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One. Two! Three?\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One Period. Two Sentences\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Three\nlines, one sentence\n\"\t\"\"
"""

# This function allows you to test the mapper with the provided test string
def main():
    import StringIO
    sys.stdin = StringIO.StringIO(test_text)
    mapper()
    sys.stdin = sys.__stdin__

if __name__ == "__main__":
    main()