[MapReduce]Filter Pattern
这是Udacity的课程 intro to hadoop and mapReduce里面Lesson4的练习该练习完成了从大量的论坛post中,过滤出只有一句话的post。以下是题目具体描述以及python代码#!/usr/bin/pythonimport sysimport csv# To run this code on the actual data, please do
·
这是Udacity的课程 intro to hadoop and mapReduce里面Lesson4的练习
该练习完成了从大量的论坛post中,过滤出只有一句话的post。以下是题目具体描述以及python代码
#!/usr/bin/python
import sys
import csv
# To run this code on the actual data, please download the additional dataset.
# You can find instructions in the course materials (wiki) and in the instructor notes.
# There are some things in this data file that are different from what you saw
# in Lesson 3. The dataset is more complicated and closer to what you might
# see in the real world. It was generated by exporting data from a SQL database.
#
# The data in at least one of the fields (the body field) can include newline
# characters, and all the fields are enclosed in double quotes. Therefore, we
# will need to process the data file in a way other than using split(","). To do this,
# we have provided sample code for using the csv module of Python. Each 'line'
# will be a list that contains each field in sequential order.
#
# In this exercise, we are interested in the field 'body' (which is the 5th field,
# line[4]). The objective is to count the number of forum nodes where 'body' either
# contains none of the three punctuation marks: period ('.'), exclamation point ('!'),
# question mark ('?'), or else 'body' contains exactly one such punctuation mark as the
# last character. There is no need to parse the HTML inside 'body'. Also, do not pay
# special attention to newline characters.
punctualSet = ['.', '!', '?']
def mapper():
reader = csv.reader(sys.stdin, delimiter='\t')
writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for line in reader:
# YOUR CODE HERE
if one_sentence(line[4]):
writer.writerow(line)
def one_sentence(body):
tokens = [".", "!", "?"]
if body[-1] in tokens:
body = body[:-1]
if containsAny(body, tokens):
return False
# for token in tokens:
# if len(body.strip().split(token)) > 1:
# return False
return True
def containsAny(body, tokens):
"""Check whether 'str' contains ANY of the chars in 'set'"""
return 1 in [c in body for c in tokens]
test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"This is one sentence\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Also one sentence!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Hey!\nTwo sentences!\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One. Two! Three?\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"One Period. Two Sentences\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"Three\nlines, one sentence\n\"\t\"\"
"""
# This function allows you to test the mapper with the provided test string
def main():
import StringIO
sys.stdin = StringIO.StringIO(test_text)
mapper()
sys.stdin = sys.__stdin__
if __name__ == "__main__":
main()
更多推荐
已为社区贡献2条内容
所有评论(0)