机器学习_hadoop + python
用streaming处理参考:http://blog.csdn.net/yaoyepeng/article/details/5929457遇到safemode,关掉 hadoop dfsadmin -safemode leave或者 hadoop fsck /Mapper.py#!/usr/bin/pythonimport sys# maps words to the
·
用streaming处理
参考:http://blog.csdn.net/yaoyepeng/article/details/5929457
遇到safemode,关掉 hadoop dfsadmin -safemode leave
或者 hadoop fsck /
Mapper.py
#!/usr/bin/python
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words while removing any empty strings
words = filter(lambda word: word, line.split())
# increase counters
for word in words:
print '%s\t%s' % (word, 1)
Reduer.py
#!/usr/bin/python
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split()
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the word lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key = itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s' % (word, count)
test.txt
This is a book!
That is a rular.
I'm a boy.
hadoop fs -mkdir py_input
hadoop jar hadoop-streaming.jar -input py_input -output py_output -mapper ./py/Mapper.py -reducer ./py/Reducer.py -file ./py/Mapper.py -file ./py/Reducer.py
更多推荐
已为社区贡献1条内容
所有评论(0)