机器学习_hadoop + python

用streaming处理参考：http://blog.csdn.net/yaoyepeng/article/details/5929457遇到safemode，关掉 hadoop dfsadmin -safemode leave或者 hadoop fsck /Mapper.py#!/usr/bin/pythonimport sys# maps words to the

blacklaw0

949人浏览 · 2013-08-07 23:19:29

blacklaw0 · 2013-08-07 23:19:29 发布

用streaming处理

参考：http://blog.csdn.net/yaoyepeng/article/details/5929457

遇到safemode，关掉 hadoop dfsadmin -safemode leave

或者 hadoop fsck /

Mapper.py

#!/usr/bin/python
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN (standard input)
for line in sys.stdin:
	# remove leading and trailing whitespace
	line = line.strip()
	# split the line into words while removing any empty strings
	words = filter(lambda word: word, line.split())
	# increase counters
	for word in words:
		print '%s\t%s' % (word, 1)

Reduer.py

#!/usr/bin/python
from operator import itemgetter
import sys

# maps words to their counts 
word2count = {}

# input comes from STDIN
for line in sys.stdin:
	# remove leading and trailing whitespace
	line = line.strip()

	# parse the input we got from mapper.py
	word, count = line.split()
	# convert count (currently a string) to int
	try:
		count = int(count)
		word2count[word] = word2count.get(word, 0) + count
	except ValueError:
		# count was not a number, so silently
		# ignore/discard this line
		pass
# sort the word lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key = itemgetter(0))

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
	print '%s\t%s' % (word, count)

test.txt

This is a book!
That is a rular.
I'm a boy.

hadoop fs -mkdir py_input

hadoop jar hadoop-streaming.jar -input py_input -output py_output -mapper ./py/Mapper.py -reducer ./py/Reducer.py -file ./py/Mapper.py -file ./py/Reducer.py