mapreduce 相关

一、例1

https://www.cnblogs.com/kaituorensheng/p/3826114.html

python利用streaming编写mapreducer程序

1、先编写一个map.py

# coding: utf8

import sys

for line in sys.stdin:
   line = line.strip()
   film_d = line.split(";")
   print(film_d[0])

2、再编写一个red.py：

# coding: utf8

import sys

cur_film = "惊天魔盗团2"
cur_count = 0
for line in sys.stdin:
    if cur_film  in line:
        cur_count  += 1
print('%s总共出现了，%s次。' % ( cur_film,cur_count))

3、运行程序：

1）数据上传
• Hdfs dfs –put ~/dat0203_1.log input/

2）修改文件属性
• chmod 777 map.py
• chmod 777 red.py

3）开始运行
Hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoopstreaming-2.7.1.jar
–file ~/map.py –mapper ~/map.py
–file ~/red.py –reducer ~/red.py –input input –output
output

二、例2

1、mapper.py

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
   # remove leading and trailing whitespace
   line = line.strip()
   # split the line into words
   words = line.split()
   # increase counters
   for word in words:
       # write the results to STDOUT (standard output);
       # what we output here will be the input for the
       # Reduce step, i.e. the input for reducer.py
       #
       # tab-delimited; the trivial word count is 1
       print '%s\t%s' % (word, 1)

2、recuder.py

#!/usr/bin/python
# -*- coding: utf-8 -*-

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
   # remove leading and trailing whitespace
   line = line.strip()

   # parse the input we got from mapper.py
   word, count = line.split('\t', 1)

   # convert count (currently a string) to int
   try:
       count = int(count)
   except ValueError:
       # count was not a number, so silently
       # ignore/discard this line
       continue

   # this IF-switch only works because Hadoop sorts map output
   # by key (here: word) before it is passed to the reducer
   if current_word == word:
       current_count += count
   else:
       if current_word:
           # write result to STDOUT
           print '%s\t%s' % (current_word, current_count)
       current_count = count
       current_word = word

# do not forget to output the last word if needed!
if current_word == word:
   print '%s\t%s' % (current_word, current_count)

3、test.txt

hello
hello world
world hi
hi world

4、运行程序

chmod 777 mapper.py
chmod 777 reducer.py

ls /

root@hadoop1:~# ls
mapper.py  reducer.py  test.txt

hdfs dfs -put test.txt /

hadoop jar /opt/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar ©¦
 -file mapper.py -mapper mapper.py  -file reducer.py -reducer reducer.py -input /test.txt -outp©¦
ut /output

hdfs dfs -cat /output/*

results matching ""

No results matching ""