📅  最后修改于: 2023-12-03 15:31:05.720000             🧑  作者: Mango
Hadoop Streaming is a utility that comes with the Hadoop distribution. It allows you to create and run Map/Reduce jobs with any executable or script as the Mapper and/or the Reducer. The input and output to these jobs are streams of text data, which makes it easy to integrate with other tools like awk, sed, grep, etc.
Hadoop Streaming works by reading and writing data between the client and the Hadoop cluster using standard input and output streams. The client sends data to the Hadoop cluster using the standard input stream (stdin) and receives results from the Hadoop cluster using the standard output stream (stdout).
The streaming process is divided into two phases: the Map phase and the Reduce phase. During the Map phase, the input data is processed by the Mapper script. The Mapper script reads data from the standard input stream (stdin), processes it, and writes the output to the standard output stream (stdout).
Similarly, during the Reduce phase, the output of the Mapper script is collected and processed by the Reducer script. The Reducer script reads data from the standard input stream (stdin), processes it, and writes the output to the standard output stream (stdout).
Here is an example of how to use Hadoop Streaming in Python to count the frequency of words in a text file:
$ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-file mapper.py \
-mapper 'python mapper.py' \
-file reducer.py \
-reducer 'python reducer.py' \
-input input.txt \
-output output
Here, the hadoop
command is used to run the Hadoop Streaming job. The jar
option specifies the location of the Hadoop Streaming Jar file. The file
option specifies the location of the Python scripts for the Mapper and Reducer. The mapper
option specifies the command to run the Mapper script. The reducer
option specifies the command to run the Reducer script. The input
option specifies the input file, and the output
option specifies the output directory.
Here is an example Python Mapper script:
#!/usr/bin/env python
import sys
# input comes from standard input
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to standard output
print '%s\t%s' % (word, 1)
And here is an example Python Reducer script:
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from standard input
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop
# sorts map output by key (here: word) before
# it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
This Mapper script reads the input from the standard input stream (stdin), processes it, and writes the output to the standard output stream (stdout). The Reducer script reads the input from the standard input stream (stdin), processes it, and writes the output to the standard output stream (stdout).
Hadoop Streaming is a powerful tool for processing large amounts of data using custom Map/Reduce scripts. It allows you to use any executable or script as the Mapper and/or the Reducer, which makes it very flexible and easy to use. Overall, Hadoop Streaming is a great addition to the Hadoop ecosystem and is definitely worth exploring if you are working with big data.