📜  hadoop straming (1)

📅  最后修改于: 2023-12-03 15:31:05.720000             🧑  作者: Mango

Hadoop Streaming

Hadoop Streaming is a utility that comes with the Hadoop distribution. It allows you to create and run Map/Reduce jobs with any executable or script as the Mapper and/or the Reducer. The input and output to these jobs are streams of text data, which makes it easy to integrate with other tools like awk, sed, grep, etc.

How it works

Hadoop Streaming works by reading and writing data between the client and the Hadoop cluster using standard input and output streams. The client sends data to the Hadoop cluster using the standard input stream (stdin) and receives results from the Hadoop cluster using the standard output stream (stdout).

The streaming process is divided into two phases: the Map phase and the Reduce phase. During the Map phase, the input data is processed by the Mapper script. The Mapper script reads data from the standard input stream (stdin), processes it, and writes the output to the standard output stream (stdout).

Similarly, during the Reduce phase, the output of the Mapper script is collected and processed by the Reducer script. The Reducer script reads data from the standard input stream (stdin), processes it, and writes the output to the standard output stream (stdout).

Example

Here is an example of how to use Hadoop Streaming in Python to count the frequency of words in a text file:

$ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-file mapper.py \
-mapper 'python mapper.py' \
-file reducer.py \
-reducer 'python reducer.py' \
-input input.txt \
-output output

Here, the hadoop command is used to run the Hadoop Streaming job. The jar option specifies the location of the Hadoop Streaming Jar file. The file option specifies the location of the Python scripts for the Mapper and Reducer. The mapper option specifies the command to run the Mapper script. The reducer option specifies the command to run the Reducer script. The input option specifies the input file, and the output option specifies the output directory.

Here is an example Python Mapper script:

#!/usr/bin/env python

import sys

# input comes from standard input
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to standard output
        print '%s\t%s' % (word, 1)

And here is an example Python Reducer script:

#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from standard input
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop
    # sorts map output by key (here: word) before
    # it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

This Mapper script reads the input from the standard input stream (stdin), processes it, and writes the output to the standard output stream (stdout). The Reducer script reads the input from the standard input stream (stdin), processes it, and writes the output to the standard output stream (stdout).

Conclusion

Hadoop Streaming is a powerful tool for processing large amounts of data using custom Map/Reduce scripts. It allows you to use any executable or script as the Mapper and/or the Reducer, which makes it very flexible and easy to use. Overall, Hadoop Streaming is a great addition to the Hadoop ecosystem and is definitely worth exploring if you are working with big data.