p4-mapreduce

heapq

This tutorial will show you how to use the heapq.merge() function. It’s useful for implementing the Reduce Stage in MapReduce.

Example

The heapq.merge() function is part of the Python standard library that implements merge functionality similar to merge sort. It merges multiple sorted inputs into a single sorted output. We have provided an example use case below.

Input

Given two sorted files:

sample1.txt

a 1
a 1
b 2
c 3

sample2.txt

a 2
c 2
d 3

Program

We can merge sample1.txt and sample2.txt into a single sorted output like this sample.py.

import heapq

# Make a list of open file handles
files = []
files.append(open("sample1.txt"))
files.append(open("sample2.txt"))

# Merge
# heapq.merge(*files) is equivalent to heapq.merge(files[0], files[1])
for line in heapq.merge(*files):
    print(line)

The sample code is not pylint-clean because open should be used inside a context manager (with). The best way to handle multiple open files is with contextlib.ExitStack.

Output

Run our program and notice that the output is sorted.

$ python3 sample.py
a 1
a 1
a 2
b 2
c 2
c 3
d 3

Acknowledgments

Original document written by Zachary Papanastasopoulos and Andrew DeOrio awdeorio@umich.edu.

This document is licensed under a Creative Commons Attribution-NonCommercial 4.0 License. You’re free to copy and share this document, but not to sell it. You may not share source code provided with this document.