p5-search-engine

EECS 485 Project 5: Search Engine

Due 11:59pm ET December 5, 2021. This is a group project to be completed in groups of two or three.

Change Log

Initial Release for F21

2021-11-16 Simplify MapReduce pipeline example

2021-11-16 Add template reducer code to the Hadoop Streaming tutorial.

2021-11-16 Improve starter file pipeline.sh. The old one was misleading. Download a new copy.

2021-11-17 Changed starter_files/hadoop/inverted_index/output_sample/ to be sorted in alphabetical order.

2021-11-22 Clarify that the exit status code of ./bin/search status depends on the Search server, not the Index server.

2021-11-29 Clarify that doc IDs in the inverted index should be sorted with string comparison, not numeric comparison. For example, doc ID 1234 < doc ID 99. Pro-tip: The MapReduce grouper can do this for you.

Introduction

A scalable search engine similar to Google or Bing.

The learning goals of this project are information retrieval concepts like PageRank and tf-idf, parallel data processing with MapReduce, and writing an end-to-end search engine.

Create a segmented inverted index of web pages using a pipeline of MapReduce programs. Then, wrap the inverted index in an Index server REST API service, which is segmented by document. Finally, provide a front-end user interface that communicates with the collection of Index segment servers and returns ranked search results.

Setup

Group registration

Register your group on the Autograder.

Project folder

Create a folder for this project (instructions). Your folder location might be different.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine

Version control

Set up version control using the Version control tutorial.

Be sure to check out the Version control for a team tutorial.

After you’re done, you should have a local repository with a “clean” status and your local repository should be connected to a remote GitLab repository.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ git status
On branch main
Your branch is up-to-date with 'origin/main'.

nothing to commit, working tree clean
$ git remote -v
origin	https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (fetch)
origin	https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (push)

You should have a .gitignore file (instructions).

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ head .gitignore
This is a sample .gitignore file that's useful for EECS 485 projects.
...

Python virtual environment

Create a Python virtual environment using the Python Virtual Environment Tutorial.

Check that you have a Python virtual environment, and that it’s activated (remember source env/bin/activate).

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ ls -d env
env
$ echo $VIRTUAL_ENV
/Users/awdeorio/src/eecs485/p5-search-engine/env

Starter files

Download and unpack the starter files.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ wget https://eecs485staff.github.io/p5-search-engine/starter_files.tar.gz
$ tar -xvzf starter_files.tar.gz

Move the starter files to your project directory and remove the original starter_files/ directory and tarball.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ mv starter_files/* .
$ rm -rf starter_files starter_files.tar.gz

This project involves very few starter files. You have learned everything you need to build this project from (almost) scratch. Each major part of the spec will walk you through the expected directory structure. It’s up to you to move the starter files into the right places.

At the end of this project, your directory structure should look something like this:

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ tree -I 'env|__pycache__'
.
├── bin
│   ├── index
│   ├── indexdb
│   ├── install
│   └── search
├── hadoop
│   ├── hadoop-streaming-2.7.2.jar
│   ├── inverted_index
│   │   ├── example_input
│   │   │   └── input.csv
│   │   ├── example_output
│   │   │   ├── part-00000
│   │   │   ├── part-00001
│   │   │   └── part-00002
│   │   ├── input
│   │   │   └── input.csv
│   │   ├── map0.py
│   │   ├── map1.py
│   │   ├── ...
│   │   ├── output_sample
│   │   │   ├── part-00000
│   │   │   ├── part-00001
│   │   │   └── part-00002
│   │   ├── pipeline.sh
│   │   ├── reduce0.py
│   │   ├── reduce1.py
│   │   ├── ...
│   │   └── stopwords.txt
│   └── word_count
│       ├── input
│       │   ├── input01.txt
│       │   └── input02.txt
│       ├── map.py
│       └── reduce.py
├── index
│   ├── index
│   │   ├── __init__.py
│   │   ├── api
│   │   │   ├── __init__.py
│   │   │   └── *.py
│   │   ├── inverted_index
│   │   │   ├── inverted_index_0.txt
│   │   │   ├── inverted_index_1.txt
│   │   │   └── inverted_index_2.txt
│   │   ├── pagerank.out
│   │   └── stopwords.txt
│   ├── requirements.txt
│   └── setup.py
├── search
│   ├── requirements.txt
│   ├── search
│   │   ├── __init__.py
│   │   ├── config.py
│   │   ├── model.py
│   │   ├── sql
│   │   │   └── index.sql
│   │   ├── static
│   │   │   └── css
│   │   │       └── style.css
│   │   ├── templates
│   │   │   └── *.html
│   │   ├── var
│   │   │   └── index.sqlite3
│   │   └── views
│   │       ├── __init__.py
│   │       └── *.py
│   └── setup.py
└── tests
    └── ...

Install Hadoop

Hadoop is a MapReduce framework, similar to your Project 4 implementation. You will be using Hadoop to execute the MapReduce programs in your Inverted Index Pipeline.

Hadoop does not work on all machines. For this project, we will use a lightweight Python implementation that imitates the Hadoop Streaming Interface. The source code is in tests/utils/hadoop.py.

$ pwd # Make sure you are in the root directory
/Users/awdeorio/src/eecs485/p5-search-engine
$ echo $VIRTUAL_ENV # Check that you have a virtual environment
/Users/awdeorio/src/eecs485/p5-search-engine/env
...
$ pushd $VIRTUAL_ENV/bin
$ ln -sf ../../tests/utils/hadoop.py hadoop # Link to provided hadoop implementation
$ popd
$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ which hadoop
/Users/awdeorio/src/eecs485/p5-search-engine/env/bin/hadoop
$ hadoop --version
Fake Hadoop 1.0 by Andrew DeOrio <awdeorio@umich.edu>

If you want to play around with the real Hadoop, Appendix: C has the installation steps listed. However, for this project, we will be using our Python implemented version of Hadoop. You do not need to use the real Hadoop for this project.

Install script

Installing all of the tool chain requires a lot of steps! Follow the Project 3 Tutorial and write a bash script bin/install to install your app. However, we need to make some small changes from your previous install script.

Unlike Project 3, this project does not have a client-side dynamic portion. As a result, you do not need to install nodeenv, chromedriver or any other client-side dynamic tools.

This project has a different folder structure, so the server installation instruction needs to change. You will also have two backend servers running for this project, so you will need to install two different python packages.

If running the install script on CAEN, pip install might throw PermissionDenied errors when updating packages. To allow installation and updates of packages using pip add the following lines before the first pip install command.

You will also need to install the python implementation of hadoop. Add this to the end of your install script.

MapReduce Example

Next we will run a MapReduce word count program. See the Hadoop Streaming Tutorial for a detailed explanation. Your inverted index pipeline will consist of multiple MapReduce programs a lot like this example.

The example MapReduce program is in hadoop/word_count/. The code is in map.py and reduce.py.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/word_count
$ tree
.
├── input
│   ├── input01.txt
│   └── input02.txt
├── map.py
└── reduce.py

Take a look at the input files.

$ head input/input*
==> input/input01.txt <==
Hello World
Bye World

==> input/input02.txt <==
Hello Hadoop
Goodbye Hadoop

Run the MapReduce job. By default, there will be one mapper for each input file.

$ hadoop \
  jar ../hadoop-streaming-2.7.2.jar \
  -input input \
  -output output \
  -mapper ./map.py \
  -reducer ./reduce.py
Starting map stage
+ ./map.py < output/hadooptmp/mapper-input/part-00000 > output/hadooptmp/mapper-output/part-00000
+ ./map.py < output/hadooptmp/mapper-input/part-00001 > output/hadooptmp/mapper-output/part-00001
Starting group stage
+ cat output/hadooptmp/mapper-output/* | sort > output/hadooptmp/grouper-output/sorted.out
Starting reduce stage
+ reduce.py < output/hadooptmp/grouper-output/part-00000 > output/hadooptmp/reducer-output/part-00000
+ reduce.py < output/hadooptmp/grouper-output/part-00001 > output/hadooptmp/reducer-output/part-00001
+ reduce.py < output/hadooptmp/grouper-output/part-00002 > output/hadooptmp/reducer-output/part-00002
+ reduce.py < output/hadooptmp/grouper-output/part-00003 > output/hadooptmp/reducer-output/part-00003
Output directory: output

After the MapReduce job finishes, the output will be in the output directory. Each reducer produces a one part-XXXXX file.

$ ls output/
part-00000 part-00001 part-00002 part-00003
$ head output/part-*
==> output/part-00000 <==
Bye	1
World	2

==> output/part-00001 <==
Goodbye	1

==> output/part-00002 <==
Hadoop	2

==> output/part-00003 <==
Hello	2
$ cat output/part-*
Bye	1
World	2
Goodbye	1
Hadoop	2
Hello	2

Inverted Index Pipeline

Build an inverted index using a pipeline of MapReduce programs. Write each map and reduce program as a stand-alone Python program compatible with the Hadoop Streaming Interface. We’ll provide a lightweight Hadoop streaming work-alike that runs on your local development machine.

The input is a copy of Wikipedia pages (documents). The output is an inverted index containing inverse document frequency, term frequency, and document normalization factor. The output is several files segmented by document.

Before continuing, read the Hadoop Streaming Tutorial.

Directory structure

We’ve provided most of the files to get you started.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop
$ tree -I 'env|__pycache__'
.
├── hadoop-streaming-2.7.2.jar
├── inverted_index
│   ├── example_input
│   │   └── input.csv
│   ├── example_output
│   │   ├── part-00000
│   │   ├── part-00001
│   │   └── part-00002
│   ├── input
│   │   └── input.csv
│   ├── map0.py  # Write this
│   ├── map1.py  # Write this
│   ├── ...
│   ├── output_sample
│   │   ├── part-00000
│   │   ├── part-00001
│   │   └── part-00002
│   ├── pipeline.sh # Edit this
│   ├── reduce0.py  # Write this
│   ├── reduce1.py  # Write this
│   ├── ...
│   └── stopwords.txt
└── word_count
    ├── input
    │   ├── input01.txt
    │   └── input02.txt
    ├── map.py
    └── reduce.py

Within the inverted_index directory:

Input

The input is a CSV file, hadoop/inverted_index/input/input.csv, representing a list of documents. Each line is one document, and has 3 comma-separated values: "<doc_id>","<doc_title>","<doc_body>".

For example, the wiki article about Gold Mining in Alaska.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
$ grep 'Gold mining in Alaska' input/input.csv
"15322302","Gold mining in Alaska","Gold mining in Alaska, a state of the States, has been a major industry ...

Pro-tip: Use the Python csv library and add the line csv.field_size_limit(sys.maxsize) (doc_body is very large for some documents).

Pro-tip: Start with the small example input example_input/input.csv.

Cleaning

Follow these cleaning steps for one document. When you parse a query in the Index server, you’ll use the same procedure.

  1. Combine both document title and document body by concatenating them, separated by a space.

  2. Remove non-alphanumeric characters (that also aren’t spaces) like this:
    import re
    text = re.sub(r"[^a-zA-Z0-9 ]+", "", text)
    
  3. The inverted index should be case insensitive. Convert upper case characters to lower case using casefold().

  4. Split the text into whitespace-delimited terms.

  5. Remove stop words. We have provided a list of stop words in hadoop/inverted_index/stopwords.txt.

Output

The output is an inverted index, with one term on each line. For example, the term smelting is in the provided sample (sample described in more detail in the Testing section).

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
$ grep smelting output_sample/part-00000
smelting 1.5648920412154652 11835570 1 5434.523099893447 11936580 1 1170.6006515649055 ...

The elements in one line of the inverted index are separated by a space. The first item is the term, then the inverse document frequency. After that, items come in groups of 3, where each group represents the occurrence of the term in one document. For example:

Term Inverse
document
frequency
Document
(ID)
Term
frequency
Normalization
factor
Document
(ID)
Term
frequency
Normalization
factor
smelting 1.564 11835570 1 5434.523 11936580 1 1170.600
\(t_{k}\) \(idf_{k}\) \(d_{i}\) \(tf_{ik}\) \(\vert d_{i} \vert\) \(d_{j}\) \(tf_{jk}\) \(\vert d_{j} \vert\)
Term \(t_{k}\) appears in document \(d_{i}\) Term \(t_{k}\) appears in document \(d_{j}\)

One normalization factor applies to one document \(d_{i}\). The value stored in the inverted index omits the square root. See the tf-idf calculation section for more details.

Terms must be in ascending sorted order, which means that the terms should be in alphabetical order.

Doc IDs must be in ascending sorted order. When a term appears in multiple documents, multiple doc IDs may appear on one line of an inverted index segment. Sort these doc IDs using string comparison, not numeric comparison, for example doc ID 1234 < doc ID 99. Pro-tip: The MapReduce grouper can do this for you.

Segments

The output inverted index is divided into several files, segmented by document. Specifically, 3 files partitioned by doc_id % 3. For example, all documents where doc_id % 3 == 0 should be in the same file.

The same term may appear in multiple segments because the term appears in different documents. This example shows only the first document on each line (cut -d' ' -f1-5 shows columns 1-5). Notice that 11835570 % 3 == 0 is in part-00000; 12687106 % 3 == 1 is in part-00001; and 1097432 % 3 == 2 is in part-00002.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
$ grep '^smelting ' output/part-* | cut -d' ' -f1-5
output/part-00000:smelting 1.5648920412154652 11835570 1 5434.523099893447
output/part-00001:smelting 1.5648920412154652 12687106 1 27712.408476722147
output/part-00002:smelting 1.5648920412154652 1097432 4 120180.83720916476

A small input might not have any occurrences of a term in a doc such that doc_id % 3 == 0, for example. That means your final MapReduce output may have less than 3 files for small inputs.

tf-idf calculation

Here’s a copy of the term frequency and inverse document frequency definitions.

\(w_{ik} = tf_{ik}*idf_{k}\) tf-idf score for term \(t_{k}\) in document \(d_{i}\)
\(t_{k}\) Term \(k\)
\(d_{i}\) Document \(i\)
\(tf_{ik}\) Frequency of term \(t_{k}\) in document \(d_{i}\)
\(idf_{k} = log(N/n_{k})\) Inverse document frequency of term \(t_{k}\) in collection \(C\)
\(N\) Number of documents in collection \(C\)
\(n_{k}\) Number of documents that contain term \(t_{k}\)
\(\vert d_{i} \vert = \sqrt{\sum_{k=1}^t w_{ik}^2} = \sqrt{\sum_{k=1}^t \left( tf_{ik} * idf_{k} \right)^2}\) Normalization factor for one document \(d_{i}\)
over every term \(t_{k}\) in that document*

*The normalization factor stored in the inverted index omits the square root.

Use log base 10 in your calculations.

Your numbers may be slightly different due to floating point arithmetic. We compare values with a tolerance of 5%.

MapReduce pipeline

A MapReduce pipeline executes several MapReduce programs. The output of one MapReduce program is the input to the next MapReduce program.

We have provided hadoop/inverted_index/pipeline.sh, which has helpful comments for getting started.

Job 0

The first MapReduce job counts the total number of documents in the collection. It should read input from input/ and write output to output0.

The mapper and reducer executables should be named map0.py and reduce0.py. The output should be a single integer.

In pipeline.sh, copy the output (e.g., output0/part-00000) to total_document_count.txt. In later pipeline jobs, read the document count from total_document_count.txt, not output0/part-00000.

For example:

$ cat example_input/input.csv | ./map0.py | sort | ./reduce0.py
3

Jobs 1-N pipeline

MapReduce job 1 should read input from input/ and write output to output1/. MapReduce job 2 should read input from output1/ and write output to output2/. Etc.

Your pipeline may contain up to 9 MapReduce jobs, but the inverted index can be produced in fewer. Name your map and reduce executables in pipeline order: map0.py and reduce0.py, map1.py and reduce1.py, etc.

Pro-tip: Use our template reducer code to get started on each reducer.

Job N

The last job in the pipeline should produce 3 files, one for each segment. We recommend using doc_id % 3 as the mapper output key in the last job. Then, specify -numReduceTasks 3 in pipeline.sh for the last job.

Example

We’ve provided a small example input with correct output. Run the pipeline and make sure it generated 3 output files.

$ ./pipeline.sh example_input
...
Output directory: output
$ ls output
part-00000  part-00001  part-00002

Compare your output files to the solution output files. Take a look at our diff tips and tricks for help narrowing down a mismatch.

$ diff output/part-00000 example_output/part-00000
$ diff output/part-00001 example_output/part-00001
$ diff output/part-00002 example_output/part-00002

Pro-tip: Write a quick shell script to run your pipeline with the example input and check the output. You could save this to example_run.sh.

#!/bin/bash
set -Eeuxo pipefail

./pipeline.sh example_input
diff example_output/part-00000 output/part-00000
diff example_output/part-00001 output/part-00001
diff example_output/part-00002 output/part-00002

Quick link to the Shell script pitfalls

Pro-tip: Debug a single map or reduce program at the command line. For example, here’s how to debug the job 2 map program. The example assumes that you’ve already created total_document_count.txt.

$ cat example_input/input.csv | ./map1.py | sort | ./reduce1.py | ./map2.py

Still having trouble? Here’s how to use debugger in a map or reduce program.

Testing

Once you have implemented your MapReduce pipeline to create the inverted index, you should be able to pass all tests in test_pipeline_public.py. These tests have small inputs.

$ pytest -v tests/test_pipeline_public.py

Take a look at Appendix B: MapReduce Pipeline Testings and Debugging for lots of tips and tricks.

Index server

The Index server and Search server are separate services. The Index server is a REST API app that returns search results in JSON format. You’ll run several Index servers, one for each inverted index segment. The Search server is a server-side dynamic pages app that returns search results in HTML format. It makes REST API calls to each Index server segment and combines the results.

GET /api/v1/

Return a list of services available. The output should look exactly like this example.

$ http "localhost:9000/api/v1/"
HTTP/1.0 200 OK
...
{
    "hits": "/api/v1/hits/",
    "url": "/api/v1/"
}

GET /api/v1/hits/

Return a list of hits with doc ID and score. The query is specified by the parameter q=<query>.

This example searches for “university michigan” using one Index segment server running on port 9000.

$ http "localhost:9000/api/v1/hits/?q=university+michigan"
HTTP/1.0 200 OK
...
{
    "hits": [
        {
            "docid": 218076,
            "score": 0.09465820929053777
        },
        {
            "docid": 19224105,
            "score": 0.09111997207247934
        },
        ...
    ]
}

PageRank weight

The weight of PageRank in the score is specified by the optional parameter w=<weight>. If w=<weight> is not specified, use the default value of 0.5.

This example searches for “university michigan” with a PageRank weight of 1.0 using one Index segment server running on port 9000. Notice that the results are different.

$ http "localhost:9000/api/v1/hits/?q=university+michigan&w=1.0"
HTTP/1.0 200 OK
...
{
    "hits": [
        {
            "docid": 21648,
            "score": 0.00133111
        },
        {
            "docid": 7567080,
            "score": 0.000369752
        },
        ...
    ]
}

Directory structure

Here’s what your final Index server directory structure should look like.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ tree index -I '__pycache__|*.egg-info'
index
├── index
│   ├── __init__.py
│   ├── api
│   │   ├── __init__.py
│   │   ├── ...
│   │   └── main.py
│   ├── inverted_index
│   │   ├── inverted_index_0.txt
│   │   ├── inverted_index_1.txt
│   │   └── inverted_index_2.txt
│   ├── pagerank.out
│   └── stopwords.txt
├── requirements.txt
└── setup.py

The inverted index segments are copied from the MapReduce pipeline output. For example, copy hadoop/inverted_index/output/part-00000 to index/index/inverted_index/inverted_index_0.txt.

The stopwords.txt file is copied from hadoop/inverted_index.

Install and run

Install the Index server app.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ pip install -r index/requirements.txt
$ pip install -e index

Run the Index server. The Flask app reads the name of an inverted index segment file from the INDEX_PATH environment variable.

$ FLASK_APP=index FLASK_ENV=development INDEX_PATH=inverted_index_0.txt flask run --host 0.0.0.0 --port 9000
...
 * Running on http://127.0.0.1:9000/ (Press CTRL+C to quit)

Load the inverted index

The Index server loads the inverted index file, PageRank file, and stopwords files into memory. It should only load the files into memory once, at the time of the first request.

To load files into memory at the time of the first request, using Flask’s @index.app.before_first_request. In this snippet from the instructor solution, the startup() function will be automatically called by Flask exactly once.

@index.app.before_first_request
def startup():
    """Load inverted index, pagerank, and stopwords into memory."""
    index_dir = pathlib.Path(__file__).parent.parent
    read_stopwords(index_dir)
    read_pagerank(index_dir)
    read_inverted_index(index_dir)

Make sure that your Index server reads stopwords.txt and pagerank.out from the index/index/ directory, which isindex_dir in the above example code.

Inverted index segments

Different instances of the Index server will serve different segments of the inverted index. One Index server should load one segment from the file specified by the environment variable INDEX_PATH. Here’s a snippet from our instructor solution index/index/__init__.py:

app.config["INDEX_PATH"] = os.getenv("INDEX_PATH")

Recall that when you run the Index server, you set the environment variable, for example INDEX_PATH=inverted_index_0.txt.

$ FLASK_APP=index FLASK_ENV=development INDEX_PATH=inverted_index_0.txt flask run --host 0.0.0.0 --port 9000

Example with two segments

We’ll start two Index servers on two ports, then make REST API requests for the query “cat” to both servers.

Start an Index server for segment 0 on port 9000.

$ FLASK_APP=index FLASK_ENV=development INDEX_PATH=inverted_index_0.txt flask run --host 0.0.0.0 --port 9000
...
 * Running on http://0.0.0.0:9000/ (Press CTRL+C to quit)

Start an Index server for segment 1 on port 9001. Do this in a second terminal window.

$ FLASK_APP=index FLASK_ENV=development INDEX_PATH=inverted_index_1.txt flask run --host 0.0.0.0 --port 9001
...
 * Running on http://0.0.0.0:9001/ (Press CTRL+C to quit)

Make two requests for the query “cat”, one to each Index server. Do this in a third terminal window. Notice that the two Index segment servers return different doc IDs for the same query.

$ http "localhost:9000/api/v1/hits/?q=cat"
{
    "hits": [
        {
            "docid": 33205287,
            "score": 0.02167029858461507
        },
...
$ http "localhost:9001/api/v1/hits/?q=cat"
HTTP/1.0 200 OK
Content-Length: 1070
Content-Type: application/json
Date: Wed, 27 Oct 2021 15:34:55 GMT
Server: Werkzeug/1.0.1 Python/3.9.7

{
    "hits": [
        {
            "docid": 488581,
            "score": 0.018105284309218938
        },
...

Query processing

Clean the query using the same procedure as in inverted index construction cleaning.

Queries containing multiple words are treated as AND (AKA non-phrase) queries. That is, assume the words in a query do not have to appear consecutively in a document. Select documents from the in-memory inverted index data structure that contain every word in the cleaned query.

Return search results sorted in order of relevance score, with the most relevant documents at the beginning. If multiple documents have the same relevance score, sort by ascending doc ID, with the smallest doc ID showing first.

PageRank integration

PageRank expresses the relative importance of a website based on links. Sites that are more important will have a higher PageRank score, so they should appear higher in the search results.

Rank documents based on the weighted linear combination of two factors: the query-dependent tf-idf score, and the query-independent PageRank score. The formula for the score of a query \(q\) on a single document \(d\) is:

\[Score(q, d, w) = w * PageRank(d) + (1 - w) * tfIdf(q, d)\]

\(w\) is the weight of a document’s PageRank score, and takes on a value \([0,1]\). The weight of a document’s cosine similarity score is \(1 - w\). The value \(w\) is provided as a URL query parameter. Don’t confuse \(w\) with \(w_{ik}\) given by the formula in the tf-idf calculation.

\(PageRank(d)\) is the PageRank score of document \(d\) obtained using the provided pagerank.out.

\(tfIdf(q, d)\) is the cosine similarity between the query \(q\) and the document \(d\) tf-idf weight vectors. Each weight \(w_{ik}\) in a tf-idf weight vector is calculated via the formula in the tf-idf calculation.

We provide a calculation walk-through example in Appendix A

We have provided PageRank scores in pagerank.out. Load this file into memory once, creating a data structure that maps doc ID to PageRank score. See Load the inverted index for tips.

Init script

Write a shell script to launch the Index servers. The examples below show output with the commands your script should run, but we won’t check your output, just the functionality.

Example: start index. The command flask run ... & will start up your flask server in the background. The >> var/log/index.log 2&1 appends stdout and stderr to a log file.

$ ./bin/index start
starting index server ...
+ mkdir -p var/log
+ rm -f var/log/index.log
+ FLASK_APP=index INDEX_PATH="inverted_index_0.txt" flask run --host 0.0.0.0 --port 9000 >> var/log/index.log 2>&1 &
+ FLASK_APP=index INDEX_PATH="inverted_index_1.txt" flask run --host 0.0.0.0 --port 9001 >> var/log/index.log 2>&1 &
+ FLASK_APP=index INDEX_PATH="inverted_index_2.txt" flask run --host 0.0.0.0 --port 9002 >> var/log/index.log 2>&1 &

Watch the log file produced by the Index servers. Type Control-C to stop watching the log.

$ tail -f var/log/index.log
 * Serving Flask app "index"
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:9000/ (Press CTRL+C to quit)
 * Running on http://0.0.0.0:9001/ (Press CTRL+C to quit)
 * Running on http://0.0.0.0:9002/ (Press CTRL+C to quit)

Example: start Index server when database does not exist. Exit non-zero.

$ ./bin/index start
Error: can't find search database search/search/var/index.sqlite3
Try: ./bin/indexdb create

Example: start Index server when one or more Index servers is already running. To check if the first Index server is already running, you can use pgrep -f "flask run --host 0.0.0.0 --port 9000". Don’t forget, you’ll need to do this 3 times, one for each server.

$ ./bin/index start
Error: index server is already running

Example: stop index.

$ ./bin/index stop
stopping index server ...
+ pkill -f "flask run --host 0.0.0.0 --port 9000" || true
+ pkill -f "flask run --host 0.0.0.0 --port 9001" || true
+ pkill -f "flask run --host 0.0.0.0 --port 9002" || true

Example: restart index.

stopping index server ...
+ pkill -f "flask run --host 0.0.0.0 --port 9000" || true
+ pkill -f "flask run --host 0.0.0.0 --port 9001" || true
+ pkill -f "flask run --host 0.0.0.0 --port 9002" || true
starting index server ...
+ mkdir -p var/log
+ rm -f var/log/index.log
+ FLASK_APP=index INDEX_PATH="inverted_index_0.txt" flask run --host 0.0.0.0 --port 9000 >> var/log/index.log 2>&1 &
+ FLASK_APP=index INDEX_PATH="inverted_index_1.txt" flask run --host 0.0.0.0 --port 9001 >> var/log/index.log 2>&1 &
+ FLASK_APP=index INDEX_PATH="inverted_index_2.txt" flask run --host 0.0.0.0 --port 9002 >> var/log/index.log 2>&1 &

Example: index status. The script should exit 0 if the Index server is running, and exit non-zero otherwise.

$ ./bin/index status
index server stopped
$ ./bin/index start
starting index server ...
$ ./bin/index status
index server running
$ pkill -f 'flask run --host 0.0.0.0 --port 9001'
$ ./bin/index status
index server error: found 2 processes, expected 3
$ ./bin/index stop
stopping index server ...
$ ./bin/index status
index server stopped

The code for index status is tricky, so we’ll just give it to you.

set +o pipefail
NPROCS=$(pgrep -f "flask run --host 0.0.0.0 --port 900[0-2]" | wc -l)
set -o pipefail
if [ "$NPROCS" -eq 3 ]; then
  echo "index server running"
  exit
elif [ "$NPROCS" -eq 0 ]; then
  echo "index server stopped"
  exit 1
else
  echo "index server error: found ${NPROCS} processes, expected 3"
  exit 2
fi

Testing

Once you have implemented your Index server, you should be able to pass all tests in test_index_server_public.py.

$ pytest -v tests/test_index_server_public.py

Search server

The Search server is a user interface implemented with server-side dynamic pages. A user enters a query and the Search server returns a page of search results, just like Google or Bing.

The Search server backend makes REST API requests to each Index server and combines the results from each inverted index segment. It should make these requests in parallel threads. The Search server then displays the top 10 results to the client.

Directory structure

Here’s what your final Search server directory structure should look like.

$ tree search -I '__pycache__|*.egg-info|tmp'
search
├── requirements.txt
├── search
│   ├── __init__.py
│   ├── config.py
│   ├── model.py
│   ├── sql
│   │   └── index.sql
|   ├── static
│   │   └── css
│   │       └── style.css
│   ├── templates
│   │   └── *.html
│   ├── var
│   │   └── index.sqlite3
│   └── views
│       ├── __init__.py
│       └── *.py
└── setup.py

Configuration

The Search server configuration in config.py should have a configuration parameter called SEARCH_INDEX_SEGMENT_API_URLS. The unit tests will change this variable when the ports of the Index segment servers on the autograder change.

You can hardcode your API URLs like we did in our instructor solution search/search/config.py:

SEARCH_INDEX_SEGMENT_API_URLS = [
    "http://localhost:9000/api/v1/hits/",
    "http://localhost:9001/api/v1/hits/",
    "http://localhost:9002/api/v1/hits/",
]

Database

We’ve provided a database containing details about each docid in search/search/sql/index.sql. It has the following schema.

Database management script

Write a database script for the Index server. The examples below show output with the command your script should run, but we won’t check your output, just the functionality.

Example create.

$ ./bin/indexdb create
+ mkdir -p search/search/var/
+ sqlite3 search/search/var/index.sqlite3 < search/search/sql/index.sql

Example create when database already exists.

$ ./bin/indexdb create
Error: database already exists

Example destroy.

$ ./bin/indexdb destroy
+ rm -f search/search/var/index.sqlite3

Example reset.

$ ./bin/indexdb reset
+ rm -f search/search/var/index.sqlite3
+ mkdir -p search/search/var/
+ sqlite3 search/search/var/index.sqlite3 < search/search/sql/index.sql

GUI /

The graphical user interface (GUI) allows users to input a query and a weight, then view a ranked list of relevant docs. Serve this page as the main route (/).

A user enters a query into a text input box, and specifies a PageRank weight using a slider. Execute the query when the user presses the submit button.

Input

The GUI shows a text input and a slider input.

The text input box should set the q GET query parameter in the URL, and will be a string. When displaying the results, the text input box should show the query terms.

The slider should set the w GET query parameter in the URL, and will be a decimal value between 0-1, inclusive. You should set the range slider’s step value to be 0.01. When displaying the results, the slider should have a value equal to the query weight.

Use this HTML form code for the search form:

<!-- DO NOT CHANGE THIS (aside from where we say 'FIXME') -->
<form action="/" method="GET">
  <div><input type="text" name="q"<FIXME add other attributes such as value>/></div>
  <div><input type="range" name="w" <FIXME add other attributes such as min and max> /></div>
  <input type="submit" value="Search"/>
</form>

Output

When the user clicks the submit button, make a GET request to the index url (/) with query parameters q and w. The server should return a webpage with at most 10 search results in ranked order. Each result is the title of a document verbatim from the database.

Enclose each search result URL in <a class="doc_url">. If the url for a document is not in the database, show “No url available”.

Enclose each search result title <div class="doc_title">.

Enclose each search result summary in <div class="doc_summary">. If the summary for a document is not in the database, show “No summary available”.

If there are no search results, display <div class="no_results">No search results found!</div>.

Connect to Index servers

To produce the final list of search results, the Search server makes several REST API requests to the Index servers. Make concurrent requests to each API URL in the Flask configuration variable SEARCH_INDEX_SEGMENT_API_URLS.

Use the Requests library to make HTTP requests to the Index servers. Use threads to make concurrent requests. Consider using the heapq.merge() function to combine the results of several Index servers.

Pitfall: An Index server URL looks like localhost:9000/api/v1/hits/, not /api/v1/hits/. Remember that the Index servers run on different ports, and the Search server runs on yet another port.

Optional CSS

We included in the starter files the CSS style sheet we used in our instructor solution (style.css). Here’s how to use it.

Make sure style.css is in the right place.

$ ls search/search/static/css/style.css
search/search/static/css/style.css

Include the CSS in your header. Our template is called search/search/templates/index.html, but yours might be different.

<head>
...
  <link rel="stylesheet" type="text/css" href="{{ url_for('static', filename='css/style.css') }}">
  <link href="https://fonts.googleapis.com/css?family=Montserrat" rel="stylesheet">
  <link href="https://fonts.googleapis.com/css?family=Roboto" rel="stylesheet">
...
</head>

Enclose the entire body in <div class="feed">.

Enclose the search bar in <div class="search_bar">.

Enclose the “ask485” title in <div class="ask485">.

Display the label on the slider like this: <div class="pagerank_weight">Pagerank Weight</div>.

Enclose the list of docs in <div class="docs">.

Within the list of docs, continue to use classes doc_url, doc_title, and doc_summary as described above.

Install and run

Install the Search server app.

$ pip install -r search/requirements.txt
$ pip install -e search

Make sure you’ve generated your Index server database.

$ ls search/search/var/index.sqlite3
search/search/var/index.sqlite3

If you need to create the Index server database:

$ ./bin/indexdb create

Make sure the Index servers are running.

$ pgrep -lf flask  # macOS
$ pgrep -af flask  # Linux/WSL
75608 ... flask run --host 0.0.0.0 --port 9000
75609 ... flask run --host 0.0.0.0 --port 9001
75610 ... flask run --host 0.0.0.0 --port 9002

If you need to start the Index servers in the background (notice the & symbol at the end of the command):

$ FLASK_APP=index FLASK_ENV=development INDEX_PATH=inverted_index_0.txt flask run --host 0.0.0.0 --port 9000 &
$ FLASK_APP=index FLASK_ENV=development INDEX_PATH=inverted_index_1.txt flask run --host 0.0.0.0 --port 9001 &
$ FLASK_APP=index FLASK_ENV=development INDEX_PATH=inverted_index_2.txt flask run --host 0.0.0.0 --port 9002 &

Run the Search server.

$ FLASK_APP=search FLASK_ENV=development flask run --host 0.0.0.0 --port 8000

Kill all Search and Index servers.

$ pkill -lf flask  # macOS
$ pkill -af flask  # Linux/WSL

Init script

Write a shell script to launch the Search server. The examples below show output with the commands your script should run, but we won’t check your output, just the functionality.

Example: start search. The command flask run ... & will start up your flask server in the background. The &> var/log/search.log writes stdout and stderr to a log file.

$ ./bin/search start
starting search server ...
+ mkdir -p var/log
+ rm -f var/log/search.log
+ FLASK_APP=search flask run --host 0.0.0.0 --port 8000 &> var/log/search.log &

Watch the log file produced by the Search server. Type Control-C to stop watching the log.

$ tail -f var/log/search.log
 * Serving Flask app "search"
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8000/ (Press CTRL+C to quit)

Example: start Search server when Index server is stopped. Exit non-zero. Pro-tip: use ./bin/index status to check if the Index server is running.

$ ./bin/search start
Error: index server is not running
Try ./bin/index start

Example: start Search server when it’s already running. Exit non-zero.

$ ./bin/search start
Error: search server is already running

Example: stop search.

$ ./bin/search stop
stopping search server ...
+ pkill -f 'flask run --host 0.0.0.0 --port 8000' || true

Example: restart search.

$ ./bin/search restart
stopping search server ...
+ pkill -f 'flask run --host 0.0.0.0 --port 8000' || true
starting search server ...
+ mkdir -p var/log
+ rm -f var/log/search.log
+ FLASK_APP=search flask run --host 0.0.0.0 --port 8000 &> var/log/search.log &

Example: search status. The script should exit 0 if the Search server is running, and exit non-zero otherwise. Pro-tip: you can check the Search server process with pgrep -f "flask run --host 0.0.0.0 --port 8000".

$ ./bin/search status
search server stopped
$ ./bin/index start
starting index server ...
$ ./bin/search start
starting search server ...$
 ./bin/search status
search server running
$ ./bin/search stop
stopping search server ...
$ ./bin/search status
search server stopped
$ ./bin/index stop
stopping index server ...

Testing

Once you have implemented your Search and Index servers, you should be able to pass all tests in test_search_server_public.py.

$ pytest -v --log-cli-level=INFO tests/test_search_server_public.py

Run all published unit tests.

$ pytest -v

Code style

As in previous projects, all your Python code for the Flask servers is expected to be pycodestyle-, pydocstyle-, and pylint-compliant. You don’t need to lint the mapper/reducer executables, although it may be a good idea to do it anyway.

It is fine to disable wrong-import-position pylint checks in __init__.py files for your Flask apps. For example, this line is in the instructor solution index/index/__init__.py:

import index.api # noqa: E402 pylint: disable=wrong-import-position

Run pylint like this:

$ pylint --disable=cyclic-import --unsafe-load-any-extension=y index/index search/search

Use only the third party dependencies in requirements.txt and setup.py.

$ pytest -v tests/test_style.py

Submitting and grading

One team member should register your group on the autograder using the create new invitation feature.

Submit a tarball to the autograder, which is linked from https://eecs485.org. Include the --disable-copyfile flag only on macOS.

$ tar \
  -cJvf submit.tar.xz \
  --disable-copyfile \
  --exclude '*__pycache__*' \
  --exclude '*.out' \
  --exclude '*.sql' \
  --exclude '*.sqlite3' \
  --exclude '*.jar' \
  --exclude '*.egg-info' \
  --exclude '*var*' \
  --exclude '*tmp*' \
  --exclude '*/part-*' \
  bin \
  hadoop/inverted_index/{map*.py,reduce*.py,pipeline.sh} \
  index \
  search

Note that the submission command uses the -J flag: This tells tar to use the xz compression algorithm. xz is slower than gzip (which we used in past projects), but generally results in better compression and smaller file sizes. Make sure your submission tarball is less than 10MB.

$ du -h submit.tar.xz
6.1M	submit.tar.xz

Rubric

This is an approximate rubric.

Deliverable Value
Public tests 50%
Hidden unit tests run after the deadline 40%
Public style tests 10%

FAQ

Appendix A: TFIDF Calculation Walkthrough

You have a query \(q\), let’s make a query vector for that

\[\bar{q} = \left[ tf_{q1} * idf_1, tf_{q2} * idf_2, \ldots \right]\]

You have a document \(d_{i}\), let’s make a document vector for that

\[\bar{d_{i}} = \left[ tf_{i1} * idf_1, tf_{i2} * idf_2, \ldots \right]\]

The document normalization factor is the sum-of-squares of the tf-idf values in the document vector.

\[\vert \bar{d_{i}} \vert = \sqrt{\sum_{k=1}^t \left( tf_{ik} * idf_{k} \right)^2}\]

The query normalization factor is the sum-of-squares of the tf-idf values in the query vector.

\[\vert \bar{q} \vert = \sqrt{\sum_{k=1}^t \left( tf_{qk} * idf_{k} \right)^2}\]

The inverse document frequencies used to compute both normalization factors come from the inverted index. The normalization factor for the document can be copied from your inverted index, whereas the query normalization factor needs to be computed on the fly. Don’t forget that the document normalization factor from the inverted index omits the square root, so you’ll have to include it.

If you have defined your query and document vectors this way, then the final score will be

\[\text{tf-idf score}(\bar{q}, \bar{d}) = \frac{\bar{q} \cdot \bar{d}}{\vert \bar{q} \vert \cdot \vert \bar{d} \vert}\]

Example

Let’s walk through an example for the query ?w=0.3&q=michigan+wolverine. The query is michigan wolverine. The weight of PageRank is 0.3 and the weight tf-idf is 0.7.

Read inverted index and PageRank

Search all inverted index segments for michigan and then for wolverine.

$ egrep '^michigan[[:space:]]' ./index/index/inverted_index/inverted_index_*.txt
./index/index/inverted_index/inverted_index_0.txt:michigan 1.5099606740777352 102024 1 168022.4004130407 ...
./index/index/inverted_index/inverted_index_1.txt:michigan 1.5099606740777352 10033609 1 1028.806106016676
./index/index/inverted_index/inverted_index_2.txt:michigan 1.5099606740777352 10883747 14 9233.633290712713
$ egrep '^wolverine[[:space:]]' ./index/index/inverted_index/inverted_index_*.txt
./index/index/inverted_index/inverted_index_0.txt:wolverine 2.669184007846121 14444976 1 7229.416362853553 3479688 2 15000.812615991854 5791422 1 67772.33722992068
./index/index/inverted_index/inverted_index_1.txt:wolverine 2.669184007846121 2632792 1 1201.6239419318742 32965192 2 49040.95308989391 868657 1 127181.05498938105
./index/index/inverted_index/inverted_index_2.txt:wolverine 2.669184007846121 8943134 1 1369.2650730488642

We see that the idf for michigan is 1.509 and the idf for wolverine is 2.669. When we look at the docids for michigan and wolverine, we see that there is one document (docid 868657) that contains both terms.

$ egrep '^(michigan|wolverine)[[:space:]]' ./index/index/inverted_index/inverted_index_*.txt | grep 868657
./index/index/inverted_index/inverted_index_1.txt:michigan ... 868657 46 127181.05498938105 ...
./index/index/inverted_index/inverted_index_1.txt:wolverine ... 868657 1 127181.05498938105 ...

The output above tells us that in document 868657, michigan appears 46 times, and wolverine occurs 1 time.

If you’re curious, document 868657 is the wiki page about Michigan’s Thumb.

$ grep 868657 hadoop/inverted_index/input/input.csv | head -c 100
"868657","The Thumb","thumb|320px|Much of the Thumb is characterized by rolling farmland such as th

We’ll also read the value of PageRank for document 868657 from pagerank.out.

$ egrep 868657 ./index/index/pagerank.out
868657,5.41822e-06

Query vector and document vector

The Query vector has one position for each term (["michigan", "wolverine"]). Each position is a number calculated as <term frequency in query> * <idf>.

q = [1*1.50996, 1*2.66918]
  = [1.50996, 2.66918]

The Document vector similarly has one position for each term (["michigan", "wolverine"]). Each position is a number calculated as <term frequency in document> * idf.

d = [46*1.50996, 1*2.66918]
  = [69.45816, 2.66918]

tf-idf

Compute the dot product of q and d.

q * d = 112.004

Compute the normalization factor for the query. The normalization factor is the length of the query vector, which is the sum-of-squares of the query vector.

q = [1.50996, 2.66918]
norm_q = sqrt(1.50996^2 + 2.66918^2) = 3.067

Compute the normalization factor for the document, which is the square root of the normalization factor read from the inverted Index.

norm_d_squared = 127181.05498
norm_d = sqrt(127181.05498)
       = 356.6245

Compute tf-idf.

tfidf = q * d / (norm_q * norm_d)
      = 112.004 / (3.067 * 356.6245)
      = 0.1024

Weighted score

Finally, we combine tf-idf and PageRank using the weight.

weighted_score = w * PageRank + (1-w) (tfidf)
               = (0.3 * 5.418e-06) + (1-0.3) * (0.1024)
               = 0.07168

Appendix B: MapReduce Pipeline Testing and Debugging

This section will show how to sanity check your final inverted index, and provide some debugging tips.

Testing large inverted index

Your MapReduce pipeline’s final output should match the samples provided in hadoop/inverted_index/output_sample/. Use it to sanity check the large final output. Each file contains a few terms from the instructor solution inverted index segments. These commands assume that you’ve already run your pipeline.sh script with the full size input directory.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
$ ls output
part-00000  part-00001  part-00002
$ ls output_sample/
part-00000  part-00001  part-00002
$ cat output_sample/part-00000 | awk '{print $1}'  # show first word
smelting
itselfbetteshanger
$ grep '^smelting ' output/part-00000
smelting 1.5648920412154652 ...
$ grep '^smelting ' output_sample/part-00000
smelting 1.5648920412154652 ...

Your numbers may be slightly different within a tolerance of 5%.

Debugging

Autograder inputs

Most of the pipeline tests run your pipeline with a different input. For example, take a look at test_pipeline_public.py::test_simple, find the input directory, and run it yourself.

$ ./pipeline.sh ../../tests/testdata/test_pipeline03/input
...
Output directory: output

Compare your output to the correct output file. Notice that we need to concatenate and sort the pipeline output for this particular test. You can find the path to expected.txt in the test.

$ sort output/part-* > output/sorted.txt
$ diff output/sorted.txt ../../tests/testdata/test_pipeline03/expected.txt

Some pipeline tests compare each inverted index segment, like test_segments. You can compare each output file to the expected output.

$ ./pipeline.sh ../../tests/testdata/test_pipeline14/input_multi
...
Output directory: output
$ diff output/part-00000 ../../tests/testdata/test_pipeline14/expected/part-0.txt
$ diff output/part-00001 ../../tests/testdata/test_pipeline14/expected/part-1.txt
$ diff output/part-00002 ../../tests/testdata/test_pipeline14/expected/part-2.txt

Diff tips and tricks

Comparing large files is tricky. Here are some tips. We’ll use the example input and output for the following examples.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
$ ./pipeline.sh example_input/

Like colors? Try piping into colordiff.

$ diff example_output/part-00000 output/part-00000 | colordiff

Want to see the files side-by-side? Try diff --side-by-side.

$ diff --side-by-side example_output/part-00000 output/part-00000
...
forgetting 0.47712125471966244 3 1 2.048802225347385      forgetting 0.47712125471966244 3 1 2.048802225347385
hear 0.47712125471966244 3 1 2.048802225347385          | bear 0.47712125471966244 3 1 2.048802225347385
heard 0.47712125471966244 3 1 2.048802225347385           heard 0.47712125471966244 3 1 2.048802225347385
...

Too much on one line? Try comparing just one column. The -d' ' breaks up a line on spaces, and -f1 shows the first column. You can also see multiple columns, e.g., -f1-5 or -f1,3.

$ diff --side-by-side <(cut -d' ' -f1 output/part-00000) <(cut -d' ' -f1 example_output/part-00000) | colordiff 
...
forgetting              forgetting
bear                  | hear
beard                   heard
...

If your output is mostly correct and you’re having trouble narrowing down the difference between two similar versions of a line, try wdiff.

$ wdiff example_output/part-00000 output/part-00000 | colordiff
...
forgetting [-0.47712125471966244-] {+0.66712125471966244+} 3 1 2.048802225347385
hear [-0.47712125471966244-] {+0.66712125471966244+} 3 1 2.048802225347385
heard [-0.47712125471966244-] {+0.66712125471966244+} 3 1 2.048802225347385
...

Do those lines REALLY look the same? Maybe it’s a whitespace problem. Ignore whitespace and see if the problem goes away.

$ diff --ignore-all-space example_output/part-00000 output/part-00000

Emacs user? Try Ediff, invoked with M-x ediff.

Common problems

Your should be able to run your map and reduce scripts as standalone executables. Make sure your map and reduce scripts have the following shebang at the top of the file:

#!/usr/bin/env python3

Your map and reduce programs should use relative paths when opening an input file. For example, to access stopwords.txt:

with open("stopwords.txt", "r") as stopwords:
    for line in stopwords:
        # Do something with line

Other common problems include:

Appendix C: Installing Real Hadoop

Although not needed for this project, below are the steps to install hadoop on OSX and WSL if you want to play around with it.

If you are on OSX, install Hadoop with Homebrew

$ brew cask install java
$ brew install hadoop

If you are on Ubuntu Linux, Ubuntu Linux VM or Windows 10 WSL, you will have to install Hadoop manually. OSX users skip the following. Resume below, where specified

Install Java

$ sudo -s                            # Become root
$ sudo apt-get update
$ sudo apt-get install default-jdk   # Install Java
$ java -version                      # Check Java version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

Download Hadoop and unpack.

$ cd /opt
$ wget http://apache.osuosl.org/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
$ wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz.mds
$ shasum -a 256 hadoop-2.8.5.tar.gz
e8bf9a53337b1dca3b152b0a5b5e277dc734e76520543e525c301a050bb27eae  hadoop-2.8.5.tar.gz
$ grep SHA256 hadoop-2.8.5.tar.gz.mds
SHA256 = E8BF9A53 337B1DCA 3B152B0A 5B5E277D C734E765 20543E52 5C301A05 0BB27EAE
$ tar -xvzf hadoop-2.8.5.tar.gz
$ rm hadoop-2.8.5.tar.gz hadoop-2.8.5.tar.gz.mds

Locate the path to your Java interpreter.

$ which java | xargs readlink -f | sed 's:bin/java::'
/usr/lib/jvm/java-8-openjdk-amd64/jre/

Edit /opt/hadoop-2.8.5/etc/hadoop/hadoop-env.sh to change the JAVA_HOME.

#export JAVA_HOME=${JAVA_HOME}  # remove/comment
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

Write a launch script, /usr/local/bin/hadoop

#!/bin/bash
exec "/opt/hadoop-2.8.5/bin/hadoop" "$@"

Make the script executable and check that it’s working.

$ chmod +x /usr/local/bin/hadoop
$ which hadoop
/usr/local/bin/hadoop
$ hadoop  -h
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
...
$ exit  # drop root privileges

OSX Java Home Error

Note to OSX users: If you run Hadoop and Hadoop is unable to find JAVA_HOME, please refer to the fix below.

  1. Find hadoop-env.sh. You location might be different.
    $ find / -name hadoop-env.sh
    /usr/local/Cellar/hadoop/2.8.1/libexec/etc/hadoop/hadoop-env.sh
    
  2. Uncomment and edit the export JAVA_HOME line in hadoop-env.sh:
    # The java implementation to use.
    export JAVA_HOME="$(/usr/libexec/java_home)"
    

Acknowledgments

Original project written by Andrew DeOrio awdeorio@umich.edu.