p5-search-engine

EECS 485 Project 5: Wikipedia Search Engine

Due: 8pm EST December 4, 2020. This is a group project to be completed in groups of two or three.

Change Log

Initial Release for F20

Introduction

Build a scalable search engine that is similar to a commercial search engine. The search engine in this assignment has several features:

The learning goals of this project are information retrieval concepts like PageRank and tf-idf, parallel data processing with MapReduce, and writing an end-to-end search engine.

Table of Contents

Setup

Group registration

Register your group on the Autograder.

Project folder

Create a folder for this project (instructions). Your folder location might be different.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine

Version control

Set up version control using the Version control tutorial.

Be sure to check out the Version control for a team tutorial.

After you’re done, you should have a local repository with a “clean” status and your local repository should be connected to a remote GitLab repository.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.

nothing to commit, working tree clean
$ git remote -v
origin	https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (fetch)
origin	https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (push)

You should have a .gitignore file (instructions).

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ head .gitignore
This is a sample .gitignore file that's useful for EECS 485 projects.
...

Python virtual environment

Create a Python virtual environment using the Python Virtual Environment Tutorial.

Check that you have a Python virtual environment, and that it’s activated (remember source env/bin/activate).

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ ls -d env
env
$ echo $VIRTUAL_ENV
/Users/awdeorio/src/eecs485/p5-search-engine/env

Starter files

Download and unpack the starter files.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ wget https://eecs485staff.github.io/p5-search-engine/starter_files.tar.gz
$ tar -xvzf starter_files.tar.gz

Move the starter files to your project directory and remove the original starter_files/ directory and tarball.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ mv starter_files/* .
$ rm -rf starter_files starter_files.tar.gz

This project involves very few starter files. You have learned everything you need to build this project from (almost) scratch. You are responsible for converting the starter files structure into the final structure. Each part of the spec will walk you through what we expect in terms of structure.

Both the index server and the search server are separate Flask apps. The search server will serve a server-side dynamic webpage. The index server will be styled after the project 3 REST API. As in previous projects, both will be python packages that should have a setup.py.

At the end of this project your directory structure should look something like this:

$ tree -I 'env|__pycache__'
.
├── bin
│   ├── index
│   ├── indexdb
│   ├── install
│   └── search
├── hadoop
│   ├── hadoop-streaming-2.7.2.jar
│   ├── inverted_index
│   │   ├── input
│   │   │   └── input.csv
│   │   ├── map0.py
│   │   ├── map1.py
│   │   ├── ...
│   │   ├── output_sample.txt
│   │   ├── pipeline.sh
│   │   ├── reduce0.py
│   │   ├── reduce1.py
│   │   ├── ...
│   │   └── stopwords.txt
│   └── word_count
│       ├── input
│       │   ├── file01
│       │   └── file02
│       ├── map.py
│       └── reduce.py
├── index
│   ├── index
│   │   ├── __init__.py
│   │   ├── api
│   │   │   └── *.py
│   │   ├── inverted_index.txt
│   │   ├── pagerank.out
│   │   └── stopwords.txt
│   ├── requirements.txt
│   └── setup.py
├── search
│   ├── search
│   │   ├── __init__.py
│   │   ├── config.py
│   │   ├── sql
│   │   │   └── wikipedia.sql
│   │   ├── templates
│   │   │   └── *.html
│   │   ├── var
│   │   │   └── wikipedia.sqlite3
│   │   └── views
│   │       └── *.py
│   ├── requirements.txt
│   ├── setup.py
└── tests

Install Hadoop

Hadoop does not work on all machines. For this project, we will use a lightweight Python implementation of Hadoop. The source code is in tests/utils/hadoop.py.

$ pwd # Make sure you are in the root directory
/Users/awdeorio/src/eecs485/p5-search-engine
$ echo $VIRTUAL_ENV # Check that you have a virtual environment
/Users/awdeorio/src/eecs485/p5-search-engine/env
...
$ pushd $VIRTUAL_ENV/bin
$ ln -sf ../../tests/utils/hadoop.py hadoop # Link to provided hadoop implementation
$ popd
$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ which hadoop
/Users/awdeorio/src/eecs485/p5-search-engine/env/bin/hadoop
$ hadoop -h # hadoop command usage
usage: hadoop [-h] -input INPUT -output OUTPUT -mapper MAPPER
              -reducer REDUCER

Lightweight Hadoop work-alike.

optional arguments:
  -h, --help        show this help message and exit

required arguments:
  -input INPUT
  -output OUTPUT
  -mapper MAPPER
  -reducer REDUCER

If you want to play around with the real Hadoop, Appendix: B has the installation steps listed. However, for this project, we will be using our Python implemented version of Hadoop. You do not need to use the real Hadoop for this project.

Install script

Installing all of the tool chain requires a lot of steps! Follow the Project 3 Tutorial and write a bash script bin/install to install your app. However, we need to make some small changes from your previous install script.

Unlike Project 3, this project does not have a client-side dynamic portion. As a result, you do not need to install nodeenv, chromedriver or any other client-side dynamic tools.

This project has a different folder structure, so the server installation instruction needs to change. You will also have two backend servers running for this project, so you will need to install two different python packages.

If running the install script on CAEN, pip install might throw PermissionDenied errors when updating packages. To allow installation and updates of packages using pip add the following lines before the first pip install command.

You will also need to install the python implementation of hadoop. Add this to the end of your install script.

“Hello World” with Hadoop

Next we will run a sample mapreduce word count program with Hadoop. This example will run on both real Hadoop and the provided lightweight Python implementation.

The example MapReduce program lives in hadoop/word_count/. Change directory.

$ cd hadoop/word_count/
$ ls
input  map.py  reduce.py

The inputs live in hadoop/word_count/input/.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/word_count
$ tree
.
├── input
│   ├── file01
│   └── file02
├── map.py
└── reduce.py
$ cat input/file*
hadoop map reduce file map
map streaming file reduce
map reduce is cool
hadoop file system
google file system

Run the Hadoop mapreduce job. Hadoop automatically splits large input files into smaller ones. There will be one mapper for each input split.

$ hadoop \
  jar ../hadoop-streaming-2.7.2.jar \
  -input input \
  -output output \
  -mapper ./map.py \
  -reducer ./reduce.py
Starting map stage
+ ./map.py < output/hadooptmp/mapper-input/part-00000 > output/hadooptmp/mapper-output/part-00000
+ ./map.py < output/hadooptmp/mapper-input/part-00001 > output/hadooptmp/mapper-output/part-00001
Starting group stage
+ cat output/hadooptmp/mapper-output/* | sort > output/hadooptmp/grouper-output/sorted.out
Starting reduce stage
+ reduce.py < output/hadooptmp/grouper-output/part-00000 > output/hadooptmp/reducer-output/part-00000
+ reduce.py < output/hadooptmp/grouper-output/part-00001 > output/hadooptmp/reducer-output/part-00001
+ reduce.py < output/hadooptmp/grouper-output/part-00002 > output/hadooptmp/reducer-output/part-00002
+ reduce.py < output/hadooptmp/grouper-output/part-00003 > output/hadooptmp/reducer-output/part-00003
Output directory: output

You will see an output directory created. This is where the output of the MR job lives. You are interested in all part-XXXXX files.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/word_count
$ ls output/
hadooptmp  part-00000 part-00001 part-00002 part-00003
$ cat output/part-*
cool 1
is 1
system 2
file 4
map 4
google 1
reduce 3
hadoop 2
streaming 1

Inverted Index with MapReduce

You will be building and using a MapReduce based indexer in order to process the large dataset for this project. You will be using Hadoop’s command line streaming interface that lets you write your indexer in the language of your choice instead of using Java (which is what Hadoop is written in). However, you are limited to using Python 3 so that the course staff can provide better support to students. In addition, all of your mappers must output both a key and a value (not just a key). Moreover, keys and values BOTH must be non-empty. If you do not need a key or a value, common practice is to emit a “1”

There is one key difference between the MapReduce discussed in class and the Hadoop streaming interface implementation: In the Java Interface (and in lecture) one instance of the reduce function was called for each intermediate key. In the streaming interface, one instance of the reduce function may receive multiple keys. However, each reduce function will receive all the values for any key it receives as input.

You will not actually run your program on hundreds of nodes; it is possible to run a MapReduce program on just one machine: your local one. However, a good MapReduce program that can run on a single node will run fine on clusters of any size. In principle, we could take your MapReduce program and use it to build an index on 100B web pages.

For this project you will create an inverted index for the documents in hadoop/inverted_index/input/input.csv through a series of MapReduce jobs. This inverted index will contain the idf, term frequency, and document normalization factor as specified in the information retrieval lecture slides. The format of your inverted index must follow the format, with each data element separated by a space:

<term> <idf> <doc_id_x> <occurrences in doc_id_x> <doc_id_x normalization factor before sqrt> <doc_id_y> <occurrences in doc_id_y> <doc_id_y normalization factor before sqrt> ...

A sample of the correctly formatted output for the inverted index can be found in hadoop/inverted_index/output_sample.txt. You can also use this sample output to check the accuracy of your inverted index.

For your reference, here are the definitions of term frequency and idf:

And here is the definition for normalization factor for a document:

\[\sqrt{\sum_{k=1}^t \left( tf_{ik} \right)^2 \left[ \log \left( N / n_k \right) \right]^2}\]

Hadoop pipeline script

This script will execute several MR jobs. The first one will be responsible for counting the number of documents in the input data. Name the mapper and reducer for this map0.py and reduce0.py, respectively. The next N jobs will be a “pipeline”, meaning that you will chain multiple jobs together such that the input of a job is the output of the previous job.

We have provided hadoop/inverted_index/pipeline.sh, which has helpful comments for getting started.

Input

The input is a CSV file, hadoop/inverted_index/input/input.csv, representing a list of documents. Each line is one document, and has 3 comma-separated columns: "<doc_id>","<doc_title>","<doc_body>".

Pro-tip: Use the Python csv library and add the line csv.field_size_limit(sys.maxsize) (doc_body is very large for some documents).

Job 0

The first MapReduce job that you will create counts the total number of documents. You need to ensure this is run with only one reducer. Hint: remember all key/value pairs with the same key will be sent to the same reducer. The mapper and reducer executables should be named map0.py and reduce0.py, respectively. The reducer should save the total number of documents in a file called total_document_count.txt. The only data in this file will be an integer representing the total document count. The total_document_count.txt file should be created in whichever directory the pipeline is executed from.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
$ ./pipeline.sh
...
Output directory: output
$ cat total_document_count.txt
3268

MapReduce pipeline

You will be going from large datasets to an inverted index, which involves calculating quite a few values for each word. As a result, you will need to write several MapReduce jobs and run them in a pipeline to be able to generate the inverted index. The first MapReduce job in this pipeline will get input from input/ and write its output to output/. The starter file pipeline.sh shows how to pipe the output from one MapReduce job into the next one.

To test your MapReduce program, we recommend that you make a new test file, with only 10-20 of the documents from the original large file. Once you know that your program works with a single input file, try breaking the input into more files, thus using more mappers.

You may only use maximum of 9 MapReduce jobs in this pipeline (but the inverted index can be produced in fewer). The first job in the pipeline (the document counter) must have mapper and reducer executables named map0.py, reduce0.py respectively, and the second job should be map1.py, reduce1.py, etc.

The format of your inverted index must follow the format, with each data element separated by a space: word -> idf -> doc_id -> number of occurrences in doc_id -> doc_id's normalization factor (before sqrt)

Sample of formatted output can be found in hadoop/inverted_index/output_sample.txt. The order of words in the inverted index does not matter. If a word appears in more than one doc it does not matter which doc_id comes first in the inverted index. Note that the log for idf is computed with base 10 and that some of your decimals may be slightly off due to rounding errors. In general, you should be within 5% of these sample output decimals.

When building the inverted index file, you should use a list of predefined stopwords to remove words that are so common that they do not add anything to search quality (“and”, “the”, etc). We have given you a list of stopwords to use in hadoop/inverted_index/stopwords.txt.

When creating your index you should treat capital and lowercase letters as the same (case insensitive). You should also only include alphanumeric words in your index and ignore any other symbols. If a word contains a symbol, simply remove it from the string. Do this with the following code snippet:

import re
re.sub(r'[^a-zA-Z0-9]+', '', word)

You should remove non-alphanumeric characters before you remove stopwords. Your inverted index should include both doc_title and doc_body for each document.

To construct the inverted index file used by the index server, concatenate all the files in the output directory from the final MapReduce job, and put them into a new file. Make sure to copy any updated inverted index file to your index Flask app (index/index). This can either be done at the end of the pipeline.sh or run manually after the ouput files are created.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index
cat output/* > inverted_index.txt

MapReduce specifications

There are a few things you must ensure to get your mappers and reducers playing nicely with both hadoop and our autograder. Here is the list:

Sample input.

"1","The Document: A","This document is about Mike Bostock. He made d3.js and he's really cool"
"2","The Document: B","Another flaw in the human character is that everybody wants to build and nobody wants to do maintenance. - Kurt Vonnegut"
"3","Document C:","Originality is the fine art of remembering what you hear but forgetting where you heard it. - Laurence Peter"

Sample output.

character 0.47712125471966244 2 1 1.593512841936855
maintenance 0.47712125471966244 2 1 1.593512841936855
mike 0.47712125471966244 1 1 1.138223458526325
kurt 0.47712125471966244 2 1 1.593512841936855
peter 0.47712125471966244 3 1 2.048802225347385
flaw 0.47712125471966244 2 1 1.593512841936855
heard 0.47712125471966244 3 1 2.048802225347385
cool 0.47712125471966244 1 1 1.138223458526325
remembering 0.47712125471966244 3 1 2.048802225347385
laurence 0.47712125471966244 3 1 2.048802225347385
d3js 0.47712125471966244 1 1 1.138223458526325
made 0.47712125471966244 1 1 1.138223458526325
build 0.47712125471966244 2 1 1.593512841936855
document 0.0 2 1 1.593512841936855 3 1 2.048802225347385 1 2 1.138223458526325
originality 0.47712125471966244 3 1 2.048802225347385
bostock 0.47712125471966244 1 1 1.138223458526325
forgetting 0.47712125471966244 3 1 2.048802225347385
hear 0.47712125471966244 3 1 2.048802225347385
art 0.47712125471966244 3 1 2.048802225347385
human 0.47712125471966244 2 1 1.593512841936855
fine 0.47712125471966244 3 1 2.048802225347385
vonnegut 0.47712125471966244 2 1 1.593512841936855

Common problems

Your should be able to run your map and reduce scripts as standalone executable. Make sure your map and reduce scripts have the following shebang at the top of the file:

#!/usr/bin/env python3

Your map and reduce programs should use relative paths when opening an input file. For example, to access stopwords.txt:

with open("stopwords.txt", "r") as stopwords:
    for line in stopwords:
        # Do something with line

Testing

Once you have implemented your MapReduce pipeline to create the inverted index, you should be able to pass all tests in test_doc_counts_public.py and test_pipeline_public.py.

$ pytest -v tests/test_doc_counts_public.py tests/test_pipeline_public.py

Index server

The index server is a separate service from the search interface that handles search queries and returns a list of relevant results. Your index server is a RESTful API that returns search results in JSON format that the search server processes to display results to the user.

Directory structure

In this project since you have two servers, the index server and the search server, you will need separate directories for each of your server applications. Your index server code is going to be inside the directory index/index and your setup.py is going to be inside the directory index. This is to allow you to use python packages with two different servers in the same project root directory. Note that the setup.py is always in a directory above the directory in which your application’s code is. Below is an example of what your final index directory structure should look like.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ tree index -I '__pycache__|*.egg-info'
index/
├── index
│   ├── __init__.py
│   ├── api
│   │   ├── __init__.py
│   │   └── *.py
│   ├── inverted_index.txt
│   ├── pagerank.out
│   └── stopwords.txt
├── requirements.txt
└── setup.py

Your index and index/api directories need to be python packages.

PageRank integration

In lecture, you learned about PageRank, an algorithm used to determine the relative importance of a website based on the sites that link to that site, and the links to other sites that appear on that website. Sites that are more important will have a higher PageRank score, so they should be returned closer to the top of the search results.

In this project, you are given a set of pages and their PageRank scores in pagerank.out, so you do not need to implement PageRank. However, it is still important to understand how the PageRank algorithm works!

Your search engine will rank documents based on both the query-dependent tf-idf score, as well as the query-independent PageRank score. The formula for the score of a query \(q\) on a single document \(d\) should be:

\[Score(q, d, w) = w * PageRank(d) + (1 - w) * tfIdf(q, d)\]

where \(w\) is a decimal between 0 and 1, inclusive. The value \(w\) will be a URL parameter. The \(w\) parameter represents how much weight we want to give the document’s PageRank versus its cosine similarity with the query. This is a completely different variable from the \(w_{ik}\) given by the formula in Inverted Index with MapReduce. The final score contains two parts, one from pagerank and the other from a tf-idf based cosine similarity score. The \(PageRank(d)\) is the pagerank score of document \(d\), and the \(tfIdf(q, d)\) is the cosine similarity between the query and the document tf-idf weight vectors. Each weight in a tf-idf weight vector is calculated via the formula in Inverted Index with MapReduce for \(w_{ik}\). Treat query \(q\) as a simple AND, non-phrase query(that is, assume the words in a query do not have to appear consecutively and in-sequence).

Integrating PageRank scores will require creating a second index, which maps each doc_id to its corresponding precomputed PageRank score, which is given to you in pagerank.out. You can do this where you read the inverted index. This index should be accessed at query time by your index server.

If you still have some confusion with these calculations, please refer to Appendix A

Returning hits

When the Index server is run, it will load the inverted index file, pagerank file, and stopwords file into memory and wait for queries. It should only load the inverted index and pagerank into memory once, when the index server is first started.

Every time you send an appropriate request to the index server, it will process the user’s query and use the inverted index loaded into memory to return a list of all of the “hit” doc_ids. A hit will be a document that is similar to the user’s query. When finding similar documents, only include documents that have every word from the query in the document. The index server should not load the inverted index or pagerank into memory every time it receives a query!

Index Server Endpoint

Your index server will have two end points. The first is a directory route, /api/v1/ and should return the response below.

{
    "hits": "/api/v1/hits/",
    "url": "/api/v1/"
}

The second route is http://{host}:{port}/api/v1/hits/?w=<w>&q=<query>, which receives the pagerank weight and query as URL parameters w and q, and returns the search results, including the relevance score for each result (calculated according to the previous section). For example, the endpoint /api/v1/hits/?w=0.3&q=michigan+wolverine would correspond to a pagerank weight of 0.3 and a query of “michigan wolverine”.

Your index server should return a JSON object in the following format. A full walkthrough of the example is shown in Appendix A.

{
  "hits": [
    {
      "docid": 868657,
      "score": 0.071872572435374
    }
  ]
}

The documents in the hits array must be sorted in order of relevance score, with the most relevant documents at the beginning, and the least relevant documents at the end. If multiple documents have the same relevance score, sort them in order of doc_id (with the smallest doc_id first).

When you get a user’s query make sure you remove all stopwords. Since we removed them from our inverted index, we need to remove them from the query. Also clean the user’s query of any nonalphanumeric characters as you did with the inverted index.

Running the index server

Install you index server app by running the command below in your project root directory:

$ pip install -r index/requirements.txt
$ pip install -e index

To run the index server, you will use environment variables to put the development server in debug mode, specify the name of your app’s python module and locate an additional config file. (These commands assume you are using the bash shell.)

$ export FLASK_DEBUG=True
$ export FLASK_APP=index
$ export INDEX_SETTINGS=config.py
$ flask run --host 0.0.0.0 --port 8001
 * Serving Flask app "index"
 * Forcing debug mode on
 * Running on http://127.0.0.1:8001/ (Press CTRL+C to quit)

Common problems

Make sure that your index server reads stopwords.txt, pagerank.out from the index/index/ directory.

Pro-tip: use an absolute path and Python’s __file__ variable to compute the path to the stopwords file.

"""index/index/api/views.py"""
index_package_dir = pathlib.Path(__file__).parent.parent
stopwords_filename = index_package_dir/"stopwords.txt"

Testing

Once you have implemented your Index Server, you should be able to pass all tests in test_index_server_public.py.

$ pytest -v tests/test_index_server_public.py

Search interface

The third component of the project is a search engine user interface written using server-side dynamic pages. The search interface app will provide a GUI for the user to enter a query and specify a Pagerank weight, and will then send a request to your index server. When it receives the search results from the index server, it will display them on the webpage.

Directory structure

As mentioned earlier, in this project since you have two separate servers, the index server and the search server, you will need a separate directory for you search server. Your search interface server code is going to be inside the directory search/search and your setup.py is going to be inside the directory search. Your search server is supposed to be a python package. Below is an example of what your final search directory structure should look like.

$ tree search -I '__pycache__|*.egg-info|tmp'
search
├── requirements.txt
├── search
│   ├── __init__.py
│   ├── config.py
│   ├── model.py
│   ├── sql
│   │   └── wikipedia.sql
│   ├── templates
│   │   └── *.html
│   ├── var
│   │   └── wikipedia.sqlite3
│   └── views
│       └── *.py
├── setup.py

Your search and views directories need to be Python packages.

Configuration

The search server configuration in config.py should have a configuration parameter called INDEX_API_URL. The unit tests will change this variable when the port of the index server on the autograder changes.

Here’s an excerpt from the instructor solution search/search/config.py:

# URL of the index server
INDEX_API_URL = "http://localhost:8001/api/v1/hits/"

Database

You will need to create a new database for project 5, with a table called Documents as follows:

The SQL to create this table and load the necessary data into it is provided in search/search/sql/wikipedia.sql inside the starter files.

GUI

Make a simple search interface for your database of Wikipedia articles which allows users to input a query and a w value, and view a ranked list of relevant docs. This is the main route (http://{host}:{port}/).

Users can enter their query into a text input box, and specify the weight value using a slider. The query is executed when a user presses the submit button. You can assume anyone can use your search engine; you do not need to worry about access rights or sessions like past projects. Your engine should receive a simple AND, non-phrase query (that is, assume the words in a query do not have to appear consecutively and in-sequence), and return the ranked list of docs. Your search results will be titles of Wikipedia documents, displayed exactly as they appear in the database. Search results will show at most 10 documents. Any other content should appear below.

The slider should set the w GET query parameter in the URL, and will be a decimal value between 0-1, inclusive. You should set the range slider’s step value to be 0.01.

The text input box should set the q GET query parameter in the URL, and will be a string.

Use this HTML form code for the search form:

<!-- DO NOT CHANGE THIS (aside from where we say 'FIXME') -->
<form action="/" method="GET">
  <p><input type="text" name="q"/></p>
  <p><input type="range" name="w" <FIXME add other attributes such as min and max> /></p>
  <input type="submit" value="Search"/>
</form>

When the user clicks the submit button of the search GUI, the page should make a GET request to the index url (/) with query parameters q and w. The server should return a webpage displaying the appropriate titles. For each title returned, display the title in a <p> tag with a class="doc_title" (<p class="doc_title">) and the summary in a <p> tag with a class="doc_summary" (<p class="doc_summary">). If the summary for a document is not in the database, put the text “No summary available” in the <p> tag (with class="doc_summary").

If there are no search results, display a <p> tag with a class="no_results" (<p class="no_results">).

Running the search server

Install you search server app by running the command below in your project root directory:

$ pip install -r search/requirements.txt
$ pip install -e search

To run the search server, you will use environment variables to put the development server in debug mode, specify the name of your app’s python module and locate an additional config file. (These commands assume you are using the bash shell.)

$ export FLASK_DEBUG=True
$ export FLASK_APP=search
$ export SEARCH_SETTINGS=config.py
$ flask run --host 0.0.0.0 --port 8000
 * Serving Flask app "search"
 * Forcing debug mode on
 * Running on http://127.0.0.1:8000/ (Press CTRL+C to quit)

In order, to be able to search on the search server, you must have your index server up and running, since it is the response of the index server that the search server uses to display the search results.

Shell scripts to launch both servers

You will be responsible for writing shell scripts to launch both the index server and the search server. Hint: these will be somewhat similar to last project’s bin/mapreduce script. The name of these scripts must be bin/index and bin/search.

Example: start search.

$ ./bin/search start
starting search server ...
+ export FLASK_APP=search
+ export SEARCH_SETTINGS=config.py
+ flask run --host 0.0.0.0 --port 8000 &> /dev/null &

Example: stop search.

$ ./bin/search stop
stopping search server ...
+ pkill -f 'flask run --host 0.0.0.0 --port 8000'

Example: restart search. Your PIDs may be different.

$ ./bin/search restart
stopping search server ...
+ pkill -f 'flask run --host 0.0.0.0 --port 8000'
starting search server ...
+ export FLASK_APP=search
+ export SEARCH_SETTINGS=config.py
+ flask run --host 0.0.0.0 --port 8000 &> /dev/null &

Example: start search when something is already running on port 8000

$ ./bin/search start
Error: a process is already using port 8000

Example: start index.

$ ./bin/index start
starting index server ...
+ export FLASK_APP=index
+ flask run --host 0.0.0.0 --port 8001 &> /dev/null &

Example: stop index.

$ ./bin/index stop
stopping index server ...
+ pkill -f 'flask run --host 0.0.0.0 --port 8001'

Example: restart index.

$ ./bin/index restart
stopping index server ...
+ pkill -f 'flask run --host 0.0.0.0 --port 8001'
starting index server ...
+ export FLASK_APP=index
+ flask run --host 0.0.0.0 --port 8001 &> /dev/null &

Example: start index when something is already running on port 8001

$ ./bin/index start
Error: a process is already using port 8001

flask run --host 0.0.0.0 --port 8000 &> /dev/null & will start up your flask server in the background. The &> /dev/null will prevent your Flask app from outputting text to the terminal.

Index database management script

You will also write a database script for the index server.

Example create.

$ ./bin/indexdb create
+ mkdir -p search/search/var/
+ sqlite3 search/search/var/wikipedia.sqlite3 < search/search/sql/wikipedia.sql

Example create when database already exists.

$ ./bin/indexdb create
Error: database already exists

Example destroy.

$ ./bin/indexdb destroy
+ rm -f search/search/var/wikipedia.sqlite3

Example reset.

$ ./bin/indexdb reset
+ rm -f search/search/var/wikipedia.sqlite3
+ mkdir -p search/search/var/
+ sqlite3 search/search/var/wikipedia.sqlite3 < search/search/sql/wikipedia.sql

Testing

Once you have implemented your Search and Index servers, you should be able to pass all tests in test_search_interface_public.py.

$ pytest -v --log-cli-level=INFO tests/test_search_interface_public.py

Run all published unit tests.

$ pwd
/Users/awdeorio/src/eecs485/p5-search-engine
$ pytest -v

Code style

As in previous projects, all your Python code for the Flask servers is expected to be pycodestyle, pydocstyle, and pylint compliant. You don’t need to lint the mapper/reducer executables (although it may be a good idea to do so still!)

Also as usual, you may not use any external dependencies aside from what is provided in requirements.txt and setup.py

Submitting and grading

One team member should register your group on the autograder using the create new invitation feature.

Submit a tarball to the autograder, which is linked from https://eecs485.org. Include the --disable-copyfile flag only on macOS.

$ tar \
  -cJvf submit.tar.xz \
  --disable-copyfile \
  --exclude '*input.csv' \
  --exclude '*/input/*' \
  --exclude '*/input_multi/*' \
  --exclude '*output*' \
  --exclude '*__pycache__*' \
  --exclude '*stopwords.txt' \
  --exclude '*.out' \
  --exclude '*.sql' \
  --exclude '*.sqlite3' \
  --exclude '*.jar' \
  --exclude '*.egg-info' \
  --exclude '*var*' \
  --exclude '*tmp*' \
  --exclude 'hadoop/inverted_index/inverted_index.txt' \
  bin \
  hadoop \
  index \
  search \

Note that the submission command uses the -J flag: This tells tar to use the xz compression algorithm. xz is slower than gzip (which we used in past projects), but generally results in better compression and smaller file sizes. Make sure your submission tarball is less than 10MB.

$ du -h submit.tar.xz
6.1M	submit.tar.xz

Rubric

This is an approximate rubric.

Deliverable Value
Public tests 50%
Hidden unit tests run after the deadline 40%
Public style tests 10%

FAQ

API Request URLs

If you want to send a request to your index server API from your search interface, you will want to include the following host:port pair in the url you are attempting to send a request to: http://localhost:8001

Appendix A: TFIDF Calculation Walkthrough

You have a query, let’s make a query vector for that

\[\bar{q} = \left[ tf_{q1} * idf_1, tf_{q2} * idf_2, \ldots \right]\]

You have a document, let’s make a document vector for that

\[\bar{d} = \left[ tf_{d1} * idf_1, tf_{d2} * idf_2, \ldots \right]\]

Let’s define a norm factor for our document

\[norm_d = \sqrt{\sum_k{\left(tf_k\right)^2 \left(idf_k\right)^2}}\]

Let’s define a norm factor for our query

\[norm_q = \sqrt{\sum_k{\left(tf_k\right)^2 \left(idf_k\right)^2}}\]

Note: that the IDFs for both norm factors come from the inverted index. The norm factor for the document can be lifted from your inverted index (but you have sqrt it), where as the query norm factor needs to be computed on the fly.

If you have defined your query and document vectors this way, then the final score will be

\[\text{tf-idf score}(\bar{q}, \bar{d}) = \frac{\bar{q} \cdot \bar{d}}{norm_q \cdot norm_d}\]

Example

Let’s walk through an example for the query ?w=0.3&q=michigan+wolverine. The query is michigan wolverine. The weight of PageRank is 0.3 and the weight tf-idf is 0.7.

Read inverted index and PageRank

Search the inverted index for michigan.

$ egrep '^michigan[[:space:]]' ./index/index/inverted_index.txt
michigan	1.509960674077735 10883747 14 8852.693606436345 ...
$ egrep '^wolverine[[:space:]]' ./index/index/inverted_index.txt
wolverine	2.669184007846121 5791422 1 66539.45074152622 ...

We see that the idf for michigan is 1.509 and the idf for wolverine is 2.669. When we look at the docids for michigan and wolvinere, we see that there is one document (docid 868657) that contains both terms.

$ egrep '^(michigan|wolverine)[[:space:]]' ./index/index/inverted_index.txt | grep 868657
wolverine	... 868657 1 126536.87039603827 ...
michigan	... 868657 46 126536.87039603827 ...

In document 868657, michigan appears 46 times, and wolverine occurs 1 time.

We’ll also read the value of PageRank for document 868657 from pagerank.out.

$ egrep 868657 ./index/index/pagerank.out
868657,5.41822e-06

Query vector and document vector

The Query vector has one position for each term (["michigan", "wolverine"]). Each position is a number calculated as <term frequency in query> * <idf>.

q = [1*1.50996, 1*2.66918]
  = [1.50996, 2.66918]

The Document vector similarly has one position for each term (["michigan", "wolverine"]). Each position is a number calculated as <term frequency in document> * idf.

d = [46*1.50996, 1*2.66918]
  = [69.45816, 2.66918]

tf-idf

Compute the dot product of q and d.

q * d = 112.004

Compute the normalization factor for the query. The normalization factor is the length of the query vector, which is the sum-of-squares of the query vector.

q = [1.50996, 2.66918]
norm_q = sqrt(1.50996^2 + 2.66918^2) = 3.067

Compute the normalization factor for the document, which is the square root of the normalization factor read from the inverted Index.

norm_d_squared = 126536.87040
norm_d = sqrt(126536.87040)
       = 355.720

Compute tf-idf.

tfidf = q * d / (norm_q * norm_d)
      = 112.004 / (3.067 * 355.720)
      = 0.103

Weighted score

Finally, we combine tf-idf and PageRank using the weight.

weighted_score = w * PageRank + (1-w) (tdidf)
               = (0.3 * 5.418e-06) + (1-0.3) * (0.103)
               = 0.072

Appendix B: Installing Real Hadoop

Although not needed for this project, below are the steps to install hadoop on OSX and WSL if you want to play around with it.

If you are on OSX, install Hadoop with Homebrew

$ brew cask install java
$ brew install hadoop

If you are on Ubuntu Linux, Ubuntu Linux VM or Windows 10 WSL, you will have to install Hadoop manually. OSX users skip the following. Resume below, where specified

Install Java

$ sudo -s                            # Become root
$ sudo apt-get update
$ sudo apt-get install default-jdk   # Install Java
$ java -version                      # Check Java version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

Download Hadoop and unpack.

$ cd /opt
$ wget http://apache.osuosl.org/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
$ wget https://dist.apache.org/repos/dist/release/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz.mds
$ shasum -a 256 hadoop-2.8.5.tar.gz
e8bf9a53337b1dca3b152b0a5b5e277dc734e76520543e525c301a050bb27eae  hadoop-2.8.5.tar.gz
$ grep SHA256 hadoop-2.8.5.tar.gz.mds
SHA256 = E8BF9A53 337B1DCA 3B152B0A 5B5E277D C734E765 20543E52 5C301A05 0BB27EAE
$ tar -xvzf hadoop-2.8.5.tar.gz
$ rm hadoop-2.8.5.tar.gz hadoop-2.8.5.tar.gz.mds

Locate the path to your Java interpreter.

$ which java | xargs readlink -f | sed 's:bin/java::'
/usr/lib/jvm/java-8-openjdk-amd64/jre/

Edit /opt/hadoop-2.8.5/etc/hadoop/hadoop-env.sh to change the JAVA_HOME.

#export JAVA_HOME=${JAVA_HOME}  # remove/comment
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

Write a launch script, /usr/local/bin/hadoop

#!/bin/bash
exec "/opt/hadoop-2.8.5/bin/hadoop" "$@"

Make the script executable and check that it’s working.

$ chmod +x /usr/local/bin/hadoop
$ which hadoop
/usr/local/bin/hadoop
$ hadoop  -h
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
...
$ exit  # drop root privileges

OSX Java Home Error

Note to OSX users: If you run Hadoop and Hadoop is unable to find JAVA_HOME, please refer to the fix below.

  1. Find hadoop-env.sh. You location might be different.
    $ find / -name hadoop-env.sh
    /usr/local/Cellar/hadoop/2.8.1/libexec/etc/hadoop/hadoop-env.sh
    
  2. Uncomment and edit the export JAVA_HOME line in hadoop-env.sh:
    # The java implementation to use.
    export JAVA_HOME="$(/usr/libexec/java_home)"