Maps
Let's go out to eat! Show me places I would like By learning my tastes.
Introduction
In this project, you will create a visualization of restaurant scores using machine learning and the Yelp academic dataset. In this visualization, Berkeley is segmented into regions, where each region is shaded by the predicted score of the closest restaurant (yellow is 5 stars, blue is 1 star). Specifically, the visualization you will be constructing is a Voronoi diagram.
In the map above, each dot represents a restaurant. The color of the dot is determined by the restaurant's location. For example, downtown restaurants are colored green. The user that generated this map has a strong preference for Southside restaurants, and so the southern regions are colored yellow.
This project uses concepts from Sections 2.1, 2.2, 2.3, and 2.4.3 of Composing Programs. It also introduces techniques and concepts from machine learning, a growing field at the intersection of computer science and statistics that analyzes data to find patterns and make predictions.
Logistics
This is a project. You may work with one other partner. You should not share your code with students who are not your partner or copy from anyone else's solutions. In the end, you will submit one project for both partners. We strongly encourage you to use pair programming to work on all parts of the project together rather than splitting up the work. Whoever is not coding should contribute by looking at the code and providing comments on a direction to go and catching bugs.
The project is worth 20 points:
- The mid-project checkpoint is worth 2 points and is due Tuesday 03/04
- The final project submission is worth 18 points and is due Tuesday 03/18
You can earn 1 point of extra credit for submitting by Monday 03/17.
You will turn in the following files:
utils.py
abstractions.py
recommend.py
You do not need to modify or turn in any other files to complete the project. To submit the project, submit the required files to the appropriate Gradescope assignment.
You may not use artificial intelligence tools to help you with this project or reference solutions found on the internet.
For the functions that we ask you to complete, there may be some initial code that we provide. If you would rather not use that code, feel free to delete it and start from scratch. You may also add new function definitions as you see fit.
However, please do not modify any other functions or edit any files not listed above. Doing so may result in your code failing our autograder tests. Also, please do not change any function signatures (names, argument order, or number of arguments).
Throughout this project, you should be testing the correctness of your code. It is good practice to test often, so that it is easy to isolate any problems. However, you should not be testing too often, to allow yourself time to think through problems.
We have provided an autograder called ok
to help you
with testing your code and tracking your progress. The first time you run the
autograder, you will be asked to log in with your Ok account using your web
browser. Please do so. Each time you run ok
, it will back up
your work and progress on our servers.
The primary purpose of ok
is to test your implementations.
If you want to test your code interactively, you can run
python3 ok -q [question number] -iwith the appropriate question number (e.g.
01
) inserted.
This will run the tests for that question until the first one you failed,
then give you a chance to test the functions you wrote interactively.
You can also use the debugging print feature in OK by writing
print("DEBUG:", x)which will produce an output in your terminal without causing OK tests to fail with extra output.
Project Files
The maps.zip archive contains all the starter code and data sets.
The project uses several files, but all of your changes will be made to
utils.py
, abstractions.py
, and recommend.py
.
abstractions.py
: Data abstractions used in the projectrecommend.py
: Machine learning algorithms and data processingutils.py
: Utility functions for data processingucb.py
: Utility functions for miscellaneous and debuggingdata
: A directory of Yelp users, restaurants, and reviewsok
: The autogradermaps.ok
: Theok
configuration filetests
: A directory of tests used byok
users
: A directory of user filesvisualize
: A directory of tools for drawing the final visualization
Conceptual Review
The following concepts will be important to understand to do this project. Feel free to review them below, or skip if you feel comfortable.
lambda <arguments>: <expression using arguments>
Below are a couple of examples of lambda expressions:
>>> square = lambda x: x * x
>>> square(4)
16
>>> plus = lambda x,y: x + y
>>> plus(2, 3)
5
>>> (lambda x: x[0])([1, 2, 3])
1
Note that as with other functions, a lambda function's lexical parent is the frame in which it was defined.
>>> d = {2: 4, 'two': ['four'], (1, 1): 4}
>>> d[2]
4
>>> d['two']
['four']
>>> d[(1, 1)]
4
The sequence of keys or values or key-value pairs can be accessed using
.keys()
or .values()
or .items()
.
>>> for k in d.keys():
... print(k)
...
2
two
(1, 1)
>>> for v in d.values():
... print(v)
...
4
['four']
4
>>> for k, v in d.items():
... print(k, v)
...
2 4
two ['four']
(1, 1) 4
By default, iterating through a dictionary iterates through its keys.
>>> for x in d:
... print(x)
...
2
two
(1, 1)
You can check whether a dictionary contains a key using in
:
>>> 'two' in d
True
>>> 4 in d
False
A dictionary comprehension is an expression that evaluates to a new dictionary.
>>> {3*x: 3*x + 1 for x in range(2, 5)}
{6: 7, 9: 10, 12: 13}
Phase 0: Utilities
All changes in this phase will be made to
utils.py
.
Problem 0 (1 pt)
Before starting the core project, familiarize yourself with some Python features
by completing utils.py
. Each function described below can be implemented in
one line. The functions you implement here can optionally be used in future phases of the project but are not strictly needed.
Problem 0.1: map_and_filter
Recall that a list comprehension constructs a new list from an existing sequence by first filtering the given sequence, and then computing an element of the result for each remaining element that is not filtered out. A list comprehension has the following syntax:
[<map expression> for <name> in <sequence expression> if <filter expression>]
For example, if we wanted to square every even integer in range(10)
, we could
write:
>>> [x * x for x in range(10) if x % 2 == 0]
[0, 4, 16, 36, 64]
In utils.py
, implement map_and_filter
. This function takes in a sequence
s
, a one-argument function map_fn
, and a one-argument function filter_fn
.
It returns a new list containing the result of calling map_fn
on each element
of s
for which filter_fn
returns a true value. Make sure your solution is
only one line and uses a list comprehension.
Problem 0.2: key_of_min_value
The built-in min
function takes a sequence (such as a list or a dictionary)
and returns the sequence's smallest element. The min
function can also take a
keyword argument called key
, which must be a one-argument function. The key
function is called with each element of the list, and the return values are used
for comparison. For example:
>>> words = ['bat', 'attack', 'do', 'cat']
>>> # no key argument; returns "smallest" input which is first word in alphabetical order
>>> min(words)
'attack'
>>> # provided key argument; returns input with shortest length
>>> min(words, key=lambda w: len(w))
'do'
In utils.py
, implement key_of_min_value
, which takes in a dictionary d
and
returns the key that corresponds to the minimum value in d
. This behavior
differs from just calling min
on a dictionary, which would return the smallest
key. Make sure your solution is only one line and uses the min
function.
Problem 0.3: enumerate
The zip
function defined in utils.py
takes multiple sequences as arguments
and returns a list of lists, where the i-th list contains the i-th element
of each original list. For example:
>>> zip([1, 2, 3], [4, 5, 6])
[[1, 4], [2, 5], [3, 6]]
>>> for triple in zip(['a', 'b', 'c'], [1, 2, 3], ['do', 're', 'mi']):
... print(triple)
['a', 1, 'do']
['b', 2, 're']
['c', 3, 'mi']
In utils.py
, use the zip
function to implement enumerate
, which takes a
sequence s
and a starting index start
. It returns a list of pairs, in which
the i-th element is i+start
paired with the i-th element of s
. Make sure
your solution is only one line and uses the zip
function and a range
.
Note: zip
and enumerate
are also built-in Python functions, but their
behavior is slightly different than the versions provided in utils.py
.
Common Mistakes
RecursionError: maximum recursion depth exceeded
- make sure your given zip function code isreturn list(map(list, _zip(*sequences)))
and notreturn list(map(list, zip(*sequences)))
(note the underscore)
Run Ok for Problem 0
As you work through Problem 0, you can unlock the test cases for these
exercises and check your solutions by running ok
:
python3 ok -q 00 -u
python3 ok -q 00
Phase 1: Data Abstraction
All changes in this phase will be made to
abstraction.py
.
Complete the data abstractions in abstractions.py
. Two of the data
abstractions have already been completed for you: the review
data abstraction
and the user
data abstraction. Make sure that you understand how they work.
Data Abstraction Descriptions
Review Data Abstraction
variable | type | description |
---|---|---|
restaurant_name |
string | restaurant name of the review |
review_score |
float between 1 and 5 (inclusive) | number of stars given by the review |
User Data Abstraction
variable | type | description |
---|---|---|
name |
string | name of user |
reviews |
dictionary from restaurant names to review data abstractions |
reviews that user has written |
Restaurant Data Abstraction
variable | type | description |
---|---|---|
name |
string | name of restaurant |
location |
list containing two floats: latitude and longitude | location of restaurant |
categories |
list of strings | categories that restaurant belongs to |
price |
integer | price of restaurant |
scores |
list of scores (floats between 1 to 5, inclusive) | list of scores based on restaurant reviews |
Problem 1 (0.5 pt)
Complete the implementations of the constructor and selectors for the
restaurant
data abstraction. You can use any implementation you choose, but
the constructor and selectors must be defined together to satisfy the following
description. A starter implementation using a dictionary is provided, but you
may modify it.
make_restaurant
: return a restaurant constructed from five arguments:name
(a string)location
(a list containing latitude and longitude)categories
(a list of strings)price
(a number)reviews
(a list of review data abstractions created bymake_review
)
restaurant_name
: return the name of arestaurant
restaurant_location
: return the location of arestaurant
restaurant_categories
: return the categories of arestaurant
restaurant_price
: return the price of arestaurant
restaurant_scores
: return a list of scores for the restaurant
Use Ok to unlock and test your code:
python3 ok -q 01 -u
python3 ok -q 01
Problem 2 (0.5 pt)
Implement the restaurant_num_scores
and restaurant_mean_score
functions,
without assuming any particular implementation of a restaurant.
Be sure not to violate abstraction barriers! Violating an abstraction barrier means writing code outside of the ADT that makes assumption(s) about how the ADT is implemented. On the other hand, calling functions that are part of the ADT does not make assumptions about the underlying implementation.
Note that if these two functions were part of the restaurant ADT, there would be no abstraction barrier to break. For this problem, you must assume the two functions are not part of the ADT, so you must take care not to violate the abstraction barrier.
Hint: Use the
mean
function fromutils.py
to compute the average value of a sequence of numbers.
Test your implementation before moving on:
python3 ok -q 02 -u
python3 ok -q 02
Optional: Visualization
When you finish, you should be able to generate a visualization of all
restaurants rated by a user. Use -u
to select a user from the users
directory (a user is represented by a .dat
file). You can even create your own.
python3 recommend.py
python3 recommend.py -u one_cluster
Notes:
- You may have to refresh your browser to update the visualization.
- You may see an error that says "This page could not load Google Maps correctly." This is fine, just click Ok. As long as you can see the voronoi diagram and each restaurant, it's working.
- If you aren't able to get the visualization to work, that is also completely fine. You are only graded on the code you write and submit to Gradescope, not whether the visualization shows up on your computer (which could be caused by another issue).
Checkpoint Submission
When you have finished with Phase 0 and Phase 1, you must submit to the Maps Checkpoint on Gradescope.
If you're working with a partner, you must add your partner to your Gradescope submission. Review Gradescope's instructions on adding partners.
Phase 2: Unsupervised Learning
All changes in this phase will be made to
recommend.py
.
Restaurants tend to appear in clusters. In this phase, we will devise a way to group together restaurants that are close to each other.
The k-means algorithm is a method for discovering the centers of clusters. It is called an unsupervised learning method because the algorithm is not told what the correct clusters are; it must infer the clusters from the data alone.
The k-means algorithm finds k
centroids within a dataset that each
correspond to a cluster of inputs. To do so, k-means begins by choosing k
centroids at random, then alternates between the following two steps:
- Group the restaurants into clusters, where each cluster contains all restaurants that are closest to the same centroid.
- Compute a new centroid (average position) for each new cluster.
This video is a good way to understand and visualize the algorithm, as well as one application of k-means which is image segmentation.
Glossary
As you complete the remaining questions, you will encounter the following terminology. Be sure to refer back here if you're ever confused about what a question is asking.
- location: A pair containing latitude and longitude
- centroid: A location (see above) that represents the center of a cluster
- restaurant: A restaurant data abstraction, as defined in
abstractions.py
- cluster: A list of restaurants
- user: A user data abstraction, as defined in
abstractions.py
- review: A review data abstraction, as defined in
abstractions.py
- feature function: A single-argument function that takes a restaurant and returns a number representing a feature of the restaurant, such as its mean score or price
Problem 3 (2 pt)
Implement find_closest
, which takes a location
and a sequence of centroids
(locations). It returns the element of centroids
closest to location
.
You should use the distance
function from utils.py
to measure distance
between locations. The distance
function calculates the Euclidean
distance between two locations.
If two centroids are equally close, return the one that occurs first in the
sequence of centroids
.
Hint: Use the
min
function.
Use Ok to unlock and test your code:
python3 ok -q 03 -u
python3 ok -q 03
Problem 4 (2 pt)
Implement group_by_centroid
, which takes a sequence of restaurants
and a
sequence of centroids
(locations) and returns a list of clusters. Each cluster of the result is a list of restaurants that are closer to a specific centroid in centroids
than any other centroid. The order of the list of clusters returned does not matter.
If a restaurant is equally close to two centroids, it is associated with the
centroid that appears first in the sequence of centroids
.
The example below is a visualization of the doctest of group_by_centroid
.Restaurant r1
is in a group by itself because it is closest to centroid c1
, while restaurants r2
and r3
are grouped together because they are closer to centroid c2
.
Hint: Use the provided
group_by_key
function to group together all values for the same key in a list of[key, value]
pairs. You can look at the doctests to see how to use it.Hint: What is a function you've already implemented that helps you find the closest centroid for a given location? How can you use it here?
Be sure not violate abstraction barriers! Test your implementation before moving on:
python3 ok -q 04 -u
python3 ok -q 04
Common Mistakes
Key Error - the key does not exist in the dictionary
Problem 5 (2 pt)
Implement find_centroid
, which finds the centroid of a cluster
(a list of
restaurants) based on the locations of the restaurants. The centroid latitude is
computed by averaging the latitudes of the restaurant locations. The centroid
longitude is computed by averaging the longitudes.
The example below is a visualization of the doctest of find_centroid
. The centroid has been computed based on the locations of restaurants r1
, r2
, and r3
.
Hint: Use the
mean
function fromutils.py
to compute the average value of a sequence of numbers.
Be sure not violate abstraction barriers! Test your implementation before moving on:
python3 ok -q 05 -u
python3 ok -q 05
Problem 6 (2 pt)
Complete the implementation of k_means
. In each iteration of the while
statement,
- Group
restaurants
into clusters, where each cluster contains all restaurants closest to the same centroid. (Hint: Usegroup_by_centroid
) - Bind
centroids
to a new list of the centroids of all the clusters. (Hint: Usefind_centroid
)
Use Ok to unlock and test your code:
python3 ok -q 06 -u
python3 ok -q 06
Optional: Visualization with k-means clustering
Your visualization can now indicate which restaurants are close to each other (e.g.
Southside restaurants, Northside restaurants). Dots that have the same color
on your map belong to the same cluster of restaurants. You can get more
fine-grained groupings by increasing the number of clusters with the -k
option.
python3 recommend.py -k 2
python3 recommend.py -u likes_everything -k 3
Congratulations! You've now implemented an unsupervised learning algorithm.
Phase 3: Supervised Learning
All changes in this phase will be made to
recommend.py
.
In this phase, you will predict what score a user would give for a restaurant. You will implement a supervised learning algorithm that attempts to generalize from examples for which the correct score is known, which are all of the restaurants that the user has already scored. By analyzing a user's past scores, we can then try to predict what score the user might give to a new restaurant. When you complete this phase, your visualization will include all restaurants, not just the restaurants that were scored by a user.
To predict scores, you will implement simple least-squares linear regression, a widely used statistical method that approximates a relationship between some input feature (such as price) and an output value (the score) with a line. The algorithm takes a sequence of input-output pairs and computes the slope and intercept of the line that minimizes the mean of the squared difference between the line and the outputs.
Problem 7 (3 pt)
Implement the find_predictor
function, which takes in a user
, a sequence of
restaurants
, and a feature function called feature_fn
.
find_predictor
returns two values: a predictor
function and an r_squared
value.
Use least-squares linear regression to compute the predictor
and r_squared
.
This method, described below, computes the coefficients a
and b
for the
predictor
line y = a + bx
. The r_squared
value measures how accurately
this line describes the original data.
One method of computing these values is by calculating the sums of squares,
S_xx
, S_yy
, and S_xy
:
- Sxx = Σi (xi - mean(x))2
- Syy = Σi (yi - mean(y))2
- Sxy = Σi (xi - mean(x)) (yi - mean(y))
After calculating the sums of squares, the regression coefficients (a
and b
)
and r_squared
are defined as follows:
- b = Sxy / Sxx
- a = mean(y) - b * mean(x)
- R2 = Sxy2 / (Sxx Syy)
Hint: The
mean
function can be helpful here.
Use Ok to unlock and test your code:
python3 ok -q 07 -u
python3 ok -q 07
Problem 8 (2 pt)
Implement best_predictor
, which takes a user
, a list of restaurants
,
and a sequence of feature_fns
. It uses each feature function to compute a
predictor function, then returns the predictor that has the highest r_squared
value. All predictors are learned from the subset of restaurants
reviewed by
the user (called reviewed
in the starter implementation).
Hint: The
max
function can also take akey
argument, just likemin
.Hint: What is a function you've already implemented that can compute a predictor function? What does the function you implemented return?
Use Ok to unlock and test your code:
python3 ok -q 08 -u
python3 ok -q 08
Problem 9 (2 pt)
Implement rate_all
, which takes a user
and list of restaurants
. It
returns a dictionary where the keys are the names of each restaurant in
restaurants
. Its values are scores (numbers).
If a restaurant was already scored by the user, rate_all
will assign the
restaurant the user's score. Otherwise, rate_all
will assign the restaurant
the score computed by the best predictor for the user. The best predictor is
chosen using a sequence of feature_fns
.
Hint: You may find the
user_score
function inabstractions.py
useful.Hint: You may find it useful to remember that
reviewed
is a list of restaurant ADTs. If you're having a hard time passing test cases, look closely at your function's logic.
Be sure not violate abstraction barriers! Test your implementation before moving on:
python3 ok -q 09 -u
python3 ok -q 09
Common mistakes
TypeError: unhashable type: 'dict' - Check the type of your dictionary key.
Problem 10 (1 pt)
To focus the visualization on a particular restaurant category, implement
search
. The search
function takes a category query
and a sequence of
restaurants. It returns all restaurants that have query
as a category.
Hint: you might find a list comprehension useful here.
Be sure not violate abstraction barriers! Test your implementation:
python3 ok -q 10 -u
python3 ok -q 10
Optional: Visualization with category filtering
Congratulations, you've completed the project! The -q
option allows you to
filter based on a category. For example, the following command visualizes all
sandwich restaurants and their predicted scores for the user who
likes_expensive
restaurants:
python3 recommend.py -u likes_expensive -k 2 -p -q Sandwiches
Final Submission
Once you're done with the entire project, you must submit to the Maps Project assignment on Gradescope.
If you're working with a partner, you must add your partner to your Gradescope submission. Review Gradescope's instructions on adding partners.
Optional: Predicting your own scores
Once you're done, you can use your project to predict your own scores too! Here's how:
- In the
users
directory, you'll see a couple of.dat
files. Copy one of them and rename the new file toyourname.dat
(for example,michael.dat
). In the new file (e.g.
michael.dat
), you'll see something like the following:make_user( 'Michael Box', # name [ # reviews make_review('Jasmine Thai', 4.0), ... ] )
Replace the second line with your name (as a string).
Replace the existing reviews with reviews of your own! You can get a list of Berkeley restaurants with the following command:
python3 recommend.py -r
Rate a couple of your favorite (or least favorite) restaurants.
Use
recommend.py
to predict scores for you:python3 recommend.py -u michael -k 2 -p -q Sandwiches
(Replace
michael
with your name.) Play around with the number of clusters (the-k
option) and try different queries (with the-q
option)!
How accurate is your predictor?