CS 1101 Computer Science I
Spring 2016

Computer Science Department
The Morrissey College of Arts and Sciences
Boston College

About Staff Textbook Grading Schedule Resources
Notes Labs Piazza Canvas GitHub Problem Sets
Manual StdLib Pervasives UniLib OCaml.org
Problem Set 6: Movies

Assigned: Monday March 14, 2016
Due: Thursday March 24, 2016, midnight
Points: 16

This is a pair project. Find a partner using the usual tools. You can cut-and-paste the following code for creating the Film/Actors dictionary:

# build and return the dictionary whose keys are movie titles
# and values are lists of the movie casts as in the file 'mdb.txt'

# Note:  mdb.txt was produced by combining lists of 500 top-grossing
# movies with 250 'top-rated' movies, and movies nominated for
# the Best Picture Oscar between 1984 and 2014.  Some movies appear
# more than once, and
# the condition 'if not actor in titledict[title]' was placed to
# prevent the same actor being added twice to the cast list.
#
def makeFilmActorsDictionary():
  titledict = {}
  infile = open("mdb.txt", "r")
  masterlist = [nextline.strip() for nextline in infile]
  infile.close()
  j = 0
  while j < len(masterlist):
    title = masterlist[j]
    actor = masterlist[j + 1]
    if title in titledict:
      if not actor in titledict[title]:
        titledict[title].append(actor)
    else:
      titledict[title] = [actor]
    j = j + 2
  return titledict

This assignment uses dictionaries to explore a database of movies and actors. The earlier, simpler, parts of the assignment ask you to extract some basic information from the dictionary, such as how many films in the database Bradley Cooper appeared in. In the later sections you'll build the collaboration graph -- a structure that links together every pair of actors who appeared together in a film -- and determine things like the shortest path in this graph connecting two actors.
At each step, you have some questions to answer. I expect that everyone should be able to complete the assignment through Step 3. Step 4 is harder, and Step 5, is a challenge.

Background. In the 1960's, a psychologist named Stanley Millgram conducted an experiment in which he gave subjects in Omaha, Nebraska, and Wichita, Kansas, the name and address of a person in Boston. The subjects were instructed to send this information by mail to someone they knew personally who might be closer to the target individual in Boston. The recipients of these letters were asked to do the same. The goal was to see how many links of acquaintanceship connected the original subjects to the Boston target.

This research penetrated into popular culture in a play by John Guare (later a movie) entitled Six Degrees of Separation. One character says that she read somewhere that there are "six degrees of separation" between any two individuals.

It also showed up (a) in serious applied mathematics research into the structure of both man-made and natural networks. A paper by Duncan Watts and Steven Strogatz showed that under certain hypotheses, natural and man-made networks exhibited a "small world property" in which every node of the network was connected to every other node by a very short path; (b) a trivia game invented by some college students, who, after watching Footloose, decided that Kevin Bacon was the center of the movie universe. In the game, the challenger names an actor, and the player who is challenged has to connect the actor to Kevin Bacon in as few steps as possible. To see how this works, go to Google and type in "Kevin Bacon number" followed by the name of an actor. If you complete all the steps of this assignment, your program be able to play "Six Degrees of Kevin Bacon" , although with a much smaller database of movies.

The Database. The database is in a text file called "mdb.txt" The information was extracted from web pages returned by the Internet Movie Database. It consists of all the films nominated for the Academy Award for Best Picture from 1984 until 2014, all films in the 500 top-grossing films listed on IMDB, and the 250 top-rated films (based, I think, on ratings submitted by users of the database), all together with their complete cast. A selection of typical lines in the file is:

12 Years a Slave
Chiwetel Ejiofor
12 Years a Slave
Dwight Henry
12 Years a Slave
Dickie Gravois
12 Years a Slave
Bryan Batt
12 Years a Slave
Ashley Dyke
The entire file is formatted this way, beginning with the title of a film on one line, and an actor in the film on the next. This file is posted here. Download it. By the way, since the list was created by combining several different lists, the same film, with its entire cast list, may appear several times in the database. This problem is discussed briefly below.

By the way, neither "Footloose" nor "Six Degrees of Separation" is in the database! Kevin Bacon is there, as well as Donald Sutherland and Will Smith, two of the stars of "Six Degrees".

Step 1. The title-cast dictionary.

The harness code includes code for making the Film/Cast map which gives the cast list for every movie in the database. You should be able to understand the code and adapt it in later parts of the assignment. However, in this part you only have to use it to answer some questions. A typical dictionary entry looks like:
      "The Crying Game" : ["Forest Whitaker"; "Miranda Richardson";
      "Stephen Rea"; "Adrian Dunbar"; "Breffni McKenna"; "Joe Savino";
      "Birdy Sweeney"; "Jaye Davidson"; "Andr\xc3\xa9e Bernard"; 
      "Jim Broadbent"; "Ralph Brown"; "Tony Slattery"; "Jack Carr";
      "Josephine White"; "Shar Campbell"; "Bryan Coleman"; "Ray De-Haan";
      "David Crionelly"]
The complete cast lists for films tend to be long. You'll note that one of the actors' names appeared in this entry with nonprinting characters, which show up in the hex encoding. A more graphical representation of a small part of this dictionary is shown below:
Then answer the following questions: Your submission should show the code that you wrote to find the answer as well as the answer itself. As with Assignment 5, these questions can often be answered using a single line of code if you take advantage of list comprehension.
  1. What is the length of the longest cast list?

  2. What is the film or films with the largest cast?

  3. How many films are in the database?

  4. Answer questions 1 and 2 above for "smallest cast".

  5. List all the movies in which Owen Wilson appears. (Solve this using the dictionary d. In the next step you'll see a simpler solution using a different dictionary.)

  6. List all the actors who appeared in both "Silver Linings Playbook" and "American Hustle".

Step 2. Build the actor-title dictionary

Write a function
makeActorFilmsMap : string -> Map.t
The makeActorFilmsMap function returns the reversal of the dictionary in Step 1: here the keys are actors, and the value corresponding to a key is the list of films in which the actor appeared. Here is one entry.

      
      "Meryl Streep", ["The Deer Hunter"; "Into the Woods"; "The
      Hours"; "It's Complicated"; "Lemony Snicket's A Series of
      Unfortunate Events"; "The Devil Wears Prada"; "Out of Africa";
      "Mamma Mia!"]

A graphical representation would look just like the diagram above, but with the arrows pointing in the opposite directions. There are two different ways to go about it: You can write build_actor_dict with a single argument corresponding to the original title dictionary, and have it build the actor dictionary using the title dictionary. This uses more or less generic code for reversing a dictionary. It is a bit time-consuming, but it should finish running in under a minute. Alternatively, you can write it with no arguments and build the actor dictionary directly from the text file, just as we built the title dictionary. Again, you will need to be careful about repeated entries.

Use the actor dictionary to answer the following questions:

  1. List (again) all the movies in which Owen Wilson appears.

  2. How many actors are in the database?

  3. Which actor (or actors) has been in the largest number of films in the database? [By the way, because of the large number of animated films in the database, answers to questions like these tend to include hard-working voice actors whom you probably never heard of.]

  4. What are all the films in which either Clint Eastwood or Morgan Freeman appeared?

Step 3. Build the collaboration graph

Create a dictionary whose keys are actors, and in which the value corresponding to a key A is the list of all the actors who have appeared in a movie with A. To do this, you need both the dictionaries created in the previous steps. In outline, the procedure is this:

      create a new empty dictionary g
      for every actor A in the actor-title dictionary a:
        for every film M in a[A]:
          for every actor B in d[M]:
            if B is different from A and not already in g, set g[B]=[A]
            otherwise if B is different from A and already in g, append A to g[B].

The value portions of the key-value pairs in this dictionary tend to be very large: Owen Wilson has over 600 collaborators. Here is a graphical representation of a very small portion of the graph.

(There is a fancier way to go about this in which the dictionary stores not only the list of collaborators, but the films on which the two actors worked together. This is discussed in Step 5. It's usually a good idea to NOT try to do the fancy stuff at first, and then build up to it, so you should not skip Steps 3 and 4.) Use g to answer the following questions:

  1. Who is the actor with the most collaborators? (It's not Kevin Bacon)

  2. Is Kate Winslet a collaborator of Cate Blanchett? (I always get those two confused.)

  3. 13. Is Kate Winslet a collaborator of a collaborator of Cate Blanchett? There are several ways to go about this; the simplest probably involves writing a pair of statements, one to get the list of all of Kate's collaborators, and a second to filter out all of elements of this list who are not also collaborators of Cate.

Step 4: Build the Breadth-First Search Collaborator Tree

Write a function make_bfs_tree(g,root) that takes as a parameter the collaboration graph g from Step 3 and an actor root and returns a dictionary having the following structure. Let's say the root is "Cate Blanchett". So one of the items in the dictionary is
      ("Cate Blanchett" : "")
That is, "Cate Blanchett" is the key and "" is the value. Let's call this Level 0. We then take each collaborator of Cate Blanchett, say for example, Leonardo DiCaprio, and add to the dictionary the key-value pair
      ("Leonardo DiCaprio" : "Cate Blanchett")
These pairs constitute Level 1 of the dictionary. To construct Level 2, we take all the collaborators of collaborators of Cate Blanchett who are not themselves already keys in an earlier level, and add new key-value pairs for them, for instance,
      ("Jonah Hill" : "Leonardo DiCaprio")
The way to do this is to at each step keep track of the list of actors on the current level. At the start, this list just consists of the root (Cate Blanchett in the example). We then repeatedly create the list of actors for the next generation: We look at each collaborator of each actor in the current level, and if that collaborator is not already in the dictionary, then we add the collaborator to both our new generation list and the pair collaborator:actor to the dictionary.

Here is a little piece of the dictionary shown schematically, so you can see why it is called a tree.

In pseudocode:

      current level = [root]
      while current level is not empty:
        new level = []
        for every actor in the current level
          for every collaborator of the actor
            if the collaborator is not already in the tree
              add collaborator:actor to the tree
              add collaborator to the new level
        set current level = new level
This algorithm is called breadth-first-search: it finds all the actors at a distance 1 from the root, then at a distance of 2, etc. For any actor in the tree we can follow the path from this actor back to the root, and this will be the shortest such path in the collaboration graph (although there may be other such paths with the same length). For example, using Dustin Hoffman as the root of the tree, I got:
      >>> person = "Cate Blanchett"
      >>> while person != "":
        print person
        person = tree[person]

      Cate Blanchett
      Kelly Macdonald
      Dustin Hoffman

      >>> person = "Leonardo DiCaprio"
      >>> while person != "":
      print person
      person = tree[person]

      Leonardo DiCaprio
      Kate Winslet
      Dustin Hoffman
This means that there is a path of length 2 from Cate Blanchett to Dustin Hoffman (and no shorter path) and also a path of length 2 from Leonardo DiCaprio to Dustin Hoffman. This gives us a path of length 4 from Cate Blanchett to Leonardo DiCaprio in the collaboration graph, but this is not the shortest such path.

Build the breadth-first search tree rooted at one of the actors mentioned above. Use it to answer the following questions:

  1. Pick several other actors and print out the path from that actor to the root actor.

  2. What percentage of the total number of actors in the database are in this tree? (The answer is quite striking.)

  3. What is the length of the longest path in the tree? (To find this, you may need to revise your tree-making function ever so slightly so that it counts the number of levels, and prints out how many distinct levels it found.)

  4. Find an instance of a longest path in the tree. For instance, if you get the result that everyone on the tree is connected to Sandra Bullock by a path of at most 5, find an actor whose shortest path to Sandra Bullock has length 5, and print out the path.

Step 5. Provide more information in the collaboration graph and tree

One defect in the approach outlined above is that although the collaboration graph contains the information that Amy Adams was in a film with Christian Bale, it doesn't include the information that she was in "American Hustle" and "The Fighter" with Christian Bale. The idea here is to include this somehow in the collaboration graph. Here you have some decisions to make. Should we include all the films the two actors have in common in the graph, so that an item might look like
      ("Amy Adams" : ("Christian Bale", ["The Fighter"; "American Hustle"])),
         
or should we just include the first one we find, so that the item might be
      ("Amy Adams" : ("Christian Bale", "The Fighter"))
         
I recommend the latter solution, of keeping only one film in common in the graph. Similarly, we might like to include this information in the breadth-first-search tree. A little piece of the enhanced collaboration graph is shown at the start of the assignment.

Write an improved version of the function that creates the neighbor graph so that it incorporates this information. Then, instead of writing a new version of the function that creates the tree rooted at a particular actor, write a function that takes as parameters the names of two actors and prints out a shortest path between them, with the names of the appropriate films. For example, calling this function with parameters "Meryl Streep" and "Don Cheadle" yields the following output:

      Don Cheadle
      Traffic
      Viola Davis
      The Help
      Allison Janney
      The Hours
      Meryl Streep
         
You will still use the breadth-first-search tree approach, but you do not need to build the whole tree, just build it until you reach the destination actor.

What to submit

You will hand in the functions that you write to create each of the dictionaries, the brief fragments of code to answer the questions above, and the answers to the questions. You should bundle this into a single function that calls the basic dictionary-creating functions and prints clearly-labeled answers to the questions. This master function should have a single parameter s of type string for the name of the database file. (That way the graders can run your program on computers where the path name of the database file is different from what it is on your computer.
Created on 01-19-2016 23:09.