|
||||||||||||||||||||||
Problem Set 6: Movies Assigned: Monday March 14, 2016 Due: Thursday March 24, 2016, midnight Points: 16 This is a pair project. Find a partner using the usual tools. You can cut-and-paste the following code for creating the Film/Actors dictionary:
# build and return the dictionary whose keys are movie titles # and values are lists of the movie casts as in the file 'mdb.txt' # Note: mdb.txt was produced by combining lists of 500 top-grossing # movies with 250 'top-rated' movies, and movies nominated for # the Best Picture Oscar between 1984 and 2014. Some movies appear # more than once, and # the condition 'if not actor in titledict[title]' was placed to # prevent the same actor being added twice to the cast list. # def makeFilmActorsDictionary(): titledict = {} infile = open("mdb.txt", "r") masterlist = [nextline.strip() for nextline in infile] infile.close() j = 0 while j < len(masterlist): title = masterlist[j] actor = masterlist[j + 1] if title in titledict: if not actor in titledict[title]: titledict[title].append(actor) else: titledict[title] = [actor] j = j + 2 return titledict
This assignment uses dictionaries to explore a database
of movies and actors. The earlier, simpler, parts of the
assignment ask you to extract some basic information from the
dictionary, such as how many films in the database Bradley
Cooper appeared in. In the later sections you'll build the
collaboration graph -- a structure that links together every pair
of actors who appeared together in a film -- and determine things
like the shortest path in this graph connecting two actors.
Background. In the 1960's, a psychologist named Stanley Millgram conducted an experiment in which he gave subjects in Omaha, Nebraska, and Wichita, Kansas, the name and address of a person in Boston. The subjects were instructed to send this information by mail to someone they knew personally who might be closer to the target individual in Boston. The recipients of these letters were asked to do the same. The goal was to see how many links of acquaintanceship connected the original subjects to the Boston target. This research penetrated into popular culture in a play by John Guare (later a movie) entitled Six Degrees of Separation. One character says that she read somewhere that there are "six degrees of separation" between any two individuals. It also showed up (a) in serious applied mathematics research into the structure of both man-made and natural networks. A paper by Duncan Watts and Steven Strogatz showed that under certain hypotheses, natural and man-made networks exhibited a "small world property" in which every node of the network was connected to every other node by a very short path; (b) a trivia game invented by some college students, who, after watching Footloose, decided that Kevin Bacon was the center of the movie universe. In the game, the challenger names an actor, and the player who is challenged has to connect the actor to Kevin Bacon in as few steps as possible. To see how this works, go to Google and type in "Kevin Bacon number" followed by the name of an actor. If you complete all the steps of this assignment, your program be able to play "Six Degrees of Kevin Bacon" , although with a much smaller database of movies. The Database. The database is in a text file called "mdb.txt" The information was extracted from web pages returned by the Internet Movie Database. It consists of all the films nominated for the Academy Award for Best Picture from 1984 until 2014, all films in the 500 top-grossing films listed on IMDB, and the 250 top-rated films (based, I think, on ratings submitted by users of the database), all together with their complete cast. A selection of typical lines in the file is: 12 Years a Slave Chiwetel Ejiofor 12 Years a Slave Dwight Henry 12 Years a Slave Dickie Gravois 12 Years a Slave Bryan Batt 12 Years a Slave Ashley DykeThe entire file is formatted this way, beginning with the title of a film on one line, and an actor in the film on the next. This file is posted here. Download it. By the way, since the list was created by combining several different lists, the same film, with its entire cast list, may appear several times in the database. This problem is discussed briefly below. By the way, neither "Footloose" nor "Six Degrees of Separation" is in the database! Kevin Bacon is there, as well as Donald Sutherland and Will Smith, two of the stars of "Six Degrees". Step 1. The title-cast dictionary.The harness code includes code for making the Film/Cast map which gives the cast list for every movie in the database. You should be able to understand the code and adapt it in later parts of the assignment. However, in this part you only have to use it to answer some questions. A typical dictionary entry looks like:"The Crying Game" : ["Forest Whitaker"; "Miranda Richardson"; "Stephen Rea"; "Adrian Dunbar"; "Breffni McKenna"; "Joe Savino"; "Birdy Sweeney"; "Jaye Davidson"; "Andr\xc3\xa9e Bernard"; "Jim Broadbent"; "Ralph Brown"; "Tony Slattery"; "Jack Carr"; "Josephine White"; "Shar Campbell"; "Bryan Coleman"; "Ray De-Haan"; "David Crionelly"]The complete cast lists for films tend to be long. You'll note that one of the actors' names appeared in this entry with nonprinting characters, which show up in the hex encoding. A more graphical representation of a small part of this dictionary is shown below: Then answer the following questions: Your submission should show the code that you wrote to find the answer as well as the answer itself. As with Assignment 5, these questions can often be answered using a single line of code if you take advantage of list comprehension.
Step 2. Build the actor-title dictionaryWrite a functionmakeActorFilmsMap : string -> Map.tThe makeActorFilmsMap function returns the reversal of the dictionary in Step 1: here the keys are actors, and the value corresponding to a key is the list of films in which the actor appeared. Here is one entry.
"Meryl Streep", ["The Deer Hunter"; "Into the Woods"; "The Hours"; "It's Complicated"; "Lemony Snicket's A Series of Unfortunate Events"; "The Devil Wears Prada"; "Out of Africa"; "Mamma Mia!"] A graphical representation would look just like the diagram above, but with the arrows pointing in the opposite directions. There are two different ways to go about it: You can write build_actor_dict with a single argument corresponding to the original title dictionary, and have it build the actor dictionary using the title dictionary. This uses more or less generic code for reversing a dictionary. It is a bit time-consuming, but it should finish running in under a minute. Alternatively, you can write it with no arguments and build the actor dictionary directly from the text file, just as we built the title dictionary. Again, you will need to be careful about repeated entries. Use the actor dictionary to answer the following questions:
Step 3. Build the collaboration graphCreate a dictionary whose keys are actors, and in which the value corresponding to a key A is the list of all the actors who have appeared in a movie with A. To do this, you need both the dictionaries created in the previous steps. In outline, the procedure is this:
create a new empty dictionary g for every actor A in the actor-title dictionary a: for every film M in a[A]: for every actor B in d[M]: if B is different from A and not already in g, set g[B]=[A] otherwise if B is different from A and already in g, append A to g[B].
The value portions of the key-value pairs in this dictionary
tend to be very large: Owen Wilson has over 600 collaborators.
Here is a graphical representation of a very small portion of
the graph.
Step 4: Build the Breadth-First Search Collaborator TreeWrite a function make_bfs_tree(g,root) that takes as a parameter the collaboration graph g from Step 3 and an actor root and returns a dictionary having the following structure. Let's say the root is "Cate Blanchett". So one of the items in the dictionary is("Cate Blanchett" : "")That is, "Cate Blanchett" is the key and "" is the value. Let's call this Level 0. We then take each collaborator of Cate Blanchett, say for example, Leonardo DiCaprio, and add to the dictionary the key-value pair ("Leonardo DiCaprio" : "Cate Blanchett")These pairs constitute Level 1 of the dictionary. To construct Level 2, we take all the collaborators of collaborators of Cate Blanchett who are not themselves already keys in an earlier level, and add new key-value pairs for them, for instance, ("Jonah Hill" : "Leonardo DiCaprio")The way to do this is to at each step keep track of the list of actors on the current level. At the start, this list just consists of the root (Cate Blanchett in the example). We then repeatedly create the list of actors for the next generation: We look at each collaborator of each actor in the current level, and if that collaborator is not already in the dictionary, then we add the collaborator to both our new generation list and the pair collaborator:actor to the dictionary.
Here is a little piece of the dictionary shown schematically, so
you can see why it is called a tree. current level = [root] while current level is not empty: new level = [] for every actor in the current level for every collaborator of the actor if the collaborator is not already in the tree add collaborator:actor to the tree add collaborator to the new level set current level = new levelThis algorithm is called breadth-first-search: it finds all the actors at a distance 1 from the root, then at a distance of 2, etc. For any actor in the tree we can follow the path from this actor back to the root, and this will be the shortest such path in the collaboration graph (although there may be other such paths with the same length). For example, using Dustin Hoffman as the root of the tree, I got: >>> person = "Cate Blanchett" >>> while person != "": print person person = tree[person] Cate Blanchett Kelly Macdonald Dustin Hoffman >>> person = "Leonardo DiCaprio" >>> while person != "": print person person = tree[person] Leonardo DiCaprio Kate Winslet Dustin HoffmanThis means that there is a path of length 2 from Cate Blanchett to Dustin Hoffman (and no shorter path) and also a path of length 2 from Leonardo DiCaprio to Dustin Hoffman. This gives us a path of length 4 from Cate Blanchett to Leonardo DiCaprio in the collaboration graph, but this is not the shortest such path. Build the breadth-first search tree rooted at one of the actors mentioned above. Use it to answer the following questions:
Step 5. Provide more information in the collaboration graph and treeOne defect in the approach outlined above is that although the collaboration graph contains the information that Amy Adams was in a film with Christian Bale, it doesn't include the information that she was in "American Hustle" and "The Fighter" with Christian Bale. The idea here is to include this somehow in the collaboration graph. Here you have some decisions to make. Should we include all the films the two actors have in common in the graph, so that an item might look like("Amy Adams" : ("Christian Bale", ["The Fighter"; "American Hustle"])),or should we just include the first one we find, so that the item might be ("Amy Adams" : ("Christian Bale", "The Fighter"))I recommend the latter solution, of keeping only one film in common in the graph. Similarly, we might like to include this information in the breadth-first-search tree. A little piece of the enhanced collaboration graph is shown at the start of the assignment. Write an improved version of the function that creates the neighbor graph so that it incorporates this information. Then, instead of writing a new version of the function that creates the tree rooted at a particular actor, write a function that takes as parameters the names of two actors and prints out a shortest path between them, with the names of the appropriate films. For example, calling this function with parameters "Meryl Streep" and "Don Cheadle" yields the following output: Don Cheadle Traffic Viola Davis The Help Allison Janney The Hours Meryl StreepYou will still use the breadth-first-search tree approach, but you do not need to build the whole tree, just build it until you reach the destination actor. What to submitYou will hand in the functions that you write to create each of the dictionaries, the brief fragments of code to answer the questions above, and the answers to the questions. You should bundle this into a single function that calls the basic dictionary-creating functions and prints clearly-labeled answers to the questions. This master function should have a single parameter s of type string for the name of the database file. (That way the graders can run your program on computers where the path name of the database file is different from what it is on your computer. |