Project Resources


The official deadline is the last day of classes, Thursday, May 4, at midnight.  I am willing to accept completed projects up through Tuesday, May 9, at noon, but my hope is that your work will be completed by the time classes have ended, to leave you time to study for exams in your other courses.  

Your submission should include both code and a write-up that addresses the following points:

I have not done all of these projects myself (although I am trying out a few more). 

1. Monoalphabetic Substitution Cipher

Here is a  handout I prepared a few years ago, outlining an approach to the problem.  Note that this requires you to build a fairly large corpus of 3- and 4-gram statistics to use as the basis of your scoring model. You should try to solve cryptograms in two different styles:  cryptograms in which the original spacing between words is present , and those in which it is absent.  Obviously the latter is harder.  The measure of success here is how long the ciphertexts need to be in order for the cryptogram to be solved successfully.

I will also give you some challenge cryptograms to solve, but of course you should generate your own while you are developing and testing the program.

2. Two- and three-time pad

If you have 20 or so ciphertexts all encrypted with the same key, then you can attack it the way we attacked the Vigenère cipher.  The trick is to decrypt when you have, say, no more than four texts.  If the texts are long enough, in principle this can be done with just two ciphertexts, but first try it with four or five.  I will provide you with sets of ciphertexts as challenges, but of course you can produce these on your own.  As in Project 1, you will need to build a  language model, only now it should be based on the complete ASCII text, with all printable characters.  Here you are not so much concerned with the probability that a given character follows a different prefix as with the overall frequency of the n-gram.

As to how to approach the problem:  Here are slides of a presentation given by Eric Filiol at the Black Hat conference in 2010 ( he reports that he developed the method in 1994).  Only Section 4 is relevant, and the attack is outlined in some detail.  In addition, here is a paper describing a more sophisticated approach to the same problem. An older paper describes a more ad hoc approach (and uses only the 26 letters and spaces).

4. Differential and Linear Cryptanalysis

I find that there is much more out there to read about differential cryptanalysis.  I will provide you with pages from two textbooks, and you can also look at this primer, which covers differential cryptanalysis in detail in the second part.  You will notice that all of this material works with a toy cipher; in the case of both the Heys paper and the Katz-Lindell book, it is a 4-round substitution-permutation network with a 16-bit block, and thus 5 round keys of 16 bits each.  You will need to implement such a toy example to work with.  (If I have time, I may implement it myself, and provide you with a web-based version where you can obtain the encryptions of chosen plaintexts without knowledge of the key.)

You also might want to check out this web-based tutorial.

5. GCM Mode

Start with the Wikipedia page and proceed to the linked documents that describe how this encryption mode is used. (Note the warning about a typo in one of the linked documents!) You'll notice that the encryption part is just the same old counter mode that we've already studied, so the real meat is understanding how the authentication tag is computed and verified.  There are two parts to the core project.  First, you have to understand enough about the structure of the field GF(2128) to explain (and implement) the multiplication by the hash key.  Then you should produce implementations, using only AES in ECB mode, of encryption/signing and decryption/verification.  Your results should match those you get from actual implementations (like pycrypto or pycryptodome, or openssl).

6. Bleichenbacher Padding Oracle Attack on RSA

Here is Bleichenbacher's paper.  The attack is nicely broken down into two phases on the cryptopals challenge website, which tells you what part of the paper to try to understand first.

https://cryptopals.com/sets/6/challenges/47

https://cryptopals.com/sets/6/challenges/48

 It's a sound accomplishment to get through the first challenge.  You can create your own `oracles', although it's less fun because you already know the secret key. As with the CBC padding oracle attack, I will provide an implementation of a padding oracle (and give you the public key) which you can query.  There will be two versions, one using a 256-bit key and the other a 512-bit key.

7. Algorithms for the Discrete Log

See the original description of this project.  The source material that I will provide you is in photocopied pages from several textbooks.  I will also give you a variety of challenges.  Each challenge will consist of a prime p, a primitive element g mod p, and a value ga mod p.  Your job, of course, is to determine a.

A real achievement here would be the implementation of the Index Calculus algorithm, but don't try to tackle that first.  Start out smaller, implementing the Pohlig-Hellman algorithm (which only works when p-1 factors into small prime factors) and either the Baby-Step, Giant-Step algorithm, or the small-space birthday attack.  How large a prime are you able to attack by these methods?

If you are successful in implementing the Index Calculus algorithm, you ought to be able to attack the discrete log problem for quite large primes.

8. Elliptic Curve Cryptography

See the original description of this project.  The source material that I will provide to you consists of photocopied pages from several textbooks. When you've completed the project, you should be able to explain what an elliptic curve is, what are the properties that make it suitable for cryptography, and how (for example) the Diffie-Hellman key exchange protocol is implemented with elliptic curves.  Provide example computations with artificially small parameters, and implement in code the cryptographic primitives for one of the standard elliptic curves used in cryptographic applications.

The underlying theorems that make this work--the fact that the points on an elliptic curve form a group, and that there is an adequate supply of points on an elliptic curve---are fairly difficult to prove, and you will notice that the proofs are left out of most crypto books.  However, in your writeup, you should state these theorems carefully and explain their significance.  Examples always help.

Your code should give the same results as canned elliptic curve primitives in openssl.  (I have not tried this out myself, but I think it should work.)

9. Meaningful Collisions in Hash Functions

There are two versions of this (the second is more interesting):  First suppose you are able to create collisions for two messages that begin with the same prefix, which you choose.  That is, you provide any P of your choosing, and the collision-finder finds two distinct suffixes S1 and S2 such that
H(P||S1)=H(P||S2),

and moreover, S1 and S2 have the same length.  There's then a little trick you can do where P is a fragment of, say, Python code ending with the line

suffix = "

You can then extend the two messages by appending a chunk of code K that determines which of the two suffixes was provided, and prints a different message in the two cases.  If it's all done correctly, the result is two Python programs P||S1||K and P||S2||K that hash to the same value but have different behavior.

The more interesting case is a stronger 'chosen-prefix' attack in which you get to specify two different prefixes P1, P2 and the collision-finder produces two suffixes S1, S2 such that
H(P1||S1)=H(P2||S2).

This can be used for all sorts of tricks, like producing two different JPEG files with the same hash, or producing documents predicting all possible outcomes of the 2017 American League pennant race, with all of the documents having the same hash value.  You can start here and follow the many links therein.

http://www.win.tue.nl/hashclash/Nostradamus/


But just where are you going to get the code for producing collisions? 

If you want collisions in real hash functions, I found some C programs on the Web that will generate the first (weaker) kind of chosen-prefix collisions in both MD4 and MD5, and I will point you to these.  (I first want to test them to make sure that they really do what they claim to be doing.) Code implementing the stronger kind of chosen-prefix collision in MD5 is available, but I believe it requires GPU resources to generate the collisions quickly enough for you to experiment with.  The author of this blog post

https://natmchugh.blogspot.com/2015/02/create-your-own-md5-collisions.html


created something on Amazon Web Services that you can use.  However, a reasonable alternative, as a proof of concept, is to use something small enough (like top 40 bits of SHA-1, as in our last homework assignment) to mount the attack, which is still interesting.


10. Rainbow Tables

I have some textbook pages describing the original method of Hellman---the rainbow tables themselves represent an improvement to the algorithm.  Start with the Wikipedia page, and then check out a few blog posts

https://stichintime.wordpress.com/2009/04/09/rainbow-tables-part-4-reduction-functions/

http://kestas.kuliukas.com/RainbowTables/

A real rainbow table is a huge file, but try creating a small proof-of-concept version, where the space of possible passwords consists of all sequences of 6 lower-case letters. (Maybe you can push this farther.)  One challenge appears to be the selection of the reduction functions.  Pay attention to these important parameters: What is the effort required to create the table? How big is the table?  How many hashes are performed during a typical search, and are the searches successful?  Keep in mind what these parameters are in two simple cases:  If we create a table containing the hashes of all N passwords, then the search time is constant, but the storage requirement is N entries, each consisting of the password and its hash value.  If  just do a brute-force search for the hash value, then we require almost no storage but perform about N/2  hashes.  So the rainbow-table approach represents a tradeoff between these two extremes.