The official deadline is the last day of classes, Thursday, May 4,
at midnight. I am willing to accept completed projects up
through Tuesday, May 9, at noon, but my hope is that your work
will be completed by the time classes have ended, to leave you
time to study for exams in your other courses.
Your submission should include both code and a write-up that
addresses the following points:
Here is a handout I
prepared a few years ago, outlining an approach to the
problem. Note that this requires you to build a fairly large
corpus of 3- and 4-gram statistics to use as the basis of your
scoring model. You should try to solve cryptograms in two
different styles: cryptograms in which the original spacing
between words is present , and those in which it is absent.
Obviously the latter is harder. The measure of success here
is how long the ciphertexts need to be in order for the cryptogram
to be solved successfully.
I will also give you some challenge cryptograms to solve, but of
course you should generate your own while you are developing and
testing the program.
If you have 20 or so ciphertexts all encrypted with the same key,
then you can attack it the way we attacked the Vigenère
cipher. The trick is to decrypt when you have, say, no more
than four texts. If the texts are long enough, in principle
this can be done with just two ciphertexts, but first try it with
four or five. I will provide you with sets of ciphertexts as
challenges, but of course you can produce these on your own.
As in Project 1, you will need to build a language model,
only now it should be based on the complete ASCII text, with all
printable characters. Here you are not so much concerned
with the probability that a given character follows a different
prefix as with the overall frequency of the n-gram.
As to how to approach the problem: Here are slides of a
presentation given by Eric Filiol at the Black Hat
conference in 2010 ( he reports that he developed the method in
1994). Only Section 4 is relevant,
and the attack is outlined in some detail. In addition, here is a paper describing a more
sophisticated approach to the same problem. An older paper describes a more ad hoc
approach (and uses only the 26 letters and spaces).
I find that there is much more out there to read about
differential cryptanalysis. I will provide you with pages
from two textbooks, and you can also look at
this primer, which covers differential cryptanalysis in
detail in the second part. You will notice that all of this
material works with a toy cipher; in the case of both the Heys
paper and the Katz-Lindell book, it is a 4-round
substitution-permutation network with a 16-bit block, and thus 5
round keys of 16 bits each. You will need to implement such
a toy example to work with. (If I have time, I may implement
it myself, and provide you with a web-based version where you can
obtain the encryptions of chosen plaintexts without knowledge of
the key.)
You also might want to check out this
web-based tutorial.
Start with the Wikipedia page and proceed to the linked documents
that describe how this encryption mode is used. (Note the warning
about a typo in one of the linked documents!) You'll notice that
the encryption part is just the same old counter mode that we've
already studied, so the real meat is understanding how the
authentication tag is computed and verified. There are two
parts to the core project. First, you have to understand
enough about the structure of the field GF(2128) to
explain (and implement) the multiplication by the hash key.
Then you should produce implementations, using only AES in ECB
mode, of encryption/signing and decryption/verification.
Your results should match those you get from actual
implementations (like pycrypto or pycryptodome, or openssl).
Here is Bleichenbacher's paper.
The attack is nicely broken down into two phases on the cryptopals
challenge website, which tells you what part of the paper to try
to understand first.
https://cryptopals.com/sets/6/challenges/47
https://cryptopals.com/sets/6/challenges/48
It's a sound accomplishment to get through the first
challenge. You can create your own `oracles', although it's
less fun because you already know the secret key. As with the CBC
padding oracle attack, I will provide an implementation of a
padding oracle (and give you the public key) which you can
query. There will be two versions, one using a 256-bit key
and the other a 512-bit key.
See the original description of this project. The source
material that I will provide you is in photocopied pages from
several textbooks. I will also give you a variety of
challenges. Each challenge will consist of a prime p,
a primitive element g mod p, and a value ga
mod p. Your job, of course, is to determine a.
A real achievement here would be the implementation of the Index
Calculus algorithm, but don't try to tackle that first.
Start out smaller, implementing the Pohlig-Hellman algorithm
(which only works when p-1 factors into small prime factors) and
either the Baby-Step, Giant-Step algorithm, or the small-space
birthday attack. How large a prime are you able to attack by
these methods?
If you are successful in implementing the Index Calculus
algorithm, you ought to be able to attack the discrete log problem
for quite large primes.
See the original description of this project. The source
material that I will provide to you consists of photocopied pages
from several textbooks. When you've completed the project, you
should be able to explain what an elliptic curve is, what are the
properties that make it suitable for cryptography, and how (for
example) the Diffie-Hellman key exchange protocol is implemented
with elliptic curves. Provide example computations with
artificially small parameters, and implement in code the
cryptographic primitives for one of the standard elliptic curves
used in cryptographic applications.
The underlying theorems that make this work--the fact that the
points on an elliptic curve form a group, and that there is an
adequate supply of points on an elliptic curve---are fairly
difficult to prove, and you will notice that the proofs are left
out of most crypto books. However, in your writeup, you
should state these theorems carefully and explain their
significance. Examples always help.
Your code should give the same results as canned elliptic curve
primitives in openssl. (I have not tried this out myself,
but I think it should work.)
There are two versions of this (the second is more
interesting): First suppose you are able to create
collisions for two messages that begin with the same prefix, which
you choose. That is, you provide any P of your choosing, and
the collision-finder finds two distinct suffixes S1 and S2 such
that
H(P||S1)=H(P||S2),
and moreover, S1 and S2 have the same length. There's then
a little trick you can do where P is a fragment of, say, Python
code ending with the line
suffix = "
You can then extend the two messages by appending a chunk of code
K that determines which of the two suffixes was provided, and
prints a different message in the two cases. If it's all
done correctly, the result is two Python programs P||S1||K and
P||S2||K that hash to the same value but have different behavior.
The more interesting case is a stronger 'chosen-prefix' attack in
which you get to specify two different prefixes P1, P2 and
the collision-finder produces two suffixes S1, S2 such that
H(P1||S1)=H(P2||S2).
This can be used for all sorts of tricks, like producing two
different JPEG files with the same hash, or producing documents
predicting all possible outcomes of the 2017 American League
pennant race, with all of the documents having the same hash
value. You can start here and follow the many links therein.
http://www.win.tue.nl/hashclash/Nostradamus/
I have some textbook pages describing the original method of
Hellman---the rainbow tables themselves represent an improvement
to the algorithm. Start with the Wikipedia page, and then
check out a few blog posts
https://stichintime.wordpress.com/2009/04/09/rainbow-tables-part-4-reduction-functions/
http://kestas.kuliukas.com/RainbowTables/
A real rainbow table is a huge file, but try creating a small
proof-of-concept version, where the space of possible passwords
consists of all sequences of 6 lower-case letters. (Maybe you can
push this farther.) One challenge appears to be the
selection of the reduction functions. Pay attention to these
important parameters: What is the effort required to create the
table? How big is the table? How many hashes are performed
during a typical search, and are the searches successful?
Keep in mind what these parameters are in two simple cases:
If we create a table containing the hashes of all N
passwords, then the search time is constant, but the storage
requirement is N entries, each consisting of the password
and its hash value. If just do a brute-force search
for the hash value, then we require almost no storage but perform
about N/2 hashes. So the rainbow-table
approach represents a tradeoff between these two extremes.