I've been thinking a lot about encryption recently and what makes it easy to crack (or not).
One thing which can make it easy to crack is knowing part of the "plaintext" - the thing that was being encrypted. If one knows a bit of the plaintext and one has the encrypted text, it may be possible to work out how to get from "plain" to "encrypted" - i.e., work out the key. And once you have the key, you can decrypt the rest of the message.
Mostly, this isn't that easy, as a modern cipher will operate on blocks of plaintext in such as way that you can't point at a single byte and say "this has been changed to this other byte, therefore the key must be XXX". But any bits you know of the plaintext might give you clues, reducing the number of possible keys from billions to thousands, thus helping a "brute force attack" immensely.
So if there are artefacts in the plaintext, it can potentially be broken more easily. Why does this even matter? Because data transmitted by computer has all kinds of artefacts in it.
Artefacts: The enemy of encryption
When you visit Google in your web browser, your computer makes a request and Google responds. As the request is transmitted from your computer, through the myriad of devices on the internet to eventually reach a Google server, it has various things added to it to tell Google's servers many things that it may find useful when responding: Your browser type, if your browser supports encryption, the kind of character set to use, and the very first few bytes of the request (and response) are (almost) always "HTTP/1.1".
So all of this data in the request is predictable, if not completely static. When Google replies, the HTML it uses is also very static. If it is a public website to which the attacker has access, most of the page that they see will be identical to what you see. It might only be a few of the numbers on the page that are actually different between their visit to the site and yours. As much as 90% of the plaintext being encrypted and sent to you is therefore known to the attacker. That's a whole lot of artefacts.
Finding artefacts is the basis of the W.E.P. (Wireless Encryption Protocol) attack: It was found that Wireless Access Point that encrypted with WEP always responded with a similar message when a particular kind of duff request was presented to it. A carefully written "attack program" would fire millions of messages at an access point in a minute or so and analyse the results. Analysing these results and knowing what the plaintext was, the "attack program" was able to very quickly present you with the WEP key. Hey presto, broken wireless.
Digital TV Pay-per-view is encrypted. The Wikipedia article makes some interesting points on how to brute-force attack the key, and also how to use "rainbow tables" to speed up the brute force:
So, in the case of an HTML document, if we assume that most of the plaintext is known, how can we protect the unknown, sensitive bits - and our encryption key?
This is where some simple file scrambling can come in to its own: An attacker may know a lot of your insensitive plaintext and will have captured the encrypted text, but if they don't know where in the file or network packet each is, this knowledge can no longer help them recover the key or the sensitive parts of the data.
Scrambling the data to remove artefacts
What do we mean scramble?
Very simply, to randomly move characters around so that a predictably laid out file with predictable contents appears to be a random stream of bytes.
Let's roughly scramble some short plaintext:
So this hasn't been encrypted, it's just been randomised. We need to be able to un-randomise it to get back to our plaintext, so we need to know how the letters have moved. We'll put some numbers on the front to specify where each letter came from and is now.
There are 7 letters, so we'll need 7 numbers on the front. The position of each number refers to the position of the letter we moved. So the first number refers to the first letter, the second to the second, and so on. The actual number itself refers to where that letter has moved to. So if the first number is 3, it means the first letter is now in position 3.
Let's see how that looks with the above:
So we have some random numbers which can't be known sandwiched on to the front of some plaintext where one or more of the letters have been previously guessed.
But now our attacker doesn't know where those guessed letters are. Let's encrypt. Just for demonstration purposes, we'll use a Caesor Cipher. Here's the translation table:
So our new ciphertext, based on our scrambled plaintext, is:
Let's assume the attacker somehow knows the first and last letters of the plaintext are "E" and "R", respectively.
Trying to use this knowledge, we fit the plaintext E to the ciphertext G and the R to the O. This would give us a key of 2 in the first case and -3 in the second, which can't be right! So the attacker is denied using knowledge of the plaintext to find the key.
Earlier, we talked of HTML. If an HTML document had been scambled and then encrypted, it would not be possible to say that the first 4 letters of the document before encryption was HTTP. They may know up to 90% of the plaintext, but from their point of view what was actually given to the encryption algorithm is effectively gibberish.
The artefacts are gone.
A potential file format for artefact-free scrambling
How would we scramble a file real world without leaving any artefacts for an attacker to use? We surely can't go adding a byte for each byte in the file! It would be huge, and would need compressing, which will leave artefacts in the file which can be used to recover the key - the same kind of artefacts we are trying to be rid of by scrambling.
Rather than working on bytes, it should operate on blocks. The blocks should be a random size - the size is picked at scramble time. To pick a rule at random, we'll make the block size 1% of the file size +/- (randomly) up to 2%. This will give us between 33 and 100 blocks. The number of blocks will need to be recorded in our header. That's ok, because it is a random number, unknown to our attacker, so it doesn't create an artefact.
A file is unlikely to be exactly the right number of bytes in length to divide in to an equal number of blocks. So the last bit of the file will be padded with random garbage. Rather than recording the original file size, which may be known to an attacker and thus give us an artefact, the number of bytes that were added for padding is recorded next. This is related to a random number we picked earlier, so is again unknown to our attacker. But it gives us enough information with which to reconstruct the original file.
Then, all the random "this block moved to X" numbers are generated. Each number will be stored in 2 or 4 or 8 bytes, depending on how large the file (or, indeed, network packet) is. The numbers are placed on the front, followed by the blocks in their scrambled order.
So the final scrambled file would look like this:
(Block size)(Padding size)[list of random numbers][blocks in random number order...]
None of the numbers on the front can be predicted and none of the data later on is in the right place, so, if this were then encrypted by any number of (hopefully) secure encryption algorithms an attacker now doesn't know which bits of cipher-text to compare with which bits of plain-text.
Not only is the key infinitely harder to recover now, but brute-forcing is now a more expensive operation: every time a key is tested out of the (say) 50 billion possible keys/passwords, the result has to be tested by unscrambling. This is extra clock cycles and slows the whole thing right down. If testing a single password/key would normally take, say, 10ms, and unscrambling takes, say, 1ms, that is an extra 10% effort. If testing all possible keys takes 10 days, that's an extra day.
One day I'll write a proof-of-concept of this scrambling technique, and maybe even prove that scrambling before encryption is a worthy thing to do. If it is worthy, this simple technique could be transparently built in to many commonly used consumer encryption technologies: SSL, GPG, IPSEC etc could all scramble before encrypting, boosting the average use's protection on the net a million-fold.
Here's a good graphical example of how different encryption algorithms work, and how, unless you're careful, the encrypted result can still show major details of what was encrypted: http://pthree.org/2012/02/17/ecb-vs-cbc-encryption/
If the image files had first been scrambled before encryption, both the algorithms demonstrated would result in complete gibberish out.