Monday, October 3, 2016

Analysis and conversion of binary files with the use of PHP CLI

Last time I gave myself a mission to translate some game to Polish language. These aren't ordinary text files but instead custom scripts binary coded. At the bottom of file there is a lot of letters in a row. You can't practically tell which part of the text will be displayed in the exact moment in the game. By deep analysis of such files You can come to some conclusions. By viewing one of them in hexadecimal editor You can notice that some characters repeat over and over again. You can notice that they occur every 4 bytes, so these are most likely binary coded 32-bit variables (long int). However, as it is a script that hold a lot of numbers, then You can imply, that the big part of them will be of no use for You. Also by carefully looking at it You can notice repeating [FF] bytes one next to another. So these will be most likely numeric variables represented as signed type (positive & negative), because I doubt that the software the company used to edit these scripts required anyone to type 4 billion-like numbers, these were most likely negative numbers close to zero like -1 or -3.

Why would You need to know all of this though?
Such detailed knowledge about these files is needed to have full control over the length of text You wish to modify. As I previously wrote, such text doesn't have a visible split points, in which part of the game which part of the text is shown. So something else must used for game to tell the beginning at which to start loading the data and it's length.
Let's start from the beginning of file. You can see a DCPB tag there which identifies file type used by game. Next You can try to decode 4 bytes after that, because it may be used to represent an offset or certain data length in this file. So [BC] [BA] [01] [00] is: 188 + 186 * 256 + 1 * 256^2 + 0 * 256^3 (little endian) = 113 340. Now You can check the file size, it is 117 150 bytes. So exactly how I wrote, it must represent a length of some data in this file. So now You can check if something interesting is located at the decimal address of 113 348 in this file. [64] [00] [00] [00] is located there, in other words the value of 100. But what it actually means? An offset? A value? Something else? If this is an offset then for sure at the distance of +/- 100 bytes nothing interesting exists. Let's go to 4 bytes ahead then. [20] [03] [00] [00] is located there, in other words the value of 800. Let's see if something interesting exists 800 bytes ahead of this place. Bull's eye! By adding 800 to current offset, from which we read this value, first text character exists. So You can be almost 100% sure that this is the offset of first part of the text which will be displayed somewhere in the game. Now You can go 800 bytes back and read next 4 bytes as signed long int. [12] [00] [00] [00] is located there, the value of 18. Is it maybe 800 + 18 as a some kind of offset? Hmm. Let's check another 4 bytes ahead, [32] [03] [00] [00], the value of 818. It looks like that in exactly THIS PLACE the offset really is coded. So what exactly means the value of 18 we previously decoded? 800 + 18 = 818! If You assume that 800 and 818 were offsets, then what occurs after ever of them (in this case the value of 18) is most likely the length (the number of bytes) of the text that will be read by the game. Now You can count how many occurrences of such 8 byte ranges exist before You reach the place where the text resides. It looks like there is 100 occurrences of them, so 4 bytes coded before the first offset ([64] [00] [00] [00] = 100) most likely indicates the amount of lines of the text to translate.

Let's begin?
After getting more or less accustomed to file structure, which You are interested in, You can translating the text. But let's think for a bit. If You have 100 occurrences, then every of them needs to have offset and length properly aligned. Won't there be a problem if the first part of the text gets a bit longer than the original? It is known that other languages than the original, the game was transcribed in, may result in bigger amount of translation data. You would then need to manually alter every next offset! Huge amount of work and a lot of time required. Is there maybe some other way to do this?

PHP CLI scripts as a helpful tool to convert files to more user friendly version.
Easiest way is to "convert" the file in such a way that You will have the text exactly split line by line, You won't be bothered by some kind of offsets, and You can skip directly to translating the text. You must remember, though, that such file must be properly converted back to its previous format, including the changes to binary data that points the offsets and lengths of parts of the text/lines. You can attain much more when programming in C, for example by directly reading the file and dynamically creating text input boxes for text inputting/altering basing on original game script. However, it can take more time to do it, it all depends if You have parts of code to use from Your other projects, or not.

Analysis and conversion of files with the use of PHP scripts.
You can write PHP scripts by checking the manual and using plain notepad. First You would need to create an algorithm or have the ability to do it on-fly in Your head. You must have in mind the analysis You did before. Let's try to more or less describe how the file structure looks like:
String: "DCPB",
4 bytes of signed long int: data length, which we You won't modify,
data, which You won't modify,
4 bytes of signed long int: amount of text occurrences,
{4 bytes of signed long int: nth text offset, 4 bytes of signed long int: nth text length} * amount of occurrences,
text as a whole.

By looking at the above You would for sure need a loop/function to read 4 bytes and translate them to long int. For sure You will also need to read a parts of data which You won't modify, and as such, they would need to be coded in a way that it will be safe to text edit such file. I advice to use base64_encode, and to apply deflate compression before it:
This way You will achieve a file to edit which will be much smaller that the original and it won't hold a lot of numbers and letters You are not interested in at the beinning (base64).
When You will be in a place to read offsets and lengths of the text, then for sure You will need dynamic arrays, which will hold these offsets and lengths in 2 arrays (or 2-dimensional array under one variable, which one You prefer most). In PHP it's relatively easy, in C You would need to malloc/realloc a lot, and to remember to NULL byte at the end of array of characters etc. Dynamic arrays are best created here by loop repeating [text occurence amount] times by reading 4 bytes two times, decoding them and putting them as long ints in 2 array variables, for example:
for($x=0;$x<$amount;++$x) {
  $buf = fread($file1,4);
  $pos[] = (ord($buf[3]) << 24) + (ord($buf[2]) << 16) + (ord($buf[1]) << 8) + ord($buf[0]) + 12 + $rawdatalength;
  $buf = fread($file1,4);
  $len[] = (ord($buf[3]) << 24) + (ord($buf[2]) << 16) + (ord($buf[1]) << 8) + ord($buf[0]);
Of course You can read whole 4 bytes and use proper bitshifting or other methods. Whichever method suits You best, most important is to do have conversion working without errors. Code performance in usually required in production environment or when a lot of data is concerned.
After the loop finishes You need to start another one, which will use data from previously created arrays, properly read the amount of characters and output it in separate lines into text editing friendly file. So to be sure it's best that You use fseek for offsets and fread to read amount of characters, and as last fwrite. Example:
for($x=0;$x<$amount;++$x) {
  $text = fread($file1,$len[$x]);
  fwrite($file2,$text . PHP_EOL);
This is more or less everything You would need to do to create user friendly text editing file. You can of course put an amount of data You got from compressing and base64-coding the data. This way You will avoid the fgets error of 1024 characters limit if You will utilize fgets to read Your text files. In case of the data that You will base64 decode and decompress, You will use fread($file,$amount_of_bytes_to_read).

The converting back to binary script needs to be done in reverse order. For sure You would need to recreate "DCPB" tag and write 4 bytes that You will read from the length of Your decoded, decompressed data, write this data, recreate proper offset and lengths of lines which Your script will read from text file You used to safely translate the game text. After You finish writing the coding part You need to test the original coded file against Your translated one, if the differences between the two start after first offset pointing to the text and if all offsets and lengths properly point to the parts of translated text. If You can confirm that it's exactly as it should be then You succeeded in creating coder/decoder of script files which will make Your translation alteration much easier.

File used as an example:

No comments: