How do computers store, process and interpret data? In this post we will explore through the medium of simple programs how this happens.
Disclaimers and license
Moodle™ is a registered trademark of ‘Martin Dougiamas’ – moodle.com/trademarks.
Ubuntu® is a registered trademark of Canonical Ltd – www.ubuntu.com/legal/terms-and-policies/intellectual-property-policy
Other names / logos can be trademarks of their respective owners. Please review their website for details.
I am independent from the organisations listed above and am in no way writing for or endorsed by them.
All code presented has been written by myself. Please feel free to copy and adapt for ‘educational’ purposes only.
Before we start
To understand how data can be processed and interpreted by a computer, we really need to understand at a basic level what a CPU (Central Processing Unit) actually does (or at least one following the Von Neumann architecture – en.wikipedia.org/wiki/Von_Neumann_architecture). When running it operates the ‘Instruction cycle’, which it repeats over and over again, its cyclic – en.wikipedia.org/wiki/Instruction_cycle. This in effect gets data and instructions (what to do with the data) from memory, processes it and puts the result back in memory.
Software that is written in a programming language is those ‘instructions’ and may contain data as well as instructions on where the data can be found or inputted by the user. Because CPU instructions are primitive and from a CPU’s point of view, just a number, this is difficult for humans to understand and time consuming to create. Thus there is different ‘levels’ of programming language called ‘generations’. Generally the higher the generation the more powerful the individual syntax elements are and they are easier for humans to understand.
In this post, I will be using ‘C’ which is a third generation language – en.wikipedia.org/wiki/Third-generation_programming_language. This is translated by a ‘compiler’ into ‘assembler’ for the CPU you are using. Assembler is a second generation language that is assembled into a first generation language (in effect binary numbers) by an ‘assembler’ to create ‘machine code’ – en.wikipedia.org/wiki/Machine_code. It is these binary numbers that are the ‘switch states’ that set-up the CPU to perform a given action – an instruction – this ‘set-up’ is undertaken by ‘microcode’ – en.wikipedia.org/wiki/Microcode.
Thus instructions created by a software engineer to solve a given problem are performed by a CPU so that you, the user, get the job you want ‘done’.
‘C’ has the capability to be powerful enough in its functionality and syntax that its possible to illustrate stored data being interpreted in different ways. Thus I’ve chosen it to illustrate this post. It can allow a given variable to be explicitly different types at the same time and to know where it is stored in memory, along with its exact size in bytes.
‘C’ is also as close to ‘assembler’ as possible and yet not be too hard to understand.
Imagine computer memory as boxes that are arranged in a given way, store one byte each and have a unique identifier – an address. Just like the dwellings we live in have an address.
With a ‘union’ in C, we can look at the memory in two different ways. Here, either as a number ‘an integer’ or a single character. The size of the integer can vary between CPU’s and operating systems on them. Here I’ve shown it as four bytes (green boxes), thus being capable of storing a number between −2147483648 and 2147483647 or 0 and 4294967295 if unsigned (no indication of positive or negative). The yellow box is a single byte, for us it will be an ASCII character (UTF-8 characters are two bytes). Both the green and yellow boxes – the individual bytes – ‘map’ to the real memory. Each memory box, being a byte here, has an address. This ‘address’ is represented as a number, also from the CPU’s perspective in binary, for humans this is usually stated as a hexadecimal number (base 16).
Please see en.wikipedia.org/wiki/Memory_address, en.wikipedia.org/wiki/Physical_address and en.wikipedia.org/wiki/Bus_(computing)#Address_bus. This is why you see in advertising for both CPU’s and Operating Systems, 32 and 64 bit, one element is the amount of memory that the CPU can use, with 32 bit this will be up to 4 Gigabytes and for 64 bit, up to 16 Exabytes.
The first program ‘fab’ is:
it defines the ‘union’ I have described. The first bit like ‘unsigned int’ is the ‘data type’ (en.wikipedia.org/wiki/C_data_types#Main_types) which states what ‘type’ the variable, in this case called ‘b’, should be represented. It is ‘unsigned’ so will always be a positive number.
Then the ‘main’ function places the number ‘97’ into the ‘integer’ representation of the memory. Ninety seven is the ASCII (en.wikipedia.org/wiki/ASCII) decimal number for the character ‘a’. Then we loop around stating what the ‘char’ (character) representation of the memory is using the function ‘printf’ and then increment the integer representation by one each time with ‘fab.b++’. This should then output ‘abc’.
On line 14, ‘printf(“\n”);’ says to output a new line.
From line 16 to 21, we have another loop that places sequentially the numbers, 102, 97 and 98 into the integer representation and in the same way as before, state the character representation. This should output ‘fab’.
Finally on line ‘23’, ‘return 0;’ exits the program with the return value of ‘0’ (exit code) which tells the operating system that the program ran correctly.
To show this we first need to ‘compile’ our program using a compiler to transform the human readable ‘source code’ above into the ‘machine code’ that the CPU needs. For this I’m using the compiler ‘gcc’ (gcc.gnu.org) on my Raspberry Pi B+ called ‘Matilda’:
The first line ‘gcc fab.c -o fab’ says to compile the source code file ‘fab.c’ into the runnable file (executable). Then we have the rest as expected.
Our second program ‘fab2’ is:
where on lines 9 to 14 we find out how big the representations are and where they are stored in memory, their ‘address’.
From lines 16 to 28, as before. Lines 30 to 35 places a character in the memory address ‘pointed to’ by ‘bp’ and then increments ‘bp’ by one byte (as its an ‘unsigned char’ pointer, so one ‘byte’). So ‘unsigned char *bp = &fab.c;’ means that ‘bp’ will be assigned the memory ‘address of’ the variable ‘fab.c’. The actual character is determined by adding the loop variable ‘bpi’ value to the number ninety seven. Therefore, if ‘b’ is 4 bytes then we will have ‘abcd’ in memory. The ‘printf’ function should tell us this on line 33, where ‘printf(“%p has %c\n”, bp, *bp);’ states at ‘bp’ address we have the character of ‘x’ pointed to by ‘bp’, thus:
Then when the pointer ‘bp’ is incremented with ‘bp++’ then we will see the character ‘b’ in the next memory location which was put there by ‘*bp = bpi + 97;’. This means place in the memory address stated by ‘bp’ the value of adding ‘bpi’ (being ‘1’ at this point) to the number ‘97’, so ‘98’ and thus that is the character ‘b’ in ASCII.
Don’t worry if this is complex, C pointers and their arithmetic can be quite difficult!
So, we then get the output:
we can see that our ‘union’ called ‘fab’ is as big as the biggest data type it contains, the variable ‘b’. That both ‘b’ and ‘c’ are at the same memory location. Finally, we can see the ‘abcd’ in memory that I’ve already described.
If we now compile and run the same program on a 64bit PC that is running ‘Ubuntu 19.10’ (same Debian based Linux operating system as ‘Raspbian’ on the Raspberry Pi) then we get:
So different memory addresses but the same size of integer for the data type ‘unsigned int’.
Our third and last program ‘fab3.c’ is only slightly different to ‘fab2.c’:
In that on line ‘4’ the ‘unsigned int’ has changed to an ‘unsigned long’. This means that it should be bigger, however on the Raspberry Pi I’m using it is not, its the same as the ‘unsigned int’:
and yet on the Ubuntu 19.10 machine it is double the size:
and looking at en.wikipedia.org/wiki/C_data_types#Main_types again we see that they are both different. This is one of the ‘headaches’ for software engineers in needing to support different CPU’s / operating systems and still have reliable code. You will also notice that the program has already been written in such a way to cope with this without change as ‘b’ now has ‘abcdefgh’.
So by one small change in a simple program we can see the importance of accuracy and understanding in code to produce the intended result. It is perhaps the human condition where we are good at being dynamic, adaptable and creative with the skill to think and make software and yet do make mistakes, that ‘bugs’ in software can happen.
You will see with all three programs that data is being read, processed and stored. There is input and output. The ‘instruction cycle’ is happening between the CPU and memory. A CPU has hardware (the address and data lines) that select the memory addresses and send the actual data to and from memory.
How computers process data can be controlled by software. It gives us the ability to get a machine to store all data as one thing, binary, and yet process it in so many different ways according to the purpose it is needed for. This is why we have different file types such as text (txt), image (jpg, png…), sound (mp3, flac…), video (mp4, avi…), the data is all binary but how it is processed is the key to making it useful.
But what about humans, how do we look at and perceive data? Can we go back and look at something from the past and look at it in a different way from a changed perspective?
When you’re writing, using or conducting a Moodle course how can you look at the ‘data’ differently? Does that view allow you to make things better or could it even make things worse?