At the conclusion of the course the student will be able to:
Install the tools necessary to compile and run assembly language code.
Identify the three sections of an assembly program.
Identify the five logical instructions that assembly recognizes.
Identify what recursion is.
Understand the basics of memory management.
Assembly or asm is a catch-all term used to describe low-level programming language with close parity to hardware machine code instructions. Assembly language will be specific to particular computer architecture and operating systems. Assembly is generally not portable and does not share the ease in which high-level programming languages can be ported across architectures thanks to interpreting and compiling.
Assembly is made into something that can be executed with a tool known as an assembler. We will be using nasm for the purposes of this instructional material.
Assembly language requires an assembler. This is an application that turns your assembly code into something that can be executed. We will be using nasm because it is easy to install, readily available, and runs on Linux.
Check to make sure you don’t have development tools installed with
$ whereis nasm
We can also run
$ which nasm
If you see
nasm: then you need to install using the command
sudo pacman -S
nasm or your equivalent installation command for your distribution.
Once you have installed nasm we can begin writing and building executables from assembly.
Assembly programs are generally divided into three sections. These three sections consist of the data section, the bss section, and the text section.
The data section will be used to hold your constants and the initialized data that you have declared. This data will not change at runtime and is useful to holding things like strings.
You declare the data section using the following syntax.
section .data msg db 'Hello World!', 0xa ;string len equ $ - msg ;length of the string
The bss section has no content. It is uninitialized data. This contains the information necessary for the loader to preallocate memory space when starting a program. At execution this will normally contain all 0s and is devoid of useful information until data is written to those variables. You can use a debugger to review the contents of this memory as the system runs the program.
Think of it like taking cardboard, folding a box, putting a name on the box, and declaring ‘this box holds 4 glass jars’. You are preparing the object to hold things, but it is empty until something is put in it.
The text section will house your actual code. This section has a specific
declaration and this informs the kernel where program execution begins. This is
similar to the
main() function found in languages like C or C++.
section.text global _start _start:
Assembly supports inline and lone comments. A comment supports all printable characters and also allows for blank lines. We use comments to bring attention to code segments when the code itself is not self explanatory. There are many thoughts on how comments should be used. I the author believe that code should be easy to read and indicative of what it is doing and comments should be used sparingly. Others believe that comments should be used liberally. You as a coder should make that decision but writing well formed and self documenting code should be your priority regardless of the methodology chosen.
; ; ; This is an example comment. ; ; msg db 'Hello World!', 0xa ; This is also an inline comment.
asm Hello World
section .text global _start ;must be declared for linker (ld) _start: ;tell linker entry point mov edx,len ;message length mov ecx,msg ;message to write mov ebx,1 ;file descriptor (stdout) mov eax,4 ;system call number (sys_write) int 0x80 ;call kernel mov eax,1 ;system call number (sys_exit) int 0x80 ;call kernel section .data msg db 'Hello, world!',0xa ;our dear string len equ $ - msg ;length of our dear string
$ nasm -f elf64 hello.asm $ ld -m elf_x86_64 -s hello.o -o hello $ ./hello
.global _start .text _start: # write(1, message, 13) mov $4, %eax # system call 4 is write mov $1, %ebx # file handle 1 is stdout mov $message, %ecx # address of string to output mov $13, %edx # number of bytes to write int $0x80 # invoke operating system code # exit(0) mov $1, %eax # system call 1 is exit xor %ebx, %ebx # we want return code 0 int $0x80 # invoke operating system code message: .ascii "Hello, World\n"
$ gcc -c hello.s $ ld hello.o -o hello $ ./hello
- 32 bit General Registers: EAX,EBX,ECX,EDX
- int is an interrupt. 0x80 informs the kernel it needs to do the action being held inside eax. The action is known as a ‘system call’.
- System call 1 forces a program to exit
- System call 4 forces a program to print
- 64 bit General Registers: RAX, RDI, RSI, RDX
mov eax, 1 Instruction, Destination, Source
mov $4, %eax Instruction, Source, Destination
Intel syntax uses inference to decided the amount of data which is moved and
the addressing mode comes from the operands themselves. AT&T syntax supports
using suffixes at the end of the instruction set to signify the size of the
data. This is not mandatory. The real explicitness of AT&T syntax comes from
the use of the
$ means immediate addressing. Without the
$ it would fetch the value found at memory address 1. The
% means use the
register and makes sure the system not to use the symbol (labeled memory
- l - long 32 bits
- w - word 16 bits
- q - quad-word 64 bits
- b - single byte
Data must be stored in memory and accessed as necessary by the processor. Reading data from and storing data inside memory is a slow process relatively speaking. The process itself requires data to be moved across the bus and into a memory storage before then traversing the same bus in the opposite direction when the data is needed again. Moving information between ram and cpu is slow.
Registers exist to speed up processor operations by making memory storage allocation available within the cpu itself. Registers store data elements for processing without ever having to traverse the bus and access memory. This means we have a limited number of spots where we can store information within the cpu itself.
64 bit architecture enjoys the benefits of a large number of registers. However it should be wise to remember that IA-64 assembly language was deliberately designed with the intention that compilers will conduct the majority of code writing and that humans will do little if any interaction in assembly. This is important to remember. In modern systems, compilers and high level programs can now do an excellent job in taking advantage of hardware and is easily able to conduct a large sum of optimization without the user needing to get involved.
AND is used for supporting logical expressions by performing bitwise AND operations. This operation will return 1 if the matching bits from both operands are 1, else it returns 0. Does this sound similar to a logic gate? It should as they operate the same.
0 AND 0 = 0 0 AND 1 = 0 1 AND 0 = 0 1 AND 1 = 1
OR is used for setting one or more bits. The bitwise OR operator will return 1 if the matching bits from either or both operands are one. It returns 0 if both bits are zero.
0 OR 0 = 0 1 OR 0 = 1 0 OR 1 = 1 1 OR 1 = 1
XOR is a logic gate that gives a 1 output when the number of true inputs is odd. This can also be used to clear a register.
0 and 0 = 0 0 and 1 = 1 1 and 0 = 1 1 and 1 = 0
Clearing a register would look like
XOR ABC, ABC.
TEST instructions work the same as the AND operation but does not change anything. This allows you to find out if a number is even or odd without changing the original number.
NOT will reverse the bits of an operand.
0 NOT = 1 1 NOT = 0
Recursion exists in two forms. Direct recursion is the procedure in which a function calls itself. Indirect recursion is the procedure by which a second procedure calls the first function. Python performs direct recursion very elegantly. Consider the following bit of code for finding the factor of a number.
def fact(a): if a == 1: return 1 else: return a * fact(a-1)
If you are not familiar with finding the factorial of a number it works like this.
print(fact(4)) 4 * 3 * 2 * 1
Recursion is an elegant and simple method of allowing you to conduct repeated operations with specifically defined rules.
Pointers, Virtual Memory, and Physical Addresses are all important concepts to begin mastering. In assembly language, we allocate space and then fill it with a string. You can see the comparison between assembly and something like python very easily in regards to handling a string and printing it.
Assembly - hello.asm
; Define variables in the data section SECTION .DATA hello: db 'Hello world!',10 helloLen: equ $-hello ; Code goes in the text section SECTION .TEXT GLOBAL _start _start: mov eax,4 ; 'write' system call = 4 mov ebx,1 ; file descriptor 1 = STDOUT mov ecx,hello ; string to write mov edx,helloLen ; length of string to write int 80h ; call the kernel ; Terminate program mov eax,1 ; 'exit' system call mov ebx,0 ; exit with error code 0 int 80h ; call the kernel
You then must compile.
$ nasm -f elf64 hello.asm $ ld hello.o -o hello ./hello
Python - hello.py
hello = 'Hello World!' print(hello)
Running is trivial.
GDB, the GNU Project Debugger allows you to find out what is happening with a program as it executes. GDB is generally used for catching issues or bugs in software. GDB is capable of doing four main things:
Start a program and specify anything that could change the behavior of that program.
Make a program stop on demand or conditionally.
Examine what happened to a program when it has stopped.
Change things in the program, this allows experimentation.
GDB supports a plethora of languages including:
- and more …
We can install GDB using
sudo pacman -S gdb or use the appropriate package
manager for your distribution. The AUR for python-dbg must also be installed if
you plan to use gdb in conjunction with python. This will replace your python
with one that includes the debugging hooks. Sometimes this breaks things. Don’t
hesitate to use a virtual environment to help with this.
Python developers can also use pdb.
NASM - the Netwide Assembler can be installed in Arch derivatives with
sudo pacman -S nasm.
Assembly language consists of the data, bss, and text section. Assembly also supports comments.
AND, OR, XOR, TEST, and NOT are the five logical instructions that assembly recognizes.
Recursion is the ability for a procedure to call itself. There are two forms of recursion and they are direct and indirect.
Assembly is very low level. You must allocate and deallocate memory as appropriate.
You cannot begin the process of disassembly and review of software if you do not understand how software functions. You must master the basics of computing if you plan to move forward. Assembly language is extremely low level and gives the user an excellent idea of exactly what is occurring within the computer.
Understanding how assembly language works will give you the knowledge necessary to begin learning concepts such as coding, reverse engineering, and malware analysis.
Assembly programming is not necessary to write code. The reasoning for understanding assembly is several fold. It forces you to gain a greater understanding of your hardware, architecture, and can help you navigate what an application is doing when it appears to be coded right but still misbehaves. You as developer or researcher would be served well by taking the time to familiarize your self enough to be able to read the Intel Processor Manual.
- Use Linux.
- Understand your hardware and software.
- Don’t be afraid to get your hands dirty.
- Read Programming from the Ground Up Book