Introduction To Assembly Language

Performance Objective

At the conclusion of the course the student will be able to:

Install the tools necessary to compile and run assembly language code.
Identify the three sections of an assembly program.
Identify the five logical instructions that assembly recognizes.
Identify what recursion is.
Understand the basics of memory management.

Introduction

Assembly or asm is a catch-all term used to describe low-level programming language with close parity to hardware machine code instructions. Assembly language will be specific to particular computer architecture and operating systems. Assembly is generally not portable and does not share the ease in which high-level programming languages can be ported across architectures thanks to interpreting and compiling.

Assembly is made into something that can be executed with a tool known as an assembler. We will be using nasm for the purposes of this instructional material.

Installation

Assembly language requires an assembler. This is an application that turns your assembly code into something that can be executed. We will be using nasm because it is easy to install, readily available, and runs on Linux.

Check to make sure you don’t have development tools installed with

$ whereis nasm

We can also run

$ which nasm

If you see nasm: then you need to install using the command sudo pacman -S nasm or your equivalent installation command for your distribution.

Once you have installed nasm we can begin writing and building executables from assembly.

Basic Syntax

Assembly programs are generally divided into three sections. These three sections consist of the data section, the bss section, and the text section.

data section

The data section will be used to hold your constants and the initialized data that you have declared. This data will not change at runtime and is useful to holding things like strings.

You declare the data section using the following syntax.

section .data
msg db 'Hello World!', 0xa  ;string
len equ $ - msg     ;length of the string

bss section

The bss section has no content. It is uninitialized data. This contains the information necessary for the loader to preallocate memory space when starting a program. At execution this will normally contain all 0s and is devoid of useful information until data is written to those variables. You can use a debugger to review the contents of this memory as the system runs the program.

Think of it like taking cardboard, folding a box, putting a name on the box, and declaring ’this box holds 4 glass jars’. You are preparing the object to hold things, but it is empty until something is put in it.

text section

The text section will house your actual code. This section has a specific declaration and this informs the kernel where program execution begins. This is similar to the main() function found in languages like C or C++.

section.text
  global _start
_start:

Comments

Assembly supports inline and lone comments. A comment supports all printable characters and also allows for blank lines. We use comments to bring attention to code segments when the code itself is not self explanatory. There are many thoughts on how comments should be used. I the author believe that code should be easy to read and indicative of what it is doing and comments should be used sparingly. Others believe that comments should be used liberally. You as a coder should make that decision but writing well formed and self documenting code should be your priority regardless of the methodology chosen.

;
;
; This is an example comment.
;
;

msg db 'Hello World!', 0xa     ; This is also an inline comment.

asm Hello World

Intel Syntax

section     .text
global      _start                              ;must be declared for linker (ld)

_start:                                         ;tell linker entry point
    mov     edx,len                             ;message length
    mov     ecx,msg                             ;message to write
    mov     ebx,1                               ;file descriptor (stdout)
    mov     eax,4                               ;system call number (sys_write)
    int     0x80                                ;call kernel

    mov     eax,1                               ;system call number (sys_exit)
    int     0x80                                ;call kernel

section     .data
msg     db  'Hello, world!',0xa                 ;our dear string
len     equ $ - msg                             ;length of our dear string

$ nasm -f elf64 hello.asm
$ ld -m elf_x86_64 -s hello.o -o hello
$ ./hello

AT&T Syntax

.global _start

.text
_start:
        # write(1, message, 13)
        mov     $4, %eax                # system call 4 is write
        mov     $1, %ebx                # file handle 1 is stdout
        mov     $message, %ecx          # address of string to output
        mov     $13, %edx               # number of bytes to write
        int     $0x80                   # invoke operating system code
        
        # exit(0)
        mov     $1, %eax                # system call 1 is exit
        xor     %ebx, %ebx              # we want return code 0
        int     $0x80                   # invoke operating system code
message:
        .ascii  "Hello, World\n"

$ gcc -c hello.s
$ ld hello.o -o hello
$ ./hello

32 bit General Registers: EAX,EBX,ECX,EDX
int is an interrupt. 0x80 informs the kernel it needs to do the action being held inside eax. The action is known as a ‘system call’.
System call 1 forces a program to exit
System call 4 forces a program to print
64 bit General Registers: RAX, RDI, RSI, RDX

Intel Syntax: mov eax, 1 Instruction, Destination, Source

AT&T Syntax: mov $4, %eax Instruction, Source, Destination

Intel syntax uses inference to decided the amount of data which is moved and the addressing mode comes from the operands themselves. AT&T syntax supports using suffixes at the end of the instruction set to signify the size of the data. This is not mandatory. The real explicitness of AT&T syntax comes from the use of the $ and % symbols. $ means immediate addressing. Without the $ it would fetch the value found at memory address 1. The % means use the register and makes sure the system not to use the symbol (labeled memory address).

Suffixes

l - long 32 bits
w - word 16 bits
q - quad-word 64 bits
b - single byte

Registers

Data must be stored in memory and accessed as necessary by the processor. Reading data from and storing data inside memory is a slow process relatively speaking. The process itself requires data to be moved across the bus and into a memory storage before then traversing the same bus in the opposite direction when the data is needed again. Moving information between ram and cpu is slow.

Registers exist to speed up processor operations by making memory storage allocation available within the cpu itself. Registers store data elements for processing without ever having to traverse the bus and access memory. This means we have a limited number of spots where we can store information within the cpu itself.

64 bit architecture enjoys the benefits of a large number of registers. However it should be wise to remember that IA-64 assembly language was deliberately designed with the intention that compilers will conduct the majority of code writing and that humans will do little if any interaction in assembly. This is important to remember. In modern systems, compilers and high level programs can now do an excellent job in taking advantage of hardware and is easily able to conduct a large sum of optimization without the user needing to get involved.

Logical Instructions

AND

AND is used for supporting logical expressions by performing bitwise AND operations. This operation will return 1 if the matching bits from both operands are 1, else it returns 0. Does this sound similar to a logic gate? It should as they operate the same.

0 AND 0 = 0
0 AND 1 = 0
1 AND 0 = 0
1 AND 1 = 1

OR

OR is used for setting one or more bits. The bitwise OR operator will return 1 if the matching bits from either or both operands are one. It returns 0 if both bits are zero.

0 OR 0 = 0
1 OR 0 = 1
0 OR 1 = 1
1 OR 1 = 1

XOR

XOR is a logic gate that gives a 1 output when the number of true inputs is odd. This can also be used to clear a register.

0 and 0 = 0
0 and 1 = 1
1 and 0 = 1
1 and 1 = 0

Clearing a register would look like XOR ABC, ABC.

TEST

TEST instructions work the same as the AND operation but does not change anything. This allows you to find out if a number is even or odd without changing the original number.

NOT

NOT will reverse the bits of an operand.

0 NOT = 1
1 NOT = 0

Recursion

Recursion exists in two forms. Direct recursion is the procedure in which a function calls itself. Indirect recursion is the procedure by which a second procedure calls the first function. Python performs direct recursion very elegantly. Consider the following bit of code for finding the factor of a number.

def fact(a):
  if a == 1:
    return 1
  else:
  return a * fact(a-1)

If you are not familiar with finding the factorial of a number it works like this.

print(fact(4))
4 * 3 * 2 * 1

Recursion is an elegant and simple method of allowing you to conduct repeated operations with specifically defined rules.

Memory

Pointers, Virtual Memory, and Physical Addresses are all important concepts to begin mastering. In assembly language, we allocate space and then fill it with a string. You can see the comparison between assembly and something like python very easily in regards to handling a string and printing it.

Assembly - hello.asm

; Define variables in the data section
SECTION .DATA
	hello:     db 'Hello world!',10
	helloLen:  equ $-hello

; Code goes in the text section
SECTION .TEXT
	GLOBAL _start

_start:
	mov eax,4            ; 'write' system call = 4
	mov ebx,1            ; file descriptor 1 = STDOUT
	mov ecx,hello        ; string to write
	mov edx,helloLen     ; length of string to write
	int 80h              ; call the kernel

	; Terminate program
	mov eax,1            ; 'exit' system call
	mov ebx,0            ; exit with error code 0
	int 80h              ; call the kernel

You then must compile.

$ nasm -f elf64 hello.asm
$ ld hello.o -o hello
./hello

Python - hello.py

hello = 'Hello World!'
print(hello)

Running is trivial.

python hello.py

GDB

GDB, the GNU Project Debugger allows you to find out what is happening with a program as it executes. GDB is generally used for catching issues or bugs in software. GDB is capable of doing four main things:

Start a program and specify anything that could change the behavior of that program.
Make a program stop on demand or conditionally.
Examine what happened to a program when it has stopped.
Change things in the program, this allows experimentation.

GDB supports a plethora of languages including:

C
C++
D
Fortran
Go
Rust
Python
and more …

We can install GDB using sudo pacman -S gdb or use the appropriate package manager for your distribution. The AUR for python-dbg must also be installed if you plan to use gdb in conjunction with python. This will replace your python with one that includes the debugging hooks. Sometimes this breaks things. Don’t hesitate to use a virtual environment to help with this.

Python developers can also use pdb.

Answers

NASM - the Netwide Assembler can be installed in Arch derivatives with sudo pacman -S nasm.
Assembly language consists of the data, bss, and text section. Assembly also supports comments.
AND, OR, XOR, TEST, and NOT are the five logical instructions that assembly recognizes.
Recursion is the ability for a procedure to call itself. There are two forms of recursion and they are direct and indirect.
Assembly is very low level. You must allocate and deallocate memory as appropriate.

Conclusion

You cannot begin the process of disassembly and review of software if you do not understand how software functions. You must master the basics of computing if you plan to move forward. Assembly language is extremely low level and gives the user an excellent idea of exactly what is occurring within the computer.

Understanding how assembly language works will give you the knowledge necessary to begin learning concepts such as coding, reverse engineering, and malware analysis.

Assembly programming is not necessary to write code. The reasoning for understanding assembly is several fold. It forces you to gain a greater understanding of your hardware, architecture, and can help you navigate what an application is doing when it appears to be coded right but still misbehaves. You as developer or researcher would be served well by taking the time to familiarize your self enough to be able to read the Intel Processor Manual.

Final Recommendations

Use Linux.
Understand your hardware and software.
Don’t be afraid to get your hands dirty.
Experiment.
Practice.
Read Programming from the Ground Up Book