Automata and Language Theory

String (computer science)

In computer programming and some branches of mathematics, a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set or alphabet.

In computer programming, a string is generally understood as a data type storing a sequence of data values, usually bytes, in which elements usually stand for characters according to a character encoding, which differentiates it from the more general array data type. In this context, the terms binary string and byte string are used to suggest strings in which the stored data does not (necessarily) represent text.

A variable declared to have a string data type usually causes storage to be allocated in memory that is capable of holding some predetermined number of symbols. When a string appears literally in source code, it is known as a string literal and has a representation that denotes it as such.

Formal theory

Let Σ be an alphabet, a non-empty finite set. Elements of Σ are called symbols or characters. A string (or word) over Σ is any finite sequence of characters from Σ. For example, if Σ = {0, 1}, then 0101 is a string over Σ.

The length of a string is the number of characters in the string (the length of the sequence) and can be any non-negative integer. The empty string is the unique string over Σ of length 0, and is denoted ε or λ.

The set of all strings over Σ of length n is denoted Σn. For example, if Σ = {0, 1}, then Σ2 = {00, 01, 10, 11}. Note that Σ0 = {ε} for any alphabet Σ.

The set of all strings over Σ of any length is the Kleene closure of Σ and is denoted Σ*. In terms of Σn,

For example, if Σ = {0, 1}, Σ* = {ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, …}. Although Σ* itself is countably infinite, all elements of Σ* have finite length.

A set of strings over Σ (i.e. any subset of Σ*) is called a formal language over Σ. For example, if Σ = {0, 1}, the set of strings with an even number of zeros ({ε, 1, 00, 11, 001, 010, 100, 111, 0000, 0011, 0101, 0110, 1001, 1010, 1100, 1111, …}) is a formal language over Σ.

Concatenation and substrings

Concatenation is an important binary operation on Σ*. For any two strings s and t in Σ*, their concatenation is defined as the sequence of characters in s followed by the sequence of characters in t, and is denoted st. For example, if Σ = {a, b, …, z}, s = bear, and t = hug, then st = bearhug and ts = hugbear.

String concatenation is an associative, but non-commutative operation. The empty string serves as the identity element; for any string s, εs = sε = s. Therefore, the set Σ* and the concatenation operation form a monoid, the free monoid generated by Σ. In addition, the length function defines a monoid homomorphism from Σ* to the non-negative integers.

A string s is said to be a substring or factor of t if there exist (possibly empty) strings u and v such that t = usv. The relation “is a substring of” defines a partial order on Σ*, the least element of which is the empty string.

Lexicographical ordering

It is often necessary to define an ordering on the set of strings. If the alphabet Σ has a total order (cf. alphabetical order) one can define a total order on Σ* called lexicographical order. Note that since Σ is finite, it is always possible to define a well ordering on Σ and thus on Σ*. For example, if Σ = {0, 1} and 0 < 1, then the lexicographical ordering of Σ* is ε < 0 < 00 < 000 < … < 011 < 0110 < … < 01111 < … < 1 < 10 < 100 < … < 101 < … < 111 …

String operations

A number of additional operations on strings commonly occur in the formal theory. These are given in the article on string operations.

Topology

Strings admit the following interpretation as nodes on a graph:

  • Fixed length strings can be viewed as nodes on a hypercube;
  • Variable length strings (of finite length) can be viewed as nodes on the k-ary tree, where k is the number of symbols in Σ;
  • Infinite strings can be viewed as infinite paths on the k-ary tree.

The natural topology on the set of fixed length strings or variable length strings is the discrete topology, but the natural topology on the set of infinite strings is the limit topology, viewing the set of infinite strings as the inverse limit of the sets of finite strings. This is the construction used for the p-adic numbers and some constructions of the Cantor set, and yields the same topology.

String datatypes

A string datatype is a datatype modeled on the idea of a formal string. Strings are such an important and useful datatype that they are implemented in nearly every programming language. In some languages they are available as primitive types and in others as composite types. The syntax of most high-level programming languages allows for a string, usually quoted in some way, to represent an instance of a string datatype; such a meta-string is called a literal or string literal.

String length

Although formal strings can have an arbitrary (but finite) length, the length of strings in real languages is often constrained to an artificial maximum. In general, there are two types of string datatypes: fixed length strings which have a fixed maximum length and which use the same amount of memory whether this maximum is reached or not, and variable length strings whose length is not arbitrarily fixed and which use varying amounts of memory depending on their actual size. Most strings in modern programming languages are variable length strings. Despite the name, even variable length strings are limited in length; although, generally, the limit depends only on the amount of memory available.

Character encoding

Historically, string datatypes allocated one byte per character, and although the exact character set varied by region, character encodings were similar enough that programmers could generally get away with ignoring this — groups of character sets used by the same system in different regions usually either had a character in the same place, or did not have it at all. These character sets were typically based on ASCII or EBCDIC.

Logographic languages such as Chinese, Japanese, and Korean (known collectively as CJK) need far more than 256 characters (the limit of a one 8-bit byte per-character encoding) for reasonable representation. The normal solutions involved keeping single-byte representations for ASCII and using two-byte representations for CJK ideographs. Use of these with existing code led to problems with matching and cutting of strings, the severity of which depended on how the character encoding was designed. Some encodings such as the EUC family guarantee that a byte value in the ASCII range will only represent that ASCII character, making the encoding safe for systems that use those characters as field separators. Other encodings such as ISO-2022 and Shift-JIS do not make such guarantees, making matching on byte codes unsafe. Another issue is that if the beginning of a string is deleted, important instructions for the decoder or information on position in a multibyte sequence may be lost. Another is that if strings are joined together (especially after having their ends truncated by code not aware of the encoding), the first string may not leave the encoder in a state suitable for dealing with the second string.

Unicode has simplified the picture somewhat. Most languages have a datatype for Unicode strings (usually UTF-16 as it was usually added before Unicode supplemental planes were introduced). Converting between Unicode and local encodings requires an understanding of the local encoding, which may be problematic for existing systems where strings of various encodings are being transmitted together with no real marking as to what encoding they are in.

Implementations

Some languages like C++ implement strings as templates that can be used with any datatype, but this is the exception, not the rule.

If an object-oriented language represents strings as objects, they are called mutable if the value can change at runtime and immutable if the value is frozen after creation. For example, Ruby has mutable strings, while Python’s strings are immutable.

Other languages, most notably Prolog and Erlang, avoid implementing a string datatype, instead adopting the convention of representing strings as lists of character codes.

Representations

Representations of strings depend heavily on the choice of character repertoire and the method of character encoding. Older string implementations were designed to work with repertoire and encoding defined by ASCII, or more recent extensions like the ISO 8859 series. Modern implementations often use the extensive repertoire defined by Unicode along with a variety of complex encodings such as UTF-8 and UTF-16.

Most string implementations are very similar to variable-length arrays with the entries storing the character codes of corresponding characters. The principal difference is that, with certain encodings, a single logical character may take up more than one entry in the array. This happens for example with UTF-8, where single characters can take anywhere from one to four bytes. In these cases, the logical length of the string differs from the logical length of the array.

The length of a string can be stored implicitly by using a special terminating character; often this is the null character having value zero, a convention used and perpetuated by the popular C programming language[1]. Hence, this representation is commonly referred to as C string. The length of a string can also be stored explicitly, for example by prefixing the string with the length as a byte value — a convention used in Pascal; consequently some people call it a P-string.

In terminated strings, the terminating code is not an allowable character in any string.

The term bytestring usually indicates a general-purpose string of bytes — rather than strings of only (readable) characters, strings of bits, or such. Byte strings often imply that bytes can take any value and any data can be stored as-is, meaning that there should be no value interpreted as a termination value.

Here is an example of a null-terminated string stored in a 10-byte buffer, along with its ASCII representation:

F

R

A

N

K

NUL

k

e

f

w

46

52

41

4E

4B

00

6B

66

66

77

The length of a string in the above example is 5 characters, but it occupies 6 bytes. Characters after the terminator do not form part of the representation; they may be either part of another string or just garbage. (Strings of this form are sometimes called ASCIZ strings, after the original assembly language directive used to declare them.)

Here is the equivalent (old style) Pascal string stored in a 10-byte buffer, along with its ASCII representation:

length

F

R

A

N

K

k

e

f

w

05

46

52

41

4E

4B

6B

66

66

77

Both character termination and length codes limit strings: for example, C character arrays that contain Nul characters cannot be handled directly by C string library functions: strings using a length code are limited to the maximum value of the length code.

Both of these limitations can be overcome by clever programming, of course, but such workarounds are by definition not standard.

Historically, rough equivalents of the C termination method appear in both hardware and software. For example “data processing” machines like the IBM 1401 used a special word mark bit to delimit strings at the left, where the operation would start at the right. This meant that while the IBM 1401 had a seven-bit word in “reality”, almost no-one ever thought to use this as a feature, and override the assignment of the seventh bit to (for example) handle ASCII codes.

It is possible to create data structures and functions that manipulate them that do not have the problems associated with character termination and can in principle overcome length code bounds. It is also possible to optimize the string represented using techniques from run length encoding (replacing repeated characters by the character value and a length) and Hamming encoding.

While these representations are common, others are possible. Using ropes makes certain string operations, such as insertions, deletions, and concatenations more efficient.

Vectors

While character strings are very common uses of strings, a string in computer science may refer generically to any vector of homogenously typed data. A string of bits or bytes, for example, may be used to represent data retrieved from a communications medium. This data may or may not be represented by a string-specific datatype, depending on the needs of the application, the desire of the programmer, and the capabilities of the programming language being used.

String processing algorithms

There are many algorithms for processing strings, each with various trade-offs. Some categories of algorithms include

Advanced string algorithms often employ complex mechanisms and data structures, among them suffix trees and finite state machines.

Character string oriented languages and utilities

Character strings are such a useful datatype that several languages have been designed in order to make string processing applications easy to write. Examples include the following languages:

Many UNIX utilities perform simple string manipulations and can be used to easily program some powerful string processing algorithms. Files and finite streams may be viewed as strings.

Some Application Programming Interfaces like Multimedia Control Interface, embedded SQL or printf use strings to hold commands that will be interpreted.

Recent scripting programming languages, including Perl, Python, Ruby, and Tcl employ regular expressions to facilitate text operations.

Some languages such as Perl and Ruby support string interpolation, which permits arbitrary expressions to be evaluated and included in string literals.

Character string functions

String functions are used to manipulate a string or change or edit the contents of a string. They also are used to query information about a string. They are usually used within the context of a computer programming language.

The most basic example of a string function is the length(string) function, which returns the length of a string (not counting any terminator characters or any of the string’s internal structural information) and does not modify the string. For example, length("hello world") returns 11.

There are many string functions which exist in other languages with similar or exactly the same syntax or parameters. For example in many languages the length function is usually represented as len(string). Even though string functions are very useful to a computer programmer, a computer programmer using these functions should be mindful that a string function in one language could in another language behave differently or have a similar or completely different function name, parameters, syntax, and results.

Programming Language

 

Language

syntax, grammar, and symbols or words used to give instructions to a computer

 

Development of Low-Level Languages

All computers operate by following machine language programs, a long sequence of instructions called machine code that is addressed to the hardware of the computer and is written in binary notation, which uses only the digits 1 and 0. First-generation languages, called machine languages, required the writing of long strings of binary numbers to represent such operations as “add,” “subtract,” “and compare.” Later improvements allowed octal, decimal, or hexadecimal representation of the binary strings.

Because writing programs in machine language is impractical (it is tedious and error prone), symbolic, or assembly, languages—second-generation languages—were introduced in the early 1950s. They use simple mnemonics such as A for “add” or M for “multiply,” which are translated into machine language by a computer program called an assembler. The assembler then turns that program into a machine language program. An extension of such a language is the macro instruction, a mnemonic (such as “READ”) for which the assembler substitutes a series of simpler mnemonics. The resulting machine language programs, however, are specific to one type of computer and will usually not run on a computer with a different type of central processing unit (CPU).

Evolution of High-Level Languages

The lack of portability between different computers led to the development of high-level languages—so called because they permitted a programmer to ignore many low-level details of the computer’s hardware. Further, it was recognized that the closer the syntax, rules, and mnemonics of the programming language could be to “natural language” the less likely it became that the programmer would inadvertently introduce errors (called “bugs”) into the program. Hence, in the mid-1950s a third generation of languages came into use. These algorithmic, or procedural, languages are designed for solving a particular type of problem. Unlike machine or symbolic languages, they vary little between computers. They must be translated into machine code by a program called a compiler or interpreter.

Early computers were used almost exclusively by scientists, and the first high-level language, Fortran [Formula translation], was developed (1953–57) for scientific and engineering applications by John Backus at the IBM Corp. A program that handled recursive algorithms better, LISP [LISt Processing], was developed by John McCarthy at the Massachusetts Institute of Technology in the early 1950s; implemented in 1959, it has become the standard language for the artificial intelligence community. COBOL [COmmon Business Oriented Language], the first language intended for commercial applications, is still widely used; it was developed by a committee of computer manufacturers and users under the leadership of Grace Hopper, a U.S. Navy programmer, in 1959. ALGOL [ALGOrithmic Language], developed in Europe about 1958, is used primarily in mathematics and science, as is APL [A Programming Language], published in the United States in 1962 by Kenneth Iverson. PL/1 [Programming Language 1], developed in the late 1960s by the IBM Corp., and ADA [for Ada Augusta, countess of Lovelace, biographer of Charles Babbage, (1972-1871.English mathematician) developed in 1981 by the U.S. Dept. of Defense, are designed for both business and scientific use.

BASIC [Beginner's All-purpose Symbolic Instruction Code] was developed by two Dartmouth College professors, John Kemeny and Thomas Kurtz, as a teaching tool for undergraduates (1966); it subsequently became the primary language of the personal computer revolution. In 1971, Swiss professor Nicholas Wirth developed a more structured language for teaching that he named Pascal (for French mathematician Blaise Pascal, who built the first successful mechanical calculator). Modula 2, a Pascallike language for commercial and mathematical applications, was introduced by Wirth in 1982. Ten years before that, to implement the UNIX operating system, Dennis Ritchie of Bell Laboratories produced a language that he called C; along with its extensions, called C++, developed by Bjarne Stroustrup of Bell Laboratories, it has perhaps become the most widely used general-purpose language among professional programmers because of its ability to deal with the rigors of object-oriented programming. Java is an object-oriented language similar to C++ but simplified to eliminate features that are prone to programming errors. Java was developed specifically as a network-oriented language, for writing programs that can be safely downloaded through the Internet  and immediately run without fear of computer viruses. Using small Java programs called applets, World Wide Web pages can be developed that include a full range of multimedia functions.

Fourth-generation languages are nonprocedural—they specify what is to be accomplished without describing how. The first one, FORTH, developed in 1970 by American astronomer Charles Moore, is used in scientific and industrial control applications. Most fourth-generation languages are written for specific purposes. Fifth-generation languages, which are still in their infancy, are an outgrowth of artificial intelligence research. PROLOG [PROgramming LOGic], developed by French computer scientist Alain Colmerauer and logician Philippe Roussel in the early 1970s, is useful for programming logical processes and making deductions automatically.

Many other languages have been designed to meet specialized needs. GPSS [General Purpose System Simulator] is used for modeling physical and environmental events, and SNOBOL [String-Oriented Symbolic Language] is designed for pattern matching and list processing. LOGO, a version of LISP, was developed in the 1960s to help children learn about computers. PILOT [Programmed Instruction Learning, Or Testing] is used in writing instructional software, and Occam is a nonsequential language that optimizes the execution of a program’s instructions in parallel-processing systems.

There are also procedural languages that operate solely within a larger program to customize it to a user’s particular needs. These include the programming languages of several database and statistical programs, the scripting languages of communications programs, and the macro languages of word-processing programs.

Compilers and Interpreters

Once the program is written and has had any errors repaired (a process called debugging), it may be executed in one of two ways, depending on the language. With some languages, such as C or Pascal, the program is turned into a separate machine language program by a compiler, which functions much as an assembler does. Other languages, such as LISP, do not have compilers but use an interpreter to read and interpret the program a line at a time and convert it into machine code. A few languages, such as BASIC, have both compilers and interpreters. Source code, the form in which a program is written in a high-level language, can easily be transferred from one type of computer to another, and a compiler or interpreter specific to the machine configuration can convert the source code to object, or machine, code.

Programming Language

Language in which a computer programmer writes instructions for a computer to execute. Some languages, such as COBOL, FORTRAN, Pascal, and C, are known as procedural languages because they use a sequence of commands to specify how the machine is to solve a problem. Others, such as LISP, are functional, in that programming is done by invoking procedures (sections of code executed within a program). Languages that support object-oriented programming take the data to be manipulated as their point of departure. Programming languages can also be classified as high-level or low-level. Low-level languages address the computer in a way that it can understand directly, but they are very far from human language. High-level languages deal in concepts that humans devise and can understand, but they must be translated by means of a compiler into language the computer understands.

Programming Languages

A language used to write instructions for the computer. It lets the programmer express data processing in a symbolic manner without regard to machine-specific details.

From Source Code to Machine Language
The statements that are written by the programmer are called “source language,” and they are translated into the computer’s “machine language” by programs called “assemblers,” “compilers” and “interpreters.” For example, when a programmer writes MULTIPLY HOURS TIMES RATE, the verb MULTIPLY must be turned into a code that means multiply, and the nouns HOURS and RATE must be turned into memory locations where those items of data are actually located.
Grammar and Syntax
Like human languages, each programming language has its own grammar and syntax. There are many dialects of the same language, and each dialect requires its own translation system. Standards have been set by ANSI for many programming languages, and ANSI-standard languages are dialect free. However, it can take years for new features to be included in ANSI standards, and new dialects inevitably spring up as a result.

Low Level and High Level
Programming languages fall into two categories: low-level assembly languages and high-level languages. Assembly languages are available for each CPU family, and each assembly instruction is translated into one machine instruction by the assembler program. With high-level languages, a programming statement may be translated into one or several machine instructions by the compiler.

Following is a brief summary of the major high-level languages. Look up each one for more details. For a list of high-level programming languages designed for client/server development, see client/server development system.

ActionScript
Programming language for Flash programs. See Flash and ActionScript.

Ada
Comprehensive, Pascal-based language used by the Department of Defense. See Ada.

ALGOL
International language for expressing algorithms. See ALGOL.

APL
Used for statistics and mathematical matrices. Requires special keyboard symbols. See APL.

BASIC
Developed as a timesharing language in the 1960s. It has been widely used in microcomputer programming in the past, and various dialects of BASIC have been incorporated into many different applications. Microsoft’s Visual Basic is widely used. See BASIC and Visual Basic.

C
Developed in the 1980s at AT&T. Widely used to develop commercial applications. Unix is written in C. See
C.

C++
Object-oriented version of C that is popular because it combines object-oriented capability with traditional C programming syntax. See C++.
C#
Pronounced “C-sharp.” A Microsoft .NET language based on C++ with elements from Visual Basic and Java. See
.NET.

COBOL
Developed in the 1960s. Widely used for mini and mainframe programming. See COBOL.

dBASE
Used to be widely used in business applications, but FoxPro (Microsoft’s dBASE) has survived the longest. See Visual FoxPro, FoxBase, Clipper and Quicksilver.

F#
Pronounced “F-sharp.” A Microsoft .NET scripting language based on ML. See F#.

FORTH
Developed in the 1960s, FORTH has been used in process control and game applications. See FORTH.

FORTRAN
Developed in 1954 by IBM, it was the first major scientific programming language and continues to be widely used. Some commercial applications have been developed in FORTRAN. See FORTRAN.

Java
The programming language developed by Sun and repositioned for Web use. It is widely used on the server side, although client applications are increasingly used. See Java.

JavaScript
A scripting language widely used on the Web. JavaScript is embedded into many HTML pages. See JavaScript.

LISP
Developed in 1960. Used for AI applications. Its syntax is very different than other languages. See LISP.

Logo
Developed in the 1960s, it was noted for its ease of use and “turtle graphics” drawing functions. See Logo.

M
Originally MUMPS (Massachusetts Utility MultiProgramming System), it includes its own database. It is widely used in medical applications. See M.

Modula-2
Enhanced version of Pascal introduced in 1979. See Modula-2.

Pascal
Originally an academic language developed in the 1970s. Borland commercialized it with its Turbo Pascal. See Pascal.

Perl
A scripting language widely used on the Web to write CGI scripts. See Perl.

Prolog
Developed in France in 1973. Used throughout Europe and Japan for AI applications. See Prolog.

Python
A scripting language used for system utilities and Internet scripts. Developed in Amsterdam by Guido van Rossum. See Python.

REXX
Runs on IBM mainframes and OS/2. Used as a general purpose macro language. See REXX.

VBScript
Subset of Visual Basic used on the Web similar to JavaScript. See VBScript.

Visual Basic
Version of BASIC for Windows programming from Microsoft that has been widely used. See Visual Basic.

Web Languages
Languages such as JavaScript, Jscript, Perl and CGI are used to automate Web pages as well as link them to other applications running in servers.

Millions of Languages!

Programmers must use standard names for the instruction verbs (add, compare, etc.) in the language they use. In addition, a company generally uses standardized names for the data elements in its databases. However, programmers typically “make up” names for all the functions (subroutines) in the program. Since programmers are loathe to document their code, the readability of the names chosen for these routines is critical.

In a single program, the programmer could make up hundreds of function names as well as names for data structures that hold fixed sums, predefined tables and display messages.

 

Set in mathematics

Set in mathematics, collection of entities, called elements of the set, that may be real objects or conceptual entities. Set theory not only is involved in many areas of mathematics but has important applications in other fields as well, e.g., computer technology and atomic and nuclear physics.

Definition of Sets

A set must be well defined; i.e., for any given object, it must be unambiguous whether or not the object is an element of the set. For example, if a set contains all the chairs in a designated room, then any chair can be determined either to be in or not in the set. If there were no chairs in the room, the set would be called the empty, or null, set, i.e., one containing no elements. A set is usually designated by a capital letter. If A is the set of even numbers between 1 and 9, then A={2, 4, 6, 8}. The braces, {}, are commonly used to enclose the listed elements of a set. The elements of a set may be described without actually being listed. If B is the set of real numbers that are solutions of the equation x2=9, then the set can be written as B={x:x2=9} or B={x|x2=9}, both of which are read: B is the set of all x such that x2=9; hence B is the set {3,−3}.

Membership in a set is indicated by the symbol and nonmembership by ; thus, xA means that element x is a member of the set A (read simply as “x is a member of A“) and yA means y is not a member of A. The symbols and are used to indicate that one set A is contained within or contains another set B; AB means that A is contained within, or is a subset of, B; and AB means that A contains, or is a superset of, B.

Operations on Sets

There are three basic set operations: intersection, union, and complementation. The intersection of two sets is the set containing the elements common to the two sets and is denoted by the symbol ∩. The union of two sets is the set containing all elements belonging to either one of the sets or to both, denoted by the symbol . Thus, if C={1, 2, 3, 4} and D={3, 4, 5}, then CD={3, 4} and CD={1, 2, 3, 4, 5}. These two operations each obey the associative law and the commutative law, and together they obey the distributive law.

In any discussion the set of all elements under consideration must be specified, and it is called the universal set. If the universal set is U={1, 2, 3, 4, 5} and A={1, 2, 3}, then the complement of A (written A′) is the set of all elements in the universal set that are not in A, or A′={4, 5}. The intersection of a set and its complement is the empty set (denoted by ), or AA′=; the union of a set and its complement is the universal set, or AA′=U. symbolic logic.

Set

In mathematics and logic, any collection of objects (elements), which may be mathematical (e.g., numbers, functions) or not. The intuitive idea of a set is probably even older than that of number. Members of a herd of animals, for example, could be matched with stones in a sack without members of either set actually being counted. The notion extends into the infinite. For example, the set of integers from 1 to 100 is finite, whereas the set of all integers is infinite. A set is commonly represented as a list of all its members enclosed in braces. A set with no members is called an empty, or null, set, and is denoted . Because an infinite set cannot be listed, it is usually represented by a formula that generates its elements when applied to the elements of the set of counting numbers. Thus, {2x | x = 1,2,3,…} represents the set of positive even numbers (the vertical bar means “such that”).

SET

(Secure Electronic Transaction)

A standard protocol from MasterCard and Visa for securing online credit card payments via the Internet. It is a three-way transaction: the user, merchant and bank must use the SET protocols.

Credit card data and a digital certificate (for authentication) is stored in a plug-in to the user’s Web browser. The order is received by a SET-enabled merchant server that passes encrypted payment information to the bank. Approval is electronically sent to the merchant.

set

1. Maths logic

a. a collection of numbers, objects, etc., that is treated as an entity: {3, the moon} is the set the two members of which are the number 3 and the moon

b. (in some formulations) a class that can itself be a member of other classes

2. any apparatus that receives or transmits television or radio signals

3. Tennis squash badminton one of the units of a match, in tennis one in which one player or pair of players must win at least six games

4. 

a. the number of couples required for a formation dance

b. a series of figures that make up a formation dance

5. 

a. a band’s or performer’s concert repertoire on a given occasion

b. a continuous performance

 

 

 

 

 

 

 

1.

(security)

SET – Secure Electronic Transaction.

 

2.

(electronics)

SETSingle Electron Tunneling.

 

3.

(standard)

SETStandard d’Echange et de Transfert.

 

4.

 

set – A collection of objects, known as the elements of the set, specified in such a way that we can tell in principle whether or not a given object belongs to it. E.g. the set of all prime numbers, the set of zeros of the cosine function.

For each set there is a predicate (or property) which is true for (possessed by) exactly those objects which are elements of the set. The predicate may be defined by the set or vice versa. Order and repetition of elements within the set are irrelevant so, for example, 1, 2, 3 = 3, 2, 1 = 1, 3, 1, 2, 2.

Some common set of numbers are given the following names:

N = the natural numbers 0, 1, 2, …

Z = the integers …, -2, -1, 0, 1, 2, …

Q = the rational numbers p/q where p, q are in Z and q /= 0.

R = the real numbers

C = the complex numbers.

The empty set is the set with no elements. The intersection of two sets X and Y is the set containing all the elements x such that x is in X and x is in Y. The union of two sets is the set containing all the elements x such that x is in X or x is in Y.

 

 

RELATION(MATHEMATICS)

 

In mathematics, the concept of a relation or relationship is a generalization of 2-place relations, such as the relation of equality, denoted by the sign “=” in a statement like “5 + 7 = 12,” or the relation of order, denoted by the sign “<” in a statement like “5 < 12″. Relations that involve two places or roles are called binary relations by some and dyadic relations by others, the latter being historically prior but also useful when necessary to avoid confusion with binary (base 2) numerals.

The next step up is to consider relations that can involve more than two places or roles, but still a finite number of them. These are called finite place or finitary relations. A finitary relation that involves k places is variously called a k-ary, a k-adic, or a k-dimensional relation. The number k is then called the arity, the adicity, or the dimension of the relation, respectively. See the formal definitions below.

Informal introduction

The definition of relation given in the next Section formally captures a concept that is actually quite familiar from everyday life. For example, consider the relationship, involving three roles that people might play, expressed in a statement of the form “X thinks that Y likes Z ”. The facts of a concrete situation could be organized in a Table like the following:

Relation S : X thinks that Y likes Z

Person X

Person Y

Person Z

Alice

Bob

Denise

Charles

Alice

Bob

Charles

Charles

Alice

Denise

Denise

Denise

Each row of the Table records a fact or makes an assertion of the form “X thinks that Y likes Z ”. For instance, the first row says, in effect, “Alice thinks that Bob likes Denise”. The Table represents a relation S over the set P of people under discussion:

P = {Alice, Bob, Charles, Denise}.

The data of the Table are equivalent to the following set of ordered triples:

S = {(Alice, Bob, Denise), (Charles, Alice, Bob), (Charles, Charles, Alice), (Denise, Denise, Denise)}.

By a slight overuse of notation, it is usual to write S(Alice, Bob, Denise) to say the same thing as the first row of the Table. The relation S is a ternary relation, since there are three items involved in each row. The relation itself is a mathematical object defined in terms of concepts from set theory (i.e., the relation is a subset of the Cartesian product on {Person X, Person Y, Person Z}), that carries all of the information from the Table in one neat package. Mathematically, then, a relation is simply a “set“.

The Table for relation S is an extremely simple example of a relational database. The theoretical aspects of databases are the specialty of one branch of computer science, while their practical impacts have become all too familiar in our everyday lives. Computer scientists, logicians, and mathematicians, however, tend to see different things when they look at these concrete examples and samples of the more general concept of a relation.

For one thing, databases are designed to deal with empirical data, and experience is always finite, whereas mathematics is nothing if not concerned with infinity, at the very least, potential infinity. This difference in perspective brings up a number of ideas that are usefully introduced at this point, if by no means covered in depth.

Example: divisibility

A more typical example of a 2-place relation in mathematics is the relation of divisibility between two positive integers n and m that is expressed in statements like “n divides m” or “n goes into m.” This is a relation that comes up so often that a special symbol “|” is reserved to express it, allowing one to write “n|m” for “n divides m.”

To express the binary relation of divisibility in terms of sets, we have the set P of positive integers, P = {1, 2, 3, …}, and we have the binary relation D on P such that the ordered pair (n, m) is in the relation D just in case n|m. In other turns of phrase that are frequently used, one says that the number n is related by D to the number m just in case n is a factor of m, that is, just in case n divides m with no remainder. The relation D, regarded as a set of ordered pairs, consists of all pairs of numbers (n, m) such that n divides m.

For example, 2 is a factor of 4, and 6 is a factor of 72, which can be written either as 2|4 and 6|72 or as D(2, 4) and D(6, 72).

Formal definitions

There are two definitions of k-place relations that are commonly encountered in mathematics. In order of simplicity, the first of these definitions is as follows:

Definition 1. A relation L over the sets X1, …, Xk is a subset of their cartesian product, written L X1 × … × Xk. Under this definition, then, a k-ary relation is simply a set of k-tuples.

The second definition makes use of an idiom that is common in mathematics, stipulating that “such and such is an n-tuple” in order to ensure that such and such a mathematical object is determined by the specification of n component mathematical objects. In the case of a relation L over k sets, there are k + 1 things to specify, namely, the k sets plus a subset of their cartesian product. In the idiom, this is expressed by saying that L is a (k+1)-tuple.

Definition 2. A relation L over the sets X1, …, Xk is a (k+1)-tuple L = (X1, …, Xk, G(L)), where G(L) is a subset of the cartesian product X1 × … × Xk. G(L) is called the graph of L.

Elements of a relation are more briefly denoted by using boldface characters, for example, the constant element = (a1, …, ak) or the variable element = (x1, …, xk).

A statement of the form “ is in the relation L ” is taken to mean that is in L under the first definition and that is in G(L) under the second definition.

The following considerations apply under either definition:

* The sets Xj for j = 1 to k are called the domains of the relation. In the case of the first definition, the relation itself does not uniquely determine a given sequence of domains.

 

* If all of the domains Xj are the same set X, then L is more simply referred to as a k-ary relation over X.

 

* If any of the domains Xj is empty, then the cartesian product is empty, and the only relation over such a sequence of domains is the empty relation L = . As a result, naturally occurring applications of the relation concept typically involve a stipulation that all of the domains be nonempty.

As a rule, whatever definition best fits the application at hand will be chosen for that purpose, and anything that falls under it will be called a ‘relation’ for the duration of that discussion. If it becomes necessary to distinguish the two alternatives, the latter type of object can be referred to as an embedded or included relation.

If L is a relation over the domains X1, …, Xk, it is conventional to consider a sequence of terms called variables, x1, …, xk, that are said to range over the respective domains.

A boolean domain B is a generic 2-element set, say, B = {0, 1}, whose elements are interpreted as logical values, typically 0 = false and 1 = true.

The characteristic function of the relation L, written fL or χ(L), is the boolean-valued function fL : X1 × … × Xk → B, defined in such a way that fL( ) = 1 just in case the k-tuple is in the relation L. The characteristic function of a relation may also be called its indicator function, especially in probability and statistics.

It is conventional in applied mathematics, computer science, and statistics to refer to a boolean-valued function like fL as a k-place predicate. From the more abstract viewpoints of formal logic and model theory, the relation L is seen as constituting a logical model or a relational structure that serves as one of many possible interpretations of a corresponding k-place predicate symbol, as that term is used in predicate calculus.

Due to the convergence of many different styles of study on the same areas of interest, the reader will find much variation in usage here. The variation presented in this article treats a relation as the set-theoretic extension of a relational concept or term. Another variation reserves the term ‘relation’ to the corresponding logical entity, either the logical comprehension, which is the totality of intensions or abstract properties that all of the elements of the relation in extension have in common, or else the symbols that are taken to denote these elements and intensions. Further, but hardly finally, some writers of the latter persuasion introduce terms with more concrete connotations, like ‘relational structure’, for the set-theoretic extension of a given relational concept.

Example: coplanarity

For lines L in three-dimensional space, there is a ternary relation picking out the triples of lines that are coplanar. This does not reduce to the binary symmetric relation of coplanarity of pairs of lines.

In other words, writing P(L, M, N) when the lines L, M, and N lie in a plane, and Q(L, M) for the binary relation, it is not true that Q(L, M), Q(M, N) and Q(N, L) together imply P(L, M, N); although the converse is certainly true (any pair out of three coplanar lines is coplanar, a fortiori). There are two geometrical reasons for this.

In one case, for example taking the x-axis, y-axis and z-axis, the three lines are concurrent, i.e. intersect at a single point. In another case, L, M, and N can be three edges of an infinite triangular prism.

What is true is that if each pair of lines intersects, and the points of intersection are distinct, then pairwise coplanarity implies coplanarity of the triple.

 

Relations are classified according to the number of sets in the cartesian product, in other words the number of terms in the expression:

* Unary relation or property: L(u)

* Binary relation: L(u, v) or u L v

* Ternary relation: L(u, v, w)

* Quaternary relation: L(u, v, w, x)

Relations with more than four terms are usually referred to as k-ary, for example, “a 5-ary relation”.

 

String operation

 

In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used on computer programming, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.

Alphabet of a string

The alphabet of a string is a list of all of the letters that occur in a particular string. If s is a string, its alphabet is denoted by

 

String substitution

Let L be a language, and let be its alphabet. A string substitution or simply a substitution is a mapping f that maps letters in to languages (possibly in a different alphabet). Thus, for example, given a letter , one has where is some language whose alphabet is . This mapping may be extended to strings as

 

for the empty string , and

 

for string . String substitution may be extended to the entire language as

 

 

An example of string substitution occurs in regular languages, which are closed under string substitution. That is, if the letters of a regular language are substituted by other regular languages, the result is still a regular language.

String homomorphism

A string homomorphism is a string substitution such that each letter is replaced by a single string. That is, , where s is a string, for each letter a. String homomorphisms are homomorphisms, preserving the binary operation of string concatenation. Given a language L, the set is called the homomorphic image of L. The inverse homomorphic image of a string s is defined as

 

while the inverse homomorphic image of a language L is defined as

 

Note that, in general, , while one does have

 

and

 

for any language L. Simple single-letter substitution ciphers are examples of string homomorphisms.

String projection

If s is a string, and is an alphabet, the string projection of s is the string that results by removing all letters which are not in . It is written as . It is formally defined by removal of letters from the right hand side:

 

Here denotes the empty string. The projection of a string is essentially the same as a projection in relational algebra.

String projection may be promoted to the projection of a language. Given a formal language L, its projection is given by

 

Right quotient

The right quotient of a letter a from a string s is the truncation of the letter a in the string s, from the right hand side. It is denoted as . If the string does not have a on the right hand side, the result is the empty string. Thus:

 

The quotient of the empty string may be taken:

 

Similarly, given a subset of a monoid , one may define the quotient subset as

 

Left quotients may be defined similarly, with operations taking place on the left of a string.

Syntactic relation

The right quotient of a subset of a monoid defines an equivalence relation, called the right syntactic relation of S. It is given by

 

The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if

 

is finite. In this case, S is a recognizable language, that is, a language that can be recognized by a finite state automaton. This is discussed in greater detail in the article on syntactic monoids.

Right cancellation

The right cancellation of a letter a from a string s is the removal of the first occurrence of the letter a in the string s, starting from the right hand side. It is denoted as and is recursively defined as

 

The empty string is always cancellable:

 

Clearly, right cancellation and projection commute:

 

Prefixes

The prefixes of a string is the set of all prefixes to a string, with respect to a given language:

 

The prefix closure of a language is

 

A language is called prefix closed if . Clearly, the prefix closure operator is idempotent:

 

The prefix relation is a binary relation such that if and only if .

Prefix grammars generate languages that are prefix-closed (with respect to the grammar!).

 

LANGUGE OPERATION

 

In computer science, an operation is called an instruction

In computer science, an instruction is a single operation of a processor defined by an instruction set architecture. In a broader sense, an “instruction” may be any representation of an element of an executable program, such as a bytecode.

On traditional architectures, an instruction includes an opcode specifying the operation to be performed, such as “add contents of memory to register”, and zero or more operand specifiers, which may specify registers, memory locations, or literal data. The operand specifiers may have addressing modes determining their meaning or may be in fixed fields.

In very long instruction word (VLIW) architectures, which include many microcode architectures, multiple simultaneous operations and operands are specified in a single instruction.

The size or length of an instruction varies widely, from as little as four bits in some microcontrollers to many hundreds of bits in some VLIW systems. Most modern processors used in personal computers, mainframes, and supercomputers have instruction sizes between 16 and 64 bits. In some architectures, notably most Reduced Instruction Set Computers, instructions are a fixed length, typically corresponding with that architecture’s word size. In other architectures, instructions have variable length, typically integral multiples of a byte or a halfword.

The instructions constituting a program are rarely specified using their internal, numeric form; they may be specified by programmers using an assembly language or, more commonly, may be generated by compilers.

 

 

 

 

 

 

 

 

 

 

 

 

Finite State Machine (FMS)

 

A finite state machine (FSM) or finite state automaton (plural: automata) or simply a state machine is a model of behavior composed of a finite number of states, transitions between those states, and actions.

Concepts and vocabulary

A state stores information about the past, i.e. it reflects the input changes from the system start to the present moment. A transition indicates a state change and is described by a condition that would need to be fulfilled to enable the transition. An action is a description of an activity that is to be performed at a given moment. There are several action types:

Entry action: which is performed when entering the state

Exit action: which is performed when exiting the state

Input action: which is performed depending on present state and input conditions

Transition action: which is performed when performing a certain transition

A FSM can be represented using a state diagram (or state transition diagram) as in figure 1 above. Besides this, several state transition table types are used. The most common representation is shown below: the combination of current state (B) and condition (Y) shows the next state (C). The complete actions information can be added only using footnotes. An FSM definition including the full actions information is possible using state tables (see also VFSM).

State transition table

   Current State ->
Condition

State A

State B

State C

Condition X

Condition Y

State C

Condition Z

In addition to their use in modeling reactive systems presented here, finite state automata are significant in many different areas, including electrical engineering, linguistics, computer science, philosophy, biology, mathematics, and logic. A complete survey of their applications is outside the scope of this article. Finite state machines are a class of automata studied in automata theory and the theory of computation. In computer science, finite state machines are widely used in modeling of application behavior, design of hardware digital systems, software engineering, compilers, network protocols, and the study of computation and languages.

Classification

There are two different groups: Acceptors/Recognizers and Transducers.

 

 

 

Acceptors and recognizers

Fig. 2 Acceptor FSM: parsing the word “nice”

Acceptors and recognizers (also sequence detectors) produce a binary output, saying either yes or no to answer whether the input is accepted by the machine or not. All states of the FSM are said to be either accepting or not accepting. At the time when all input is processed, if the current state is an accepting state, the input is accepted; otherwise it is rejected. As a rule the input are symbols (characters); actions are not used. The example in figure 2 shows a finite state machine which accepts the word “nice”, in this FSM the only accepting state is number 7.

The machine can also be described as defining a language, which would contain every word accepted by the machine but none of the rejected ones; we say then that the language is accepted by the machine. By definition, the languages accepted by FSMs are the regular languages – that is, a language is regular if there is some FSM that accepts it (cf. Kleene’s Theorem).

Start state

The start state is usually shown drawn with an arrow “pointing at it from nowhere” (Sipser (2006) p.34)

Accept state

An accept state (sometimes referred to as an accepting state) is a state at which the machine has successfully performed its procedure. It is usually represented by a double circle.

An example of an accepting state appears on the left in this diagram of a deterministic finite automaton which determines if the binary input contains an even number of 0s:

S1 (which is also the start state) indicates the state at which an even number of 0s has been input and is therefore defined as an accepting state.

 

 

 

 

 

Transducers

Transducers generate output based on a given input and/or a state using actions. They are used for control applications. Here two types are distinguished:

Fig. 3 Transducer FSM: Moore model example

Moore machine: The FSM uses only entry actions, i.e. output depends only on the state. The advantage of the Moore model is a simplification of the behaviour. The example in figure 3 shows a Moore FSM of an elevator door. The state machine recognizes two commands: “command_open” and “command_close” which trigger state changes. The entry action (E:) in state “Opening” starts a motor opening the door, the entry action in state “Closing” starts a motor in the other direction closing the door. States “Opened” and “Closed” don’t perform any actions. They signal to the outside world (e.g. to other state machines) the situation: “door is open” or “door is closed”.

 

Fig. 4 Transducer FSM: Mealy model example

Mealy machine: The FSM uses only input actions, i.e. output depends on input and state. The use of a Mealy FSM leads often to a reduction of the number of states. The example in figure 4 shows a Mealy FSM implementing the same behaviour as in the Moore example (the behaviour depends on the implemented FSM execution model and will work e.g. for virtual FSM but not for event driven FSM). There are two input actions (I:): “start motor to close the door if command_close arrives” and “start motor in the other direction to open the door if command_open arrives”.

In practice mixed models are often used.

More details about the differences and usage of Moore and Mealy models, including an executable example, can be found in the external technical note “Moore or Mealy model?”

A further distinction is between deterministic (DFA) and non-deterministic (NDFA, GNFA) automata. In deterministic automata, for each state there is exactly one transition for each possible input. In non-deterministic automata, there can be none or more than one transition from a given state for a given possible input. This distinction is relevant in practice, but not in theory, as there exists an algorithm which can transform any NDFA into an equivalent DFA, although this transformation typically significantly increases the complexity of the automaton.

The FSM with only one state is called a combinatorial FSM and uses only input actions. This concept is useful in cases where a number of FSM are required to work together, and where it is convenient to consider a purely combinatorial part as a form of FSM to suit the design tools.

FSM logic

Fig. 5 FSM Logic

The next state and output of an FSM is a function of the input and of the current state. The FSM logic is shown in Figure 5.

Mathematical model

Depending on the type there are several definitions. An acceptor finite state machine is a quintuple , where:

  • is the input alphabet (a finite non empty set of symbols).
  • is a finite non empty set of states.
  • is an initial state, an element of . In a Nondeterministic finite state machine, is a set of initial states.
  • is the state transition function: .
  • is the set of final states, a (possibly empty) subset of .

 

 

A transducer finite state machine is a six tuple , where:

  • is the input alphabet (a finite non empty set of symbols).
  • is the output alphabet (a finite non empty set of symbols).
  • is a finite non empty set of states.
  • is the initial state, an element of . In a Nondeterministic finite state machine, is a set of initial states.
  • is the state transition function: .
  • is the output function.

If the output function is a function of a state and input alphabet ( ) that definition corresponds to the Mealy model, and can be modelled as a Mealy machine. If the output function depends only on a state ( ) that definition corresponds to the Moore model, and can be modelled as a Moore machine. A finite state machine with no output function at all is known as a semiautomaton or transition system.

Optimization

Optimizing an FSM means finding the machine with the minimum number of states that performs the same function. One possibility is by using an Implication table or the Moore reduction procedure. Another possibility is bottom-up algorithm for Acyclic FSAs.

 

Software applications

The following concepts are commonly used to build software applications with finite state machines:

History

Starting in the 1970s, Leslie Lamport, an early leader within the distributed systems research community, used finite state machines as the basis for an algorithm he called state machine replication. In this approach, a deterministic computer program or service is replaced with a set of replicas that use some form of atomic broadcast to perform operations in a manner tolerant of failures.

 

 

 

 

 

 

Deterministic finite-state machine

 

In the theory of computation, a deterministic finite state machine or deterministic finite automaton (DFA) is a finite state machine where for each pair of state and input symbol there is one and only one transition to a next state. DFAs recognize the set of regular languages and no other languages.

A DFA will take in a string of input symbols. For each input symbol it will then transition to a state given by following a transition function. When the last input symbol has been received it will either accept or reject the string depending on whether the DFA is in an accepting state or a non-accepting state.

 

Formal definition

A DFA is a 5-tuple, (S, Σ, T, s, A), consisting of

  • a finite set of states (S)
  • a finite set called the alphabet (Σ)
  • a transition function (T : S × Σ → S)
  • a start state (s S)
  • a set of accept states (A S)

Let M be a DFA such that M = (S, Σ, T, s, A), and X = x0x1 … xn be a string over the alphabet Σ. M accepts the string X if a sequence of states, r0,r1, …, rn, exists in S with the following conditions:

  1. r0 = s
  2. ri+1 = T(ri, xi), for i = 0, …, n-1
  3. rn A.

As shown in the first condition, the machine starts in the start state s. The second condition says that given each character of string X, the machine will transition from state to state as ruled by the transition function T. The last condition says that the machine accepts if the last input of X causes the machine to be in one of the accepting states. Otherwise, it is said to reject the string. The set of strings it accepts form a language, which is the language the DFA recognises.

A DFA without a list of accept states and without a designated starting state is known as a transition system or semiautomaton.

Example

The following example is of a DFA M, with a binary alphabet, which requires that the input contains an even number of 0s.

M = (S, Σ, T, s, A) where

 

0

1

S1

S2

S1

S2

S1

S2

Simply put, the state S1 represents that there has been an even number of 0s in the input so far, while S2 signifies an odd number. A 1 in the input does not change the state of the automaton. When the input ends, the state will show whether the input contained an even number of 0s or not.

The language of M is the regular language given by the regular expression (1*(0(1)*0)*)*

Advantages and disadvantages

DFAs are one of the most practical models of computation, since there is a trivial linear time, constant-space, online algorithm to simulate a DFA on a stream of input. Given two DFAs there are efficient algorithms to find a DFA recognizing the union, intersection, and complements of the languages they recognize. There are also efficient algorithms to determine whether a DFA accepts any strings, whether a DFA accepts all strings, whether two DFAs recognize the same language, and to find the DFA with a minimum number of states for a particular regular language.

DFAs are equivalent in computing power to nondeterministic finite automata.

On the other hand, DFAs are of strictly limited power in the languages they can recognize — many simple languages, including any problem that requires more than constant space to solve, cannot be recognized by a DFA. The classical example of a simply described language that no DFA can recognize is bracket language, that is language that consists of properly paired brackets, such as (()()). More formally the language consisting of strings of the form anbn — some finite number of a’s, followed by an equal number of b’s. It can be shown that no DFA can have enough states to recognize such a language.

 

Nondeterministic Finite Automata

In the theory of computation, a nondeterministic finite state machine or nondeterministic finite automaton (NFA) is a finite state machine where for each pair of state and input symbol there may be several possible next states. This distinguishes it from the deterministic finite automaton (DFA), where the next possible state is uniquely determined. Although the DFA and NFA have distinct definitions, it may be shown in the formal theory that they are equivalent, in that, for any given NFA, one may construct an equivalent DFA, and vice-versa: this is the powerset construction. Both types of automata recognize only regular languages. Non-deterministic finite state machines are sometimes studied by the name subshifts of finite type. Non-deterministic finite state machines are generalized by probabilistic automata, which assign a probability to each state transition.

Nondeterministic finite automata were introduced by Michael O. Rabin and Dana Scott in 1959[1], who showed their equivalence to deterministic automata.

Intuitive introduction

An NFA, just as a DFA, consumes a string of input symbols. For each input symbol it transforms to a new state until all input symbols have been consumed.

Unlike a DFA, it is non-deterministic in that, for any input symbol, its next state is not uniquely determined, but may be any one of several possible states. Thus, in the formal definition, it is common to talk of a subset of the power set of states: the transition is not so much to a single state, but to some subset of all possible states, considered at once.

An extension of the NFA is the NFA-lambda (also known as NFA-epsilon or the NFA with epsilon moves), which allows a transform to a new state without consuming any input symbols. For example, if it is in state 1, with the next input symbol an a, it can move to state 2 without consuming any input symbols, and thus one has an ambiguity: is the system in state 1, or state 2, before consuming the letter a? Because of this ambiguity, it is more convenient to talk of the set of possible states the system may be in. Thus, before consuming letter a, the NFA-epsilon may be in any one of the states out of the set {1,2}. Equivalently, one may imagine that the NFA is in state 1 and 2 ‘at the same time’: and this gives an informal hint of the powerset construction: the DFA equivalent to an NFA is defined as the one that is in the state q={1,2}. Transformations to new states without consuming an input symbol are called lambda transitions or epsilon transitions. They are usually labelled with the Greek letter λ or ε.

The notion of accepting an input is similar to that for the DFA. When the last input symbol is consumed, the NFA accepts if and only if there is some set of transitions that will take it to an accepting state. Equivalently, it rejects, if, no matter what transitions are applied, it would not end in an accepting state.

Formal definition

Two similar types of NFA’s are commonly defined: the NFA and the NFA with ε-moves. The ordinary NFA is defined as a 5-tuple, (Q, Σ, T, q0, F), consisting of

  • a finite set of states Q
  • a finite set of input symbols Σ
  • a transition function T : Q × Σ → P(Q).
  • an initial (or start) state q0 such that q0 Q
  • a set of states F distinguished as accepting (or final) states F Q.

Here, P(Q) denotes the power set of Q. The NFA with ε-moves (also sometimes called NFA-epsilon or NFA-lambda) replaces the transition function with one that allows the empty string ε as a possible input, so that one has instead

T : Q × (Σ {ε}) → P(Q).

 

It can be shown that ordinary NFA and NFA with epsilon moves are equivalent, in that, given either one, one can construct the other, which recognizes the same language.

Properties

The machine starts in an arbitrary initial state and reads in a string of symbols from its alphabet. The automaton uses the state transition function T to determine the next state using the current state, and the symbol just read or the empty string. However, “the next state of an NFA depends not only on the current input event, but also on an arbitrary number of subsequent input events. Until these subsequent events occur it is not possible to determine which state the machine is in” [2]. If, when the automaton has finished reading, it is in an accepting state, the NFA is said to accept the string, otherwise it is said to reject the string.

The set of all strings accepted by an NFA is the language the NFA accepts. This language is a regular language.

For every NFA a deterministic finite state machine (DFA) can be found that accepts the same language. Therefore it is possible to convert an existing NFA into a DFA for the purpose of implementing a (perhaps) simpler machine. This can be performed using the powerset construction, which may lead to an exponential rise in the number of necessary states. A formal proof of the powerset construction is given here.

For non-deterministic automata with a countably infinite set of states, the powerset construction gives a deterministic automaton with a continuum of states (since the powerset of the countable infinity is the continuum: ). In such a case, the set of states must be endowed with a topology in order for the state transitions to be meaningful. Such systems are studied as topological automata.

Properties of NFA-ε

For all one writes if and only if can be reached from by going along zero or more arrows. In other words, if and only if there exists where such that

.

For any , the set of states that can be reached from p is called the epsilon-closure or ε-closure of p, and is written as

.

For any subset , define the ε-closure of P as

.

The epsilon-transitions are transitive, in that it may be shown that, for all and , if and , then .

Similarly, if and then

Let x be a string over the alphabet Σ. An NFA-ε M accepts the string x if there exist both a representation of x of the form x1x2 … xn, where xi {ε}), and a sequence of states p0,p1, …, pn, where pi Q, meeting the following conditions:

  1. p0 E({q0})
  2. pi E(T(pi-1, xi )) for i = 1, …, n
  3. pn F.

In particular, note that some of the letters xi can be {ε}; they are chosen not from Σ alone, but from Σ {ε}.

Implementation

There are many ways to implement a NFA:

  • Convert to the equivalent DFA. In some cases this may cause exponential blowup in the size of the automaton and thus auxiliary space proportional to the number of states in the NFA (as storage of the state value requires at most one bit for every state in the NFA)
  • Keep a set data structure of all states which the machine might currently be in. On the consumption of the last input symbol, if one of these states is a final state, the machine accepts the string. In the worst case, this may require auxiliary space proportional to the number of states in the NFA; if the set structure uses one bit per NFA state, then this solution is exactly equivalent to the above.
  • Create multiple copies. For each n way decision, the NFA creates up to copies of the machine. Each will enter a separate state. If, upon consuming the last input symbol, at least one copy of the NFA is in the accepting state, the NFA will accept. (This, too, requires linear storage with respect to the number of NFA states, as there can be one machine for every NFA state.)
  • Explicitly propagate tokens through the transition structure of the NFA and match whenever a token reaches the final state. This is sometimes useful when the NFA should encode additional context about the events that triggered the transition. (For an implementation that uses this technique to keep track of object references have a look at Tracematches.)

Example

The following example explains a NFA M, with a binary alphabet, which determines if the input contains an even number of 0s or an even number of 1s. Let M = (Q, Σ, T, s0, F) where

  • Σ = {0, 1},
  • Q = {s0, s1, s2, s3, s4},
  • E({s0}) = { s0, s1, s3 }
  • F = {s1, s3}, and
  • The transition function T can be defined by this state transition table:

 

0

1

ε

S0

{}

{}

{S1, S3}

S1

{S2}

{S1}

{}

S2

{S1}

{S2}

{}

S3

{S3}

{S4}

{}

S4

{S4}

{S3}

{}

 

 

 

 

The state diagram for M is:

M can be viewed as the union of two DFAs: one with states {S2, S1} and the other with states {S3, S4}.

The language of M can be described by the regular language given by this regular expression:

 

Application of NFA-ε

NFA’s and DFA’s are equivalent in that if a language is recognized by an NFA, it is also recognized by a DFA and vice versa. The establishment of such equivalence is important and useful. It is useful because constructing an NFA to recognize a given language is sometimes much easier than constructing a DFA for that language. It is important because NFA’s can be used to reduce the complexity of the mathematical work required to establish many important properties in the theory of computation. For example, it is much easier to prove the following properties using NFA’s than DFA’s:

(i) The union of two regular languages is regular. ( See formal proof. )

(ii) The concatenation of two regular languages is regular. ( See formal proof. )

(iii) The Kleene Closure of a regular language is regular. ( See formal proof. )

 

 

In computing, a regular expression is a string that is used to describe or match a set of strings, according to certain syntax rules.

Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. For example, Perl and Tcl have a powerful regular expression engine built directly into their syntax. Several utilities provided by Unix distributions—including the editor ed and the filter grep—were the first to popularize the concept of regular expressions. “Regular expression” is often shortened to regex or regexp (singular), or regexes, regexps, or regexen (plural). Some authors distinguish between regular expression and regexp, restricting the former to true regular expressions, which describe regular languages, while using the latter for any regular expression-like pattern, including those that describe languages that are not regular. As only some authors observe this distinction, it is not safe to rely upon it.

As an example of the syntax, the regular expression \bex can be used to search for all instances of the string “ex” that occur at word boundaries (signified by the \b). Thus in the string, “Texts for experts,” \bex matches the “ex” in “experts,” but not in “Texts” (because the “ex” occurs inside the word there and not immediately after a word boundary).

Many modern computing systems provide wildcard characters in matching filenames from a file system. This is a core capability of many command-line shells and is also known as globbing. Wildcards differ from regular expressions in that they generally only express very restrictive forms of alternation.

 

Basic concepts

A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings “Handel“, “Händel“, and “Haendel” can be described by the pattern H(ä|ae?)ndel (or alternatively, it is said that the pattern matches each of the three strings). In most formalisms, if there is any regex that matches a particular set then there is an infinite number of such expressions. Most formalisms provide the following operations to construct regular expressions.

Alternation

A vertical bar separates alternatives. For example, gray|grey can match “gray” or “grey“.

 

Grouping

Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of “gray” and “grey“.

 

Quantification

A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are ?, *, and +.

?

The question mark indicates there is zero or one of the preceding element. For example, colou?r matches both “color” and “colour“.

*

The asterisk indicates there are zero or more of the preceding element. For example, ab*c matches “ac“, “abc“, “abbc“, “abbbc“, and so on.

+

The plus sign indicates that there is one or more of the preceding element. For example, ab+c matches “abc“, “abbc“, “abbbc“, and so on, but not “ac“.

These constructions can be combined to form arbitrarily complex expressions, much like one can construct arithmetical expressions from numbers and the operations +, -, *, and /. For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.

The precise syntax for regular expressions varies among tools and with context; more detail is given in the Syntax section.

 

History

The origins of regular expressions lie in automata theory and formal language theory, both of which are part of theoretical computer science. These fields study models of computation (automata) and ways to describe and classify formal languages. In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets. Ken Thompson built this notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep’s use of regular expressions (“grep” is a word derived from the command for regular expression searching in the ed editor: g/re/p where re stands for regular expression). Since that time, many variations of Thompson’s original adaptation of regular expressions have been widely used in Unix and Unix-like utilities including expr, AWK, Emacs, vi, and lex.

Perl and Tcl regular expressions were derived from a regex library written by Henry Spencer, though Perl later expanded on Spencer’s library to add many new features.[1] Philip Hazel developed PCRE (Perl Compatible Regular Expressions), which attempts to closely mimic Perl’s regular expression functionality, and is used by many modern tools including PHP and Apache HTTP Server. Part of the effort in the design of Perl 6 is to improve Perl’s regular expression integration, and to increase their scope and capabilities to allow the definition of parsing expression grammars.[2] The result is a mini-language called Perl 6 rules, which are used to define Perl 6 grammar as well as provide a tool to programmers in the language. These rules maintain existing features of Perl 5.x regular expressions, but also allow BNF-style definition of a recursive descent parser via sub-rules.

The use of regular expressions in structured information standards for document and database modeling started in the 1960s and expanded in the 1980s when industry standards like ISO SGML (precursored by ANSI “GCA 101-1983″) consolidated. The kernel of the structure specification language standards are regular expressions. Simple use is evident in the DTD element group syntax.

Formal language theory

Regular expressions can be expressed in terms of formal language theory. Regular expressions consist of constants and operators that denote sets of strings and operations over these sets, respectively. Given a finite alphabet Σ the following constants are defined:

·         (empty set) denoting the set

·         (empty string) ε denoting the set {ε}

·         (literal character) a in Σ denoting the set {a}

 

 

The following operations are defined:

·         (concatenation) RS denoting the set { αβ | α in R and β in S }. For example {“ab”|”c”}{“d”|”ef”} = {“abd”, “abef”, “cd”, “cef”}.

·         (alternation) R|S denoting the set union of R and S.

·         (Kleene star) R* denoting the smallest superset of R that contains ε and is closed under string concatenation. This is the set of all strings that can be made by concatenating zero or more strings in R. For example, {“ab”, “c”}* = {ε, “ab”, “c”, “abab”, “abc”, “cab”, “cc”, “ababab”, … }.

The above constants and operators form a Kleene algebra.

Many textbooks use the symbols , +, or for alternation instead of the vertical bar.

To avoid brackets it is assumed that the Kleene star has the highest priority, then concatenation and then set union. If there is no ambiguity then brackets may be omitted. For example, (ab)c can be written as abc, and a|(b(c*)) can be written as a|bc*.

Examples:

·         a|b* denotes {ε, a, b, bb, bbb, …}

·         (a|b)* denotes the set of all strings consisting of any number of a and b symbols, including the empty string

·         ab*(c|ε) denotes the set of strings starting with a, then zero or more bs and finally optionally a c.

The formal definition of regular expressions is purposely parsimonious and avoids defining the redundant quantifiers ? and +, which can be expressed as follows: a+ = aa*, and a? = (a|ε). Sometimes the complement operator ~ is added; ~R denotes the set of all strings over Σ* that are not in R. The complement operator is redundant, as it can always be expressed by using the other operators (although the process for computing such a representation is complex, and the result may be exponentially larger).

Regular expressions in this sense can express the regular languages, exactly the class of languages accepted by finite state automata. There is, however, a significant difference in compactness. Some classes of regular languages can only be described by automata that grow exponentially in size, while the length of the required regular expressions only grow linearly. Regular expressions correspond to the type-3 grammars of the Chomsky hierarchy. On the other hand, there is a simple mapping between regular expressions and nondeterministic finite automata (NFAs) that does not lead to such a blowup in size; for this reason NFAs are often used as alternative representations of regular expressions.

We can also study expressive power within the formalism. As the examples show, different regular expressions can express the same language: the formalism is redundant.

It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).

To what extent can this redundancy be eliminated? Can we find an interesting subset of regular expressions that is still fully expressive? Kleene star and set union are obviously required, but perhaps we can restrict their use. This turns out to be a surprisingly difficult problem. As simple as the regular expressions are, it turns out there is no method to systematically rewrite them to some normal form. The lack of axiomatization in the past led to the star height problem. Recently, Dexter Kozen axiomatized regular expressions with Kleene algebra.

It is worth noting that many real-world “regular expression” engines implement features that cannot be expressed in the regular expression algebra; see below for more on this.

 

 

Syntax

Traditional Unix regular expressions

The basic Unix regular expression syntax is now defined as obsolete by POSIX, but is still widely used for backwards compatibility. Many regular-expression-aware Unix utilities including grep and sed use it by default while providing support for extended regular expressions with command line arguments.

In the basic syntax, most characters are treated as literals — they match only themselves (i.e., a matches “a“). The exceptions, listed below, are called metacharacters.

.

Matches any single character except newline. Within square brackets, the dot character matches a literal dot. For example, a.c matches “abc“, etc., but [a.c] matches only “a“, “.“, or “c“.

[ ]

Matches a single character that is contained within the brackets. For example, [abc] matches “a“, “b“, or “c“. [a-z] specifies a range which matches any lowercase letter from “a” to “z“. These forms can be mixed: [abcx-z] matches “a“, “b“, “c“, “x“, “y“, and “z“, as does [a-cx-z]. The - character is treated as a literal character if it is the last or the first character within the brackets, or if it is escaped with a backslash: [abc-], [-abc], or [a\-bc]. The [ character can be included anywhere within the brackets. To match the ] character, the easiest way is to escape it with a backslash, e.g., [\]]. Some tools allow you to avoid the backslash if the closing bracket is first in the enclosing square brackets, e.g., [][ab] matches “]“, “[", "a", or "b".

[^ ]

Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". As above, literal characters and ranges can be mixed.

^

Matches the starting position within the string. In multiline mode, it matches the starting position of any line.

$

Matches the ending position of the string or the position just before a string-terminating newline. In multiline mode, it matches the ending position of any line.

( )

Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n. A marked subexpression is also called a block or capturing group. This feature is not found in all instances of regular expressions, and in many Unix utilities including sed and vi, a backslash must precede the open and close parentheses for them to be interpreted with special meaning.

\n

Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is theoretically irregular and was not adopted in the POSIX extended regular expression (ERE) syntax. Some tools allow referencing more than nine capturing groups.

*

Matches the preceding element zero or more times. For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. (ab)* matches "", "ab", "abab", "ababab", and so on.

{m,n}

Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few, older instances of regular expressions.

Note that particular implementations of regular expressions interpret backslash differently in front of some of the metacharacters. For example, egrep and Perl interpret non-backslashed parentheses, vertical bars, and curly brackets as metacharacters, reserving the backslashed versions to mean the literal characters themselves. Old versions of grep did not support the alternation operator |.

Examples:

·         .at matches any three-character string ending with "at", including "hat", "cat", and "bat".

·         [hc]at matches "hat" and "cat".

·         [^b]at matches all strings matched by .at except "bat".

·         ^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.

·         [hc]at$ matches "hat" and "cat", but only at the end of the string or line.

POSIX extended regular expressions

POSIX extended regular expressions (ERE), as defined by IEEE POSIX standard 1003.2, are similar in syntax to the traditional Unix regular expressions, with some exceptions. The following metacharacters are added:

?

Matches the preceding element zero or one time. For example, ba? matches "b" or "ba".

+

Matches the preceding element one or more times. For example, ba+ matches "ba", "baa", "baaa", and so on.

|

The choice (aka alternation or set union) operator matches either the expression before or the expression after the operator. For example, abc|def matches "abc" or "def".

Additionally, some backslashes are removed: \{...\} becomes {...}, and \(...\) becomes (...).

Examples:

·         [hc]+at matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at".

·         [hc]?at matches "hat", "cat", and "at".

·         cat|dog matches "cat" or "dog".

Since the characters (, ), [, ], ., *, ?, +, ^, |, and $ are used as special symbols, they have to be escaped if they are meant literally. This is done by preceding them with \, which therefore also has to be escaped this way if meant literally.

Extended POSIX regular expressions can often be used with modern Unix utilities by including the command line flag -E.

POSIX bracket expressions

Since many ranges of characters depend on the chosen locale setting (i.e., in some settings letters are organized as abc...zABC...Z, while in some others as aAbBcC...zZ), the POSIX standard defines some classes or categories of characters as shown in the following table:

POSIX

ASCII

Description

[:alnum:]

[A-Za-z0-9]

Alphanumeric characters

[:alpha:]

[A-Za-z]

Alphabetic characters

[:blank:]

[ \t]

Space and tab

[:cntrl:]

[\x00-\x1F\x7F]

Control characters

[:digit:]

[0-9]

Digits

[:graph:]

[\x21-\x7E]

Visible characters

[:lower:]

[a-z]

Lowercase letters

[:print:]

[\x20-\x7E]

Visible characters and spaces

[:punct:]

[!"#$%&'()*+,-./:;?@[\\\]_`{|}~]

Punctuation characters

[:space:]

[ \t\r\n\v\f]

Whitespace characters

[:upper:]

[A-Z]

Uppercase letters

[:xdigit:]

[A-Fa-f0-9]

Hexadecimal digits

For example, [[:upper:]ab] matches the uppercase letters and lowercase "a" and "b".

In Perl regular expressions, [:print:] matches [:graph:] union [:space:]. An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers. The editor Vim further distinguishes word and word-head classes (using the notation \w and \h) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions.

Perl-derivative regular expressions

Perl has a richer and more predictable syntax than the POSIX basic (BRE) and extended (ERE) regular expression standards. An example of its predictability is that \ always escapes a non-alphanumeric character. An example of functionality possible through Perl but not POSIX-compliant regular expressions is the concept of lazy quantification (see the next section).

Due largely to its expressive power, many other utilities have adopted syntaxes very similar to Perl's — for example, Java, JavaScript, PCRE, Python, Ruby, and Microsoft's .NET Framework all use regular expression syntax similar to Perl's. Some languages and tools such as PHP support multiple regular expression engine types. Perl-derivative regular expression implementations are not identical, and many implement only a subset of Perl's features. With Perl 5.9.x (development track for Perl 5.10), this process has come full circle with Perl incorporating syntax extensions originally from Python, the .NET Framework, and Java.

Lazy quantification

The standard quantifiers in regular expressions are greedy, meaning they match as much as they can, only giving back as necessary to match the remainder of the regex. For example, someone new to regexes wishing to find the first instance of an item in double brackets in this example of MediaWiki syntax:

Another whale explosion occurred on January 26, 2004.

...would very likely come up with the pattern \[\[.*]], or similar (note that the opening square brackets must be escaped with backslashes). However, this pattern will actually return "January 26, 2004" instead of the "January 26" which might be expected, because the * quantifier is greedy — it will consume as many characters as possible from the input, and "January 26]], 2004" has more characters than "January 26".

Though this problem can be avoided in a number of ways (e.g., by specifying the text that is not to be matched: \[\^\*), modern regular expression tools allow a quantifier to be specified as non-greedy (also known as lazy, reluctant, or minimal) by putting a question mark after the quantifier (e.g., \[\[.*?]]). By using a lazy quantifier, the expression tries the minimal match first. Though in the previous example lazy matching is used to select one of many matching results, in some cases it can also be used to improve performance when greedy matching would require more backtracking.

Patterns for non-regular languages

Many patterns provide an expressive power that far exceeds the regular languages. For example, the ability to group subexpressions with parentheses and recall the value they match in the same expression means that a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory. The pattern for these strings is (.*)\1. However, the language of squares is not regular, nor is it context-free. Pattern matching with an unbounded number of back references, as supported by numerous modern tools, is NP-hard.

However, many tools, libraries, and engines that provide such constructions still use the term regular expression for their patterns. This has led to a nomenclature where the term regular expression has different meanings in formal language theory and pattern matching. For this reason, some people have taken to using the term regex or simply pattern to describe the latter. Larry Wall (author of Perl) writes in Apocalypse 5:

 

'Regular expressions' [...] are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).

 

Implementations and running times

There are at least two fundamentally different algorithms that decide if and how a given regular expression matches a string.

The oldest and fastest relies on a result in formal language theory that allows every nondeterministic finite state machine (NFA) to be transformed into a deterministic finite state machine (DFA). The algorithm performs or simulates this transformation and then runs the resulting DFA on the input string, one symbol at a time. The latter process takes time linear to the length of the input string. More precisely, an input string of size n can be tested against a regular expression of size m in time O(n+2m) or O(nm), depending on the details of the implementation. This algorithm is often referred to as DFA. It is fast, but can only be used for matching and not for recalling grouped subexpressions, lazy quantification, and several other features commonly found in modern regular expression libraries.

The other algorithm is to match the pattern against the input string by backtracking. This algorithm is commonly called NFA, but this terminology can be confusing. Its running time can be exponential, which simple implementations exhibit when matching against expressions like (a|aa)*b that contain both alternation and unbounded quantification and force the algorithm to consider an exponential number of sub-cases. More complex implementations will often identify and speed up or abort common cases where they would otherwise run slowly.

Although backtracking implementations give an exponential guarantee in the worst case, they provide much greater flexibility and expressive power. For example, any implementation which allows the use of backreferences, or implements the various improvements introduced by Perl, must use a backtracking implementation.

Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.

Regular expressions and Unicode

Regular expressions were originally used with ASCII characters. Many regular expression engines can now handle Unicode. In most respects it makes no difference what the character set is, but some issues do arise when extending regular expressions to support Unicode.

One issue is which Unicode format is supported. Some regular expression libraries expect the UTF-8 encoding, while others might expect UTF-16, obsolete UCS-2, or UTF-32.

A second issue is whether the full Unicode range is supported. Many regular expression engines support only the Basic Multilingual Plane, that is, the characters which can be encoded with only 16 bits. Currently, only a few regular expression engines can handle the full 21-bit Unicode range.

A third issue is variation in how ASCII-oriented constructs are extended to Unicode. For example, in ASCII-based implementations, character ranges of the form [x-y] are valid wherever x and y are codepoints in the range [0x00,0x7F] and codepoint(x) ≤ codepoint(y). The natural extension of such character ranges to Unicode would simply change the requirement that the endpoints lie in [0x00,0x7F] to the requirement that they lie in [0,0x10FFFF]. However, in practice this is often not the case. Some implementations, such as that of gawk, do not allow character ranges to cross Unicode blocks. A range like [0x61,0x7F] is valid since both endpoints fall within the Basic Latin block, as is [0x0530,0x0560] since both endpoints fall within the Armenian block, but a range like [0x0061,0x0532] is invalid since it includes multiple Unicode blocks. Other engines, such as that of the Vim editor, allow block-crossing but limit the number of characters in a range to 128.

Another area in which there is variation is in the interpretation of case-insensitive flags. Some such flags affect only the ASCII characters. Other flags affect all characters. Some engines have two different flags, one for ASCII, the other for Unicode. Exactly which characters belong to the POSIX classes also varies.

Another response to Unicode has been the introduction of character classes for Unicode blocks and Unicode general character properties. In Perl and the java.util.regex library, classes of the form \p{InX} match characters in block X and \P{InX} match the opposite. Similarly, \p{Armenian} matches any character in the Armenian block, and \p{X} matches any character with the general character property X. For example, \p{Lu} matches any upper-case letter.

 

Uses of regular expressions

Regular expressions are useful in the production of code completion systems and syntax highlighting in integrated development environments (IDEs), data validation and many other tasks.

While regular expressions would be useful on search engines such as Google or Live Search, processing them across the entire database could consume massive amounts of resources depending on the complexity and design of the regex. Although in many cases system administrators can run regex-based queries internally, most search engines do not offer regex support to the public. A notable exception is Google Code Search.

 

In formal language theory, a context-free grammar (CFG) is a grammar in which every production rule is of the form

Vw

where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (possibly empty). The term "context-free" expresses the fact that nonterminals can be rewritten without regard to the context in which they occur. A formal language is context-free if some context-free grammar generates it.

Context-free grammars play a central role in the description and design of programming languages and compilers. They are also used for analyzing the syntax of natural languages.

Background

            Since the time of Pāini, at least, linguists have described the grammars of languages in terms of their block structure, and described how sentences are recursively built up from smaller phrases, and eventually individual words or word elements.

Context-Free Grammar

 

 

The context-free grammar (or "phrase-structure grammar") as Chomsky called it) formalism developed by Noam Chomsky,[0] in the mid-1950s, took the manner in which linguistics had described this grammatical structure, and then turned it into rigorous mathematics. A context-free grammar provides a simple and precise mechanism for describing the methods by which phrases in some natural language are built from smaller blocks, capturing the "block structure" of sentences in a natural way. Its simplicity makes the formalism amenable to rigorous mathematical study, but it comes at a price: important features of natural language syntax such as agreement and reference cannot be expressed in a natural way, or not at all.

Block structure was introduced into computer programming languages by the Algol project, which, as a consequence, also featured a context-free grammar to describe the resulting Algol syntax. This became a standard feature of computer languages, and the notation for grammars used in concrete descriptions of computer languages came to be known as Backus-Naur Form, after two members of the Algol language design committee.

The "block structure" aspect that context-free grammars capture is so fundamental to grammar that the terms syntax and grammar are often identified with context-free grammar rules, especially in computer science. Formal constraints not captured by the grammar are then considered to be part of the "semantics" of the language.

Context-free grammars are simple enough to allow the construction of efficient parsing algorithms which, for a given string, determine whether and how it can be generated from the grammar. An Earley parser is an example of such an algorithm, while the widely used LR and LL parsers are more efficient algorithms that only deal with more restrictive subsets of context-free grammars.

Formal definitions

         A context-free grammar G is a 4-tuple:

where

1. is a finite set of non-terminal characters or variables. They represent different types of phrase or clause in the sentence.

2. is a finite set of terminals, disjoint with , which make up the actual content of the sentence.

3. is the start variable, used to represent the whole sentence (or program). It must be an element of .

4. is a relation from to such that .

In addition, is a finite set. The members of are called the rules or productions of the grammar. The asterisk represents the Kleene star operation.

Additional Definition 1

         For any strings , we say yields , written as , if such that and . Thus, is the result of applying the rule to .

Additional Definition 2

For any (or in some textbooks) if such that

Additional Definition 3

The language of a grammar is the set

Additional Definition 4

A language is said to be a context-free language (CFL) if there exists a CFG, such that .

Examples

Example 1

A simple context-free grammar is given as:

S → aSb | ab;

where | is used to separate multiple options for the same non-terminal; so this is the same as

S → aSb

S → ab

The terminals here are a and b, while the only non-terminal is S. This grammar generates the language which is not regular.

The special character ε stands for the empty string. By changing the above grammar to :S → aSb | ε we obtain a grammar generating the language instead. This differs only in that it contains the empty string while the original grammar did not.

Example 2

Here is a context-free grammar for syntactically correct infix algebraic expressions in the variables x, y and z:

S → x | y | z | S + S | S - S | S * S | S/S | (S)

This grammar can, for example, generate the string "( x + y ) * x - z * y / ( x + x )" as follows: "S" is the initial string. "S - S" is the result of applying the fifth transformation [S → S - S] to the nonterminal S. "S * S - S / S" is the result of applying the sixth transform to the first S and the seventh one to the second S. "( S ) * S - S / ( S )" is the result of applying the final transform to certain of the nonterminals. "( S + S ) * S - S * S / ( S + S )" is the result of the fourth and fifth transforms to certain nonterminals. "( x + y ) * x - z * y / ( x + x )" is the final result, obtained by using the first three transformations to turn the S nonterminals into the terminals x, y, and z.

This grammar is ambiguous, meaning that one can generate the same string with more than one parse tree. For example, "x + y * z" might have either the + or the * parsed first; presumably these will produce different results.

 

Example 3

A context-free grammar for the language consisting of all strings over {a,b} for which the number of a's and b's are different is

S → U | V

U → TaU | TaT

V → TbV | TbT

T → aTbT | bTaT | ε

Here, the nonterminal T can generate all strings with the same number of a's as b's, the nonterminal U generates all strings with more a's than b's and the nonterminal V generates all strings with fewer a's than b's.

Example 4

Another example of a context-free language is . This is not a regular language, but it is context free as it can be generated by the following context-free grammar:

S → bSbb | A

A → aA | ε

Other examples

Context-free grammars are not limited in application to mathematical ("formal") languages. For example, it has been suggested that a class of Tamil poetry called Venpa is governed by a context-free grammar.[2]

Derivations and syntax trees

There are two common ways to describe how a given string can be derived from the start symbol of a given grammar. The simplest way is to list the consecutive strings of symbols, beginning with the start symbol and ending with the string, and the rules that have been applied. If we introduce a strategy such as "always replace the left-most nonterminal first" then for context-free grammars the list of applied grammar rules is by itself sufficient. This is called the leftmost derivation of a string. For example, if we take the following grammar:

(1) S → S + S

(2) S → 1

(3) S → a

and the string "1 + 1 + a" then a left derivation of this string is the list [ (1), (1), (2), (2), (3) ]. Analogously the rightmost derivation is defined as the list that we get if we always replace the rightmost nonterminal first. In this case this could be the list [ (1), (3), (1), (2), (2)].

The distinction between leftmost derivation and rightmost derivation is important because in most parsers the transformation of the input is defined by giving a piece of code for every grammar rule that is executed whenever the rule is applied. Therefore it is important to know whether the parser determines a leftmost or a rightmost derivation because this determines the order in which the pieces of code will be executed. See for an example LL parsers and LR parsers.

A derivation also imposes in some sense a hierarchical structure on the string that is derived. For example, if the string "1 + 1 + a" is derived according to the leftmost derivation:

S→S+S (1)

  →S+S+S (1)

  →1+S+S (2)

  →1+1+S (2)

  →1+1+a (3)

the structure of the string would be:

{ { { 1 }S + { 1 }S }S + { a }S }S

where { ... }S indicates a substring recognized as belonging to S. This hierarchy can also be seen as a tree:

S /| / | / | S '+' S /|\ | / | \ | S '+' S 'a' | | '1' '1'

This tree is called a concrete syntax tree (see also abstract syntax tree) of the string. In this case the presented leftmost and the rightmost derivations define the same syntax tree; however, there is another (leftmost) derivation of the same string

S→ S + S (1)

  → 1 + S (2)

  → 1 + S + S (1)

  → 1 + 1 + S (2)

  → 1 + 1 + a (3)

and this defines the following syntax tree:

S /| / | / | S '+' S | /| | / | '1' S '+' S | | '1' 'a'

If, for certain strings in the language of the grammar, there is more than one parsing tree, then the grammar is said to be an ambiguous grammar. Such grammars are usually hard to parse because the parser cannot always decide which grammar rule it has to apply.

Normal forms

Every context-free grammar that does not generate the empty string can be transformed into one in which no rule has the empty string as a product [a rule with ε as a product is called an ε-production]. If it does generate the empty string, it will be necessary to include the rule , but there need be no other ε-rule. Every context-free grammar with no ε-production has an equivalent grammar in Chomsky normal form or Greibach normal form. "Equivalent" here means that the two grammars generate the same language.

Because of the especially simple form of production rules in Chomsky Normal Form grammars, this normal form has both theoretical and practical implications. For instance, given a context-free grammar, one can use the Chomsky Normal Form to construct a polynomial-time algorithm which decides whether a given string is in the language represented by that grammar or not (the CYK algorithm).

Undecidable problems

Although some operations on context-free grammars are decidable due to their limited power, CFGs do have interesting undecidable problems. One of the simplest and most cited is the problem of deciding whether a CFG accepts the language of all strings. A reduction can be demonstrated to this problem from the well-known undecidable problem of determining whether a Turing machine accepts a particular input. The reduction uses the concept of a computation history, a string describing an entire computation of a Turing machine. We can construct a CFG that generates all strings that are not accepting computation histories for a particular Turing machine on a particular input, and thus it will accept all strings only if the machine does not accept that input.

As a consequence of this, it is also undecidable whether two CFGs describe the same language, since we can't even decide whether a CFG is equivalent to the trivial CFG deciding the language of all strings.

Another point worth mentioning is that the problem of determining if a context-sensitive grammar describes a context-free language is undecidable.

Extensions

An obvious way to extend the context-free grammar formalism is to allow nonterminals to have arguments, the values of which are passed along within the rules. This allows natural language features such as agreement and reference, and programming language analogons such as the correct use and definition of identifiers, to be expressed in a natural way. E.g. we can now easily express that in English sentences, the subject and verb must agree in number.

In computer science, examples of this approach include affix grammars, attribute grammars, indexed grammars, and Van Wijngaarden two-level grammars.

Similar extensions exist in linguistics.

Another extension is to allow additional symbols to appear at the left hand side of rules, constraining their application. This produces the formalism of context-sensitive grammars.

Linguistic applications

Chomsky himself initially hoped to overcome the limitations of context-free grammars by adding transformation rules.[0]

Such rules are another standard device in traditional linguistics; e.g. passivization in English. However, arbitrary transformations must be disallowed, since they are much too powerful (Turing complete). Much of generative grammar has been devoted to finding ways of refining the descriptive mechanisms of phrase-structure grammar and transformation rules such that exactly the kinds of things can be expressed that natural language actually allows.

His general position regarding the non-context-freeness of natural language has held up since then[3], although his specific examples regarding the inadequacy of CFGs in terms of their weak generative capacity were later disproved.[4] Gerald Gazdar and Geoffrey Pullum have argued that despite a few non-context-free constructions in natural language (such as cross-serial dependencies in Swiss German[3] and reduplication in Bambara[5]), the vast majority of forms in natural language are indeed context-free.[4]

 

Pushdown Automaton

In automata theory, a pushdown automaton (PDA) is a finite automaton that can make use of a stack containing data.

 

Operation

 

Pushdown automata differ from normal finite state machines in two ways:

  1. They can use the top of the stack to decide which transition to take.
  2. They can manipulate the stack as part of performing a transition.

Pushdown automata choose a transition by indexing a table by input signal, current state, and the top of the stack. Normal finite state machines just look by input signal and current state: they have no stack to work with. Pushdown automata add the stack as a parameter for choice. Given an input signal, current state, and a given symbol at the top of the stack, a transition path is chosen.

Pushdown automata can also manipulate the stack, as part of performing a transition. Normal finite state machines choose a new state, the result of following the transition. The manipulation can be to push a particular symbol to the top of the stack, or to pop off the top of the stack. The automaton can alternatively ignore the stack, and leave it as it is. The choice of manipulation (or no manipulation) is determined by the transition table.

Put together: Given an input signal, current state, and stack symbol, the automaton can follow a transition to another state, and optionally manipulate (push or pop) the stack.

The (underlying) finite automaton is specifically a nondeterministic finite state machine, giving what is technically known as a "nondeterministic pushdown automaton" (NPDA). If a deterministic finite state machine is used, then the result is a deterministic pushdown automaton (DPDA), a strictly weaker device. Nondeterminism means that there may be more than just one transition available to follow, given an input signal, state, and stack symbol.

If we allow a finite automaton access to two stacks instead of just one, we obtain a more powerful device, equivalent in power to a Turing machine. A linear bounded automaton is a device which is more powerful than a pushdown automaton but less so than a Turing machine.

Pushdown automata are equivalent to context-free grammars: for every context-free grammar, there exists a pushdown automaton such that the language generated by the grammar is identical with the language generated by the automaton, and for every pushdown automaton there exists a context-free grammar such that the language generated by the automaton is identical with the language generated by the grammar.

Formal Definition

A PDA is formally defined as a 6-tuple:

where

  • is a finite set of states
  • is a finite set of the input alphabet
  • is a finite set of the stack alphabet
  • : is the transition function
  • is the start state
  • is the set of accept states

Computation Definition 1

For any PDA, , a computation path is an ordered

(n+1)-tuple, , where Q, n 0, which satisfies the following conditions:

(i) for i = 0, 1, 2,......, n-1,

where

(ii) such that

 

Intuitively, a PDA, at any point in the computation process, faces a multiple possibilities of whether to read a symbol from the top of the stack and replace it with another symbol, or read a symbol from the top of the stack and remove it without replacement, or not read any symbol from the stack but only push another symbol to it, or does nothing. All these are governed by the simutaneous equations and . is called the stack content immediately before the transitional move and is the symbol to be removed from the top of the stack. is the stack content immediately after the transitional move and is the symbol to be added to the stack during the transitional move.

Both and can be .

If and , the PDA reads a symbol from the stack and replace it with another one.

If and , the PDA reads a symbol from the stack and remove it without replacement.

If and , the PDA simply adds a symbol to the stack.

If and , the PDA leaves the stack unchanged.

Note that when n=0, the computation path is just the singleton (q0).

Computation Definition 2

For any input w = w1w2...wm, wi , m 0, M accepts w if a computation path and a finite sequence r0, r1, r2,...rm Q, m n, such that

(i) For each i = 0, 1, 2,...m, ri is on the computation path. That is,

f(i) where i f(i) n such that ri = qf(i)

(ii) (qf(i)+1, bf(i)+1) (ri, wi+1, af(i)+1) for each i = 0, 1, 2,...m-1.

where af(i)+1 and bf(i)+1 are defined as in Computation Definition 1.

(iii) (qj+1, bj+1) (qj, , aj+1) if qj {r0, r1,...rm}

where aj+1 and bj+1 are defined as in Computation Definition 1.

(iv) rm = qn and rm F

Note that the above definitions do not provide a mechanism to test for an empty stack. To do so, one would need to write a special symbol on the stack at the beginning of every computation so that the PDA would recognize that the stack is effectively empty whenever it detects the special symbol. Formally, this is done by introducing the transition (q0, = {(q1, $)} where $ is the special symbol.

Example

The following is the formal description of the PDA which recognizes the language :

Q = {q1, q2, q3, q4}

= {0, 1}

= {0, $}

F = {q1, q4}

(q1, , ) = {(q2, $), (q1, )}

(q2, 0, ) = {(q2, 0)}

(q2, 1, 0) = {(q3, )}

(q3, 1, 0) = {(q3, )}

(q3, , $) = {(q4, )}

(q, w, a) = for any other values of state, input and stack symbols.

Understanding the computation process

The following illustrates how the above PDA computes on different input strings.

(a) Input string = 0011

(i) write (q1, , ) (q2, $) to represent (q2, $) (q1, , )

 

: s0 = , s1 = $, t = , a = , b = $

 

: set r0 = q2

 

(ii) (r0, 0, ) = (q2, 0, ) (q2, 0)

 

: s1 = $, a = , t = $, b = 0, s2 = 0$

 

: set r1 = q2

 

(iii) (r1, 0, ) = (q2, 0, ) (q2, 0)

 

: s2 = 0$, a = , t = 0$, b = 0, s3 = 00$

 

: set r2 = q2

 

(iv) (r2, 1, 0) = (q2, 1, 0) (q3, )

 

: s3 = 00$, a = 0, t = 0$, b = , s4 = 0$

 

: set r3 = q3

 

(v) (r3, 1, 0) = (q3, 1, 0) (q3, )

 

: s4 = 0$, a = 0, t = $, b = , s5 = $

 

(vi) (q3, , $) (q4, )

 

: s5 = $, a = $, t = , b = , s6 =

 

: set r4 = q4

 

Since q4 is an accept state, 0011 is accepted.

 

In summary, computation path = (q1, q2, q2, q2, q3, q3, q4)

 

and (r0, r1, r2, r3, r4) = (q2, q2, q2, q3, q4)

(b) Input string = 001

Computation moves (i), (ii), (iii), (iv) would have to be the same as in case (a), otherwise, the PDA would have come to a dead end before reaching (v).

 

(v) (r3, , a) = (q3, , a)

 

: Since s4 = 0$, either a = or a = 0

 

: In either case, (q3, , a) =

 

: Therefore, computation comes to an end at r3 = q3 which is not an accept state. Therefore, 001 is rejected.

(c) Input string =

Set r0 = q1, r1 = q1

 

(r0, , ) (q1, )

 

Since q1 is an accept state, is accepted.

Generalized Pushdown Automaton (GPDA)

A GPDA is a PDA which writes an entire string to the stack or removes an entire string from the stack in one step.

A GPDA is formally defined as a 6-tuple

where Q, , , q0 and F are defined the same way as a PDA.

 

: is the transition function.

Computation rules for a GPDA are the same as a PDA except that the ai+1's and bi+1's are now strings intead of symbols.

GPDA's and PDA's are equivalent in that if a language is recognized by a PDA, it is also recognized by a GPDA and vice versa.

One can formulate an analytic proof for the equivalence of GPDA's and PDA's using the following simulation:

Let (q1, w, x1x2...xm) (q2, y1y2...yn) be a transition of the GPDA

where q1, q2 Q, w , x1x2...xm , m 0, y1y2...yn , n 0.

Construct the following transitions for the PDA:

: (q1, w, x1) (p1, )

 

: (p1, , x2) (p2, )

 

:::

 

: (pm-1, , xm) (pm, )

 

: (pm, , ) (pm+1, yn)

 

: (pm+1, , ) (pm+2, yn-1)

 

:::

 

: (pm+n-1, , ) (q2, y1)

 

For the test of artificial intelligence, see Turing test. For the instrumental rock band, see Turing Machine (band).

Turing Machine

Turing machines are extremely basic abstract symbol-manipulating devices which, despite their simplicity, can be adapted to simulate the logic of any computer that could possibly be constructed. They were described in 1936 by Alan Turing. Though they were intended to be technically feasible, Turing machines were not meant to be a practical computing technology, but a thought experiment about the limits of mechanical computation; thus they were not actually constructed. Studying their abstract properties yields many insights into computer science and complexity theory.

A Turing machine that is able to simulate any other Turing machine is called a Universal Turing machine (UTM, or simply a universal machine). A more mathematically-oriented definition with a similar "universal" nature was introduced by Alonzo Church, whose work on lambda calculus intertwined with Turing's in a formal theory of computation known as the Church-Turing thesis. The thesis states that Turing machines indeed capture the informal notion of effective method in logic and mathematics, and provide a precise definition of an algorithm or 'mechanical procedure'.

Informal description

For visualizations of Turing machines, see Turing machine gallery.

The concept of the Turing machine is based on the idea of a person executing a well-defined procedure by changing the contents of an unlimited paper tape, which is divided into squares that can contain one of a finite set of symbols. The person needs to remember one of a finite set of states and the procedure is formulated in very basic steps in the form of "If your state is 42 and the symbol you see is a '0' then replace this with a '1', move one symbol to the right, and assume state 17 as your new state."

 

 

 

 

In some models the tape moves and the unused tape is truly blank. The instruction to be performed (q4) is shown over the scanned square. (Drawing after Kleene (1952) p.375.)

 

In some models the head moves along the stationary tape. The instruction to be performed (q1) is shown inside the head. In this model the "blank" tape is all 0's. The shaded squares, including the blank scanned by the head, and the squares marked 1, 1, B, and the head symbol, constitute the system state. (Drawing after Minsky (1967) p. 121).

More precisely, a Turing machine consists of:

  1. A TAPE which is divided into cells, one next to the other. Each cell contains a symbol from some finite alphabet. The alphabet contains a special blank symbol (here written as 'B') and one or more other symbols. The tape is assumed to be arbitrarily extendable to the left and to the right, i.e., the Turing machine is always supplied with as much tape as it needs for its computation. Cells that have not been written to before are assumed to be filled with the blank symbol. In some models the tape has a left end marked with a special symbol; the tape extends or is indefinitely extensible to the right. The symbols are sometimes referred to as colors.
  2. A HEAD that can read and write symbols on the tape and move the tape left and right one (and only one) cell at a time. In some models the head moves and the tape is stationary.
  3. A TABLE ("action table", or transition function) of instructions (usually quintuples or 5-tuples but sometimes 4-tuples) that, given the state the machine is currently in and the symbol it is reading on the tape tells the machine to do the following in sequence (for the 5-tuple models): (i) either erase or write a symbol, and then (ii) move the head ('L' for one step left or 'R' for one step right), and then (iii) assume the same or a new state as prescribed. In the 4-tuple models the TABLE tells the machine to (ia) erase or to write a symbol or (ib) move the head left or right, and then (ii) assume the same or a new state as prescribed, but not both actions (ia) and (ib) in the same instruction. In some models, if there is no entry in the table for the current combination of symbol and state then the machine will halt; other models require all entries to be filled.
  4. A state register that stores the state of the Turing table. The number of different states is always finite and there is one special start state with which the state register is initialized. Turing defined this as a "note of instructions" to preserve the computation of the "computer" (a person) who is working in a "desultory manner":
  5. :"This note is the counterpart of the 'state of mind'." (Undecidable, p. 139)

Note that every part of the machine—its state- and symbol-collections—and its actions—printing, erasing and tape motion—is finite, discrete and distinguishable; it is the potentially unlimited amount of tape that gives it an unbounded amount of storage space.

Examples of Turing machines

To see examples of the following models, see Turing machine examples:

  1. Turing's very first machine
  2. Copy routine
  3. 3-state busy beaver

Formal definition of single-tape Turing machine

A nutshell formal description of a "Turing machine":

"A Turing machine is a finite-state machine associated with an external storage or memory medium." (Minsky (1967), p. 117)

"A Turing machine is essentially a finite-state sequential machine that has the ability to communicate with an external store of information." (Booth (1967), p. 354)

The finite state machine is represented by the state-table together with its state register. The "external storage medium" is the tape. The input to the state machine is the scanned symbol on the tape. The output of the state machine is a symbol to print or the erase command and tape motion-command left or right.

Hopcroft and Ullman (1979, p. 148) formally define a (one-tape) Turing machine as a 7-tuple where

  • is a finite set of states
  • is a finite set of the tape alphabet/symbols
  • is the blank symbol (the only symbol allowed to occur on the tape infinitely often at any step during the computation)
  • , a subset of not including b is the set of input symbols
  • is a partial function called the transition function, where L is left shift, R is right shift.
  • is the initial state
  • is the set of final or accepting states

The 7-tuple for the 3-state busy beaver looks like this (see more about this busy beaver at Turing machine examples):

Q = { A, B, C, HALT }

Γ = { 0, 1 }

b = 0 = "blank"

Σ = { }, empty set

δ = see state-table below

q0 = A = initial state

F = the one element set of final states {HALT}

 

State table for 3 state, 2 symbol busy beaver

Tape symbol

Current state A

Current state B

Current state C

 

Write symbol

Move tape

Next state

Write symbol

Move tape

Next state

Write symbol

Move tape

Next state

 

0

1

R

B

1

L

A

1

L

B

 

1

1

L

C

1

R

B

1

N

HALT

 

This specification is insufficient, however. van Emde Boas (1990) observes that:

"The set-theoretical object [his formal seven-tuple description similar to the above] provides only partial information on how the machine will behave and what its computations will look like" (p. 6).

 

Examples readily demonstrate that, as well as the behavior of the components of the informal description, the form of the input and output "parameters", and the position of the head at the start must be specified as well. For example, Turing (1936) places his "figures" "1" and "0" on alternate squares. Other models form the input as tight-packed unary 1s with blanks between the various strings, yet others place strings of 0's, each spaced by 1, on truly blank tape, etc.; the output may or may not appear in a similar manner. Thus to fully "describe" a "computation" of a Turing machine Stone (1972) states it is necessary to state the following:

  1. The tape alphabet
  2. The form in which the parameters are presented on the tape
  3. The initial state of the Turing machine
  4. The form in which answers will be represented on the tape when the Turing machine halts
  5. The machine program" (Stone, p. 10)

Turing instructions—quintuples (5-tuples)

Definitions in literature sometimes differ slightly, to make arguments or proofs easier or clearer, but this is always done in such a way that the resulting machine has the same computational power. For example, changing the set to , where N ("None" or "No-operation) would allow the machine to stay on the same tape cell instead of moving left or right, does not increase the machine's computational power.

The most common convention represents each "Turing instruction" in a "Turing table" by one of nine 5-tuples, per the convention of Turing/Davis (Turing (1936 in Undecidable, p. 126-127 and Davis (2000) p. 152):

(definition 1): (qi, Sj, Sk/E/N, L/R/N, qm)

: ( current state qi , symbol scanned Sj , print symbol Sk PSk/erase E/none N , move_tape_one_ square left L/right R/None N , new state qm )

 (definition 2): (qi, Sj, qm, Sk/E/N, L/R/N)

: ( current state qi, symbol scanned Sj , new state qm , print symbol Sk PSk/erase E/none N , move_tape_one_square left L/none N/right R )

For remainder of this article we will use "definition 1" i.e. the Turing/Davis convention

Example: state table for the 3-state 2-symbol busy beaver reduced to 5-tuples

Current state

Scanned symbol

Print symbol

Move tape

Final (i.e. next) state

5-tuples

A

0

1

R

B

(A, 0, 1, R, B)

A

1

1

L

C

(A, 1, 1, L, C)

B

0

1

L

A

(B, 0, 1, L, A)

B

1

1

R

B

(B, 1, 1, R, B)

C

0

1

L

B

(C, 0, 1, L, B)

C

1

1

N

H

(C, 1, 1, N, H)

In the following table, Turing's original model allowed only the first three lines that he called N1, N2, N3 (cf Turing in Undecidable, p. 126). He allowed for erasure of the "scanned square" by naming a 0th symbol S0 = "erase" or "blank", etc. However, he did not allow for non-printing, so every instruction-line includes "print symbol Sk" or "erase" (cf footnote 12 in Post (1947), Undecidable p. 300). The abbreviations are Turing's (Undecidable p.119). Subsequent to Turing's original paper in 1936–1937, machine-models have allowed all nine possible five-tuples:

 

 

Current m-configuration (Turing state)

Tape symbol

Print-operation

Tape-motion

Final m-configuration (Turing state)

5-tuple

5-tuple comments

 

4-tuple

N1

qi

Sj

Print(Sk)

Left L

qm

(qi, Sj, Sk, L, qm)

"blank" = S0, 1=S1, etc.

 

 

N2

qi

Sj

Print(Sk)

Right R

qm

(qi, Sj, Sk, R, qm)

"blank" = S0, 1=S1, etc.

 

 

N3

qi

Sj

Print(Sk)

None N

qm

(qi, Sj, Sk, N, qm)

"blank" = S0, 1=S1, etc.

 

(qi, Sj, Sk, qm)

 

 

 

 

 

 

 

 

 

 

4

qi

Sj

None N

Left L

qm

(qi, Sj, N, L, qm)

 

 

(qi, Sj, L, qm)

5

qi

Sj

None N

Right R

qm

(qi, Sj, N, R, qm)

 

 

(qi, Sj, R, qm)

6

qi

Sj

None N

None N

qm

(qi, Sj, N, N, qm)

Direct "jump"

 

(qi, Sj, N, qm)

7

qi

Sj

Erase

Left L

qm

(qi, Sj, E, L, qm)

 

 

 

8

qi

Sj

Erase

Right R

qm

(qi, Sj, E, R, qm)

 

 

 

9

qi

Sj

Erase

None N

qm

(qi, Sj, E, N, qm)

 

 

(qi, Sj, E, qm)

We can construct any Turing table (list of instructions) from the above nine 5-tuples. For technical reasons usually we can dispense with the three non-printing or "N" instructions (4, 5, 6). For examples see Turing machine examples.

Less frequently we encounter the use of 4-tuples—these represent a further atomization of the Turing instructions (cf Post (1947), Boolos & Jeffrey (1974, 1999), Davis-Sigal-Weyuker (1994)); also see more at Post-Turing machine.

The "state"

The word "state" used in context of Turing machines can be a source of confusion. Most commentators after Turing have used "state" to mean the name/designator of the current instruction to be performed—i.e. the contents of the state register. But Turing (1936) made a strong distinction between a record of what he called the machine's "m-configuration", and the machine's (or person's) "state of progress" through the computation. What Turing called "the state formula" includes both the current instruction and all the symbols on the tape:

"Thus the state of progress of the computation at any stage is completely determined by the note of instructions and the symbols on the tape. That is, the state of the system may be described by a single expression (sequence of symbols) consisting of the symbols on the tape followed by Δ (which we suppose not to appear elsewhere) and then by the note of instructions. This expression is called the 'state formula' (Undecidable, p.139–140, emphasis added)

Earlier in his paper Turing carried this even further: he gives an example where he places a symbol of the current "m-configuration"—the instruction's label—beneath the scanned square, together with all the symbols on the tape (Undecidable, p.121); this he calls "the complete configuration" (Undecidable, p. 118). To print the "complete configuration" on one line he places the state-label/m-configuration to the left of the scanned symbol.

We see a variant of this in Kleene (1952) where Kleene shows how to write the Gödel number of a machine's "situation": he places the "m-configuration" symbol q4 over the scanned square in roughly the center of the 6 non-blank squares on the tape (see the Turing-tape figure in this article) and puts it to the right of the scanned square. But Kleene refers to "q4" itself as "the machine state" (Kleene, p. 374-375). Hopcroft and Ullman call this composite the "instantaneous description" and follow the Turing convention of putting the "current state" (instruction-label, m-configuration) to the left of the scanned symbol (p.149).

Example: total state of 3-state 2-symbol busy beaver after 3 "moves" (taken from example "run" in the figure below):

: 1A1

This means: after three moves the tape has … 000110000 … on it, the head is scanning the right-most 1, and the state is A. Blanks (in this case represented by "0"s) can be part of the total state as shown here: B01 ; the tape has a single 1 on it, but the head is scanning the 0 ("blank") to its left and the state is B.

Thus when we speak of "state" in the context of Turing machines we should clarify whether we are describing: (i) the current instruction, or (ii) the list of symbols on the tape together with the current instruction, or (iii) the list of symbols on the tape together with the current instruction placed to the left of the scanned symbol or to the right of the scanned symbol.

Hodges—Turing's biographer—has observed this confusion, and his commentary on it appears in a footnote (cf Hodges (1983) p. 107). Turing machine "state" diagrams

 

The table for the 3-state busy beaver ("P" = print/write a "1")

Tape symbol

Current state A

Current state B

Current state C

 

Write symbol

Move tape

Next state

Write symbol

Move tape

Next state

Write symbol

Move tape

Next state

0

P

R

B

P

L

A

P

L

B

1

P

L

C

P

R

B

P

N

HALT

 

The "3-state busy beaver" Turing Machine in a finite state representation. Each circle represents a "state" of the TABLE—an "m-configuration" or "instruction". "Direction" of a state transition is shown by an arrow. The label (e.g.. 0/P,R) near the outgoing state (at the "tail" of the arrow) specifies the scanned symbol that causes a particular transition (e.g. 0) followed by a slash /, followed by the subsequent "behaviors" of the machine, e.g. "P Print" then move tape "R Right". No general accepted format exists. The convention shown is after McClusky (1965), Booth (1965), Hill and Peterson (1974).

To the right: the above TABLE as expressed as a "state transition" diagram".

Usually large TABLES are better left as tables (Booth, p. 74). They are more readily simulated by computer in tabular form (Booth, p. 74). However, certain concepts—e.g. machines with "reset" states and machines with repeating patterns (cf Hill and Peterson p. 244ff)—can be more readily seen when viewed as a drawing.

Whether a drawing represents an improvement on its TABLE must be decided by the reader for the particular context. See Finite state machine for more.

The evolution of the busy-beaver's computation starts at the top and proceeds to the bottom.

The reader should again be cautioned that such diagrams represent a snapshot of their TABLE frozen in time, not the course ("trajectory") of a computation through time and/or space. While every time the busy beaver machine "runs" it will always follow the same state-trajectory, this is not true for the "copy" machine that can be provided with variable input "parameters".

The diagram "Progress of the computation" shows the 3-state busy beaver's "state" (instruction) progress through its computation from start to finish. On the far right is the Turing "complete configuration" (Kleene "situation", Hopcroft-Ullman "instantaneous description") at each step. If we were to stop the machine and clear to blank both the "state register" and entire tape, these "configurations" could be used to rekindle a computation anywhere in its progress (cf Turing (1936) Undecidable pp. 139–140).

Models equivalent to the Turing machine model

Many machines that might be thought to have more computational capability than a simple universal Turing machine can be shown to have no more power (Hopcroft and Ullman p. 159, cf Minsky (1967)). They might compute faster, perhaps, or use less memory, or their instruction set might be smaller, but they cannot compute more powerfully (i.e. more mathematical functions). (Recall that the Church-Turing thesis hypothesizes this to be true: that anything that can be “computed” can be computed by some Turing machine.)

A Turing machine is equivalent to a pushdown automaton made more powerful by relaxing the last-in-first-out requirement of its stack. (Interestingly, this seemingly minor relaxation enables the Turing machine to perform such a wide variety of computations that it can serve as a model for the computational capabilities of all modern computer software.)

At the other extreme, some very simple models turn out to be Turing-equivalent, i.e. to have the same computational power as the Turing machine model.

Common equivalent models are the multi-tape Turing machine, machines with input and output, and the non-deterministic Turing machine (NDTM) as opposed to the deterministic Turing machine (DTM) for which the action table has at most one entry for each combination of symbol and state.

For more on this topic see Turing machine equivalents, in particular Register machine and Post-Turing machine.

Read only right moving Turing Machines are equivalent to NDFA's (as well as DFA's by conversion using the NDFA to DFA conversion algorithm).

For practical and didactical intentions the equivalent register machine can be used as a usual assembly programming language.

Choice c-machines, Oracle o-machines

Early in his paper (1936) Turing makes a distinction between an "automatic machine"—its "motion … completely determined by the configuration" and a "choice machine":

"…whose motion is only partially determined by the configuration … When such a machine reaches one of these ambiguous configurations, it cannot go on until some arbitrary choice has been made by an external operator. This would be the case if we were using machines to deal with axiomatic systems" (p. 118 Undecidable)

Turing (1936) does not elaborate further excepting a footnote in which he describes how to use an a-machine to "find all the provable formulae of the [Hilbert] calculus" rather than use a choice machine. He "suppose[s] that the choices are always between two possibilities 0 and 1. Each proof will then be determined by a sequence of choices i1, i2, …, in (i1 = 0 or 1, i2 = 0 or 1, …, in = 0 or 1), and hence the number 2n + i12n-1 + i22n-2 + … +in completely determines the proof. The automatic machine carries out successively proof 1, proof 2, proof 3, …"(Footnote ‡ The Undecidable:138)

This is indeed the technique by which a deterministic (i.e. a-) Turing machine can be used to mimic the action of a nondeterministic Turing machine; Turing solved the matter in a footnote and appears to dismiss it from further consideration.

An oracle machine or o-machine is a Turing a-machine that pauses its computation at state "o" while, to complete its calculation, it "awaits the decision" of "the oracle"—an unspecified entity "apart from saying that it cannot be a machine" (Turing (1939), Undecidable p. 166–168). The concept is now actively used by mathematicians.

Universal Turing machines

Main article: Universal Turing machine

As Turing wrote in Undecidable, p. 128:

"It is possible to invent a single machine which can be used to compute any computable sequence. If this machine U is supplied with the tape on the beginning of which is written the string of quintuples separated by semicolons of some computing machine M, then U will compute the same sequence as M." (italics added)

We now take this remarkable finding for granted. But at the time (1936) it was astonishing. The model of computation that Turing called his "universal machine"—"U" for short—is considered by some (cf Davis (2000)) to have been the fundamental theoretical breakthrough that led to the notion of the stored program computer.

"Turing's paper … contains, in essence, the invention of the modern computer and some of the programming techniques that accompanied it" (Minsky (1967), p. 104).

As part of his New Kind of Science, Stephen Wolfram announced his discovery of a 2-state 5-symbol universal Turing machine. Wolfram's example is the smallest universal Turing machine so far since it has the smallest product (2,5)=10 of any known universal Turing machine.

On May 14th, 2007, Wolfram announced a US$25,000 prize for the proof or disproof of the conjecture that an even simpler, 2-state 3-symbol Turing machine is universal [1]. If a proof is found, this would be the smallest possible universal Turing Machine.

Comparison with real machines

It is often said that Turing machines, unlike simpler automata, are as powerful as real machines, and are able to execute any operation that a real program can. What is missed in this statement is that almost any particular program running on a particular machine is in fact nothing but a deterministic finite automaton, since the machine it runs on can only be in finitely many configurations. Turing machines would actually only be equivalent to a machine that had an unlimited amount of storage space. We might ask, then, why Turing machines are useful models of real computers. There are a number of ways to answer this:

  1. Anything a real computer can compute, a Turing machine can also compute. For example: "A Turing machine can simulate any type of subroutine found in programming languages, including recursive procedures and any of the known parameter-passing mechanisms" (Hopcroft and Ullman p. 157). Thus, a statement about the limitations of Turing machines will also apply to real computers.
  2. The difference lies only with the ability of a Turing machine to manipulate an unbounded amount of data. However, given a finite amount of time, a Turing machine (like a real machine) can only manipulate a finite amount of data.
  3. Like a Turing machine, a real machine can have its storage space enlarged as needed, by acquiring more disks or other storage media. If the supply of these runs short, the Turing machine may become less useful as a model. But the fact is that neither Turing machines nor real machines need astronomical amounts of storage space in order to perform useful computation. The processing time required is usually much more of a problem.
  4. Real machines are much more complex than a Turing machine. For example, a Turing machine describing an algorithm may have a few hundred states, while the equivalent deterministic finite automaton on a given real machine has quadrillions. This makes the DFA representation infeasible to analyze.
  5. Turing machines describe algorithms independent of how much memory they utilize. There is a limit to the memory possessed by any current machine, but this limit can rise arbitrarily in time. Turing machines allow us to make statements about algorithms which will (theoretically) hold forever, regardless of advances in conventional computing machine architecture.
  6. Turing machines simplify the statement of algorithms. Algorithms running on Turing-equivalent abstract machines are usually more general than their counterparts running on real machines, because they have arbitrary-precision data types available and never have to deal with unexpected conditions (including, but not limited to, running out of memory).

One way in which Turing machines are a poor model for programs is that many real programs, such as operating systems and word processors, are written to receive unbounded input over time, and therefore do not halt. Turing machines do not model such ongoing computation well (but can still model portions of it, such as individual procedures).

Limitations of Turing machines in computational complexity theory

Further information: Computational complexity theory

A limitation of Turing Machines is that they do not model the strengths of a particular arrangement well. For instance, modern stored-program computers are actually instances of a more specific form of abstract machine known as the random access stored program machine or RASP machine model. Like the Universal Turing machine the RASP stores its "program" in "memory" external to its finite-state machine's "instructions". Unlike the Universal Turing Machine, the RASP has an infinite number of distinguishable, numbered but unbounded "registers"—memory "cells" that can contain any integer (cf Elgot and Robinson (1964), Hartmanis (1971), and in particular Cook-Rechow (1973); references at random access machine). The RASP's finite-state machine is equipped with the capability for indirect addressing (e.g. the contents of one register can "point to" the address of any other, arbitrary register); thus the RASP's "program" can address any register in the register-sequence. The upshot of this distinction is that there are computational optimizations that can be performed based on the memory indices, which are not possible in a general Turing Machine; thus when Turing Machines are used as the basis for bounding running times, a 'false lower bound' can be proven on certain algorithms' running times (due to the false simplifying assumption of a Turing Machine). An example of this is binary search, which violates the Turing machine's linear Ω(n) lower bound on searching an ordered list.

Historical background: computational machinery

Robin Gandy—a student of Alan Turing (1912–1954) and his life-long friend—traces the lineage of the notion of "calculating machine" back to Babbage (circa 1834) and actually proposes "Babbage's Thesis":

"That the whole of development and operations of analysis are now capable of being executed by machinery" (italics in Babbage as cited by Gandy, p. 54)

Gandy's analysis of Babbage's Analytical Engine describes the following five operations (cf p. 52–53):

  1. The arithmetic functions +, -, x where - indicates "proper" subtraction x - y = 0 if y >= x
  2. Any sequence of operations is an operation
  3. Iteration of an operation (repeating n times an operation P)
  4. Conditional iteration (repeating n times an operation P conditional on the "success" of test T)
  5. Conditional transfer (i.e. conditional "goto")

Gandy states that "the functions which can be calculated by (1), (2), and (4) are precisely those which are Turing computable." (p. 53). He cites other proposals for "universal calculating machines" included those of Percy Ludgate (1909), Leonardo Torres y Quevedo (1914), M. d'Ocagne (1922), Louis Couffignal (1933), Vannevar Bush (1936), Howard Aiken (1937). However:

"... the emphasis is on programming a fixed iterable sequence of arithmetical operations. The fundamental importance of conditional iteration and conditional transfer for a general theory of calculating machines is not recognized ..."(Gandy p. 55)

The Entscheidungsproblem (the "decision problem"): Hilbert's tenth question of 1900

With regards to Hilbert's problems posed by the famous mathematician David Hilbert in 1900, an aspect of problem #10 had been floating about for almost 30 years before it was framed precisely. Hilbert's original expression for #10 is as follows:

"10. Determination of the solvability of a Diophantine equation. Given a Diophantine equation with any number of unknown quantities and with rational integral coefficients: To devise a process according to which it can be determined in a finite number of operations whether the equation is solvable in rational integers.

 

"The Entscheidungsproblem [decision problem for first-order logic] is solved when we know a procedure that allows for any given logical expression to decide by finitely many operations its validity or satisfiability ... The Entscheidungsproblem must be considered the main problem of mathematical logic. ..." (quoted, with this translation and the original German, in Nachum Dershowitz and Yuri Gurevich, "A Natural Axiomatization of Church's Thesis", 2007:1 [2])

By 1922, this notion of "Entscheidungsproblem" had developed a bit, and H. Behmann stated that

"... most general form of the Entscheidungsproblem [is] as follows:

:"A quite definite generally applicable prescription is required which will allow one to decide in a finite number of steps the truth or falsity of a given purely logical assertion ..." (Gandy p. 57 quoting Behmann)

 

"Behmann remarks that ... the general problem is equivalent to the problem of deciding which mathematical propositions are true." (Gandy p. 57)

If one were able to solve the Entscheidungsproblem then one would have a "procedure for solving many (or even all) mathematical problems" (Gandy p. 92).

By the 1928 international congress of mathematicians Hilbert "made his questions quite precise. First, was mathematics complete ... Second, was mathematics consistent ... And thirdly, was mathematics decidable?" (Hodges p. 91, Hawking p. 1121). The first two questions were answered in 1930 by Kurt Gödel at the very same meeting where Hilbert delivered his retirement speech (much to the chagrin of Hilbert); the third — the Entscheidungsproblem — had to wait until the mid-1930's.

The problem was: An answer first required a precise definition of "definite general applicable prescription", what Alonzo Church would come to call "effective calculability", and in 1928 no such definition existed. But over the next 6–7 years Emil Post developed his definition of a worker moving from room to room writing and erasing marks per a list of instructions (Post 1936), as did Princeton professor Church and his two students Stephen Kleene and J. B. Rosser by use of Church's λ-calculus and Gödel's recursion theory (1934). Church's paper (published 15 April 1936) showed that the Entscheidungsproblem was indeed "undecidable" and beat Turing to the punch by almost a year (Turing's paper submitted 28 May 1936, published January 1937). In the meantime, Emil Post submittted a brief paper in the fall of 1936 so Turing at least had priority over Post. While Church refereed Turing's paper, Turing had time to study Church's paper and add an Appendix where he sketched a proof that Church's λ-calculus and his machines would compute the same functions.

"But what Church had done was something rather different, and in a certain sense weaker . . . the Turing construction was more direct, and provided an argument from first principles, closing the gap in Church's demonstration." (Hodges p. 112).

And Post had only proposed a definition of calculability and criticized Church's "definition", but had proved nothing.

Alan Turing's a- (automatic-)machine

In the spring of 1935 Alan M. Turing, the young Master's student at King's College Cambridge UK took on the challenge; he had been stimulated by the lectures of the logician M. H. A. Newman "and learned from them of Gödel's work and the Entscheidungsproblem ... Newman used the word 'mechanical' ... In his obituary of Turing 1955 Newman writes:

"To the question 'what is a "mechanical" process?' Turing returned the characteristic answer 'Something that can be done by a machine' and he embarked on the highly congenial task of analysing the general notion of a computing machine." (Gandy p. 74)

Gandy states that:

"I suppose, but do not know, that Turing, right from the start of his work, had as his goal a proof of the undecidability of the Entscheidungsproblem. He told me that the 'main idea' of the paper came to him when he was lying in Grantchester meadows in the summer of 1935. The 'main idea' might have either been his analysis of computation or his realization that there was a universal machine, and so a diagonal argument to prove unsolvability." (p. 76)

While Gandy believed that Newman's statement above is "misleading", this opinion is not shared by all. Turing had had a life-long interest in machines: "Alan had dreamt of inventing typewriters as a boy; [his mother] Mrs. Turing had a typewriter; and he could well have begun by asking himself what was meant by calling a typewriter 'mechanical'" (Hodges p. 96). While at Princeton pursuing his PhD, Turing built a Boolean-logic multiplier (see below). His PhD thesis, titled "Systems of Logic Based on Ordinals", contains the following definition of "a computable function":

"It was stated above that 'a function is effectively calculable if its values can be found by some purely mechanical process'. We may take this statement literally, understanding by a purely mechanical process one which could be carried out by a machine. It is possible to give a mathematical description, in a certain normal form, of the structures of these machines. The development of these ideas leads to the author's definition of a computable function, and to an identification of computability with effective calculabiity. It is not difficult, though somewhat laborious, to prove that these three definitions [the 3rd is the λ-calculus] are equivalent ..." (Turing (1939) in The Undecidable p. 160)

When Turing returned to the UK he ultimately became jointly responsible for breaking the German secret codes created by encryption machines called "The Enigma"; he also became involved in the design of the ACE (Automatic Computing Machine), "[Turing's] ACE proposal was effectively self-contained, and its roots lay not in the EDVAC [the USA's initiative], but in his own universal machine" (Hodges p. 318). Arguments still continue concerning the origin and nature of what has been named by Kleene (1952) Turing's Thesis. But what Turing did prove with his computational-machine model appears in his paper On Computable Numbers, With an Application to the Entscheidungsproblem (1937):

"[that] the Hilbert Entscheidungsproblem can have no solution ... I propose, therefore to show that there can be no general process for determining whether a given formula U of the functional calculus K is provable, i.e. that there can be no machine which, supplied with any one U of these formulae, will eventually say whether U is provable." (from Turing's paper reprinted in The Undecidable, p. 145)

Turing's example (his second proof): If one is to ask for a general procedure to tell us: "Does this machine ever print 0", the question is "undecidable."

1937–1970: The "digital computer", the birth of "Computer science"

In 1937, while at Princeton working on his PhD thesis, Turing built a digital (Boolean-logic) multiplier from scratch, making his own electromechanical relays (Hodges p. 138). "Alan's task was to embody the logical design of a Turing machine in a network of relay-operated switches . . ." (Hodges p. 138). While Turing might have been just curious and experimenting, quite-earnest work in the same direction was going in Germany ( Konrad Zuse (1938)), and in the United States (Howard Aiken) and George Stibitz (1937); the fruits of their labors were used by the Axis and Allied military in World War II (cf Hodges p. 298–299). In the early- to mid-1950's Hao Wang and Marvin Minsky reduced the Turing machine to a simpler form (a precursor to the Post-Turing machine of Martin Davis ); simultaneously European researchers were reducing the new-fangled electronic computer to a computer-like theoretical object equivalent to what was now being called a "Turing machine" (Turing by this time had killed himself by eating a poisoned apple). In the late 1950's and early 1960's, the coincidentally-parallel developments of Melzak and Lambek (1961), Minsky (1961), and Shepherdson and Sturgis (1961) carried the European work further and reduced the Turing machine to a more friendly, computer-like abstract model called the counter machine; Elgot and Robinson (1964), Hartmanis (1971), Cook and Reckhow (1973) carried this work even further with the register machine and random access machine models —but basically all are just multi-tape Turing machines with an arithmetic-like instruction set.

1970–present: The Turing machine as a model of computation

Today the counter, register and random-access machines and their sire the Turing machine continue to be the models of choice for theorists investigating questions in the theory of computation. In particular, computational complexity theory makes use of the Turing machine:

"Depending on the objects one likes to manipulate in the computations (numbers like nonnegative integers or aphanumeric strings), two models have obtained a domiant position in machine-based complexity theory:

:"the off-line multitape Turing machine. . . , which represents the standard model for string-oriented computation, and

:"the random access machine (RAM) as introduced by Cook and Reckhow . . . , which models the idealized Von Neumann style computer." (van Emde Boas 1990:4)

"Only in the related area of analysis of algorithms this role is taken over by the RAM model"(van Emde Boas 1990:16)

Moore machine

In the theory of computation, a Moore machine is a finite state automaton where the outputs are determined by the current state alone (and do not depend directly on the input). The state diagram for a Moore machine will include an output signal for each state. Compare with a Mealy machine, which maps transitions in the machine to outputs.

The name Moore machine comes from that of its promoter, Edward F. Moore, a state-machine pioneer who wrote "Gedanken-experiments on Sequential Machines".[1]

Most digital electronic systems are designed as clocked sequential systems. Clocked sequential systems are a restricted form of Moore machine where the state changes only when the global clock signal changes. Typically the current state is stored in flip-flops, and a global clock signal is connected to the "clock" input of the flip-flops. Clocked sequential systems are one way to solve metastability problems.A typical electronic Moore machine includes a combinatorial logic chain to decode the current state into the outputs (lambda). The instant the current state changes, those changes ripple through that chain, and almost instantaneously the outputs change (or don't change). There are design techniques to ensure that no glitches occur on the outputs during that brief period while those changes are rippling through the chain, but most systems are designed so that glitches during that brief transition time are ignored or are irrelevant. The outputs then stay the same indefinitely (LEDs stay bright, power stays connected to the motors, solenoids stay energized, etc.), until the Moore machine changes state again.

Formal definition

A Moore machine can be defined as a 6-tuple { S, S0, Σ, Λ, T, G } consisting of the following:

  • a finite set of states ( S )
  • a start state (also called initial state) S0 which is an element of (S)
  • a finite set called the input alphabet ( Σ )
  • a finite set called the output alphabet ( Λ )
  • a transition function (T : S × Σ → S) mapping a state and an input to the next state
  • an output function (G : S → Λ) mapping each state to the output alphabet

The number of states in a Moore machine will be greater than or equal to the number of states in the corresponding Mealy machine.

 

 

Gedanken-experiments

In Moore's paper "Gedanken-experiments on Sequential Machines" [1], the (n;m;p) automata (or machines) S are defined as having n states, m input symbols and p output symbols. Nine theorems are proved about the structure of S, and experiments with S. Later, S machines became known as Moore machines.

At the end of the paper, in Section Further problems, the following task is stated: Another directly following problem is the improvement of the bounds given at the theorems 8 and 9.

Moore's Theorem 8 is formulated as: ''Given an arbitrary (n;m;p) machine S, such that every two its states are distinguishable from one another. Then there exists an experiment of length which determines the state of S at the end of the experiment.

In 1957 A.A. Karatsuba proved the following two theorems, which completely solved Moore's problem on the improvement of the bounds of the experiment length of his Theorem 8.

Theorem A. If S is an (n;m;p) machine, such that every two of its states are distinguishable from one another, then there exists a branched experiment of length at most through which one may determine the state of S at the end of the experiment.

Theorem B. There exists an (n;m;p) machine, every two states of which are distinguishable from one another, such that the length of the shortest experiments establishing the state of the machine at the end of the experiment is equal to .

Theorems A and B were used for the basis of the course work of a student of the fourth year, A.A. Karatsuba, "On a problem from the automata theory" which was distinguished by testimonial reference at the competition of student works of the faculty of mechanics and mathematics of Moscow Lomonosow State University in 1958. The paper by A.A. Karatsuba was given to the Journal Uspekhi Mat. Nauk on 17 December 1958 and was published there in June 1960 [2] .

Until the present day (2007), Karatsuba's result on the length of experiments is the only exact nonlinear result, both in automata theory, and in similar problems of computational complexity theory.

 

 

Mealy machine

 

In the theory of computation, a Mealy machine is a finite state machine (and more accurately, a finite state transducer) that generates an output based on its current state and an input. This means that the state diagram will include both an input and output signal for each transition edge. In contrast, the output of a Moore finite state machine depends only on the machine's current state; transitions have no input attached. However, for each Mealy machine there is an equivalent Moore machine whose states are the union of the Mealy machine's states and the Cartesian product of the Mealy machine's states and the input alphabet.

The name Mealy machine comes from that of the concept's promoter, G. H. Mealy, a state-machine pioneer who wrote "A Method for Synthesizing Sequential Circuits" in 1955.[1]

Mealy machines provide a rudimentary mathematical model for cipher machines. Considering the input and output alphabet the Latin alphabet, for example, then a Mealy machine can be designed that given a string of letters (a sequence of inputs) can process it into a ciphered string (a sequence of outputs). However, although you could probably use a Mealy model to describe Enigma, the state diagram would be too complex to provide feasible means of designing complex ciphering machines.

Formal definition

A Mealy machine is a 6-tuple, (S, S0, Σ, Λ, T, G), consisting of the following:

  • a finite set of states (S)
  • a start state (also called initial state) S0 which is an element of (S)
  • a finite set called the input alphabet (Σ)
  • a finite set called the output alphabet (Λ)
  • a transition function (T : S × Σ → S)
  • an output function (G : S × Σ → Λ)

 

PROGRAMMING LANGUAGE

C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system.

Although C was designed for writing architecturally independent system software it is also widely used for developing application software.

Worldwide, C is the first or second most popular language in terms of number of developer positions or publicly available code. It is widely used on many different software platforms, and there are few computer architectures for which a C compiler does not exist. C has greatly influenced many other popular programming languages, most notably C++, which originally began as an extension to C, and Java and C# which borrow C lexical conventions and operators

History

Early developments

The initial development of C occurred at AT&T Bell Labs between 1969 and 1973; according to Ritchie, the most creative period occurred in 1972. It was named “C” because many of its features were derived from an earlier language called “B“, which according to Ken Thompson was a stripped-down version of the BCPL programming language.

The origin of C is closely tied to the development of the Unix operating system, originally implemented in assembly language on a PDP-7 by Ritchie and Thompson, incorporating several ideas from colleagues. Eventually they decided to port the operating system to a PDP-11. B’s lack of functionality to take advantage of some of the PDP-11’s features, notably byte addressability, led to the development of an early version of the C programming language.

The original PDP-11 version of the Unix system was developed in assembly language. By 1973, with the addition of struct types, the C language had become powerful enough that most of the Unix kernel was rewritten in C. This was one of the first operating system kernels implemented in a language other than assembly. (Earlier instances include the Multics system (written in PL/I), and MCP (Master Control Program) for the Burroughs B5000 written in ALGOL in 1961.)

&R C

In 1978, Brian Kernighan and Dennis Ritchie published the first edition of The C Programming Language. This book, known to C programmers as “K&R”, served for many years as an informal specification of the language. The version of C that it describes is commonly referred to as “K&R C”. The second edition of the book covers the later ANSI C standard.

K&R introduced several language features:

  • standard I/O library
  • long int data type
  • unsigned int data type
  • compound assignment operators =op were changed to op= to remove the semantic ambiguity created by the construct i=-10, which had been interpreted as i =- 10 instead of the possibly intended i = -10

Even after the publication of the 1989 C standard, for many years K&R C was still considered the “lowest common denominator” to which C programmers restricted themselves when maximum portability was desired, since many older compilers were still in use, and because carefully written K&R C code can be legal Standard C as well.

In early versions of C, only functions that returned a non-integer value needed to be declared if used before the function definition; a function used without any previous declaration was assumed to return an integer, if its value was used.

For example:

long int SomeFunction();
/* int OtherFunction(); */
 
/* int */ CallingFunction()
{
    long int test1;
    register /* int */ test2;
 
    test1 = SomeFunction();
    if (test1 > 0) 
          test2 = 0;
    else 
          test2 = OtherFunction();
 
    return test2;
}

All the above commented-out int declarations could be omitted in K&R C.

Since K&R function declarations did not include any information about function arguments, function parameter type checks were not performed, although some compilers would issue a warning message if a local function was called with the wrong number of arguments, or if multiple calls to an external function used different numbers or types of arguments. Separate tools such as Unix’s lint utility were developed that (among other things) could check for consistency of function use across multiple source files.

In the years following the publication of K&R C, several unofficial features were added to the language, supported by compilers from AT&T and some other vendors. These included:

The large number of extensions and lack of agreement on a standard library, together with the language popularity and the fact that not even the Unix compilers precisely implemented the K&R specification, led to the necessity of standardization.

ANSI C and ISO C

Main article: ANSI C

During the late 1970s and 1980s, versions of C were implemented for a wide variety of mainframe computers, minicomputers, and microcomputers, including the IBM PC, as its popularity began to increase significantly.

In 1983, the American National Standards Institute (ANSI) formed a committee, X3J11, to establish a standard specification of C. In 1989, the standard was ratified as ANSI X3.159-1989 “Programming Language C.” This version of the language is often referred to as ANSI C, Standard C, or sometimes C89.

In 1990, the ANSI C standard (with formatting changes) was adopted by the International Organization for Standardization (ISO) as ISO/IEC 9899:1990, which is sometimes called C90. Therefore, the terms “C89″ and “C90″ refer to the same programming language.

ANSI, like other national standards bodies, no longer develops the C standard independently, but defers to the ISO C standard. National adoption of updates to the international standard typically occurs within a year of ISO publication.

One of the aims of the C standardization process was to produce a superset of K&R C, incorporating many of the unofficial features subsequently introduced. The standards committee also included several additional features such as function prototypes (borrowed from C++), void pointers, support for international character sets and locales, and preprocessor enhancements. The syntax for parameter declarations was also augmented to include the style used in C++, although the K&R interface continued to be permitted, for compatibility with existing source code.

C89 is supported by current C compilers, and most C code being written nowadays is based on it. Any program written only in Standard C and without any hardware-dependent assumptions will run correctly on any platform with a conforming C implementation, within its resource limits. Without such precautions, programs may compile only on a certain platform or with a particular compiler, due, for example, to the use of non-standard libraries, such as GUI libraries, or to a reliance on compiler- or platform-specific attributes such as the exact size of data types and byte endianness.

In cases where code must be compilable by either standard-conforming or K&R C-based compilers, the __STDC__ macro can be used to split the code into Standard and K&R sections to take advantage of features available only in Standard C.

C99

Main article: C99

After the ANSI/ISO standardization process, the C language specification remained relatively static for some time, whereas C++ continued to evolve, largely during its own standardization effort. In 1995 Normative Amendment 1 to the 1990 C standard was published, to correct some details and to add more extensive support for international character sets. The C standard was further revised in the late 1990s, leading to the publication of ISO/IEC 9899:1999 in 1999, which is commonly referred to as “C99.” It has since been amended three times by Technical Corrigenda. The international C standard is maintained by the working group ISO/IEC JTC1/SC22/WG14.

C99 introduced several new features, including inline functions, several new data types (including long long int and a complex type to represent complex numbers), variable-length arrays, support for variadic macros (macros of variable arity) and support for one-line comments beginning with //, as in BCPL or C++. Many of these had already been implemented as extensions in several C compilers.

C99 is for the most part backward compatible with C90, but is stricter in some ways; in particular, a declaration that lacks a type specifier no longer has int implicitly assumed. A standard macro __STDC_VERSION__ is defined with value 199901L to indicate that C99 support is available. GCC, Sun Studio and other C compilers now support many or all of the new features of C99.

As of 2007, work has begun in anticipation of another revision of the C standard, informally called “C1x“. The C standards committee has adopted guidelines to limit the adoption of new features that have not been tested by existing implementations.

IMPORTANCE

Like most imperative languages in the ALGOL tradition, C has facilities for structured programming and allows lexical variable scope and recursion, while a static type system prevents many unintended operations. In C, all executable code is contained within functions. Function parameters are always passed by value. Pass-by-reference is achieved in C by explicitly passing pointer values. Heterogeneous aggregate data types (struct) allow related data elements to be combined and manipulated as a unit. C program source text is free-format, using the semicolon as a statement terminator (not a delimiter).

C also exhibits the following more specific characteristics:

  • non-nestable function definitions
  • variables may be hidden in nested blocks
  • partially weak typing; for instance, characters can be used as integers
  • low-level access to computer memory by converting machine addresses to typed pointers
  • function and data pointers supporting ad hoc run-time polymorphism
  • array indexing as a secondary notion, defined in terms of pointer arithmetic
  • a preprocessor for macro definition, source code file inclusion, and conditional compilation
  • complex functionality such as I/O, string manipulation, and mathematical functions consistently delegated to library routines
  • A relatively small set of reserved keywords (originally 32, now 37 in C99)
  • A lexical structure that resembles B more than ALGOL, for example
    • { ... } rather than ALGOL’s begin ... end
    • the equal-sign is for assignment (copying), much like Fortran
    • two consecutive equal-signs are to test for equality (compare to .EQ. in Fortran or the equal-sign in BASIC)
    • && and || in place of ALGOL’s and and or (these are semantically distinct from the bit-wise operators & and | because they will never evaluate the right operand if the result can be determined from the left alone (short-circuit evaluation)).
    • a large number of compound operators, such as +=, ++, ……

Absent features

The relatively low-level nature of the language affords the programmer close control over what the computer does, while allowing special tailoring and aggressive optimization for a particular platform. This allows the code to run efficiently on very limited hardware, such as embedded systems.

C does not have some features that are available in some other programming languages:

A number of these features are available as extensions in some compilers, or can be supplied by third-party libraries, or can be simulated by adopting certain coding disciplines.

Undefined behavior

Many operations in C that have undefined behavior are not required to be diagnosed at compile time. In the case of C, “undefined behavior” means that the exact behavior which arises is not specified by the standard, and exactly what will happen does not have to be documented by the C implementation. A famous, although misleading, expression in the newsgroups comp.std.c and comp.lang.c is that the program could cause “demons to fly out of your nose”.[7] Sometimes in practice what happens for an instance of undefined behavior is a bug that is hard to track down and which may corrupt the contents of memory. Sometimes a particular compiler generates reasonable and well-behaved actions that are completely different from those that would be obtained using a different C compiler. The reason some behavior has been left undefined is to allow compilers for a wide variety of instruction set architectures to generate more efficient executable code for well-defined behavior, which was deemed important for C’s primary role as a systems implementation language; thus C makes it the programmer’s responsibility to avoid undefined behavior. Examples of undefined behavior are:

  • accessing outside the bounds of an array
  • overflowing a signed integer
  • reaching the end of a non-void function without finding a return statement, when the return value is used
  • reading the value of a variable before initializing it

These operations are all programming errors that could occur using many programming languages; C draws criticism because its standard explicitly identifies numerous cases of undefined behavior, including some where the behavior could have been made well defined, and does not specify any run-time error handling mechanism.

Invoking fflush() on a stream opened for input is an example of a different kind of undefined behavior, not necessarily a programming error but a case for which some conforming implementations may provide well-defined, useful semantics (in this example, presumably discarding input through the next new-line) as an allowed extension. Use of such nonstandard extensions generally limits software portability

 

 

 

Uses

C’s primary use is for “system programming“, including implementing operating systems and embedded system applications, due to a combination of desirable characteristics such as code portability and efficiency, ability to access specific hardware addresses, ability to “pun” types to match externally imposed data access requirements, and low runtime demand on system resources.

One consequence of C’s wide acceptance and efficiency is that C is used for other programs that are not directly used by end-users (compilers, libraries, interpreters, etc).

C is sometimes used as an intermediate language by implementations of other languages. This approach may be used for convenience (by using C as an intermediate language, it is not necessary to develop machine-specific code generators; ). Some compilers which use C this way are BitCC, Gambit, the Glasgow Haskell Compiler, Squeak, and valac.

Unfortunately, C was designed as a programming language, not as a compiler target language, and is thus less than ideal for use as an intermediate language. This has led to development of C-based intermediate languages such as C–.

C has also been widely used to implement end-user applications. Although as applications became larger, much of that development shifted to other languages.

Syntax

Unlike languages such as FORTRAN 77, C source code is free-form which allows arbitrary use of whitespace to format code, rather than column-based or text-line-based restrictions. Comments may appear either between the delimiters /* and */, or (in C99) following // until the end of the line.

Each source file contains declarations and function definitions. Function definitions, in turn, contain declarations and statements. Declarations either define new types using keywords such as struct, union, and enum, or assign types to and perhaps reserve storage for new variables, usually by writing the type followed by the variable name. Keywords such as char and int specify built-in types. Sections of code are enclosed in braces ({ and }, sometimes called “curly brackets”) to limit the scope of declarations and to act as a single statement for control structures.

As an imperative language, C uses statements to specify actions. The most common statement is an expression statement, consisting of an expression to be evaluated, followed by a semicolon; as a side effect of the evaluation, functions may be called and variables may be assigned new values. To modify the normal sequential execution of statements, C provides several control-flow statements identified by reserved keywords. Structured programming is supported by if(-else) conditional execution and by do-while, while, and for iterative execution (looping). The for statement has separate initialization, testing, and reinitialization expressions, any or all of which can be omitted. break and continue can be used to leave the innermost enclosing loop statement or skip to its reinitialization. There is also a non-structured goto statement which branches directly to the designated label within the function. switch selects a case to be executed based on the value of an integer expression.

Expressions can use a variety of built-in operators (see below) and may contain function calls. The order in which operands to most operators, as well as the arguments to functions, are evaluated is unspecified; the evaluations may even be interleaved. However, all side effects (including storage to variables) will occur before the next “sequence point“; sequence points include the end of each expression statement and the entry to and return from each function call. This permits a high degree of object code optimization by the compiler, but requires C programmers to exert more care to obtain reliable results than is needed for other programming languages.

Although mimicked by many languages because of its widespread familiarity, C’s syntax has often been criticized. For example, Kernighan and Ritchie say in the second edition of The C Programming Language, “C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better.”

Some specific problems worth noting are:

  • Not checking number and types of arguments when the function declaration has an empty parameter list. (This provides backward compatibility with K&R C, which lacked prototypes.)
  • Some questionable choices of operator precedence, as mentioned by Kernighan and Ritchie above, such as == binding more tightly than & and | in expressions like x & 1 == 0.
  • The use of the = operator, used in mathematics for equality, to indicate assignment, following the precedent of Fortran, PL/I, and BASIC, but unlike ALGOL and its derivatives. Ritchie made this syntax design decision consciously, based primarily on the argument that assignment occurs more often than comparison.
  • Similarity of the assignment and equality operators (= and ==), making it easy to accidentally substitute one for the other. C’s weak type system permits each to be used in the context of the other without a compilation error (although some compilers produce warnings). For example, the conditional expression in if (a=b) is only true if a is not zero after the assignment.[8]
  • A lack of infix operators for complex objects, particularly for string operations, making programs which rely heavily on these operations (implemented as functions instead) somewhat difficult to read.
  • A declaration syntax that some find unintuitive, particularly for function pointers. (Ritchie’s idea was to declare identifiers in contexts resembling their use: “declaration reflects use“.)

Operators

C supports a rich set of operators, which are symbols used within an expression to specify the manipulations to be performed while evaluating that expression. C has operators for:

C has a formal grammar, specified by the C standard.

["Hello, world" example

The "hello, world" example which appeared in the first edition of K&R has become the model for an introductory program in most programming textbooks, regardless of programming language. The program prints "hello, world" to the standard output, which is usually a terminal or screen display. A standard-conforming "hello, world" program is:[9]

#include <stdio.h>
 
int main(void)
{
    printf("hello, world\n");
    return 0;
}

The first line of the program contains a preprocessing directive, indicated by #include. This causes the preprocessor — the first tool to examine source code as it is compiled — to substitute the line with the entire text of the stdio.h standard header, which contains declarations for standard input and output functions such as printf. The angle brackets surrounding stdio.h indicate that stdio.h is located using a search strategy that prefers standard headers to other headers having the same name. Double quotes may also be used to include local or project-specific header files.

The next line indicates that a function named main is being defined. The main function serves a special purpose in C programs: The run-time environment calls the main function to begin program execution. The type specifier int indicates that the return value, the value that is returned to the invoker (in this case the run-time environment) as a result of evaluating the main function, is an integer. The keyword void as a parameter list indicates that the main function takes no arguments.[10]

The opening curly brace indicates the beginning of the definition of the main function.

The next line calls (executes the code for) a function named printf, which was declared in stdio.h and is supplied from a system library. In this call, the printf function is passed (provided with) a single argument, the address of the first character in the string literal "hello, world\n". The string literal is an unnamed array with elements of type char, set up automatically by the compiler with a final 0-valued character to mark the end of the array (printf needs to know this). The \n is an escape sequence that C translates to a newline character, which on output signifies the end of the current line. The return value of the printf function is of type int, but it is silently discarded since it is not used. (A more careful program might test the return value to determine whether or not the printf function succeeded.) The semicolon ; terminates the statement.

The return statement terminates the execution of the main function and causes it to return the integer value 0, which is interpreted by the run-time system as an exit code indicating successful execution.

The closing curly brace indicates the end of the code for the main function.

Data structures

C has a static weak typing type system that shares some similarities with that of other ALGOL descendants such as Pascal. There are built-in types for integers of various sizes, both signed and unsigned, floating-point numbers, characters, and enumerated types (enum). C99 added a boolean datatype. There are also derived types including arrays, pointers, records (struct), and untagged unions (union).

C is often used in low-level systems programming where escapes from the type system may be necessary. The compiler attempts to ensure type correctness of most expressions, but the programmer can override the checks in various ways, either by using a type cast to explicitly convert a value from one type to another, or by using pointers or unions to reinterpret the underlying bits of a value in some other way.

Pointers

C supports the use of pointers, a very simple type of reference that records, in effect, the address or location of an object or function in memory. Pointers can be dereferenced to access data stored at the address pointed to, or to invoke a pointed-to function. Pointers can be manipulated using assignment and also pointer arithmetic. The run-time representation of a pointer value is typically a raw memory address (perhaps augmented by an offset-within-word field), but since a pointer’s type includes the type of the thing pointed to, expressions including pointers can be type-checked at compile time. Pointer arithmetic is automatically scaled by the size of the pointed-to data type. (See Array-pointer interchangeability below.) Pointers are used for many different purposes in C. Text strings are commonly manipulated using pointers into arrays of characters. Dynamic memory allocation, which is described below, is performed using pointers. Many data types, such as trees, are commonly implemented as dynamically allocated struct objects linked together using pointers. Pointers to functions are useful for callbacks from event handlers.

A null pointer is a pointer value that points to no valid location (it is often represented by address zero). Dereferencing a null pointer is therefore meaningless, typically resulting in a run-time error. Null pointers are useful for indicating special cases such as no next pointer in the final node of a linked list, or as an error indication from functions returning pointers.

Void pointers (void *) point to objects of unknown type, and can therefore be used as “generic” data pointers. Since the size and type of the pointed-to object is not known, void pointers cannot be dereferenced, nor is pointer arithmetic on them allowed, although they can easily be (and in many contexts implicitly are) converted to and from any other object pointer type.

Careless use of pointers is potentially dangerous. Because they are typically unchecked, a pointer variable can be made to point to any arbitrary location, which can cause undesirable effects. Although properly-used pointers point to safe places, they can be made to point to unsafe places by using invalid pointer arithmetic; the objects they point to may be deallocated and reused (dangling pointers); they may be used without having been initialized (wild pointers); or they may be directly assigned an unsafe value using a cast, union, or through another corrupt pointer. In general, C is permissive in allowing manipulation of and conversion between pointer types, although compilers typically provide options for various levels of checking. Some other programming languages address these problems by using more restrictive reference types.

Arrays

Array types in C are traditionally, of a fixed, static size specified at compile time. (The more recent C99 standard also allows a form of variable-length arrays.) However, it is also possible to allocate a block of memory (of arbitrary size) at run-time, using the standard library’s malloc function, and treat it as an array. C’s unification of arrays and pointers (see below) means that true arrays and these dynamically-allocated, simulated arrays are virtually interchangeable. Since arrays are always accessed (in effect) via pointers, array accesses are typically not checked against the underlying array size, although the compiler may provide bounds checking as an option. Array bounds violations are therefore possible and rather common in carelessly written code, and can lead to various repercussions, including illegal memory accesses, corruption of data, buffer overruns, and run-time exceptions.

C does not have a special provision for declaring multidimensional arrays, but rather relies on recursion within the type system to declare arrays of arrays, which effectively accomplishes the same thing. The index values of the resulting “multidimensional array” can be thought of as increasing in row-major order.

Although C supports static arrays, it is not required that array indices be validated (bounds checking). For example, one can try to write to the sixth element of an array with five elements, generally yielding undesirable results. This type of bug, called a buffer overflow or buffer overrun, is notorious for causing a number of security problems. On the other hand, since bounds checking elimination technology was largely nonexistent when C was defined, bounds checking came with a severe performance penalty, particularly in numerical computation. A few years earlier, some Fortran compilers had a switch to toggle bounds checking on or off; however, this would have been much less useful for C, where array arguments are passed as simple pointers.

Multidimensional arrays are commonly used in numerical algorithms (mainly from applied linear algebra) to store matrices. The structure of the C array is well suited to this particular task. However, since arrays are passed merely as pointers, the bounds of the array must be known fixed values or else explicitly passed to any subroutine that requires them, and dynamically sized arrays of arrays cannot be accessed using double indexing. (A workaround for this is to allocate the array with an additional “row vector” of pointers to the columns.)

C99 introduced “variable-length arrays” which address some, but not all, of the issues with ordinary C arrays.

See also: C string

Array-pointer interchangeability

A distinctive (but potentially confusing) feature of C is its treatment of arrays and pointers. The array-subscript notation x[i] can also be used when x is a pointer; the interpretation (using pointer arithmetic) is to access the (i+1)th of several adjacent data objects pointed to by x, counting the object that x points to (which is x[0]) as the first element of the array.

Formally, x[i] is equivalent to *(x + i). Since the type of the pointer involved is known to the compiler at compile time, the address that x + i points to is not the address pointed to by x incremented by i bytes, but rather incremented by i multiplied by the size of an element that x points to. The size of these elements can be determined with the operator sizeof by applying it to any dereferenced element of x, as in n = sizeof *x or n = sizeof x[0].

Furthermore, in most expression contexts (a notable exception is sizeof array), the name of an array is automatically converted to a pointer to the array’s first element; this implies that an array is never copied as a whole when named as an argument to a function, but rather only the address of its first element is passed. Therefore, although C’s function calls use pass-by-value semantics, arrays are in effect passed by reference.

The number of elements in a declared array a can be determined as sizeof a / sizeof a[0].

An interesting demonstration of the interchangeability of pointers and arrays is shown below. The four assignments are equivalent and each is valid C code. Note how the last line contains the strange code i[x] = 1;, which has the index variable i apparently interchanged with the array variable x. This last line might be found in obfuscated C code.

/* x designates an array */
x[i] = 1;
*(x + i) = 1;
*(i + x) = 1;
i[x] = 1; /* strange, but correct: i[x] is equivalent to *(i + x) */

However, there is a distinction to be made between arrays and pointer variables. Even though the name of an array is in most expression contexts converted to a pointer (to its first element), this pointer does not itself occupy any storage. Consequently, you cannot change what an array “points to”, and it is impossible to assign to an array. (Arrays may however be copied using the memcpy function, for example.)

Memory management

One of the most important functions of a programming language is to provide facilities for managing memory and the objects that are stored in memory. C provides three distinct ways to allocate memory for objects:

  • Static memory allocation: space for the object is provided in the binary at compile-time; these objects have an extent (or lifetime) as long as the binary which contains them is loaded into memory
  • Automatic memory allocation: temporary objects can be stored on the stack, and this space is automatically freed and reusable after the block in which they are declared is exited
  • Dynamic memory allocation: blocks of memory of arbitrary size can be requested at run-time using library functions such as malloc from a region of memory called the heap; these blocks persist until subsequently freed for reuse by calling the library function free

These three approaches are appropriate in different situations and have various tradeoffs. For example, static memory allocation has no allocation overhead, automatic allocation may involve a small amount of overhead, and dynamic memory allocation can potentially have a great deal of overhead for both allocation and deallocation. On the other hand, stack space is typically much more limited and transient than either static memory or heap space, and dynamic memory allocation allows allocation of objects whose size is known only at run-time. Most C programs make extensive use of all three.

Where possible, automatic or static allocation is usually preferred because the storage is managed by the compiler, freeing the programmer of the potentially error-prone chore of manually allocating and releasing storage. However, many data structures can grow in size at runtime, and since static allocations (and automatic allocations in C89 and C90) must have a fixed size at compile-time, there are many situations in which dynamic allocation must be used. Prior to the C99 standard, variable-sized arrays were a common example of this (see “malloc” for an example of dynamically allocated arrays).

Automatically and dynamically allocated objects are only initialized if an initialized is explicitly specified; otherwise they initially have indeterminate values (typically, whatever bit pattern happens to be present in the storage, which might not even represent a valid value for that type). If the program attempts to access an uninitialized value, the results are undefined. Many modern compilers try to detect and warn about this problem, but both false positives and false negatives occur.

Another issue is that heap memory allocation has to be manually synchronized with its actual usage in any program in order for it to be reused as much as possible. For example, if the only pointer to a heap memory allocation goes out of scope or has its value overwritten before free() has been called, then that memory cannot be recovered for later reuse and is essentially lost to the program, a phenomenon known as a memory leak. Conversely, it is possible to release memory too soon and continue to access it; however, since the allocation system can re-allocate or itself use the freed memory, unpredictable behavior is likely to occur when the multiple users corrupt each other’s data. Typically, the symptoms will appear in a portion of the program far removed from the actual error. Such issues are ameliorated in languages with automatic garbage collection or RAII.

Libraries

The C programming language uses libraries as its primary method of extension. In C, a library is a set of functions contained within a single “archive” file. Each library typically has a header file, which contains the prototypes of the functions contained within the library that may be used by a program, and declarations of special data types and macro symbols used with these functions. In order for a program to use a library, it must include the library’s header file, and the library must be linked with the program, which in many cases requires compiler flags (e.g., -lm, shorthand for “math library”).

The most common C library is the C standard library, which is specified by the ISO and ANSI C standards and comes with every C implementation. (“Freestanding” [embedded] C implementations may provide only a subset of the standard library.) This library supports stream input and output, memory allocation, mathematics, character strings, and time values.

Another common set of C library functions are those used by applications specifically targeted for Unix and Unix-like systems, especially functions which provide an interface to the kernel. These functions are detailed in various standards such as POSIX and the Single UNIX Specification.

Since many programs have been written in C, there are a wide variety of other libraries available. Libraries are often written in C because C compilers generate efficient object code; programmers then create interfaces to the library so that the routines can be used from higher-level languages like Java, Perl, and Python.

Deficiencies

Although the C language is extremely concise, C is subtle, and expert competency in C is not common—taking more than ten years to achieve.[11] C programs are also notorious for security vulnerabilities due to the unconstrained direct access to memory of many of the standard C library function calls.

In spite of its popularity and elegance, real-world C programs commonly suffer from instability and memory leaks, to the extent that any appreciable C programming project will have to adopt specialized practices and tools to mitigate spiraling damage. Indeed, an entire industry has been born merely out of the need to stabilize large source-code bases.

Although C was developed for Unix, Microsoft adopted C as the core language of its operating systems. Although all standard C library calls are supported by Windows, there is only ad-hoc support for Unix functionality side-by-side with an inordinate number of inconstant Windows-specific API calls.

It is inevitable that C did not choose limit the size or endianness of its types—for example, each compiler is free to choose the size of an int type as any anything over 16 bits according to what is efficient on the current platform. Many programmers work based on size and endianness assumptions, leading to code that is not portable.

Also inevitable is that the C standard defines only a very limited gamut of functionality, excluding anything related to network communications, user interaction, or process/thread creation. Its parent document, the POSIX standard, includes such a wide array of functionality that no operating system appears to support it exactly, and only Unix systems have even attempted to support substantial parts of it.

Therefore the kinds of programs that can be portably written are extremely restricted, unless specialized programming practices are adopted.

Language tools

Tools have been created to help C programmers avoid some of the problems inherent in the language, such as statements with undefined behavior or statements that are not a good practice because they are more likely to result in unintended behavior or run-time errors.

Automated source code checking and auditing are beneficial in any language, and for C many such tools exist, such as Lint. A common practice is to use Lint to detect questionable code when a program is first written. Once a program passes Lint, it is then compiled using the C compiler. Also, many compilers can optionally warn about syntactically valid constructs that are likely to actually be errors. MISRA C is a proprietary set of guidelines to avoid such questionable code, developed for embedded systems.

There are also compilers, libraries and operating system level mechanisms for performing array bounds checking, buffer overflow detection, serialization and automatic garbage collection, that are not a standard part of C.

Tools such as Purify, Valgrind, and linking with libraries containing special versions of the memory allocation functions can help uncover runtime memory errors.

Related languages

C has directly or indirectly influenced many later languages such as Java, C#, Perl, PHP, JavaScript, LPC, and Unix’s C Shell. The most pervasive influence has been syntactical: all of the languages mentioned combine the statement and (more or less recognizably) expression syntax of C with type systems, data models and/or large-scale program structures that differ from those of C, sometimes radically.

When object-oriented languages became popular, C++ and Objective-C were two different extensions of C that provided object-oriented capabilities. Both languages were originally implemented as source-to-source compilers — source code was translated into C, and then compiled with a C compiler.

Bjarne Stroustrup devised the C++ programming language as one approach to providing object-oriented functionality with C-like syntax. C++ adds greater typing strength, scoping and other tools useful in object-oriented programming and permits generic programming via templates. Nearly a superset of C, C++ now supports most of C, with a few exceptions (see Compatibility of C and C++ for an exhaustive list of differences).

Unlike C++, which maintains nearly complete backwards compatibility with C, the D language makes a clean break with C while maintaining the same general syntax. It abandons a number of features of C which Walter Bright (the designer of D) considered undesirable, including the C preprocessor and trigraphs. Some, but not all, of D’s extensions to C overlap with those of C++.

Objective-C was originally a very “thin” layer on top of, and remains a strict superset of, C that permits object-oriented programming using a hybrid dynamic/static typing paradigm. Objective-C derives its syntax from both C and Smalltalk: syntax that involves preprocessing, expressions, function declarations and function calls is inherited from C, while the syntax for object-oriented features was originally taken from Smalltalk.

Limbo is a language developed by the same team at Bell Labs that was responsible for C and Unix, and while retaining some of the syntax and the general style, introduced garbage collection, CSP based concurrency and other major innovations.