In math: A string is a finite sequence of symbols that are chosen from a set called an alphabet.
In CS: A string is a sequence of characters.
Strings in the way computers represent textual information.
Most problems with strings have to do with search a string or a list of strings/patterns in another string.
In computer memory, strings are nothing more that binary data.
In order to interprect what the value 0x48
(first 8 bits) means in terms of characters, we need an encoding.
The problem of encoding predates computers. For example, Morse code is an encoding between short and long sounds to Latin script.
ASCII was originally a 7-bit encoding.
ASCII characters are representable as numbers. This means that we can do arithmetic on them.
#include<stdio.h>
main() {
for (int i = 0; i < 128; i++)
printf("%c ", i);
printf("%d\n", c);
for (int i = 128; i < 256; i++)
printf("%c ", i);
}
To accomodate different languages with characters in non-Latin scripts, or special characters (e.g. ø, ß, ç) in Latin-based languages, ASCII was extended with an 8th bit.
This allowed operating systems in most of the world (non-logographic languages) to use the American string representation as a basis and only customize encodings when the extra characters where required.
Q: What problems did this create?
A file encoded in ISO-8859-7 (Greek) could not be read correctly in a computer configured with ISO-8859-1 (Latin-1).
The world has around 7k languages and 3.8k writing systems. Unicode is an effort to specify 1 string encoding scheme for everything. The latest version contains a repertoire of 136,755 characters covering 139 modern and historic scripts, as well as multiple symbol sets (e.g. music, math, emoji).
Unicode defines a lookup table between a unique character ID and a character name / description
The word hello is represented in Unicode as:
U+0068 U+0065 U+006C U+006C U+006F
Q: How would you store that on disk?
One way would be to store the bytes: 00 68 00 65 00 6C 00 6C 00 6F
However, this assumes that Unicode symbols have equal probability of being used, while we can only store \(2^{16}\) characters. The later can be solved in we use \(2^{32}\) bytes for each character.
To solve space issues we can use variable length encodings. The most common encoding format for Unicode data is UTF-8, developed by Ken Tompson and Rob Pike.
In UTF-8, the most common characters are encoded with less bits than more rare ones.
For the character U+262D (☭), various Unicode encodings produce the following results
One thing to remember about strings in computers:
Characters and bytes are 2 different things
Given a text \(T\) and a pattern \(P\), find all occurrences of \(P\) within \(T\).
\(T = AGCATGCTGCAGTCATGCTTAGGCTA\) \(P = GCT\)
Q: How many times does \(P\) occur in \(T\)
The answer 3. But how did you search for it?
Say that we are searching for \(P=\)ABC
in string \(T=\)ABABABACCABC
.
ABABABACCABC
1 ABX (first two characters match, last does not)
2 X (first character doesn't match)
3 ABX (first two characters match, last does not)
4 X (first character doesn't match)
5 ABX (first two characters match, last does not)
6 X (first character doesn't match)
7 AX (first character matches, second doesn't)
8 X (first character doesn't match)
9 X (first character doesn't match)
10 ABC (match found)
The brute force approach is \(O(mn)\), whe \(m = len(T)\) and \(n = len(P)\).
A step 2, we are restarting the search at a character we know it won’t match. We should instead start from step 4.
Idea: How about we build a lookup table of the lengths of prefixes in \(P\) that are also suffixes? Then, we can use that to jump ahead in our search if we get a mismatch!
pattern | a | b | a | b | a | c |
---|---|---|---|---|---|---|
j | 0 | 1 | 2 | 3 | 4 | 5 |
Prefix: \(P[0:j]\) | a | ab | aba | abab | ababa | ababac |
Prefix - Suffix | \(\emptyset\) | \(\emptyset\) | a | ab | aba | \(\emptyset\) |
\(p[i]\) | 0 | 0 | 1 | 2 | 3 | 0 |
This table is used to calculate where to jump to in \(T\) in case of a mismatch of character position \(i\) in \(P\). The new position should be \(cur + i - jump[i]\),
Q: Let T = ABC ABCDAB ABCDABCDABDE
and P = ABCDABD
. What is jump table?
j = [0,0,0,0,1,2,0]
T: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
|
mismatch
Step 1: We matched \(k = 3\) letters so far and \(j[k] = 0\). So we can jump by \(k - j[k] = 3\) letters ahead:
T: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
|
mismatch
Step 2: We matched \(k = 0\) letters, \(j[0] = -1\). So we can jump by \(k - j[k] = 1\) letter
T: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
|
mismatch
Step 3: We matched \(k = 6\) letters, \(j[6] = 2\). So we can jump by \(k - j[k] = 4\) letters
T: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
|
mismatch
Step 4: We matched \(k = 2\) letters, \(j[2] = 0\). So we can jump by \(k - j[k] = 2\) letters
T: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
|
mismatch
Step 5: We matched \(k = 0\) letters, \(j[0] = -1\). So we can jump by \(k - j[k] = 1\) letters
T: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
|
mismatch
Step 6: We matched \(k = 6\) letters, \(j[6] = 2\). So we can jump by \(k - j[k] = 4\) letters
T: ABC ABCDAB ABCDABCDABDE
P: ABCDABD
success!
The example above was adapted from Jaehyun Park’s String Algorithms lecture
def jump_table(pattern):
result = [None]
for i in range(0, len(pattern)):
j = i
while True:
if j == 0:
result.append(0)
break
if pattern[result[j]] == pattern[i]:
result.append(result[j] + 1)
break
j = result[j]
return result
The complexity of the jump table calculator is \(O(m)\)
def kmp(P, T):
jump = jump_table(P)
index = 0
match = 0
while index + match < len(T):
if T[index + match] == P[match]:
match = match + 1
if match == len(P):
return index
else:
if match == 0:
index = index + 1
else:
index = index + match - jump[match]
match = jump[match]
return None
The complexity of KMP is \(O(n)\); the overall complexity of applying KMP is \(O(m) + O(n)\), which is much better than the brute force’s \(O(mn)\).
A generalization of text search is pattern identification. Instead of using exact strings as search terms, we use patterns that define repetions, groupings and ranges of characters to match. This way, we can make our searches more generic.
The most common way to express search patterns are regular expressions.
*
Match the previous pattern 0 or more timesF-M
or e-f
Regular expressions are converted to state machines, through which the text passes. For example, the state machine for ((e|a)b*)*
is:
/^#?([a-f0-9])$/
Matches hexademical numbers, perhaps starting with a #
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
Matches (a good subset of) valid email addresses. Here is a regexp that will match everything.
{([1-2]?[1-9]\{1,2\}\.)\{3,\}[1-2]?[0-9]\{1,2\}}
Match IPv4 addresses
Approximate string matching (or fuzzy search) is a technique of finding strings that match a pattern approximately (rather than exactly).
Given a pattern string \(P\) and a string \(T\), find a sub-string \(T'\) in the set of all substrings of \(T\), so that the distance to \(P\) is minimal (for some definition of distance).
Q Can you think of an application of fuzzy searching?
Some applications include spellcheckers, Google’s “Did you mean…”, DNA matching etc
The distance between two strings \(a,b\) (of length \(|a|\) and \(|b|\)) is:
\[ \qquad\operatorname{lev}_{a,b}(i,j) = \begin{cases} \max(i,j) & \text{ if} \min(i,j)=0, \\ \min \begin{cases} \operatorname{lev}_{a,b}(i-1,j) + 1 \\ \operatorname{lev}_{a,b}(i,j-1) + 1 \\ \operatorname{lev}_{a,b}(i-1,j-1) + 1 \end{cases} & \text{ otherwise.} \end{cases} \]
\(lev(i,j)\) is the distance at characters \(a_i\) and \(b_j\).
The Levenshtein distance measures the minimal number of deletions, insertions or replacements to make the strings equal.
def levenshtein(s, t):
if s == "":
return len(t)
if t == "":
return len(s)
if s[-1] == t[-1]:
cost = 0
else:
cost = 1
res = min([levenshtein(s[:-1], t) + 1,
levenshtein(s, t[:-1]) + 1,
levenshtein(s[:-1], t[:-1]) + cost])
return res
Q: What is the complexity of the simple implementation?
\(O(3^{m+n-1})\), where \(m\) and \(n\) are the legths of s
and t
. See an explanation here
memo = {}
def levenshtein2(s, t):
if s == "":
return len(t)
if t == "":
return len(s)
cost = 0 if s[-1] == t[-1] else 1
i1 = (s[:-1], t)
if not i1 in memo:
memo[i1] = levenshtein2(*i1)
i2 = (s, t[:-1])
if not i2 in memo:
memo[i2] = levenshtein2(*i2)
i3 = (s[:-1], t[:-1])
if not i3 in memo:
memo[i3] = levenshtein2(*i3)
res = min([memo[i1]+1, memo[i2]+1, memo[i3]+cost])
return res
With memoization, we store intermediate results in a shared dictionary, whose key is the function arguments. This takes space of \(O(mn)\), but the algorithm now becomes \(O(mn)\).
Levenshtein distance does not tell us which steps we need to take to ma. To obtain the steps, we need to solve the longest common subsequence problem.
In LCS, we compare our \(O\)riginal string with a \(N\)ew one and obtain a sequence of all the items that are common in \(O\) and \(N\). By comparing this sequence to both \(O\) and \(N\), we can obtain a set of operations (or edit script) that when applied to \(O\) will give us \(N\).
O 1 a 2 3 4 r t
N c 1 b 2 3 x d 4
LCS 1 2 3 4
DIFF c 1 a b 2 3 x d 4 r t
OP + = - + = = + + = - -
def diff(xs, ys):
cs = lcs(xs, ys)
edit_script(xs, ys, cs)
def lcs(xstr, ystr):
if not xstr or not ystr:
return ""
x, xs, y, ys = xstr[0], xstr[1:], ystr[0], ystr[1:]
if x == y:
return x + lcs(xs, ys)
else:
return max(lcs(xstr, ys), lcs(xs, ystr), key=len)
The complexity of the naive LCS is \(O(2^{m+n})\). Similar to Levenshtein distance, we can use memoization to convert LCS to \(O(mn)\).
What LCS returns are the common characters in xstr
and ystr
, in order of appearence in xstr
.
def edit_script(xs, ys, cs):
if len(cs) == 0:
for x in xs:
print "-%s" % x
for y in ys:
print "+%s" % y
return
x, y, c = xs[0], ys[0], cs[0]
if c != x:
print "-%s" % x
edit_script(xs[1:], ys, cs)
elif c != y:
print "+%s" % y
edit_script(xs, ys[1:], cs)
else:
print "=%s" % c
edit_script(xs[1:], ys[1:], cs[1:])
Given the two strings and the LCS, the edit script function will return a sequence of additions +
and removals -
that when applied to xs
will give ys
.
diff
utilityThe ability to compare strings and produce edit scripts has been generalized in the diff
Unix tool, where differences are only considered at the line level.
The output of diff
is the basis for all version control systems, like Git and Subversion.
[1] J. Spolsky, “The absolute minimum every software developer absolutely, positively must know about unicode and character sets,” 2003. [Online]. Available: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.
This work is (c) 2017 - onwards by TU Delft, Georgios Gousios and his co-authors and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.