Erlang Thursday – ETS data matching

Today’s Erlang Thursday moves on from the introduction to ETS, and starts using it to store some data and do some retrieval of data in ETS.

First we need some data to have in ETS, so we will fall back to one of my goto problems, Markov Chains.

For those unfamiliar with what a Markov Chain is, it is a state machine that transitions to the next state based off a probability instead of a specific input. The common example that people are familiar with in “everyday use” is predictive typing on smart phones, where the next word or letter is offered up as a prediction, and the predicted words are chosen by the historical likelihood that the words predicted follows the current word.

The first thing we will do is to create a module that given a string of text, will return a list of tuples representing a word and the word that follows.

-module(markov_words).

-export([create_word_pairs/1]).

-spec create_word_pairs(string()) -> list({string(), string()}).
create_word_pairs(Text) ->
  Words = string:tokens(Text, " \t\n"),
  create_word_pairs([], Words).

create_word_pairs(WordPairs, [_Word|[]]) ->
    WordPairs;
create_word_pairs(WordPairs, [Word|Words]) ->
    [Following|_] = Words,
    UpdatedWordPairs = [{Word, Following} | WordPairs],
    create_word_pairs(UpdatedWordPairs, Words).

The above code takes a string of text and splits that text into “words” based off using the space, tab, and newline characters as a word boundary. With that list of “words”, we then create a list of word to following word tuples, which is what we will be inserting into our ETS table.

Time to fire up the Erlang shell and start experimenting.

First we need to compile our module, and then we will create a variable to hold our text we want to use to prime our Markov Chain.

c(markov_words).
% {ok,markov_words}
ToTC = "It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way--in short,
the period was so far like the present period,
that some of its noisiest authorities insisted on its
being received, for good or for evil, in the superlative
degree of comparison only.

There were a king with a large jaw and a queen with a
plain face, on the throne of England; there were a king
with a large jaw and a queen with a fair face,
on the throne of France. In both countries it was
clearer than crystal to the lords of the State preserves
of loaves and fishes, that things in general were
settled for ever.".

We will create a new process to give our ETS table away to, just in case we bomb out the shell.

Fun = fun() -> receive after infinity -> ok end end.
% #Fun<erl_eval.20.54118792>
SomeProcess = spawn(Fun).
% <0.60.0>

And we now create an ETS table that will store data for us to use as part of our Markov Chain generation.

WordPairs = ets:new(word_pairs, [public, duplicate_bag]).
% 20498
ets:give_away(WordPairs, SomeProcess, []).
% true

We make the table public in this case, since we want our shell process, which is no longer the owner, to be able to add items to the table, and we make the table type a duplicate bag.

The reason for the duplicate_bag, is that for demonstration reasons, we want to be able to have multiple entries with the same key, as we are likely to see any word multiple times, and that some sets of word pairs are more common, so we want to be able to capture (and retain) those “duplicates”.

And for ease of population from inside the shell, we will use a list comprehension to add each word pair tuple we create from the text into our ETS table by calling ets:insert/2.

[[ ets:insert(WordPairs, WordPair) || WordPair <- markov_words:create_word_pairs(ToTC)]].
% [[true,true,true,true,true,true,true,true,true,true,true,
%   true,true,true,true,true,true,true,true,true,true,true,true,
%   true,true,true,true,true|...]]

Now we should have some data in our ETS table, it is time to see how we can access our data. For this we turn to ets:match/2. ets:match/2 takes a table to query, and a Pattern.

The pattern is made up an Erlang term to be matched against; _, which matches anything and doesn’t bind; or pattern variables, which take the form of $N where N is any positive integer. The return result of ets:match/2 is a list containing the list of values in the pattern variables in order of variable name order.

So with this knowledge, we can try to find the word pairs to find the words that follow "of". If we were doing a pattern match it would look like {"of", Following}, but using ETS, we need to use a pattern variable which would make our spec {"of", '$1'}.

So lets run that against our ETS table.

ets:match(WordPairs, {"of", '$1'}).
% [["loaves"],
%  ["the"],
%  ["France."],
%  ["England;"],
%  ["comparison"],
%  ["its"],
%  ["despair,"],
%  ["hope,"],
%  ["Darkness,"],
%  ["Light,"],
%  ["incredulity,"],
%  ["belief,"],
%  ["foolishness,"],
%  ["wisdom,"],
%  ["times,"],
%  ["times,"]]

And there we go, we can see the results is a list of the list of variable matches, in this case, just what '$1' matched to.

For fun and exploration, let’s confirm what we would get if we look for the words that follow "it" in our Tale of Two Cities intro text.

ets:match(WordPairs, {"it", '$1'}).
% [["was"],
%  ["was"],
%  ["was"],
%  ["was"],
%  ["was"],
%  ["was"],
%  ["was"],
%  ["was"],
%  ["was"],
%  ["was"]]

Just a bunch of "was" which is right for the first two paragraphs of the book.

And we will double check and look for any words that follow "Scrooge".

ets:match(WordPairs, {"Scrooge", '$1'}).
% []

And if we wanted to get the whole tuple back, we could use ets:match_object/2, which will return the whole object that satisfies the match

ets:match_object(WordPairs, {"of", '$1'}).
% [{"of","loaves"},
%  {"of","the"},
%  {"of","France."},
%  {"of","England;"},
%  {"of","comparison"},
%  {"of","its"},
%  {"of","despair,"},
%  {"of","hope,"},
%  {"of","Darkness,"},
%  {"of","Light,"},
%  {"of","incredulity,"},
%  {"of","belief,"},
%  {"of","foolishness,"},
%  {"of","wisdom,"},
%  {"of","times,"},
%  {"of","times,"}]

or, in this case we could use ets:lookup/2 which returns all of the items for which the key matches.

ets:lookup(WordPairs, "of").
% [{"of","loaves"},
%  {"of","the"},
%  {"of","France."},
%  {"of","England;"},
%  {"of","comparison"},
%  {"of","its"},
%  {"of","despair,"},
%  {"of","hope,"},
%  {"of","Darkness,"},
%  {"of","Light,"},
%  {"of","incredulity,"},
%  {"of","belief,"},
%  {"of","foolishness,"},
%  {"of","wisdom,"},
%  {"of","times,"},
%  {"of","times,"}]

So to take a brief detour away from the Markov Chain example, why might we want to use either ets:lookup/2 or ets:match_object/2 versus the other? To answer that with an example, let’s add another entry into our WordPairs table, that is a three-tuple.

To start with, we will insert 100_000 items into our ETS tables and see what the resulting memory size becomes. We will insert a new tuple of {X, X}, for all numbers from 1 to 100_000.

ets:insert(WordPairs, {"of", "times,", "it"}).
% true

If we do a ets:lookup/2 we get all items with the specified key.

ets:lookup(WordPairs, "of").
[{"of","loaves"},
 {"of","the"},
 {"of","France."},
 {"of","England;"},
 {"of","comparison"},
 {"of","its"},
 {"of","despair,"},
 {"of","hope,"},
 {"of","Darkness,"},
 {"of","Light,"},
 {"of","incredulity,"},
 {"of","belief,"},
 {"of","foolishness,"},
 {"of","wisdom,"},
 {"of","times,"},
 {"of","times,"},
 {"of","times,","it"}]

But if we use ets:match_object/2, and use a two-tuple because we only want the word pairs, we don’t get the item that is a three-tuple in the results.

ets:match_object(WordPairs, {"of", '_'}).
[{"of","loaves"},
 {"of","the"},
 {"of","France."},
 {"of","England;"},
 {"of","comparison"},
 {"of","its"},
 {"of","despair,"},
 {"of","hope,"},
 {"of","Darkness,"},
 {"of","Light,"},
 {"of","incredulity,"},
 {"of","belief,"},
 {"of","foolishness,"},
 {"of","wisdom,"},
 {"of","times,"},
 {"of","times,"}]

Back to the Markov Chain scenario, we can start to see how we can get some text that follows the Markov Chain rules.

We get the match of potential words to choose from for a given word, and we pick a uniformly random result from the list of following words.

PotentialChoices = ets:match(WordPairs, {"of", '$1'}).
[NextWord] = lists:nth(random:uniform(length(PotentialChoices)), PotentialChoices).

We can write a function that will repeat these steps until we get to our termination case. Some examples of a termination state could be a word that doesn’t have a word that follows it; we get to a certain number of words picked to make up our text; or we get to a certain total length, say to fit in a SMS or Tweet.

With that, we have started to scratch the surface of putting some “real” data into ETS, and doing matching against the data for some given pattern. Next week we will continue looking at this example with other ways to get data out of our ETS tables into something that might be nicer to consume.

–Proctor