tag:blogger.com,1999:blog-13671975536342106912018-03-07T03:16:21.247+01:00Lessoned CodeIvo Danihelkanoreply@blogger.comBlogger23125tag:blogger.com,1999:blog-1367197553634210691.post-48896394521195347312011-12-30T19:49:00.000+01:002011-12-30T23:46:16.076+01:00Results from Sum-Product Networks<p
>I'm now experimenting with <a href="http://lessoned.blogspot.com/2011/10/intro-to-sum-product-networks.html"
>sum-product networks</a
>. I will show you some obtained results.</p
><div id="predicting-a-half-of-the-image"
><h3
>Predicting a half of the image</h3
><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-7J4AIdHvpd0/Tv4HmLXmmEI/AAAAAAAAB9k/D5BG1DNBhyw/s1600/left_800.png" imageanchor="1" style=""><img border="0" height="59" width="400" src="http://3.bp.blogspot.com/-7J4AIdHvpd0/Tv4HmLXmmEI/AAAAAAAAB9k/D5BG1DNBhyw/s400/left_800.png" /></a></div>
<p
>The above picture shows:</p
><ol style="list-style-type: decimal;"
><li
>The original images in the first row.</li
><li
>The visible part of the input in the second row.</li
><li
>The expected pixel intensities predicted by the network.</li
></ol
><p
>The shown images are from a validation set. The sum-product network was trained on different images. The training was done on 800 images from the <a href="http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html"
>notMNIST dataset</a
>.</p
><p
>The trained sum-product network is a probabilistic model. It can compute the probability of something, when given some evidence. Here, the network is asked to compute the expected intensity of each pixel. The given evidence is the right half of the image.</p
><p
>Some of the expected pixel intensities look bad. Some of them look good. If multiple values are possible for a pixel, the expected intensity will be a weighted average of the possibilities.</p
></div
><div id="predicting-every-second-column"
><h3
>Predicting every second column</h3
><div class="separator" style="clear: both; text-align: center;"><a href="http://1.bp.blogspot.com/-_ZF90-kurRU/Tv4H0ucGwnI/AAAAAAAAB9w/Y_HM6pWpxLs/s1600/even_800.png" imageanchor="1" style=""><img border="0" height="59" width="400" src="http://1.bp.blogspot.com/-_ZF90-kurRU/Tv4H0ucGwnI/AAAAAAAAB9w/Y_HM6pWpxLs/s400/even_800.png" /></a></div>
<p
>The expected pixel intensities look better when giving every second column as the evidence. The number of pixels to predict is still one half of the image.</p
><p
>The better result can be explained by:</p
><ol style="list-style-type: decimal;"
><li
><p
>The set of all probable whole images is better pruned, when showing pixels from all parts of the image. The expected pixel intensity will be an average from fewer possibilities.</p
></li
><li
><p
>The used network structure knows about locality of pixels. Nearby image regions are connected at the bottom layers of the network.</p
></li
></ol
><div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-B_zNIC-96lk/Tv4zmKqvFCI/AAAAAAAAB98/IV-59Kl90Ok/s1600/patterns.png" imageanchor="1" style=""><img border="0" height="161" width="400" src="http://4.bp.blogspot.com/-B_zNIC-96lk/Tv4zmKqvFCI/AAAAAAAAB98/IV-59Kl90Ok/s400/patterns.png" /></a></div>
<p
>The whole network gives high probability to some patterns. And it gives lower probability to other patterns. The top sum node has multiple possible patterns as children. The children of a sum node are product nodes. The product nodes split the pattern to smaller sub-patterns.</p
><p
>If two nearby sub-patterns occur together, it is easy to connect the sub-patterns by a product node. The new pattern means: sub_pattern1 AND sub_pattern2. The new pattern will occur often, if the sub-patterns occur often together. Such patterns are discovered when training the network.</p
></div
>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com1tag:blogger.com,1999:blog-1367197553634210691.post-48671440225660578362011-10-24T10:59:00.001+02:002011-12-14T17:58:31.741+01:00The Big Picture<p>It is helpful to view all machine-learning methods as approximations of Bayesian inference. It allows to devise new approximations or to make some approximations more precise.
</p>
<p>The following two videos show the unified view. They explain it better than I would do.
</p>
<ul>
<li><a href="http://www.youtube.com/watch?v=frbX2JH-_Aw&list=PLD0F06AA0D2E8FFBA&index=19">The Big Picture (part 1)</a></li>
<li><a href="http://www.youtube.com/watch?v=Ih4R42qPRWo&list=PLD0F06AA0D2E8FFBA&index=20">The Big Picture (part 2)</a></li>
</ul>
<iframe width="560" height="315" src="http://www.youtube.com/embed/frbX2JH-_Aw" frameborder="0" allowfullscreen></iframe>
<p>
Note that the goal is to minimize the expected loss. The expectation is over all possible examples. Modeling P(X, Y) can help to have small loss on unseen examples.
</p>
<p>
<b>Footnote:</b><br/>
Even <a href="http://www.icml-2011.org/papers/386_icmlpaper.pdf">SVM can be viewed as a probabilistic model</a>.
</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com2tag:blogger.com,1999:blog-1367197553634210691.post-12685885913121604512011-10-02T15:19:00.000+02:002013-12-17T14:59:51.913+01:00Intro to Sum-Product Networks<p
>In 2011, a new probabilistic model was proposed. It is fast. It can represent many probability functions. And its learning prefers simple explanations. I will show you how the model looks like. You can then read the original paper for details: <a href="http://alchemy.cs.washington.edu/spn/poon11.pdf"
>Sum-Product Networks: A New Deep Architecture</a
>.</p>
<div id="computing-probabilities-quickly"
>
<h2
>Computing Probabilities Quickly</h2>
<p
><b>Sum-product networks</b> allow to compute the probability of an event quickly. They were first invented as a data structure for the quick computation.</p>
<div id="network-with-one-variable"
>
<h3
>Network with one variable</h3>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-Wmsra9VM3zw/Togqpzfp4lI/AAAAAAAAB8k/Kj8OqbNXfXU/s1600/onevar.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="90" width="119" src="http://3.bp.blogspot.com/-Wmsra9VM3zw/Togqpzfp4lI/AAAAAAAAB8k/Kj8OqbNXfXU/s320/onevar.png" /></a></div>
<p
>A sum-product network is a directed acyclic graph. It has alternating layers of sum and product nodes. Only the edges below a sum node have weights. The weights are normalized to sum to 1.</p>
<p
>To compute the probability of an evidence, the network is evaluated from bottom up.</p>
<pre
><code
>P(X1=1) = net.eval_with(x1=1, not_x1=0)
= P(X1=1) * 1 + P(X1=0) * 0
</code
></pre>
</div>
<div id="network-with-two-independent-variables"
>
<h3
>Network with two independent variables</h3>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-1gfNDuzTs2o/Togqp6RNRuI/AAAAAAAAB8s/WspO5P6OR00/s1600/independent.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="130" width="259" src="http://3.bp.blogspot.com/-1gfNDuzTs2o/Togqp6RNRuI/AAAAAAAAB8s/WspO5P6OR00/s320/independent.png" /></a></div>
<p
>When two variables are independent, their join probability can be represented concisely. We don't need to store the probabilities of all their combinations. It is enough to store the factors: P(X1=1), P(X2=1).</p>
<p
>The <b>joint probability</b> is calculated by evaluation the network:</p>
<pre
><code
>P(X1=1, X2=1) = net.eval_with(
x1=1, not_x1=0,
x2=1, not_x2=0)
= (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 0)
= P(X1=1) * P(X2=1)
</code
></pre>
<p
>It is also possible to calculate the <b>probability of a subset</b> of variables. Naively, we would evaluate the network multiple times:</p>
<pre
><code
>P(X1=1) = P(X1=1, X2=0) + P(X1=1, X2=1)
</code
></pre>
<p
>It can be done faster. We want the network to sum all the possible branches. We can do that by setting both x2 and not_x2 to 1. It is then enough to evaluate the network just once.</p>
<pre
><code
>P(X1=1) = net.eval_with(
x1=1, not_x1=0,
x2=1, not_x2=1)
= (P(X1=1) * 1 + P(X1=0) * 0) * (P(X2=1) * 1 + P(X2=0) * 1)
= P(X1=1) * 1.0
</code
></pre>
<p
>The computation is done in O(num_edges) steps.</p>
</div>
<div id="network-with-conditional-independence"
>
<h3
>Network with conditional independence</h3>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-U_dJyyYutCc/Toh9LRFQY1I/AAAAAAAAB9E/DM8rSGincG8/s1600/conditional.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="184" width="400" src="http://4.bp.blogspot.com/-U_dJyyYutCc/Toh9LRFQY1I/AAAAAAAAB9E/DM8rSGincG8/s400/conditional.png" /></a></div>
<p
>Conditional independence also helps to make the network concise. In the above network, variables X1 and X2 are independent when given X3. The value of X3 switches between the branches of the network.</p>
<pre
><code
>P(X1=1, X2=1, X3=1) = net.eval_with(
x1=1, not_x1=0,
x2=1, not_x2=0,
x3=1, not_x3=0)
= (P(X1=1|X3=1) * P(X2=1|X3=1) * 1) * P(X3=1)
</code
></pre>
<p
>This network can represent the same probabilities as a Naive Bayes model.</p>
</div>
<div>
<h2>Expressive Power</h2>
<p>Any <a href="http://en.wikipedia.org/wiki/Bayesian_network">Bayesian network</a> can be converted to a sum-product network. The resulting sum-product network may have many edges. In the worst case, the number of edges in the sum-product network will be proportional to the time complexity of computing probabilities in the Bayesian network (i.e., O(num_vars * 2**<a href="http://en.wikipedia.org/wiki/Treewidth">treewidth</a>)).
</p>
<p>Some probability functions can be represented more concisely by a sum-product network. For example, the sum-product network can omit edges with zero weights. And a sum-product network can reuse nodes. The network does not need to be a tree. A node can be reused by multiple parents.</p>
<p>The concise sum-product network can be learned directly from training examples.</p>
</div>
</div>
<div id="learning"
>
<h2
>Learning</h2>
<p
>Learning is done by maximizing likelihood P(training_examples|net). The learned network should give high probability to each seen example.</p>
<div id="requirements-for-network-structure"
>
<h3
>Requirements for network structure</h3>
<p
>At the start, we don't know the network structure and the edge weights. We have some requirements for the network structure:</p>
<ul
><li
><p
>The number of edges should be small. It forces the network to find general explanations for the seen examples. That should improve generalization. The network will have also small number of weights. A small number of examples will be enough to estimate the weights.</p>
<p
>The small number of edges will also make the network evaluation fast.</p>
</li>
<li
><p
>The network should produce valid probabilities. The probability of an event should be equal to the sum of the probabilities of all the included outcomes.</p>
</li>
</ul>
<p
>Fortunately, the validity of the produced probabilities can be ensured by two simple constrains:</p>
<ol style="list-style-type: decimal;"
><li
><p
>All children of a sum node must use the same variables. No variable can be missing and no variable can be appended. Only a product node can split the set of used variables to subsets.</p>
</li>
<li
><p
>A product node must not multiply xi with not_xi.</p>
</li>
</ol>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-R4SsnJXaOWM/TohAHHDzRgI/AAAAAAAAB88/K8kkzoxh89A/s1600/invalid.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="114" width="254" src="http://2.bp.blogspot.com/-R4SsnJXaOWM/TohAHHDzRgI/AAAAAAAAB88/K8kkzoxh89A/s320/invalid.png" /></a></div>
</div>
<div id="learning-steps"
>
<h3
>Learning steps</h3>
<p
>A valid network can be learned by:</p>
<ol style="list-style-type: decimal;"
><li
><p
>Define a valid structure of the network. The structure can have many layers and many edges. The initial weights will be zero.</p>
</li>
<li
><p
>Increment the probability of each training example, by incrementing some weights.</p>
</li>
<li
><p
>Only the edges with non-zero weights will be retained in the final network.</p>
</li>
</ol>
<p
>The finding of the weights to increment is also fast. It is similar to evaluating of the network. The most active edges should have their weights incremented.</p>
<p
>The details can be seen in the <a href="http://alchemy.cs.washington.edu/spn/"
>SPN source code</a
>. The learning stops at a local maximum. Ask me for an explanation, if you are interested.</p>
</div>
</div>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com6tag:blogger.com,1999:blog-1367197553634210691.post-87542826748646554452011-09-07T19:15:00.000+02:002011-09-07T19:15:29.138+02:00Decision Trees without Pruning<p
>Do you want to improve generalization of a decision tree? Do no pruning. Try full Bayesian weighting instead. A prediction can be still done in O(tree_depth) steps.</p
><p
>Pruning is an approximation of Bayesian weighting. It gives weight 1.0 to one short tree. And it gives weights 0 to all other trees.</p
><p
>It can be done better. We can give weights to all tree depths. Predictions from shorter trees would get bigger weights. <a href="http://lessoned.blogspot.com/2011/06/bayesian-sequence-predictor.html"
>Context Tree Weighting</a
> allows that.</p
><div id="context"
><h2
>Context</h2
><p
>Context Tree Weighting is normally used to predict the next bit of a string. For that, it uses the suffix of the string as a context. Different predictions are done from different context lengths. The final prediction is a weighted sum of the predictions. The shorter context lengths get bigger weights.</p
><p
>The context does not need to be a suffix. We can use any list of bits as a context.</p
><pre
><code
>context = get_context(model, training_example)
</code
></pre
></div
><div id="training"
><h2
>Training</h2
><p
>First, a fully grown binary decision tree would be constructed. Information gain or some other greedy heuristic can be used for that.</p
><p
>The decision tree can be now used to select contexts for the Context Tree Weighting model. Different decision tree depths should serve as different context lengths. We would use the path from the tree root to a tree leaf as the context.</p
><pre
><code
>def get_context(model, training_example):
path = []
node = model.tree.root
while node is not None:
child_index = node.choose_child(training_example)
path.append(child_index)
node = node.children[child_index]
path.reverse()
return path
</code
></pre
><p
>The path is reversed to have the most important bit on the right. In a suffix the most important bit is also on the right.</p
><p
>Each training example would be used to train the Context Tree Weighting model. The training precomputes values in the model. The precomputed values can be stored in the decision tree nodes. Each example is processed in O(tree_depth) steps.</p
></div
><div id="prediction"
><h2
>Prediction</h2
><p
>The decision tree is used as a normal Context Tree Weighting model. A context of the test example is obtained. And a prediction is made based on the context.</p
></div
><div id="resources"
><h2
>Resources</h2
><p
>The possibility of non-suffix contexts was mentioned in chapter 6 of <a href="http://jveness.info/publications/veness_phd_thesis_final.pdf"
>Approximate Universal Artificial Intelligence</a
>.</p
><p
>The weighting of different decision tree depths is simple. I guess, somebody else suggested it before. I haven't found a paper mentioning it. So I share the trick here.</p
></div
>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com1tag:blogger.com,1999:blog-1367197553634210691.post-52947907079798393122011-08-07T15:35:00.001+02:002011-08-07T15:52:37.858+02:00To the edge of our galaxy<p
>It is possible to travel 500+ light years in 80 years. And you don't need to travel faster than light. Stephen Hawking mentioned the possibility at the end of <a href="http://www.dailymail.co.uk/home/moslive/article-1269288/STEPHEN-HAWKING-How-build-time-machine.html"
>his article</a
>.</p
><div id="heres-how"
><h2
>Here's how</h2
><p
>Use a giant rocket. It needs to carry rocket fuel for 6 years of accelerating. After the 6 years, you would travel at 99% of the speed of light. You would be moving very fast in space. You would be <a href="http://www.reddit.com/r/askscience/comments/fjwkh/why_exactly_can_nothing_go_faster_than_the_speed/c1gh4x7"
>moving less in time</a
>. Your clock would go slower than clocks on Earth.</p
></div
>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-26228767648161039592011-06-26T16:35:00.006+02:002011-07-25T23:10:36.868+02:00A Bayesian Sequence Predictor<p
>It is possible to use Bayesian inference for sequence prediction. I will show how to predict a continuation of a binary sequence.</p
><p
>For example, we want to know what will be the next bit, after seeing "01101".</p
><div id="considered-generators"
><h2
>Considered Generators</h2
><p
>I will not consider all possible sequence generators. All possible sequence generators are all possible computer programs. I will consider a smaller set of generators. I will consider all generators, where the next generated bit depends only on a sequence suffix.</p
><p
>Each suffix will assign a different probability to the next bit:</p
><pre
><code
>import random
class OnSuffixGenerator:
def __init__(self, suffix):
self.suffix = suffix
self.next_bit_probability = unknown_function(suffix)
def generate_next_bit(self, seq):
assert seq.endswith(self.suffix)
probability = self.next_bit_probability
return "1" if probability > random.random() else "0"
</code
></pre
><p
>Multiple OnSuffixGenerators can be used to generate a sequence. They cooperate to cover different suffixes.</p
><p
>We don't know the used OnSuffixGenerators. We see only the beginning of a generated sequence. And we assume that it was generated by a suffix-based generator.</p
></div
><div id="brute-force-approach"
><h2
>Brute Force Approach</h2
><p
>If we want to predict the next bit of the sequence, we want to know the probability of the next bit being one:</p
><pre
><code
>P(NextBit=1|seq) = P(seq + "1") / P(seq)
</code
></pre
><p
>For that, we need to calculate P(seq + "1") and P(seq). The probability of a sequence is its marginal probability:</p
><pre
><code
>P(seq) = sum(P(seq|generator) * P(generator) for
generator in ALL_GENERATORS)
</code
></pre
><p
>A brute force approach would calculate the expression directly. The number of all possible suffix-based generators is very big. Each generator is a cooperation of one or more OnSuffixGenerators.</p
><p
>For example, OnSuffixGenerators for "00", "10" and "1" suffixes can cooperate as:</p
><pre
><code
>def get_next_bit_generator(seq):
if seq.endswith("00"):
return OnSuffixGenerator("00")
if seq.endswith("10"):
return OnSuffixGenerator("10")
if seq.endswith("1"):
return OnSuffixGenerator("1")
</code
></pre
><p
>The number of all possible suffix-based generators is bigger than 2**len(seq). The brute force approach would need to iterate over all of them.</p
></div
><div id="context-tree-weighting-approach"
><h2
>Context-Tree Weighting Approach</h2
><p
>Fortunately, we can calculate the sequence probability in a faster way. We can consider one OnSuffixGenerator in many cooperations at the same time.</p
><div id="weighting"
><h4
>Weighting</h4
><p
>If we have only two possible generators, the probability of a sequence is:</p
><pre
><code
>P(seq) = P(seq|generator1) * P(generator1) +
P(seq|generator2) * P(generator2)
</code
></pre
><p
>We will give them equal prior probabilities. You can incorporate a different prior belief, if you have it.</p
><pre
><code
>P(seq) = P(seq|generator1) * 0.5 +
P(seq|generator2) * 0.5
</code
></pre
></div
><div id="notation"
><h4
>Notation</h4
><p
>We now need to express P(seq|generator1), when considering our possible sequence generators. A considered OnSuffixGenerator is generating only bits after the given suffix. A different generator is used to generate the remaining bits. I will need a notation to express the probability of just bits after the given suffix. I will use <b><code>P(seq on s)</code></b> syntax for the probability that the bits after the suffix <code>s</code> were generated by an OnSuffixGenerator(s).</p
><pre
><code
>P(seq on s) = P(seq|
bits_with_different_suffix(seq, s),
Generator=OnSuffixGenerator(s))
P(seq on not_present_suffix) = P(seq|seq) = 1
</code
></pre
><p
>I will use <b><code>*s</code></b> syntax to denote all suffixes ending with the suffix <code>s</code>.</p
></div
><div id="putting-it-together"
><h4
>Putting it together</h4
><p
>Let's consider the probability of a sequence. For the bits after a suffix, we have only two possible generators:</p
><ol style="list-style-type: decimal;"
><li
>The bits can be generated by a generator with the given suffix.</li
><li
>Or the bits can be generated by a cooperation of different generators with some longer suffixes.</li
></ol
><pre><code>P(seq on *s) = P(seq on s) * 0.5 +
P(seq on *0s) * P(seq on *1s) * P(seq on STARTs) * 0.5
</code></pre>
<p
>The "STARTs" suffix is matching only at the start of the sequence. It is used to cover the bits uncovered by the longer suffixes.</p
><p
>The probability of the whole sequence is the probability of all its bits:</p
><pre
><code
>P(seq) = P(seq on *)
</code
></pre
><p
>The obtained sequence predictor isn't just theoretically nice. Context-Tree Weighting (CTW) variants are doing well on <a href="http://mattmahoney.net/dc/text.html"
>text compression benchmarks</a
>.</p
></div
></div
><div id="implementation"
><h2
>Implementation</h2
><p
>The recursive P(seq on *s) can reuse previous computations. It can be then computed in O(len(seq)) steps. The computation starts at the biggest depth and carries the partial results to the front.</p
><p
>I implemented the algorithm in Python. You can <a href="https://github.com/fidlej/continue"
>play with continuation of binary sequences</a
>.</p
></div
><div id="resources"
><h2
>Resources</h2
><ol style="list-style-type: decimal;"
><li
><a href="http://www.sps.ele.tue.nl/members/F.M.J.Willems/RESEARCH_files/CTW/ctw1.ps"
>The Context-Tree Weighting Method: Basic Properties</a
>: This paper introduced the CTW method.</li
><li
><a href="http://www.sps.ele.tue.nl/members/F.M.J.Willems/RESEARCH_files/CTW/reflections.ps"
>Reflections on "The Context-Tree Weighting Method: Basic Properties"</a
>: It complements the original paper in explanation.</li
><li
><a href="http://jveness.info/software/default.html"
>A Monte-Carlo AIXI Approximation</a
>: Joel Veness's C++ implementation of CTW shows clearly, how to implement the algorithm efficiently.</li
><li
><a href="http://jveness.info/publications/veness_phd_thesis_final.pdf"
>Approximate Universal Artificial Intelligence</a
>: Chapter 3 in Joel Veness's PhD thesis explains how the used P(generator) prior is related to the generator complexity. Interesting future ideas are proposed in chapter 6.</li
></ol
>
</div
>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com1tag:blogger.com,1999:blog-1367197553634210691.post-71858103702919424722010-08-14T11:39:00.000+02:002010-08-14T11:39:30.277+02:00You don't see everything<p>Your perception is suppressed before a movement of your eyes. You cannot see your eyes moving when looking into a mirror. Look at the left eye, look at the right eye. Only <a href="http://en.wikipedia.org/wiki/Saccadic_masking">an external observer will see the movement of your eyes</a>.
</p>
<p>
Your perception is resumed when the image stops being blurred.
That is, when the velocity of the eye matches the relative velocity of the object.
That also allows you to see non-blurred details outside of a running train window.
</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-36900583115292545822010-06-30T18:53:00.004+02:002011-06-26T17:03:54.794+02:00Problem Simplification<p>
I will mention how different AI problems are related. I present them from the most general problems to the most specific problems. Each more specific problem includes also the assumptions from the previous simplifications. </p>
<p>
The problems: </p>
<h4>
1) General Intelligence</h4>
<p>
The goal is to choose an action to maximize the future total reward: </p>
<pre>best_y(observations) = argmax_action future_total_reward(
observations, action)
</pre>
<p>
This problem is considered by <a href="http://www.hutter1.net/ai/uaibook.htm">AIXI</a>. </p>
<h4>
2) Reinforcement Learning</h4>
<p>
Assumptions: </p>
<ol>
<li>The environment is stationary. I.e., P(Trajectory) is a fixed probability distribution. </li>
</ol>
<p>
Implications: </p>
<ul>
<li>We can talk about the <a href="http://en.wikipedia.org/wiki/Expected_value">expected value</a> of a function with respect to the probability distribution.</li>
<li>A fixed policy is enough for the fixed environment.</li>
</ul>
<pre>best_policy = argmax_policy E[
total_reward(Trajectory)|policy]
best_y(observations) = a draw from best_policy(Y|observations)
</pre>
<p>
Note that it is not needed to compute the expected total reward. Its <a href="http://www.scholarpedia.org/article/Policy_gradient_methods">gradient is enough</a>. </p>
<h4>
3) <a href="http://hunch.net/?p=1389">Contextual Bandits</a></h4>
<p>
Assumptions: </p>
<ol>
<li>An action does not affect the future observations. The sequence of seen contexts is already assigned.</li>
<li>The reward depends only on the used action and the context.</li>
</ol>
<p>
Implications: </p>
<ul>
<li>There is no delayed future reward. Only the immediate reward is caused by the action. There is no confusion what caused it. That simplifies training.</li>
<li>It is not needed to use a stochastic policy. It cannot help us from being stuck. We cannot affect the observations. The chosen decision could be deterministic.</li>
</ul>
<pre>best_y = argmax_y E[total_reward(Contexts, y)]
</pre>
<h4>
4) Supervised Regression</h4>
<p>
The agent is presented with (x, target) examples. </p>
<p>
Assumptions: </p>
<ol>
<li>The seen examples are independent. Their probability does not depend on the already seen examples.</li>
<li>The examples are identically distributed. They share a P(Target, X) probability distribution.
<li>The reward function is known. It is possible to compute all possible rewards after seeing a target.</li>
</ol>
<p>
Implications: </p>
<ul>
<li>The maximum of the expected total reward is at the same point as the maximum of the expected reward from a single example.</li>
<li>No exploration is needed. We cannot affect the future observations. And we can compare all possible rewards.</li>
</ul>
<pre>best_y = argmax_y E[reward(Target, y(X))]
</pre>
<p>
The distribution P(X,Target) is still unknown. </p>
<h4>
5) Squared Loss</h4>
<p>
Assumptions: </p>
<ol>
<li>The reward function is known to be:</li>
</ol>
<pre>reward(target, y(x)) = - constant * (target - y(x))**2
</pre>
<p>
Implications: </p>
<ul>
<li>We can find the argmax_y explicitly. The derivative of the expected reward is a linear function of y. It leads to the following solution:</li>
</ul>
<pre>best_y(x) = E[Target|x]
</pre>
<p>
Other simplifications could go in different directions. For example, planning assumes a known deterministic environment. </p>
<h3>
Used Offline Resources</h3>
<p>The supervised learning problem is described in chapter 1 of Bishop's <a href="http://research.microsoft.com/en-us/um/people/cmbishop/prml/">Pattern Recognition and Machine Learning</a>.</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-30602176277884235882010-06-24T11:31:00.005+02:002010-06-24T11:56:09.108+02:00Statistical Analysis Community<p>Do you want to teach yourself machine-learning? The following links may guide you:
</p>
<ul>
<li><p><a href="http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html">Learning about Machine Learning</a></p>
<li><p><a href="http://measuringmeasures.com/blog/2010/4/19/7-tips-for-successful-self-learning.html">7 Tips for Successful Self-Learning</a></p>
</ul>
<p>
The above articles were written by the <a href="http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data">brain behind FlightCaster</a>. That give them some credibility.
</p>
<p>There is also a proposal to create a <a href="http://area51.stackexchange.com/proposals/33/statistical-analysis?referrer=eX9GfpYWpMs1">Q&A site for statistics, data analysis, data mining and data visualization</a>. Join the community if you plan to ask and answer statistical questions. We need more people to start the beta phase. Programmers already have their Q&A site at <a href="http://stackoverflow.com/">Stack Overflow</a>.</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-91835572952237230342010-05-30T12:33:00.021+02:002010-06-02T21:28:27.486+02:00Common steps in machine learning<p>
I will summary one way to solve a machine learning problem.
These abstract steps fit many problems.
</p>
<ol>
<li><p><b>Understand the task. See how to measure the performance.</b></p>
<p>
A computer program is learning if its performance improves with more experience.
We are going to design such a program.</p>
</li>
<li><p><b>Choose the source of training experience.</b></p>
</li>
<li><p><b>Decide what will be input and output.</b></p>
</li>
<li><p><b>Choose a set of models to approximate the output function.</b></p><p>
For example, the set could be a set of linear functions
or a set of neural networks with N hidden neurons.
We should use our knowledge of the problem to restrict the set.
The set only needs to contain the true output function
or its good approximation.
</p>
</li>
<li><p><b>Choose a <a href="http://hunch.net/?p=224">learning algorithm</a>.</b></p><p>
The algorithm will select one model from the given set of models.
The model is selected based on high
probability under the given training data (i.e., P(model|data)).
</p>
<p>
We may also select multiple models and use a weighted vote from them.
The models should be mutually exclusive.
Every model should have just one vote.
</p>
</li>
</ol>
<h3>Resources</h3>
<p>
Chapters 1, 2 and 6 of <a href="http://www.cs.cmu.edu/~tom/mlbook.html">Mitchell's Machine Learning</a>.
</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-65403084834962431632010-04-10T15:14:00.005+02:002010-04-10T17:42:03.261+02:00Viewing programming as constraint satisfaction<p>
Programming could be viewed as solving a <a href="http://aima.cs.berkeley.edu/newchap05.pdf">constraint satisfaction problem</a>.
We start with some empty disk space and the goal is have a program there.
Only the goal state is important. We are not searching for a path.
We are searching for a goal state that meets the constraints.
</p>
<p>
I could view my programming as an attempt
to solve the problem efficiently. Modifying existing source code is like a mutation in a local search. Hitting an unexpected constraint and choosing a different path is like backtracking. And rewriting allows me to get rid of old constraints.
</p>
<p>
A super intelligent machine may laugh. It will see the inefficiency of my primitive attempts. I would seem like a 19 month old baby.
</p>
<object width="480" height="385"><param name="movie" value="http://www.youtube.com/v/Zybl598sK24&hl=en_US&fs=1&rel=0"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/Zybl598sK24&hl=en_US&fs=1&rel=0" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="385"></embed></object>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com1tag:blogger.com,1999:blog-1367197553634210691.post-48882624080676651142010-04-02T22:04:00.007+02:002010-04-02T22:10:51.534+02:00Ulimited "cd -" history<p>
I use bash and I use "cd -" to go to the previous directory.
To be able to go back more than one step,
I had to replace the original cd command with a function:
</p>
<pre>
# cd with automatic pushd
function cd() {
if test "x$1" = "x-" ; then
popd >/dev/null
else
pushd . >/dev/null
builtin cd "$@"
fi
}
</pre>
<p>
Put that into your ~/.bashrc and start a new bash.
</p>
<h3>Example usage</h3>
<pre>
/home$ cd /opt
/opt$ cd /var/log
/var/log$ cd -
/opt$ cd -
/home$
</pre>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-73974429426464295562010-03-06T09:07:00.020+01:002010-03-10T18:08:12.407+01:00Learning from history<p>
It is amazing to see that learning from historical data is theoretically solved.
For example, we could calculate the probability that all crows are black
when N black crows were seen previously:
</p>
<pre>
P(all_crows_are_black|seen_N_black_crows)
= P(seen_N_black_crows|all_crows_are_black) * P(all_crows_are_black)
* 1/P(seen_N_black_crows)
= 1 * P(all_crows_are_black) * 1/P(seen_N_black_crows)
</pre>
<p>
A different model could predict that 90% of crows are black.
Its probability after seeing N black crows would be:
</p>
<pre>
P(90%_of_crows_are_black|seen_N_black_crows)
= 0.9**N * P(90%_of_crows_are_black) * 1/P(seen_N_black_crows)
</pre>
<p>
The <code>1/P(seen_N_black_crows)</code> constant is not known.
We could interpret it as a normalization constant.
It ensures that the sum of probabilities of all possible models is 1.
Or we could ignore it if just comparing the probabilities of different models.
</p>
<p>
Many models could have non-zero probability when given a small history.
We should use them all when making a prediction.
A prediction is just the probability of unseen data based on the seen data.
That is calculated by:
</p>
<pre>
P(data|old_data)
= sum(P(data|h,old_data) * P(h|old_data) for h in ALL_MODELS)
</pre>
<p>
This approach is completely general. It could be used for non-independent samples, time series, everything. We would then work with models that predict such non-independent data or time series.
</p>
<h3>Additional Resources</h3>
<ul>
<li><a href="http://aima.cs.berkeley.edu/">AI: A Modern Approach</a> chapters 13, 14 and 20 give an introduction to probabilities and Bayesian learning.</li>
<li><a href="http://www.hutter1.net/ai/uspx.pdf">On Universal Prediction and Bayesian Confirmation</a> by Marcus Hutter. It hints how to estimate the P(model|no_data_yet) probabilities. Simpler models are preferred.</li>
</ul>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-45646829840049401312010-01-23T13:06:00.002+01:002010-01-23T13:13:58.186+01:00Many layers are needed<p>
I have read an interesting paper on limitations of machine learning models:
<a href="http://yann.lecun.com/exdb/publis/pdf/bengio-lecun-07.pdf">Scaling Learning Algorithms towards AI</a>.
It mentions limitation of two-layer neural networks and
other two-layer models (SVMs).
These shallow models are unable to learn some functions
without an exponential number of components.
For example, to learn the parity function over N input bits,
they would need 2<sup>N</sup> hidden neurons.
</p>
<p>
On the other hand, a deep model with N layers
could compute the parity with just N components.
</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-79926693920173734422010-01-19T20:07:00.008+01:002010-05-23T17:42:47.706+02:00Humane utility function<p>
It will be hard to design a utility function for a strong AI.
The utility function should express what the AI should maximize.
Humans still cannot decide what weight to assign to lives. Especially if you have to decide
between lives and the restoration of order in a society.
</p>
<object width="480" height="295"><param name="movie" value="http://www.youtube.com/v/kBdfcR-8hEY&hl=en_US&fs=1&"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/kBdfcR-8hEY&hl=en_US&fs=1&" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="295"></embed></object>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-19098982502574329602010-01-02T14:12:00.003+01:002010-01-02T14:15:22.560+01:00AI for real life: Understanding autonomy<p>
There are <a href="http://www.ted.com/talks/dan_pink_on_motivation.html">talks about autonomy and other forms of motivations</a>.
But I realized the meaning of autonomy only after reading about it
in an <a href="http://www.pearsonhighered.com/educator/academic/product/0,3110,0136042597,00.html">AI book</a>.
</p>
<p>
Let's first define how we want an agent or an employee to behave.
We want him to try his best to maximize our score assigned to him.
When doing "his best", he can only use his prior knowledge or his senses.
We should not blame a deaf person for not running in reaction to the sound "fire".
We should also not blame a person without the prior knowledge of the English word "fire".
They are acting rationally under the given conditions.
</p>
<p>
And autonomy is the ability to enhance or correct the prior knowledge.
An autonomous agent does not need to follow the rules defined in the prior knowledge. He could override them if it makes sense to him.
For example, he does not have to run immediately after hearing "fire".
He can grab the deaf person's hand first.
</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-5209424263023347502009-12-06T22:14:00.010+01:002009-12-06T22:43:06.931+01:00Solving Problems by Searching<p>
If you are new to AI, you will want to read this sample chapter:
<a href="http://www.pearsonhighered.com/assets/hip/us/hip_us_pearsonhighered/samplechapter/0137903952.pdf">Solving Problems by Searching</a>.<br/>
It explains the breadth-first search, A*, heuristic estimates, ...
</p>
<p>
It is from the newly updated
<a href="http://www.pearsonhighered.com/educator/academic/product/0,3110,0136042597,00.html">AI: A Modern Approach 3rd Edition</a>.
</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-36926994292256630612009-11-30T21:21:00.011+01:002010-01-02T14:20:40.927+01:00AI for real life: The perfect is the enemy of the good<p>
I have seen that searching for a perfect
solution is much harder than searching for a good solution.
My experience is based on AI planning. You have to find
a path from a start to a goal there. A perfect solution would
find the shortest path.
</p><p>
If you need the shortest path, you have to examine all promising paths.
You cannot skip a path that could be possibly shorter than the currently
best known solution.
</p><p>
If a good solution is enough, you have more possibilities how to find
the solution. For example, you could stop the search when no promising
path would be X-times better.
You could also be more realistic about the promising paths.
It is OK to over-estimate the length of a path. That would postpone
the path for later examination. A good enough solution could be found in the meantime.
</p><p>
The hardness of the search is
seen on the <a href="http://ipc.informatik.uni-freiburg.de/HomePage">International Planning Competition 2008</a>. They have two tracks:
</p>
<ul>
<li>The "satisficing track" is for planners searching a good solution.</li>
<li>The "optimization track" is for cost-optimal planners. They are searching only for the perfect solution.</li>
</ul>
<p>
The competition uses much larger problems for the satisficing planners. The cost-optimal planners would not be able to solve the same problems in the given time.
</p><p>
It is also interesting that <a href="http://ipc.informatik.uni-freiburg.de/Presentations?action=AttachFile&do=get&target=domain-poster.pdf">no cost-optimal planner was better than a basic breadth-first search</a>. The breadth-first search was used as the "baseline" for the "optimization track". A planner would need prior knowledge, to carefully estimate the lengths of the paths. Otherwise it is not faster than doing the walk.
</p>
<h3>Additional reading</h3>
<p>
The prior knowledge could be, for example, knowledge of solutions to easier problems. You then know that the harder problem will take at least the same number of steps:
<a href="http://www.cs.ualberta.ca/~holte/Publications/tr-95-18.pdf">Hierarchical A*: Searching Abstraction Hierarchies Efficiently</a>
</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-22466152828508489022009-11-28T15:39:00.009+01:002009-11-29T09:31:17.413+01:00Face detection that just works<p>
I may be the last man who noticed this.
My new <a href="http://www.panasonic.net/avc/lumix/compact/fs7_fs6/ia.html#face">camera does face detection</a> when looking at persons.
And it just works. You only use AI knowledge to appreciate it.
</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-70893078976262096252009-11-27T17:49:00.009+01:002010-02-11T09:24:50.099+01:00AI for real life: Optimize team utility<p>
I will write some articles about Artificial Intelligence (AI). Especially, they will be about what I have learned from AI. The first one is about teamwork.
</p><p>
A work in a team is different from a solo work. They are two different problems. When you are working solo, your goal is to maximize the amount of work done by you. In AI, they would say that an agent optimizes its utility function. It is like a score you receive from a finished game.
</p><p>
Inside a team, the goal of the problem is different. The goal is to maximize the amount of work done by the team. You should maximize the sum of the work done by you and your coworkers.
</p><p>
It is a much harder problem to optimize the sum of the utilities. If you are choosing an action, you would like to be able to predict its effects. A good start is to create a model of the internal state of your coworker. We humans call that empathy.
</p><p>
To understand AI:<br/>
<a href="http://www.vetta.org/documents/UniversalIntelligence.pdf">Universal Intelligence: A Definition of Machine Intelligence</a>
</p><p>
To understand other people:<br/>
<a href="http://www.amazon.com/How-Win-Friends-Influence-People/dp/0671723650">How to Win Friends & Influence People</a>
</p>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com1tag:blogger.com,1999:blog-1367197553634210691.post-74362079444263817762008-07-24T10:49:00.006+02:002009-11-27T10:20:37.760+01:00Python string concatenation performance<p>
A small test revealed that <code>"".join(listOfStrings)</code> is
not faster than plain +=. <s>The .join() is slower.</s> Using .append() and .join() is slower.
</p>
Time with .join():
<pre>real 0m2.908s
</pre>
Time with +=:
<pre>real 0m1.742s
</pre>
The test:
<pre>
#!/usr/bin/env python
def combine(inc, count):
text = ""
for i in xrange(count):
text += inc
return len(text)
def combineByJoin(inc, count):
text = []
for i in xrange(count):
text.append(inc)
text = "".join(text)
return len(text)
def main():
inc = "a" * 10
print combine(inc, 10000000)
#print combineByJoin(inc, 10000000)
main()
</pre>
Tested on Python 2.5.2.Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com2tag:blogger.com,1999:blog-1367197553634210691.post-8455454092021088602008-07-03T08:17:00.004+02:002009-12-28T21:24:52.951+01:00Stop disk clickingIf your disk is clicking on inactivity, it could be because it has too aggressive power management set. Each click will increase Load_Cycle_Count. Check that:
<pre>
$ sudo smartctl -A /dev/sda | grep Load
</pre>
To stop it, use hdparm:
<pre>
$ sudo hdparm -B 254 /dev/sda
</pre>
See how to force the hdparm setting on boot and resume:
<a href="https://wiki.ubuntu.com/DanielHahler/Bug59695">https://wiki.ubuntu.com/DanielHahler/Bug59695</a>Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0tag:blogger.com,1999:blog-1367197553634210691.post-32517141711674777262008-06-28T15:06:00.004+02:002008-07-24T10:54:38.359+02:00Form submit via jQueryWith <a href="http://jquery.com/">jQuery</a> it is easy to implement form submit by Ajax.
This is how I do it:
<pre>
$("#filter").submit(function(event) {
event.preventDefault();
event.stopPropagation();
var params = {};
$(":input", this).each(function() {
params[this.name] = this.value;
});
$.get("/ajax/filter", params, aCallback);
});
</pre>
A working example is the filter form on <a href="http://jakybyt.cz/mapa">Jaký Byt: Mapa</a>.Ivo Danihelkahttps://plus.google.com/115596297739985932709noreply@blogger.com0