Search
The Way of the Software Engineer

Generative simulation patent?

Posted by admin on August 22nd, 2007


NewScientist
is reporting on predictive software systems being patented. Doesn’t anyone bother looking for prior art anymore? A simple Google search would show you that wake-sleep learning and Hierarchical Temporal Models (HTM) have been around for years now. I’ve even seen papers discussing their use in predicting human behavior relating to music sales.
I’m working on another post that discusses predictive behavior, so I’ll save my ranting for later.

Multipul Experts

Posted by admin on July 3rd, 2007

There exists a model of using several ‘experts’ or classifiers in order to increase the relevancy of a result. In the fish example from Duda, Hart, Stork, a group of experts are asked to determine if a fish is diseased. 9 of them say it’s not, but one disagrees. How do you ask a computer system to choose between the majority or minority opinion. That one dissenter may have specialized knowledge that gives him an advantage and thus has the correct answer. In a human situation, those 10 experts would argue with each other until perhaps others are convinced.

Can one classifier learn from another? That may be difficult in practice. If the majority classifiers could generate fantasy problems and ask the minority dissenter to solve these problems we can determine if the minority opinion should be given greater weight. If the minority expert does better with this type of problem, his ‘opinion’ should be given greater weight; perhaps enough weight to offset the majority. The majority group could then learn from this new data and adjust their weights accordingly.

Semantics IV: Applications

Posted by admin on June 17th, 2007

Web companies - ad revenue based media companies in particular - have amassed great amounts of data on their users. However, they’re data rich and problem poor. Access and error logs are everywhere, but generally they’re not mined for anything more than simple metrics. I’ve used several log analyzers and web trend trackers like HBX and they all produce pretty graphs but these just bring up more questions than they answer. In part this is because people haven’t fully formed the questions they want answered, but mostly because they just graph data instead of actually analyzing the data.
Read the rest of this entry »

Semantics III: The Core

Posted by admin on June 11th, 2007

At its heart, a targeting engine must make a decision about what ad to send to a website visitor. This can be done at random, by looking at surrounding content, or by looking at history. Most “site ads” are chosen randomly from a short list that’s manually administered by the site owner. The ad rotation allows for the site to stay fresh - users seeing the same ad too frequently will subconsciously learn to ignore them - while still delivering the target number of impressions agreed upon by the site owner and their customers. The goal is to have customers with relevant ads and place them to achieve the greatest click-through rate. It’s an optimization problem that’s run by hand.

If, however, the ad network spans a great number of websites, choosing at random won’t bring the greatest revenue or click-through rate. Optimizing a large network can’t be done cost effectively by hand. Systems like Google’s AdSense or Yahoo! Ads will scrape the content of a website and through keyword searches determine which ads are most relevant and will therefore garner the greatest click-through rate. The market shows us that this is an effective approach, but it’s not solving the same optimization problem that the single site owner does by hand.

So here’s the goal: Build a system that looks at the history of all ad campaigns on all sites and selects an ad with the greatest likelihood of being clicked on. Now the content of the site becomes unimportant. When a user clicks on an ad we can view that as a success and train our targeting engine accordingly. In machine learning terms, this is a feed forward classifier (more on this another time).

Semantics II: Advertising

Posted by admin on June 8th, 2007

This is part two of a multi-part series. Part one can be found here.

Online advertising is a multi-billion dollar business. Google has become one of the worlds most recognized brands by selling ads on its search result pages and putting context ads on websites. Yahoo and Microsoft are working hard, and spending billions to compete with Google in this same space. Most online ads are targeted to the content of their residing website assuming that the ad is being viewed by someone who has an interest in a related subject. For example, a blog article on Ruby on Rails is going to have ads relating to development tools and book on the Ruby language. The click-through rates for these kinds of ads are pretty low, less than a few percent. This mean that most of the ads displayed on the site are being seen by people that are not interested in the products advertised. This isn’t much better than standard broadcast radio or television ads.

Years ago, some websites started collecting demographic information on their users through registration forms and mailing lists. This allows them to target ads to your demographic and charge more money for those ads (and you thought it was because they like you?). Click through rates are higher for these ads, but forcing users to register alienates them quickly and reduces your inventory. The more web savvy internet users become the less likely they’ll give out personal information, so this model will eventually fail.

This is why Yahoo! and Microsoft are pumping billions into behavioral targeting engines. Since ad networks can plant a cookie on a user’s desktop and they have ads on thousands of sites throughout the net they can track a user’s movement through any of those sites. By building a profile of these movements and using pattern recognition systems to put them in to groups, they can watch which groups are most likely to click on a particular ad. These targeting engines can serve up an ad that user is most likely to click on, even if that ad does not relate to the content on that site.

The big technology here is not in the ad serving software, but in the profiling of users. Ad networks have such a fantastic amount of data and it’s so diverse that normal data mining techniques break down or take longer to complete than can be useful. The number of methods applied to this problem is extensive: neural networks, factor analysis, Markov-chain Monty Carlo, logical methods and many combinations. These methods have been used by scientists for years to model aspects of our world and (partly through the field of bio-informatics) these techniques are going main stream.

Stay tuned for part three

Semantics I: Search

Posted by admin on June 7th, 2007

This is part one of a multi-part series. The series is continued, here

The Semantic web has been touted has Web 3.0 and the next evolutionary step in internet communications. A Scientific American article from 2001 told us what we should expect of a semantic internet. This is how devices should communicate and how information should be passed to people and devices. Somehow everything we own should be able to understand its surroundings and make meaningful decisions based on current events. The method described is a brute force approach where software that wants to participate in the semantic web will have to add fantastic amounts of meta data so the ‘agents’ will have a basis for making these decisions.

There’s the problem. No one is going to re-catagorise the entire internet, so there needs to be an automated way of doing all of this. Organizing information has been a field of study for hundreds (if not thousands) of years. The term ontology was coined in the early 17th century and has been applied to this problem. Ontology Languages facilitate connections between words to form ideas. If you know that a ‘cup’ and a ‘mug’ are both ‘containers’ and that ‘milk’ is a ‘liquid’, you should be able to determine the context of a statement such has “A cup of milk”.

Ontologies help to simplify the problem, but they don’t solve it. It’s still a brute force approach where much of this information needs to be input by hand. Additional information is still needed to overcome words having double meanings or where the meaning is determined by other context. Trying to solve natural language problems with keywords alone is impossible.

Newer research is focusing on pattern recognition to find relationships between documents. Pattern recognition systems have been applied to text before, but mostly in an academic setting. Internet entrepreneurs are finding that this is a marketable field. Several new search engines are trying to apply these algorithms to organize search results and make finding what you want easier. In practice, the results you find through these engines aren’t much better than using google or yahoo.

Dr. Riza C. Berkan, founder of Hakia.com, is a pioneer in this field and
admits
that this search engine system is in its infancy and works poorly with short queries (which are much more common). Hakia categorizes results and through the user clicking on links produces a longer query with more precise results. With the short queries most people present to search engines, there isn’t enough context to discover a useful meaning to many of the terms. As far as I can tell, this is still a supervised learning method. There is, however, another field where these same techniques are being employed.

Click here to continue on to part II

Good Ideas

Posted by admin on June 6th, 2007

Recently it seems like every time I have an idea for a useful web tool someone has built it already.

With OS X Leopard coming out soon I thought using the new text-to-speech system to make a web page reader would be cool. done!

Wouldn’t it be great if you could convert a wordperfect document to something actually useful without having to install that stupid MS Office file conversion pack? Done! (and zamzar does a lot more)

::sigh:: Will there ever come a time when a web programmers job will boil down to:
1) Read requirements document
2) Determine correct keywords to enter into Google
3) Download OpenSource package that completes the requirements

Done!

Portable Office

Posted by admin on May 25th, 2007

While I don’t normally pull things from the unwashed masses, I thought the study on ceiling height affecting thought patterns was funny. Clearly the study is dubious at best, but it’s a good excuse to work outside. If the ceiling is causing problems, let’s just avoid it entirely.




Coffee, laptop, iPod, wireless net and a VPN token are all I really need to do my job. With the weather so nice, I even convinced my boss to join me outside one morning.

Make Faire!

Posted by admin on May 21st, 2007

So, this year’s Make Faire has ended and I’m just as impressed as last year. The event falls near my daughter’s birthday, and after hearing me talk about last years event she wanted to go for her birthday this year. I see this as proof that I’m raising a good little geek.
I really like The Crucible and appreciate what they do and stand for. I have yet to take a class from them, but their volunteering barter system looks fair and well maintained. Their exploitation of the term “Fire Truck” is among my favorite events at the Faire and I hope to see them back next year.
They were pouring pewter tiles with letters on one face. I really wanted to see them actually pour one, but each time we walked by they were scratching the firebricks and getting ready for the next pour, or had just finished and were cooling the batch of tiles.
The big disappointment for me was the ‘craft’ side of the faire. It’s as though the craft people this year didn’t get the point of the whole show. On one side we see a large group of Makers each showing off their trade and trying to teach people how to do things and perhaps increase the interest in a particular skill, and the craft side was just a bunch of tables with neo-hippies sitting on their asses peddling garbage they’ve made in to more interesting garbage.
Alternative energy sources were a big theme at this Faire. This is not unexpected considering the nation’s technology push in this direction, but I think the groups that have been crying “electric” for years and finally getting some recognition are redoubling their efforts. Seeing some of these toys up close it’s clear to see which gadgets will make it the mainstream and which will be left in niche markets. The Tango definitely falls in the latter category. The car itself is beautiful, and the idea behind it noble, but it’s completely impractical. No one is going to buy an enclosed motorcycle for $108k when they can have a Tesla. If they ever get around to building their $18k version of the lane-splitter, then they’ll actually have a product worth looking in to.
There were some plug-in hybrids parked around the grounds. I didn’t see anyone around talking about them, but adding a power inverter and plug to the side of a hybrid is an obvious next step. I’m not jumping on the hybrid bandwagon - in fact I think they’re just next to useless in terms of energy savings - but the socio-political impact of the hybrid and bringing environmental concerns to the masses is worth it’s weight in carbon.
There are many more cool things about the faire to talk about, so I’ll leave some for another time. Perhaps when I get the pictures up somewhere useful I’ll post a bit more.

IP Geo-location

Posted by admin on May 9th, 2007

I’m building a system that maps a user’s IP to a physical location. This kind of stuff has been around for a while, and it’s so easy I don’t understand is why it’s used so infrequently. It’s as though it’s seen as a tool for the mega-corp ad networks like Yahoo and Google/DoubleClick, but there are free geo-IP databases available for everyone to use. SourceForge uses the free MaxMind GeoLite database and I’ve found it to be accurate enough to be useful.

It’s no secret that this is possible and there are projects available to make this process very easy, but I just don’t see it used very often. Why? Is there some sort of unspoken net neutrality rule that makes geotracking an IP an internet foux paux?

I hate having to type in my city state, zip, country, timezone, etc when registering for accounts online. The easier a company can make it for me the more likely I’ll do business with them.  Back to work…