<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>CodeBudo</title>
	<atom:link href="http://www.codebudo.com/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.codebudo.com</link>
	<description>The Way of the Software Engineer</description>
	<pubDate>Sun, 07 Jun 2009 05:23:55 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Applying JOONE to Real-World Data</title>
		<link>http://www.codebudo.com/?p=67</link>
		<comments>http://www.codebudo.com/?p=67#comments</comments>
		<pubDate>Sun, 07 Jun 2009 05:23:13 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Articles]]></category>

		<category><![CDATA[Pattern Classification]]></category>

		<category><![CDATA[java]]></category>

		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/?p=67</guid>
		<description><![CDATA[JOONE is a toolset used to build and run neural networks in Java.  To demonstrate its capability, I&#8217;ve built a simple supervised network and trained it on a common data set used for other machine learning projects.  By using a common data set, comparisons can be made between the different approaches.
The data set was published [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.jooneworld.com/" target="_blank">JOONE</a> is a toolset used to build and run neural networks in Java.  To demonstrate its capability, I&#8217;ve built a simple supervised network and trained it on a common data set used for other machine learning projects.  By using a common data set, comparisons can be made between the different approaches.</p>
<p>The <a href="http://archive.ics.uci.edu/ml/datasets/Mushroom" target="_blank">data set</a> was published by the Audubon Society Field Guide and describes the characteristics of mushrooms found in North America. <span id="more-67"></span> The version I&#8217;m using was compiled by the <a href="http://archive.ics.uci.edu/ml/datasets.html" target="_blank">UCI Machine Learning Repository</a>.  It contains 8124 records (one record per line) with its classification and each of the 22 mushroom characteristics represented by a character value in a comma separated list.  The first value describes the poisonous or edible classification.</p>
<blockquote><p>p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u<br />
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g</p></blockquote>
<p>JOONE requires semicolon separated numerical values for input, so I replaced each character value with its alphabetical position and changed the commas to semicolons.  Missing values were given the value 27.</p>
<blockquote><p>16;24;19;14;20;16;6;3;14;11;5;5;19;19;23;23;16;23;15;16;11;19;21<br />
5;24;19;25;20;1;6;3;2;11;5;3;19;19;23;23;16;23;15;16;14;14;7</p></blockquote>
<p>The network has three layers: 22 input nodes, 10 hidden nodes, and a single output node.  If the output node is 16 (p), the mushroom is classified as poisonous.  If this node is 5 (e), it is classified as edible.  The hidden nodes and output node have a sigmoidal activation function.  The network is trained on the first 3000 elements of the data set using JOON&#8217;s built in back propagation functions and a Root Mean Squared Error (RMSE) function.  The remaining ~5124 nodes can be used in verifying the application.  Running in training batches of 10,000 iterations (epochs) and storing a serialized representation of the network to disk every 100 iterations allowed fine grained monitor the progress of the application and ensure net trained network could be recovered in the even of a crash.</p>
<p>Serialization is a mechanism where an object in memory is converted into a portable form (XML in this case) so it can be later retrieved and the object restored to memory exactly as it once was.  In this case, we are using the &#8217;serializeable&#8217; java interface to store a neural network that contains the network diagram, weighted synapses, and trainer (error).</p>
<p>The error after the first 100 iterations was ~5%, and decreased to 4.25% after 50,000 iterations.  While this is rather slow, the error is still decreasing and could be within acceptable levels with a few million iterations.</p>
<p>Further research should be done on the design of the network and its training.  Adding another layer or changing the number of hidden nodes could converge more quickly.  The serialization mechanism could produce an easy way to distribute and parallelize the training.  If the current RMSE of the network were stored along with serialized net, a node could determine if its error is less than the current &#8220;best&#8221; for a group of nodes.  The node with the lowest error would write a new serialized net and global error file and nodes with greater error would use the least error net to continue training.</p>
<p>Here are the files required to continue developing this network:</p>
<p><a href="/download/mushroom_numerical.data">MushroomFFNN.java</a></p>
<p><a href="/download/mushroom_numerical.data">mushroom_numerical.data</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=67</wfw:commentRss>
		</item>
		<item>
		<title>AvantGo is AvantGone</title>
		<link>http://www.codebudo.com/?p=63</link>
		<comments>http://www.codebudo.com/?p=63#comments</comments>
		<pubDate>Tue, 02 Jun 2009 21:21:40 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[advertising]]></category>

		<category><![CDATA[mobile]]></category>

		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/?p=63</guid>
		<description><![CDATA[AvantGo, the once ubiquitous application for all PDAs, is shutting down its web sync service.  Users of the service have just begun to see banners stating, &#8220;Starting June 30, Avantgo will no longer offer mobile web content.&#8221; With modern wireless networks and browsers built in to new smartphones, the on device browser just couldn&#8217;t meet [...]]]></description>
			<content:encoded><![CDATA[<p>AvantGo, the once ubiquitous application for all PDAs, is shutting down its web sync service.  Users of the service have just begun to see banners stating, <strong>&#8220;Starting June 30, Avantgo will no longer offer mobile web content.&#8221;</strong> With modern wireless networks and browsers built in to new smartphones, the on device browser just couldn&#8217;t meet the demands of modern consumers.  While there are no direct competitors to this service (in the consumer market) there are a few companies that meet the needs of some consumers.  AvantGo is suggesting mysnacs.com as an alternative.  Some users have received this message:</p>
<blockquote><p>After June 30, 2009, AvantGo will no longer be providing mobile Web content for sync or online access, and you will not be able to access or update your AvantGo content or account.  Your account information will continue to be protected by our privacy policy, and we plan to delete any personally identifiable information you’ve provided (e.g., your e-mail address) as soon as reasonably possible.</p>
<p>If you are an 8MB account subscriber you may be entitled to a refund for a portion of your subscription fees that are unused. To request a refund, please click here and submit the refund request form. You will need to reference your AvantGo User ID (included in this email).</p>
<p>To continue receiving news and information from your favorite content providers, you should visit that content provider’s channel before June 30 for details on how to obtain their content other than through AvantGo.  Also, AvantGo recommends the Snac mobile widget application - a new, fun way to get your favorite content on your mobile device. You can find out more about Snac at: http://www.mysnacs.com/landing?token=avantgo0609</p>
<p>Best wishes,<br />
The AvantGo Team</p></blockquote>
<p><span id="more-63"></span></p>
<p>The original AvantGo sync service was a pioneering effort.  It brought the web to your handheld device and allowed users to take any piece of the web they wished along for the ride.  Web applications were supported with an offline form submission system and advanced javascript hooks allowed mobile web developers to interact with their users in a useful way while they were offline.  The &#8220;HandheldFriendly&#8221; meta tag was originally developed by AvantGo to differentiate websites designed for desktop or mobile view and is now used by many browsers including Opera Mini, Opera Mobile and some BlackBerry devices.</p>
<p>AvantGo taught its approximately 10 million recorded users what to expect of the mobile web and defined a market for the mobile device browsers followed.  Aspects of the service&#8217;s design are sure to remain in the mobile computing field.  Many former AvantGo engineers went on to help develop Yahoo! Go, Google Gears, and other mobile content services.</p>
<p>If HTML 5 ever gets adopted, most of the offline browsing features will be available on a large number of browsers.  Safari 4 beta already supports the site manifest file aspects of HTML 5 for pre-downloading content (a feature that has caused some controversy over wasted bandwidth) which allows offline view of pages.</p>
<p>According to parent company Sybase, AvantGo will transition from a mobile web service, to an <a href="AvantGo Mobile Advertising Services" target="_blank">SMS advertising</a> and content delivery system.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=63</wfw:commentRss>
		</item>
		<item>
		<title>A First Look at JOONE - Java Object Oriented Neural Engine</title>
		<link>http://www.codebudo.com/?p=62</link>
		<comments>http://www.codebudo.com/?p=62#comments</comments>
		<pubDate>Tue, 05 May 2009 18:04:07 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Pattern Classification]]></category>

		<category><![CDATA[machine learning]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/?p=62</guid>
		<description><![CDATA[Recently I&#8217;ve been playing with a tool set called JOONE.  The goal of the JOONE project is to produce a fast prototyping environment for Neural Nets and a series of libraries to training these networks.  I have so far ignored the prototyping environment, but I do find the libraries quite useful.
Two years ago I began [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I&#8217;ve been playing with a tool set called <a href="http://www.jooneworld.com/" target="_blank">JOONE</a>.  The goal of the JOONE project is to produce a fast prototyping environment for Neural Nets and a series of libraries to training these networks.  I have so far ignored the prototyping environment, but I do find the libraries quite useful.</p>
<p>Two years ago I began building a series of base classes in Java that could be used to create neural networks.  I managed to get it to a somewhat useable state, but it needed a lot of cleaning up before I could release it or expand it to apply to a larger set of problems.  I found JOONE while writing my libraries, but shelved it because my goal was to learn more about machine learning.  Just using an existing toolset hides most the important &#8220;educational&#8221; bits.  I returned to JOONE a few months ago, and discovered it shared many of the design elements had built into my own library along with a bunch of really great features I hadn&#8217;t even thought of.</p>
<p>It comes pre-packaged with useful example codes to get you started, and an extensive PDF manual (which could use some copy-editing).  I ordered a <a href="http://www.amazon.com/Introduction-Neural-Networks-Java-Heaton/dp/097732060X/ref=sr_1_1?ie=UTF8&amp;qid=1241545730&amp;sr=8-1" target="_blank">book</a> (that&#8217;s also available <a href="http://www.heatonresearch.com/articles/series/1/" target="_blank">online</a>) that discusses the basics of neural nets in the context of JOONE.  While I&#8217;ve found the book useful, I think the examples are written for a slightly older version of JOONE.  Some of the method calls suggested in the book are listed as @deprecated in the actual JOONE source.  Fortunately, the examples included with the JOONE source code make this easy enough to modify the book&#8217;s samples and use the more modern methods.</p>
<p>I like the example based format of the book.  There are a series of problems to solve and the JOONE way of solving them is presented.  The pace is good and they increase in complexity as more advanced topics are covered.  There is an element I dislike about the book: some of the topics covered are not in a JOONE context.  Heaton Research, which published the book, seems to have its own basic NN library and some sections of the book use this library instead of JOONE.  The lack of consistancy could be a problem for someone trying to use JOONE as an engine and apply it to an actual problem.  For simply a learning experience, the subject matter is well described no matter which library is used.</p>
<p>As I continue playing with JOONE, I may post my example codes or describe the process of getting JOONE to play nice with Eclipse.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=62</wfw:commentRss>
		</item>
		<item>
		<title>The iPhone SDK is Calling Collect</title>
		<link>http://www.codebudo.com/?p=61</link>
		<comments>http://www.codebudo.com/?p=61#comments</comments>
		<pubDate>Thu, 24 Jul 2008 16:48:45 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/?p=61</guid>
		<description><![CDATA[I recently borrowed an iPod Touch from a friend and co-worker to play with the new 2.0 firmware version and App Store.  There&#8217;s a lot of potential with this new platform.  The very attractive device, capable hardware and ridiculous hype associated with iPhone makes it a great platform for developers to get wide exposure for [...]]]></description>
			<content:encoded><![CDATA[<p>I recently borrowed an iPod Touch from a friend and co-worker to play with the new 2.0 firmware version and App Store.  There&#8217;s a lot of potential with this new platform.  The very attractive device, capable hardware and ridiculous hype associated with iPhone makes it a great platform for developers to get wide exposure for simple applications (just google on iBeer).  The development tools - particularly Interface Builder - are top-notch.  I was looking for blogs and forums that had iPhone development tutorials to get the &#8220;tips and tricks&#8221; view of these tools, but I came up dry.  I did come across a few articles that have begun to address this lack of useful information.  Basically, the licensing agreement for the SDK is a gag order preventing developers from releaseing their code or writing about the SDK.</p>
<p><span id="more-61"></span></p>
<p>When developing an app on a new platform I take one of my existing apps and port it to the platform.  Since I already know the bulk of the code, I can learn the new features of the platform while not balking a the task of a totally new project.  A good number of my apps are open source or use libraries that are protected by various open source licenses.  The GPL is by far the most restrictive, but these licenses all generally require (or at lease encourage) the user to release extensions to these libraries back to the public.  If Apple&#8217;s iPhone SDK requires me to be silent, and the OSI license requires me to publish my changes, I&#8217;m stuck and can&#8217;t produce anything.</p>
<p>The obvious answer is to just use the tools that Apple has offered and build from there.  As great as that sounds, it&#8217;s really not going to fly for an open source developer.  If I write an app for the iPhone and list it for free in the App Store, I still can&#8217;t release my code.  This is the very core of &#8220;Free as in Freedom, not Free as in Beer&#8221;.  Developing for the iPhone means Apple has control of my source code.</p>
<p>I know that the jailbroken iPhone community has solved this and boast some rediculous penetration rates in the iPhone market.  This is not a sustainable solution.  Eventually Apple and AT&amp;T will learn to hate each other and we&#8217;ll have iPhones on other networks.  I predict this will happen in 4 years.  This will remove the major reason for jailbreaking the iPhone for the average user and jailbroken iPhones will as useless as iPod Linux.</p>
<p>I really like the iPhone and iPod Touch.  Like most Apple hardware, they&#8217;re really well designed products.  I&#8217;m not sure if this was just an oversight on the part of Apple&#8217;s upper management or just an over-zelous general counsel, but until I can have control over my own code I won&#8217;t be developing for Apple.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=61</wfw:commentRss>
		</item>
		<item>
		<title>Behavioural Location Tracking</title>
		<link>http://www.codebudo.com/?p=47</link>
		<comments>http://www.codebudo.com/?p=47#comments</comments>
		<pubDate>Tue, 27 May 2008 03:24:27 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<category><![CDATA[geoip]]></category>

		<category><![CDATA[tracking]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/?p=47</guid>
		<description><![CDATA[SkyHook Wireless is sitting on the greatest behavioral corpus known to man.  This software that powers the location based services for the Apple iPhone and iPod Touch is a self-learning map of wireless access point to GPS locations.  They seeded this database over the past several years by war driving across the US [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.skyhookwireless.com/">SkyHook Wireless</a> is sitting on the greatest behavioral corpus known to man.  This software that powers the location based services for the Apple iPhone and iPod Touch is a self-learning map of wireless access point to GPS locations.  They seeded this database over the past several years by war driving across the US and recording the ESSIDs and MAC addresses of wireless access points and GPS locations of all the exposed wireless networks they found.  <span id="more-47"></span>iPhone and iPod Touch users can find their current location by finding all the wifi networks they currently have in range and sending that information to SkyHook.  SkyHook can then look up their current location based on networks it knows about, and record information about new networks it may not have seen before.</p>
<p>This last point is important: users are actively scanning local airwaves and sending this information to SkyHook.  Cellular network operators have long been able to track movements of individuals, but they only have visibility of the people connected to their own network.  SkyHook is retaining information from an active scan.  Now you not only get information about a users&#8217; current location, but also the location of ajacent users not connected to your network.</p>
<p>It may not be obvious how this data can be used for behavioral analysis, so I&#8217;ll present this scenario.  Sally takes the train to work every day and has no knowledge of the SkyHook service.  She reads her email on the platform by picking up a local wifi hotspot.  This same train platform is periodically populated by tourists that are using the SkyHook service to map out their day.  SkyHook is now receiving active scans of this train station and seeing Sally&#8217;s wifi radio talking to the hotspot. When Sally goes to get coffee later in the afternoon, her phone is detected by another SkyHook user and her location again entered into the SkyHook database.  It&#8217;s now possible for SkyHook to see that Sally stops at the same train platform at around 7:30am each day and gets coffee on 4th St. on Tuesdays, even though Sally has never used or heard of the SkyHook service.</p>
<p>If we presume that connecting to a network and using a service constitutes an agreement between user and provider, then the provider has some reason to know your location.  If the person sitting next to me on the train belongs to a different network, my network doesn&#8217;t know anything about and has no agreement with that individual.  If my phone is running an active scan and finds the person next to me and sends this information to my network operator, who owns this information?  Do I have the right to send someone else&#8217;s location to a third party?  Is my location information public domain?</p>
<p>I see SkyHook as a potential disruptive technology that has yet to be challenged in court.  How long will it take before this information is considered valuable in a court case and Skyhook subpoenaed for probable suspects?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=47</wfw:commentRss>
		</item>
		<item>
		<title>Yahoo SearchMonkey</title>
		<link>http://www.codebudo.com/?p=60</link>
		<comments>http://www.codebudo.com/?p=60#comments</comments>
		<pubDate>Fri, 23 May 2008 22:42:47 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/?p=60</guid>
		<description><![CDATA[Yahoo has published another tool in their series of open technologies called SearchMonkey.  The principle is that website owners can create a &#8220;data service&#8221; that runs an XSLT on their site output and then pipe it to a &#8220;presentation application&#8221; that produces a custom L&#38;F for search results relating to your site.


There are better examples [...]]]></description>
			<content:encoded><![CDATA[<p>Yahoo has published another tool in their series of open technologies called <a href="http://developer.yahoo.com/searchmonkey/" target="_blank">SearchMonkey</a>.  The principle is that website owners can create a &#8220;data service&#8221; that runs an XSLT on their site output and then pipe it to a &#8220;presentation application&#8221; that produces a custom L&amp;F for search results relating to your site.</p>
<p><a href="http://www.codebudo.com/wp-content/uploads/2008/05/picture-1.png"><img class="aligncenter size-medium wp-image-58" title="CodeMonkey Enhanced Search Results" src="http://www.codebudo.com/wp-content/uploads/2008/05/picture-1-300x193.png" alt="CodeMonkey Enhanced Search Results" width="300" height="193" /></a></p>
<p><span id="more-60"></span></p>
<p>There are better examples elsewhere, but the idea is simple.  If your result looks better than the next guys, users are more likely to choose your link to click on.  Altering the presentation of a search result seems like a massive security hole at first.  I certainly wouldn&#8217;t want to give the world access to change how my search results appear.  However, these enhanced are only available on an opt-in basis so a user can only change the presentation of their own search results.  The Presentation Applications can be selected and enabled in their <a href="http://search.yahoo.com/preferences/preferences" target="_blank">preferences</a>.</p>
<p>I&#8217;m happy with the developer tools and speed with which a new presentation can be created, but only enabling these apps on an opt-in basis means that the vast majority of users won&#8217;t ever see them.  As a small site owner, it would be nice is Yahoo would allow the modified search results to be the default based on my preferences (after proper authentication with Site Explorer or similar tool).  Perhaps a simple flag similar to SafeSearch called something like &#8220;Use Enchanced Search Results&#8221; could be turned on in a users&#8217; preferences enable these features.  No one is going to install a presentation application for my tiny little blog.  (Besides, most of my traffic comes from google.)</p>
<p>I&#8217;m worried that Yahoo is ignoring the <a href="http://en.wikipedia.org/wiki/Power_Law" target="_self">Long Tail</a> with SearchMonkey and only catering to large web properties.  Given their new &#8220;open&#8221; motto, this seems rather out of character.  The project would make a good way to separate them from Google and Live search, although it does share some usability features with <a href="http://ask.com/" target="_blank">Ask</a>&#8217;s Site Preview.</p>
<p>I plan on participating in their developer feedback program, so perhaps this is still possible with SearchMonkey.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=60</wfw:commentRss>
		</item>
		<item>
		<title>OCR with Tesseract</title>
		<link>http://www.codebudo.com/?p=56</link>
		<comments>http://www.codebudo.com/?p=56#comments</comments>
		<pubDate>Tue, 13 May 2008 21:45:35 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<category><![CDATA[how to]]></category>

		<category><![CDATA[web]]></category>

		<category><![CDATA[javascript]]></category>

		<category><![CDATA[ocr]]></category>

		<category><![CDATA[php]]></category>

		<category><![CDATA[tesseract]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/?p=56</guid>
		<description><![CDATA[Optical Character Recognition is one of those technologies that has been around for a long time and never quite met customer demands.  This is a common AI application, but I thought I&#8217;d see what&#8217;s currently available publicly instead of trying to write my own from scratch.  The primary options I found were PHPOCR, [...]]]></description>
			<content:encoded><![CDATA[<p>Optical Character Recognition is one of those technologies that has been around for a long time and never quite met customer demands.  This is a common AI application, but I thought I&#8217;d see what&#8217;s currently available publicly instead of trying to write my own from scratch.  The primary options I found were PHPOCR, GOCR, and Tesseract.  <a href="There are a few open source OCR packages available and GOCR is generally considered the market winner.  " target="_blank">PHPOCR</a> is a system written by a developer in the Ukraine as a platform for further OCR research.  The examples were very easy to get working, but it&#8217;s not the quickest solution for my project.  <a href="http://jocr.sourceforge.net/" target="_blank">GOCR</a> is generally considered the market winner.  It installs easily via the OS X macports tree, and works quickly on the command line.  I also downloaded <a href="http://code.google.com/p/tesseract-ocr/" target="_blank">Tesseract</a>, but with GOCR&#8217;s ease of use and prevalence in the open source community, I thought I&#8217;d try that first.<span id="more-56"></span></p>
<p>A friend of mine made a really cool <a href="http://www.bookmarklets.com/" target="_blank">bookmarklet</a> that lets the user select a bounding box on an image, then sends an AJAX call to his server which downloads the image, crops it to the bounding box, and pipes it through <a href="http://jocr.sourceforge.net/" target="_blank">GOCR</a> (GNU Optical Character Recognition). The result is then dropped in a div exactly positioned over the originally selected bounding box approximating text size and color. The goal is to make it possible to copy and paste text out of an image. It works quite well and it&#8217;s clean interface makes it a beautiful thing to watch.</p>
<p>Given the complete failure by the web development community to accurately populate image alt attributes, I thought it would be slick to grab all images on a page, get any embedded text and automatically populate the alt text in much the same way as my friend&#8217;s bookmarklet draws a div. I ran a few tests by piping web comics through GOCR with horrible results. Accuracy couldn&#8217;t have been above 10%. I told my friend about my results hoping he could give me some insight as to why GOCR was failing me so badly. As I probably should have expected, the simple bounding box he uses is really important. GOCR doesn&#8217;t have any layout detection.</p>
<p>Google open sourced a big OCR package called <a href="http://code.google.com/p/tesseract-ocr/" target="_blank">Tesseract</a> originally shelved by HP in the 1980s and is (I believe) using it to scan books and make them available on the net. This is one of Google&#8217;s many efforts to make all the world&#8217;s information available. I was hoping it&#8217;s touted increased accuracy would help overcome the lack of a bounding box in my application.  It compiles and installs without errors and it runs just fine on the test images provided, but producing images it can read is a challenge.  I&#8217;ve tried converting saving gif files as tiff from Pixelmator and Tesseract gave me errors.  I tried using ImageMagick and a <a href="http://sourceforge.net/forum/forum.php?thread_id=1568751&amp;forum_id=534361" target="_blank">little bash script I found</a> with the same results.  Tesseract complains about minute differences in tiff header information (datetime format, bpp info, etc), so some care is needed.  I&#8217;m not sure if this has something to do with my version of libtiff (v3.6.1) that Tesseract is using, or if there&#8217;s some parsing code in Tesseract that isn&#8217;t happy.</p>
<p>I finally did manage to get things to work by creating a very basic bash script and using the simplest settings for ImageMagick.  The goal was to have the bash script work the same way GOCR&#8217;s command line utility works.</p>
<blockquote><p>#!/bin/bash</p>
<p>tmpid=$$<br />
convert -compress none $1 /tmp/img.${tmpid}.tif</p>
<p>tesseract /tmp/img.${tmpid}.tif /tmp/tout.${tmpid} 2&gt; /dev/null<br />
cat /tmp/tout.${tmpid}.txt | perl -e &#8216;while(&lt;&gt;) { $_ =~ s/\|//g; $_ =~ s/\^~R//g; print $_ } &#8216;<br />
rm /tmp/tout.${tmpid}.*<br />
rm /tmp/img.${tmpid}.tif</p></blockquote>
<p>Now that I have two tools who&#8217;s interface is the same, I can write a wrapper around them to use in PHP and compare their performance.  I&#8217;ll handle that in another post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=56</wfw:commentRss>
		</item>
		<item>
		<title>Open Calais</title>
		<link>http://www.codebudo.com/?p=54</link>
		<comments>http://www.codebudo.com/?p=54#comments</comments>
		<pubDate>Thu, 08 May 2008 18:08:30 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[semantic web]]></category>

		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/?p=54</guid>
		<description><![CDATA[In February, the OpenCalais project was slashdotted because it had opened a bounty on a WordPress plug-in to produce RDF formatted version of blog posts.  The project sounded interesting, and my previous semantic web project had just stalled.  The specification was very loose, so I assumed that they were expecting a simple alpha that could [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">In February, the <a href="http://opencalais.mashery.com/">OpenCalais</a> project was <a href="http://slashdot.org/article.pl?sid=08/02/10/2041235">slashdotted</a> because it had opened a <a href="http://opencalais.mashery.com/forum/read/13599">bounty</a> on a WordPress plug-in to produce RDF formatted version of blog posts.  The project sounded interesting, and my previous semantic web project had just stalled.  The specification was very loose, so I assumed that they were expecting a simple alpha that could be expanded on.  So, I signed up for the project and began hacking away at a simple plugin.  I spent a couple hours on it and stopped.  I realized I wouldn&#8217;t be able to work on it for several days and by that time my entry would be lost.  I was expecting OpenCalais to be inundated with so many entries from the slashdot community they would close the submissions in 48 hours.  Boy was I wrong.</p>
<p style="text-align: left;"><span id="more-54"></span></p>
<p style="text-align: left;">I checked back few weeks later expecting to see the completed entry available for download.  I wanted to see how the finished version compared with my approach.  Instead I found this:</p>
<blockquote>
<p style="text-align: left;">Unfortunately - and unexpectedly - we haven&#8217;t seen any reasonable applications for the bounty process so we&#8217;ll most likely be contracting for the development of the WordPress plugin.</p>
</blockquote>
<p style="text-align: left;">They hadn&#8217;t seen any reasonable applications?  Was I wrong about them getting hundreds of applications from the unwashed masses of slashdot?  Did I still have a chance at getting the prize?  The deadline or the submissions was the next day (Saturday) and all I had was a little bit of very raw code.  I spent my Saturday morning building my submission and writing up my proposal for what I would do with my entry.  A demo plugin was not mentioned in the proposal request, but I included my basic plugin anyway.  I thought it may let me stand out of the crowd a little.  So, I submitted a proposal to the OpenCalais bounty project a few hours before the deadline.</p>
<p style="text-align: left;">The winner was to be announced in 10 days, so I waited.  And waited.  There was nothing on the boards and not even an email response confirming my submission had been received.  Finally, I emailed them and they replied with a link to a new post on their boards.</p>
<blockquote>
<p style="text-align: left;">Bounty Update</p>
<p>Over the last week or so we&#8217;ve been thinking hard about what to do with the WordPress bounty. Here&#8217;s the situation:</p>
<p>We received a number of proposals of varying quality at the very last minute - from three sentences long to reasonably well articulated. We&#8217;ve read each one carefully and evaluated them for how innovative they were and how experienced the proposer was in developing production strength WordPress plugins. While we appreciate the effort the individuals made in putting these proposals together, the fact of the matter is that none of them had the combination of great ideas and great experience that we were looking for.</p>
<p>So, we&#8217;ve decided to to down another path. We feel badly about it - but we feel strongly that its the right thing to do.</p>
<p>So, our apologies if you had your hopes up. Our thanks for taking the time to apply. There will be contest opportunities in the near future that will be contests - not bounties.</p>
<p>Comments, criticism and suggestions welcome,</p>
<p>Regards</p>
<p style="text-align: left;">
</blockquote>
<p style="text-align: left;">I was horribly disappointed.  Not because my proposal wasn&#8217;t chosen, but because they just shot themselves in the foot.  The point of a bounty is to draw attention to your project and get people interested in using your service and building add-ons.  The initial proposal is not the finished product.  All the people that submitted proposals could have been an army of developers combining their ideas into a fantastic product in true OpenSource fashion, but instead they&#8217;ve been alienated.  All OpenCalais needed to do was choose the person with the most extensible architecture and allow them to wrangle the rest of the developers as the benevolent dictator.  In a few months, they&#8217;d have a product they could be proud of.</p>
<p style="text-align: left;">Now, I have a bit of code without a home and I really hate that.  There isn&#8217;t a project somewhere that I can check out and maybe add some of my ideas to.  I&#8217;ve found some of the other OpenCalais submitters on the web, but without a single point of contact or the endorsement of the parent project I doubt they&#8217;ll go very far.</p>
<p style="text-align: left;">OpenCalais just joined Duke Nukem Forever.  Cheers!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=54</wfw:commentRss>
		</item>
		<item>
		<title>Google &#8220;We&#8217;re Sorry&#8230;&#8221;</title>
		<link>http://www.codebudo.com/?p=51</link>
		<comments>http://www.codebudo.com/?p=51#comments</comments>
		<pubDate>Tue, 01 Apr 2008 23:49:21 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/2008/04/01/google-were-sorry/</guid>
		<description><![CDATA[Apparently Google has finally gotten sick of spyware authors taking advantage of them and have devised a system for preventing automated searches.  Unfortunately, they&#8217;re just blocking entire IP blocks from using their service.  This means that large corporate networks who&#8217;s outbound net goes through a small number of IPs gets entirely blocked from [...]]]></description>
			<content:encoded><![CDATA[<p>Apparently Google has finally gotten sick of spyware authors taking advantage of them and have devised a system for preventing automated searches.  Unfortunately, they&#8217;re just blocking entire IP blocks from using their service.  This means that large corporate networks who&#8217;s outbound net goes through a small number of IPs gets entirely blocked from Google access.</p>
<blockquote><p> &#8220;We&#8217;re Sorry&#8230; &#8230; but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can&#8217;t process your request right now.</p>
<p>We&#8217;ll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software.</p>
<p>If you&#8217;re continually receiving this error, you may be able to resolve the problem by deleting your Google cookie and revisiting Google. For browser-specific instructions, please consult your browser&#8217;s online support center.</p>
<p>If your entire network is affected, more information is available in the Google Web Search Help Center.</p>
<p>We apologize for the inconvenience, and hope we&#8217;ll see you again on Google.<br />
To continue searching, please type the characters you see below:&#8221;</p></blockquote>
<p><a href="http://www.codebudo.com/2008/04/01/google-were-sorry/google-is-sorry/" rel="attachment wp-att-53" title="Google is sorry"><img src="http://www.codebudo.com/wp-content/uploads/2008/04/google_sorry1.gif" alt="Google is sorry" /></a></p>
<p>Of course, considering the date this could just be a well crafted joke.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=51</wfw:commentRss>
		</item>
		<item>
		<title>Conditional Independence</title>
		<link>http://www.codebudo.com/?p=48</link>
		<comments>http://www.codebudo.com/?p=48#comments</comments>
		<pubDate>Thu, 28 Feb 2008 16:57:01 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Software]]></category>

		<category><![CDATA[probability]]></category>

		<guid isPermaLink="false">http://www.codebudo.com/2008/02/28/conditional-independence/</guid>
		<description><![CDATA[An honest man is flipping a fair coin.  After flipping &#8216;heads&#8217; seven times in a row, a banker walks up and offers the honest man a bet of $50 if the next flip is &#8216;heads&#8217;.  The banker hasn&#8217;t seen the previous seven flips.  Should the honest man take the bet?
More importantly, do [...]]]></description>
			<content:encoded><![CDATA[<p>An honest man is flipping a fair coin.  After flipping &#8216;heads&#8217; seven times in a row, a banker walks up and offers the honest man a bet of $50 if the next flip is &#8216;heads&#8217;.  The banker hasn&#8217;t seen the previous seven flips.  Should the honest man take the bet?</p>
<p>More importantly, do the previous flips effect the outcome of the next?</p>
<p><span id="more-48"></span></p>
<p>This is the concept of conditional independence.  Clearly, the probability of any coin flip is 1/2, but the probability of eight flips all being the same is 1/256 (0.5^8).  So, if the future depends on the past, can the future be determined within some finite probability?</p>
<p>To take the gambling  metafore a bit further, slot machines are perfect examples of discrete random number generators.  If you watch someone else play a slot machine for a while, the more they play and don&#8217;t win, the higher the probability that their text turn will be a jackpot.  Ideally, a winning slot player should observe the other players in the casino and try to take over a &#8216;cold&#8217; machine that has been played frequently but never produced a jackpot.  Despite the social aspects of a slot machine or table &#8216;going cold&#8217;, these machines actually have the highest probability of making you rich.</p>
<p>Being able to predict the future under these very controlled situations doesn&#8217;t sound like a very useful super power, but it has many real world applications.  Physicists use Markov Chain Monty Carlo (MCMC) models in determining the movement of subatomic particles, and it&#8217;s becoming popular with information theorists for behavioral analytics of internet click-streams.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codebudo.com/?feed=rss2&amp;p=48</wfw:commentRss>
		</item>
	</channel>
</rss>
