<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://community.research.microsoft.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>TechFest Live! : Center</title><link>http://community.research.microsoft.com/blogs/techfestlive/archive/tags/Center/default.aspx</link><description>Tags: Center</description><dc:language>en</dc:language><generator>CommunityServer 2008.5 SP1 (Build: 31106.3070)</generator><item><title>Language-Agnostic Search</title><link>http://community.research.microsoft.com/blogs/techfestlive/archive/2009/02/26/language-agnostic-search.aspx</link><pubDate>Thu, 26 Feb 2009 22:56:00 GMT</pubDate><guid isPermaLink="false">eaca9afb-5ccf-4c08-b3f3-369c7e6f1a06:4708</guid><dc:creator>robk</dc:creator><slash:comments>0</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://community.research.microsoft.com/blogs/techfestlive/rsscomments.aspx?PostID=4708</wfw:commentRss><wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://community.research.microsoft.com/blogs/techfestlive/commentapi.aspx?PostID=4708</wfw:comment><comments>http://community.research.microsoft.com/blogs/techfestlive/archive/2009/02/26/language-agnostic-search.aspx#comments</comments><description>&lt;p&gt;&amp;nbsp;Some of the demos featured in TechFest 2009 were submitted by the &lt;a href="http://www.microsoft.com/middleeast/Egypt/CMIC/default.aspx"&gt;Cairo Microsoft Innovation Center&lt;/a&gt;, and i got a chance to speak with a couple of Cairo researchers, Kareem Darwish and Motaz El-Saban, about&amp;nbsp;their work.&lt;/p&gt;
&lt;p&gt;&amp;quot;We&amp;#39;re trying to enable multilingual search,&amp;quot; Darwish said, &amp;quot;in the space of text documents and in the space of printed documents. In the case of printed documents, this is the OCRLess.&amp;quot;&lt;/p&gt;
&lt;p&gt;Then El-Saban took over.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;quot;OCRLess is about language-independent technology,&amp;quot; he said,&amp;nbsp;&amp;quot;that allows you to search within scanned document images without the use of OCR (optical character recognition). Traditionally, if you have a document image, you would need to convert it into text using OCR, and then you can search. What we do as an alternative approach is&amp;nbsp;take the text query and&amp;nbsp;transform it into an image for rendering. Then we match it against the image document. It&amp;#39;s based on image matching and indexing, and what we&amp;#39;re showing here is five languages--English, Arabic, Chinese, Hebrew, and hieroglyphics.&amp;quot; &lt;/p&gt;
&lt;p&gt;&lt;a href="http://community.research.microsoft.com/cfs-file.ashx/__key/CommunityServer.Blogs.Components.WeblogFiles/techfestlive/Minority-Languages_2D00_small.jpg"&gt;&lt;img border="0" width="448" src="http://community.research.microsoft.com/resized-image.ashx/__size/550x0/__key/CommunityServer.Blogs.Components.WeblogFiles/techfestlive/Minority-Languages_2D00_small.jpg" alt="Hieroglyphics ready to be searched." height="320" style="border:0;" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-size:xx-small;"&gt;Hieroglyphics ready to be searched.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;It&amp;#39;s an ingenious approach, upon which El-Saban expanded.&lt;/p&gt;
&lt;p&gt;&amp;quot;The first step is to segment it,&amp;quot; he explained. &amp;quot;The segment can be a character, a part of a character, or a word. in English, it&amp;#39;s a character. We take all these segments and&amp;nbsp;cluster similar shapes together in a completely unsupervised manner,&amp;nbsp;then assign an ID to each cluster. Now, every page is presented, instead of&amp;nbsp;by the characters, by a set of IDs, and we index this set of&amp;nbsp;IDs into a regular search engine. With a query, we do&amp;nbsp;the same thing. We render it into an image using the font of the book, then we segment it into pieces, and we actually assign, for each piece, the closest cluster ID.The query is a string of IDs, basically, and we go search for the string in the book.&amp;nbsp;This makes it effective,&amp;nbsp;because we&amp;#39;re using image matching, and efficient, becomes we&amp;#39;re using an underlying basic search engine.When we actually match the image&amp;nbsp;query to an image inside the book, we&amp;#39;re using&amp;nbsp;template matching, an array of pixels.&amp;quot;&lt;/p&gt;
&lt;p&gt;So where might this work lead?&lt;/p&gt;
&lt;p&gt;&amp;quot;There are a number of possibilities,&amp;quot; El-Saban said. &amp;quot;Libraries could use&amp;nbsp;something like this, or whatever entities sit on large volumes of documents, possibly written in many languages, for which you don&amp;#39;t have an OCR. Another area of potential is handwriting search.&amp;nbsp;You write your own notes by hand, and then, without even having to recognize your handwriting,&amp;nbsp;you can still search them. I&amp;#39;m&amp;nbsp;trying to sit&amp;nbsp;with&amp;nbsp;different contacts&amp;nbsp;in Microsoft product groups to see if there is an interest to take this project in a specific direction.&amp;quot;&lt;/p&gt;
&lt;p&gt;Back to Darwish, whose project is called Trans-Bulletization.&lt;/p&gt;
&lt;p&gt;&amp;quot;We&amp;#39;re trying to enable people to search using English queries against documents in many, many different languages,&amp;quot; he says, &amp;quot;and then present the results, not in the original language of the documents, but in a bulletized list, a summary that removes superfluous words from the English translation and puts the information into bulletized form. The user then can very quickly learn what the document is about and consequently make a decision whether they want to invest more time reading the full translation or the original document.&amp;quot;&lt;/p&gt;
&lt;p&gt;He went on to discuss how entire documents are boiled down to bulleted lists.&lt;/p&gt;
&lt;p&gt;&amp;quot;The key technology is all the documents we are going to search, we translate into English first, and then deploy the bulletized technique,&amp;quot; Darwish said..&amp;quot;Actually, it&amp;#39;s a sentence-reduction technique. To reduce the sentences, we use a dependency parser that&amp;nbsp;recognizes the main verb, the subject of this verb, and the object. Given that these are the core components of the sentence, then we find all the other pieces--prepositional phrases, modifiers, and so forth--then&amp;nbsp;make a judgment about the information content of these pieces. If they have low information content, then they&amp;#39;re candidates for removal, but before we remove them, we have to make sure that they don&amp;#39;t break anything in the sentence.&amp;nbsp;We won&amp;#39;t remove a noun phrase unless that, if we remove it, it won&amp;#39;t break the flow as measured by&amp;nbsp;language models.&amp;quot;&lt;/p&gt;
&lt;p&gt;Such help could provide a boon to many in this information-saturated age.&lt;/p&gt;
&lt;p&gt;&amp;quot;For people who work in an organization that requires sifting through a lot of documents in many, many languages, this would be really useful,&amp;quot; Darwish concludes. &amp;quot;They&amp;#39;re getting all the information content in a shortened version, so they can scan lots of documents very, very quickly. A typical user might be a reporter who wants to see how people look at a particular issue&amp;nbsp;across different countries. As the person enters a query across the countries he&amp;#39;s interested in, he gets articles from Japan, the Middle East, from China and Europe, and so forth, and then he can see all the different views at the same time, in bulleted lists. He can do this very, very efficiently and very, very quickly.&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://community.research.microsoft.com/cfs-file.ashx/__key/CommunityServer.Blogs.Components.WeblogFiles/techfestlive/cairo.jpg"&gt;&lt;img border="0" width="448" src="http://community.research.microsoft.com/resized-image.ashx/__size/550x0/__key/CommunityServer.Blogs.Components.WeblogFiles/techfestlive/cairo.jpg" alt="Motaz El-Saban and Kareem Darwish" height="336" style="border:0;" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-size:xx-small;"&gt;Motaz El-Saban (&lt;em&gt;left&lt;/em&gt;) and Kareem Darwish in their TechFest booth.&lt;/span&gt;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://community.research.microsoft.com/aggbug.aspx?PostID=4708" width="1" height="1"&gt;</description><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/Research/default.aspx">Research</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/TechFest/default.aspx">TechFest</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/Microsoft/default.aspx">Microsoft</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/2009/default.aspx">2009</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/Cairo/default.aspx">Cairo</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/Trans-Bulletization/default.aspx">Trans-Bulletization</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/Motaz+El-Saban/default.aspx">Motaz El-Saban</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/Center/default.aspx">Center</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/Kareen+Darwish/default.aspx">Kareen Darwish</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/OCRLess/default.aspx">OCRLess</category><category domain="http://community.research.microsoft.com/blogs/techfestlive/archive/tags/Innovation/default.aspx">Innovation</category></item></channel></rss>