Join IKANOW for a meetup discussing the economic value of unstructured data →
Join us, on April 5th, to discuss the business value of unstructured data in analytics and benefits of open source technologies in providing robust, big data solutions.
Welcome to the IKANOW Knowledge Discovery Blog. Here you will find our Technical and Analytic rants.
Join us, on April 5th, to discuss the business value of unstructured data in analytics and benefits of open source technologies in providing robust, big data solutions.
We learned about an interesting open source player called Ikanow. One of my colleagues pronounced it, “I can know.” Sounded good to me. You can get information about the firm’s solutions for “agile intelligence.”
Great post on the importance of open data for citizens and government!
Allows the easy seeding of urls from Mongodb into Nutch. This is similar in nature to that of the DmozParser that comes with Nutch. This provides a way to bootstrap and seed Nutch with data coming directly from Mongodb. The injector add urls from a specified mongodb to the crawldb of your choice. - CM
Allows direct indexing of Nutch crawl data directly into Mongodb. This is similar in nature to that of the SolrIndexer that comes with Nutch which let you index directly into Solr. This provides a way directly index data into Mongodb coming directly from Nutch. - CM
Allow the indexing of Nutch crawl data directly into elasticsearch. This is similar in nature to that of the SolrIndexer that comes with Nutch which let you index directly into Solr. This provides a way directly index data into elasticsearch coming directly from Nutch - CM
Before we really start to get into this whitepost it is important to ground the information with a set of definitions that will be used throughout.
Unstructured Data = Information that does not have a predefined model and is typically text-heavy but may contain dates, numbers and facts.
Structured data = Typically associated to a data model which determines a predefined structure to data, typically associated to database models.
Semi-structured data = A form of structured data that does not conform to the formal structure of tables and data models typically associated to relational databases.
definition citations are from wikipedia
As we know, information is growing at an enormous rate with no real end in sight. Based on an IDC Digital Universe report, which was underwritten by EMC, released estimates that the Digital Universe (eg every electronically stored piece of information) will reach 1.2 million petabytes or 1.2 zettabytes this year. To imagine this, John Gantz and David Reinsel, authors of the IDC report “picture a stack of DVDs, reaching from the earth to the moon and back.” (that is about 240,000 miles each way or driving across the United States 80 times).
Recently we decided to take a look at performing meaningful analysis on structured law enforcement data (crime reports) and fuse that with unstructured data produced through news mediums and social network mediums.
Critical to finding actionable intelligence is determining the appropriate data necessary to realize the use case. First, we started by looking at publicly available data from the DC Data Catalog. Specifically, we ingested the available Crime Incident Reports and several of the available geo-spatial layers to illustrate how the Infinit.e Structured Analysis toolset can handle the structured report data. Secondly, we then fused this with unstructured data from various social media, blogs and news sources and some synthetic data to simulate intelligence reporting activities. This provided a foundation to stress the unstructured and structured data harvesting capabilities.