1. Join IKANOW for a meetup discussing the economic value of unstructured data →

    Join us, on April 5th, to discuss the business value of unstructured data in analytics and benefits of open source technologies in providing robust, big data solutions. 

  2. Ikanow: Creating Pathways through Information →

    We learned about an interesting open source player called Ikanow. One of my colleagues pronounced it, “I can know.” Sounded good to me. You can get information about the firm’s solutions for “agile intelligence.” 

  3. Data for the Public Good →

    Great post on the importance of open data for citizens and government!

  4. Nature Editorial: If you want reproducible science, the software needs to be open source →

  5. MongoDB Parser for Nutch (seed Nutch with URLs from MongoDB) →

    Allows the easy seeding of urls from Mongodb into Nutch. This is similar in nature to that of the DmozParser that comes with Nutch. This provides a way to bootstrap and seed Nutch with data coming directly from Mongodb. The injector add urls from a specified mongodb to the crawldb of your choice. - CM

  6. Integrating MongoDB and Nutch →

    Allows direct indexing of Nutch crawl data directly into Mongodb. This is similar in nature to that of the SolrIndexer that comes with Nutch which let you index directly into Solr. This provides a way directly index data into Mongodb coming directly from Nutch. - CM

  7. Integrating Elastic Search and Nutch →

    Allow the indexing of Nutch crawl data directly into elasticsearch. This is similar in nature to that of the SolrIndexer that comes with Nutch which let you index directly into Solr. This provides a way directly index data into elasticsearch coming directly from Nutch - CM

  8. 5 Steps to scaling mongodb (or any db) in 5 to 8 minutes →

  9. Enabling reasoning from unstructured and structured data

    Before we really start to get into this whitepost it is important to ground the information with a set of definitions that will be used throughout. 

    Unstructured Data = Information that does not have a predefined model and is typically text-heavy but may contain dates, numbers and facts.

    Structured data = Typically associated to a data model which determines a predefined structure to data, typically associated to database models.

    Semi-structured data = A form of structured data that does not conform to the formal structure of tables and data models typically associated to relational databases.

    definition citations are from wikipedia

    As we know, information is growing at an enormous rate with no real end in sight. Based on an IDC Digital Universe report, which was underwritten by EMC, released estimates that the Digital Universe (eg every electronically stored piece of information) will reach 1.2 million petabytes or 1.2 zettabytes this year. To imagine this, John Gantz and David Reinsel, authors of the IDC report “picture a stack of DVDs, reaching from the earth to the moon and back.” (that is about 240,000 miles each way or driving across the United States 80 times).

    Read More

  10. Monitoring crime data for patterns and trends

    Recently we decided to take a look at performing meaningful analysis on structured law enforcement data (crime reports) and fuse that with unstructured data produced through news mediums and social network mediums.

    Critical to finding actionable intelligence is determining the appropriate data necessary to realize the use case. First, we started by looking at publicly available data from the DC Data Catalog.  Specifically, we ingested the available Crime Incident Reports and several of the available geo-spatial layers to illustrate how the Infinit.e Structured Analysis toolset can handle the structured report data.  Secondly, we then fused this with unstructured data from various social media, blogs and news sources and some synthetic data to simulate intelligence reporting activities.  This provided a foundation to stress the unstructured and structured data harvesting capabilities.

    Read More