Archive for the ‘S4’ Category

Getting Started with S4, The Self-Service Semantic Suite

Tuesday, September 16th, 2014

Getting Started with S4, The Self-Service Semantic Suite by Marin Dimitrov.

From the post:

Here’s how S4 developers can get started with The Self-Service Semantic Suite. This post provides you with practical information on the following topics:

  • Registering a developer account and generating API keys
  • RESTful services & free tier quotas
  • Practical examples of using S4 for text analytics and Linked Data querying

Ontotext is up front about the limitations on the “free” service:

  • 250 MB of text processed monthly (via the text analytics services)
  • 5,000 SPARQL queries monthly (via the LOD SPARQL service)

The number of pages in a megabyte of text varies depends on text content but assuming a working average of one (1) megabyte = five hundred (500) pages of text, you can analyze up to one hundred and twenty-five thousand (125,000) pages of text a month. Chump change for serious NLP but it is a free account.

The post goes on to detail two scenarios:

  • Annotate a news document via the News analytics service
  • Send a simple SPARQL query to the Linked Data service

Learn how effective entity recognition and SPARQL are with data of interest to you, at a minimum of investment.

I first saw this in a tweet by Tony Agresta.

Zooming Through Historical Data…

Monday, January 20th, 2014

Zooming Through Historical Data with Streaming Micro Queries by Alex Woodie.

From the post:

Stream processing engines, such as Storm and S4, are commonly used to analyze real-time data as it flows into an organization. But did you know you can use this technology to analyze historical data too? A company called ZoomData recently showed how.

In a recent YouTube presentation, Zoomdata Justin Langseth demonstrated his company’s technology, which combines open source stream processing engines like Apache with data connection and visualization libraries based on D3.js.

“We’re doing data analytics and visualization a little differently than it’s traditionally done,” Langseth says in the video. “Legacy BI tools will generate a big SQL statement, run it against Oracle or Teradata, then wait for two to 20 to 200 seconds before showing it to the user. We use a different approach based on the Storm stream processing engine.”

Once hooked up to a data source–such as Cloudera Impala or Amazon Redshift–data is then fed into the Zoomdata platform, which performs calculations against the data as it flows in, “kind of like continues event processing but geared more toward analytics,” Langseth says.

From the video description:

In this hands-on webcast you’ll learn how LivePerson and Zoomdata perform stream processing and visualization on mobile devices of structured site traffic and unstructured chat data in real-time for business decision making. Technologies include Kafka, Storm, and d3.js for visualization on mobile devices. Byron Ellis, Data Scientist for LivePerson will join Justin Langseth of Zoomdata to discuss and demonstrate the solution.

After watching the video, what do you think the concept of “micro queries?”

I ask because I don’t know of any technical reason why a “large” query could not stream out interim results and display those as more results were arriving.

Visualization isn’t usually done that way but that brings me to my next question: Assuming we have interim results visualized, how useful are interim results? Being actionable on interim results really depends on the domain.

I rather like Zoomdata’s emphasis on historical data and the video is impressive.

You can download a VM at Zoomdata.

If you can think of upsides/downsides to the interim results issue, please give a shout!

MOA Massively Online Analysis

Saturday, December 1st, 2012

MOA Massively Online Analysis : Real Time Analytics for Data Streams

From the homepage:

What is MOA?

MOA is an open source framework for data stream mining. It includes a collection of machine learning algorithms (classification, regression, and clustering) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.

What can MOA do for you?

MOA performs BIG DATA stream mining in real time, and large scale machine learning. MOA can be easily used with Hadoop, S4 or Storm, and extended with new mining algorithms, and new stream generators or evaluation measures. The goal is to provide a benchmark suite for the stream mining community. Details.

Short tutorials and a manual are available. Enough to get started but you will need additional resources on machine learning if it isn’t already familiar.

A small niggle about documentation. Many projects have files named “tutorial” or in this case “Tutorial1,” or “Manual.” Those files are easier to discover/save, if the project name, version(?), is prepended to tutorial or manual. Thus “Moa-2012-08-tutorial1” or “Moa-2012-08-manual.”

If data streams are in your present or future, definitely worth a look.


Friday, December 3rd, 2010


From the website:

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

Just in case you were wondering if topic maps are limited to being bounded objects composed of syntax. No.


  1. Specify three sources of unbounded streams of data. (3 pages, citations)
  2. What subjects would you want to identify and on what basis in any one of them? (3-5 pages, citations)
  3. What other information about those subjects would you want to bind to the information in #2? What subject identity tests are used for those subjects in other sources? (5-10 pages, citations)