Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 17, 2011

Building blocks of a scalable web crawler

Filed under: Indexing,NoSQL,Search Engines,Searching,SQL — Patrick Durusau @ 7:29 pm

Building blocks of a scalable web crawler Thesis by Marc Seeger. (2010)

Abstract:

The purpose of this thesis was the investigation and implementation of a good architecture for collecting, analysing and managing website data on a scale of millions of domains. The final project is able to automatically collect data about websites and analyse the content management system they are using.

To be able to do this efficiently, different possible storage back-ends were examined and a system was implemented that is able to gather and store data at a fast pace while still keeping it searchable.

This thesis is a collection of the lessons learned while working on the project combined with the necessary knowledge that went into architectural decisions. It presents an overview of the different infrastructure possibilities and general approaches and as well as explaining the choices that have been made for the implemented system.

From the conclusion:

The implemented architecture has been recorded processing up to 100 domains per second on a single server. At the end of the project the system gathered information about approximately 100 million domains. The collected data can be searched instantly and the automated generation of statistics is visualized in the internal web interface.

Most of your clients have lesser information demands but the lessons here will stand you in good stead with their systems too.

Hadoop & Startups: Where Open Source Meets Business Data

Filed under: Hadoop,Marketing,Subject Identity — Patrick Durusau @ 7:28 pm

Hadoop & Startups: Where Open Source Meets Business Data

From the post:

A decade ago, the open-source LAMP (Linux, Apache, MySQL, PHP/Python) stack began to transform web startup economics. As new open-source webservers, databases, and web-friendly programming languages liberated developers from proprietary software and big iron hardware, startup costs plummeted. This lowered the barrier to entry, changed the startup funding game, and led to the emergence of the current Angel/Seed funding ecosystem. In addition, of course, to enabling a generation of webapps we all use everyday.

This same process is now unfolding in the Big Data space, with an open-source ecosystem centered around Hadoop displacing the expensive, proprietary solutions. Startups are creating more intelligent businesses and more intelligent products as a result. And perhaps even more importantly, this technological movement has the potential to blur the sharp line between traditional business and traditional web startups, dramatically expanding the playing field for innovation.

So, how do we create an open-source subject identity ecosystem?

Note that I said “subject identity ecosystem” and not URLs pointing at arbitrary resources. Useful but subject identity, to be re-usable, requires more than that.

Highly Scalable Erlang Web Apps

Filed under: Erlang,Marketing,Software — Patrick Durusau @ 7:26 pm

Highly Scalable Erlang Web Apps by Yurii Rashkovskii.

From the post:

Erlang is not well known for it’s ability for writing Web applications on the front-end; however, it can be incredibly powerful for writing scalable and highly scalable. Yurii Rashkovskii, creator of Beam.js and Erlagner.org is helping to change that with a laundry list of Erlang open source projects and libraries which make writing powerful and scalable Web applications back possible in Erlang. Yurii Rashkovskii recently presented on some of the powerful frameworks he has presented at the Erlang Factory in London and shares some of his projects and their powerful abilities.

In addition to useful information about Erlang web apps, Yurii says:

If one would look at my current list of open source Erlang projects, they might seem like a pile of unrelated stuff, but there’s actually a very basic idea behind most (if not all) of these projects. The idea is that if we want to make Erlang a much more attractive platform for other developers, we should act more on befriending adjacent communities, instead of directly competing with them. (emphasis added)

Is that a useful way to think about topic map applications?

RDF data in Neo4J: the Tinkerpop story

Filed under: RDF,Sail,TinkerPop — Patrick Durusau @ 7:25 pm

RDF data in Neo4J: the Tinkerpop story

From the post:

As mentioned in my previous blog post, I recently got asked to implement a storage and querying platform for biological RDF (Resource Description Framework) data. Traditional RDF stores are not really an option as my solution should also provide the ability to calculate shortest paths between random subjects. Calculating shortest path is however one of the strong selling points of Graph Databases and more specifically Neo4J. Unfortunately, the neo-rdf-sail component, which suits my requirements perfectly, is no longer under active development. Tinkerpop’s Sail implementation however, fills the void with an even better alternative!

Interesting if you are an RDF or biologicals fan, or even if you are not!

Social Media in Strategic Communication (SMISC)

Filed under: Funding,Marketing,Social Networks — Patrick Durusau @ 7:25 pm

Social Media in Strategic Communication (SMISC)

From the Synopsis:

DARPA is soliciting innovative research proposals in the area of social media in strategic communication. Proposed research should investigate innovative approaches that enable revolutionary advances in science, devices, or systems. Specifically excluded is research that primarily results in evolutionary improvements to the existing state of practice. See the full DARPA-BAA-11-64 document attached.

Important Dates
Posting Date: see announcement at www.fbo.gov
Proposal Due Date
Initial Closing: August 30, 2011, 12:00 noon (ET)
Final Closing: October 11, 2011, 12:00 noon (ET)
Industry Day: Tuesday, August 2, 2011

Contracting Office Address:
3701 North Fairfax Drive
Arlington, Virginia 22203-1714
Primary Point of Contact.:
Dr. Rand Waltzman
DARPA-BAA-11-64@darpa.mil

From the Funding Opportunity Description:

DARPA is soliciting innovative research proposals in the area of social media in strategic communication. Proposed research should investigate innovative approaches that enable revolutionary advances in science, devices, or systems. Specifically excluded is research that primarily results in evolutionary improvements to the existing state of practice. (emphasis added)

I think topic maps could be part of an approach that is revolutionary, not evolutionary.

I don’t have the infrastructure to field an application but if you do and have need for a wooly-pated consultant on such a project, give me a call.

PS: I first saw this in a tweet from Tim O’Reilly.

July 16, 2011

bulbflow

Filed under: Blueprints,Graphs,Gremlin,OrientDB — Patrick Durusau @ 5:43 pm

bulbflow: a Python framework for the graph era

From the Overview:

Bulbs is an open-source Python persistence framework for graph databases and the first piece of a larger Web-development toolkit that will be released in the upcoming weeks.

It’s like an ORM for graphs, but instead of SQL, you use the graph-traveral language Gremlin to query the database.

You can use it to connect to any Blueprints-enabled
database, including TinkeGraph, Neo4j, OrientDB, Dex, and OpenRDF (and there is an InfiniteGraph implementation in development).

This means your code is portable because you can to plug into different graph database backends without worrying about vendor lock in.

Bulbs was developed in the process of building Whybase, a startup that will open for preview this fall. Whybase needed a persistence layer to model its complex relationships, and Bulbs is an open-source version of that framework.

You can use Bulbs from within any Python Web-development framework, including Flask, Pyramid, and Django.

Will be watching for future developments!

Python for brain mining:…

Filed under: Machine Learning,Parallel Programming,Parallelism,Python,Visualization — Patrick Durusau @ 5:42 pm

Python for brain mining: (neuro)science with state of the art machine learning and data visualization by Gaël Varoquaux.

Brief slide deck on three tools:

Mayavi: For 3-D visualizations.

scikit-learn, which we reported on at: scikits.learn machine learning in Python.

Joblib: running Python function as pipeline jobs.

All three look useful, although I suspect Joblib may be the one of more immediate interest.

Depends on your interests. Comments?

IPSN 2012

Filed under: Conferences,Machine Learning — Patrick Durusau @ 5:42 pm

IPSN 2012 : The 11th ACM/IEEE Conference on Information Processing in Sensor Networks Call for Papers.

April 2012 Beijing, China (conference dates not confirmed)

Important Dates:

Abstract deadline: Friday, October 07, 2011
Full papers due: Friday, October 14, 2011
Author notification: Friday, January 20, 2012
Camera ready due: Friday, March 01, 2012

Scope:

The International Conference on Information Processing in Sensor Networks (IPSN) is a leading, single-track, annual forum on research in wireless embedded sensing systems. IPSN brings together researchers from academia, industry, and government to present and discuss recent advances in both theoretical and experimental research. Its scope includes signal and image processing, information and coding theory, databases and information management, distributed algorithms, networks and protocols, wireless communications, collaborative objects and the Internet of Things, machine learning, and embedded systems design.

If you are designing a topic map for sensor network input this looks like a good conference to attend.

I haven’t had the good fortune to visit Beijing but April is reported to be a good month (Spring) to visit.

Relational Theory Workship – 17 July 2011 (tomorrow)

Filed under: Conferences — Patrick Durusau @ 5:41 pm

Relational Theory Workship – 17 July 2011 (tomorrow)

By way of Jack Park, an announcement from John Kineman of a workshop on relational theory.

Announcement reads:

Judith [Rosen] and I are in Hull at the ISSS annual meeting [International Society for the Systems Sciences 2011]. We will do a workshop on Relational theory and my recent synthesis (R-theory) tomorrow (Sunday) from 10:00 to 4:30 GMT [6 AM East Coast US until 12:30 PM East Coast US].

A workshop and SIG website has been established by the graces of Jeff Prideaux. At this site you can access relevant information about the workshop and related materials. In particular you will find the key papers for the workshop posted there under the “Papers” tab. The URL is: www.relationalscience.org

If we can work out the technical details there will be a livestream of the workshop and those viewing may communicate via the relational science website or the livestream chat function. The URL for the live stream, if it happens, will be: http://www.livestream.com/relationalscience

Feel free to join for part or all and/or to let others know who may want to join in remotely

Your mileage may vary.

The older I get, the more I feel that “observers” are only participants with an additional name.

TempleScript cloud control

Filed under: Cloud Computing,Topic Map Software — Patrick Durusau @ 5:40 pm

TempleScript cloud control

Robert Barta’s efforts at weather control. No, wait, that’s not right, must mean that other “cloud.” 😉

July 15, 2011

Cloudant Search

Filed under: Search Engines,Search Interface,Searching — Patrick Durusau @ 6:49 pm

Cloudant Search

Tim Anglade writes:

I’ve always strongly felt that using NOSQL wasn’t so much a choice as a necessity. That most successful NOSQL deployments start with the intimate knowledge that your set of requirements — from speed & availability to operational considerations and budget — cannot be met with a relational database, coupled with a deep understanding of the tradeoffs you are making. Among those, perhaps no tradeoff has been felt more deeply by NOSQL users worldwide, than the eponymous loss of a natural, instantaneous way of accessing your data through a structured query language. We all came up with our own remedies; more often than not, that substitute was based on MapReduce: Google’s novel, elegant way of explicitly parallelizing computation over distributed, unstructured data. But as the joke goes, it’s always been a non-starter for the more novice users out there, and where suits & ties are involved.

CouchDB Views (as our brand of MapReduce is called) come with additional concerns, as they are pre-computed and written to disk. While this is fine — and actually, extremely useful — for the use-cases and small scales a lot of Apache CouchDB deployments reside at (single instances working off a limited dataset), this behavior is somewhere North of nagging and South of suicidal for the data sizes & use-cases most Cloudant customers have to deal with. Part of the promise of our industry is — or should be, anyway — to make your life & business easier, no matter how much data you have. And so, while CouchDB Views have been, and will undoubtedly remain, an essential tool to index, filter & transform your data, once you know what to do with it; and while its various weaknesses (explicitly parallelized syntax, lengthy computation, heavy disk usage) are also the source of its most meaningful strengths (distributed processing, high performance on repeated queries, persistent transformations), we at Cloudant saw a clear opportunity to offer a novel, complementary way to interact with your data.

A way that would allow you to interact with your data instantaneously; wouldn’t force you to mess around with MapReduce jobs or complex languages; a way that would not require you to set up a third-party, financially or operationally expensive solution.

We call this way Cloudant Search. And today, we’re proud to announce its immediate availability, as a public beta.

Well, there goes the weekend!

Data, Graphs, and Combinatorics…Oh My!

Filed under: Combinatorics,Conferences,Data,Graphs — Patrick Durusau @ 6:48 pm

DATA, GRAPHS, and COMBINATORICS in BIO-INFORMATICS, FINANCE, LINGUISTICS, and NATIONAL SECURITY

From the workshops page:

26-27 July 2011
College of Staten Island
City University of New York

This two-day workshop is designed to address current topics in Data Intensive Computing. Topics covered will include computational statistical, graph theoretic and combinatoric approaches in bio-informatics, financial data analytics, linguistics, and national security.

Technology is allowing researchers, government agencies, and companies to acquire huge amounts of information in the sciences and on human behavior. The challenge confronting researchers is how to discern meaningful information and relationships from this plethora of data. The workshop will focus on new algorithmic techniques and their computational implementation, including:

  • How Watson won Jeopardy!
  • Computational graph theory and combinatorics in bio-informatics, on Wall Street, and in National Security applications.
  • The role of high performance computing, graph theory and combinatorics in data intensive computing.

Workshop speakers include noted representatives from academe, government research laboratories, and industry. A list of speakers is attached.

The workshop will be held on Tuesday and Wednesday, July 26-27, 2011 from 8:15 AM to 4:45 PM in the Recital Hall of the Center for the Arts on the campus of the College of Staten Island, 2800 Victory Boulevard, Staten Island, New York 10314.

A continental breakfast, lunch, and refreshments will be provided each day. There is an attendance fee of $85 per person ($50 for students). Advanced registration is required.

Workshop information:
http://www.csi.cuny.edu/cunyhpc/workshops.php

Registration page:
http://www.csi.cuny.edu/cunyhpc/registration.php

Directions page:
http://www.csi.cuny.edu/cunyhpc/directions.html

For information, send email to:
hpcworkshops@csi.cuny.edu

If you will be on Staten Island, July 26-27, 2011, register and attend!

If you can get to Staten Island, July 26-27, 2011, register and attend!

This looks quite exciting!

Cineasts

Filed under: Neo4j,Spring Data — Patrick Durusau @ 6:47 pm

Cineasts

A demo based on:

  • Springsource
  • Springdata
  • Springdatagraph
  • Neo4j

Seems to have some rough edges. Rating movies by selecting “heads” keeps jumping to the top of the page. Not obvious, even after looking around, how you would add friends, etc. Underlying technology is sound but the UI could use some work.

RDFaCE WYSIWYM RDFa Content Editor

Filed under: RDFa,Semantic Web — Patrick Durusau @ 6:47 pm

RDFaCE WYSIWYM RDFa Content Editor

From the announcement:

RDFaCE is an online RDFa content editor based on TinyMCE. In addition to two classical views for text authoring (WYSIWYG and HTML source code), RDFaCE supports two novel views for semantic content authoring namely WYSIWYM (What You See Is What You Mean), which highlights semantic annotations directly inline and a triple view (aka. fact view). Further features are:

  • use of different Web APIs (Prefix.cc, Sindice, Swoogle) to facilitate the semantic content authoring process.
  • combining of results from multiple NLP APIs (Alchemy, Extractive, Ontos, Evri, OpenCalais) for obtaining rich automatic semantic annotations that can be modified and extended later on.

This is very clever and a step forward for the Semantic Web.

OrientDB 1.0rc3 – Graph(Ed)

Filed under: Blueprints,Gremlin,OrientDB,Pipes — Patrick Durusau @ 6:47 pm

OrientDB 1.0rc3 – Graph(Ed)

From the webpage:

This is a special edition of OrientDB with these TinkerPop technologies in bundle:

  • Blueprints provides a collection of interfaces and implementations to common, complex data structures. In short, Blueprints provides a one stop shop for implemented interfaces to help developers create software without being tied to particular underlying data management systems.
  • Gremlin is a Turing-complete, graph-based programming language designed for key/value-pair multi-relational graphs. Gremlin makes use of an XPath-like syntax to support complex graph traversals. This language has application in the areas of graph query, analysis, and manipulation.
  • Pipes is a graph-based data flow framework for Java 1.6+. A process graph is composed of a set of process vertices connected to one another by a set of communication edges. Pipes supports the splitting, merging, and transformation of data from input to output.

The graph community just keeps getting stronger.

July 14, 2011

Graphs, Brains, and Gremlin

Filed under: Graphs,Gremlin,Neo4j — Patrick Durusau @ 4:14 pm

Graphs, Brains, and Grelim

From the post:

What do graphs and brains have in common? First, they both share a relatively similar structure: Vertices/neurons are connected to each other by edges/axons. Second, they both share a similar process: traversers/action potentials propagate to effect some computation that is a function of the topology of the structure. If there exists a mapping between two domains, then its possible to apply the processes of one domain (the brain) to the structure of the other (the graph). The purpose of this post is to explore the application of neural algorithms to graph systems.

As only Marko could answer the question: “What do graphs and brains have in common?”

Highly recommended.

I am particularly interested in the the use of spreading activation for subject recognition. How do we capture such a recognition and/or communicate it to others?

…20 Billion Events Per Day

Filed under: Analytics,HBase — Patrick Durusau @ 4:13 pm

Facebook’s New Realtime Analytics System: HBase to Process 20 Billion Events Per Day

The post covers the use of HBase with pointers to additional comments. Some of the additional analysis caught my eye:

Facebook’s Social Plugins are Roman Empire Management 101. You don’t have to conquer everyone to build an empire. You just have control everyone with the threat they could be conquered while making them realize, oh by the way, there’s lots of money to be made being friendly with Rome. This strategy worked for quite a while as I recall.

You’ve no doubt seen Social Plugins on websites out the wild. A social plugin lets you see what your friends have liked, commented on or shared on sites across the web. The idea is putting social plugins on a site makes the content more engaging. Your friends can see what you are liking and in turn websites can see what everyone is liking. Content that is engaging gives you more clicks, more likes, and more comments. For a business or brand, or even an individual, the more engaging the content is, the more people see it, the more it pops up in news feeds, the more it drives traffic to a site.

The formerly lone-wolf web, where content hunters stalked web sites silently and singly, has been turned into a charming little village, where everyone knows your name. That’s the power of social.

Turning content hunters into villagers is quite attractive.

I checked out the reference on Like buttons. You can use the Open Graph protocol but:

When your Web page represents a real-world entity, things like movies, sports teams, celebrities, and restaurants, use the Open Graph protocol to specify information about the entity.

Isn’t a web page at the wrong level of granularity?

This page has already talked about social plugins, Facebook, web pages, Like buttons, HBase, the Roman Empire and several other “entities.”

But:

og:url – The canonical, permanent URL of the page representing the entity. When you use Open Graph tags, the Like button posts a link to the og:url instead of the URL in the Like button code.

Opps. I have to either choose one entity or have the same URL for the Roman Empire as I do Facebook.

That doesn’t sound like a good solution.

Does it to you?

Computer learns language by playing games

Filed under: Artificial Intelligence,Language — Patrick Durusau @ 4:12 pm

Computer learns language by playing games

From the post:

Computers are great at treating words as data: Word-processing programs let you rearrange and format text however you like, and search engines can quickly find a word anywhere on the Web. But what would it mean for a computer to actually understand the meaning of a sentence written in ordinary English — or French, or Urdu, or Mandarin?

One test might be whether the computer could analyze and follow a set of instructions for an unfamiliar task. And indeed, in the last few years, researchers at MIT’s Computer Science and Artificial Intelligence Lab have begun designing machine-learning systems that do exactly that, with surprisingly good results.

The original paper, Learning to Win by Reading Manuals in a Monte-Carlo Framework by S.R.K. Branavan, David Silver, and, Regina Barzilay reports:

Abstract:

This paper presents a novel approach for leveraging automatically extracted textual knowledge to improve the performance of control applications such as games. Our ultimate goal is to enrich a stochastic player with high level guidance expressed in text. Our model jointly learns to identify text that is relevant to a given game state in addition to learning game strategies guided by the selected text. Our method operates in the Monte-Carlo search framework, and learns both text analysis and game strategies based only on environment feedback. We apply our approach to the complex strategy game Civilization II using the official game manual as the text guide. Our results show that a linguistically-informed game-playing agent significantly outperforms its language-unaware counterpart, yielding a 27% absolute improvement and winning over 78% of games when playing against the built-in AI of Civilization II.

Deeply interesting work, particularly as assistive authoring for topic maps.

Think about the number of recipes, manuals, IETMs, etc., that are in electronic format. Identifying common steps, despite differences in descriptions, could be quite useful.

Come to think of it, most regulations and laws are written that way. Imagine the difference between a strictly textual search of legal resources and a semantically aware search of legal resources? Not today or tomorrow, but given the rate of progression, that sort of killer app may not be far off.

Do you want to be selling it or buying it?


Sci-fi fans take note:

Games are won by gaining control of the entire world map.

MapReduce with MongoDB and Clojure

Filed under: Clojure,MapReduce,MongoDB — Patrick Durusau @ 4:12 pm

MapReduce with MongoDB and Clojure

From the post:

A few days ago, we decided to create a dashboard in order to better visualize some statistics of our production systems. One important function is to plot the average latency as a time-series graph, so we can see the trend over time. Since MongoDB implemented MapReduce, and we store our logs in MongoDB, MapReduce seems a natural fit for log analysis.

One issue with MongoDB’s implementation of MapReduce is that no matter what language you use, you have to pass JavaScript code as strings to MongoDB. Storing code written in another language as strings in a program is … inelegant, to say the least.

Fortunately, Clojure being a homoiconic language, it is relatively easy to transform Clojure forms into code snippets of other languages using Clojure itself in the same program. In other words, it is possible to embed JavaScript programs in a Clojure program without actually seeing any JavaScript syntax. There are already a number of libraries, with different level of maturity, that allow you to transform Clojure forms to JavaScript. I haven’t done an extensive survey, but ClojureJS is good enough for our purpose.

Emphasis on homoiconic nature of Clojure.

Elasticsearch, Kettle and the CTools

Filed under: ElasticSearch,Search Engines,Searching — Patrick Durusau @ 4:11 pm

Elasticsearch, Kettle and the CTools

From the post:

I’m not much into the sql vs nosql discussion. I have enough years of BI to know that the important thing is to choose the right tool for the job. And that requires a lot of tools!

Here’s one more for our set: ElasticSearch. ElasticSearch is an Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene.

Adds Elasticsearch to Kettle for BI.


Updated 14 May 2012 (forgot the URL for the link, now fixed)

Introduction to Graph Databases

Filed under: GraphDB,Graphs,Neo4j — Patrick Durusau @ 4:10 pm

Introduction to Graph Databases

From the description:

Neo Technology CEO Emil Eifrem provides a fast paced introduction to NOSQL, graph databases, and Neo4j, the world’s leading graph database.

I managed to catch this Webinar and it is a good introduction to graph databases.

July 13, 2011

GeoCommons Enterprise Features – Free!

Filed under: Geo Analytics,Geographic Data,Geographic Information Retrieval — Patrick Durusau @ 7:30 pm

GeoCommons Enterprise Features – Free!

From the email announcement:

  • Analytics: Easy-to-use, advanced spatial analytics that users and groups can utilize to answer mission-critical questions. Select among numerous analyses such as filtering, buffers, spatial aggregation and predictive analysis.
  • Private Data Support: Keep proprietary data private and unsearchable by others. Now you can upload proprietary data, analyze it with other data and create compelling maps, charts and graphs all within a secure interface.
  • Groups and Permissions: Allow others in your group or organization to access and collaborate with you. Enable permissions at various levels to limit or expand data sharing. See a step-by-step guide of how to create groups and make your data private here from @seangorman.

For groups and private data, see: Private Data and Groups for GeoCommons!!

GeoCommons has 70,000 datasets.

If you look around you might find something you like.

Topic mappers should ask themselves: Why does this work? (more on that anon)

RecordBreaker: Automatic structure for your text-formatted data

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 7:30 pm

RecordBreaker: Automatic structure for your text-formatted data

From the post:

This post was contributed by Michael Cafarella, an assistant professor of computer science at the University of Michigan. Mike’s research interests focus on databases, in particular managing Web data. Before becoming a professor, he was one of the founders of the Nutch and Hadoop projects with Doug Cutting. This first version of RecordBreaker was developed by Mike in conjunction with Cloudera.

RecordBreaker is a project that automatically turns your text-formatted data (logs, sensor readings, etc) into structured data, without any need to write parsers or extractors. In particular, RecordBreaker targets Avro as its output format. The project’s goal is to dramatically reduce the time spent preparing data for analysis, enabling more time for the analysis itself.

No quite “automatic” but a step in that direction and a useful one.

Think of “automatic” identification of subjects and associations in such files.

Like the files from campaign financing authorities.

Scala for the Curious Erlang Programmer

Filed under: Erlang,Scala — Patrick Durusau @ 7:28 pm

Dean Wampler – Scala for the Curious Erlang Programmer

From the description:

Scala is a statically-typed, hybrid functional and object-oriented language for the JVM. The Scala standard library includes an Erlang- inspired Actors library. In this talk, I’ll discuss how Scala compares and contrasts to Erlang, highlighting the advantages and disadvantages of each language for particular needs. For example, we’ll discuss the pros and cons of a rich type system and static typing in Scala. We’ll discuss ways that Scala is perhaps more general purpose than Erlang, but not as powerful in the areas where Erlang excels.

Always useful to choose the right tool for a task. Including semantics as understood by users.

You may also enjoy Dean’s Polyglotprogramming site, with links to his presentations and blog.

Unstructured data ‘out of control’: survey

Filed under: Data,Data Mining — Patrick Durusau @ 7:28 pm

Unstructured data ‘out of control’: survey

Joe McKendrick writes:

Many organizations are becoming overwhelmed with the volumes of unstructured information — audio, video, graphics, social media messages — that falls outside the purview of their “traditional” databases. Organizations that do get their arms around this data will gain significant competitive edge.

As part of my work with Unisphere Research, a division of Information Today, Inc., I helped conduct a new survey that finds unstructured data is growing at a faster clip than relational data — driving the “Big Data” explosion. Thirty-five percent of respondents say unstructured information has already surpassed or will surpass the volume of traditional relational data in the next 36 months. Sixty-two percent say this is inevitable within the next decade. The survey gathered input from 446 data managers and professionals who are readers of Database Trends and Applications magazine, and was underwritten by MarkLogic.

A majority of survey respondents acknowledge that unstructured information is growing out of control and is driving the big data explosion – 91% say unstructured information already lives in their organizations, but many aren’t sure what to do about it.

I mention this survey because unstructured data has few contenders for the attribution, discovery, extraction of semantics and topic maps may find less competition from traditional solutions.

Storing and querying RDF data in Neo4J through Sail

Filed under: Neo4j,RDF,Sail — Patrick Durusau @ 7:27 pm

Storing and querying RDF data in Neo4J through Sail

Dave Suvee walks through importing RDF triple files into Neo4j.

Update: 17 July 2011 RDF data in Neo4J: the Tinkerpop story advises that neo-rdf-sail is no longer under active development.

July 12, 2011

DataSift

Filed under: Marketing,Semantic Web,Semantics — Patrick Durusau @ 7:11 pm

DataSift

Congratulations to DataSift for the $6 Million in funding!

But there is another gem in the story about their funding.

Instead Mr. Halstead looked at how companies like Amazon had disrupted server rental and came up with a plan to do the same to data analysis. “For me the technology isn’t the game changer. For me it is approaching data processing in a democratized way. There are any number of companies that will sell you data, but they will typically charge you a five-figure sum minimum.

Note the line: “For me the technology isn’t the game changer. For me it is approaching data processing in a democratized way.”

That the trick isn’t it? To approach “…semantics in a democratized way.”

Precisely what topic maps have to offer. Topic maps can capture your semantics. If and when you become interested in the semantics of others, you can map your semantics to theirs, preserving the integrity of both.

Disruptive to the top down ontology approach you ask? I suppose that is true, it is. But then democratization is always disruptive of authoritarian schemes and patterns.

The real question is: Whose semantics would you rather have? Your own or those of someone else?

Scaling Scala at Twitter by Marius Eriksen

Filed under: Geographic Information Retrieval,Scala — Patrick Durusau @ 7:10 pm

Scaling Scala at Twitter by Marius Eriksen

From the description:

Rockdove is the backend service that powers the geospatial features on Twitter.com and the Twitter API (“Twitter Places”). It provides a datastore for places and a geospatial search engine to find them. To throw out some buzzwords, it is:

  • a distributed system
  • realtime (immediately indexes updates and changes)
  • horizontally scalable
  • fault tolerant

Rockdove is written entirely in Scala and was developed by 2 engineers with no prior Scala experience (nor with Java or the JVM). We think the geospatial search engine provides an interesting case study as it presents a mix of algorithm problems and “classic” scaling and optimization issues. We will report on our experience using Scala, focusing especially on:

  • “functional” systems design
  • concurrency and parallelism
  • using a “research language” in practice
  • when, where and why we turned the “functional dial”
  • avoiding mutable state

Not to mention being a well done presentation!

EMC Forum 2011

Filed under: Conferences — Patrick Durusau @ 7:08 pm

EMC Forum 2011

From the website:

Cloud Meets BigData

Join us for a free one-day event to learn how cloud computing is transforming IT and discover how you can accelerate the journey to your cloud.

Vendor sponsored forum but also an opportunity to get out of the office and interact with others at high bandwidth. 😉

MADlib goes beta!

Filed under: Data Analysis,SQL,Statistics — Patrick Durusau @ 7:08 pm

MADlib goes beta! Serious in-database analytics

From the post:

MADlib is an open-source statistical analytics package for SQL that I kicked off last year with friends at EMC-Greenplum. Last Friday we saw it graduate from alpha, to the first beta release version, 0.20beta. Hats off the MADlib team!

Forget your previous associations with low-tech SQL analytics, including so-called “business intelligence”, “olap”, “data cubes” and the like. This is the real deal: statistical and machine learning methods running at scale within the database, massively parallel, close to the data. Much of the code is written in SQL (a language that doesn’t get enough credit as a basis for parallel statistics), with key extensions in C/C++ for performance, and the occasional Python glue code. The suite of methods in the beta includes:

  • standard statistical methods like multi-variate linear and logistic regressions,
  • supervised learning methods including support-vector machines, naive Bayes, and decision trees
  • unsupervised methods including k-means clustering, association rules and Latent Dirichlet Allocation
  • descriptive statistics and data profiling, including one-pass Flajolet-Martin and CountMin sketch methods (my personal contributions to the library) to compute distinct counts, range-counts, quantiles, various types of histograms, and frequent-value identification
  • statistical support routines including an efficient sparse vector library and array operations, and conjugate gradiant optimization.

Kudos to EMC:

And hats off to EMC-Greenplum for putting significant development resources behind this open-source effort. I started this project at Greenplum before they were acquired, and have been happy to see EMC embrace it and push it further.

Not every acquisition has that happy result.

« Newer PostsOlder Posts »

Powered by WordPress