Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 12, 2014

Building a tweet ranking web app using Neo4j

Filed under: Graphs,MongoDB,Neo4j,node-js,Python,Tweets — Patrick Durusau @ 7:28 pm

Building a tweet ranking web app using Neo4j by William Lyon.

From the post:

twizzard

I spent this past weekend hunkered down in the basement of the local Elk’s club, working on a project for a hackathon. The project was a tweet ranking web application. The idea was to build a web app that would allow users to login with their Twitter account and view a modified version of their Twitter timeline that shows them tweets ranked by importance. Spending hours every day scrolling through your timeline to keep up with what’s happening in your Twitter network? No more, with Twizzard!

The project uses the following components:

  • Node.js web application (using Express framework)
  • MongoDB database for storing basic user data
  • Integration with Twitter API, allowing for Twitter authentication
  • Python script for fetching Twitter data from Twitter API
  • Neo4j graph database for storing Twitter network data
  • Neo4j unmanaged server extension, providing additional REST endpoint for querying / retrieving ranked timelines per user

Looks like a great project and good practice as well!

Curious what you think of the ranking of tweets:

How can we score Tweets to show users their most important Tweets? Users are more likely to be interested in tweets from users they are more similar to and from users they interact with the most. We can calculate metrics to represent these relationships between users, adding an inverse time decay function to ensure that the content at the top of their timeline stays fresh.

That’s one measure of “importance.” Being able to assign a rank would be useful as well, say for the British Library.

Do take notice of the Jaccard similarity index.

Would you say that possessing at least one identical string (id, subject identifier, subject indicator) is a form of similarity measure?

What other types of similarity measures do you think would be useful for topic maps?

I first saw this in a tweet by GraphemeDB.

August 2, 2013

Norch – a search engine for node.js

Filed under: JSON,leveldb,node-js,Search Engines — Patrick Durusau @ 3:01 pm

Norch – a search engine for node.js by Fergus McDowall.

From the post:

Norch is a search engine written for Node.js. Norch uses the Node search-index module which is in turn written using the super fast levelDB library that Google open-sourced in 2011.

The aim of Norch is to make a simple, fast search server, that requires minimal configuration to set up. Norch sacrifices complex functionality for a limited robust feature set, that can be used to set up a freetext search engine for most enterprise scenarios.

Currently Norch features

  • Full text search
  • Stopword removal
  • Faceting
  • Filtering
  • Relevance weighting (tf-idf)
  • Field weighting
  • Paging (offset and resultset length)

Norch can index any data that is marked up in the appropriate JSON format

Download the first release of Norch (0.2.1) here

Not every feature possible but it looks like Norch covers the most popular ones.

July 6, 2013

Norch- a search engine for node.js

Filed under: Indexing,node-js,Search Engines — Patrick Durusau @ 4:30 pm

Norch- a search engine for node.js by Fergus McDowall.

From the post:

Norch is a search engine written for Node.js. Norch uses the Node search-index module which is in turn written using the super fast levelDB library that Google open-sourced in 2011.

The aim of Norch is to make a simple, fast search server, that requires minimal configuration to set up. Norch sacrifices complex functionality for a limited robust feature set, that can be used to set up a freetext search engine for most enterprise scenarios.

Currently Norch features

  • Full text search
  • Stopword removal
  • Faceting
  • Filtering
  • Relevance weighting (tf-idf)
  • Field weighting
  • Paging (offset and resultset length)

Norch can index any data that is marked up in the appropriate JSON format.

Download the first release of Norch (0.2.1) here

See: https://github.com/fergiemcdowall/norch for various details and instructions.

Interesting but I am curious what advantage Norch offers over Solr or Elasticseach, for example?
.

May 31, 2013

Freeing Information From Its PDF Chains

Filed under: node-js,PDF — Patrick Durusau @ 3:58 pm

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js by Gary Sieling.

From the post:

Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process.

I like the phrase “[m]uch information is trapped inside PDFs….”

Despite window dressing executive orders, information is going to continue to be trapped inside PDFs.

What information do you want to free from its PDF chains?

I first saw this at DZone.

April 15, 2013

Node.js integrates with M:…

Filed under: MUMPS,node-js — Patrick Durusau @ 1:08 pm

Node.js integrates with M: The NoSQL Hierarchical database by Luis Ibanez.

From the post:

We have talked recently about the significance of integrating the Node.js language with the powerful M database, particularly in the space of healthcare applications.

The efficiency of Node.js combined with the high performance of M, provides an unparalleled fresh approach for building healthcare applications.

Healthcare needs help from Node.js but other areas do as well!

I first saw this at: Node.js Integrates With M: The NoSQL Hierarchical Database by Alex Popescu.

December 20, 2012

Node.js, Neo4j, and usefulness of hacking ugly code [Normalization as Presentation?]

Filed under: Graphs,Neo4j,Networks,node-js — Patrick Durusau @ 8:16 pm

Node.js, Neo4j, and usefulness of hacking ugly code by Justin Mandzik.

From the post:

My primary application has a ton of data, even in its infancy. Hundreds of millions of distinct entities (and growing fast), each with many properties, and many relationships. Numbers in the billions start to be really easy to hit, and then thats still not accounting for organic growth. Most of the data is hierarchical for now, but theres a need in the near term for arbitrary relationships and the quick traversing thereof. Vanilla MySQL in particular is annoying to work when it comes to hierarchical data. Moving to Oracle gets us some nicer toys to play with (CONNECT_BY_ROOT and such), but ultimately, the need for a complimentary database solution emerges.

NOSQL bake-off

While my non-relational db experience is limited to MongoDB (which I love dearly), a graph db seemed to be the better theoretical fit. Requirements: Manage dense, interconnected data that has to be traversed fast, a query language that supports a root cause analysis use case, and some kind of H.A. plan of attack. Signals of Neo4j, OrientDB, and Titan started emerging from the noise. Randomly, I started in with Neo4j with the intent of repeating the test cases on the other contenders assuming any of the 3 met the requirements (in theory, at least). Neo4j has a GREAT “2 minutes to get up and running” experience. Untar, bin/neo4j start, and go to localhost:7474 and you’re off and running. A decent interface waits for you and you can dive right in.

Proof of concept code for testing Neo4j with project data.

The presumption of normalization in Neo4j continues to nag at me.

The broader the reach for data, the less likely normalization is going to be possible, or affordable if possible in some theoretical sense.

It may be that normalization is a presentation aspect of results. Will have to think about that over the holidays.

September 15, 2012

Wrapping Up TimesOpen: Sockets and Streams

Filed under: Data Analysis,Data Streams,node-js,Stream Analytics — Patrick Durusau @ 10:41 am

Wrapping Up TimesOpen: Sockets and Streams by Joe Fiore.

From the post:

This past Wednesday night, more than 80 developers came to the Times building for the second TimesOpen event of 2012, “Sockets and Streams.”

If you were one of the 80 developers, good for you! The rest of us will have to wait for the videos.

Links to the slides are given but a little larger helping of explanation would be useful.

Data streams have semantic diversity, just like static data, only less time to deal with it.

Ups the semantic integration bar.

Are you ready?

August 31, 2012

Data mining local radio with Node.js

Filed under: Data Mining,node-js — Patrick Durusau @ 3:19 pm

Data mining local radio with Node.js by Evan Muehlhausen.

From the post:

More harpsicord?!

Seattle is lucky to have KINGFM, a local radio station dedicated to 100% classical music. As one of the few existent classical music fans in his twenties, I listen often enough. Over the past few years, I’ve noticed that when I tune to the station, I always seem to hear the plinky sound of a harpsicord.

Before I sent KINGFM an email, admonishing them for playing so much of an instrument I dislike, I wanted to investigate whether my ears were deceiving me. Perhaps my own distaste for the harpsicord increased its impact in my memory.

This article outlines the details of this investigation and especially the process of collecting the data.
….

Another data collecting/mining post.

If you were collecting this data, how would you reliably share it with others?

In that regard, you might want to consider distinguishing members of the Bach family as a practice run.

I first saw this at DZone.

Parsing Wikipedia Articles with Node.js and jQuery

Filed under: JQuery,node-js,Wikipedia — Patrick Durusau @ 1:32 pm

Parsing Wikipedia Articles with Node.js and jQuery by Ben Coe.

From the post:

For some NLP research I’m currently doing, I was interested in parsing structured information from Wikipedia articles. I did not want to use a full-featured MediaWiki parser. WikiFetch Crawls a Wikipedia article using Node.js and jQuery. It returns a structured JSON-representation of the page.

Harvesting of content (unless you are authoring all of it) is a major part of any topic map project.

Does this work for you?

Other small utilities or scripts you would recommend?

I first saw this at: DZone.

August 16, 2012

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js

Filed under: Hadoop,MongoDB,node-js,Pig — Patrick Durusau @ 6:53 pm

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js by Russell Jurney.

From the post:

Series Introduction

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-node-mongo and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.

Introduction

In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. We do so to illustrate Pig’s ability to act as glue between distributed systems, and to show how easy it is to publish data from Hadoop to the web.

I was tempted to add ‘duct tape’ as a category. But there could only be one entry. 😉

Take an early weekend and have some fun with this tomorrow. August will be over sooner than you think.

March 8, 2012

Stash

Filed under: Cache Invalidation,node-js,Redis — Patrick Durusau @ 8:49 pm

Stash by Nate Kohari.

From the post:

Stash is a graph-based cache for Node.js powered by Redis.

Warning! Stash is just a mental exercise at this point. Feedback is very much appreciated, but using it in production may cause you to contract ebola or result in global thermonuclear war.

Overview

“There are only two hard things in computer science: cache invalidation and naming things.”
— Phil Karlton

One of the most difficult parts about caching is managing dependencies between cache entries. In order to reap the benefits of caching, you typically have to denormalize the data that’s stored in the cache. Since data from child items is then stored within parent items, it can be challenging to figure out what entries to invalidate in the cache in response to changes in data.

As Nate says, a thought experiment but an interesting one.

From a topic map perspective, I don’t know that I would consider cache invalidation and naming things as two distinct problems. Or rather, the same problem under different constraints.

If you don’t think “cache invalidation” is related to naming, what sort of problem is it when a person’s name changes upon marriage? Isn’t a stored record “cached?” May not be cache in the sense of the cache in an online service or chip, but those are the special cases aren’t they?

December 24, 2011

Natural

Filed under: Lexical Analyzer,Natural Language Processing,node-js — Patrick Durusau @ 4:44 pm

Natural

From the webpage:

“Natural” is a general natural language facility for nodejs. Tokenizing, stemming, classification, phonetics, tf-idf, WordNet, and some inflection are currently supported.

It’s still in the early stages, and am very interested in bug reports, contributions and the like.

Note that many algorithms from Rob Ellis’s node-nltools are being merged in to this project and will be maintained here going forward.

At the moment most algorithms are English-specific but long-term some diversity is in order.

Aside from this README the only current documentation is here on my blog.

Just in case you are looking for natural language processing capabilities with nodejs.

December 1, 2011

Node.js Tutorial

Filed under: Javascript,node-js — Patrick Durusau @ 7:41 pm

Node.js Tutorial by Alex Young.

A tutorial on building a notepad web app called NodePad using node.js.

I happened upon lesson #23, which makes it hard to get your bearings but the Alex included the link you see above, which gives you an ordered listing of the lessons.

I haven’t gone back to the beginning but judging from the comments it would not be a bad thing to do!

September 27, 2011

The Node Beginner Book

Filed under: Javascript,node-js — Patrick Durusau @ 6:46 pm

The Node Beginner Book by Manuel Kiessling.

I ran across this the other day when I was posting Node.js at Scale and just forgot to post it.

Nice introduction to Node.js.

You may also be interested in:

How to Node

Node.js

There are lots of other resources but that should be enough to get you started.

BTW, would “lite” servers with Node.js answer the question of who gets what data we saw in: Linked Data Semantic Issues (same for topic maps?)? Some people might get very little “extra” information if any at all. Others could get quite a bit extra. Would not have to build a monolithic server with every capability.

September 26, 2011

Node.js at Scale

Filed under: Javascript,node-js — Patrick Durusau @ 7:01 pm

Node.js at Scale by Tom Hughes-Croucher (Joyent).

ABSTRACT

When we talk about performance what do we mean? There are many metrics that matter in different scenarios but it’s difficult to measure them all. Tom Hughes-Croucher looks at what performance is achievable with Node today, which metrics matter and how to pick the ones that most matter to you. Most importantly he looks at why metrics don’t matter as much as you think and the critical decision making involved in picking a programming language, a framework, or even just the way you write code.

BIOGRAPHY

Tom Hughes-Croucher is the Chief Evangelist at Joyent, sponsors of the Node.js project. Tom mostly spends his days helping companies build really exciting projects with Node and seeing just how far it will scale. Tom is also the author of the O’Reilly book “Up and running with Node.js”. Tom has worked for many well known organizations including Yahoo, NASA and Tesco.

I thought the discussion of metrics was going to be the best part. It is worth your time but I stayed around for the node.js demonstration and it was impressive!

December 6, 2010

GT.M High end TP database engine

Filed under: Data Structures,GT.M,node-js,NoSQL,Software — Patrick Durusau @ 4:55 am

GT.M High end TP database engine (Sourceforge)

Description from the commercial version:

The GT.M data model is a hierarchical associative memory (i.e., multi-dimensional array) that imposes no restrictions on the data types of the indexes and the content – the application logic can impose any schema, dictionary or data organization suited to its problem domain.* GT.M’s compiler for the standard M (also known as MUMPS) scripting language implements full support for ACID (Atomic, Consistent, Isolated, Durable) transactions, using optimistic concurrency control and software transactional memory (STM) that resolves the common mismatch between databases and programming languages. Its unique ability to create and deploy logical multi-site configurations of applications provides unrivaled continuity of business in the face of not just unplanned events, but also planned events, including planned events that include changes to application logic and schema.

There are clients for node-js:

http://github.com/robtweed/node-mdbm
http://github.com/robtweed/node-mwire

Local topic map software is interesting and useful but for scaling to the enterprise level, something different is going to be required.

Reports of implementing the TMDM or other topic map legends with a GT.M based system are welcome.

Powered by WordPress