Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 25, 2012

Using Oracle Full-Text Search in Entity Framework

Filed under: Full-Text Search,Oracle,Searching,Text Mining — Patrick Durusau @ 4:05 pm

Using Oracle Full-Text Search in Entity Framework

From the post:

Oracle database supports an advanced functionality of full-text search (FTS) called Oracle Text, which is described comprehensively in the documentation:

We decided to meet the needs of our users willing to take advantage of the full-text search in Entity Framework and implemented the basic Oracle Text functionality in our Devart dotConnect for Oracle ADO.NET Entity Framework provider.

Just in case you run across a client using Oracle to store text data. 😉

I first saw this at Beyond Search (As Stephen implies, it is not a resource for casual data miners.)

Using Luke the Lucene Index Browser to develop Search Queries

Filed under: Lucene,Luke — Patrick Durusau @ 3:27 pm

Using Luke the Lucene Index Browser to develop Search Queries

From the post:

Luke is a GUI tool written in Java that allows you to browse the contents of a Lucene index, examine individual documents, and run queries over the index. Whether you’re developing with PyLucene, Lucene.NET, or Lucene Core, Luke is your friend.

Which also covers:


Downloading, running Luke ….

Exploring Document Indexing ….

Exploring Search ….

Using the Lucene XML Query Parser ….

Nothing surprising but a well written introduction to Luke.

Thinking about the HDFS vs. Other Storage Technologies

Filed under: Hadoop,HDFS,Hortonworks — Patrick Durusau @ 3:11 pm

Thinking about the HDFS vs. Other Storage Technologies by Eric Baldeschwieler.

Just to whet your interest (see Eric’s post for the details):

As Apache Hadoop has risen in visibility and ubiquity we’ve seen a lot of other technologies and vendors put forth as replacements for some or all of the Hadoop stack. Recently, GigaOM listed eight technologies that can be used to replace HDFS (Hadoop Distributed File System) in some use cases. HDFS is not without flaws, but I predict a rosy future for HDFS. Here is why…

To compare HDFS to other technologies one must first ask the question, what is HDFS good at:

  • Extreme low cost per byte….
  • Very high bandwidth to support MapReduce workloads….
  • Rock solid data reliability….

A lively storage competition is a good thing.

A good opportunity to experiment with different storage strategies.

Searching the WWW for found things, a bad solution

Filed under: Algorithms,Graphs,Networks — Patrick Durusau @ 1:53 pm

Searching for something you have found once before, is a very bad solution.

It is like catching a fish and then throwing it back into the ocean and attempting to find that same fish, again.

Real example that happened to me this week. I had mentioned this research in a post but did not include a link to it. I didn’t remember the names of the researchers or their location.

From the news bulletin:

Whether sudoku, a map of Germany or solid bodies – in all of these cases, it’s all about counting possibilities. In the sudoku, it is the permitted solutions; in the solid body, it is the possible arrangements of atoms. In the map, the question is how many ways the map can be colored so that adjacent countries are always shown in a different color. Scientists depict these counting problems as a network of lines and nodes. Consequently, they need to answer just one question: How many different ways are there to color in the nodes with a certain number of colors? The only condition: nodes joined by a line may not have the same color. Depending on the application, the color of a node is given a completely new significance. In the case of the map, “color” actually means color; with sudoku the “colors” represent different figures.

“The existing algorithm copies the whole network for each stage of the calculation and only changes one aspect of it each time,” explains Frank van Bussel of the Max Planck Institute for Dynamics and Self-Organization (MPIDS). Increasing the number of nodes dramatically increases the calculation time. For a square lattice the size of a chess board, this is estimated to be many billions of years. The new algorithm developed by the Göttingen-based scientists is significantly faster. “Our calculation for the chess board lattice only takes seven seconds,” explains Denny Fliegner from MPIDS.

Without naming the search engine, would you believe that:

network +color +node

results in 24 “hits,” none of which are the research in question.

Remembering some of the terms in the actual scholarly article I searched using:

"chromatic polynomials" +network

Which resulted in three (3) scholarly articles and one “hit,” none of which were the research in question.

As you may suspect, variations on these searches resulted in similar “non-helpful” results.

I had not imagined the research in question but searching was unable to recover the reference.

Well, using a search engine was unable to recover the reference.

Knowing that I had bookmarked the site, I had to scan a large bookmark file for the likely entry.

Found it and so I don’t have to repeat this non-productive behavior, what follows are the citations and some text from each to help with the finding, next time.

The general news article:

A new kind of counting

How many different sudokus are there? How many different ways are there to color in the countries on a map? And how do atoms behave in a solid? Researchers at the Max Planck Institute for Dynamics and Self-Organization in Göttingen and at Cornell University (Ithaca, USA) have now developed a new method that quickly provides an answer to these questions. In principle, there has always been a way to solve them. However, computers were unable to find the solution as the calculations took too long. With the new method, the scientists look at separate sections of the problem and work through them one at a time. Up to now, each stage of the calculation has involved the whole map or the whole sudoku. The answers to many problems in physics, mathematics and computer science can be provided in this way for the first time. (New Journal of Physics, February 4, 2009)

The New Journal of Physics article:

Counting complex disordered states by efficient pattern matching: chromatic polynomials and Potts partition functions by Marc Timme, Frank van Bussel, Denny Fliegner, and Sebastian Stolzenberg.

Abstract:

Counting problems, determining the number of possible states of a large system under certain constraints, play an important role in many areas of science. They naturally arise for complex disordered systems in physics and chemistry, in mathematical graph theory, and in computer science. Counting problems, however, are among the hardest problems to access computationally. Here, we suggest a novel method to access a benchmark counting problem, finding chromatic polynomials of graphs. We develop a vertex-oriented symbolic pattern matching algorithm that exploits the equivalence between the chromatic polynomial and the zero-temperature partition function of the Potts antiferromagnet on the same graph. Implementing this bottom-up algorithm using appropriate computer algebra, the new method outperforms standard top-down methods by several orders of magnitude, already for moderately sized graphs. As a first application, we compute chromatic polynomials of samples of the simple cubic lattice, for the first time computationally accessing three-dimensional lattices of physical relevance. The method offers straightforward generalizations to several other counting problems.

GENERAL SCIENTIFIC SUMMARY

Introduction and background. The number of accessible states of a complex physical system fundamentally impacts its static and dynamic properties. For instance, antiferromagnets often exhibit an exponential number of energetically equivalent ground states and thus positive entropy at zero temperature – an exception to the third law of thermodynamics. However, counting the number of ground states, such as for the Potts model antiferromagnet, is computationally very hard (so-called sharp-P hard), i.e. the computation time generally increases exponentially with the size of the system. Standard computational counting methods that use theorems of graph theory are therefore mostly restricted to very simple or very small lattices.

Main results. Here we present a novel general-purpose method for counting. It relies on a symbolic algorithm that is based on the original physical representation of a Potts partition function and is implemented in the computer algebra language FORM that was successfully used before in precision high-energy physics.

Wider implications. The bottom-up nature of the algorithm, together with the purely symbolic implementation make the new method many orders of magnitude faster than standard methods. It now enables exact solutions of various systems that have been thus far computationally inaccessible, including lattices in three dimensions. Through the relation of the Potts partition functions to universal functions in graph theory, this new method may also help to access related counting problems in communication theory, graph theory and computer science.

The language used was FORM.

This search reminded me that maps are composed of found and identified things.

Which explains the difference between searching the WWW versus a map of found things.

July 24, 2012

SPARQL 1.1 Query Language [Last Call – 21 August 2012]

Filed under: RDF,SPARQL — Patrick Durusau @ 7:25 pm

SPARQL 1.1 Query Language

From the W3C News page:

The SPARQL Working Group has published a Last Call Working Draft of SPARQL 1.1 Query Language. RDF is a directed, labeled graph data format for representing information in the Web. This specification defines the syntax and semantics of the SPARQL query language for RDF. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports aggregation, subqueries, negation, creating values by expressions, extensible value testing, and constraining queries by source RDF graph. The results of SPARQL queries can be result sets or RDF graphs. Comments are welcome through 21 August.

Datomic Free Edition

Filed under: Datomic — Patrick Durusau @ 7:17 pm

Datomic Free Edition

From the post:

We’re happy to announce today the release of Datomic Free Edition. This edition is oriented around making Datomic easier to get, and use, for open source and smaller production deployments.

  • Datomic Free Edition is … free!
  • The system supports transactor-local storage
  • The peer library includes a memory database and Datomic Datalog
  • The Free transactor and peers are freely redistributable
  • The transactor supports 2 simultaneous peers

Of particular note here is that Datomic Free Edition comes with a redistributable license, and does not require a personal/business-specific license from us. That means you can download Datomic Free, build e.g. an open source application with it, and ship/include Datomic Free binaries with your software. You can also put the Datomic Free bits into public repositories and package managers (as long as you retain the licenses and copyright notices).

There is a ton of capability included in the Free Edition, including the Datomic in-process memory database (great for testing), and the Datomic datalog engine, which works on both Datomic databases and in-memory collections. That’s right, free datalog for everyone.

You can use Datomic Free Edition in production, and you can use it in commercial applications.

Get Datomic!

I first saw this at Alex Popescu’s myNoSQL.

How To Choose ‘Advanced’ Data Visualization Tools

Filed under: Graphics,R,Visualization — Patrick Durusau @ 7:05 pm

How To Choose ‘Advanced’ Data Visualization Tools by Doug Henschen.

From the post:

How do you separate the “advanced” visualization products from the also rans? In a new report, Forrester analysts Boris Evelson and Noel Yuhanna identify six traits that separate advanced data visualization from static graphs: dynamic data, visual querying, linked multi-dimensional visualization, animation, personalization, and actionable alerts. Dynamic data is the ability to update visualizations as data changes in sources such as databases. With visual querying you can change the query by selecting or clicking on a portion of the graph or chart (to drill down, for example). With multi-dimensional linking, selections made in one chart are reflected as you navigate into other charts. With personalization you can give power users an in-depth view and newbies a simpler view, and you can also control access to data based on user- and role-based access privileges. Visualizations can illuminate important trends and conditions, but what if you don’t see the visualization? Alerting is there as a safeguard, so you can set thresholds and parameters that trigger messages whether you’re interacting with reports or not.

Forrester’s report, “The Forrester Wave: Advanced Data Visualization Platforms, Q3 2012,” is available online from the SAS Web site. (The report was not sponsored upfront by any vendor, but SAS fared well in the research and purchased download rights for the report, as it often does with Gartner Magic Quadrant reports.) So that’s what sets advanced products apart, but how do you pick the product that’s right for your organization. Forrester’s Wave report puts IBM, Information Builders, SAP, SAS, Tableau, Tibco, and Oracle in the advanced data visualization “leaders” wave. That’s a pretty long list if you ask me, but the report includes a scorecard with individual 0 (weak) to 5 (strong) grades detailing more than 16 product attributes. Tableau, IBM, and SAP score highest on “geospatial integration,” for example, whereas SAS, Tableau, and Tibco Spotfire score highest on visualization “animation,” a technique used, for example, to show changes over time, in relationship to pricing changes, or other variables. Vendors in the “strong performers” wave include Microsoft, MicroStrategy, Actuate, QlikTech, SpagoBI, and Panorama Software.

I like Forrester Wave reports because the scoring and the weighting of the scores is spelled out in detail, so you can tweak the scoring formula to your own liking. For example, Forrester weighted 50% of its overall score of its assessment of current products and 50% on “Strategy.” Within strategy, 40% of the score was based on “commitment” and 45% was based on “product direction” whereas only 10% was based on “pricing and licensing” and 5% on “transparency.” Personally, I would make the strategy scores account for about 40% of the overall score, and I would raise the weighting of “pricing and licensing,” as I’m guessing customers will care much more about that than “commitment,” whatever that means.

What’s missing from this evaluation of data visualization tools, advanced or not?

As far as I can tell, Forrester never considers the skill of users with the tools or more importantly, their insight into the data.

If you have ever seen a good presentation using PowerPoint and then remembered all the other “death by PowerPoint” presentations you have suffered through in your career, you know what I am talking about.

A tool is no better or worse than the user attempting to use it.

A strong tool will not compensate for weak users. Buy solutions accordingly.

Cambridge Advanced Modeller (CAM)

Filed under: Cambridge Advanced Modeler (CAM),Modeling — Patrick Durusau @ 6:46 pm

Cambridge Advanced Modeller (CAM)

From the webpage:

Cambridge Advanced Modeller is a software tool for modelling and analysing the dependencies and flows in complex systems – such as products, processes and organisations. It provides a diagrammer, a simulation tool, and a DSM tool.

CAM is free for research, teaching and evaluation. We only require that you cite our work if you use CAM in support of published work. Commercial evaluation is allowed. Commercial use is subject to non-onerous conditions.

Toolboxes provide several modelling notations and analysis methods. CAM can be configured to develop new modelling notations by specifying the types of element and connection allowed. A modular architecture allows new functionality, such as simulation codes, to be added.

One of the research tool boxes is topic maps! Cool!

Have you used CAM?

ActionGenerator, Part Two

Filed under: ActionGenerator,ElasticSearch,Solr — Patrick Durusau @ 4:13 pm

Rafał Kuć returns in: ActionGenerator, Part Two to cover action generators for Elastic Search and Solr.

Just in case you are interested. 😉

Both include indexing and query action generators, just in case you want to stress your deployment before a big opening day. (Zero day crashes don’t encourage your user/customer base.)

Future plans include action generators for SenseiDB.

On the Pattern of Primes

Filed under: Graphics,Visualization — Patrick Durusau @ 3:32 pm

On the Pattern of Primes by Jason Davies.

From the webpage:

For each natural number n, we draw a periodic curve starting from the origin, intersecting the x-axis at n and its multiples. The prime numbers are those that have been intersected by only two curves: the prime number itself and one.

Below the currently highlighted number, we also show its sum of divisors σ(n), and its aliquot sum s(n) = σ(n) – n, which indicate whether the number is prime, deficient, perfect or abundant.

A very compelling pattern visualization.

What other number or semantic patterns would you visualize?

I first saw this at Information Aesthetics.

Oracle closes Fortress language down for good

Filed under: Fortress Language,Parallel Programming — Patrick Durusau @ 3:14 pm

Oracle closes Fortress language down for good by Chris Mayer.

From the post:

Oracle is to cease all production on the long-running Fortress language project, seeking to cast aside any language that isn’t cutting the mustard financially.

Guy Steele, creator of Fortress and also involved in Java’s development under Sun jurisdiction, wrote on his blog: “After working nearly a decade on the design, development, and implementation of the Fortress programming language, the Oracle Labs Programming Language Research Group is now winding down the Fortress project.“

He added: “Ten years is a remarkably long run for an industrial research project (one to three years is much more typical), but we feel that our extended effort has been worthwhile.”

Guy’s post has commentary on points of pride from the Fortress project:

  • Generators and reducers
  • Implicit parallelism supported by work-stealing
  • Nested atomic blocks supported by transactional memory
  • Parametrically polymorphic types that are not erased
  • Symmetric multimethod dispatch and parametrically polymorphic methods
  • Multiple inheritance, inheritance symmetry, and type exclusion
  • Mathematical syntax
  • Components and APIs
  • Dimensions and units
  • Explicit descriptions of data distribution and processor assignment
  • Conditional inheritance and conditional method definition

Respectable output for a project? Yes?

To avoid saying something in anger, I did research Oracle’s Support for Open Source and Open Standards:

  • Berkeley DB
  • Eclipse
  • GlassFish
  • Hudson
  • InnoDB
  • Java
  • Java Platform, Micro Edition (Java ME)
  • Linux
  • MySQL
  • NetBeans
  • OpenJDK
  • PHP
  • VirtualBox
  • Xen
  • Free and Open Source Software

Hard for me to say which one of those projects I would trade for Fortress or even ODF/OpenOffice.

But that was Oracle’s call, not mine.

On the other hand, former Oracle support doesn’t bar anyone else from stepping up. So maybe it is your call now?

Parallel processors are here, now, in abundance. Can’t say the same for programming paradigms to take full advantage of them.

Topic maps may help you avoid re-inventing Fortress concepts and mechanisms, if you learn from the past, as opposed to re-inventing it.

July 23, 2012

Wrinkling Time

Filed under: Modeling,Time,Timelines,Topic Maps — Patrick Durusau @ 6:25 pm

The post by Dan Brickley that I mentioned earlier today, Dilbert schematics, made me start thinking about more complex time scenarios than serial assignment of cubicles.

Like Hermione Granger and Harry Potter’s adventure in the Prisoner of Azkaban.

For those of you who are vague on the story, Hermione uses a “Time-Turner” to go back in time several hours. As a result, she and Harry must avoid being seen by themselves (and others). Works quite well in the story but what if I wanted to model that narrative in a topic map?

Some issues/questions that occurred to me:

Harry and Hermione are the same subjects they were during the prior time interval. Or are they?

Does a linear notion of time mean they are different subjects?

How would I model their interactions with others? Such as Buckbeak? Who interacted with both versions (for lack of a better term) of Harry?

Is there a time line running parallel to the “original” time line?

Just curious, what happens if the Time-Turner fails and Harry and Hermoine don’t return to the present, ever? That is their “current” present is forever 3 hours behind their “real” present.

What other time issues, either in literature or elsewhere seem difficult to model to you?

XLConnect 0.2-0

Filed under: Data Mining,Excel,R — Patrick Durusau @ 5:59 pm

XLConnect 0.2-0

From the post:

Mirai Solutions GmbH (http://www.mirai-solutions.com) is very pleased to announce the release of XLConnect 0.2-0, which can be found at CRAN.

As one of the updates, XLConnect has moved to the newest release of Apache POI: 3.8. Also, the lazy evaluation issues with S4 generics are now fixed: generic methods now fully expand the argument list in order to have the arguments immediately evaluated.

Furthermore, we have added an XLConnect.R script file to the top level library directory, which contains all code examples presented in the vignette, so that it’s easier to reuse the code.

From an earlier description of XLConnect:

XLConnect is a comprehensive and platform-independent R package for manipulating Microsoft Excel files from within R. XLConnect differs from other related R packages in that it is completely cross-platform and as such runs under Windows, Unix/Linux and Mac (32- and 64-bit). Moreover, it does not require any installation of Microsoft Excel or any other special drivers to be able to read & write Excel files. The only requirement is a recent version of a Java Runtime Environment (JRE). Also, XLConnect can deal with the old *.xls (BIFF) and the new *.xlsx (Office Open XML) file formats. Under the hood, XLConnect uses Apache POI (http://poi.apache.org) – a Java API to manipulate Microsoft Office documents. (From XLConnect – A platform-independent interface to Excel

If you work with data in a business environment, you are going to encounter Excel files. (Assuming you are not in a barter economy counting animal skins and dried fish.)

And customers are going to want you to return Excel files to them. (Yes, yes, topic maps would be a much better delivery format. But if the choice is Excel files, you get paid, topic maps files, you don’t get paid, which one would you do? That’s what I thought.)

A package to consider if you need to manipulate Excel files from within R.

Statistics Thingy?

Filed under: Humor — Patrick Durusau @ 3:48 pm

From Simply Statistics, a link titled We used, you know, that statistics thingy, which in the original read: We really don’t care what statistical method you used, all of which pointed to an abstract in BMC Systems Biology 2011, 5(Suppl 3):S4 that contains:

(insert statistical method here)

It happens. Even with proof reading by authors, copy editors, publishers.

But proof reading reduces the error rate greatly.

MongoDB-as-a-service for private rolled out by ScaleGrid, in MongoDirector

Filed under: MongoDB,NoSQL — Patrick Durusau @ 3:16 pm

MongoDB-as-a-service for private rolled out by ScaleGrid, in MongoDirector by Chris Mayer.

From the post:

Of all the NoSQL databases emerging at the moment, there appears to be one constant discussion taking place – are you using MongoDB?

It appears to be the open source, document-oriented NoSQL database solution of choice, mainly due to its high performance nature, its dynamism and its similarities to the JSON data structure (in BSON). Despite being written in C++, it is attracting attention from developers of different creeds. Its enterprise level features have helped a fair bit in its charge up the rankings to leading NoSQL database, with it being the ideal datastore for highly scalable environments. Just a look at the latest in-demand skills on Indeed.com shows you that 10gen’s flagship product has infiltrated the enterprise well and truly.

Quite often, an enterprise can find the switch from SQL to NoSQL daunting and needs a helping hand. Due to this, many MongoDB-related products are arriving just as quickly as MongoDB converts The latest of which to launch as a public beta is MongoDirector from Seattle start-up ScaleGrid. MongoDirector offers an end-to-end lifecycle manager for MongoDB to guide newcomers along.

I don’t have anything negative to say about MongoDB but I’m not sure the discussion of NoSQL solutions is quite as one-sided as Chris seems to think.

The Indeed.com site is a fun one to play around with but I would not take the numbers all that seriously. For one thing, it doesn’t appear to control for duplicate job ads posted in different source, for example. But that’s a nitpicking objection.

A more serious one is when you start to explore the site and discover the top three job titles for IT.

Care to guess what they are? Would you believe they don’t have anything to do with databases or MongoDB?

As least as of today, and I am sure it changes over time, Graphic Designer, Technical Writer, and Project Manager all rank higher than Data Analyst, where you would hope to find some MongoDB jobs. (Information Technology Industry – 23 July 2012)

BTW, for your amusement, when I was looking for information on database employment, I encountered Database Administrators, from the Bureau of Labor Statistics in the United States. The data is available for download as XLS files.

The site says blanks on the maps are from lack of data. I suspect the truth is there are no database administrators in Wyoming. 😉 Or at least I could point to the graphic as some evidence for my claim.

I think you need to consider the range of database options, from very traditional SQL vendors to bleeding edge No/New/Maybe/SQL solutions, including MongoDB. The question is which one meets your requirements, whether flavor of the month or no.

Dilbert schematics

Filed under: Associations,RDF,Scope,Topic Maps — Patrick Durusau @ 2:17 pm

Dilbert schematics

In November of 2011, Dan Brickley writes:

How can we package, manage, mix and merge graph datasets that come from different contexts, without getting our data into a terrible mess?

During the last W3C RDF Working Group meeting, we were discussing approaches to packaging up ‘graphs’ of data into useful chunks that can be organized and combined. A related question, one always lurking in the background, was also discussed: how do we deal with data that goes out of date? Sometimes it is better to talk about events rather than changeable characteristics of something. So you might know my date of birth, and that is useful forever; with a bit of math and knowledge of today’s date, you can figure out my current age, whenever needed. So ‘date of birth’ on this measure has an attractive characteristic that isn’t shared by ‘age in years’.

At any point in time, I have at most one ‘age in years’ property; however, you can take two descriptions of me that were at some time true, and merge them to form a messy, self-contradictory description. With this in mind, how far should we be advocating that people model using time-invariant idioms, versus working on better packaging for our data so it is clearer when it was supposed to be true, or which parts might be more volatile?

Interesting to read as an issue for RDF modeling.

Not difficult to solve using scopes on associations in a topic map.

Question: What difficulties do time-invariant idioms introduce for modeling? What difficulties do non-time-invariant idioms introduce for processing?*

Different concerns and it isn’t enough to have an answer to a modeling issue without understanding the implications of the answer.

*Hint: As I read the post, it assumes a shared, “objective” notion of time. Perhaps works for the cartoon world, but what about elsewhere?

Aurora – Illegal Weapons [Big Data to Small Data]

Filed under: BigData,Marketing,Security — Patrick Durusau @ 1:53 pm

Tuan C. Nguyen writes in Inside the secret online marketplace for illegal weapons that:

With just a few clicks, anyone with an internet connection can obtain some of the deadliest weapons known to man, an investigation by tech blog Gizmodo has revealed.

These include AK-47s, Bushmaster military rifles and even grenades — all of which can be sold, bought, sent and delivered on the Armory, a hidden website that functions as an online black market for illegal firearms. It’s there that Gizmodo writer Sam Biddle, who went undercover as an anonymous buyer, discovered a transaction process that uses an elaborate scheme that involves identity-concealing data encryption, an alternative electronic currency and a delivery method that allows both buyers and sellers to bypass the authorities without raising even the hint of suspicion.

Concerns over the ease of obtaining guns and other lethal weapons has gripped the nation in the aftermath of one of the deadliest massacre’s in recent memory when a heavily-armed lone gunman killed 12 people and injured 58 during a midnight movie screening just outside Denver. Shortly after, a paper trail revealed that the suspect built his arsenal through purchases made via a host of unregulated web sites, the Associated press reports. The existence of such portals is alarming in that not only can they arm a single deranged individual with enough ballistics to carry out a massacre, but also supply a group of terrorist rebels with enough artillery to lay siege to embassies and government offices, according to the report.

The post goes on to make much of the use of TOR (The Onion Router), which was developed by the U.S. Navy.

The TOR site relates in its overview:

Using Tor protects you against a common form of Internet surveillance known as “traffic analysis.” Traffic analysis can be used to infer who is talking to whom over a public network. Knowing the source and destination of your Internet traffic allows others to track your behavior and interests. This can impact your checkbook if, for example, an e-commerce site uses price discrimination based on your country or institution of origin. It can even threaten your job and physical safety by revealing who and where you are. For example, if you’re travelling abroad and you connect to your employer’s computers to check or send mail, you can inadvertently reveal your national origin and professional affiliation to anyone observing the network, even if the connection is encrypted.

I recommend that you take a look at the TOR site and its documentation. Quite a clever piece of work.

Taun see this in part as a “big data” problem. Sure, given all the network traffic that is being exchanged at one time, TOR can easily defeat any “traffic analysis” process. (Or at least let’s take that as a given for purposes of this discussion. Users are assuming there are no “backdoors” built into the encryption but that’s another story.)

What if we look at this as a “big data” being reduced to “small data” problem?

Assume local law enforcement has access to the local Internet “connection.” (It is more complicated than this but I am trying to illustrate something, not write a manual for it.)

My first step is to filter encrypted traffic from non-encrypted traffic, passing my current location. Since locations are fed by routers, I can just walk the chain of routers, filtering non-encrypted traffic as I go. I don’t have to worry about the content or even tracking the IP addresses of the sender. Eventually I have tracked the senders of encrypted messages down to the nearest router to the origin of the traffic.

My second step is to start using a topic map to combine other information known to the local police about an area and its residents. A person or group ordering heavy weapons, explosives, etc., is going to have other “tells” besides encrypted Internet traffic.

A topic map can help combine all those “tells” into a map of probable locations and actors, using a variety of information sources, TOR or other technologies not withstanding.

Rather than a “big data,” you now have a “small data” problem and one that can be addressed by the local police.

Everything Still Looks Like A Graph (but graphs look like maps)

Filed under: Graphs,Interface Research/Design,Visualization — Patrick Durusau @ 9:28 am

Everything Still Looks Like A Graph (but graphs look like maps) by Dan Brickley.

From the post:

Last October I posted a writeup of some experiments that illustrate item-to-item similarities from Apache Mahout using Gephi for visualization. This was under a heading that quotes Ben Fry, “Everything looks like a graph” (but almost nothing should ever be drawn as one). There was also some followup discussion on the Gephi project blog

The entry quoting Ben Fry is entitled Linked Literature, Linked TV – Everything Looks like a Graph and is a great read! Both from the experiments he reports on visualizing linked data and the visualizations that are part of the posts.

Near the end of the “Everything Still Looks Like A Graph…” Dan remarks:

There’s no single ‘correct’ view of the bibliographic landscape; what makes sense for a phd researcher, a job seeker or a schoolkid will naturally vary. This is true also of similarity measures in general, i.e. for see-also lists in plain HTML as well as fancy graph or landscape-based visualizations. There are more than metaphorical comparisons to be drawn with the kind of compositing tools we see in systems like Blender, and plenty of opportunities for putting control into end-user rather than engineering hands.

What do you make of:

There’s no single ‘correct’ view…of similarity measures in general, i.e., for see-also lists in plain HTML…

and

…plenty of opportunities for putting control into end-user rather than engineering hands.

???

Is it the case that most semantic solutions offer users “similarity measures” as applied by the author’s of the semantic solutions?

That may or may not the same as “similarity measures” as applied by users?

Is that why user continue to use Google? That for all of its crudeness, it does offer users the freedom to create their own judgements on similarity?

So how do we create an interface that:

  • Enables users to use their own judgements of similarity
  • and

  • Enables users to capture those judgements of similarity for use by others
  • and

  • Enables uses to explain/disclose their judgements of similarity (to enable other users to agree/not-agree)
  • and

  • Does so with only a little more effort than like/dislike?

Suggestions/comments/proposals?

Hypergraphs and Colored Maps

Filed under: Graphs,Hyperedges,Hypergraphs — Patrick Durusau @ 6:42 am

Hypergraphs and Colored Maps by James Mallos.

From the post:

A graph, in general terms, is a set of vertices connected by edges. Finding good colorings for the vertices (or edges) of a graph may seem like a hobby interest, but, in fact, graphs bearing certain rule-based colorings represent mathematical objects that are more general than graphs themselves. By bearing colors, graphs let us see objects that could not be quite so easily drawn.

I discovered James’ site, Weave Anything while searching for blogs on hypergraphs.

I recommend it as an entertaining way to learn more about graphs, hypergraphs, hypermaps and similar structures.

July 22, 2012

Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT available for download

Filed under: Lucene,Solr — Patrick Durusau @ 6:34 pm

Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT available for download

More good news!

I am very excited to announce the availability of Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT. The Realtime NRT implementation now supports both RankingAlgorithm and Lucene. Realtime NRT is a high performance and more granular NRT implementation as to soft commit. The update performance is about 70,000 documents / sec*. You can also scale up to 2 billion documents* in a single core, and query half a billion documents index in ms**.

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or boolean queries and is compatible with the new Lucene 4.0-ALPHA api.

You can get more information about Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 Realtime performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 from here: http://solr-ra.tgels.org

Please download and give the new version a try.

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

Apache Lucene 3.6.1 and Apache Solr 3.6.1 available

Filed under: Indexing,Lucene,Solr — Patrick Durusau @ 6:21 pm

Lucene/Solr news on 22 July 2012:

The Lucene PMC is pleased to announce the availability of Apache Lucene 3.6.1 and Apache Solr 3.6.1.

This release is a bug fix release for version 3.6.0. It contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below.

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-3x-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-3x-redir.html

See the CHANGES.txt file included with the release for a full list of details.

Lucene 3.6.1 Release Highlights:

  • The concurrency of MMapIndexInput.clone() was improved, which caused a performance regression in comparison to Lucene 3.5.0.
  • MappingCharFilter was fixed to return correct final token positions.
  • QueryParser now supports +/- operators with any amount of whitespace.
  • DisjunctionMaxScorer now implements visitSubScorers().
  • Changed the visibility of Scorer#visitSubScorers() to public, otherwise it’s impossible to implement Scorers outside the Lucene package. This is a small backwards break, affecting a few users who implemented custom Scorers.
  • Various analyzer bugs where fixed: Kuromoji to not produce invalid token graph due to UNK with punctuation being decompounded, invalid position length in SynonymFilter, loading of Hunspell dictionaries that use aliasing, be consistent with closing streams when loading Hunspell affix files.
  • Various bugs in FST components were fixed: Offline sorter minimum buffer size, integer overflow in sorter, FSTCompletionLookup missed to close its sorter.
  • Fixed a synchronization bug in handling taxonomies in facet module.
  • Various minor bugs were fixed: BytesRef/CharsRef copy methods with nonzero offsets and subSequence off-by-one, TieredMergePolicy returned wrong-scaled floor segment setting.

Solr 3.6.1 Release Highlights:

  • The concurrency of MMapDirectory was improved, which caused a performance regression in comparison to Solr 3.5.0. This affected users with 64bit platforms (Linux, Solaris, Windows) or those explicitely using MMapDirectoryFactory.
  • ReplicationHandler “maxNumberOfBackups” was fixed to work if backups are triggered on commit.
  • Charset problems were fixed with HttpSolrServer, caused by an upgrade to a new Commons HttpClient version in 3.6.0.
  • Grouping was fixed to return correct count when not all shards are queried in the second pass. Solr no longer throws Exception when using result grouping with main=true and using wt=javabin.
  • Config file replication was made less error prone.
  • Data Import Handler threading fixes.
  • Various minor bugs were fixed.

What a nice way to start the week!

Thanks to the Lucene PMC!

C++11 regex cheatsheet

Filed under: Regex,Regexes — Patrick Durusau @ 6:01 pm

C++11 regex cheatsheet

A one page C++11 regex cheatsheet that you may find useful.

Curious though, how useful do you find colors on cheatsheets?

Or are there cheatsheets where you find colors useful and others not?

If so, what seems to be the difference?

Not an entirely idle query. I want to author a cheatsheet or two, but want them to be useful to others.

At one level, I see cheatsheets as being extremely minimalistic, no commentary, just short reminders of the correct syntax.

A step up from that level, perhaps for rarely used commands, a bit more than bare syntax.

Suggestions? Pointers to cheatsheets you have found useful?

Stardog Quick Start

Filed under: RDF,Stardog — Patrick Durusau @ 5:44 pm

After downloading Stardog and a license key, I turned to the “Stardog Quick Start” page at: {install directory}/stardog-1.0.2/docs/manual/quick-start/index.html.

What follows is an annotated version of that page that reports “dumb” and perhaps not so “dumb” mistakes I made to get the server running and the first database loaded. (I have also included what output to expect if you are successful at each step.)

First, tell Stardog where its home directory (where databases and other files will be stored) is:If you’re using some weird Unix shell that doesn’t create environment variables in this way, adjust accordingly. Stardog requires STARDOG_HOME to be defined.

$ export STARDOG_HOME=/data/stardog

I edited my .bashrc file to insert this statement.

Be sure to remember to open another shell so that the variable gets set for that window. (Yes, I forgot the first time around.)

Second, copy the stardog-license-key.binYou’ll get this either with an evaluation copy of Stardog or with a licensed copy. into place:

$ cp stardog-license-key.bin $STARDOG_HOME

Of course stardog-license-key.bin has to be readable by the Stardog process.

Mine defaulted to 664 but it is a good idea to check.

Third, start the Stardog server. By default the server will expose SNARL and HTTP interfaces—on ports 5820 and 5822, respectively.

$ ./stardog-admin server start

If successful, you will see:

Starting Stardog server in background, see /home/patrick/working/stardog-1.0.2/stardog.log for more information.

************************************************************
This copy of Stardog is licensed to Patrick Durusau (patrick@durusau.net), Patrick Durusau
This is a Community license
This license does not expire.
************************************************************

                                                             :;   
                                      ;;                   `;`:   
  `'+',    ::                        `++                    `;:`  
 +###++,  ,#+                        `++                    .     
 ##+.,',  '#+                         ++                     +    
,##      ####++  ####+:   ##,++` .###+++   .####+    ####++++#    
`##+     ####+'  ##+#++   ###++``###'+++  `###'+++  ###`,++,:     
 ####+    ##+        ++.  ##:   ###  `++  ###  `++` ##`  ++:      
  ###++,  ##+        ++,  ##`   ##;  `++  ##:   ++; ##,  ++:      
    ;+++  ##+    ####++,  ##`   ##:  `++  ##:   ++' ;##'#++       
     ;++  ##+   ###  ++,  ##`   ##'  `++  ##;   ++:  ####+        
,.   +++  ##+   ##:  ++,  ##`   ###  `++  ###  .++  '#;           
,####++'  +##++ ###+#+++` ##`   :####+++  `####++'  ;####++`      
`####+;    ##++  ###+,++` ##`    ;###:++   `###+;   `###++++      
                                                    ##   `++      
                                                   .##   ;++      
                                                    #####++`      
                                                     `;;;.        

************************************************************
Stardog server 1.0.2 started on Sun Jul 22 16:54:01 EDT 2012.

SNARL server running on snarl://localhost:5820/
HTTP server running on http://localhost:5822/.
Stardog documentation accessible at http://localhost:5822/docs
SNARL & HTTP servers listening on all interfaces

STARDOG_HOME=/home/patrick/working/stardog-1.0.2 

Fourth, create a database with an input file; use the –server parameter to specify which server:

$ ./stardog-admin create -n myDB -t D -u admin -p admin –server snarl://localhost:5820/ examples/data/University0_0.owl

Gotcha! Would you believe that UniversityO_O.owl has two 0 digits in the name?

Violent disagreement notwithstanding, it is always bad practice to use easily confused letter and digits in files names. Always.

If you are successful you will see:

Bulk loading data to new database.
Data load complete. Loaded 8,521 triples in 00:00:01 @ 8.1K triples/sec.
Successfully created database 'myDB'.

Fifth, optionally, admire the pure RDF bulk loading power…woof!

OK. 😉

Sixth, query the database:

$ ./stardog query -c http://localhost:5822/myDB -q “SELECT DISTINCT ?s WHERE { ?s ?p ?o } LIMIT 10”

If successful, you will see:

Executing Query:

SELECT DISTINCT ?s WHERE { ?s ?p ?o } LIMIT 10

+--------------------------------------------------------+
|                           s                            |
+--------------------------------------------------------+
| http://api.stardog.com                                 |
| http://www.University0.edu                             |
| http://www.Department0.University0.edu                 |
| http://www.Department0.University0.edu/FullProfessor0  |
| http://www.Department0.University0.edu/Course0         |
| http://www.Department0.University0.edu/GraduateCourse0 |
| http://www.Department0.University0.edu/GraduateCourse1 |
| http://www.University84.edu                            |
| http://www.University875.edu                           |
| http://www.University241.edu                           |
+--------------------------------------------------------+

Query returned 10 results in 00:00:00.093

If you happen to make any mistakes, you may want to be aware of:

./stardog-admin drop -n myDB

😉

Your performance will vary but these notes may save you a few minutes and some annoyance in getting Stardog up and running.

Snarl [Protocol]

Filed under: Snarl [Protocol] — Patrick Durusau @ 1:56 pm

Snarl

I encountered the Snarl protocol while configuring the latest release of Stardog.

Looking for more information, I found its homepage and this tag line:

What you need to know – when you need to know it

I rather like that.

I can think of a number of notifications that could be sent to a user from a topic map application.

Or as input into a topic map application.

Windows Azure Active Directory Graph

Filed under: Graphs,Microsoft — Patrick Durusau @ 5:23 am

Windows Azure Active Directory Graph

Pre-release documentation, subject to change before release, blah, blah, but very interesting none the less.

When I look at the application scenario, Creating Enterprise Applications by Using Windows Azure AD Graph, which is described as:

In this scenario you have purchased an Office 365 subscription. As part of the subscription you have purchased the capability to manage users using Windows Azure AD, which is part of Windows Azure. You want to build an application that can access users’ information such as user names and group membership.

OK, so I can access “user names and group membership,” a good thing but a better (read more useful) thing would be to manage other user identifications for access to enterprise applications.

Or to put that differently, to map user identifications together for any single user, so the appropriate identification is used for any particular system. (Thinking of long term legacy systems and applications. Almost everyone has them.)

Certainly worth your attention as this develops towards release.

July 21, 2012

Update on Assimilation Project: Neo4j Server Schema

Filed under: Neo4j,Networks — Patrick Durusau @ 8:10 pm

Update on Assimilation Project: Neo4j Server Schema

From the post:

In his latest blog post, Alan Robertson makes a bold move by building a Neo4j – a schemaless graph database – schema server for his Assimilation Monitoring Project which is a comprehensive monitoring of systems and services for networks of potentially unlimited size.

For the schema, Robertson outlines the basic entities: Servers, NICs and IP addresses which indexes of servers names, MAC addresses and IP addresses. The power in using a graph is the ability to define the relationships between the various entities such as NIC owner, IP Owner, IP Host and Primary IP.

Correct me if I’m wrong but doesn’t that sound a lot like the topic map that Robert Barta was running when he was down under?

I don’t have the reference within easy reach so if you can post it as a comment I would appreciate it.

Mapping Public Opinion: A Tutorial

Filed under: Mapping,Maps,R — Patrick Durusau @ 8:00 pm

Mapping Public Opinion: A Tutorial by David Sparks.

From the post:

At the upcoming 2012 summer meeting of the Society of Political Methodology, I will be presenting a poster on Isarithmic Maps of Public Opinion. Since last posting on the topic, I have made major improvements to the code and robustness of the modeling approach, and written a tutorial that illustrates the production of such maps.

This tutorial, in a very rough draft form, can be downloaded here [PDF]. I would welcome any and all comments on clarity, readability, and the method itself. Please feel free to use this code for your own projects, but I would be very interested in seeing any results, and hope you would be willing to share them.

An interesting mapping exercise, even though I find political opinion mapping just a tad tedious. Hasn’t changed significantly in years, which explains “safe” seats for both Republicans and Democrats in the United States.

Still, the techniques are valid and can be useful in other contexts.

The Amazing Mean Shift Algorithm

Filed under: Algorithms,Merging — Patrick Durusau @ 7:47 pm

The Amazing Mean Shift Algorithm by Larry Wasserman.

From the post:

The mean shift algorithm is a mode-based clustering method due to Fukunaga and Hostetler (1975) that is commonly used in computer vision but seems less well known in statistics.

The steps are: (1) estimate the density, (2) find the modes of the density, (3) associate each data point to one mode.

If you are puzzling over why I cited this post, it might help if you read “(3)” as:

(3) merge data points associated with one mode.

The notion that topics can only be merged on the basis of URLs, actually discrete values of any sort, is one way to think about merging. Your data may or may not admit to robust processing on that basis.

Those are all very good ways to merge topics, if and only if that works for your data.

If not, then you need to find ways that work with your data.

Stardog 1.0.2 Released

Filed under: Graphs,Linked Data,Stardog — Patrick Durusau @ 7:35 pm

Stardog 1.0.2 Released: NoSQL Graph Database Leading Innovation in Semantic Technologies

From the post:

C&P LLC, the company behind Stardog, today announced the release of Stardog 1.0.2. Stardog is a NoSQL graph database based on W3C semantic web standards: SPARQL, RDF, and OWL. Stardog is a key component in Linked Data-based information integration at Fortune 500 enterprises and governments around the world.

The new release follows closely on last month’s launch of Stardog 1.0. The 1.0.2 release includes Stardog Community, a free version of Stardog for community use in academia, non-profit, and related sectors. Stardog is being used by customers in the areas of government, aerospace, financial, intelligence, defense, and at consumer-oriented startups.

“We are pleased with the technical progress made on Stardog since the 1.0 launch,” said Dr Evren Sirin, CTO, C&P LLC. “Today’s release begins our support for SPARQL 1.1, a crucial standard in enterprise information integration and semantic technologies. It also introduces Stardog Community to a new user base in linked open data, open government data, and related fields.”

I’m not real sure I would want to tie Linked Data or SPARQL to the tail of my product.

Still, there is a fair amount of it and the sheer inertia of government systems will result in more of it and it will always be around as a legacy format. So there isn’t any harm in supporting it, so long as you don’t get tunnel vision from it.

Efficient Core Maintenance in Large Dynamic Graphs

Filed under: Graphs,Merging,Topic Map Software — Patrick Durusau @ 7:26 pm

Efficient Core Maintenance in Large Dynamic Graphs by Rong-Hua Li and Jeffrey Xu Yu.

Abstract:

The $k$-core decomposition in a graph is a fundamental problem for social network analysis. The problem of $k$-core decomposition is to calculate the core number for every node in a graph. Previous studies mainly focus on $k$-core decomposition in a static graph. There exists a linear time algorithm for $k$-core decomposition in a static graph. However, in many real-world applications such as online social networks and the Internet, the graph typically evolves over time. Under such applications, a key issue is to maintain the core number of nodes given the graph changes over time. A simple implementation is to perform the linear time algorithm to recompute the core number for every node after the graph is updated. Such simple implementation is expensive when the graph is very large. In this paper, we propose a new efficient algorithm to maintain the core number for every node in a dynamic graph. Our main result is that only certain nodes need to update their core number given the graph is changed by inserting/deleting an edge. We devise an efficient algorithm to identify and recompute the core number of such nodes. The complexity of our algorithm is independent of the graph size. In addition, to further accelerate the algorithm, we develop two pruning strategies by exploiting the lower and upper bounds of the core number. Finally, we conduct extensive experiments over both real-world and synthetic datasets, and the results demonstrate the efficiency of the proposed algorithm.

Maintenance of topic maps in the face of incoming information is an important issue.

I am intrigued by the idea of only certain nodes requiring updating based on addition/deletion of edges to a graph. Certainly true with topic maps and my question is whether this work can be adapted to work outside of core number updates?

Or perhaps more clearly, can it be adapted to work with a basis for merging topic maps? Or should core numbers be adapted for processing topic maps.

Questions I would be exploring if I had a topic maps lab. Maybe I should work up a proposal for one at an investment site.

« Newer PostsOlder Posts »

Powered by WordPress