Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 19, 2013

Knowledge representation and processing with formal concept analysis

Filed under: Formal Concept Analysis (FCA) — Patrick Durusau @ 2:55 pm

Knowledge representation and processing with formal concept analysis by Sergei O. Kuznetsov and Jonas Poelmans. (Kuznetsov, S. O. and Poelmans, J. (2013), Knowledge representation and processing with formal concept analysis. WIREs Data Mining Knowl Discov, 3: 200–215. doi: 10.1002/widm.1088)

Abstract:

During the last three decades, formal concept analysis (FCA) became a well-known formalism in data analysis and knowledge discovery because of its usefulness in important domains of knowledge discovery in databases (KDD) such as ontology engineering, association rule mining, machine learning, as well as relation to other established theories for representing knowledge processing, like description logics, conceptual graphs, and rough sets. In early days, FCA was sometimes misconceived as a static crisp hardly scalable formalism for binary data tables. In this paper, we will try to show that FCA actually provides support for processing large dynamical complex (may be uncertain) data augmented with additional knowledge.

It isn’t entirely clear who the authors’ consider responsible for formal concept analysis being “…sometimes misconceived as a static crisp hardly scalable formalism for binary data tables.”

I suspect opinions differ on that point. 😉

Whatever your position on that issue, the paper is a handy review/survey of formal concept analysis (FCA) up to its present state.

Working through this paper and its many references will give you a lot of practice at analysis of concepts.

I should say good practice because many of the same questions will occur when you analyze subjects for representation in a topic map.

Preliminary evaluation of the CellFinder literature…

Filed under: Curation,Data,Entity Resolution,Named Entity Mining — Patrick Durusau @ 2:18 pm

Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts by Mariana Neves, Alexander Damaschun, Nancy Mah, Fritz Lekschas, Stefanie Seltmann, Harald Stachelscheid, Jean-Fred Fontaine, Andreas Kurtz, and Ulf Leser. (Database (2013) 2013 : bat020 doi: 10.1093/database/bat020)

Abstract:

Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ∼50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction.

Database URL: http://www.cellfinder.org/.

Another extremely useful data curation project.

Do you get the impression that curation projects will continue to be outrun by data production?

And that will be the case, even with machine assistance?

Is there an alternative to falling further and further behind?

Such as abandoning some content (CNN?) to simply forever go uncurated? Or the same to be true for government documents/reports?

I am sure we all have different suggestions for what data to dump alongside the road to make room for the “important” stuff.

Suggestions on solutions other than simply dumping data?

Analyzing Data with Hue and Hive

Filed under: Hadoop,Hive,Hue — Patrick Durusau @ 2:06 pm

Analyzing Data with Hue and Hive by Romain Rigaux.

From the post:

In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).

The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!

I think the demo would be more effective if a city known for good food, New Orleans, for example, had been chosen for the challenge.

But given the complexity of the cuisine, that would be a stress test for human experts.

What chance would Apache Hadoop have? 😉

Yahoo! Cloud Serving Benchmark

Filed under: Benchmarks,YCSB — Patrick Durusau @ 1:45 pm

Yahoo! Cloud Serving Benchmark

From the webpage:

With the many new serving databases available including Sherpa, BigTable, Azure and many more, it can be difficult to decide which system is right for your application, partially because the features differ between systems, and partially because there is not an easy way to compare the performance of one system versus another.

The goal of the Yahoo! Cloud Serving Benchmark (YCSB) project is to develop a framework and common set of workloads for evaluating the performance of different “key-value” and “cloud” serving stores.

The project comprises two areas:

  • The YCSB Client, an extensible workload generator
  • The Core workloads, a set of workload scenarios to be executed by the generator

Although the core workloads provide a well-rounded picture of a system’s performance, the Client is extensible so that you can define new and different workloads to examine system aspects, or application scenarios, not adequately covered by the core workload. Similarly, the Client is extensible to support benchmarking different databases. Although we include sample code for benchmarking HBase and Cassandra, it is straightforward to write a new interface layer to benchmark your favorite database.

A common use of the tool is to benchmark multiple systems and compare them. For example, you can install multiple systems on the same hardware configuration, and run the same workloads against each system. Then you can plot the performance of each system (for example, as latency versus throughput curves) to see when one system does better than another.

The Yahoo! Cloud Serving Benchmark (YCSB) doesn’t get discussed in the video and only briefly in the paper: How to Compare NoSQL Databases.

YCSB source code and Benchmarking Cloud Serving Systems with YCSB may be helpful.

Performance of databases depend upon your point of view, benchmarks and their application and no doubt other causes as well.

Would make an interesting topic map project to make create a comparison of the metrics from different benchmarks and to attempt to create a crosswalk between them.

That would require a very deep and explicit definition of commonalities and differences between the benchmarks and their application to various database architectures.

Aerospike

Filed under: Aerospike,NoSQL,Performance — Patrick Durusau @ 1:01 pm

Aerospike

From the architecture overview:

Aerospike is a fast Key Value Store or Distributed Hash Table architected to be a flexible NoSQL platform for today’s high scale Apps. Designed to meet the reliability or ACID requirements of traditional databases, there is no single point of failure (SPOF) and data is never lost. Aerospike can be used as an in-memory database and is uniquely optimized to take advantage of the dramatic cost benefits of flash storage. Written in C, Aerospike runs on Linux.

Based on our own experiences developing mission-critical applications with high scale databases and our interactions with customers, we’ve developed a general philosophy of operational efficiency that guides product development. Three principles drive Aerospike architecture: NoSQL flexibility, traditional database reliability, and operational efficiency.

Technical details first published in Proceeding of the VLDB (Very Large Databases), Citrusleaf: A Real-Time NoSQL DB which Preserves ACID by V. Srinivasan and Brian Bulkowski.

You can guess why they changed the name. 😉

There is a free community edition, along with an SDK and documentation.

Relies on RAM and SDDs.

Timo Elliott was speculating about entirely RAM-based computing in: In-Memory Computing.

Imagine losing all the special coding tricks to get performance despite disk storage.

Simpler code and fewer operations should result in higher speed.

How to Compare NoSQL Databases

Filed under: Aerospike,Benchmarks,Cassandra,Couchbase,Database,MongoDB,NoSQL — Patrick Durusau @ 12:45 pm

How to Compare NoSQL Databases by Ben Engber. (video)

From the description:

Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL solutions (Couchbase, Aerospike, MongoDB, Cassandra, HBase, others) and to do so in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results.

We also discuss how to extend benchmarking tests to simulate different kinds of failure scenarios to help evaluate the maintainablility and recoverability of different systems. This requires carefully constructed tests and significant knowledge of the underlying databases — the talk will help evaluators overcome the common pitfalls and time sinks involved in trying to measure this.

Lastly we discuss the YCSB benchmarking tool, its significant limitations, and the significant extensions and supplementary tools Thumbtack has created to provide distributed load generation and failure simulation.

Ben makes a very good case for understanding the details of your use case versus the characteristics of particular NoSQL solutions.

Where you will find “better” performance depends on non-obvious details.

Watch the use of terms like “consistency” in this presentation.

The paper Ben refers to: Ultra-High Performance NoSQL Benchmarking: Analyzing Durability and Performance Tradeoffs.

Forty-three pages of analysis and charts.

Slow but interesting reading.

If you are into the details of performance and NoSQL databases.

April 18, 2013

KDD Cup 2013 – Author-Paper Identification Challenge

Filed under: Challenges,Contest,KDD — Patrick Durusau @ 6:41 pm

KDD Cup 2013 – Author-Paper Identification Challenge

Started: 3:47 am, Thursday 18 April 2013 UTC
Ends: 12:00 am, Wednesday 12 June 2013 UTC (54 total days)

From the post:

The ability to search literature and collect/aggregate metrics around publications is a central tool for modern research. Both academic and industry researchers across hundreds of scientific disciplines, from astronomy to zoology, increasingly rely on search to understand what has been published and by whom.

Microsoft Academic Search is an open platform that provides a variety of metrics and experiences for the research community, in addition to literature search. It covers more than 50 million publications and over 19 million authors across a variety of domains, with updates added each week. One of the main challenges of providing this service is caused by author-name ambiguity. On one hand, there are many authors who publish under several variations of their own name. On the other hand, different authors might share a similar or even the same name.

As a result, the profile of an author with an ambiguous name tends to contain noise, resulting in papers that are incorrectly assigned to him or her. This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author.

$7,500 and bragging rights.

Is there going to be a topic map entry this year?

A survey of fuzzy web mining

Filed under: Fuzzing,Fuzzy Logic,Fuzzy Matching,Fuzzy Sets — Patrick Durusau @ 6:33 pm

A survey of fuzzy web mining by Chun-Wei Lin and Tzung-Pei Hong. (Lin, C.-W. and Hong, T.-P. (2013), A survey of fuzzy web mining. WIREs Data Mining Knowl Discov, 3: 190–199. doi: 10.1002/widm.1091)

Abstract:

The Internet has become an unlimited resource of knowledge, and is thus widely used in many applications. Web mining plays an important role in discovering such knowledge. This mining can be roughly divided into three categories, including Web usage mining, Web content mining, and Web structure mining. Data and knowledge on the Web may, however, consist of imprecise, incomplete, and uncertain data. Because fuzzy-set theory is often used to handle such data, several fuzzy Web-mining techniques have been proposed to reveal fuzzy and linguistic knowledge. This paper reviews these techniques according to the three Web-mining categories above—fuzzy Web usage mining, fuzzy Web content mining, and fuzzy Web structure mining. Some representative approaches in each category are introduced and compared.

Written to cover fuzzy web mining but generally useful for data mining and organization as well.

Fuzzy techniques are probably closer to our mental processes than the precision of description logic.

Being mindful that mathematical and logical proofs are justifications for conclusions we already hold.

They are not the paths by which we arrived at those conclusions.

Fast Collaborative Graph Exploration

Filed under: Collaboration,Graph Analytics,Graphs,Networks — Patrick Durusau @ 2:26 pm

Fast Collaborative Graph Exploration by Dariusz Dereniowski, Yann Disser, Adrian Kosowski, Dominik Pajak, Przemyslaw Uznanski.

Abstract:

We study the following scenario of online graph exploration. A team of $k$ agents is initially located at a distinguished vertex $r$ of an undirected graph. At every time step, each agent can traverse an edge of the graph. All vertices have unique identifiers, and upon entering a vertex, an agent obtains the list of identifiers of all its neighbors. We ask how many time steps are required to complete exploration, i.e., to make sure that every vertex has been visited by some agent. We consider two communication models: one in which all agents have global knowledge of the state of the exploration, and one in which agents may only exchange information when simultaneously located at the same vertex. As our main result, we provide the first strategy which performs exploration of a graph with $n$ vertices at a distance of at most $D$ from $r$ in time $O(D)$, using a team of agents of polynomial size $k = D n^{1+ \epsilon} < n^{2+\epsilon}$, for any $\epsilon > 0$. Our strategy works in the local communication model, without knowledge of global parameters such as $n$ or $D$. We also obtain almost-tight bounds on the asymptotic relation between exploration time and team size, for large $k$. For any constant $c>1$, we show that in the global communication model, a team of $k = D n^c$ agents can always complete exploration in $D(1+ \frac{1}{c-1} +o(1))$ time steps, whereas at least $D(1+ \frac{1}{c} -o(1))$ steps are sometimes required. In the local communication model, $D(1+ \frac{2}{c-1} +o(1))$ steps always suffice to complete exploration, and at least $D(1+ \frac{2}{c} -o(1))$ steps are sometimes required. This shows a clear separation between the global and local communication models.

Heavy going but seems important for graph exploration performance.

See also the special case of exploring trees under related work.

Another possibility for exploring overlapping markup. Each agent has an independent view of one part of the markup trees.

Distributed Dual Decomposition (DDD) in GraphLab

Filed under: GraphLab,Operations Research,Relaxation — Patrick Durusau @ 1:38 pm

Distributed Dual Decomposition (DDD) in GraphLab by Danny Bickson.

From the post:

Our collaborator Dhruv Batra, from Virginia Tech has kindly contributed DDD code for GraphLab. Here are some explanation about the method and how to deploy it.

The full documentation is found here.

Distributed Dual Decomposition

Dual Decomposition (DD), also called Lagrangian Relaxation, is a powerful technique with a rich history in Operations Research. DD solves a relaxation of difficult optimization problems by decomposing them into simpler subproblems, solving these simpler subproblems independently and then combining these solutions into an approximate global solution.

More details about DD for solving Maximum A Posteriori (MAP) inference problems in Markov Random Fields (MRFs) can be found in the following:

D. Sontag, A. Globerson, T. Jaakkola. Introduction to Dual Decomposition for Inference. Optimization for Machine Learning, editors S. Sra, S. Nowozin, and S. J. Wright: MIT Press, 2011.

Always bear in mind that string matching is only one form of subject identification.

Parallella: The $99 Linux supercomputer

Filed under: Linux OS,Parallel Programming,Parallela,Parallelism — Patrick Durusau @ 1:23 pm

Parallella: The $99 Linux supercomputer by Steven J. Vaughan-Nichols.

From the post:

What Adapteva has done is create a credit-card sized parallel-processing board. This comes with a dual-core ARM A9 processor and a 64-core Epiphany Multicore Accelerator chip, along with 1GB of RAM, a microSD card, two USB 2.0 ports, 10/100/1000 Ethernet, and an HDMI connection. If all goes well, by itself, this board should deliver about 90 GFLOPS of performance, or — in terms PC users understand — about the same horse-power as a 45GHz CPU.

This board will use Ubuntu Linux 12.04 for its operating system. To put all this to work, the platform reference design and drivers are now available.

From Adapteva.

I wonder which will come first:

A really kick-ass 12 dimensional version of Asteroids?

or

New approaches to graph processing?

What do you think?

A Survey of Stochastic and Gazetteer Based Approaches for Named Entity Recognition

Filed under: Entity Extraction,Named Entity Mining,Natural Language Processing — Patrick Durusau @ 10:57 am

A Survey of Stochastic and Gazetteer Based Approaches for Named Entity Recognition – Part 2 by Benjamin Bengfort.

From the post:

Generally speaking, the most effective named entity recognition systems can be categorized as rule-based, gazetteer and machine learning approaches. Within each of these approaches are a myriad of sub-approaches that combine to varying degrees each of these top-level categorizations. However, because of the research challenge posed by each approach, typically one or the other is focused on in the literature.

Rule-based systems utilize pattern-matching techniques in text as well as heuristics derived either from the morphology or the semantics of the input sequence. They are generally used as classifiers in machine-learning approaches, or as candidate taggers in gazetteers. Some applications can also make effective use of stand-alone rule-based systems, but they are prone to both overreach and skipping over named entities. Rule-based approaches are discussed in (10), (12), (13), and (14).

Gazetteer approaches make use of some external knowledge source to match chunks of the text via some dynamically constructed lexicon or gazette to the names and entities. Gazetteers also further provide a non-local model for resolving multiple names to the same entity. This approach requires either the hand crafting of name lexicons or some dynamic approach to obtaining a gazette from the corpus or another external source. However, gazette based approaches achieve better results for specific domains. Most of the research on this topic focuses on the expansion of the gazetteer to more dynamic lexicons, e.g. the use of Wikipedia or Twitter to construct the gazette. Gazette based approaches are discussed in (15), (16), and (17).

Stochastic approaches fare better across domains, and can perform predictive analysis on entities that are unknown in a gazette. These systems use statistical models and some form of feature identification to make predictions about named entities in text. They can further be supplemented with smoothing for universal coverage. Unfortunately these approaches require large amounts of annotated training data in order to be effective, and they don’t naturally provide a non-local model for entity resolution. Systems implemented with this approach are discussed in (7), (8), (4), (9), and (6).

Benjamin continues his excellent survey of named entity recognition techniques.

All of these techniques may prove to be useful in constructing topic maps from source materials.

Hadoop: The Lay of the Land

Filed under: Hadoop,MapReduce — Patrick Durusau @ 10:47 am

Hadoop: The Lay of the Land by Tom White.

From the post:

The core map-reduce framework for big data consists of several interlocking technologies. This first installment of our tutorial explains what Hadoop does and how the pieces fit together.

Big Data is in the news these days, and Apache Hadoop is one of the most popular platforms for working with Big Data. Hadoop itself is undergoing tremendous growth as new features and components are added, and for this reason alone, it can be difficult to know how to start working with it. In this three-part series, I explain what Hadoop is and how to use it, presenting a simple, hands-on examples that you can try yourself. First, though, let’s look at the problem that Hadoop was designed to solve.

Much later:

Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He is an engineer at Cloudera, a company set up to offer Hadoop tools, support, and training. He is the author of the best-selling O’Reilly book, Hadoop: The Definitive Guide.

If you are getting started with Hadoop or need a good explanation for others, start here.

I first saw this at: Learn How To Hadoop from Tom White in Dr. Dobb’s by Justin Kestelyn.

Four decades of US terror attacks listed and detailed

Filed under: Data — Patrick Durusau @ 4:49 am

Four decades of US terror attacks listed and detailed by Simon Rogers.

I was disappointed to read:

The horrors of the Boston Marathon explosions have focussed attention on terror attacks in the United States. But how common are they?

The Global Terrorism Database has recorded terror attacks across the world – with data from 1970 covering up to the end of 2011. It’s a huge dataset: over 104,000 attacks, including around 2,600 in the US – and its collection is funded by an agency of the US government: the Science and Technology Directorate of the US Department of Homeland Security through a Center of Excellence program based at the University of Maryland.

There’s a lot of methodology detailed on the site and several definitions of what is terrorism. At its root, the GTD says that terrorism is:

The threatened or actual use of illegal force and violence by a non-state actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation

I thought from the headlines that there would be a listing of four decades of US terror attacks against other peoples, countries and other groups.

A ponderous list that the US has labored long and hard over the past several decades.

A data set that contrasts “terror” attacks in the US with US terrorist attacks against others would make a better data set.

Starting just after WWII.

Data: Continuous vs. Categorical

Filed under: Data,Data Types — Patrick Durusau @ 4:29 am

Data: Continuous vs. Categorical by Robert Kosara.

From the post:

Data comes in a number of different types, which determine what kinds of mapping can be used for them. The most basic distinction is that between continuous (or quantitative) and categorical data, which has a profound impact on the types of visualizations that can be used.

The main distinction is quite simple, but it has a lot of important consequences. Quantitative data is data where the values can change continuously, and you cannot count the number of different values. Examples include weight, price, profits, counts, etc. Basically, anything you can measure or count is quantitative.

Categorical data, in contrast, is for those aspects of your data where you make a distinction between different groups, and where you typically can list a small number of categories. This includes product type, gender, age group, etc.

Both quantitative and categorical data have some finer distinctions, but I will ignore those for this posting. What is more important, is: why do those make a difference for visualization?

I like the use of visualization to reinforce the notion of difference between continuous and categorical data.

Makes me wonder about using visualization to explore the use of different data types for detecting subject sameness.

It may seem trivial to use the TMDM’s sameness of subject identifiers (simple string matching) to say two or more topics represent the same subject.

But what if subject identifiers match but other properties, say gender (modeled as an occurrence), do not?

Illustrating a mistake in the use of a subject identifier but also a weakness in reliance on a subject identitier (data type URI) for subject identity.

That data type relies only one string matching for identification purposes. Which may or may not agree with your subject sameness requirements.

How Hadoop Works? HDFS case study

Filed under: Hadoop,HDFS — Patrick Durusau @ 4:14 am

How Hadoop Works? HDFS case study by Dane Dennis.

From the post:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Hadoop library contains two major components HDFS and MapReduce, in this post we will go inside each HDFS part and discover how it works internally.

Knowing how to use Hadoop is one level of expertise.

Knowing how Hadoop works takes you to the next level.

One where you can better adapt Hadoop to your needs.

Understanding HDFS is a step in that direction.

April 17, 2013

The TAO of Topic Maps in Spanish

Filed under: TMDM,Topic Maps — Patrick Durusau @ 7:24 pm

Steve Pepper sends word that The TAO of Topic Maps has been translated into Spanish!

I am very grateful to Maria Ramos of WebHostingHub.com for translating The TAO of Topic Maps into Spanish: http://www.webhostinghub.com/support/es/misc/mapas-tematicos.

Since the article contains a lot of technical terminology, it might be a good idea if some Spanish-speaking Topic Maps experts were to proof-read the translation. Please send any comments directly to Maria at mariar@webhostinghub.com with a cc: to me.

Other translations to note?

List of Machine Learning APIs

Filed under: Machine Learning,Programming — Patrick Durusau @ 3:39 pm

List of Machine Learning APIs

From the post:

Wikipedia defines Machine Learning as “a branch of artificial intelligence that deals with the construction and study of systems that can learn from data.”

Below is a compilation of APIs that have benefited from Machine Learning in one way or another, we truly are living in the future so strap into your rocketship and prepare for blastoff.

Interesting collection.

Worth reviewing for “assists” to human curators working with data.

And a rich hunting ground for head to head competitions against human curated data.

I first saw this at Alex Popescu’s List of Machine Learning APIs.

Tachyon

Filed under: HDFS,Storage,Tachyon — Patrick Durusau @ 3:26 pm

Tachyon by UC Berkeley AMP Lab.

From the webpage:

Tachyon is a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.It offers up to 300 times higher throughput than HDFS, by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that is frequently read.

Since we aren’t quite to in-memory computing just yet, you may want to review Tachyon.

The numbers are very impressive.

A New Perspective on Vertex Connectivity

Filed under: Graphs,Networks — Patrick Durusau @ 3:12 pm

A New Perspective on Vertex Connectivity by Keren Censor-Hillel, Mohsen Ghaffari, Fabian Kuhn.

Abstract:

Edge connectivity and vertex connectivity are two fundamental concepts in graph theory. Although by now there is a good understanding of the structure of graphs based on their edge connectivity, our knowledge in the case of vertex connectivity is much more limited. An essential tool in capturing edge connectivity are edge-disjoint spanning trees. The famous results of Tutte and Nash-Williams show that a graph with edge connectivity $\lambda$ contains $\floor{\lambda/2}$ edge-disjoint spanning trees. We present connected dominating set (CDS) partition and packing as tools that are analogous to edge-disjoint spanning trees and that help us to better grasp the structure of graphs based on their vertex connectivity. The objective of the CDS partition problem is to partition the nodes of a graph into as many connected dominating sets as possible. The CDS packing problem is the corresponding fractional relaxation, where CDSs are allowed to overlap as long as this is compensated by assigning appropriate weights. CDS partition and CDS packing can be viewed as the counterparts of the well-studied edge-disjoint spanning trees, focusing on vertex disjointedness rather than edge disjointness.

We constructively show that every $k$-vertex-connected graph with $n$ nodes has a CDS packing of size $\Omega(k/\log n)$ and a CDS partition of size $\Omega(k/\log^5 n)$. We prove that the $\Omega(k/\log n)$ CDS packing bound is existentially optimal.

Using CDS packing, we show that if vertices of a $k$-vertex-connected graph are independently sampled with probability $p$, then the graph induced by the sampled vertices has vertex connectivity $\tilde{\Omega}(kp^2)$. Moreover, using our $\Omega(k/\log n)$ CDS packing, we get a store-and-forward broadcast algorithm with optimal throughput in the networking model where in each round, each node can send one bounded-size message to all its neighbors.

Just in case you are interested in cutting edge (sorry) graph research.

Users can assure each other they are using the most popular graph software or they can be using the most powerful graph software.

I know which one I would choose.

How about you?

HowTo: Develop Your First Google Glass App [Glassware]

Filed under: Marketing,Topic Maps — Patrick Durusau @ 2:50 pm

HowTo: Develop Your First Google Glass App [Glassware] by Tarandeep Singh.

From the post:

Google has raised curtains off it’s Glass revealing detailed Tech Specs. Along with the specs came the much awaited Mirror API – The API for Glass apps.

So you had that killer app idea for Google Glass? Now its time for you to put those ideas into code!

The race is on to produce the first topic map based Google Glass App!

A response to a request can be a machine generated guess or a human curated answer.

Which one do you think users would prefer?

Practical tools for exploring data and models

Filed under: Data,Data Mining,Data Models,Exploratory Data Analysis — Patrick Durusau @ 2:37 pm

Practical tools for exploring data and models by Hadley Alexander Wickham. (PDF)

From the introduction:

This thesis describes three families of tools for exploring data and models. It is organised in roughly the same way that you perform a data analysis. First, you get the data in a form that you can work with; Section 1.1 introduces the reshape framework for restructuring data, described fully in Chapter 2. Second, you plot the data to get a feel for what is going on; Section 1.2 introduces the layered grammar of graphics, described in Chapter 3. Third, you iterate between graphics and models to build a succinct quantitative summary of the data; Section 1.3 introduces strategies for visualising models, discussed in Chapter 4. Finally, you look back at what you have done, and contemplate what tools you need to do better in the future; Chapter 5 summarises the impact of my work and my plans for the future.

The tools developed in this thesis are firmly based in the philosophy of exploratory data analysis (Tukey, 1977). With every view of the data, we strive to be both curious and sceptical. We keep an open mind towards alternative explanations, never believing we
have found the best model. Due to space limitations, the following papers only give a glimpse at this philosophy of data analysis, but it underlies all of the tools and strategies that are developed. A fuller data analysis, using many of the tools developed in this thesis, is available in Hobbs et al. (To appear).

Has a focus on R tools, including ggplot2 and Wilkinson’s The Grammar of Graphics.

The “…never believing we have found the best model” approach works for me!

You?

I first saw this at Data Scholars.

Requirements and Brown M&M’s Clauses

Filed under: Requirements — Patrick Durusau @ 2:10 pm

Use a No Brown M&M’s Clause by Jim Harris.

From the post:

There is a popular story about David Lee Roth exemplifying the insane demands of a power-mad celebrity by insisting that Van Halen’s contracts with concert promoters contain a clause that a bowl of M&M’s has to be provided backstage with every single brown candy removed, upon pain of forfeiture of the show, with full compensation to the band.

At least once, Van Halen followed through, peremptorily canceling a show in Colorado when Roth found some brown M&M’s in his dressing room – a clear violation of the No Brown M&M’s Clause.

However, in his book The Checklist Manifesto: How to Get Things Right, Atul Gawande recounted the explanation that Roth provided in his memoir Crazy from the Heat. “Van Halen was the first band to take huge productions into tertiary, third-level markets. We’d pull up with nine eighteen-wheeler trucks, full of gear, where the standard was three trucks, max. And there were many, many technical errors – whether it was the girders couldn’t support the weight, or the flooring would sink in, or the doors weren’t big enough to move the gear through.”

Therefore, because there was so much equipment, requiring so much coordination to make their concerts function smoothly and safely, Van Halen’s contracts were massive. So, just as a little test to see if the contract had actually been read by the concert promoters, buried somewhere in the middle would be article 126: the infamous No Brown M&M’s Clause.

I would not use the same clause as IT consultants will simply scan for the M&M’s clause and delegate someone to do it.

But it would be a good idea for large requirement documents to insert some similar requirement for meetings, report binding, etc.

I don’t know where David Lee Roth got the idea but dictionary publishers do something similar.

List of words and their definitions cannot be copyrighted. For obvious reasons. We don’t want one dictionary to have a monopoly on the definition of one meter for example.

But, dictionary publishers make up words, definitions for those words and include those in their dictionaries. Being original works, they are subject to copyright.

How much you need in terms of requirements will vary.

What won’t vary is your need to know the consultants have at least read your requirements.

Use a Brown M&M’s clause, you won’t regret it.

An Introduction to Named Entity Recognition…

Filed under: Entity Extraction,Named Entity Mining,Natural Language Processing — Patrick Durusau @ 1:46 pm

An Introduction to Named Entity Recognition in Natural Language Processing – Part 1 by Benjamin Bengfort.

From the post:

Abstract:

The task of identifying proper names of people, organizations, locations, or other entities is a subtask of information extraction from natural language documents. This paper presents a survey of techniques and methodologies that are currently being explored to solve this difficult subtask. After a brief review of the challenges of the task, as well as a look at previous conventional approaches, the focus will shift to a comparison of stochastic and gazetteer based approaches. Several machine-learning approaches are identified and explored, as well as a discussion of knowledge acquisition relevant to recognition. This two-part white paper will show that applications that require named entity recognition will be served best by some combination of knowledge- based and non-deterministic approaches.

Introduction:

In school we were taught that a proper noun was “a specific person, place, or thing,” thus extending our definition from a concrete noun. Unfortunately, this seemingly simple mnemonic masks an extremely complex computational linguistic task—the extraction of named entities, e.g. persons, organizations, or locations from corpora (1). More formally, the task of Named Entity Recognition and Classification can be described as the identification of named entities in computer readable text via annotation with categorization tags for information extraction.

Not only is named entity recognition a subtask of information extraction, but it also plays a vital role in reference resolution, other types of disambiguation, and meaning representation in other natural language processing applications. Semantic parsers, part of speech taggers, and thematic meaning representations could all be extended with this type of tagging to provide better results. Other, NER-specific, applications abound including question and answer systems, automatic forwarding, textual entailment, and document and news searching. Even at a surface level, an understanding of the named entities involved in a document provides much richer analytical frameworks and cross-referencing.

Named entities have three top-level categorizations according to DARPA’s Message Understanding Conference: entity names, temporal expressions, and number expressions (2). Because the entity names category describes the unique identifiers of people, locations, geopolitical bodies, events, and organizations, these are usually referred to as named entities and as such, much of the literature discussed in this paper focuses solely on this categorization, although it is easy to imagine extending the proposed systems to cover the full MUC-7 task. Further, the CoNLL-2003 Shared Task, upon which the standard of evaluation for such systems is based, only evaluates the categorization of organizations, persons, locations, and miscellaneous named entities. For example:

(ORG S.E.C.) chief (PER Mary Shapiro) to leave (LOC Washington) in December.

This sentence contains three named entities that demonstrate many of the complications associated with named entity recognition. First, S.E.C. is an acronym for the Securities and Exchange Commission, which is an organization. The two words “Mary Shapiro” indicate a single person, and Washington, in this case, is a location and not a name. Note also that the token “chief” is not included in the person tag, although it very well could be. In this scenario, it is ambiguous if “S.E.C. chief Mary Shapiro” is a single named entity, or if multiple, nested tags would be required.

Nice introduction to the area and ends with a great set of references.

Looking forward to part 2!

In-Memory Computing

Filed under: Computation,Computer Science,Programming — Patrick Durusau @ 1:23 pm

Why In-Memory Computing Is Cheaper And Changes Everything by Timo Elliott.

From the post:

What is the difference? Database engines today do I/O. So if they want to get a record, they read. If they want to write a record, they write, update, delete, etc. The application, which in this case is a DBMS, thinks that it’s always writing to disk. If that record that they’re reading and writing happens to be in flash, it will certainly be faster, but it’s still reading and writing. Even if I’ve cached it in DRAM, it’s the same thing: I’m still reading and writing.

What we’re talking about here is the actual database is physically in in-memory. I’m doing a fetch to get data and not a read. So the logic of the database changes. That’s what in-memory is about as opposed to the traditional types of computing.

Why is it time for in-memory computing?

Why now? The most important thing is this: DRAM costs are dropping about 32% every 12 months. Things are getting bigger, and costs are getting lower. If you looked at the price of a Dell server with a terabyte of memory three years ago, it was almost $100,000 on their internet site. Today, a server with more cores — sixteen instead of twelve — and a terabyte of DRAM, costs less than $40,000.

In-memory results in lower total cost of ownership

So the costs of this stuff is not outrageous. For those of you who don’t understand storage, I always get into this argument: the total cost of acquisition of an in-memory system is likely higher than a storage system. There’s no question. But the total cost of TCO is lower – because you don’t need storage people to manage memory. There are no LUNs [logical unit numbers]: all the things your storage technicians do goes away.

People cost more than hardware and software – a lot more. So the TCO is lower. And also, by the way, power: one study IBM did showed that memory is 99% less power than spinning disks. So unless you happen to be an electric company, that’s going to mean a lot to you. Cooling is lower, everything is lower.

Timo makes a good case for in-memory computing but I have a slightly different question.

If both data and program are stored in memory, where is the distinction between program and data?

Or in topic map terms, can’t we then speak about subject identities in the program and even in data at particular points in the program?

That could be a very powerful tool for controlling program behavior and re-purposing data at different stages of processing.

Categories, What’s the Point?

Filed under: Category Theory,Mathematics — Patrick Durusau @ 12:28 pm

Categories, What’s the Point? by Jeremy Kun.

From the post:

Perhaps primarily due to the prominence of monads in the Haskell programming language, programmers are often curious about category theory. Proponents of Haskell and other functional languages can put category-theoretic concepts on a pedestal or in a mexican restaurant, and their benefits can seem as mysterious as they are magical. For instance, the most common use of a monad in Haskell is to simulate the mutation of immutable data. Others include suspending and backtracking computations, and even untying tangled rope.

Category theory is often mocked (or praised) as the be-all and end-all of mathematical abstraction, and as such (and for other reasons I’ve explored on this blog) many have found it difficult to digest and impossible to master. However, in truth category theory arose from a need for organizing mathematical ideas based on their shared structure. In this post, I want to give a brief overview of what purpose category theory serves within mathematics, and emphasize what direction we’ll be going with it on this blog.

We should also note (writing this after the fact), that this article is meant to be a motivation to our future series on category theory. It is very difficult to explain what category theory is without going into very specific details, but we can explain by analogy what category theory achieves for us. That is the goal of this post.

I rather like the line:

However, in truth category theory arose from a need for organizing mathematical ideas based on their shared structure.

Which is one of the reasons why I keep trying to master category theory, for the ability to organize subject identifications based on their shared structures.

Not necessary for all subject identity cases but if topic maps extend into subjects like math, physics, etc., then it may be quite useful.

April 16, 2013

Marathon Attendees Responsible for Not Stopping Bombing

Filed under: Security — Patrick Durusau @ 7:33 pm

National security begins with you! by Phrantceena Halres.

From the post:

The Boston Marathon incident is an unfortunate, tragic reminder that, as citizens, we must always remain aware, alert and diligent! Individuals must hone a “sixth sense” that helps them detect, anticipate and plan for danger in advance of it happening. People innately know when something is wrong, dangerous, or just “off,” and in those situations it is imperative to take action rather than just brushing it off. 

As a society we’ve been de-sensitized to our personal responsibility to secure our own surroundings. We need to teach every citizen that national security begins with you! Each of us has to take responsibility for what happens in our community and nation at large, and not just ask “what can I do better” — but take tactical action. This act of terrorism was meticulously planned and the culprits had the audacity and arrogance to commit their crime in the middle of day. I assert that somewhere along the line, someone saw something that could have been reported, which may very well have prevented this tragedy. 

With each act of terrorism, whether originating from foreign lands or home-grown, we tend to point the finger over who is not doing their job to keep us safe — but then complain about why we can’t carry more than three ounces of sunscreen in our carry-on bags catching a flight for vacation.

Given the location of these explosives, it is clear that Boston should have been on higher alert and helping residents and visitors do the same, to include educating them about how to stay S.A.F.E.: smart, aware, focused and equipped.

We’ve become a reactionary society…it shouldn’t come down to the honorable and excellent first responders responding to crime, it should be about understanding how pro-active security training (for both individuals and industry professionals) helps prevent a crime from being committed in the first place.

When I read:

I assert that somewhere along the line, someone saw something that could have been reported, which may very well have prevented this tragedy.

And that the attendees weren’t:

S.A.F.E.: smart, aware, focused and equipped.

It sounds to me like the Boston marathon attendees fell down on their job to prevent the attack.

Complete and utter paranoid nonsense.

You have no obligation to be a free and voluntary force of Stasi informants.

Nor does the government have the ability to prevent every possible mishap.

What can be done is to care for the injured and assist, if possible, in finding those responsible.

Even more importantly, to live free of fear and suspicion of others.

Otherwise, the terrorists and their counterparts like Phrantceena Halres will have won.

Let’s disappoint them.

The Costs and Profits of Poor Data Quality

Filed under: Data,Data Quality — Patrick Durusau @ 7:10 pm

The Costs and Profits of Poor Data Quality by Jim Harris.

From the post:

Continuing the theme of my two previous posts, which discussed when it’s okay to call data quality as good as it needs to get and when perfect data quality is necessary, in this post I want to briefly discuss the costs — and profits — of poor data quality.

Loraine Lawson interviewed Ted Friedman of Gartner Research about How to Measure the Cost of Data Quality Problems, such as the costs associated with reduced productivity, redundancies, business processes breaking down because of data quality issues, regulatory compliance risks, and lost business opportunities. David Loshin blogged about the challenge of estimating the cost of poor data quality, noting that many estimates, upon close examination, seem to rely exclusively on anecdotal evidence.

As usual, Jim does a very good job of illustrating costs and profits from poor data quality.

I have a slightly different question:

What could you know about data to spot that it is of poor quality?

It is one thing to find out after a space ship crashes that poor data quality was responsible, but it would be better to spot the error before hand. As in before the launch.

Probably data specific but are there any general types of information that would help you spot poor quality data?

Before you are 1,000 meters off the lunar surface. 😉

Anniversary! Microsoft Open Technologies, Inc. (MS Open Tech)

Filed under: Microsoft,Open Source — Patrick Durusau @ 6:58 pm

You’re invited to help us celebrate an unlikely pairing in open source by Gianugo Rabellino.

From the post:

We are just days away from reaching a significant milestone for our team and the open source and open standards communities: the first anniversary of Microsoft Open Technologies, Inc. (MS Open Tech) — a wholly owned subsidiary of Microsoft.

We can’t think of anyone better to celebrate with than YOU, the members of the open source and open standards community and technology industry who have helped us along on our adventure over the past year.

We’d like to extend an open (pun intended!) invitation to celebrate with us on April 25, and share your burning questions on the future of the subsidiary, open source at-large and how MS Open Tech can better connect with the developer community to present even more choice and freedom.

I’ll be proud to share the stage with our amazing MS Open Tech leadership team: Jean Paoli, President; Kamaljit Bath, Engineering team leader; and Paul Cotton, Standards team leader and Co-Chair of the W3C HTML Working Group.

You have three choices:

  1. You can be a hard ass and stay home to “punish” MS for real and imagined slights and sins over the years. (You won’t be missed.)
  2. You can be obnoxious and attend, doing your best to not have a good time and trying to keep others from having a good time. (Better to stay home.)
  3. You can attend, have a good time, ask good questions, encourage more innovation and support by Microsoft for the open source and open standards communities.

Microsoft is going to be a major player in whatever solution to semantic interoperability catches on.

If that is topic maps, then Microsoft will be into topic maps.

I would prefer that be under the open source/open standards banner.

Distance prevents me from attending but I will be there in spirit!

Happy Anniversary to Microsoft Open Technologies, Inc.!

Hacking Secret Ciphers with Python

Filed under: Cryptography,Python — Patrick Durusau @ 6:40 pm

“Hacking Secret Ciphers with Python” Released by Al Sweigart.

From the post:

My third book, Hacking Secret Ciphers with Python, is finished. It is free to download under a Creative Commons license, and available for purchase as a physical book on Amazon for $25 (which qualifies it for free shipping). This book is aimed at people who have no experience programming or with cryptography. The book goes through writing Python programs that not only implement several ciphers but also can hack these ciphers.

100% of the proceeds from the book sales will be donated to the Electronic Frontier Foundation, Creative Commons, and The Tor Project.

This looks like fun!

Unlike the secrecy cultists in cybersecurity, I think new ideas and insights into cryptography can come from anyone who spends time working on it.

To paraphrase Buffalo Springfield, “…increase the government’s paranoia like looking in a mirror and seeing the public working on cryptography….”

I never claimed to be a song writer. 😉

PS: Download a copy and buy a hard copy to give to someone.

Or donate the hard copy to your local library!

« Newer PostsOlder Posts »

Powered by WordPress