Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 10, 2013

Mapping the census…

Filed under: Census Data,Mapping,Maps,R,Transparency — Patrick Durusau @ 4:20 pm

Mapping the census: how one man produced a library for all by Simon Rogers.

From the post:

The census is an amazing resource – so full of data it’s hard to know where to begin. And increasingly where to begin is by putting together web-based interactives – like this one on language and this on transport patterns that we produced this month.

But one academic is taking everything back to basics – using some pretty sophisticated techniques. Alex Singleton, a lecturer in geographic information science (GIS) at Liverpool University has used R to create the open atlas project.

Singleton has basically produced a detailed mapping report – as a PDF and vectored images – on every one of the local authorities of England & Wales. He automated the process and has provided the code for readers to correct and do something with. In each report there are 391 pages, each with a map. That means, for the 354 local authorities in England & Wales, he has produced 127,466 maps.

Check out Simon’s post to see why Singleton has undertaken such a task.

Question: Was the 2011 census more “transparent,” or “useful” after Singleton’s work or before?

I would say more “transparent” after Singleton’s work.

You?

Basic planning algorithm

Filed under: Constraint Programming,Graphs,Networks,Searching — Patrick Durusau @ 4:02 pm

Basic planning algorithm by Ricky Ho.

From the post:

Planning can be think of a Graph search problem where each node in the graph represent a possible “state” of the reality. A directed edge from nodeA to nodeB represent an “action” is available to transition stateA to stateB.

Planning can be thought of another form of constraint optimization problem which is quite different from the one I describe in last blog. In planning case, the constraint is the goal state we want to achieve, where a sequence of actions need to be found to meet the constraint. The sequence of actions will incur cost and our objective is to minimize the cost associated with our chosen actions.

Makes me curious about topic maps that perform merging based on the “cost” of the merge.

That is upon a query, an engine may respond with a merger of topics found on one node but not request data from remote nodes.

In particular thinking of network performance issues which we all experience, waiting for ads to download for example.

Depending upon my requirements, I should be able to evaluate those costs and avoid them.

I may not have the most complete information but that may not be a requirement for some use cases.

Storyboarding in the Software Design Process

Filed under: Design,Interface Research/Design,Storyboarding — Patrick Durusau @ 3:44 pm

Storyboarding in the Software Design Process by Ambrose Little

From the post:

Using storyboards in software design can be difficult because of some common challenges and drawbacks to the tools we have. The good news is that there’s a new, free tool that tries to address many of these issues. But before I get into that, let’s revisit the value of using storyboards (and stories in general) in software design.

Using stories in some form or another is a well-established practice in software design, so much so that there are many meanings of the term “stories.” For instance, in agile processes, there is a concept of “user stories,” which are very basic units of expressing functional requirements: “As a user, I want to receive notifications when new applications are submitted.”

In user experience design, these stories take on more life through the incorporation of richer user and usage contexts and personas: real people in real places doing real things, not just some abstract, feature-oriented description of functionality that clothes itself in a generic “user.”

In their book, Storytelling for User Experience, Whitney Quesenbery and Kevin Brooks offer these benefits of using stories in software design:

  • They help us gather and share information about users, tasks, and goals.
  • They put a human face on analytic data.
  • They can spark new design concepts and encourage collaboration and innovation.
  • They are a way to share ideas and create a sense of shared history and purpose.
  • They help us understand the world by giving us insight into people who are not just like us.
  • They can even persuade others of the value of our contribution.

Whatever they’re called, stories are an effective and inexpensive way to capture, relate, and explore experiences in the design process.

The benefit of:

They help us understand the world by giving us insight into people who are not just like us.

was particularly interesting.

The storyboarding post was written for UI design but constructing a topic map could benefit from insight into “people who are not just like us.”

I lean toward the physical/hard copy version of storyboarding, mostly so that technology and it inevitable limitations, doesn’t get in the way.

What I want to capture are the insights of the users, not the insights of the users as limited by the software.

Or better yet, the insights of the users unlimited by my skill with the software.

Not to neglect the use of storyboarding for software, UI/UX purposes as well.

The Power of Semantic Diversity

Filed under: Bioinformatics,Biology,Contest,Crowd Sourcing — Patrick Durusau @ 3:10 pm

Prize-based contests can provide solutions to computational biology problems by Karim R Lakhani, et al. (Nature Biotechnology 31, 108–111 (2013) doi:10.1038/nbt.2495)

From the article:

Advances in biotechnology have fueled the generation of unprecedented quantities of data across the life sciences. However, finding analysts who can address such ‘big data’ problems effectively has become a significant research bottleneck. Historically, prize-based contests have had striking success in attracting unconventional individuals who can overcome difficult challenges. To determine whether this approach could solve a real big-data biologic algorithm problem, we used a complex immunogenomics problem as the basis for a two-week online contest broadcast to participants outside academia and biomedical disciplines. Participants in our contest produced over 600 submissions containing 89 novel computational approaches to the problem. Thirty submissions exceeded the benchmark performance of the US National Institutes of Health’s MegaBLAST. The best achieved both greater accuracy and speed (1,000 times greater). Here we show the potential of using online prize-based contests to access individuals without domain-specific backgrounds to address big-data challenges in the life sciences.

….

Over the last ten years, online prize-based contest platforms have emerged to solve specific scientific and computational problems for the commercial sector. These platforms, with solvers in the range of tens to hundreds of thousands, have achieved considerable success by exposing thousands of problems to larger numbers of heterogeneous problem-solvers and by appealing to a wide range of motivations to exert effort and create innovative solutions18, 19. The large number of entrants in prize-based contests increases the probability that an ‘extreme-value’ (or maximally performing) solution can be found through multiple independent trials; this is also known as a parallel-search process19. In contrast to traditional approaches, in which experts are predefined and preselected, contest participants self-select to address problems and typically have diverse knowledge, skills and experience that would be virtually impossible to duplicate locally18. Thus, the contest sponsor can identify an appropriate solution by allowing many individuals to participate and observing the best performance. This is particularly useful for highly uncertain innovation problems in which prediction of the best solver or approach may be difficult and the best person to solve one problem may be unsuitable for another19.

An article that merits wider reading that it is likely to get behind a pay-wall.

A semantically diverse universe of potential solvers is more effective than a semantically monotone group of selected experts.

An indicator of what to expect from the monotone logic of the Semantic Web.

Good for scheduling tennis matches with Tim Berners-Lee.

For more complex tasks, rely on semantically diverse groups of humans.

I first saw this at: Solving Big-Data Bottleneck: Scientists Team With Business Innovators to Tackle Research Hurdles.

Lex Machina

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 2:44 pm

Lex Machina: IP Litigation and analytics

From the about page:

Every day, Lex Machina’s crawler extracts data and documents from PACER, all 94 District Court sites, ITC’s EDIS site and the PTO site.

The crawler automatically captures every docket event and downloads key District Court case documents and every ITC document. It converts the documents by optical character recognition (OCR) to searchable text and stores each one as a PDF file.

When the crawler encounters an asserted or cited patent, it fetches information about that patent from the PTO site.

Next, the crawler invokes Lex Machina’s state-of-the-art natural language processing (NLP) technology, which includes Lexpressions™, a proprietary legal text classification engine. The NLP technology classifies cases and dockets and resolves entity names. Attorney review of docket and case classification, patents and outcomes ensures high-quality data. The structured text indexer then orders all the data and stores it for search.

Lex Machina’s web-based application enables users to run search queries that deliver easy access to the relevant docket entries and documents. It also generates lists that can be downloaded as PDF files or spreadsheet-ready CSV files.

Finally, the system generates a daily patent litigation update email, which provides links to all new patent cases and filings.

Lex Machina does not:

  • Index the World Wide Web
  • Index legal cases around the world in every language
  • Index all legal cases in the United States
  • Index all state courts in the United States
  • Index all federal court cases in the United States

Instead, Lex Machina chose a finite legal domain, patents, that has a finite vocabulary and range of data sources.

Working in that finite domain, Lex Machina has produced a high quality data product of interest to legal professions and lay persons alike.

I intend to leave conquering world hunger, ignorance and poor color coordination of clothing to Bill Gates.

You?

I first saw this at Natural Language Processing in patent litigation: Lex Machina by Junling Hu.

Setting up Java GraphChi development environment…

Filed under: GraphChi,Graphs,Java,Machine Learning — Patrick Durusau @ 2:17 pm

Setting up Java GraphChi development environment – and running sample ALS by Danny Bickson.

From the post:

As you may know, our GraphChi collaborative filtering toolkit in C is becoming more and more popular. Recently, Aapo Kyrola did a great effort for porting GraphChi C into Java and implementing more methods on top of it.

In this blog post I explain how to setup GraphChi Java development environment in Eclipse and run alternating least squares algorithm (ALS) on a small subset of Netflix data.

Based on the level of user feedback I am going to receive for this blog post, we will consider porting more methods to Java. So email me if you are interested in trying it out.

If you are interested in more machine learning methods in Java, here’s your chance!

Not to mention your interest in graph based solutions.

Why Most BI Programs Under-Deliver Value

Filed under: Business Intelligence,Data Integration,Data Management,Integration,Semantics — Patrick Durusau @ 1:52 pm

Why Most BI Programs Under-Deliver Value by Steve Dine.

From the post:

Business intelligence initiatives have been undertaken by organizations across the globe for more than 25 years, yet according to industry experts between 60 and 65 percent of BI projects and programs fail to deliver on the requirements of their customers.

This impact of this failure reaches far beyond the project investment, from unrealized revenue to increased operating costs. While the exact reasons for failure are often debated, most agree that a lack of business involvement, long delivery cycles and poor data quality lead the list. After all this time, why do organizations continue to struggle with delivering successful BI? The answer lies in the fact that they do a poor job at defining value to the customer and how that value will be delivered given the resource constraints and political complexities in nearly all organizations.

BI is widely considered an umbrella term for data integration, data warehousing, performance management, reporting and analytics. For the vast majority of BI projects, the road to value definition starts with a program or project charter, which is a document that defines the high level requirements and capital justification for the endeavor. In most cases, the capital justification centers on cost savings rather than value generation. This is due to the level of effort required to gather and integrate data across disparate source systems and user developed data stores.

As organizations mature, the number of applications that collect and store data increase. These systems usually contain few common unique identifiers to help identify related records and are often referred to as data silos. They also can capture overlapping data attributes for common organizational entities, such as product and customer. In addition, the data models of these systems are usually highly normalized, which can make them challenging to understand and difficult for data extraction. These factors make cost savings, in the form of reduced labor for data collection, easy targets. Unfortunately, most organizations don’t eliminate employees when a BI solution is implemented; they simply work on different, hopefully more value added, activities. From the start, the road to value is based on a flawed assumption and is destined to under deliver on its proposition.

This post merits a close read, several times.

In particular I like the focus on delivery of value to the customer.

Err, that would be the person paying you to do the work.

Steve promises a follow-up on “lean BI” that focuses on delivering more value that it costs to deliver.

I am inherently suspicious of “lean” or “agile” approaches. I sat on a committee that was assured by three programmers they had improved upon IBM’s programming methodology but declined to share the details.

Their requirements document for a content management system, to be constructed on top of subversion, was a paragraph in an email.

Fortunately the committee prevailed upon management to tank the project. The programmers persist, management being unable or unwilling to correct past mistakes.

I am sure there are many agile/lean programming projects that deliver well documented, high quality results.

But I don’t start with the assumption that agile/lean or other methodology projects are well documented.

That is a question of fact. One that can be answered.

Refusal to answer due to time or resource constraints, is a very bad sign.

I first saw this in a top ten tweets list from KDNuggets.

Call for KDD Cup Competition Proposals

Filed under: Contest,Data Mining,Dataset,Knowledge Discovery — Patrick Durusau @ 1:17 pm

Call for KDD Cup Competition Proposals

From the post:

Please let us know if you are interested in being considered for the 2013 KDD Cup Competition by filling out the form below.

This is the official call for proposals for the KDD Cup 2013 competition. The KDD Cup is the well known data mining competition of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD-2013 conference will be held in Chicago from August 11 – 14, 2013. The competition will last between 6 and 8 weeks and the winners should be notified by end-June. The winners will be announced in the KDD-2013 conference and we are planning to run a workshop as well.

A good competition task is one that is practically useful, scientifically or technically challenging, can be done without extensive application domain knowledge, and can be evaluated objectively. Of particular interest are non-traditional tasks/data that require novel techniques and/or thoughtful feature construction.

Proposals should involve data and a problem whose successful completion will result in a contribution of some lasting value to a field or discipline. You may assume that Kaggle will provide the technical support for running the contest. The data needs to be available no later than mid-March.

If you have initial questions about the suitability of your data/problem feel free to reach out to claudia.perlich [at] gmail.com.

Do you have:

non-traditional tasks/data that require[s] novel techniques and/or thoughtful feature construction?

Is collocation of information on the basis of multi-dimensional subject identity a non-traditional task?

Does extraction of multiple dimensions of a subject identity from users require novel techniques?

If so, what data sets would you suggest using in this challenge?

I first saw this at: 19th ACM SIGKDD Knowledge Discovery and Data Mining Conference.

How Neo4j beat Oracle Database

Filed under: Graphs,Neo4j,Networks,Oracle — Patrick Durusau @ 11:56 am

Neo Technology execs: How Neo4j beat Oracle Database by Paul Krill.

From the post:

Neo Technology, which was formed in 2007, offers Neo4J, a Java-based open source NoSQL graph database. With a graph database, which can search social network data, connections between data are explored. Neo4j can solve problems that require repeated network probing (the database is filled with nodes, which are then linked), and the company stresses Neo4j’s high performance. InfoWorld Editor at Large Paul Krill recently talked with Neo CEO Emil Eifrem and Philip Rathle, Neo senior director of products, about the importance of graph database technology as well as Neoo4j’s potential in the mobile space. Eifrem also stressed his confidence in Java, despite recent security issues affecting the platform.

InfoWorld: Graph database technology is not the same as NoSQL, is it?

Eifrem: NoSQL is actually four different types of databases: There’s key value stores, like Amazon DynamoDB, for example. There’s column-family stores like Cassandra. There’s document databases like MongoDB. And then there’s graph databases like Neo4j. There are actually four pillars of NoSQL, and graph databases is one of them. Cisco is building a master data management system based on Neo4j, and this is actually our first Fortune 500 customer. They found us about two years ago when they tried to build this big, complex hierarchy inside of Oracle RAC. In Oracle RAC, they had response time in minutes, and then when they replaced it [with] Neo4j, they had response times in milliseconds. (emphasis added)

It is a great story and one I would repeat if I were marketing Neo4j (which I like a lot).

However, there are a couple of bits missing from the story that would make it more informative.

Such as what “…big, complex hierarchy…” was Cisco trying to build? Details please.

There are things that relational databases don’t do well.

Not realizing that up front is a design failure, not one of software or of relational databases.

Another question I would ask: What percentage of Cisco databases are relational vs. graph?

Fewer claims/stories and more data would go a long way towards informed IT decision making.

February 9, 2013

…the most hated man in America [circa 2003]

Filed under: Design,Interface Research/Design,Usability,Users — Patrick Durusau @ 8:22 pm

John E. Karlin, Who Led the Way to All-Digit Dialing, Dies at 94

The New York Time obituary for John E. Karlin, the father of the arrangement of numbers on push button phones and a host of other inventions is deeply moving.

Karlin did not have a series of lucky guesses but researched the capabilities and limitations of people to arrive at product design decisions.

Read the article to learn why one person said Karlin was “…the most hated man in America.”

I first saw this at Human Factors by Ed Lazowska.

…no Hitchhiker’s Guide…

Filed under: Computation,Computer Science,Mathematical Reasoning,Mathematics — Patrick Durusau @ 8:22 pm

Why there is no Hitchhiker’s Guide to Mathematics for Programmers by Jeremy Kun.

From the post:

Do you really want to get better at mathematics?

Remember when you first learned how to program? I do. I spent two years experimenting with Java programs on my own in high school. Those two years collectively contain the worst and most embarrassing code I have ever written. My programs absolutely reeked of programming no-nos. Hundred-line functions and even thousand-line classes, magic numbers, unreachable blocks of code, ridiculous code comments, a complete disregard for sensible object orientation, negligence of nearly all logic, and type-coercion that would make your skin crawl. I committed every naive mistake in the book, and for all my obvious shortcomings I considered myself a hot-shot programmer! At leaa st I was learning a lot, and I was a hot-shot programmer in a crowd of high-school students interested in game programming.

Even after my first exposure and my commitment to get a programming degree in college, it was another year before I knew what a stack frame or a register was, two more before I was anywhere near competent with a terminal, three more before I fully appreciated functional programming, and to this day I still have an irrational fear of networking and systems programming (the first time I manually edited the call stack I couldn’t stop shivering with apprehension and disgust at what I was doing).

A must read post if you want to be on the cutting edge of programming.

RDSTK: An R wrapper for the Data Science Toolkit API

Filed under: Data Science,R — Patrick Durusau @ 8:22 pm

RDSTK: An R wrapper for the Data Science Toolkit API

From the webpage:

This package provides an R interface to Pete Warden’s Data Science Toolkit. See www.datasciencetoolkit.org for more information. The source code for this package can be found at github.com/rtelmore/RDSTK Happy hacking!

If you don’t know the Data Science Toolkit, you should.

I first saw this at Pete Warden’s Five short links, February 8, 2013.

Distributed resilience with functional programming

Filed under: Erlang,Functional Programming — Patrick Durusau @ 8:22 pm

Distributed resilience with functional programming by Simon St. Laurent.

From the post:

Functional programming has a long and distinguished heritage of great work — that was only used by a small group of programmers. In a world dominated by individual computers running single processors, the extra cost of thinking functionally limited its appeal. Lately, as more projects require distributed systems that must always be available, functional programming approaches suddenly look a lot more appealing.

Steve Vinoski, an architect at Basho Technologies, has been working with distributed systems and complex projects for a long time, first as a tentative explorer and then leaping across to Erlang when it seemed right. Seventeen years as a columnist on C, C++, and functional languages have given him a unique viewpoint on how developers and companies are deciding whether and how to take the plunge.

Simon gives highlights from his interview of Steve Vinoski but I would start at the beginning, go to the end, then stop.

You do know that Simon has written an Erlang book? Introducing Erlang.

Haven’t seen it (yet) but knowing Simon you won’t be disappointed.

The Perfect Case for Social Network Analysis [Maybe yes. Maybe no.]

Filed under: Graphs,Networks,Security,Social Networks — Patrick Durusau @ 8:21 pm

New Jersey-based Fraud Ring Charged this Week: The Perfect Case for Social Network Analysis by Mike Betron.

When I first saw the headline, I thought the New Jersey legislature had gotten busted. 😉

No such luck, although with real transparency on contributions, relationships and state contracts, prison building would become a growth industry in New Jersey and elsewhere.

From the post:

As reported by MSN Money this week, eighteen members of a fraud ring have just been charged in what may be one of the largest international credit card scams in history. The New Jersey-based fraud ring is reported to have stolen at least $200 million, fooling credit card agencies by creating thousands of fake identities to create accounts.

What They Did

The FBI claims the members of the ring began their activity as early as 2007, and over time, used more than 7,000 fake identities to get more than 25,000 credit cards, using more than 1,800 addresses. Once they obtained credit cards, ring members started out by making small purchases and paying them off quickly to build up good credit scores. The next step was to send false reports to credit agencies to show that the account holders had paid off debts – and soon, their fake account holders had glowing credit ratings and high spending limits. Once the limits were raised, the fraudsters would “bust out,” taking out cash loans or maxing out the cards with no intention of paying them back.

But here’s the catch: The criminals in this case created synthetic identities with fake identity information (social security numbers, names, addresses, phone numbers, etc.). Addresses for the account holders were used multiple times on multiple accounts, and the members created at least 80 fake businesses which accepted credit card payments from the ring members.

This is exactly the kind of situation that would be caught by Social Network Analysis (SNA) software. Unfortunately, the credit card companies in this case didn’t have it.

Well, yes and no.

Yes, if Social Network Analysis (SNA) software were looking for the right relationships, the it could catch the fraud in question.

No, if Social Network Analysis (SNA) software were looking at the wrong relationships, the it would not catch the fraud in question.

Analysis isn’t a question of technology.

For example, what one policy change would do more to prevent future 9/11 type incidents that all the $billions spent since 9/11/2001?

Would you believe: Don’t open the cockpit door for hijackers. (full stop)

The 9/11 hijackers took advantage of the “Common Strategy” flaw in U.S. hijacking protocols.

One of the FAA officials most involved with the Common Strategy in the period leading up to 9/11 described it as an approach dating back to the early 1980s, developed in consultation with the industry and the FBI, and based on the historical record of hijackings. The point of the strategy was to “optimize actions taken by a flight crew to resolve hijackings peacefully” through systematic delay and, if necessary, accommodation of the hijackers. The record had shown that the longer a hijacking persisted, the more likely it was to have a peaceful resolution. The strategy operated on the fundamental assumptions that hijackers issue negotiable demands, most often for asylum or the release of prisoners, and that “suicide wasn’t in the game plan” of hijackers.

Hijackers may blow up a plane, kill or torture passengers, but not opening the cockpit door prevents a future 9/11 type event.

But at 9/11, there was no historical experience with hijacking a plane to use a weapon.

Historical experience is just as important for detecting fraud.

Once a pattern is identified for fraud, SNA or topic maps or several other technologies can spot it.

But it has to be identified that first time.

Production-Ready Hadoop 2 Distribution

Filed under: Hadoop,MapReduce,Marketing — Patrick Durusau @ 8:21 pm

WANdisco Launches Production-Ready Hadoop 2 Distribution

From the post:

WANdisco today announced it has made its WANdisco Distro (WDD) available for free download.

WDD is a production-ready version powered by Apache Hadoop 2 based on the most recent release, including the latest fixes. These certified Apache Hadoop binaries undergo the same quality assurance process as WANdisco’s enterprise software solutions.

The WDD team is led by Dr. Konstantin Boudnik, who is one of the original Hadoop developers, has been an Apache Hadoop committer since 2009 and served as a Hadoop architect with Yahoo! This team of Hadoop development, QA and support professionals is focused on software quality. WANdisco’s Apache Hadoop developers have been involved in the open source project since its inception and have the authority within the Apache Hadoop community to make changes to the code base, for fast fixes and enhancements.

By adding its active-active replication technology to WDD, WANdisco is able to eliminate the single points of failure (SPOFs) and performance bottlenecks inherent in Hadoop. With this technology, the same data is simultaneously readable and writable on every server, and every server is actively supporting user requests. There are no passive or standby servers with complex administration procedures required for failover and recovery.

WANdisco (Somehow the quoted post failed to include the link.)

Download WANdisco Distro (WDD)

Two versions for download:

64-bit WDD v3.1.0 for RHEL 6.1 and above

64-bit WDD v3.1.0 for CentOS 6.1 and above

You do have to register and are emailed a download link.

I know marketing people have a formula that if you pester 100 people you will make N sales.

I suppose but if your product is compelling enough, people are going to be calling you.

When was the last time you heard of a drug dealer making cold calls to sell dope?

‘What’s in the NIDDK CDR?’…

Filed under: Bioinformatics,Biomedical,Search Interface,Searching — Patrick Durusau @ 8:21 pm

‘What’s in the NIDDK CDR?’—public query tools for the NIDDK central data repository by Nauqin Pan, et al., (Database (2013) 2013 : bas058 doi: 10.1093/database/bas058)

Abstract:

The National Institute of Diabetes and Digestive Disease (NIDDK) Central Data Repository (CDR) is a web-enabled resource available to researchers and the general public. The CDR warehouses clinical data and study documentation from NIDDK funded research, including such landmark studies as The Diabetes Control and Complications Trial (DCCT, 1983–93) and the Epidemiology of Diabetes Interventions and Complications (EDIC, 1994–present) follow-up study which has been ongoing for more than 20 years. The CDR also houses data from over 7 million biospecimens representing 2 million subjects. To help users explore the vast amount of data stored in the NIDDK CDR, we developed a suite of search mechanisms called the public query tools (PQTs). Five individual tools are available to search data from multiple perspectives: study search, basic search, ontology search, variable summary and sample by condition. PQT enables users to search for information across studies. Users can search for data such as number of subjects, types of biospecimens and disease outcome variables without prior knowledge of the individual studies. This suite of tools will increase the use and maximize the value of the NIDDK data and biospecimen repositories as important resources for the research community.

Database URL: https://www.niddkrepository.org/niddk/home.do

I would like to tell you more about this research, since “[t]he National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) is part of the National Institutes of Health (NIH) and the U.S. Department of Health and Human Services” (that’s a direct quote) and so doesn’t claim copyright on its publications.

Unfortunately, the NIDDK published this paper in the Oxford journal Database, which does believe in restricting access to publicly funded research.

Do visit the search interface to see what you think about it.

Not quite the same as curated content but an improvement over raw string matching.

The Times Digital Archives

Filed under: History,News — Patrick Durusau @ 8:21 pm

The Times Digital Archives

From the webpage:

Read by both world leaders and the general public, The Times has offered readers in-depth, award-winning and objective coverage of world events since its creation 1785 and is the oldest daily newspaper in continuous publication.

The Times Digital Archive is an online, full-text facsimile of more than 200 years of The Times, one of the most highly regarded resources for the 19th – 20th Century history detailing every complete page of every issue from 1785. This historical newspaper archive allows researchers an unparalleled opportunity to search and view the best-known and most cited newspaper in the world online in its original published context.

Covers the time period 1785-2006.

Unfortunately, the publisher of this collection, GALE, has limited access to individuals at institutions with subscriptions.

Still, if you have access, this is a great resource for recent event topic maps.

February 8, 2013

Saving the “Semantic” Web (part 1)

Filed under: Semantic Web,Semantics — Patrick Durusau @ 5:17 pm

Semantics: Who You Gonna Call?

I quote “semantic” in ‘Semantic Web’ to emphasize the web had semantics long before puff pieces in Scientific American.

As a matter of fact, people traffic in semantics every day, in a variety of mediums. The “Web,” for all of its navel gazing, is just one.

At your next business or technical meeting, if a colleague uses a term you don’t know, here are some options:

  1. Search Cyc.
  2. Query WordNet.
  3. Call Pat Hayes.
  4. Ask the speaker what they meant.

Take a minute to think about it and put your answer in a comment below.

Other than Tim Berners-Lee, I suspect the vast majority of us will pick #4.

Here’s another quiz.

If asked, will the speaker respond with:

  1. Repeating the term over again, perhaps more loudly? (An Americanism that English spoken loudly is more understandable by non-English speakers. Same is true for technical terms.)
  2. Restating the term in Common Logic syntax?
  3. Singing a “cool” URI?
  4. Expanding the term by offering other properties that may be more familiar to you?

Again, other than Tim Berners-Lee, I suspect the vast majority of us will pick #4.

To summarize up to this point:

  1. We all have experience with semantics and encountering unknown semantics.
  2. We all (most of us) ask the speaker of unknown semantics to explain.
  3. We all (most of us) expect an explanation to offer additional information to clue us into the unknown semantic.

My answer to the question of “Semantics: Who You Gonna Call?” is the author of the data/information.

Do you have a compelling reason for asking someone else?


OneMusicAPI Simplifies Music Metadata Collection

Filed under: Dataset,Music,Music Retrieval — Patrick Durusau @ 5:16 pm

OneMusicAPI Simplifies Music Metadata Collection by Eric Carter.

From the post:

Elsten software, digital music organizer, has announced OneMusicAPI. Proclaimed to be “OneMusicAPI to rule them all,” the API acts as a music metadata aggregator that pulls from multiple sources across the web through a single interface. Elsten founder and OneMusicAPI creator, Dan Gravell, found keeping pace with constant changes from individual sources became too tedious a process to adequately organize music.

Currently covers over three million albums but only returns cover art.

Other data will be added but when and to what degree isn’t clear.

When launched, pricing plans will be available.

A lesson that will need to be reinforced from time to time.

Collation of data/information consumes time and resources.

To encourage collation, collators need to be paid.

If you need an example of what happens without paid collators, search your favorite search engine for the term “collator.”

Depending on how you count “sameness,” I get eight or nine different notions of collator from mine.

VOPlot and Astrostat

Filed under: Astroinformatics,BigData — Patrick Durusau @ 5:16 pm

VOPlot and Astrostat Releases from VO-India.

From the post:

The VO-India team announces the release of VOPlot (v1.8). VOPlot is a tool for visualizing astronomical data. A full list of release features is available in the change log. The team has also released AstroStat (v1.0 Beta). AstroStat allows astronomers to use both simple and sophisticated statistical routines on large datasets. VO-India welcomes your suggestions and comments about the product at voindia@iucaa.ernet.in.

It’s the rainy season where I live so “virtual” astronomy is more common than “outside” astronomy. 😉

If you don’t do either one, the software is relevant to big data and its processing.

See What I Mean: How to Use Comics to Communicate Ideas

Filed under: Communication,Graphics — Patrick Durusau @ 5:16 pm

Win This Book! See What I Mean: How to Use Comics to Communicate Ideas by Josh Tyson.

From the post:

Here, Cheng talks about the parallels between comics and UX, the joys of drawing, and the power of comics in getting your point across. You can enter to win a copy of the book below.

Comics and UX share the common struggle of having had to work extra hard at being taken seriously. Do you hear similar complaints from folks working in both fields?

I do. Obviously, UX design has become much more mainstream in the past five years and doesn’t have the same struggles it used to. The discipline is taken seriously now but perhaps still misunderstood.

Comics have long existed in more serious contexts—in film as storyboards, for instance—but I’m seeing more and more examples of comics being used in other industries, though many of these examples also feel misunderstood. The medium is often treated as a solution for reaching younger audiences or making something more light-hearted, but it doesn’t have to be constrained to that.

Go to the original post for a chance to win a copy of See What I Mean: How to Use Comics to Communicate Ideas.

Topic map comics anyone? 😉

PyPLN: a Distributed Platform for Natural Language Processing

Filed under: Linguistics,Natural Language Processing,Python — Patrick Durusau @ 5:16 pm

PyPLN: a Distributed Platform for Natural Language Processing by Flávio Codeço Coelho, Renato Rocha Souza, Álvaro Justen, Flávio Amieiro, Heliana Mello.

Abstract:

This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed using Python 2.7.3 but makes it very easy to incorporate other softwares for specific tasks as long as a linux version is available. PyPLN facilitates analyses both at document and corpus level, simplifying management and publication of corpora and analytical results through an easy to use web interface. In the current (beta) release, it supports English and Portuguese languages with support to other languages planned for future releases. To support the Portuguese language PyPLN uses the PALAVRAS parser\citep{Bick2000}. Currently PyPLN offers the following features: Text extraction with encoding normalization (to UTF-8), part-of-speech tagging, token frequency, semantic annotation, n-gram extraction, word and sentence repertoire, and full-text search across corpora. The platform is licensed as GPL-v3.

Demo: http://demo.pypln.org

Source code: http://pypln.org.

Have you noticed that tools for analysis are getting easier, not harder to use?

Is there a lesson there for tools to create topic map content?

Netflix: Solving Big Problems with Reactive Extensions (Rx)

Filed under: ActorFx,JavaRx,Microsoft,Rx — Patrick Durusau @ 5:15 pm

Netflix: Solving Big Problems with Reactive Extensions (Rx) by Claudio Caldato.

From the post:

More good news for Reactive Extensions (Rx).

Just yesterday, we told you about improvements we’ve made to two Microsoft Open Technologies, Inc., releases: Rx and ActorFx, and mentioned that Netflix was already reaping the benefits of Rx.

To top it off, on the same day, Netflix announced a Java implementation of Rx, RxJava, was now available in the Netflix Github repository. That’s great news to hear, especially given how Ben Christensen and Jafar Husain outlined on the Netflix Tech blog that their goal is to “stay close to the original Rx.NET implementation” and that “all contracts of Rx should be the same.”

Netflix also contributed a great series of interactive exercises for learning Microsoft’s Reactive Extensions (Rx) Library for JavaScript as well as some fundamentals for functional programming techniques.

Rx as implemented in RxJava is part of the solution Netflix has developed for improving the processing of 2+ billion incoming requests a day for millions of customers around the world.

Do you have 2+ billion requests coming into your topic map every day?

Assuming the lesser includes the greater, you may want to take a look at Rx or RxJava.

Be sure to visit the interactive exercises!

Rx 2.1 and ActorFx V0.2

Filed under: ActorFx,Data Integration,Rx — Patrick Durusau @ 5:15 pm

Rx 2.1 and ActorFx V0.2 by Claudio Caldato.

From the post:

Today Microsoft Open Technologies, Inc., is releasing updates to improve two cloud programming projects from our MS Open Tech Hub: Rx and ActorFx .

Reactive Extension (Rx) is a programming model that allows developers to use a common interface for writing applications that interact with diverse data sources, like stock quotes, Tweets, computer events, and Web service requests. Since Rx was open-sourced by MS Open Tech in November, 2012, it has become an important under-the-hood component of several high-availability multi-platform applications, including NetFlix and GitHub.

Rx 2.1 is available now via the Rx CodePlex project and includes support for Windows Phone 8, various bug fixes and contributions from the community.

ActorFx provides a non-prescriptive, language-independent model of dynamic distributed objects for highly available data structures and other logical entities via a standardized framework and infrastructure. ActorFx is based on the idea of the mathematical Actor Model, which was adapted by Microsoft’s Eric Meijer for cloud data management.

ActorFx V0.2 is available now at the CodePlex ActorFx project, originally open sourced in December 2012. The most significant new feature in our early prototype is Actor-to-Actor communication.

The Hub engineering program has been a great place to collaborate on these projects, as these assignments give us the agility and resources to work with the community. Stay tuned for more updates soon!

With each step towards better access to diverse data sources, the semantic impedance between data systems becomes more evident.

To say nothing of the semantics of the data you obtain.

The question to ask is:

Will new data makes sense when combined with data I already have?

If you don’t know or if the answer is no, you may need a topic map.

History SPOT

Filed under: Data Preservation,Database,History,Natural Language Processing — Patrick Durusau @ 5:15 pm

History SPOT

I discovered this site via a post entitled: Text Mining for Historians: Natural Language Processing.

From the webpage:

Welcome to History SPOT. This is a subsite of the IHR [Institute of Historical Research] website dedicated to our online research training provision. On this page you will find the latest updates regarding our seminar podcasts, online training courses and History SPOT blog posts.

Currently offered online training courses (free registration required):

  • Designing Databases for Historians
  • Podcasting for Historians
  • Sources for British History on the Internet
  • Data Preservation
  • Digital Tools
  • InScribe Palaeography

Not to mention over 300 pod casts!

Two thoughts:

First, a good way to learn about the tools and expectations that historians have of their digital tools. That should help you prepare an answer to: “What do topic maps have to offer over X technology?”

Second, I rather like the site and its module orientation. A possible template for topic map training online?

Shard-Query

Filed under: MySQL,Shard-Query — Patrick Durusau @ 5:15 pm

Shard-Query: Open Source MPP database engine

From the webpage:

What is Shard-Query

Shard-Query is a high performance MySQL query engine which offers increased parallelism compared to stand-alone MySQL. This increased parallelism is achieved by taking advantage of MySQL partitioning, sharding, common query features, or some combination thereof (see more below).

The primary goal of Shard-Query is to enable low-latency query access to extremely large volumes of data utilizing commodity hardware and open source database software. Shard-Query is a federated query engine which is designed to perform as much work in parallel as possible.

What kind of queries are supported?

  • You can run just about all SQL queries over your dataset:
  • For SELECT queries:
    • All aggregate functions are supported.
      • SUM,COUNT,MIN,MAX and AVG are the fastest aggregate operations
      • SUM/COUNT(DISTINCT ..) are supported, but are slower
      • STD/VAR/etc are supported but aggregation is not pushed down at all (slowest)
      • Custom aggregate functions are now also supported.
        • PERCENTILE(expr, N) – take a percentile, for example percentile(score,90)
  • JOINs are supported (no self joins, or joins of tables sharded by different keys)
  • ORDER BY, GROUP BY, HAVING, WITH ROLLUP, and LIMIT are supported
  • Also upports INSERT, UPDATE, DELETE
  • Also supports DDL such as CREATE TABLE, ALTER TABLE and DROP TABLE

The numbers on a 24 core server are impressive. Worth a closer look.

I first saw this at Justin Swanhart’s webinar announcement: Building a highly scaleable distributed… [Webinar, MySQL/Shard-Query]

Building a highly scaleable distributed… [Webinar, MySQL/Shard-Query]

Filed under: Distributed Systems,MySQL — Patrick Durusau @ 5:15 pm

Webinar: Building a highly scaleable distributed row, document or column store with MySQL and Shard-Query by Justin Swanhart.

From the post:

On Friday, February 15, 2013 10:00am Pacific Standard Time, I will be delivering a webinar entitled “Building a highly scaleable distributed row, document or column store with MySQL and Shard-Query”

The first part of this webinar will focus on why distributed databases are needed, and on the techniques employed by Shard-Query to implement a distributed MySQL database. The focus will then proceed to the types of distributed (massively parallel processing) database applications which can be deployed with Shard-Query and the performance aspects of each.

The following types of implementations will be described:

  • Distributed row store using XtraDB cluster
  • Distributed append-only column store using Infobright Community Edition
  • Distributed “document store” using XtraDB cluster and Flexviews

If you are using (or planning on using) MySQL as a topic map backend, this could be the webinar for you!

February 7, 2013

The Semantic Web Is Failing — But Why? (Part 5)

Filed under: Identity,OWL,RDF,Semantic Web — Patrick Durusau @ 4:30 pm

Impoverished Identification by URI

There is one final part of the faliure of the Semantic Web puzzle to explore before we can talk about a solution.

In owl:sameAs and Linked Data: An Empircal Study, Ding, Shinavier, Finin and McGuinness write:

Our experimental results have led us to identify several issues involving the owl:sameAs property as it is used in practice in a linked data context. These include how best to manage owl:sameAs assertions from “third parties”, problems in merging assertions from sources with different contexts, and the need to explore an operational semantics distinct from the strict logical meaning provided by OWL.

To resolve varying usages of owl:sameAs, the authors go beyond identifications provided by a URI to look to other properties. For example:

Many owl:sameAs statements are asserted due to the equivalence of the primary feature of resource description, e.g. the URIs of FOAF profiles of a person may be linked just because they refer to the same person even if the URIs refer the person at different ages. The odd mashup on job-title in previous section is a good example for why the URIs in different FOAF profiles are not fully equivalent. Therefore, the empirical usage of owl:sameAs only captures the equivalence semantics on the projection of the URI on social entity dimension (removing the time and space dimensions). In thisway, owl:sameAs is used to indicate p artial equivalence between two different URIs, which should not be considered as full equivalence.

Knowing the dimensions covered by a URI and the dimensions covered by a property, it is possible to conduct better data integration using owl:sameAs. For example, since we know a URI of a person provides a temporal-spatial identity, descriptions using time-sensitive properties, e.g. age, height and workplace, should not be aggregated, while time-insensitive properties, such as eye color and social security number, may be aggregated in most cases.

When an identification is insufficient based on a single URI, additional properties can be considered.

My question then is why do ordinary users have to wait for experts to decide their identifications are insufficient? Why can’t we empower users to declare multiple properties, including URIs, as a means of identification?

It could be something as simple as JSON key/value pairs with a notation of “+” for must match, “-” for must not match, and “?” for optional to match.

A declaration of identity by users about the subjects in their documents. Who better to ask?

Not to mention that the more information supplies with for an identification, the more likely they are to communicate, successfully, with other users.

URIs may be Tim Berners-Lee’s nails, but they are insufficient to support the scaffolding required for robust communication.


The next series starts with Saving the “Semantic” Web (Part 1)

The Semantic Web Is Failing — But Why? (Part 4)

Filed under: Interface Research/Design,RDF,Semantic Web — Patrick Durusau @ 4:30 pm

Who Authors The Semantic Web?

With the explosion of data, “big data” to use the oft-abused terminology, authoring semantics cannot be solely the province of a smallish band of experts.

Ordinary users must be enabled to author semantics on subjects of importance to them, without expert supervision.

The Semantic Web is designed for the semantic equivalent of:

F16 Cockpit

An F16 cockpit has an interface some people can use, but hardly the average user.

VW Dashboard

The VW “Bettle” has an interface used by a large number of average users.

Using a VW interface, users still have accidents, disobey rules of the road, lock their keys inside and make other mistakes. But the number of users who can use the VW interface is several orders of magnitude greater than the F-16/RDF interface.

Designing a solution that only experts can use, if participation by average users is a goal, is a path to failure.


The next series starts with Saving the “Semantic” Web (Part 1)

The Semantic Web Is Failing — But Why? (Part 3)

Filed under: Linked Data,RDF,Semantic Web — Patrick Durusau @ 4:30 pm

Is Linked Data the Answer?

Leaving the failure of users to understand RDF semantics to one side, there is also the issue of the complexity of its various representations.

Consider Kingsley Idehem’s “simple” example Turtle document, which he posted in: Simple Linked Data Deployment via Turtle Docs using various Storage Services:

##### Starts Here #####
# Note: the hash is a comment character in Turtle
# Content start
# You can save this to a local file. In my case I use Local File Name: kingsley.ttl .
# Actual Content:

# prefix decalaration that enable the use of compact identifiers instead of fully expanded 
# HTTP URIs.

@prefix owl:   .
@prefix foaf:  .
@prefix rdfs:  . 
@prefix wdrs:  .
@prefix opl:  .
@prefix cert:  .
@prefix:<#>.

# Profile Doc Stuff

<> a foaf:Document . 
<> rdfs:label "DIY Linked Data Doc About: kidehen" .
<> rdfs:comment "Simple Turtle File That Describes Entity: kidehen " .

# Entity Me Stuff

<> foaf:primaryTopic :this .
<> foaf:maker :this .
:this a foaf:Person . 
:this wdrs:describedby <> . 
:this foaf:name "Kingsley Uyi Idehen" .
:this foaf:firstName "Kingsley" .
:this foaf:familyName "Idehen" .
:this foaf:nick "kidehen" .
:this owl:sameAs  .
:this owl:sameAs  .
:this owl:sameAs  .
:this owl:sameAs  .
:this foaf:page  .
:this foaf:page  .
:this foaf:page  .
:this foaf:page  . 
:this foaf:knows , , , , ,  .

# Entity Me: Identity & WebID Stuff 

#:this cert:key :pubKey .
#:pubKey a cert:RSAPublicKey;
# Public Key Exponent
# :pubkey cert:exponent "65537" ^^ xsd:integer;
# Public Key Modulus
# :pubkey cert:modulus "d5d64dfe93ab7a95b29b1ebe21f3cd8a6651816c9c39b87ec51bf393e4177e6fc
2ee712d92caf9d9f1423f5e65f127274529a2e6cc53f1e452c6736e8db8732f919c4160eaa9b6f327c8617c
40036301b547abfc4c5de610780461b269e3d8f8e427237da6152ac2047d88ff837cddae793d15427fa7ce
067467834663737332be467eb353be678bffa7141e78ce3052597eae3523c6a2c414c2ae9f8d7be807bb3
fc0d516b8ecd2fafee4f20ff3550919601a0ad5d29126fb687c2e8c156f04918a92c4fc09f136473f3303814e1
83185edf0046e124e856ca7ada027345e614f8d665f5d7172d880497005ff4626c2b0f2206f7dce717e4f279
dd2a0ddf04b" ^^ xsd:hexBinary .

# :this opl:hasCertificate :cert .
# :cert opl:fingerprint "640F9DD4CFB6DD6361CBAD12C408601E2479CC4A" ^^ xsd:hexBinary;
#:cert opl:hasPublicKey "d5d64dfe93ab7a95b29b1ebe21f3cd8a6651816c9c39b87ec51bf393e4177e6fc2
ee712d92caf9d9f1423f5e65f127274529a2e6cc53f1e452c6736e8db8732f919c4160eaa9b6f327c8617c400
36301b547abfc4c5de610780461b269e3d8f8e427237da6152ac2047d88ff837cddae793d15427fa7ce06746
7834663737332be467eb353be678bffa7141e78ce3052597eae3523c6a2c414c2ae9f8d7be807bb3fc0d516b
8ecd2fafee4f20ff3550919601a0ad5d29126fb687c2e8c156f04918a92c4fc09f136473f3303814e183185edf00
46e124e856ca7ada027345e614f8d665f5d7172d880497005ff4626c2b0f2206f7dce717e4f279dd2a0ddf04b" 
^^ xsd:hexBinary .

### Ends or Here###

Try handing that “simple” example and Idehem’s article to some non-technical person in your office to gauge its “simplicity.”

For that matter, hand it to some of your technical but non-Semantic Web folks as well.

Your experience with that exercise will speak louder than anything I can say.


The next series starts with Saving the “Semantic” Web (Part 1)

« Newer PostsOlder Posts »

Powered by WordPress