Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 15, 2014

Why and How to Start Your SICP Trek

Filed under: Lisp,Programming — Patrick Durusau @ 8:35 am

Why and How to Start Your SICP Trek by Kai Wu.

From the post:

This post was first envisioned for those at Hacker Retreat – or thinking of attending – before it became more general. It’s meant to be a standing answer to the question, “How can I best improve as a coder?”

Because I hear that question from people committed to coding – i.e. professionally for the long haul – the short answer I always give is, “Do SICP!” *

Since that never seems to be convincing enough, here’s the long answer. 🙂 I’ll give a short overview of SICP’s benefits, then use arguments from (justified) authority and argument by analogy to convince you that working through SICP is worth the time and effort. Then I’ll share some practical tips to help you on your SICP trek.

* Where SICP = The Structure and Interpretation of Computer Programs by Hal Abelson and Gerald Sussman of MIT, aka the Wizard book.

BTW, excuse my enthusiasm for SICP if it comes across at times as monolingual theistic fanaticism. I’m aware that there are many interesting developments in CS and software engineering outside of the Wizard book – and no single book can cover everything. Nevertheless, SICP has been enormously influential as an enduring text on the nature and fundamentals of computing – and tends to pay very solid dividends on your investments of attention.

A great post with lots of suggestions on how to work your way through SICP.

What it can’t supply is the discipline to actually make your way through SICP.

I was at a Unicode Conference some years ago and met Don Knuth. I said something in the course of the conversation about reading some part of TAOCP and Don said rather wistfully that he wished he would met someone who had read it all.

It seems sad that so many of us have dipped into it here or there but not really taken the time to explore it completely. Rather like reading Romeo and Juliet for the sexy parts and ignoring the rest.

Do you have a reading plan for TAOCP after you finish SICP?

I first saw this in a tweet by Computer Science.

Erik Meijer – Haskell – MOOC

Filed under: Functional Programming,Haskell — Patrick Durusau @ 7:46 am

Erik Meijer tweeted:

Your opportunity to influence my upcoming #Haskell MOOC on EdX. Submit pull requests on the contents here: https://github.com/fptudelft/MOOC-Course-Description

What are your suggestions?

April 14, 2014

SIGBOVIK 2014

Filed under: Humor — Patrick Durusau @ 10:14 am

SIGBOVIK 2014 (pdf)

From the cover page:

The Association for Computational Heresy

presents

A record of the Proceeding of

SIGBOVIK 2014

The eight annual intercalary robot dance in celebration of workshop on symposium about Harry Q. Bovik’s 26th birthday.

Just in case news on computer security is as grim this week as last, something to brighten your spirits.

Enjoy!

I first saw this in a tweet by John Regehr.

tagtog: interactive and text-mining-assisted annotation…

Filed under: Annotation,Biomedical,Genomics,Text Mining — Patrick Durusau @ 8:55 am

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles by Juan Miguel Cejuela, et al.

Abstract:

The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.

Database URL: www.tagtog.net, www.flybase.org.

Encouraging because the “tagging” is not wholly automated nor is it wholly hand-authored. Rather the goal is to create an interface that draws on the strengths of automated processing as moderated by human expertise.

Annotation remains at a document level, which consigns subsequent users to mining full text but this is definitely a step in the right direction.

April 13, 2014

3 Common Time Wasters at Work

Filed under: Business Intelligence,Marketing,Topic Maps — Patrick Durusau @ 4:32 pm

3 Common Time Wasters at Work by Randy Krum.

See Randy’s post for the graphic but #2 was:

Non-work related Internet Surfing

It occurred to me that “Non-work related Internet Surfing” is indistinguishable from….search. At least at arm’s length or better.

And so many people search poorly that a lack of useful results is easy to explain.

Yes?

So, what is the strategy to get the rank and file to use more efficient information systems than search?

Their non-use or non-effective use of your system can torpedo a sale just as quickly as any other cause.

Suggestions?

Clojure and Storm

Filed under: Clojure,Storm — Patrick Durusau @ 4:12 pm

Two recent posts by Charles Ditzel, here and here, offer pointers to resources on Clojure and Storm.

For your reading convenience:

Storm Apache site.

Building an Activity Feed Stream with Storm Recipe from the Clojure Cookbook.

Storm: distributed and fault-tolerant realtime computation by Nathan Marz. (Presentation, 2012)

Storm Topologies

StormScreenCast2 Storm in use at Twitter (2011)

Enjoy!

Cross-Scheme Management in VocBench 2.1

Filed under: Mapping,SKOS,Software,Vocabularies,VocBench — Patrick Durusau @ 1:54 pm

Cross-Scheme Management in VocBench 2.1 by Armando Stellato.

From the post:

One of the main features of the forthcoming VB2.1 will be SKOS Cross-Scheme Management

I started drafting some notes about cross-scheme management here: https://art-uniroma2.atlassian.net/wiki/display/OWLART/SKOS+Cross-Scheme+Management

I think it is important to have all the integrity checks related to this aspect clear for humans, and not only have them sealed deep in the code. These notes will help users get acquainted with this feature in advance. Once completed, these will be included also in the manual of VB.

For the moment I’ve only written the introduction, some notes about data integrity and then described the checks carried upon the most dangerous operation: removing a concept from a scheme. Together with the VB development group, we will add more information in the next days. However, if you have some questions about this feature, you may post them here, as usual (or you may use the vocbench user/developer user groups).

A consistent set of operations and integrity checks for cross-scheme are already in place for this 2.1, which will be released in the next days.

VB2.2 will focus on other aspects (multi-project management), while we foresee a second wave of facilities for cross-scheme management (such as mass-move/add/remove actions, fixing utilities, analysis of dangling concepts, corrective actions etc..) for VB2.3

I agree that:

I think it is important to have all the integrity checks related to this aspect clear for humans, and not only have them sealed deep in the code.

But I am less certain that following the integrity checks of SKOS is useful in all mappings between schemes.

If you are interested in such constraints, see Armando’s notes.

Online Python Tutor (update)

Filed under: Programming,Python,Visualization — Patrick Durusau @ 1:27 pm

Online Python Tutor by Philip Guo.

From the webpage:

Online Python Tutor is a free educational tool created by Philip Guo that helps students overcome a fundamental barrier to learning programming: understanding what happens as the computer executes each line of a program’s source code. Using this tool, a teacher or student can write a Python program in the Web browser and visualize what the computer is doing step-by-step as it executes the program.

As of Dec 2013, over 500,000 people in over 165 countries have used Online Python Tutor to understand and debug their programs, often as a supplement to textbooks, lecture notes, and online programming tutorials. Over 6,000 pieces of Python code are executed and visualized every day.

Users include self-directed learners, students taking online courses from Coursera, edX, and Udacity, and professors in dozens of universities such as MIT, UC Berkeley, and the University of Washington.

If you believe in crowd wisdom, 500,000 users is a vote of confidence in the Online Python Tutor.

I first mentioned the Online Python Tutor in LEARN programming by visualizing code execution

Philip points to similar online tutors for Java, Ruby and Javascript.

Enjoy!

The Heartbleed Hit List:…

Filed under: Cybersecurity,Security — Patrick Durusau @ 12:41 pm

The Heartbleed Hit List: The Passwords You Need to Change Right Now (Mashable)

An incomplete “hit list” for the Heartbleed bug.

Useful in case you are worried about changing passwords but also for who is or was vulnerable and who’s not.

The financial sector comes up not vulnerable now and wasn’t vulnerable in the past. No exceptions.

Mashable points out banks have been warned to update their systems.

Makes me curious if the financial sector is so far behind on the technology curve they haven’t reached using OpenSSL or regulators of the financial sector know that little about the software that underpins the financial sector? Some of both? Some other explanation?

I first saw this in a tweet by MariaJesusV.

Will Computers Take Your Job?

Filed under: Data Analysis,Humor,Semantics — Patrick Durusau @ 10:42 am

Probability that computers will take away your job posted by Jure Leskovec.

jobs taken by computers

For your further amusement, I recommend the full study, “The Future of Employment: How Susceptible are Jobs to Computerisation?” by C. Frey and M. Osborne (2013).

The lower the number, the less likely for computer replacement:

  • Logisticians – #55, more replaceable than Rehabilitation Counselors at #47.
  • Computer and Information Research Scientists – #69, more replaceable than Public Relations and Fundraising Managers at #67. (Sorry Don.)
  • Astronomers – #128, more replaceable than Credit Counselors at #126.
  • Dancers – #179? I’m not sure the authors have even seen Paula Abdul dance.
  • Computer Programmers – #293, more replaceable than Historians at #283.
  • Bartenders – #422. Have you ever told a sad story to a coin-operated vending machine?
  • Barbers – #439. Admittedly I only see barbers at a distance but if I wanted one, I would prefer human one.
  • Technical Writers – #526. The #1 reason why technical documentation is so poor. Technical writers are under appreciated and treated like crap. Good technical writing should be less replaceable by computers than Lodging Managers at #12.
  • Tax Examiners and Collectors, and Revenue Agents – #586. Stop cheering so loudly. You are frightening other cube dwellers.
  • Umpires, Referees, and Other Sports Officials – 684. Now cheer loudly! 😉

If the results strike you as odd, consider this partial description of the approach taken to determine if a job could be taken over by a computer:

First, together with a group of ML researchers, we subjectively hand-labelled 70 occupations, assigning 1 if automatable, and 0 if not. For our subjective assessments, we draw upon a workshop held at the Oxford University Engineering Sciences Department, examining the automatability of a wide range of tasks. Our label assignments were based on eyeballing the O∗NET tasks and job description of each occupation. This information is particular to each occupation, as opposed to standardised across different jobs. The hand-labelling of the occupations was made by answering the question “Can the tasks of this job be sufficiently specified, conditional on the availability of big data, to be performed by state of the art computer-controlled equipment”. Thus, we only assigned a 1 to fully automatable occupations, where we considered all tasks to be automatable. To the best of our knowledge, we considered the possibility of task simplification, possibly allowing some currently non-automatable tasks to be automated. Labels were assigned only to the occupations about which we were most confident. (at page 30)

Not to mention that occupations were considered for automation on the basis of nine (9) variables.

Would you believe that semantics isn’t mentioned once in this paper? So now you know why I have issues with its methodology and conclusions. What do you think?

April 12, 2014

Testing Lucene’s index durability after crash or power loss

Filed under: Indexing,Lucene — Patrick Durusau @ 8:08 pm

Testing Lucene’s index durability after crash or power loss by Mike McCandless.

From the post:

One of Lucene’s useful transactional features is index durability which ensures that, once you successfully call IndexWriter.commit, even if the OS or JVM crashes or power is lost, or you kill -KILL your JVM process, after rebooting, the index will be intact (not corrupt) and will reflect the last successful commit before the crash.

If anyone at your startup is writing an indexing engine, be sure to pass this post from Mike along.

Ask them for a demonstration of equal durability of the index before using their work instead of Lucene.

You have enough work to do without replicating (poorly) work that already has enterprise level reliability.

Read Access on Google Production Servers

Filed under: Cybersecurity,Security — Patrick Durusau @ 7:26 pm

How we got read access on Google’s production servers

From the post:

To stay on top on the latest security alerts we often spend time on bug bounties and CTF’s. When we were discussing the challenge for the weekend, Mathias got an interesting idea: What target can we use against itself?

Of course. The Google search engine!

What would be better than to scan Google for bugs other than by using the search engine itself? What kind of software tend to contain the most vulnerabilities?

  • Old and deprecated software
  • Unknown and hardly accessible software
  • Proprietary software that only a few people have access to
  • Alpha/Beta releases and otherwise new technologies (software in early stages of it’s lifetime)

I read recently that computer security defense is 10 years behind computer security offense.

Do you think that’s in part due to the difference in sharing of information between the two communities?

Computer offense aggressively sharing and computer defense aggressively hording.

Yes?

If you are interested in a less folklorish way of gathering computer security information (such as all the software versions that are known to have the Heartbeat SSL issue), think about using topic maps.

Reasoning that the pattern that lead to the Heartbeat SSL memory leak was not unique.

As you build a list of Heartbeat susceptible software, you have a suspect list for similar issues. Find another leak and you can associate it with all those packages, subject to verification.

BTW, a good starting point for your research, the detectify blog.

Faceboook Gets Smarter with Graph Engine Optimization

Filed under: Facebook,Giraph,Graphs — Patrick Durusau @ 7:07 pm

Faceboook Gets Smarter with Graph Engine Optimization by Alex Woodie.

From the post:

Last fall, the folks in Facebook’s engineering team talked about how they employed the Apache Giraph engine to build a graph on its Hadoop platform that can host more than a trillion edges. While the Graph Search engine is capable of massive graphing tasks, there were some workloads that remained outside the company’s technical capabilities–until now.

Facebook turned to the Giraph engine to power its new Graph Search offering, which it unveiled in January 2013 as a way to let users perform searches on other users to determine, for example, what kind of music their Facebook friends like, what kinds of food they’re into, or what activities they’ve done recently. An API for Graph Search also provides advertisers with a new revenue source for Facebook. It’s likely the world’s largest graph implementation, and a showcase of what graph engines can do.

The company picked Giraph because it worked on their existing Hadoop implementation, including HDFS and its MapReduce infrastructure stack (known as Corona). Compared to running the computation workload on Hive, an internal Facebook test of a 400-billion edge graph ran 126x faster on Giraph, and had a 26x performance advantage, as we explained in a Datanami story last year.

When Facebook scaled its internal test graph up to 1 trillion edges, they were able to keep the processing of each iteration of the graph under four minutes on a 200-server cluster. That amazing feat was done without any optimization, the company claimed. “We didn’t cheat,” Facebook developer Avery Ching declared in a video. “This is a random hashing algorithm, so we’re randomly assigning the vertices to different machines in the system. Obviously, if we do some separation and locality optimization, we can get this number down quite a bit.”

High level view with technical references on how Facebook is optimizing its Apache Giraph engine.

If you are interested in graphs, this is much more of a real world scenario than building “big” graphs out of uniform time slices.

PyCon US 2014 – Videos (Tutorials)

Filed under: Conferences,Programming,Python — Patrick Durusau @ 2:04 pm

The tutorial videos from PyCon US 2014 are online! Talks to follow.

Tutorials arranged by author for your finding convenience:

  • Blomo, Jim mrjob: Snakes on a Hadoop
    This tutorial will take participants through basic usage of mrjob by writing analytics jobs over Yelp data. mrjob lets you easily write, run, and test distributed batch jobs in Python, on top of Hadoop. Hadoop is a MapReduce platform for processing big data but requires a fair amount of Java boilerplate. mrjob is an open source Python library written by Yelp used to process TBs of data every day.
  • Clifford, Williams, G. 0 to 00111100 with web2py
    This tutorial teaches basic web development for people who have some experience with HTML. No experience with CSS or JavaScript is required. We will build a basic web application using AJAX, web forms, and a local SQL database.
  • Grisel, Olivier; Jake, Vanderplas Exploring Machine Learning with Scikit-learn
    This tutorial will offer an introduction to the core concepts of machine learning, and how they can be easily applied in Python using Scikit-learn. We will use the scikit-learn API to introduce and explore the basic categories of machine learning problems, related topics such as feature selection and model validation, and the application of these tools to real-world data sets.
  • Love, Kenneth Getting Started with Django, a crash course

    Getting Started With Django is a well-established series of videos teaching best practices and common approaches for building web apps to people new to Django. This tutorial combines the first few lessons into a single lesson. Attendees will follow along as I start and build an entire simple web app and, network permitting, deploy it to Heroku.
  • Ma, Eric How to formulate a (science) problem and analyze it using Python code
    Are you interested in doing analysis but don’t know where to start? This tutorial is for you. Python packages & tools (IPython, scikit-learn, NetworkX) are powerful for performing data analysis. However, little is said about formulating the questions and tying these tools together to provide a holistic view of the data. This tutorial will provide you with an introduction on how this can be done.
  • Müller, Mike Descriptors and Metaclasses – Understanding and Using Python's More Advanced Features
    Descriptors and metaclasses are advanced Python features. While it is possible to write Python programs without active of knowledge of them, knowing how they work provides a deeper understanding about the language. Using examples, you will learn how they work and when to use as well as when better not to use them. Use cases provide working code that can serve as a base for own solutions.
  • Vanderplas, Jake; Olivier Grisel Exploring Machine Learning with Scikit-learn
    This tutorial will offer an introduction to the core concepts of machine learning, and how they can be easily applied in Python using Scikit-learn. We will use the scikit-learn API to introduce and explore the basic categories of machine learning problems, related topics such as feature selection and model validation, and the application of these tools to real-world data sets.

Tutorials or talks with multiple authors are listed under each author. (I don’t know which one you will remember.)

I am going to spin up the page for the talks so when the videos appear, all I need do is to insert the video links.

Enjoy!

Lost Boolean Operator?

Filed under: Boolean Operators,Logic — Patrick Durusau @ 1:15 pm

Binary Boolean Operator: The Lost Levels

From the post:

The most widely known of these four siblings is operator number 11. This operator is called the “material conditional”. It is used to test if a statement fits the logical pattern “P implies Q”. It is equivalent to !P || Q by the material implication.

I only know one language that implementes this operation: VBScript.

The post has a good example of why material conditional is useful.

Will your next language have a material conditional operator?

I first saw this in Pete Warden’s Five short links for April 3, 2014.

Prescription vs. Description

Filed under: Data Science,Ontology,Topic Maps — Patrick Durusau @ 10:59 am

Kurt Cagle posted this image on Facebook:

engineers-vs-physicists

with this comment:

The difference between INTJs and INTPs in a nutshell. Most engineers, and many programmers, are INTJs. Theoretical scientists (and I’m increasingly putting data scientists in that category) are far more INTPs – they are observers trying to understand why systems of things work, rather than people who use that knowledge to build, control or constrain those systems.

I would rephrase the distinction to be one of prescription (engineers) versus description (scientists) but that too is a false dichotomy.

You have to have some real or imagined description of a system to start prescribing for it and any method for exploring a system has some prescriptive aspects.

The better course is to recognize exploring or building systems has some aspects of both. Making that recognition, may (or may not) make it easier to discuss assumptions of either perspective that aren’t often voiced.

Being more from the descriptive side of the house, I enjoy pointing out that behind most prescriptive approaches are software and services to help you implement those prescriptions. Hardly seems like an unbiased starting point to me. 😉

To be fair, however, the descriptive side of the house often has trouble distinguishing between important things to describe and describing everything it can to system capacity, for fear of missing some edge case. The “edge” cases may be larger than the system but if they lack business justification, pragmatics should reign over purity.

Or to put it another way: Prescription alone is too brittle and description alone is too endless.

Effective semantic modeling/integration needs to consist of varying portions of prescription and description depending upon the requirements of the project and projected ROI.

PS: The “ROI” of a project not in your domain, that doesn’t use your data, your staff, etc. is not a measure of the potential “ROI” for your project. Crediting such reports is “ROI” for the marketing department that created the news. Very important to distinguish “your ROI” from “vendor’s ROI.” Not the same thing. If you need help with that distinction, you know where to find me.

April 11, 2014

Hemingway App

Filed under: Authoring Topic Maps,Editor,Writing — Patrick Durusau @ 7:10 pm

Hemingway App

We are a long way from something equivalent to Hemingway App for topic maps or other semantic technologies but it struck me that may not always be true.

Take it for a spin and see what you think.

What modifications would be necessary to make this concept work for a semantic technology?

Definitions Extractions from the Code of Federal Regulations

Filed under: Extraction,Law,Law - Sources — Patrick Durusau @ 7:03 pm

Definitions Extractions from the Code of Federal Regulations by Mohamma M. AL Asswad, Deepthi Rajagopalan, and Neha Kulkarni. (poster)

From a description of the project:

Imagine you’re opening a new business that uses water in the production cycle. If you want to know what federal regulations apply to you, you might do a Google search that leads to the Code of Federal Regulations. But that’s where it gets complicated, because the law contains hundreds of regulations involving water that are difficult to narrow down. (The CFR alone contains 13898 references to water.) For example, water may be defined one way when referring to a drinkable liquid and another when defined as an emission from a manufacturing facility. If the regulation says your water must maintain a certain level of purity, to which water are they referring? Definitions are the building blocks of the law, and yet pouring through them to find what applies to you is frustrating to an average business owner. Computer automation might help, but how can a computer understand exactly what kind of water you’re looking for? We at the Legal Information Institute think this is pretty important challenge, and apparently Google does too.

Looking forward to learning more about this project!

BTW, this is the same Code of Federal Regulations that some members of Congress don’t think needs to be indexed.

Knowing what legal definitions apply is a big step towards making legal material more accessible.

Google Top 10 Search Tips

Filed under: Search Engines,Searching — Patrick Durusau @ 6:47 pm

Google Top 10 Search Tips by Karen Blakeman.

From the post:

These are the top 10 tips from the participants of a recent workshop on Google, organised by UKeiG and held on 9th April 2014. The edited slides from the day can be found on authorSTREAM at http://www.authorstream.com/Presentation/karenblakeman-2121264-making-google-behave-techniques-better-results/ and on Slideshare at http://www.slideshare.net/KarenBlakeman/making-google-behave-techniques-for-better-results

Ten search tips from the trenches. Makes a very nice cheat sheet.

Transcribing Piano Rolls…

Filed under: Music,Python — Patrick Durusau @ 6:14 pm

Transcribing Piano Rolls, the Pythonic Way by Zulko.

From the post:

Piano rolls are these rolls of perforated paper that you put in the saloon’s mechanical piano. They have been very popular until the 1950s, and the piano roll repertory counts thousands of arrangements (some by greatest names of jazz) which have never been published in any other form.

NSA news isn’t going to subside anytime soon so I am including this post as one way to relax over the weekend. 😉

I’m not a musicologist but I think transcribing music from a image of roll music being played is quite fascinating.

I first saw this in a tweet from Lars Marius Garshol.

NSA: Not Your Friend or Mine

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:11 pm

NSA Said to Exploit Heartbleed Bug for Intelligence for Years by Michael Riley.

From the post:

The U.S. National Security Agency knew for at least two years about a flaw in the way that many websites send sensitive information, now dubbed the Heartbleed bug, and regularly used it to gather critical intelligence, two people familiar with the matter said.

The NSA’s decision to keep the bug secret in pursuit of national security interests threatens to renew the rancorous debate over the role of the government’s top computer experts.

Heartbleed appears to be one of the biggest glitches in the Internet’s history, a flaw in the basic security of as many as two-thirds of the world’s websites. Its discovery and the creation of a fix by researchers five days ago prompted consumers to change their passwords, the Canadian government to suspend electronic tax filing and computer companies including Cisco Systems Inc. to Juniper Networks Inc. to provide patches for their systems.

It was just earlier today in NSA … *ucked Up …TCP/IP that I pointed out:

If you aren’t worried about privacy, human rights, etc., let’s make it a matter of dollars and cents.

Think about the economic losses and expenses of your enterprise from an insecure Internet or the profits you could be making with a secure Internet.

The NSA has been at war against your commercial interests for as long as the Internet has existed. If you are serious about the Internet and information, then it is time to rid everyone of the #1 drag on ecommerce, the NSA.

There is no doubt the NSA has damaged the United States computer industry. Now we find the NSA endangering all commerce over the Internet.

The NSA is the largest threat the United States, its citizens and businesses have had to date.

Let’s end the NSA.

Free Heartbleed-Checkers

Filed under: Cybersecurity,Security — Patrick Durusau @ 2:10 pm

Free Heartbleed-Checker Released for Firefox Browser by Kelly Jackson Higgins.

From the post:

A developer today released a free add-on for Mozilla Firefox that checks websites for vulnerability to the massive Heartbleed flaw. Tom Brennan, founder of ProactiveRISK, says he wrote the tool after getting an overwhelming number of requests from family and friends about how to protect themselves from websites that are vulnerable to Heartbleed. “They just wanted their browser to tell them, like a radar detector,” Brennan says.

A similar tool for Chrome was released yesterday by developer Jamie Hoyle. The Chromebleed Checker add-in for the Chrome browser also warns users of Heartbleed-vulnerable sites.

If you discover a Heartbleed susceptible site, tweet #heartbleed (URL of site).

BTW, don’t be discouraged by pieces like Open source software is more secure, right? So what happened with OpenSSL? by Barb Darrow.

What Barb misses is that with closed source software, a large part of the security industry and almost no customers would ever know what when wrong in the OpenSSL source code. With open-source, future code reviews can look for similar errors.

Another difference is that proprietary systems are always “just good enough to ship,” and open-source software has the opportunity (not a certainty) to get better over time.

I think those are two large differences.

You?

How-to: Process Data using Morphlines (in Kite SDK)

Filed under: Cloudera,ETL,Flume,MapReduce,Morphlines — Patrick Durusau @ 1:48 pm

How-to: Process Data using Morphlines (in Kite SDK) by Janos Matyas.

From the post:

SequenceIQ has an Apache Hadoop-based platform and API that consume and ingest various types of data from different sources to offer predictive analytics and actionable insights. Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation.

These datasets come from different sources (industry-standard and proprietary adapters, Apache Flume, MQTT, iBeacon, and so on), so we need a flexible, embeddable framework to support our ETL process chain. Hello, Morphlines! (As you may know, originally the Morphlines library was developed as part of Cloudera Search; eventually, it graduated into the Kite SDK as a general-purpose framework.)

To define a Morphline transformation chain, you need to describe the steps in a configuration file, and the framework will then turn into an in-memory container for transformation commands. Commands perform tasks such as transforming, loading, parsing, and processing records, and they can be linked in a processing chain.

In this blog post, I’ll demonstrate such an ETL process chain containing custom Morphlines commands (defined via config file and Java), and use the framework within MapReduce jobs and Flume. For the sample ETL with Morphlines use case, we have picked a publicly available “million song” dataset from Last.fm. The raw data consist of one JSON file/entry for each track; the dictionary contains the following keywords:

A welcome demonstration of Morphines but I do wonder about the statement:

Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation. (Emphasis added.)

If you don’t have experience with S3 and this pipleine, it is a good starting point for your investigations.

Navigating the WARC File Format

Filed under: Common Crawl,WWW — Patrick Durusau @ 1:22 pm

Navigating the WARC File Format by Stephen Merity.

From the post:

Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.

This document aims to give you an introduction to working with the new format, specifically the difference between:

  • WARC files which store the raw crawl data
  • WAT files which store computed metadata for the data stored in the WARC
  • WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.

If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

If you aren’t already using Common Crawl data, you should be.

Fresh Data Available:

The latest dataset is from March 2014, contains approximately 2.8 billion webpages and is located
in Amazon Public Data Sets at /common-crawl/crawl-data/CC-MAIN-2014-10.

What are you going to look for in 2.8 billion webpages?

Placement of Citations [Discontinuity and Users]

Filed under: Interface Research/Design,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 12:53 pm

If the Judge Will Be Reading My Brief on a Screen, Where Should I Place My Citations? by Peter W. Martin.

From the post:

In a prior post I explored how the transformation of case law to linked electronic data undercut Brian Garner’s longstanding argument that judges should place their citations in footnotes. As that post promised, I’ll now turn to Garner’s position as it applies to writing that lawyers prepare for judicial readers.

brief page

Implicitly, Garner’s position assumes a printed page, with footnote calls embedded in the text and the related notes placed at the bottom. In print that entirety is visible at once. The eyes must move, but both call and footnote remain within a single field of vision. Secondly, when the citation sits inert on a printed page and the cited source is online, the decision to inspect that source and when to do so is inevitably influenced by the significant discontinuity that transaction will entail. In print, citation placement contributes little to that discontinuity. The situation is altered – significantly, it seems to me – when a brief or memorandum is submitted electronically and will most likely be read from a screen. In 2014 that is the case with a great deal of litigation.

This is NOT a discussion of interest only to lawyers and judges.

While Peter has framed the issue in terms of contrasting styles of citation, as he also points out, there is a question of “discontinuity” and I would argue comprehension for the reader in these styles.

At first blush, being a regular hypertext maven you may think that inline citations are “the way to go,” on this citation issue.

To some degree I would agree with you but leaving the current display to consult a citation or other material that could appear in a footnote, introduces another form of discontinuity.

You are no longer reading a brief prepared by someone familiar with the law and facts at hand but someone who is relying on different facts and perhaps even a different legal context for their statements.

If you are a regular reader of hypertexts, try writing down the opinion of one author on a note card, follow a hyperlink in that post to another resource, record the second author’s opinion on the same subject on a second note card and then follow a link from the second resource to a third and repeat the note card opinion recording. Set all three cards aside, with no marks to associate them with a particular author.

After two (2) days return to the cards and see if you can distinguish the card you made for the first author from the next two.

Yes, after a very short while you are unable to identify the exact source of information that you were trying to remember. Now imagine that in a legal context where facts and/or law are in dispute. Exactly how much “other” content do you want to display with your inline reference?

The same issue comes up for topic map interfaces. Do you really want to display all the information on a subject or do you want to present the user with a quick overview and enable them to choose greater depth?

Personally I would use citations with pop-ups that contain a summary of the cited authority, with a link to the fuller resource. So a judge could quickly confirm their understanding of a case without waiting for resources to load, etc.

But in any event, how much visual or cognitive discontinuity your interface is inflicting on users is an important issue.

NSA … *ucked Up …TCP/IP

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 10:42 am

CERF: Classified NSA Work Mucked Up Security For Early TCP/IP by Paul Roberts.

From the post:

Did the National Security Agency, way back in the 1970s, allow its own priorities to stand in the way of technology that might have given rise to a more secure Internet? You wouldn’t be crazy to reach that conclusion after hearing an interview with Google Vice President and Internet Evangelist Vint Cerf on Wednesday.

As a graduate student in Stanford in the 1970s, Cerf had a hand in the creation of ARPANet, the world’s first packet-switched network. He later went on to work as a program manager at DARPA, where he funded research into packet network interconnection protocols that led to the creation of the TCP/IP protocol that is the foundation of the modern Internet.

Cerf is a living legend who has received just about every honor a technologist can: including the National Medal of Technology, the Turing Award and the Presidential Medal of Freedom. But he made clear in the Google Hangout with host Leo Laporte that the work he has been decorated for – TCP/IP, the Internet’s lingua franca – was at best intended as a proof of concept, and that only now – with the adoption of IPv6 – is it mature (and secure) enough for what Cerf called “production use.”

Specifically, Cerf said that given the chance to do it over again he would have designed earlier versions of TCP/IP to look and work like IPV6, the latest version of the IP protocol with its integrated network-layer security and massive 128 bit address space. IPv6 is only now beginning to replace the exhausted IPV4 protocol globally.

Paul later points out that we can’t know the impact of then available security would have had on the creation and adoption of the Internet.

Fair point.

And there isn’t any use in crying over spilled milk.

However, after decades of lying, law breaking and trying to disadvantage the population it is alleged to serve, why isn’t Congress defunding the NSA now?

If an agency has a proven track record of law-breaking and lying to Congress, what reason is there to credit any report, any statement or any information the NSA claims to have gathered?

You know the saying: Fool me once, shame on you. Fool me twice, shame on me?

The entire interview:

If you aren’t worried about privacy, human rights, etc., let’s make it a matter of dollars and cents.

Think about the economic losses and expenses of your enterprise from an insecure Internet or the profits you could be making with a secure Internet.

The NSA has been at war against your commercial interests for as long as the Internet has existed. If you are serious about the Internet and information, then it is time to rid everyone of the #1 drag on ecommerce, the NSA.

Free Recommendation Engine!

Filed under: Mortar,Recommendation — Patrick Durusau @ 7:39 am

Giving Away Our Recommendation Engine for Free by Doug Daniels.

From the post:

What’s better than a recommendation engine that’s free? A recommendation engine that is both awesome and free.

Today, we’re announcing General Availability for the Mortar Recommendation Engine. Designed by Mortar’s engineers and top data science advisors, it produces personalized recommendations at scale for companies like MTV, Comedy Central, StubHub, and the Associated Press. Today, we’re giving it away for free, and it is awesome.

Cool!

But before the FOSS folks get all weepy eyed, let’s remember that in order to make use of a recommendation engine, you need:

  • Data, lots of data
  • Understanding of the data
  • Processing of the data
  • Debugging your recommendations
  • Someone to make recommendations to
  • Someone to pay you for your recommendations

And those are just the points that came to mind while writing this post.

You can learn a lot from the Mortar Recommendation Engine but it’s not a threat to Mortar’s core business.

Any more than Oracle handing out shrink wrap copies of Oracle 36DD would impact their licensing and consulting business.

When you want to wield big iron, you need professional grade training and supplies.

April 10, 2014

Spotting Guide to Bad Science

Filed under: Science — Patrick Durusau @ 7:06 pm

bad science

I am seriously considering writing up something similar for technical standards.

Suggestions?

I saw this in a tweet by Kyle Dennis.

The X’s Are In Town

Filed under: HyTime,W3C,XML,XPath,XQuery — Patrick Durusau @ 6:53 pm

XQuery 3.0, XPath 3.0, XQueryX 3.0, XDM 3.0, Serialization 3.0, Functions and Operators 3.0 are now W3C Recommendations

From the post:

The XML Query Working Group published XQuery 3.0: An XML Query Language, along with XQueryX, an XML representation for XQuery, both as W3C Recommendations, as well as the XQuery 3.0 Use Cases and Requirements as final Working Group Notes. XQuery extends the XPath language to provide efficient search and manipulation of information represented as trees from a variety of sources.

The XML Query Working Group and XSLT Working Group also jointly published W3C Recommendations of XML Path Language (XPath) 3.0, a widely-used language for searching and pointing into tree-based structures, together with XQuery and XPath Data Model 3.0 which defines those structures, XPath and XQuery Functions and Operators 3.0 which provides facilities for use in XPath, XQuery, XSLT and a number of other languages, and finally the XSLT and XQuery Serialization 3.0 specification giving a way to turn values and XDM instances into text, HTML or XML.

Read about the XML Activity.

I was wondering what I was going to have to read this coming weekend. 😉

It may just be me but the “…provide efficient search and manipulation of information represented as trees from a variety of sources…” sounds a lot like groves to me.

You?

Using Datomic as a Graph Database

Filed under: Clojure,Datomic,Functional Programming,Graphs — Patrick Durusau @ 6:32 pm

Using Datomic as a Graph Database by Joshua Davey.

From the post:

Datomic is a database that changes the way that you think about databases. It also happens to be effective at modeling graph data and was a great fit for performing graph traversal in a recent project I built.

I started out building kevinbacon.us using Neo4j, a popular open-source graph database. It worked very well for actors that were a few hops away, but finding paths between actors with more than 5 hops proved problematic. The cypher query language gave me little visibility into the graph algorithms actually being executed. I wanted more.

Despite not being explicitly labeled as such, Datomic proved to be an effective graph database. Its ability to arbitrarily traverse datoms, when paired with the appropriate graph searching algorithm, solved my problem elegantly. This technique ended up being fast as well.

Quick aside: this post assumes a cursory understanding of Datomic. I won’t cover the basics, but the official tutorial will help you get started.
….

If you are interested in Datomic, Clojure, functional programming, or graphs, this is a must read for you.

Not to spoil any surprises but Joshua ends up with excellent performance.

I first saw this in a tweet by Atabey Kaygun.

« Newer PostsOlder Posts »

Powered by WordPress