Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 12, 2014

Web Annotation Data Model [First Public Working Draft]

Filed under: Annotation — Patrick Durusau @ 10:48 am

Web Annotation Data Model [First Public Working Draft]

Web Annotation Principles

The Web Annotation Data Model is defined using the following basic principles:

  • An Annotation is a resource that represents the link between resources, or a selection within a resource.
  • There are two types of participating resources, Bodies and Targets.
  • Annotations have 0..n Bodies.
  • Annotations have 1..n Targets.
  • The content of the Body resources is related to, and typically “about”, the content of the Target resources.
  • Annotations, Bodies and Targets may have their own properties and relationships, typically including provenance information and descriptive metadata.
  • The intent behind the creation of an Annotation is an important property, and is identified by a Motivation resource.

The following principles describe additional distinctions needed regarding the exact nature of Target and Body:

  • The Target or Body resource may be more specific than the entity identified by the resource’s URI alone.
  • In particular,
    • The Target or Body resource may be a specific segment of a resource.
    • The Target or Body resource may be a resource with a specific style.
    • The Target or Body resource may be a resource in a specific context or container.
    • The Target or Body resource may be any combination of the above.
  • The identity of the specific resource is separate from the description of how to obtain the specific resource.
  • The specific resource is derived from a resource identified by a URI.

The following principles describe additional semantics regarding multiple resources:

  • A resource may be a choice between multiple resources.
  • A resource may be a unique, unordered set of resources.
  • A resource may be an ordered list of resources.
  • These resources may be used anywhere a resource may be used.

Take the time to read and comment!

If you wish to make comments regarding this document, please send them to public-annotation@w3.org (subscribe, archives), or use the specification annotation tool by selecting some text and activating the sidebar. (emphasis added)

That’s right! You can annotate the annotation draft itself. Unfortunate that more standards organizations don’t offer that type of commenting facility by default.

Although transclusion would be a better solution, annotations may offer a way to finally break through the document wall that conceals information. For example, making a reference to the Senate Report on CIA torture, page 33, means that I have to look up that document, locate page 33 and then pair your comment to that page. (Easier than manual retrieval but still less than optimal.)

Say you wanted to comment on:

After the July 2002 meetings, the CIA’s (deleted) CTC Legal, (deleted) , drafted a letter to Attorney General John Ashcroft asking the Department of Justice for “a formal declination of prosecution, in advance, for any employees of the United States, as well as any other personnel acting on behalf of the United States, who may employ methods in the interrogation of Abu Zubaydah that otherwise might subject those individuals to prosecution.”*

To supply the information that has been deleted in that sentence and then to share that annotation with others, such that when they view that document, the information appears as annotations on the deleted portions of text.

Or to annotate various lies being told by former Vice President Cheney and others with pointers to the Senate CIA torture report.

The power of annotation breaks barrier that documents pose to the composition of a sub-document portion of information with other information.

If you thought we needed librarians when organizing information at the document level, just imagine how badly we will need them when organization of information is at the sub-document level.

* So much for torture being the solution when a bomb is ticking in a school. Anyone would use whatever means necessary to stop such a bomb and accept the consequences of their actions. Requests for legal immunity demonstrates those involved in U.S. sponsored torture were not only ineffectual but cowards as well.

December 11, 2014

Bringing biodiversity information to life

Filed under: Biodiversity,Challenges — Patrick Durusau @ 2:37 pm

Bringing biodiversity information to life

From the post:

The inaugural GBIF Ebbe Nielsen Challenge aims to inspire scientists, informaticians, data modelers, cartographers and other experts to create innovative applications of open-access biodiversity data.

Background

For the past 12 years, GBIF has awarded the Ebbe Nielsen Prize to recognize outstanding contributions to biodiversity informatics while honouring the legacy of Ebbe Nielsen, one of the principal founders of GBIF, who tragically died just before it came into being.

The Science Committee, working with the Secretariat, has revamped the award for 2015 as the GBIF Ebbe Nielsen Challenge. This open incentive competition seeks to encourage innovative uses of the more than half a billion species occurrence records mobilized through GBIF’s international network. These creative applications of GBIF-mediated data may come in a wide variety of forms and formats—new analytical research, richer policy-relevant visualizations, web and mobile applications, improvements to processes around data digitization, quality and access, or something else entirely. Judges will evaulate submissions on their innovation, functionality and applicability.

As a simple point of departure, participants may wish to review the visual analyses of trends in mobilizing species occurrence data at global and national scales recently unveiled on GBIF.org. Challenge submissions may build on such creations and propose uses or extensions that make GBIF-mediated data even more useful to researchers, policymakers, educators, students and citizens alike.

A jury composed of experts from the biodiversity informatics community will judge the Round One entries collected through this ChallengePost website on their innovation, functionality and applicability, before selecting three to six finalists to compete for a €20,000 First Prize later in 2015.

You can’t argue with the judging criteria:

Innovation

How novel is the submission? A significant portion of the submission should be developed for the challenge. A submission based largely (or entirely) on work published or developed prior to the challenge start date will not be eligible for submission.

Functionality

Does the submission work and show or do something useful?

Applicability

Can the GBIF and biodiversity informatics communities use and/or build on the submission?

Deadline: Tuesday, 3 March 2015 at 5pm CET.

An obvious opportunity to introduce the biodiversity community to topic maps!

Oh, there is a €20,000 first prize and €5,000 second prize. Just something to pique your interest. 😉

Wall Street Journal Retraction? (Michael V. Hayden)

Filed under: Government,NSA,Politics — Patrick Durusau @ 1:45 pm

You may have missed NSA Reform That Only ISIS Could Love that appeared in the Wall Street Journal as an opinion piece on November 17, 2014. Less than a month before the release of the executive summary of the Senate Report on CIA Torture.

As a long time reliable source of information to the financial community, the Wall Street Journal should disavow that opinion piece as purposefully mis-leading the very readers it claims to serve.

Why? Consider the excellent summary in Hayden’s testimony vs. the Senate report by the Washington Post that compares Hayden’s recorded testimony to the Senate Select Committee on Intelligence on April 12, 2007, to the written executive summary of the Senate Report on Torture.

There you will find a consistent patterns of lies and deception that make any statements by Michael V. Hayden unworthy of belief. Moreover, since his pattern of lying has not changed over the years, it injects known falsehood into a vital national debate.

To amend for the perpetuation of this liar’s spew, the Wall Street Journal should disavow NSA Reform That Only ISIS Could Love, denounce Michael V. Hayden as a public liar and call for the release of the full and unedited version of the Senate Report on CIA Torture.

While I may not always agree with the Wall Street Journal editorial line, it has always been faithful to the business community that it serves. The WSJ has done a disservice to that community with the Michael V. Hayden opinion piece and should now make amends.


While the Wall Street Journal considers its perpetuation of lies by Michael V. Hayden, other organizations should reconsider their relationships with Michael V. Hayden.

George Mason University, School of Policy, Government and International Affairs, for example, where Michael V. Hayden is a Distinguished Visiting Professor. Unless they are offering a graduate degree in lying to the American public.

Motorolo Solutions has Michael V. Hayden on its board of directors. I wonder how the shareholders of Motorola Solutions, which is 312th on the Fortune 500 list for 2014 feel about having a torture concealer and advocate on their board of directors?

Which reminds me, what is the statute of limitations on lying to Congress? All I could find readily was: Statutes of Limitation in Federal Criminal Cases: An Overview by Charles Doyle (2012). The general rule appears to be a five year statute of limitation and since lying to Congress doesn’t appear to have a separate limit, it may be five years. That’s not legal advice! Check with a lawyer before you make statements to Congress and better yet, why not tell the truth?

When Do Natural Language Metaphors Influence Reasoning?…

Filed under: Language,Metaphors — Patrick Durusau @ 11:23 am

When Do Natural Language Metaphors Influence Reasoning? A Follow-Up Study to Thibodeau and Boroditsky (2013) by Gerard J. Steen, W. Gudrun Reijnierse, and Christian Burgers.

Abstract:

In this article, we offer a critical view of Thibodeau and Boroditsky who report an effect of metaphorical framing on readers’ preference for political measures after exposure to a short text on the increase of crime in a fictitious town: when crime was metaphorically presented as a beast, readers became more enforcement-oriented than when crime was metaphorically framed as a virus. We argue that the design of the study has left room for alternative explanations. We report four experiments comprising a follow-up study, remedying several shortcomings in the original design while collecting more encompassing sets of data. Our experiments include three additions to the original studies: (1) a non-metaphorical control condition, which is contrasted to the two metaphorical framing conditions used by Thibodeau and Boroditsky, (2) text versions that do not have the other, potentially supporting metaphors of the original stimulus texts, (3) a pre-exposure measure of political preference (Experiments 1–2). We do not find a metaphorical framing effect but instead show that there is another process at play across the board which presumably has to do with simple exposure to textual information. Reading about crime increases people’s preference for enforcement irrespective of metaphorical frame or metaphorical support of the frame. These findings suggest the existence of boundary conditions under which metaphors can have differential effects on reasoning. Thus, our four experiments provide converging evidence raising questions about when metaphors do and do not influence reasoning.

The influence of metaphors on reasoning raises an interesting question for those attempting to duplicate the human brain in silicon: Can a previously recorded metaphor influence the outcome of AI reasoning?

Or can hearing the same information multiple times from different sources influence an AI’s perception of the validity of that information? (In a non-AI context, a relevant question for the Michael Brown grand jury discussion.)

On it own merits, a very good read and recommended to anyone who enjoys language issues.

EMNLP 2014: Conference on Empirical Methods in Natural Language Processing

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 11:06 am

EMNLP 2014: Conference on Empirical Methods in Natural Language Processing

I rather quickly sorted these tutorials into order by the first author’s last name:

The links will take you to the conference site and descriptions with links to videos and other materials.

You can download the complete conference proceedings: EMNLP 2014 The 2014 Conference on Empirical Methods In Natural Language Processing Proceedings of the Conference, which at two thousand one hundred and ninety-one (2191) pages, should keep you busy through the holiday season. 😉

Or if you are interested in a particular paper, see the Main Conference Program, which has links to individual papers and videos of the presentations in many cases.

A real wealth of materials here! I must say the conference servers are the most responsive I have ever seen.

I first saw this in a tweet by Jason Baldridge.

Semantic Parsing with Combinatory Categorial Grammars (Videos)

Filed under: Grammar,Linguistics,Parsing,Semantics — Patrick Durusau @ 10:45 am

Semantic Parsing with Combinatory Categorial Grammars by Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer. (Tutorial)

Abstract:

Semantic parsers map natural language sentences to formal representations of their underlying meaning. Building accurate semantic parsers without prohibitive engineering costs is a long-standing, open research problem.

The tutorial will describe general principles for building semantic parsers. The presentation will be divided into two main parts: modeling and learning. The modeling section will include best practices for grammar design and choice of semantic representation. The discussion will be guided by examples from several domains. To illustrate the choices to be made and show how they can be approached within a real-life representation language, we will use λ-calculus meaning representations. In the learning part, we will describe a unified approach for learning Combinatory Categorial Grammar (CCG) semantic parsers, that induces both a CCG lexicon and the parameters of a parsing model. The approach learns from data with labeled meaning representations, as well as from more easily gathered weak supervision. It also enables grounded learning where the semantic parser is used in an interactive environment, for example to read and execute instructions.

The ideas we will discuss are widely applicable. The semantic modeling approach, while implemented in λ-calculus, could be applied to many other formal languages. Similarly, the algorithms for inducing CCGs focus on tasks that are formalism independent, learning the meaning of words and estimating parsing parameters. No prior knowledge of CCGs is required. The tutorial will be backed by implementation and experiments in the University of Washington Semantic Parsing Framework (UW SPF).

I previously linked to the complete slide set for this tutorial.

This page offers short videos (twelve (12) currently) and links into the slide set. More videos are forthcoming.

The goal of the project is “recover complete meaning representation” where complete meaning = “Complete meaning is sufficient to complete the task.” (from video 1).

That definition of “complete meaning” dodges a lot of philosophical as well as practical issues with semantic parsing.

Take the time to watch the videos, Yoav is a good presenter.

Enjoy!

Do we Need Hundreds of Classi fiers to Solve Real World Classi fication Problems?

Filed under: Classification,Classifier,Machine Learning — Patrick Durusau @ 10:07 am

Do we Need Hundreds of Classi fiers to Solve Real World Classification Problems? by Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. (Journal of Machine Learning Research 15 (2014) 3133-3181)

Abstract:

We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

Keywords: classifi cation, UCI data base, random forest, support vector machine, neural networks, decision trees, ensembles, rule-based classi fiers, discriminant analysis, Bayesian classifi ers, generalized linear models, partial least squares and principal component regression, multiple adaptive regression splines, nearest-neighbors, logistic and multinomial regression

Deeply impressive work but I can hear in the distance the girding of loins and sharpening of tools of scholarly disagreement. 😉

If you are looking for a very comprehensive reference of current classifiers, this is the paper for you.

For the practicing data scientist I think the lesson is to learn a small number of the better classifiers and to not fret overmuch about the lesser ones. If a major breakthrough in classification techniques does happen, it will be in the major tools with great fanfare.

I first saw this in a tweet by Jason Baldridge.

The Impala Cookbook

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 9:31 am

The Impala Cookbook by Justin Kestelyn.

From the post:

Impala, the open source MPP analytic database for Apache Hadoop, is now firmly entrenched in the Big Data mainstream. How do we know this? For one, Impala is now the standard against which alternatives measure themselves, based on a proliferation of new benchmark testing. Furthermore, Impala has been adopted by multiple vendors as their solution for letting customers do exploratory analysis on Big Data, natively and in place (without the need for redundant architecture or ETL). Also significant, we’re seeing the emergence of best practices and patterns out of customer experiences.

As an effort to streamline deployments and shorten the path to success, Cloudera’s Impala team has compiled a “cookbook” based on those experiences, covering:

  • Physical and Schema Design
  • Memory Usage
  • Cluster Sizing and Hardware Recommendations
  • Benchmarking
  • Multi-tenancy Best Practices
  • Query Tuning Basics
  • Interaction with Apache Hive, Apache Sentry, and Apache Parquet

By using these recommendations, Impala users will be assured of proper configuration, sizing, management, and measurement practices to provide an optimal experience. Happy cooking!

I must confess to some confusion when I first read Justin’s post. I thought the slide set was a rather long description of the cookbook and not the cookbook itself. I was searching for the cookbook and kept finding the slides. 😉

Oh, the slides are very much worth your time but I would reserve the term “cookbook” for something a bit more substantive.

Although O’Reilly thinks a few more than 800 responses constitutes a “survey” of data scientists. Survey results that are free from any mention of Impala. Another reason to use that “survey” with caution.

Wouldn’t it be fun to build your own Google?

Wouldn’t it be fun to build your own Google? by Martin Kleppmann.

Martin writes:

Imagine you had your own copy of the entire web, and you could do with it whatever you want. (Yes, it would be very expensive, but we’ll get to that later.) You could do automated analyses and surface the results to users. For example, you could collate the “best” articles (by some definition) written on many different subjects, no matter where on the web they are published. You could then create a tool which, whenever a user is reading something about one of those subjects, suggests further reading: perhaps deeper background information, or a contrasting viewpoint, or an argument on why the thing you’re reading is full of shit.

Unfortunately, at the moment, only Google and a small number of other companies that have crawled the web have the resources to perform such analyses and build such products. Much as I believe Google try their best to be neutral, a pluralistic society requires a diversity of voices, not a filter bubble controlled by one organization. Surely there are people outside of Google who want to work on this kind of thing. Many a start-up could be founded on the basis of doing useful things with data extracted from a web crawl.

He goes on to discuss current search efforts such a Common Crawl and Wayfinder before hitting full stride with his suggestion for a distributed web search engine. Painting in the broadest of strokes, Martin makes it sound almost plausible to contemplate such an effort.

While conceding the technological issues would be many, it is contended that the payoff would be immense, but in ways we won’t know until it is available. I suspect Martin is right but if so, then we should be able to see a similar impact from Common Crawl. Yes?

Not to rain on a parade I would like to join, but extracting value from a web crawl like Common Crawl is not a guaranteed thing. A more complete crawl of the web only multiplies those problems, it doesn’t make them easier to solve.

On the whole I think the idea of a distributed crawl of the web is a great idea, but while that develops, we best hone our skills at extracting value from the partial crawls that already exist.

2014 Data Science Salary Survey [R + Python?]

Filed under: Data Science,Python,R — Patrick Durusau @ 7:27 am

2014 Data Science Salary Survey: Tools, Trends, What Pays (and What Doesn’t) for Data Professionals by John King and Roger Magoulas.

From the webpage:

For the second year, O’Reilly Media conducted an anonymous survey to expose the tools successful data analysts and engineers use, and how those tool choices might relate to their salary. We heard from over 800 respondents who work in and around the data space, and from a variety of industries across 53 countries and 41 U.S. states.

Findings from the survey include:

  • Average number of tools and median income for all respondents
  • Distribution of responses by a variety of factors, including age, location, industry, position, and cloud computing
  • Detailed analysis of tool use, including tool clusters
  • Correlation of tool usage and salary

Gain insight from these potentially career-changing findings—download this free report to learn the details, and plug your own variables into the regression model to find out where you fit into the data space.

The best take on this publication can be found in O’Reilly Data Scientist Salary and Tools Survey, November 2014 by David Smith where he notes:

The big surprise for me was the low ranking of NumPy and SciPy, two toolkits that are essential for doing statistical analysis with Python. In this survey and others, Python and R are often similarly ranked for data science applications, but this result suggests that Python is used about 90% for data science tasks other than statistical analysis and predictive analytics (my guess: mainly data munging). From these survey results, it seems that much of the “deep data science” is done by R.

My initial observation is that “more than 800 respondents” is too small of a data sample to draw any useful conclusions about tools used by data scientists. Especially when the #1 tool listed in that survey was Windows.

Why a majority of “data scientists” confuse an OS with data processing tools like SQL or Excel, both of which ranked higher than Python or R, is unknown but casts further doubt on the data sample.

My suggestion would be to have a primary tool or language (other than an OS) whether it is R or Python but to be familiar with the strengths of other approaches. Religious bigotry about approaches is a poor substitute for useful results.

Book of Proof

Filed under: Mathematical Reasoning,Mathematics — Patrick Durusau @ 6:52 am

Book of Proof by Richard Hammack.

From the webpage:

This book is an introduction to the standard methods of proving mathematical theorems. It has been approved by the American Institute of Mathematics’ Open Textbook Initiative. Also see the Mathematical Association of America Math DL review (of the 1st edition), and the Amazon reviews.

The second edition is identical to the first edition, except some mistakes have been corrected, new exercises have been added, and Chapter 13 has been extended. (The Cantor-Bernstein-Schröeder theorem has been added.) The two editions can be used interchangeably, except for the last few pages of Chapter 13. (But you can download them here.)

Order a copy from Amazon or Barnes & Noble for $13.75 or download a pdf for free here. Click here for a pdf copy of the entire book, or get the chapters individually below.

From the Introduction:

This book will initiate you into an esoteric world. You will learn and apply the methods of thought that mathematicians use to verify theorems, explore mathematical truth and create new mathematical theories. This will prepare you for advanced mathematics courses, for you will be better able to understand proofs, write your own proofs and think critically and inquisitively about mathematics.

For a 300+ page book, almost a steal at Amazon for $13.75. A stocking stuffer for Math/CS types on your holiday list. For yourself, grab the pdf version. 😉

Big data projects are raising the bar for being able to think critically about data and the mathematics that underlie its processing.

Big data is by definition too large for human inspection. So you had better be able to think critically about the nature of the data en masse and the methods to be used to process it.

Or to put it another way, if you don’t understand the impact of the data on processing, or assumptions built into the processing methods, how are you going to evaluate big data results?

Just accept them as ground level truth? Ignore them if they contradict your “gut?” Use a Magic 8-Ball?, a Ouija Board?

I would recommend none of the above and working on your data and math critical evaluation skills.

You?

I first saw this in a tweet by David Higginbotham.

December 10, 2014

FoundationDB 3.0

Filed under: FoundationDB — Patrick Durusau @ 7:29 pm

Failing at Scaling by Dave Rosenthal.

Dave writes a great post but you want cut to what screams “Try FoundationDB!

Without further ado:

FoundationDB performance

I hope you agree that this is an incredible result. And it’s made even more impressive because we are hitting this number on a fully-ordered, fully-transactional database with 100% multi-key cross-node transactions. We haven’t heard of a database that even comes close to these performance numbers with those guarantees. Oh, and in the public cloud, with all its usual communications and noisy-neighbor challenges.

Let’s put 14.4 Mhz in context:

It’s gratifying for the whole team here to hit our ambitious initial goal after five hard years of theory, simulation, and engineering!

Yep, that is 14,400,000 random writes per second. (I know, Dave calls that number 14.4 Mhz. Control of abuse of language isn’t my department.)

I’m sure you have other questions so see Dave’s post and while you are there, grab a copy of FoundationDB 3.0!

Student Data Sets

Filed under: Climate Data,Dataset — Patrick Durusau @ 5:47 pm

Christopher Lortie tweeted today that his second year ecology students have posted 415 datasets this year!

Which is a great example for others!

However, how do other people find these and similar datasets?

Not a criticism of the students or their datasets but a reminder that findability remains an unsolved issue.

Look-behind regex

Filed under: Regex,Regexes — Patrick Durusau @ 5:26 pm

Look-behind regex by John D. Cook.

From the post:

Look-behind is one of those advanced/obscure regular expression features that I don’t use frequently enough to remember the syntax, but just frequently enough that I wish I could remember it.

Look-behind can be positive or negative. Look-behind says “match this position only if the preceding text matches (does not match) the following pattern.”

I wish I had read this post before writing regular expressions to clean up over 4K of scanning results recently. I can think of several cases where this could have been helpful.

If you want to practice your regex writing skills, visit Stack Overflow and try your hand at recent regex questions. Or stroll through some of the older questions for tips/techniques.

A Latent Source Model for Online Collaborative Filtering

Filed under: Collaboration,Filters,Recommendation — Patrick Durusau @ 5:06 pm

A Latent Source Model for Online Collaborative Filtering by Guy Bresler, George H. Chen, and Devavrat Shah.

Abstract:

Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the “online” setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of n users either likes or dislikes each of m items. We assume there to be k types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again. The goal is to maximize the number of likable items recommended to users over time. Our main result establishes that after nearly log(km) initial learning time steps, a simple collaborative filtering algorithm achieves essentially optimal performance without knowing k. The algorithm has an exploitation step that uses cosine similarity and two types of exploration steps, one to explore the space of items (standard in the literature) and the other to explore similarity between users (novel to this work).

The similarity between users makes me wonder if merging results from a topic map could or should be returned on the basis of a similarity of users? On the assumption that at some point of similarity that distinct users share views about subject identity.

Yet More “Hive” Confusion

Filed under: Crowd Sourcing,Open Source,Semantic Diversity — Patrick Durusau @ 4:28 pm

The New York Times R&D Lab releases Hive, an open-source crowdsourcing tool by Justin Ellis.

From the post:

A few months ago we told you about a new tool from The New York Times that allowed readers to help identify ads inside the paper’s massive archive. Madison, as it was called, was the first iteration on a new crowdsourcing tool from The New York Times R&D Lab that would make it easier to break down specific tasks and get users to help an organization get at the data they need.

Today the R&D Lab is opening up the platform that powers the whole thing. Hive is an open-source framework that lets anyone build their own crowdsourcing project. The code responsible for Hive is now available on GitHub. With Hive, a developer can create assignments for users, define what they need to do, and keep track of their progress in helping to solve problems.

Not all that long ago, I penned: Avoiding “Hive” Confusion, which pointed out the possible confusion between Apache Hive and High-performance Integrated Virtual Environment (HIVE), in mid to late October, 2014. Now, barely two months later we have another “Hive” in the information technology field.

I have no idea how many “hives” there are inside or outside of IT but as of today, I can name at least three (3).

Have you ever thought that semantic confusion is part and parcel of the human condition? Can be allowed for, can be compensated for, but can never be eliminated.

ArrayFire: A Portable Open-Source Accelerated Computing Library

Filed under: GPU,Parallelism — Patrick Durusau @ 4:14 pm

ArrayFire: A Portable Open-Source Accelerated Computing Library by Pavan Yalamanchilli.

From the post:

The ArrayFire library is a high-performance software library with a focus on portability and productivity. It supports highly tuned, GPU-accelerated algorithms using an easy-to-use API. ArrayFire wraps GPU memory into a simple “array” object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.

ArrayFire Capabilities

ArrayFire is an Fortran. ArrayFire has a range of functionality, including

ArrayFire has three back ends to enable portability across many platforms: CUDA, OpenCL and CPU. It even works on embedded platforms like NVIDIA’s Jetson TK1.

In a past post about ArrayFire we demonstrated the ArrayFire capabilities and how you can increase your productivity by using ArrayFire. In this post I will tell you how you can use ArrayFire to exploit various kind of parallelism on NVIDIA GPUs.

Just in case you get a box full of GPUs during the holidays and/or need better performance from ones you already have!

Enjoy!

Timeline of sentences from the CIA Torture Report

Filed under: Government,Government Data — Patrick Durusau @ 3:59 pm

Chris R. Albon has created a timeline of sentences from the CIA torture report!

Example:

year,statement
1997,”The FBI information included that al-Mairi’s brother “”traveled to Afghanistan in 1997-1998 to train in Bin – Ladencamps.”””
1997,”The FBI information included that al-Mairi’s brother “”traveled to Afghanistan in 1997-1998 to train in Bin – Ladencamps.”””
1997,”For example, on October 12, 2004, another CIA detainee explained how he met al-Kuwaiti at a guesthouse that was operated by Ibn Shaykh al-Libi and Abu Zubaydah in 1997.”

Cleanly imports into Apache OpenOffice Calc and is 6163 rows (after subtracting the header).

Please acknowledge Chris if you use this data.

What other data would you pull from the executive summary?

What other data do you think would convince Senator Udall to release the entire 6,000 page report?

A Tranche of Climate Data

Filed under: Climate Data,Government,Government Data — Patrick Durusau @ 3:35 pm

FACT SHEET: Harnessing Climate Data to Boost Ecosystem & Water Resilience

From the document:

Today, the Administration is making a new tranche of data about ecosystems and water resilience available as part of the Climate Data Initiative—including key datasets related water quality, streamflow, land cover, soils, and biodiversity.

In addition to the datasets being added today to climate.data.gov, the Department of Interior (DOI) is launching a suite of geospatial mapping tools on ecosystems.data.gov that will enable users to visualize and overlay datasets related to ecosystems, land use, water, and wildlife. Together, the data and tools unleashed today will help natural-resource managers, decision makers, and communities on the front lines of climate change build resilience to climate impacts and better plan for the future. (emphasis added)

I had to look up “tranche.” Google offers: “a portion of something, especially money.”

Assume that your contacts and interactions with both sites are monitored and recorded.

Warning: Nazi Protectors Succeed In Part

Filed under: Government,Politics — Patrick Durusau @ 2:53 pm

You may remember my post on the non-release of the Senate Intelligence Report on CIA torture: Would You Protect Nazi Torturers And Their Superiors?

An executive summary of that report has been released, Committee Study of the Central Intelligence Agency’s Detention and Interrogation Program, which at five hundred and twenty-five (525) pages, should raise even louder demands for the release of the entire report.

It is essential that you contact Senator Mark Udall to request that he release the full 6,000 page report before he leaves office at the end of the year.

http://www.markudall.senate.gov/

Senator Mark Udall
Hart Office Building Suite SH-730
Washington, D.C. 20510

P: 202-224-5941
F: 202-224-6471

The partial report gives you some idea how the Nazi protectors in the CIA and elsewhere have succeeded in avoiding accountability for their actions. So far.

Don’t allow a soon to be Republican controlled Senate conceal the truth about CIA torture from the very people who elected them.

Odd that the responsibility preaching Republicans seem so eager to help the CIA and their masters avoid responsibility for their actions.

Call Senator Udall today!

Michael Brown Grand Jury – Presenting Evidence Before Knowing the Law

Filed under: Ferguson,Skepticism — Patrick Durusau @ 2:41 pm

News coverage of the Michael Brown grand jury has proceeded like the prosecution in the case. It has been “look at this,” “now look at that,” with no rhyme or reason to the presentation. Big mistakes were made but in context, a pattern emerges that does not appear to be the result of chance or incompetence.

That pattern includes things that missing that are expected in any grand jury proceeding.

For example, did you know the grand jurors were never told what laws might apply to this case until the very end? And even there we don’t know what was said to the jurors.

4 GRAND JUROR: So you are going to give us 

5 those guidelines for us? 

6 	    MS. WHIRLEY: Right . 

7 	    MS. ALIZADEH: We're not going to give you 

8 the facts and say if he did this and then this, if 

9 you believe this, then this. But we're going to 

10 give you what the law says when a law officer can 

11 use force to affect an arrest and when that force 

12 can be deadly. And then also when a person can use 

13 force to defend themselves and when that force can 

14 be deadly. 

15 There is all kind of things about whether 

16 or not the person is an initial aggressor, you know. 

17 And under the law, a law enforcement officer can be 

18 an initial aggressor, unless his arrest is unlawful. 

This exchange happens in Volume 24, page 108, lines 4-18. Problem is, we don’t know what “laws” were actually given to the grand jury or in what form. More missing “evidence.”

Notice that the prosecutors deviated from the normal pattern of grand jury proceedings.

When the Grand Jury meets, the district attorney or an assistant district attorney designated by the district attorney will either read or explain the proposed Indictment (sometimes referred to as a Bill of Indictment) to the Grand Jury and will acquaint them with the witnesses who will testify. This is done to allow the Grand Jurors to familiarize themselves with the parties involved in case one or more members are disqualified to serve (see p. 21 and 22). (Grand Jury Handbook, page 25) (To the same effect but in federal grand juries, Antitrust Division Grand Jury Practice Manual page IV-2)

Outlining the law to a grand jury sets a context in which they place evidence and separate the important from the trivial or even irrelevant. You can scour all twenty-four volumes but in particular volume one and you will find no such assistance for the grand jury in this case.

I will outline the laws that should have been given to the grand jury at the outset of this investigation and then the consequences of not having those laws all along will be more evident.

Please shout if I fail to give specific references and/or hyperlinks to resources that I cite. I am less interested in your hearing my summary than I am in providing you with the ability to see the primary materials for yourself. (Another characteristic of a well authored topic map.)

There are two possible charges that could have been given to the grand jury, well, a properly assisted grand jury in this case, first and second degree murder. Let’s look at the laws in both cases.

First Degree Murder

First degree murder, penalty–person under sixteen years of age not to receive death penalty.

565.020. 1. A person commits the crime of murder in the first degree if he knowingly causes the death of another person after deliberation upon the matter.

The elements of first degree murder are:

  • person commits
  • knowingly causes
  • death of another person
  • after deliberation on the matter

You may have heard the term “premeditated” murder before. Essentially someone who plans to murder another person and then carries it out. There’s no specific time limit required for deliberation.

As a tactical matter, a prosecutor would not give the grand jury a first degree murder indictment in this case because there is no evidence of deliberation. The only reason for giving it in this case is to get the grand jury accustomed to the idea of not returning a true bill on any charge.

For the Michael Brown grand jury, absent some evidence that Darren Wilson knew and had some plan to murder Michael Brown, I would leave this one out.

Second Degree Murder

Until December 31, 2016–Second degree murder, penalty.

565.021. 1. A person commits the crime of murder in the second degree if he:

(1) Knowingly causes the death of another person or, with the purpose of causing serious physical injury to another person, causes the death of another person; or

(omitted language on murder in the course of commission of a felony as irrelevant)

The elements of second degree murder are:

  • person commits
  • knowingly causes
  • death of another person
  • or with purpose of serious injury
  • causes the death of another person

This illustrates the reason for instructing the grand jury on the law before they start hearing evidence. It enables them to sort out useful from non-useful testimony and evidence.

For example, do you see anything in the elements of second degree murder that allows killing of another person if the other person has been smoking marijuana? Or does it permit killing of another person for jaywalking? Or if a jaywalker runs away? What if you “tussle” with a police officer? Fair game? No, it doesn’t say any of those things.

Think about reading the grand jury transcripts and marking out witnesses and evidence that isn’t relevant to the elements:

  • person commits
  • knowingly causes
  • death of another person
  • or with purpose of serious injury
  • causes the death of another person

Not today but I will be annotating that list with points in the transcript that provide “probably cause” for each of those points.

You will have noticed from the quoted portion of the transcript that defense counsel ALIZADEH gives the jury instructions on use of force in self-defense and by a police officer.

That’s not a typo, I really mean defense counsel ALIZADEH. Why? I have appended the full statute provisions at the end of this post but in part:

Self-defense

Use of force in defense of persons provides in part:

5. The defendant shall have the burden of injecting the issue of justification under this section.

Who raised it? Defense counsel ALIZADEH.

Force by a police officer

Until December 31, 2016–Law enforcement officer’s use of force in making an arrest provides in part:

4. The defendant shall have the burden of injecting the issue of justification under this section.

Who raised it? Defense counsel ALIZADEH.

Voluntary Manslaughter

I suspect the jury was also instructed on voluntary manslaughter, which was also inappropriate because like the other statutes, Until December 31, 2016–Voluntary manslaughter, penalty–under influence of sudden passion, defendant’s burden to inject provides that:

2. The defendant shall have the burden of injecting the issue of influence of sudden passion arising from adequate cause under subdivision (1) of subsection 1 of this section.

“Sudden passion from adequate cause” under Missouri law is a defense to second degree murder. What that means is that if you are charged with second degree murder, the trial jury (not the grand jury) can find you guilty of voluntary manslaughter as a responsive verdict. See: Until December 31, 2016–Lesser degree offenses of first and second degree murder–instruction on lesser offenses, when. (And for your convenience, below.)

Again, must be raised by and probably was raised by Defense counsel ALIZADEH.

Summary:

The only facts that the grand jury had to find probable cause for in its hearings and deliberations were:

  • person commits (Darren Wilson)
  • knowingly causes (not accidental, on purpose)
  • death of another person (Michael Brown’s death)
  • or with purpose of serious injury (multiple wounds)
  • causes the death of another person (Michael Brown’s death)

That’s it in a nutshell.

The trial jury or judge alone reaches decisions on self-defense, force by a police officer, “sudden passion from adequate cause,” and other issues. Not a grand jury.

Knowing the law, review the transcripts to say whether there was probable cause or not.

PS: Sorry, almost forgot:

The best-known definition of probable cause is “a reasonable belief that a person has committed a crime”

From Probable Cause at Princeton University.

If you think shooting an unarmed person eight times leads to a reasonable belief a crime has been committed, then you would return a true bill for second degree murder.

Supplemental Missouri statutes

Use of force in defense of persons (563.031), Law enforcement officer’s use of force in making an arrest (563.046), Voluntary manslaughter (565.023), and Lesser degree offenses of first and second degree murder (565.025), below.

Use of force in defense of persons.

563.031. 1. A person may, subject to the provisions of subsection 2 of this section, use physical force upon another person when and to the extent he or she reasonably believes such force to be necessary to defend himself or herself or a third person from what he or she reasonably believes to be the use or imminent use of unlawful force by such other person, unless:

(1) The actor was the initial aggressor; except that in such case his or her use of force is nevertheless justifiable provided:

(a) He or she has withdrawn from the encounter and effectively communicated such withdrawal to such other person but the latter persists in continuing the incident by the use or threatened use of unlawful force; or

(b) He or she is a law enforcement officer and as such is an aggressor pursuant to section 563.046; or

(c) The aggressor is justified under some other provision of this chapter or other provision of law;

(2) Under the circumstances as the actor reasonably believes them to be, the person whom he or she seeks to protect would not be justified in using such protective force;

(3) The actor was attempting to commit, committing, or escaping after the commission of a forcible felony.

2. A person may not use deadly force upon another person under the circumstances specified in subsection 1 of this section unless:

(1) He or she reasonably believes that such deadly force is necessary to protect himself, or herself or her unborn child, or another against death, serious physical injury, or any forcible felony;

(2) Such force is used against a person who unlawfully enters, remains after unlawfully entering, or attempts to unlawfully enter a dwelling, residence, or vehicle lawfully occupied by such person; or

(3) Such force is used against a person who unlawfully enters, remains after unlawfully entering, or attempts to unlawfully enter private property that is owned or leased by an individual claiming a justification of using protective force under this section.

3. A person does not have a duty to retreat from a dwelling, residence, or vehicle where the person is not unlawfully entering or unlawfully remaining. A person does not have a duty to retreat from private property that is owned or leased by such individual.

4. The justification afforded by this section extends to the use of physical restraint as protective force provided that the actor takes all reasonable measures to terminate the restraint as soon as it is reasonable to do so.

5. The defendant shall have the burden of injecting the issue of justification under this section. If a defendant asserts that his or her use of force is described under subdivision (2) of subsection 2 of this section, the burden shall then be on the state to prove beyond a reasonable doubt that the defendant did not reasonably believe that the use of such force was necessary to defend against what he or she reasonably believed was the use or imminent use of unlawful force.

Until December 31, 2016–Law enforcement officer’s use of force in making an arrest.

563.046. 1. A law enforcement officer need not retreat or desist from efforts to effect the arrest, or from efforts to prevent the escape from custody, of a person he reasonably believes to have committed an offense because of resistance or threatened resistance of the arrestee. In addition to the use of physical force authorized under other sections of this chapter, he is, subject to the provisions of subsections 2 and 3, justified in the use of such physical force as he reasonably believes is immediately necessary to effect the arrest or to prevent the escape from custody.

2. The use of any physical force in making an arrest is not justified under this section unless the arrest is lawful or the law enforcement officer reasonably believes the arrest is lawful.

3. A law enforcement officer in effecting an arrest or in preventing an escape from custody is justified in using deadly force only

(1) When such is authorized under other sections of this chapter; or

(2) When he reasonably believes that such use of deadly force is immediately necessary to effect the arrest and also reasonably believes that the person to be arrested

(a) Has committed or attempted to commit a felony; or

(b) Is attempting to escape by use of a deadly weapon; or

(c) May otherwise endanger life or inflict serious physical injury unless arrested without delay.

4. The defendant shall have the burden of injecting the issue of justification under this section.

Until December 31, 2016–Voluntary manslaughter, penalty–under influence of sudden passion, defendant’s burden to inject.

565.023. 1. A person commits the crime of voluntary manslaughter if he:

(1) Causes the death of another person under circumstances that would constitute murder in the second degree under subdivision (1) of subsection 1 of section 565.021, except that he caused the death under the influence of sudden passion arising from adequate cause; or

(2) Knowingly assists another in the commission of self-murder.

2. The defendant shall have the burden of injecting the issue of influence of sudden passion arising from adequate cause under subdivision (1) of subsection 1 of this section.

Until December 31, 2016–Lesser degree offenses of first and second degree murder–instruction on lesser offenses, when.

565.025. 1. With the exceptions provided in subsection 3 of this section
and subsection 3 of section 565.021, section 556.046 shall be used for the
purpose of consideration of lesser offenses by the trier in all homicide cases.

2. The following lists shall comprise, in the order listed, the lesser
degree offenses:

(1) The lesser degree offenses of murder in the first degree are:

(a) Murder in the second degree under subdivisions (1) and (2) of
subsection 1 of section 565.021;

(b) Voluntary manslaughter under subdivision (1) of subsection 1 of
section 565.023; and

(c) Involuntary manslaughter under subdivision (1) of subsection 1 of
section 565.024;

(2) The lesser degree offenses of murder in the second degree are:

(a) Voluntary manslaughter under subdivision (1) of subsection 1 of
section 565.023; and

(b) Involuntary manslaughter under subdivision (1) of subsection 1 of
section 565.024.

3. No instruction on a lesser included offense shall be submitted unless
requested by one of the parties or the court.

December 9, 2014

New recordings, documents released in Michael Brown case [LA Times Asks If There’s More?] Yes!

Filed under: Ferguson,Skepticism — Patrick Durusau @ 8:39 pm

Ferguson, Mo.: New recordings, documents released in Michael Brown case

By James Queally and Maria L. La Ganga write for the Los Angeles Times:


It remains unclear whether all of the documents and transcripts connected to the grand jury investigation have been made public. Emails and phone calls to the St. Louis County Prosecutor’s Office late Monday were not immediately returned. Grand jury proceedings are usually secret, but McCulloch had pledged to release the evidence if Wilson was not indicted. (emphasis added)

I can answer that question without asking the St. Louis County Prosecutor’s Office.

NO!

For example and only as an example:

Read Grand Jury Volume 24, at page 69, lines 21-25:

21 now you've completed your police report in this
22 case; is that right?
23 A I have.
24 Q How many pages is your police report?
25 A I don't know exactly, 1,100, 1,200

So, where are the 1,100 to 1,200 pages of report by the crime scene detective who testified three (3) times before the grand jury?

Not present in the documents released thus far.

There are other documents missing, some of them even more critical than this report but I will cover those in other posts.

A Quick Spin Around the Big Dipper

Filed under: Astroinformatics,Graphs — Patrick Durusau @ 8:10 pm

A Quick Spin Around the Big Dipper by Summer Ash.

From the post:

From our perspective here on Earth, constellations appear to be fixed groups of stars, immobile on the sky. But what if we could change that perspective?

In reality, it’d be close to impossible. We would have to travel tens to hundreds of light-years away from Earth for any change in the constellations to even begin to be noticeable. As of this moment, the farthest we (or any object we’ve made) have traveled is less than one five-hundreth of a light-year.

Just for fun, let’s say we could. What would our familiar patterns look like then? The stars that comprise them are all at different distances from us, traveling around the galaxy at different speeds, and living vastly different lives. Very few of them are even gravitationally bound to each other. Viewed from the side, they break apart into unrecognizable landscapes, their stories of gods and goddesses, ploughs and ladles, exposed as pure human fantasy. We are reminded that we live in a very big place.

Great visualizations.

Summer’s post reminded me of Caleb Jones’ Stellar Navigation Using Network Analysis and how he created 3-D visualizations out to various distances.

By rotating Caleb’s 3-D graphs there would be more stars in the way of your vision but it might also be more realistic.

Just as a thought experiment for the moment, what if you postulated a planet around a distant star and the transparency of the atmosphere for observing distant stars? What new constellations would you see from such a distant world?

Other than speed of travel, what would be the complexities of travel and governance across a sphere of influence of say 1,000 light years? Any natural groupings that might have similar interests?

Enjoy!

What does the NSA think of academic cryptographers? Recently-declassified document provides clues

Filed under: Cryptography,NSA — Patrick Durusau @ 7:55 pm

What does the NSA think of academic cryptographers? Recently-declassified document provides clues by Scott Aaronson.

From the post:

Brighten Godfrey was one of my officemates when we were grad students at Berkeley. He’s now a highly-successful computer networking professor at the University of Illinois Urbana-Champaign, where he studies the wonderful question of how we could get the latency of the Internet down to the physical limit imposed by the finiteness of the speed of light. (Right now, we’re away from that limit by a factor of about 50.)

Last week, Brighten brought to my attention a remarkable document: a 1994 issue of CryptoLog, an NSA internal newsletter, which was recently declassified with a few redactions. The most interesting thing in the newsletter is a trip report (pages 12-19 in the newsletter, 15-22 in the PDF file) by an unnamed NSA cryptographer, who attended the 1992 EuroCrypt conference, and who details his opinions on just about every talk. If you’re interested in crypto, you really need to read this thing all the way through, but here’s a small sampling of the zingers:

Are there any leaked copies of more recent issues of CryptoLog?

I ask because of the recent outcry about secure encryption of cell phones by default. The government should not be able to argue both ways, one that non-government cryptography work is valueless and at the same time, deprive the average citizen of some modicum of privacy. Which is it?

I know the FBI wants us to return to physical phone lines and junction boxes so they can use their existing supply of wire tapping gear but that’s just not going to happen. Promise.

Parable of the Polygons

Filed under: Politics,Simulations,Social Networks,Socioeconomic Data — Patrick Durusau @ 7:26 pm

Parable of the Polygons – A Playable Post on the Shape of Society by VI Hart and Nicky Case.

From the post:

This is a story of how harmless choices can make a harmful world.

A must play post!

Deeply impressive simulation of how segregation comes into being. Moreover, how small choices may not create the society you are trying to achieve.

Bear in mind that these simulations, despite being very instructive, are orders of magnitudes less complex than the social aspects of de jure segregation I grew up under as a child.

That complexity is one of the reasons the ham-handed social engineering projects of government, be they domestic or foreign rarely reach happy results. Some people profit, mostly the architects of such programs and the people they intended to help, well, decades later things haven’t changed all that much.

If you think you have the magic touch to engineer a group, locality, nation or the world, please try your hand at these simulations first. Bearing in mind that we have no working simulations of society that supports social engineering on the scale attempted by various nation states that come to mind.

Highly recommended!

PS: Creating alternatives to show the impacts of variations in data analysis would be quite instructive as well.

Finding clusters of CRAN packages using igraph

Filed under: Graphs,R — Patrick Durusau @ 6:56 pm

Finding clusters of CRAN packages using igraph by Andrie de Vries.

From the post:

In a previous post I demonstrated how to use the igraph package to create a network diagram of CRAN packages and compute the page rank.

Now I extend this analysis and try to find clusters of packages that are close to one another.

Andrie assigns labels to the resulting groups and then worries:

With clusters this large, it’s quite brazen (and possibly just wrong) to try and interpret the clusters for meaning.

Not at all!

Without grouping and labeling, there is no opportunity to discover how others might group and label the same items. We may all stare at the same items but if no one groups or labels them, we can walk away with private and very different understandings of how items should be grouped.

I remember a scifi novel where one character observes “sheep are different from each other,” to which another character added, “but only to other sheep.” Our use of different groupings isn’t all that is important. The reasons we see/give for creating different groupings are important as well.

The Path Forward (Titan 1.0 and TinkerPop 3.0)

Filed under: Graphs,TinkerPop,Titan — Patrick Durusau @ 5:40 pm

The Path Forward by Marko Rodriguez.

A good overview of Titan 1.0 and TinkerPop 3.0. Marko always makes great slides.

I appreciate mythology as an example but it would be nice to see an example of Titan/TinkerPop used in anger.

With the limitation that the data be legally accessible (sorry) what would you suggest as a great example of using Titan/TinkerPop?

Since everyone likes mobile phone apps, I would suggest one that displays a street map and as you pass street addresses, it lights up the address as blue or red depending on their political contributions. Brighter colors for larger donations.

I think that would prove to be very popular.

Would that be a good example for Titan/TinkerPop?

What’s yours?

Data Science with Hadoop: Predicting Airline Delays – Part 2

Filed under: Hadoop,Hortonworks,Machine Learning,Python,R,Spark — Patrick Durusau @ 5:25 pm

Using machine learning algorithms, Spark and Scala – Part 2 by Ofer Mendelevitch and Beau Plath.

From the post:

In this 2nd part of the blog post and its accompanying IPython Notebook in our series on Data Science and Apache Hadoop, we continue to demonstrate how to build a predictive model with Apache Hadoop, using existing modeling tools. And this time we’ll use Apache Spark and ML-Lib.

Apache Spark is a relatively new entrant to the Hadoop ecosystem. Now running natively on Apache Hadoop YARN, the architectural center of Hadoop, Apache Spark is an in-memory data processing API and execution engine that is effective for machine learning and data science use cases. And with Spark on YARN, data workers can simultaneously use Spark for data science workloads alongside other data access engines–all accessing the same shared dataset on the same cluster.

ds_2_1

The next installment in this series continues the analysis with the same dataset but then with R!

The bar for user introductions to technology is getting higher even as we speak!

Data Science with Apache Hadoop: Predicting Airline Delays (Part 1)

Filed under: Hadoop,Hortonworks,Machine Learning,Python,R,Spark — Patrick Durusau @ 5:06 pm

Using machine learning algorithms, Pig and Python – Part 1 by Ofer Mendelevitch.

From the post:

With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.

ds_1

It is a common misconception that the way data scientists apply predictive learning algorithms like Linear Regression, Random Forest or Neural Networks to large datasets requires a dramatic change in approach, in tooling, or in usage of siloed clusters. Not so: no dramatic change; no dedicated clusters; using existing modeling tools will suffice.

In fact, the big change is in what is known as “feature engineering”—the process by which very large raw data is transformed into a “feature matrix.” Enabled by Apache Hadoop with YARN as an ideal platform, this transformation of large raw datasets (terabytes or petabytes) into a feature matrix is now scalable and not limited by RAM or compute power of a single node.

Since the output of the feature engineering step (the “feature matrix”) tends to be relatively small in size (typically in the MB or GB scale), a common choice is to run the learning algorithm on a single machine (often with multiple cores and high amount of RAM), allowing us to utilize a plethora of existing robust tools and algorithms from R packages, Python’s Scikit-learn, or SAS.

In this multi-part blog post and its accompanying IPython Notebook, we will demonstrate an example step-by-step solution to a supervised learning problem. We will show how to solve this problem with various tools and libraries and how they integrate with Hadoop. In part I we focus on Apache PIG, Python, and Scikit-learn, while in subsequent parts, we will explore and examine other alternatives such as R or Spark/ML-Lib

With the IPython notebook, this becomes a great example of how to provide potential users hands-on experience with a technology.

An example that Solr, for example, might well want to imitate.

PS: When I was traveling, a simpler way to predict flight delays was to just ping me for my travels plans. 😉 You?

The Coming Era of Egocentric Video Analysis

Filed under: Identifiers,Identity,Image Processing,Privacy — Patrick Durusau @ 3:58 pm

The Coming Era of Egocentric Video Analysis

From the post:

Head-mounted cameras are becoming de rigueur for certain groups—extreme sportsters, cyclists, law enforcement officers, and so on. It’s not hard to find content generated in this way on the Web.

So it doesn’t take a crystal ball to predict that egocentric recording is set to become ubiquitous as devices such as Go-Pros and Google Glass become more popular. An obvious corollary to this will be an explosion of software for distilling the huge volumes of data this kind of device generates into interesting and relevant content.

Today, Yedid Hoshen and Shmuel Peleg at the Hebrew University of Jerusalem in Israel reveal one of the first applications. Their goal: to identify the filmmaker from biometric signatures in egocentric videos.

A tidbit that I was unaware of:

Some of these are unique, such as the gait of the filmmaker as he or she walks, which researchers have long known is a remarkably robust biometric indicator.”Although usually a nuisance, we show that this information can be useful for biometric feature extraction and consequently for identifying the user,” say Hoshen and Peleg.

Makes me wonder if I should wear a prosthetic device to alter my gait when I do appear in range of cameras. 😉

Works great with topic maps. All you may know about an actor is that they have some gait with X characteristics. And a perchance for not getting caught planting explosive devices. With a topic map we can keep their gait as a subject identifier and record all the other information we have on such an individual.

If we ever match the gait to a known individual, then the information from both records, both as the anonymous gait owner and the known known individual will be merged together.

It works with other characteristics as well, which enables you to work from “I was attacked…,” to more granular information that narrows the pool of suspects down to a manageable size.

Traditionally the job of veterans on the police force who know their communities and who are the usual suspects but a topic map enhances their value by capturing their observations for use by the department long after a veterans retirement.

From arXiv: Egocentric Video Biometrics

Abstract:

Egocentric cameras are being worn by an increasing number of users, among them many security forces worldwide. GoPro cameras already penetrated the mass market, and Google Glass may follow soon. As head-worn cameras do not capture the face and body of the wearer, it may seem that the anonymity of the wearer can be preserved even when the video is publicly distributed.
We show that motion features in egocentric video provide biometric information, and the identity of the user can be determined quite reliably from a few seconds of video. Biometrics are extracted by training Convolutional Neural Network (CNN) architectures on coarse optical flow.

Egocentric video biometrics can prevent theft of wearable cameras by locking the camera when worn by people other than the owner. In video sharing services, this Biometric measure can help to locate automatically all videos shot by the same user. An important message in this paper is that people should be aware that sharing egocentric video will compromise their anonymity.

Now if we could just get members of Congress to always carry their cellphones and wear body cameras.

« Newer PostsOlder Posts »

Powered by WordPress