Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 10, 2018

Source Community Call | January 11, 2018 | Thursday @ 12pm ET – GMT 5pm – 9am PDT

Filed under: Journalism,News,Reporting — Patrick Durusau @ 12:56 pm

A resource sponsored by OpenNews, which self-describes as:

At OpenNews, we believe that a community of peers working, learning and solving problems together can create a stronger, more representative, and ascendant journalism. We organize events and community supports to strengthen and sustain this ecosystem.

  • In collaboration with writers and developers in newsrooms around the world, we publish Source, a community site focused on open technology projects and process in journalism. From features that explore the context behind the code to targeted job listings that help the community expand, Source presents the people, projects, and insights behind journalism code.

    We also hold biweekly Source community calls where newsroom data and apps teams can share their work, announce job openings, and find collaborators.

On the agenda for tomorrow:

  • Reporting on police shootings – Allison McCann
  • Accessibility on the web – Joanna Kao

Call Details for Jan. 11, 2018..

Archive of prior calls

Mark your calendars!: Every-other Thursday @ 12pm ET – GMT 5pm – 9am PDT

Email Spam from Congress

Filed under: Government,Journalism,News — Patrick Durusau @ 10:40 am

Receive an Email when a Member of Congress has a New Remark Printed in the Congressional Record by Robert Brammer.

From the post:

Congress.gov alerts are emails sent to you when a measure (bill or resolution), nomination, or member profile has been updated with new information. You can also receive an email after a Member has new remarks printed in the Congressional Record. Here are instructions on how to get an email after a Member has new remarks printed in the Congressional Record….

My blog title is unfair to Brammer, who isn’t responsible for the lack of meaningful content in Member remarks printed in the Congressional Record.

Local news outlets reprint such remarks, as does the national media, whether those remarks are grounded in any shared reality or not. Secondary education classes on current events, reporting, government, where such remarks are considered meaningful, are likely to find this useful.

Another use, assuming mining of prior remarks from the Congressional Record, would be in teaching NLP techniques. Highly unlikely you will discover anything new but it will be “new to you” and the result of your own efforts.

January 9, 2018

Top 5 Cloudera Engineering Blogs of 2017

Filed under: Cloudera,Impala,Kafka,Spark — Patrick Durusau @ 8:22 pm

Top 5 Cloudera Engineering Blogs of 2017

From the post:

1. Working with UDFs in Apache Spark

2. Offset Management For Apache Kafka With Apache Spark Streaming

3. Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem

4. Up and running with Apache Spark on Apache Kudu

5. Apache Impala Leads Traditional Analytic Database

Kudos to Cloudera for a useful list of “top” blog posts for 2017.

We might disagree on the top five but it’s a manageable number of posts and represents the quality of Cloudera postings all year long.

Enjoy!

Sessions for XML Prague 2018 – January 10th, Early Bird Deadline!

Filed under: Conferences,XML,XQuery,XSLT — Patrick Durusau @ 8:03 pm

List of sessions for XML Prague 2018

The range of great presentations is no surprise.

That early registration is still open, with this list of presentations, well, that is a surprise!

January 10, 2018 is the deadline for early birds!

From the post:

Unconference day

Schematron Users Meetup
XSL-FO, CSS and Paged Output – hosted by Antenna House
Introduction to CSS for Paged Media
XSpec Users Meetup
oXygen Users Meeup
Creating beautiful documents with the speedata Publisher
eXist-db Community Meetup
XML with Emacs workshop

Friday and Saturday sessions

Bert Willems: Assisted Structured Authoring using Conditional Random Fields
Christophe Marchand and Matthieu Ricaud-Dussarget: Using Maven with XML Projects
Elli Bleeker, Bram Buitendijk, Ronald Haentjens Dekker and Astrid Kulsdom: Including XML Markup in the Automated Collation of Literary Texts
Erik Siegel: Multi-layered content modelling to the rescue
Francis Cave: Does the world need more XML standards?
Gerrit Imsieke: tokenized-to-tree – An XProc/XSLT Library For Patching Back Tokenization/Analysis Results Into Marked-up Text
Hans-Juergen Rennau: Combining graph and tree: writing SHAX, obtaining SHACL, XSD and more
James Fuller: Diff with XQuery
Jean-François Larvoire: SML – A simpler and shorter representation of XML
Johannes Kolbe and Manuel Montero: XML periodic table, XML repository and XSLT checker
Michael Kay: XML Tree Models for Efficient Copy Operations
O’Neil Delpratt and Debbie Lockett: Implementing XForms using interactive XSLT 3:0
Pieter Masereeuw: Can we create a real world rich Internet application using Saxon-JS?
Radu Coravu: A short story about XML encoding and opening very large documents in an XML editing application
Steven Higgs: XML Success Story: Creating and Integrating Collaboration Solutions to Improve the Documentation Process
Steven Pemberton: Form, and Content
Tejas Barhate and Nigel Whitaker: Varieties of XML Merge: Concurrent versus Sequential
Tony Graham: Life, the Universe, and CSS Tests
Vasu Chakkera: Effective XSLT Documentation and its separation from XSLT code:
Zachary Dean: xqerl: XQuery 3:1 Implementation in Erlang

I’m expecting lots of tweets and posts about these presentations!

January 8, 2018

Are LaTeX Users Script Kiddies?

Filed under: Cybersecurity,Security,TeX/LaTeX — Patrick Durusau @ 5:15 pm

NO! Despite most LaTeX users not writing their own LaTeX engines or many of the packages they use, they are not script kiddies.

LaTeX users are experts in mathematics, statistics and probability, physics, computer science, astronomy and astrophysics, (François Brischoux and Pierre Legagneux 2009), as well as being skilled LaTeX authors.

There’s no shame in using LaTeX, despite not implementing a LaTeX engine. LaTeX makes high quality typesetting available to hundreds of thousands of users around the globe.

Contrast that view of LaTeX with making use of cyber vulnerabilities more widely available, which is dismissed as empowering “script kiddies.”

Every cyber vulnerability is a step towards transparency. Government and corporations fear cyber vulnerabilities, fearing their use will uncover evidence of their crimes and favoritism.

Fearing public exposure, it’s no surprise that governments prohibit the use of cyber vulnerabilities. Governments that also finance and support rape, torture, murder, etc., in pursuit of national policy.

The question for you is:

Do you want to assist such governments and corporations to continue hiding their secrets?

Your answer to that question should determine your position on the discovery, use and spread of cyber vulnerabilities.

16K+ Hidden Web Services (CSV file)

Filed under: Dark Web — Patrick Durusau @ 5:00 pm

I subscribe to the Justin at Hunchly Dark Web report. The current issue (daily) and archive are on Dropbox.

The daily issues are archived in .xlsx format. (Bleech!)

Yesterday I grabbed the archive, converted the files to CSV format, catted them together, cleaned up the extra headers and that resulted in a file with 16,814 links. HiddenServices-2017-07-13-2018-01-05.zip.

A number of uses come to mind, seed list for seach engine, browsing by title, sub-setting for more specialized dark web lists, testing presence/absence of sites on sub-lists, etc.

I’m not affliated with Hunch.ly but you should give their Inspector Hunchly a look. From the webpage:

Inspector Hunchly toils in the background of your web browser to track, analyze and store web pages while you perform online investigations.

Forgets nothing, keeps everything.
… (emphasis in original)

When using Inspector Hunchly, be mindful that: Anything you record, can and will be discovered.

PS: The archive I downloaded, separate files for every day, 272.3 MB. My one file, 363.8 KB. Value added?

forall x : …Introduction to Formal Logic (Smearing “true” across formal validity and factual truth)

Filed under: Logic,Ontology — Patrick Durusau @ 4:47 pm

forall x : Calgary Remix An Introduction to Formal Logic by P. D. Magnus, Tim Button, with additions by, J. Robert Loftis, remixed and revised by
Aaron Thomas-Bolduc and Richard Zach.

From the introduction:

As the title indicates, this is a textbook on formal logic. Formal logic concerns the study of a certain kind of language which, like any language, can serve to express states of affairs. It is a formal language, i.e., its expressions (such as sentences) are defined formally. This makes it a very useful language for being very precise about the states of affairs its sentences describe. In particular, in formal logic is is impossible to be ambiguous. The study of these languages centres on the relationship of entailment between sentences, i.e., which sentences follow from which other sentences. Entailment is central because by understanding it better we can tell when some states of affairs must obtain provided some other states of affairs obtain. But entailment is not the only important notion. We will also consider the relationship of being consistent, i.e., of not being mutually contradictory. These notions can be defined semantically, using precise definitions of entailment based on interpretations of the language—or proof-theoretically, using formal systems of deduction.

Formal logic is of course a central sub-discipline of philosophy, where the logical relationship of assumptions to conclusions reached from them is important. Philosophers investigate the consequences of definitions and assumptions and evaluate these definitions and assumptions on the basis of their consequences. It is also important in mathematics and computer science. In mathematics, formal languages are used to describe not “everyday” states of affairs, but mathematical states of affairs. Mathematicians are also interested in the consequences of definitions and assumptions, and for them it is equally important to establish these consequences (which they call “theorems”) using completely precise and rigorous methods. Formal logic provides such methods. In computer science, formal logic is applied to describe the state and behaviours of computational systems, e.g., circuits, programs, databases, etc. Methods of formal logic can likewise be used to establish consequences of such descriptions, such as whether a circuit is error-free, whether a program does what it’s intended to do, whether a database is consistent or if something is true of the data in it….

Unfortunately, formal logic uses “true” for a conclusion that is valid upon a set of premises.

That smearing of “true” across formal validity and factual truth, enables ontologists to make implicit claims about factual truth, ever ready to retreat into “…all I meant was formal validity.”

Premises, within and without ontologies, are known carriers of discrimination and prejudice. Don’t be distracted by “formal validity” arguments. Keep a laser focus on claimed premises.

Bait Avoidance, Congress, Kaspersky Lab

Filed under: Cybersecurity,Government,Politics,Security — Patrick Durusau @ 2:56 pm

Should you use that USB key you found? by Jeffrey Esposito.

Here is a scenario for you: You are walking around, catching Pokémon, getting fresh air, people-watching, taking Fido out to do his business, when something catches your eye. It’s a USB stick, and it’s just sitting there in the middle of the sidewalk.

Jackpot! Christmas morning! (A very small) lottery win! So, now the question is, what is on the device? Spring Break photos? Evil plans to rule the world? Some college kid’s homework? You can’t know unless…

Esposito details an experiement leaving USB keys about at University of Illinois resulted in 48% of them being plugged into computers.

Reports like this from Kaspersky Lab, given the interest in Kaspersky by Congress, could lead to what the pest control industry calls “bait avoidance.”

Imagine members of Congress or their staffs not stuffing random USB keys into their computers. This warning from Kaspersky could poison the well for everyone.

For what it’s worth, salting the halls and offices of Congress with new release music and movies on USB keys, may help develop and maintain insecure USB practices. Countering bait avoidance is everyone’s responsibility.

January 6, 2018

21 Recipes for Mining Twitter Data with rtweet

Filed under: R,Social Media,Tweets,Twitter — Patrick Durusau @ 5:26 pm

21 Recipes for Mining Twitter Data with rtweet by Bob Rudis.

From the preface:

I’m using this as way to familiarize myself with bookdown so I don’t make as many mistakes with my web scraping field guide book.

It’s based on Matthew R. Russell’s book. That book is out of distribution and much of the content is in Matthew’s “Mining the Social Web” book. There will be many similarities between his “21 Recipes” book and this book on purpose. I am not claiming originality in this work, just making an R-centric version of the cookbook.

As he states in his tome, “this intentionally terse recipe collection provides you with 21 easily adaptable Twitter mining recipes”.

Rudis has posted about this editing project at: A bookdown “Hello World” : Twenty-one (minus two) Recipes for Mining Twitter with rtweet, which you should consult if you want to contribute to this project.

Working through 21 Recipes for Mining Twitter Data with rtweet will give you experience proofing a text and if you type in the examples (no cut-n-paste), you’ll develop rtweet muscle memory.

Enjoy!

January 5, 2018

…Anyone With Less Technical Knowledge…

Filed under: Cybersecurity,Security — Patrick Durusau @ 5:17 pm

The headline came from Critical “Same Origin Policy” Bypass Flaw Found in Samsung Android Browser by Mohit Kumar, the last paragraph which reads:


Since the Metasploit exploit code for the SOP bypass vulnerability in the Samsung Internet Browser is now publicly available, anyone with less technical knowledge can use and exploit the flaw on a large number of Samsung devices, most of which are still using the old Android Stock browser.
… (emphasis added)

Kumar tosses off the … anyone with less technical knowledge … line like that’s a bad thing.

I wonder if Kumar can:

  1. Design and create a CPU chip?
  2. Design and create a memory chip?
  3. Design and create from scratch a digital computer?
  4. Design and implement an operating system?
  5. Design and create a programming language?
  6. Design and create a compiler for creation of binaries?
  7. Design and create the application he now uses for editing?

I’m guessing that Kumar strikes out on one or more of those questions, making him one of those anyone with less technical knowledge types.

I don’t doubt Kumar has a wide range of deep technical skills but lacking some particular technical skill doesn’t diminish your value as a person or even as a technical geek.

Moreover, security failures should be made as easy to use as possible.

No corporation or government is going to voluntarily engage in behavior changing transparency. The NSA was outed for illegal surveillance, Congress then passes a law making that illegal surveillance retroactively legal and when that authorization expired, the NSA continued its originally illegal surveillance.

Every security vulnerability is one potential step towards behavior changing transparency. People with “…less technical knowledge…” aren’t going to find those but with assistance, they can make the best use of the ones that are found.

Security researchers should take pride in their work. But there’s no reflected glory in dissing people who are good at other things.

Transparency, behavior changing transparency, will only result from discovery and widespread use of security flaws. (Voluntary transparency being a contradiction in terms.)

January 4, 2018

Helping Google Achieve Transparency – Wage Discrimination

Filed under: sexism,Transparency — Patrick Durusau @ 8:36 pm

Google faces new discrimination charge: paying female teachers less than men by Sam Levin.

From the post:

Google, which has been accused of systematically underpaying female engineers and other workers, is now facing allegations that it discriminated against women who taught employees’ children at the company’s childcare center.

A former employee, Heidi Lamar, is alleging in a complaint that female teachers were paid lower salaries than men with fewer qualifications doing the same job.

Lamar, who worked at Google for four years before quitting in 2017, alleged that the technology company employed roughly 147 women and three men as pre-school teachers, but that two of those men were granted higher starting salaries than nearly all of the women.

Google did not respond to the Guardian’s request for data on its hiring practices of teachers.

As Levin reports, Google is beside itself with denials and other fact free claims for which it offers no data.

If there was no wage discrimination, Google could release all of its payroll and related data and silence all of its critics at once.

Google has chosen to not silence its critics with facts known only to Google.

Google needs help seeing the value of transparency to answer charges of wage discrimination.

Will you be the one that helps Google realize the value of transparency?

Who’s on everyone’s 2017 “hit list”?

Filed under: R,Web Server — Patrick Durusau @ 8:10 pm

Who’s on everyone’s 2017 “hit list”? by Suzan Baert.

From the post:

At the end of the year, everyone is making lists. And radio stations are no exceptions.
Many of our radio stations have a weekly “people’s choice” music chart. Throughout the week, people submit their top 3 recent songs, and every week those votes turn into a music chart. At the end of the year, they collapse all those weekly charts into a larger one covering the entire year.

I find this one quite interesting: it’s not dependent on what music people buy, it’s determined by what the audience of that station wants to hear. So what are the differences between these stations? And do they match up with what I would expect?

What was also quite intriguing: in Dutch we call it a hit lijst and if you translate that word for word you get: hit list. Which at least one radio station seems to do…

Personally, when I hear the word hit list, music is not really what comes to mind, but hey, let’s roll with it: which artists are on everyone’s ‘hit list’?

A delightful scraping of four (4) radio station “hit lists,” which uses rOpenSci robotstxt, rvest, xml2, dplyr, tidyr, ggplot2, phantomJS, and collates the results.

Music doesn’t come to mind for me when I hear “hit list.”

For me “hit list” means what Google wasn’t you to know about subject N.

You?

So You Want to Play God? Intel Delivers – FUCKWIT Inside

Filed under: Cybersecurity,Security — Patrick Durusau @ 2:16 pm

Kernel-memory-leaking Intel processor design flaw forces Linux, Windows redesign by John Leyden and Chris Williams.

From the post:


It is understood the bug is present in modern Intel processors produced in the past decade. It allows normal user programs – from database applications to JavaScript in web browsers – to discern to some extent the layout or contents of protected kernel memory areas.

The fix is to separate the kernel’s memory completely from user processes using what’s called Kernel Page Table Isolation, or KPTI. At one point, Forcefully Unmap Complete Kernel With Interrupt Trampolines, aka FUCKWIT, was mulled by the Linux kernel team, giving you an idea of how annoying this has been for the developers.

Think of the kernel as God sitting on a cloud, looking down on Earth. It’s there, and no normal being can see it, yet they can pray to it.

Patches are forthcoming, to make your Intel machine 5% to 30% slower.

Cloud providers are upgrading but there’s a decade of Intel chips not in the cloud that await exploitation.

Show of hands. How many of you will slow your machines down by 5% to 30% to defeat this bug?

Next question: How long will it take to cycle out of service the most recent decade of Intel chips?

You’ll have to make your own sticker for your laptop/desktop/server:

BTW, for FUCKWIT and another deep chip flaw, see: Researchers Discover Two Major Flaws in the World’s Computers.

These fundamental flaws should alter your cybersecurity conversations. But will they?

January 3, 2018

How-To Defeat Facebook “Next or Next Page” Links

Filed under: Bash,Facebook — Patrick Durusau @ 10:09 pm

Not you but friends of yours are lured in by “click-bait” images to Facebook pages with “Next or Next Page” links. Like this one:

60 Groovy Photos You Probably Haven’t Seen Before

You can, depending on the speed of your connection and browser, follow each link. That’s tiresome and leaves you awash in ads for every page.

Here’s a start on a simple method to defeat such links.

First, if you follow the first link (they vary from site to site), you find:

http://groovyhistory.com/60-groovy-photos-you-probably-havent-seen-before/2

So we know from that URL that we need to increment the 2, up to and including 60, to access all the relevant pages.

If we do view source (CTRL-U), we find:

<div class=’gallery-image’>
<img src=’http://cdn.groovyhistory.com/content/50466/90669810f510ad0de494a9b55c1f67d2.jpg’
class=’img-responsive’ alt=” /></div>

We need to extract the image where its parent div has class=’gallery-image,’ write that to a file suitable for display.

I hacked out this quick one liner to do the deed:

echo "<html><head></head><body>" > pics.html;for i in `seq -w 1 59`;do wget -U Mozilla -q "http://groovyhistory.com/60-groovy-photos-you-probably-havent-seen-before/$i" -O - | grep gallery >> pics.html;echo "</body></html>" >> pics.html;done

Breaking the one-liner into steps:

  1. echo "<html><head></head><body>" > pics.html.

    Creates the HTML file pics.html and inserts markup down to the open body element.

  2. for i in `seq -w 1 60`.

    Creates the loop and the variable i, which is used in the next step to create the following URLs.

  3. do wget -U Mozilla -q "http://groovyhistory.com/60-groovy-photos-you-probably-havent-seen-before/$i" -O - .

    Begins the do loop, invokes wget, identifies it as Mozilla (-U Mozilla), suppresses messages (-q), gives the URL with the $i variable, requests the output of each URL (-O), pipes the output to standard out ( – ).

  4. | grep gallery >> pics.html.

    The | pipe sends the output of each URL to grep, which searches for gallery, when found, the line containing gallery is appended (>>) to pics.html. That continues until 60 is reached and the loop exits.

  5. echo "</body></html>" >> pics.html.

    After the loop exits, the closing body and html elements are appended to the pics.html file.

  6. done

    The loop having exited and other commands exhausted, the script exits.

Each step, in the one-liner, is separated from the others with a semi-colon “;”.

I converted the entities back to markup and it ran, except that it didn’t pickup the first image, a page without an appended number.

To avoid hand editing the script:

  • Pass URL at command line
  • Pass number of images on command line
  • Text to grep changes with host, so create switch statement that keys on host
  • Output file name as command line option

The next time you encounter “50 Famous Photo-Bombs,” “30 Celebs Now,” or “45 Unseen Beatles Pics,” a minute or two of editing even the crude version of this script will save you the time and tedium of loading advertisements.

Enjoy!

December 28, 2017

Twitter Taking Sides – Censorship-Wise

Filed under: Censorship,Free Speech,Twitter — Patrick Durusau @ 10:16 pm

@wikileaks pointed out that Twitter’s censorship policies are taking sides:

Accounts that affiliate with organizations that use or promote violence against civilians to further their causes. Groups included in this policy will be those that identify as such or engage in activity — both on and off the platform — that promotes violence. This policy does not apply to military or government entities and we will consider exceptions for groups that are currently engaging in (or have engaged in) peaceful resolution.
… (emphasis added)

Does Twitter need a new logo? Birds with government insignia dropping bombs on civilians?

December 27, 2017

The Coolest Hacks of 2017 [Inspirational Reading for 2018]

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:36 pm

The Coolest Hacks of 2017 by Kelly Jackson Higgins.

From the post:

You’d think by now with the pervasiveness of inherently insecure Internet of Things things that creative hacking would be a thing of the past for security researchers. It’s gotten too easy to find security holes and ways to abuse IoT devices; they’re such easy marks.

But our annual look at the coolest hacks we covered this year on Dark Reading shows that, alas, innovation is not dead. Security researchers found intriguing and scary security flaws that can be abused to bend the will of everything from robots to voting machines, and even the wind. They weaponized seemingly benign systems such as back-end servers and machine learning tools in 2017, exposing a potential dark side to these systems.

So grab a cold one from your WiFi-connected smart fridge and take a look at seven of the coolest hacks of the year.

“Dark side” language brings a sense of intrigue and naughtiness. But the “dark side(s)” of any system is just a side that meets different requirements. Such as access without authorization. May not be your requirement but it may be mine, or your government’s.

Let’s drop the dodging and posing as though there is a common interest in cybersecurity. There is no such common interest nor has there even been one. Governments want backdoors, privacy advocates, black marketeers and spies want none. Users want effortless security, while security experts know security ads are just short of actionable fraud.

Cybersecurity marketeers may resist but detail your specific requirements. In writing and appended to your contract.

Streaming SQL for Apache Kafka

Filed under: Kafka,SQL,Stream Analytics,Streams — Patrick Durusau @ 11:27 am

Streaming SQL for Apache Kafka by Jojjat Jafarpour.

From the post:

We are very excited to announce the December release of KSQL, the streaming SQL engine for Apache Kafka! As we announced in the November release blog, we are releasing KSQL on a monthly basis to make it even easier for you to get up and running with the latest and greatest functionality of KSQL to solve your own business problems.

The December release, KSQL 0.3, includes both new features that have been requested by our community as well as under-the-hood improvements for better robustness and resource utilization. If you have already been using KSQL, we encourage you to upgrade to this latest version to take advantage of the new functionality and improvements.

The KSQL Github page links to:

  • KSQL Quick Start: Demonstrates a simple workflow using KSQL to write streaming queries against data in Kafka.
  • Clickstream Analysis Demo: Shows how to build an application that performs real-time user analytics.

These are just quick start materials but are your ETL projects ever as simple as USERID to USERID? Or have such semantically transparent fields? Or what Itake to be semantically transparent fields (they may not be).

As I pointed out in Where Do We Write Down Subject Identifications? earlier today, where do I record what I know about what appears in those fields? Including on what basis to merge them with other data?

If you see where KSQL is offering that ability, please ping me because I’m missing it entirely. Thanks!

Where Do We Write Down Subject Identifications?

Filed under: Subject Identifiers,Subject Identity,Topic Maps — Patrick Durusau @ 11:23 am

Modern Data Integration Paradigms by Matthew D. Sarrel, The Bloor Group.

Introduction:

Businesses of all sizes and industries are rapidly transforming to make smarter, data-driven decisions. To accomplish this transformation to digital business , organizations are capturing, storing, and analyzing massive amounts of structured, semi-structured, and unstructured data from a large variety of sources. The rapid explosion in data types and data volume has left many IT and data science/business analyst leaders reeling.

Digital transformation requires a radical shift in how a business marries technology and processes. This isn’t merely improving existing processes, but
rather redesigning them from the ground up and tightly integrating technology. The end result can be a powerful combination of greater efficiency, insight and scale that may even lead to disrupting existing markets. The shift towards reliance on data-driven decisions requires coupling digital information with powerful analytics and business intelligence tools in order to yield well-informed reasoning and business decisions. The greatest value of this data can be realized when it is analyzed rapidly to provide timely business insights. Any process can only be as timely as the underlying technology allows it to be.

Even data produced on a daily basis can exceed the capacity and capabilities of many pre-existing database management systems. This data can be structured or unstructured, static or streaming, and can undergo rapid, often unanticipated, change. It may require real-time or near-real-time transformation to be read into business intelligence (BI) systems. For these reasons, data integration platforms must be flexible and extensible to accommodate business’s types and usage patterns of the data.

There’s the usual homage to the benefits of data integration:


IT leaders should therefore try to integrate data across systems in a way that exposes them using standard and commonly implemented technologies such as SQL and REST. Integrating data, exposing it to applications, analytics and reporting improves productivity, simplifies maintenance, and decreases the amount of time and effort required to make data-driven decisions.

The paper covers, lightly, Operational Data Store (ODS) / Enterprise Data Hub (EDH), Enterprise Data Warehouse (EDW), Logical Data Warehouse (LDW), and Data Lake as data integration options.

Having found existing systems deficient in one or more ways, the report goes on to recommend replacement with Voracity.

To be fair, as described, all four systems plus Voracity are all deficient in the same way. The hard part of data integration, the rub that lies at the heart of the task, is passed over as ETL.

Efficient and correct ETL performance requires knowledge of what column headers, for instance, identify. For instance, from the Enron spreadsheets, can you specify the transformation of the data in the following columns? “A, B, C, D, E, F…” from andrea_ring_15_IFERCnov.xlsx, or “A, B, C, D, E,…” from andy_zipper__129__Success-TradeLog.xlsx?

With enough effort, no doubt you could go through speadsheets of interest and create a mapping sufficient to transform data of interest, but where are you going to write down the facts you established for each column that underlie your transformation?

In topic maps, we may the mistake of mystifying the facts for each column by claiming to talk about subject identity, which has heavy ontological overtones.

What we should have said was we wanted to talk about where do we write down subject identifications?

Thus:

  1. What do you want to talk about?
  2. Data in column F in andrea_ring_15_IFERCnov.xlsx
  3. Do you want to talk about each entry separately?
  4. What subject is each entry? (date written month/day (no year))
  5. What calendar system was used for the date?
  6. Who created that date entry? (If want to talk about them as well, create a separate topic and an association to the spreadsheet.)
  7. The date is the date of … ?
  8. Conversion rules for dates in column F, such as supplying year.
  9. Merging rules for #2? (date comparison)
  10. Do you want relationship between #2 and the other data in each row? (more associations)

With simple questions, we have documented column F of a particular spreadsheet for any present or future ETL operation. No magic, no logical conundrums, no special query language, just asking what an author or ETL specialist knew but didn’t write down.

There are subtlties such as distinguishing between subject identifiers (identifies a subject, like a wiki page) and subject locators (points to the subject we want to talk about, like a particular spreadsheet) but identifying what you want to talk about (subject identifications and where to write them down) is more familiar than our prior obscurities.

Once those identifications are written down, you can search those identifications to discover the same subjects identified differently or with properties in one identification and not another. Think of it as capturing the human knowledge that resides in the brains of your staff and ETL experts.

The ETL assumed by Bloor Group should be written: ETLD – Extract, Transform, Load, Dump (knowledge). That seems remarkably inefficient and costly to me. You?

Tutorial on Deep Generative Models (slides and video)

Filed under: Artificial Intelligence,Deep Learning,Machine Learning — Patrick Durusau @ 10:55 am

Slides for: Tutorial on Deep Generative Models by Shakir Mohamed and Danilo Rezende.

Abstract:

This tutorial will be a review of recent advances in deep generative models. Generative models have a long history at UAI and recent methods have combined the generality of probabilistic reasoning with the scalability of deep learning to develop learning algorithms that have been applied to a wide variety of problems giving state-of-the-art results in image generation, text-to-speech synthesis, and image captioning, amongst many others. Advances in deep generative models are at the forefront of deep learning research because of the promise they offer for allowing data-efficient learning, and for model-based reinforcement learning. At the end of this tutorial, audience member will have a full understanding of the latest advances in generative modelling covering three of the active types of models: Markov models, latent variable models and implicit models, and how these models can be scaled to high dimensional data. The tutorial will expose many questions that remain in this area, and for which thereremains a great deal of opportunity from members of the UAI community.

Deep sledding on the latest developments in deep generative models (August 2017 presentation) that ends with a bibliography starting on slide 84 of 96.

Depending on how much time has passed since the tutorial, try searching the topics as they are covered, keep a bibliography of your finds and compare it to that of the authors.

No Peer Review at FiveThirtyEight

Filed under: Humanities,Peer Review,Researchers,Science — Patrick Durusau @ 10:47 am

Politics Moves Fast. Peer Review Moves Slow. What’s A Political Scientist To Do? by Maggie Koerth-Baker

From the post:

Politics has a funny way of turning arcane academic debates into something much messier. We’re living in a time when so much in the news cycle feels absurdly urgent and partisan forces are likely to pounce on any piece of empirical data they can find, either to champion it or tear it apart, depending on whether they like the result. That has major implications for many of the ways knowledge enters the public sphere — including how academics publicize their research.

That process has long been dominated by peer review, which is when academic journals put their submissions in front of a panel of researchers to vet the work before publication. But the flaws and limitations of peer review have become more apparent over the past decade or so, and researchers are increasingly publishing their work before other scientists have had a chance to critique it. That’s a shift that matters a lot to scientists, and the public stakes of the debate go way up when the research subject is the 2016 election. There’s a risk, scientists told me, that preliminary research results could end up shaping the very things that research is trying to understand.

The legend of peer review catching and correcting flaws has a long history. A legend much tarnished by the Top 10 Retractions of 2017 and similar reports. Retractions are self admissions of the failure of peer review. By the hundreds.

Withdrawal of papers isn’t the only debunking of peer review. The reports, papers, etc., on the failure of peer review include: “Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals,” Anaesthesia, Carlisle 2017, DOI: 10.1111/anae.13962; “The peer review drugs don’t work” by Richard Smith; “One in 25 papers contains inappropriately duplicated images, screen finds” by Cat Ferguson.

Koerth-Baker’s quoting of Justin Esarey to support peer review is an example of no or failed peer review at FiveThirtyEight.


But, on aggregate, 100 studies that have been peer-reviewed are going to produce higher-quality results than 100 that haven’t been, said Justin Esarey, a political science professor at Rice University who has studied the effects of peer review on social science research. That’s simply because of the standards that are supposed to go along with peer review – clearly reporting a study’s methodology, for instance – and because extra sets of eyes might spot errors the author of a paper overlooked.

Koerth-Baker acknowledges the failures of peer review but since the article is premised upon peer review insulating the public from “bad science,” she runs in Justin Esarey, “…who has studied the effects of peer review on social science research.” One assumes his “studies” are mentioned to embue his statements with an aura of authority.

Debunking Esarey’s authority to comment on the “…effects of peer review on social science research” doesn’t require much effort. If you scan his list of publications you will find Does Peer Review Identify the Best Papers?, which bears the sub-title, A Simulation Study of Editors, Reviewers, and the Social Science Publication Process.

Esarey’s comments on the effectiveness of peer review are not based on fact but on simulations of peer review systems. Useful work no doubt but hardly the confessing witness needed to exonerate peer review in view of its long history of failure.

To save you chasing the Esarey link, the abstract reads:

How does the structure of the peer review process, which can vary from journal to journal, influence the quality of papers published in that journal? In this paper, I study multiple systems of peer review using computational simulation. I find that, under any system I study, a majority of accepted papers will be evaluated by the average reader as not meeting the standards of the journal. Moreover, all systems allow random chance to play a strong role in the acceptance decision. Heterogen eous reviewer and reader standards for scientific quality drive both results. A peer review system with an active editor (who uses desk rejection before review and does not rely strictly on reviewer votes to make decisions ) can mitigate some of these effects.

If there were peer reviewers, editors, etc., at FiveThirtyEight, shouldn’t at least one of them looked beyond the title Does Peer Review Identify the Best Papers? to ask Koerth-Baker what evidence Esarey has for his support of peer review? Or is agreement with Koerth-Baker sufficient?

Peer review persists for a number of unsavory reasons, prestige, professional advancement, enforcement of discipline ideology, pretension of higher quality of publications, let’s not add a false claim of serving the public.

Game of Thrones DVDs for Christmas?

Filed under: R,Text Mining — Patrick Durusau @ 10:40 am

Mining Game of Thrones Scripts with R by Gokhan Ciflikli

If you are serious about defeating all comers to Game of Thrones trivia, then you need to know the scripts cold. (sorry)

Ciflikli introduces you to the quanteda and analysis of the Game of Thrones scripts in a single post saying:

I meant to showcase the quanteda package in my previous post on the Weinstein Effect but had to switch to tidytext at the last minute. Today I will make good on that promise. quanteda is developed by Ken Benoit and maintained by Kohei Watanabe – go LSE! On that note, the first 2018 LondonR meeting will be taking place at the LSE on January 16, so do drop by if you happen to be around. quanteda v1.0 will be unveiled there as well.

Given that I have already used the data I had in mind, I have been trying to identify another interesting (and hopefully less depressing) dataset for this particular calling. Then it snowed in London, and the dire consequences of this supernatural phenomenon were covered extensively by the r/CasualUK/. One thing led to another, and before you know it I was analysing Game of Thrones scripts:

2018, with its mid-term congressional elections, will be a big year for leaked emails, documents, in addition to the usual follies of government.

Text mining/analysis skills you gain with the Game of Thrones scripts will be in high demand by partisans, investigators, prosecutors, just about anyone you can name.

From the quanteda documentation site:


quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined “thesaurus”, and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.
… (emphasis in original)

Once you follow the analysis of the Game of Thrones scripts, what other texts or features of quanteda will catch your eye?

Enjoy!

From the Valley of Disinformation Rode the 770 – Opportunity Knocks

Filed under: Cybersecurity,Environment,Government,Government Data,Journalism,Reporting — Patrick Durusau @ 10:32 am

More than 700 employees have left the EPA since Scott Pruitt took over by Natasha Geiling.

From the post:

Since Environmental Protection Agency Administrator Scott Pruitt took over the top job at the agency in March, more than 700 employees have either retired, taken voluntary buyouts, or quit, signaling the second-highest exodus of employees from the agency in nearly a decade.

According to agency documents and federal employment statistics, 770 EPA employees departed the agency between April and December, leaving employment levels close to Reagan-era levels of staffing. According to the EPA’s contingency shutdown plan for December, the agency currently has 14,449 employees on board — a marked change from the April contingency plan, which showed a staff of 15,219.

These departures offer journalists a rare opportunity to bleed the government like a stuck pig. From untimely remission of login credentials to acceptance of spear phishing emails, opportunities abound.

Not for “reach it to me” journalists who use sources as shields from potential criminal liability. While their colleagues are imprisoned for the simple act of publication or murdered (as of today in 2017, 42).

Governments have not, are not and will not act in the public interest. Laws that criminalize acquisition of data or documents are a continuation of their failure to act in the public interest.

Journalists who serve the public interest, by exposing the government’s failure to do so, should use any means at their disposal to obtain data and documents that evidence government failure and misconduct.

Are you a journalist serving the public interest or a “reach it to me” journalist, serving the public interest when there’s no threat to you?

December 26, 2017

xsd2json – XML Schema to JSON Schema Transform

Filed under: JSON,XML,XML Schema — Patrick Durusau @ 9:12 pm

xsd2json by Loren Cahlander.

From the webpage:

XML Schema to JSON Schema Transform – Development and Test Environment

The options that are supported are:

‘keepNamespaces’ – set to true if keeping prefices in the property names is required otherwise prefixes are eliminated.

‘schemaId’ – the name of the schema

#xs:short { “type”: “integer”, “xsdType”: “xs:short”, “minimum”: -32768, “maximum”: 32767, “exclusiveMinimum”: false, “exclusiveMaximum”: false }

To be honest, I can’t imagine straying from Relax-NG, much less converting an XSD schema into a JSON schema.

But, it’s not possible to predict all needs and futures (hint to AI fearests). It will be easier to find xsd2json here than with adware burdened “modern” search engines, should the need arise.

Geocomputation with R – Open Book in Progress – Contribute

Filed under: Geographic Data,Geography,Geospatial Data,R — Patrick Durusau @ 8:57 pm

Geocomputation with R by Robin Lovelace, Jakub Nowosad, Jannes Muenchow.

Welcome to the online home of Geocomputation with R, a forthcoming book with CRC Press.

Development

p>Inspired by bookdown and other open source projects we are developing this book in the open. Why? To encourage contributions, ensure reproducibility and provide access to the material as it evolves.

The book’s development can be divided into four main phases:

  1. Foundations
  2. Basic applications
  3. Geocomputation methods
  4. Advanced applications

Currently the focus is on Part 2, which we aim to be complete by December. New chapters will be added to this website as the project progresses, hosted at geocompr.robinlovelace.net and kept up-to-date thanks to Travis….

Speaking of R and geocomputation, I’ve been trying to remember to post about Geocomputation with R since I encountered it a week or more ago. Not what I expect from CRC Press. That got my attention right away!

Part II, Basic Applications has two chapters, 7 Location analysis and 8 Transport applications.

Layering display of data from different sources should be included under Basic Applications. For example, relying on but not displaying topographic data to calculate line of sight between positions. Perhaps the base display is a high-resolution image overlaid with GPS coordinates at intervals and structures have the line of site colored on their structures.

Other “basic applications” you would suggest?

Looking forward to progress on this volume!

All targets have spatial-temporal locations.

Filed under: Geographic Data,Geography,Geophysical,Geospatial Data,R,Spatial Data — Patrick Durusau @ 5:29 pm

r-spatial

From the about page:

r-spatial.org is a website and blog for those interested in using R to analyse spatial or spatio-temporal data.

Posts in the last six months to whet your appetite for this blog:

The budget of a government for spatial-temporal software is no indicator of skill with spatial and spatial-temporal data.

How are yours?

December 24, 2017

Deep Learning for NLP, advancements and trends in 2017

Filed under: Artificial Intelligence,Deep Learning,Natural Language Processing — Patrick Durusau @ 5:57 pm

Deep Learning for NLP, advancements and trends in 2017 by Javier Couto.

If you didn’t get enough books as presents, Couto solves your reading shortage rather nicely:

Over the past few years, Deep Learning (DL) architectures and algorithms have made impressive advances in fields such as image recognition and speech processing.

Their application to Natural Language Processing (NLP) was less impressive at first, but has now proven to make significant contributions, yielding state-of-the-art results for some common NLP tasks. Named entity recognition (NER), part of speech (POS) tagging or sentiment analysis are some of the problems where neural network models have outperformed traditional approaches. The progress in machine translation is perhaps the most remarkable among all.

In this article I will go through some advancements for NLP in 2017 that rely on DL techniques. I do not pretend to be exhaustive: it would simply be impossible given the vast amount of scientific papers, frameworks and tools available. I just want to share with you some of the works that I liked the most this year. I think 2017 has been a great year for our field. The use of DL in NLP keeps widening, yielding amazing results in some cases, and all signs point to the fact that this trend will not stop.

After skimming this post, I suggest you make a fresh pot of coffee before starting to read and chase the references. It will take several days/pots to finish so it’s best to begin now.

Adversarial Learning Market Opportunity

The Pentagon’s New Artificial Intelligence Is Already Hunting Terrorists by Marcus Weisgerber.

From the post:

Earlier this month at an undisclosed location in the Middle East, computers using special algorithms helped intelligence analysts identify objects in a video feed from a small ScanEagle drone over the battlefield.

A few days into the trials, the computer identified objects – people, cars, types of building – correctly about 60 percent of the time. Just over a week on the job – and a handful of on-the-fly software updates later – the machine’s accuracy improved to around 80 percent. Next month, when its creators send the technology back to war with more software and hardware updates, they believe it will become even more accurate.

It’s an early win for a small team of just 12 people who started working on the project in April. Over the next year, they plan to expand the project to help automate the analysis of video feeds coming from large drones – and that’s just the beginning.

“What we’re setting the stage for is a future of human-machine teaming,” said Air Force Lt. Gen. John N.T. “Jack” Shanahan, director for defense intelligence for warfighter support, the Pentagon general who is overseeing the effort. Shanahan believes the concept will revolutionize the way the military fights.

So you will recognize Air Force Lt. Gen. John N.T. “Jack” Shanahan (Nvidia conference):

From the Nvidia conference:

Don’t change the culture. Unleash the culture.

That was the message one young officer gave Lt. General John “Jack” Shanahan — the Pentagon’s director for defense for warfighter support — who is hustling to put artificial intelligence and machine learning to work for the U.S. Defense Department.

Highlighting the growing role AI is playing in security, intelligence and defense, Shanahan spoke Wednesday during a keynote address about his team’s use of GPU-driven deep learning at our GPU Technology Conference in Washington.

Shanahan leads Project Maven, an effort launched in April to put machine learning and AI to work, starting with efforts to turn the countless hours of aerial video surveillance collected by the U.S. military into actionable intelligence.

There are at least two market opportunity for adversarial learning. The most obvious one is testing a competitor’s algorithm so it performs less well than yours on “… people, cars, types of building….”

The less obvious market requires US sales of AI-enabled weapon systems to its client states. Client states have an interest in verifying the quality of AI-enabled weapon systems, not to mention non-client states who will be interested in defeating such systems.

For any of those markets, weaponizing adversarial learning and developing a reputation for the same can’t start too soon. Is your anti-AI research department hiring?

Ichano AtHome IP Cameras – Free Vulnerabilities from Amazon

Filed under: Cybersecurity,Security — Patrick Durusau @ 5:36 pm

SSD Advisory – Ichano AtHome IP Cameras Multiple Vulnerabilities

Catalin Cimpanu @campuscodi pointed to these free vulnerabilities:

AtHome Camera is “a remote video surveillance app which turns your personal computer, smart TV/set-top box, smart phone, and tablet into a professional video monitoring system in a minute.”

The vulnerabilities found are:

  • Hard-coded username and password – telnet
  • Hard-coded username and password – Web server
  • Unauthenticated Remote Code Execution

Did you know the AtHome Camera – Remote video surveillance, Home security, Monitoring, IP Camera by iChano is a free download at Amazon?

That’s right! You can get all three of these vulnerabilities for free! Ranked “#270 in Apps & Games > Utilities,” as of 24 December 2017.

Context Sensitive English Glosses and Interlinears – Greek New Testament

Filed under: Bible,Greek — Patrick Durusau @ 3:56 pm

Context Sensitive English Glosses and Interlinears by Jonathan Robie.

From the post:

I am working on making the greeksyntax package for Jupyter more user-friendly in various ways, and one of the obvious ways to do that is to provide English glosses.

Contextual glosses in English are now available in the Nestle 1904 Lowfat trees. These glosses have been available in the Nestle1904 repository, where they were extracted from the Berean Interlinear Bible with their generous permission. I merged them into the Nestle 1904 Lowfat treebank using this query. And now they are available whenever you use this treebank.

Another improvement in the resources available to non-professionals who study the Greek New Testament.

Nestle 1904 isn’t the latest work but then the Greek New Testament isn’t the hotbed of revision it once was. 😉

If you are curious why the latest editions of the Greek New Testament aren’t freely available to the public, you will have to ask the scholars who publish them.

My explanation for hoarding of the biblical text isn’t a generous one.

Sleuth Kit – Checking Your Footprints (if any)

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:39 pm

Open Source File System Digital Forensics: The Sleuth Kit

From the webpage:

The Sleuth Kit is an open source forensic toolkit for analyzing Microsoft and UNIX file systems and disks. The Sleuth Kit enables investigators to identify and recover evidence from images acquired during incident response or from live systems. The Sleuth Kit is open source, which allows investigators to verify the actions of the tool or customize it to specific needs.

The Sleuth Kit uses code from the file system analysis tools of The Coroner’s Toolkit (TCT) by Wietse Venema and Dan Farmer. The TCT code was modified for platform independence. In addition, support was added for the NTFS and FAT file systems. Previously, The Sleuth Kit was called The @stake Sleuth Kit (TASK). The Sleuth Kit is now independent of any commercial or academic organizations.

It is recommended that these command line tools can be used with the Autopsy Forensic Browser. Autopsy is a graphical interface to the tools of The Sleuth Kit and automates many of the procedures and provides features such as image searching and MD5 image integrity checks.

As with any investigation tool, any results found with The Sleuth Kit should be be recreated with a second tool to verify the data.

The Sleuth Kit allows one to analyze a disk or file system image created by ‘dd’, or a similar application that creates a raw image. These tools are low-level and each performs a single task. When used together, they can perform a full analysis.

Question: Who should find your foot prints first? You or someone investigating an incident?

Test your penetration techniques for foot prints before someone else does. Yes?

BTW, pick up a copy of the Autopsy Forensic Browser.

« Newer PostsOlder Posts »

Powered by WordPress