Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 3, 2016

Phishing John Podesta

Filed under: Cybersecurity,Government,Humor — Patrick Durusau @ 2:41 pm

Reconstructions of the phishing email and Google login were posted to Twitter. I make no warranty as to their accuracy and/or resemblance to what may or may not have been seen by John Podesta.

podesta-google-phishing-email

podesta-google-login

(selecting the images will display a larger version)

Recognition Warning: A phishing email sent to you will have your name, not John Podesta, in the email. The alleged Google login page will have your image and your name.

If you get the fake Google password page for John Podesta, that is a poor phishing attempt.

If you would fall for this phishing attempt addressed to you (or John Podesta):

Turn off your computer. Unplug your computer (to avoid accidentally starting it).

Inform your employer you require a position that does not involve computers.

Thanks for making the internet safer for your employer and everyone else!

Stanford CoreNLP v3.7.0 beta is out! [Time is short, comments, bug reports, now!]

Stanford CoreNLP v3.7.0 beta

The tweets I saw from Stanford NLP Group read:

Stanford CoreNLP v3.7.0 beta is out—improved coreference, dep parsing—KBP relation annotator—Arabic pipeline #NLProc

We‘re doing an official CoreNLP beta release this time, so bugs, comments, and fixes especially appreciated over the next two weeks!

OK, so, what are you waiting for? 😉

Oh, the standard blurb for your boss on why Stanford CoreNLP should be taking up your time:

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract open-class relations between mentions, etc.

Choose Stanford CoreNLP if you need:

  • An integrated toolkit with a good range of grammatical analysis tools
  • Fast, reliable analysis of arbitrary texts
  • The overall highest quality text analytics
  • Support for a number of major (human) languages
  • Interfaces available for various major modern programming languages
  • Ability to run as a simple web service

Stanford CoreNLP is an integrated framework. Its goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A CoreNLP tool pipeline can be run on a piece of plain text with just two lines of code. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Using the standard blurb about the Stanford CoreNLP has these advantages:

  • It’s copy-n-paste, you didn’t have to write it
  • It’s appeal to authority (Stanford)
  • It’s truthful

The truthful point is a throw-away these days but thought I should mention it. 😉

Encyclopedia of Distances

Filed under: Distance,Edit Distance,Mathematics,Metric Spaces — Patrick Durusau @ 9:37 am

Encyclopedia of Distances (4th edition) by Michel Marie Deza and Elena Deza.

Springer description:

This 4-th edition of the leading reference volume on distance metrics is characterized by updated and rewritten sections on some items suggested by experts and readers, as well a general streamlining of content and the addition of essential new topics. Though the structure remains unchanged, the new edition also explores recent advances in the use of distances and metrics for e.g. generalized distances, probability theory, graph theory, coding theory, data analysis.

New topics in the purely mathematical sections include e.g. the Vitanyi multiset-metric, algebraic point-conic distance, triangular ratio metric, Rossi-Hamming metric, Taneja distance, spectral semimetric between graphs, channel metrization, and Maryland bridge distance. The multidisciplinary sections have also been supplemented with new topics, including: dynamic time wrapping distance, memory distance, allometry, atmospheric depth, elliptic orbit distance, VLBI distance measurements, the astronomical system of units, and walkability distance.

Leaving aside the practical questions that arise during the selection of a ‘good’ distance function, this work focuses on providing the research community with an invaluable comprehensive listing of the main available distances.

As well as providing standalone introductions and definitions, the encyclopedia facilitates swift cross-referencing with easily navigable bold-faced textual links to core entries. In addition to distances themselves, the authors have collated numerous fascinating curiosities in their Who’s Who of metrics, including distance-related notions and paradigms that enable applied mathematicians in other sectors to deploy research tools that non-specialists justly view as arcane. In expanding access to these techniques, and in many cases enriching the context of distances themselves, this peerless volume is certain to stimulate fresh research.

Ransomed for $149 (US) per digital copy, this remarkable work that should have a broad readership.

From the introduction to the 2009 edition:


Distance metrics and distances have now become an essential tool in many areas of Mathematics and its applications including Geometry, Probability, Statistics, Coding/Graph Theory, Clustering, Data Analysis, Pattern Recognition, Networks, Engineering, Computer Graphics/Vision, Astronomy, Cosmology, Molecular Biology, and many other areas of science. Devising the most suitable distance metrics and similarities, to quantify the proximity between objects, has become a standard task for many researchers. Especially intense ongoing search for such distances occurs, for example, in Computational Biology, Image Analysis, Speech Recognition, and Information Retrieval.

Often the same distance metric appears independently in several different areas; for example, the edit distance between words, the evolutionary distance in Biology, the Levenstein distance in Coding Theory, and the Hamming+Gap or shuffle-Hamming distance.

(emphasis added)

I highlighted that last sentence to emphasize that Encyclopedia of Distances is a static and undisclosed topic map.

While readers familiar with the concepts:

edit distance between words, the evolutionary distance in Biology, the Levenstein distance in Coding Theory, and the Hamming+Gap or shuffle-Hamming distance.

could enumerate why those merit being spoken of as being “the same distance metric,” no indexing program can accomplish the same feat.

If each of those concepts had enumerated properties, which could be compared by an indexing program, readers could not only discover those “same distance metrics” but could also discover new rediscoveries of that same metric.

As it stands, readers must rely upon the undisclosed judgments of the Deza’s and hope they continue to revise and extend this work.

When they cease to do so, successive editors will be forced to re-acquire the basis for adding new/re-discovered metrics to it.

PS: Suggestions of similar titles that deal with non-metric distances? I’m familiar with works that impose metrics on non-metric distances but that’s not what I have in mind. That’s an arbitrary and opaque mapping from non-metric to metric.

November 2, 2016

Wild Maths – explore, imagine, experiment, create!

Filed under: Mathematical Reasoning,Mathematics — Patrick Durusau @ 8:13 pm

Wild Maths – explore, imagine, experiment, create!

From the webpage:

Mathematics is a creative subject. It involves spotting patterns, making connections, and finding new ways of looking at things. Creative mathematicians play with ideas, draw pictures, have the courage to experiment and ask good questions.

Wild Maths is a collection of mathematical games, activities and stories, encouraging you to think creatively. We’ve picked out some of our favourites below – have a go at anything that catches your eye. If you want to explore games, challenges and investigations linked by some shared mathematical areas, click on the Pathways link in the top menu.

The line:

It involves spotting patterns, making connections, and finding new ways of looking at things.

is true of data science as well.

I’m going to print out Can you traverse it?, to keep myself honest, if nothing else. 😉

Enjoy!

How To Use Twitter to Learn Data Science (or anything)

Filed under: Data Science,Twitter — Patrick Durusau @ 7:55 pm

How To Use Twitter to Learn Data Science (or anything) by Data Science Renee.

Judging from the date on the post (May 2016), Renee’s enthusiasm for Twitter came before her recently breaking 10,000 followers on Twitter. (Congratulations!)

The one thing I don’t see Renee mentioning is the use of your own Twitter account to gain experience with a whole range of data mining tools.

Your Twitter feed will quickly out-strip your ability to “keep up,” so how do you propose to deal with that problem?

Renee suggests limiting examination of your timeline (in part), but have you considered using machine learning to assist you?

Or visualizing your areas of interests or people that you follow?

Indexing resources pointed to in tweets?

NLP processing of tweets?

Every tool of data science that you will be using for clients is relevant to your own Twitter feed.

What better way to learn tools than using them on content that interests you?

Enjoy!

BTW, follow Data Science Renee for a broad range of data science tools and topics!

ggplot2 cheatsheet updated – other R spreadsheets

Filed under: Data Mining,Ggplot2,R — Patrick Durusau @ 7:32 pm

RStudio Cheat Sheets

I saw a tweet that the ggplot2 cheatsheet has been updated.

Here’s a list of all the cheatsheets available at RStudio:

  • R Markdown Cheat Sheet
  • RStudio IDE Cheat Sheet
  • Shiny Cheat Sheet
  • Data Visualization Cheat Sheet
  • Package Development Cheat Sheet
  • Data Wrangling Cheat Sheet
  • R Markdown Reference Guide

Contributed Cheatsheets

  • Base R
  • Advanced R
  • Regular Expressions
  • How big is your graph? (base R graphics)

I have deliberately omitted links as when cheat sheets are updated, the links will break and/or you will get outdated information.

Use and reference the RStudio Cheat Sheets page.

Enjoy!

Wikileaks Podesta Docs Proven To Be False

Filed under: Government,Wikileaks — Patrick Durusau @ 7:01 pm

Glenn Greenwald tweeted this list of all the Podesta Docs from Wikileaks that have proven to be false:

https://t.co/3QAb3LLxn0

Journalists should keep that in mind when judging contested facts between Wikileaks and government sources.

Yes?

Don’t cyber-mess with Britain, warns UK Chancellor (I’m So Scared!)

Filed under: Cybersecurity,Security — Patrick Durusau @ 3:00 pm

Don’t cyber-mess with Britain, warns UK Chancellor by John E Dunn.

<> on January 22, 2013 in London, England.

From the post:


“We will continue to invest in our offensive cyber-capabilities, because the ability to detect, trace and retaliate in kind is likely to be the best deterrent.”

The use of the word “retaliate” is key. According to Hammond, without the ability to go on the offensive in cyberspace the UK would be left with no way to respond except by either “turning the cheek” or resorting to old-fashioned military force, which means the risk of people being killed.

Enemies must understand this. Anyone thinking of attacking the UK in cyberspace was risking getting the same back.

Before hackers start wailing in despair, burning their computers, abandoning the internet, seeking asylum with the Amish, remember that Hammond and company would have to blame someone first.

On the issue of blame, check the latest pronouncements from the then U.S. President or one of their sycophants for the cyber-villain-of-the-day.

For example, today, November 2, 2016, if your hacker moniker isn’t Fancy Bear, your safe from retaliation.

Governmental cyber attribution is a politically colored game of buying a pig in a poke.

Let the buyer and public beware!

PS: I would not be overly fearful of British efforts. British government has for years has been unable to find child molesters in its own midst. There may be reasons other than incompetent for that failure.

Does Verification Matter? Clinton/Podesta Emails Update

Filed under: Data Mining,Hillary Clinton,News,Reporting — Patrick Durusau @ 12:51 pm

As of today, 10,357 DKIM Verified Clinton/Podesta Emails (of 43,526 total). That’s releases 1-26.

I ask “Does Verification Matter?” in the title to this post because of the seeming lack of interest in verification of emails in the media. Not that it would ever be a lead, but some mention of the verified/not status of an email seems warranted.

Every Clinton/Podesta story mentions Antony Weiner’s interest in sharing his sexual insecurities and nary a peep about the false Clinton/Obama/Clapper claims that emails have been altered. Easy enough to check. But no specifics are given or requested by the press.

Thanks to the Clinton/Podesta drops by Michael Best, @NatSecGeek, I have now uploaded:

DKIM-verified-podesta-1-26.txt.gz is a sub-set of 10,357 emails that have been verified by their DKIM keys.

The statements in or data attached to those emails may still be false. DKIM verification only validates the email being the same as when it left the email server, nothing more.

DKIM-complete-podesta-1-26.txt.gz is the full set of Podesta emails to date, some 43,526, with their DKIM results of either True or False.

Both files have these fields:

ID – 1| Verified – 2| Date – 3| From – 4| To – 5| Subject -6| Message-Id – 7

Enjoy!

PS: Perhaps verification doesn’t matter when the media repeats false and/or delusional statements of DNI Clapper in hopes of…, I don’t know what they are hoping for but I am hoping they are dishonest, not merely stupid.

Is Google Fancy Bear? Or is Microsoft? Factions of Fancy Bear?

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 10:13 am

Fancy Bear: Russia-linked hackers blamed for exploiting Windows zero-day flaw.

From the post:

MICROSOFT IS USING a new tactic to get people to upgrade to Windows 10 by warning that those who don’t could fall victim to Russian hackers.

The company said in a security advisory that a hacking group previously linked to the Russian government and US political hacks has exploited a newly discovered Windows zero-day flaw that was outed by Google earlier this week.

Microsoft claimed that the hacking group ‘Strontium’, more commonly known as ‘Fancy Bear’, had carried out a small number of attacks using spear phishing techniques.

Too much of a coincidence Google drops a zero-day flaw the same week it shows up in the wild from Fancy Bear?

Too much of a coincidence Windows 10 is the magic solution to an “all Windows/all the time” vulnerability?

Could Google and Microsoft be rival factions of Fancy Bear?

The super-hackers in North Korea, should be offended by the obsession with Fancy Bear. Double ditto for the Chinese warlord class hackers.

For months, years in internet time, it’s Fancy Bear this and Fancy Bear that. Your toaster on the blink, must be Fancy Bear. Your printer is jammed, must be Fancy Bear. Worried about hacking paper ballots? Must be Fancy Bear.

Despite DNI James Clapper‘s paranoid and Hillary Clinton-serving fantasies, there is more to attribution than saying a catchy name.

November 1, 2016

What’s Your NSA Number?

Filed under: Cybersecurity,NSA — Patrick Durusau @ 6:03 pm

You have heard of Erdös numbers, which are based on collaboration of mathematicians with Paul Erdös. The Erdös Number Project

The publication of (alleged) NSA hacked sites may give rise to your NSA Number. (New leak may show if you were hacked by the NSA by Dan Goodin.)

With two assumptions:

  1. The 290 IP addresses are indeed valid.
  2. The NSA did in fact hack those sites.

The top NSA Number would be 290. (I combined, sorted and deduped the IP addresses. Other counts are out there but I don’t know how they were made.)

As a first step, I ran ping on the 290 and 74 reported as “up.”

My results on the 290.

Many others avenues of server detection to pursue but a common list is a good start.

Edits/changes to my list?

Thanks!

Copyright Office Opens Up 512 Safe Harbor ($105 Fee Reduced To $6)

Filed under: Electronic Frontier Foundation,Intellectual Property (IP) — Patrick Durusau @ 4:26 pm

After reading the Copyright Office explanation for the changes Elliot Harmon‘s complains of in Copyright Office Sets Trap for Unwary Website Owners, I see the Copyright Office as opening up the 512 safe harbor to more people.

In his rush to criticize the Copyright Office for not taking EFF advice, Elliot forgets to mention:


Transitioning to the electronic system has allowed the Office to substantially reduce the fee to designate an agent with the Office, from $105 (plus an additional fee of $35 for each group of one to ten alternate names used by the service provider) to $6 (with no additional fee for alternate names).

Copyright Office Announces Electronic System for Designating Agents under DMCA

Wow! Government fees going down?

Going from $105 (plus $35 for alternate names) to $6 and no additional fee for alternate names, opens up the 512 safe harbor to small owners/sites.

True enough, the new rule requires you to renew every three years but given the plethora of renewals we all face, what’s one more? Especially an important one.

The Copyright Office has prepared videos (with transcripts) to guide you to the new system.

A starting point for further reading: Copyright Office Reviews Section 512 Safe Harbor for Online User-Generated Content – The Differing Perceptions of Musicians and Other Copyright Holders and Online Service Providers on the Notice and Take-Down Process by David Oxenford. Just a starting point.

If you have or suspect you have copyright issues, consult an attorney. Law isn’t a safe place for self-exploration.

PS: I understand that EFF must write for its base, but closer attention to the facts of rules and changes would be appreciated.

Andrew Ng – Machine Learning – Lecture Notes

Filed under: Machine Learning — Patrick Durusau @ 3:25 pm

CS 229 Machine Learning Course Materials.

If your hand writing is as bad as mine, lecture notes are a great read-along with the video lectures or to use for review.

As you might expect, these notes are of exceptional quality.

Enjoy!

How To DeDupe Clinton/Weiner/Abedin Emails….By Tomorrow

Filed under: FBI,Hillary Clinton,Politics — Patrick Durusau @ 1:43 pm

The report by Haliman Abdullah, FBI Working to Winnow Through Emails From Anthony Weiner’s Laptop, casts serious doubt on the technical prowess of the FBI when it says:


Officials have been combing through the emails since Sunday night — using a program designed to find only the emails to and from Abedin within the time when Clinton was secretary of state. Agents will compare the latest batch of messages with those that have already been investigated to determine whether any classified information was sent from Clinton’s server.

This process will take some time, but officials tell NBC News that they hope that they will wrap up the winnowing process this week.

Since Sunday night?

Here’s how the FBI, using standard Unix tools, could have finished the “winnowing” in time for the Monday evening news cycle:

  1. Transform (if not already) all the emails into .eml format (to give you separate files for each email).
  2. Grep the resulting file set for emails that contain the Clinton email server by name or addess.
  3. Save the result of #2 to a file and copy all those messages to a separate directory.
  4. Extract the digital signature from each of the copied messages (see below), save to the Abedin file digital signature + file name where found.
  5. Extract the digital signatures from previously reviewed Clinton email server emails, save digital signatures only to the prior-Clinton-review file.
  6. Search for each digital signature in the Abedin file in the prior-Clinton-review file. If found, reviewed. If not found, new email.

The digital signatures are unique to each email and can therefore be used to dedupe or in this case, identify previously reviewed emails.

Here’s a DKIM example signature:

How can I read the DKIM header?

Here is an example DKIM signature (recorded as an RFC2822 header field) for the signed message:

DKIM-Signature a=rsa-sha1; q=dns;
d=example.com;
i=user@eng.example.com;
s=jun2005.eng; c=relaxed/simple;
t=1117574938; x=1118006938;
h=from:to:subject:date;
b=dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSb
av+yuU4zGeeruD00lszZVoG4ZHRNiYzR

Let’s take this piece by piece to see what it means. Each “tag” is associated with a value.

  • b = the actual digital signature of the contents (headers and body) of the mail message
  • bh = the body hash
  • d = the signing domain
  • s = the selector
  • v = the version
  • a = the signing algorithm
  • c = the canonicalization algorithm(s) for header and body
  • q = the default query method
  • l = the length of the canonicalized part of the body that has been signed
  • t = the signature timestamp
  • x = the expire time
  • h = the list of signed header fields, repeated for fields that occur multiple times

We can see from this email that:

  • The digital signature is dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSb
    av+yuU4zGeeruD00lszZVoG4ZHRNiYzR
    .
    This signature is matched with the one stored at the sender’s domain.
  • The body hash is not listed.
  • The signing domain is example.com.
    This is the domain that sent (and signed) the message.
  • The selector is jun2005.eng.
  • The version is not listed.
  • The signing algorithm is rsa-sha1.
    This is the algorith used to generate the signature.
  • The canonicalization algorithm(s) for header and body are relaxed/simple.
  • The default query method is DNS.
    This is the method used to look up the key on the signing domain.
  • The length of the canonicalized part of the body that has been signed is not listed.
    The signing domain can generate a key based on the entire body or only some portion of it. That portion would be listed here.
  • The signature timestamp is 1117574938.
    This is when it was signed.
  • The expire time is 1118006938.
    Because an already signed email can be reused to “fake” the signature, signatures are set to expire.
  • The list of signed header fields includes from:to:subject:date.
    This is the list of fields that have been “signed” to verify that they have not been modified.

From: What is DKIM? Everything You Need to Know About Digital Signatures by Geoff Phillips.

Altogether now, to eliminate previously reviewed emails we need only compare:

dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSbav+yuU4zGeeruD00lszZVoG4ZHRNiYzR (example, use digital signatures from Abedin file)

to the digital signatures in the prior-Clinton-review file.

Those that don’t match, are new files to review.

Why the news media hasn’t pressed the FBI on its extremely poor data processing performance is a mystery to me.

You?

PS: FBI field agents with data mining questions, I do off-your-books freelance consulting. Apologies but on-my-books for the tax man. If they don’t tell, neither will I.

« Newer Posts

Powered by WordPress