Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 2, 2018

Archives for the Dark Web: A Field Guide for Study

Filed under: Archives,Dark Web,Ethics,Journalism,Tor — Patrick Durusau @ 4:48 pm

Archives for the Dark Web: A Field Guide for Study by Robert A. Gehl.

Abstract:

This chapter provides a field guide for other digital humanists who want to study the Dark Web. In order to focus the chapter, I emphasize my belief that, in order to study the cultures of Dark Web sites and users, the digital humanist must engage with these systems’ technical infrastructures. I will provide specific reasons why I believe that understanding the technical details of Freenet, Tor, and I2P will benefit any researchers who study these systems, even if they focus on end users, aesthetics, or Dark Web cultures. To this end, I offer a catalog of archives and resources researchers could draw on and a discussion of why researchers should build their own archives. I conclude with some remarks about ethics of Dark Web research.

Highly recommended read but it falls short on practical archiving advice for starting researchers and journalists.

Digital resources, Dark Web or no, can be emphemeral. Archiving produces the only reliable and persistent record of resources as you encountered them.

I am untroubled by Gehl’s concern for research ethics. Research ethics can disarm and distract scholars in the face of amoral enemies. Governments and their contractors, to name only two such enemies, exhibit no ethical code other than self-advantage.

Those who harm innocents, rely on my non-contractual ethics at their own peril.

eXist-db 5.0.0 RC 3 [Prepping for Assange Data Tsunami]

Filed under: .Net,eXist,XML,XML Database,XQuery — Patrick Durusau @ 10:40 am

eXist-db 5.0.0 RC 3

One new feature and several bugs fixes over RC 2, but thought I should mention it for Assange Data Tsunami preppers.

I have deliberately avoided contact with any such preppers but you can read my advice at: username: 4julian password: $etJulianFree!2Day.

The gist is that sysadmins should, with appropriate cautions, create accounts with “username: 4julian password: $etJulianFree!2Day,” in the event that Julian Assange is taken into custory (a likely event).

If one truth teller (no Wikileaks release has ever been proven false or modified) disturbs the world, creating a tsunami of secret, classified, restricted, proprietary data, may shock it to its senses.

Start prepping for the Assange Data Tsunami today!

PS: Yes, there are a variety of social media events, broadcasts, etc. being planned. Wish them all well but governments respond to bleeding more than pleading. In this case, bleeding data seems appropriate.

Learning Math for Machine Learning [for building products/conducting academic research]

Filed under: Machine Learning,Mathematics — Patrick Durusau @ 10:09 am

Learning Math for Machine Learning by Vincent Chen.

From the post:

It’s not entirely clear what level of mathematics is necessary to get started in machine learning, especially for those who didn’t study math or statistics in school.

In this piece, my goal is to suggest the mathematical background necessary to build products or conduct academic research in machine learning. These suggestions are derived from conversations with machine learning engineers, researchers, and educators, as well as my own experiences in both machine learning research and industry roles.

To frame the math prerequisites, I first propose different mindsets and strategies for approaching your math education outside of traditional classroom settings. Then, I outline the specific backgrounds necessary for different kinds of machine learning work, as these subjects range from high school-level statistics and calculus to the latest developments in probabilistic graphical models (PGMs). By the end of the post, my hope is that you’ll have a sense of the math education you’ll need to be effective in your machine learning work, whatever that may be!

I headlined:

…my goal is to suggest the mathematical background necessary to build products or conduct academic research in machine learning.

because the amount of math you need for machine learning depends on your use of machine learning tools.

If you intend to “build products or conduct academic research in machine learning,” then Chen’s post is as good a place to start as any. And knowing more math is always a good thing. If for no other reason than to challenge “machine learning” others try to foist off on you.

However, there are existing machine learning tools which come with their own documentation and lore about their use in a wide variety of situations.

I always applaud deeper understanding of vulnerabilities or code, but it isn’t necessary that you re-write every, most, some tools from scratch to be effective in using machine learning.

While learning the math of machine learning at your own pace, I suggest:

  1. Define the goal of your machine learning. Recommendation? Recognition?
  2. Define the subject area and likely inputs for your goal.
  3. Search for the use of your tool (if you already have one) and experience reports.
  4. Test and compare your results to industry reports in the same area.

My list assumes you already understand the goals of your client. Except in rare cases, machine learning is a means to reach those goals, not a goal itself.

August 1, 2018

Developing SGML DTDs From Text To Model To Markup

Filed under: XML,XPath — Patrick Durusau @ 8:06 pm

Developing SGML DTDs: From Text To Model To Markup by Eve Maler and Jeanne El Andaloussi.

Maler and El Andaloussi summarize (1.2.4) the benefits of SGML this way:

To summarize, SGML markup is unique in that it combines several design strengths:

  • It is declarative, which helps document producers “write once, use many”—putting the same document data to multiple uses, such as delivery of documents in a variety of online and paper formats and interchange with others who wish to use the documents in different ways.
  • It is generic across systems and has a nonproprietary design, which helps make documents vendor and platform independent and “future-proof”—protecting them against changes in computer hardware and software.
  • It is contextual, which heightens the quality and completeness of processing by allowing documents to be structurally validated and by enabling logical collections of data to be manipulated intelligently.

The characteristics of being declarative, generic, nonproprietary, and contextual make the Standard Generalized Markup Language “standard” and “generalized.”

A truly remarkable work that is as relevant today as it was twenty-three years ago.

Most important lesson: Understanding your document comes before designing markup. Every time.

Printable Guns – When Censorship Fails

Filed under: 3D Printing,Government,Politics — Patrick Durusau @ 7:24 pm

It’s always nice when censorship fails. If you think about it for a minute, there were several places this AM where printable guns could be downloaded.

In anticipation that you will find unlooked for places with 3D printable gun designs, these may be useful resources:

20 Best 3D Printing Software Tools of 2018 (All Are Free)

20 Best Free STL File Viewer Tools of 2018

Before you try firing a printed gun, be sure to read 2018 3D Printed Gun Report – All You Need to Know very carefully.

There are reasons why no known military force uses 3D printed guns. Failure of the weapon and injury to its operator are two of them.

Interest in 3D printed guns has the potential to drive the market for better and cheaper 3D printers, as well as faster development of the technology.

All in all, not a bad result.

Trucks and beer (Music)

Filed under: Music,Text Analytics,Text Mining — Patrick Durusau @ 6:13 pm

Trucks and beer by John W. Miller.

From the post:

Inspired by a post on Big-ish Data, I’ve started working on a textual analysis of popular country music.

More specifically, I scraped Ranker.com for a list of the top female and male country artists of the last 100 years and used my python wrapper for the Genius API to download the lyrics to each song by every artist on the list. After my script ran for about six hours I was left with the lyrics to 12,446 songs by 83 artists stored in a 105 MB JSON file. As a bit of an outsider to the world of country music, I was curious whether some of the preconceived notions I had about the genre were true.

Some pertinent questions:

  • Which artist mentions trucks in their songs most often?
  • Does an artist’s affinity for trucks predict any other features? Their gender for example? Or their favorite drink?
  • How has the genre’s vocabulary changed over time?
  • Of all the artists, whose language is most diverse? Whose is most repetitive?

You can find my code for this project on GitHub.

Miller focuses on popular country music but the lesson here could be applied to any collection of lyrics.

What’s your favorite genre or group?

Here’s a history/data question: Does popular (for some definition of popular) music change before revolutions? If so, in what way?

While you are at Miller’s site, browse around. There’s a number of interesting posts in addition to this one.

…R Clients for Web APIs

Filed under: Data Mining,R,Web Applications — Patrick Durusau @ 3:35 pm

Harnessing the Power of the Web via R Clients for Web APIs by Lucy D’Agostino McGowan.

Abstract:

We often want to harness the power of the internet in our daily data practices, i.e., collect data from the internet, share data on the internet, let a dataset evolve on the internet and analyze it periodically, put products up on the internet, etc. While many of these goals can be achieved in a browser via mouse clicks, these practices aren’t very reproducible and they don’t scale, as they are difficult to capture and replicate. Most of what can be done in a browser can also be implemented with code. Web application programing interfaces (APIs) are one tool for facilitating this communication in a reproducible and scriptable way. In this talk we will discuss the general framework of common R clients for web APIs, as well as dive into specific examples. We will focus primarily on the googledrive package, a package that allows the user to control their Google Drive from the comfort of their R console, as well as other common R clients for web APIs, while discussing best practices for efficient and reproducible coding.

The ability to document and replicate acquisition of data is a “best practice,” until you have acquired data you prefer to not be attributed to you. 😉

For cases where the “best practice” obtains, consult McGowan’s slides.

« Newer Posts

Powered by WordPress