5 common biases in big data


Today, businesses are aware that a huge part of their decision-making is impacted by big data. The large availability of data does not warrant its relevancy and neither does the analysis of big data by data scientists and analysts, as human judgment can sometimes be flawed. Moreover, several factors may impact data, either positively or negatively. As a result, data may fluctuate from time to time. That is why it becomes crucial for data teams to know how to make the right inferences from big data. This is only possible when data analysts and scientists are aware of the existential biases and the solutions to them.

Special thanks to Nate DW for the link to this article.  The best one of the five of these is “Simpson’s Paradox”.  No, not the one where Homer smashed his little boy’s piggy bank and is wondering what he’s done. It’s when you notice a pattern in groups of data that favors a trend but, when you look at the cumulative patterns of the groups, the trend looks totally different.  This is an excellent read for those of you who are labeling yourselves “Data Scientist”.  I’m just a “Data Tinkerer”.


Via: 5 common biases in big data

Data Visualization Basics for Data Scientists

“A picture is worth a thousand words”, the old saying goes, and in some cases, a picture is worth even more than that. The human eye is composed of some 30 or more discrete components, and along with the optical nerves and the brain functions that process sight, can take in a contrast ratio of around 100,000:1 (over time) and can distinguish about 10 million colors. That sight-brain-pathway is a pattern-matching wonder and has “regions of interest” that the eye/brain connection focuses on (http://www.cambridgeincolour.com/tutorials/cameras-vs-human-eye.htm).

Making up one of our primary senses, sight is immeasurably important to conveying information, and it’s vital to the Data Scientist to understand how to best use various visualizations to display and discuss data.

There is a book reference in this article from 2013 that still is a must-read for anyone attempting data visualization at any level.  The best lesson is to look through other people eyes to appreciate how the information must “Look”.  I go by a simple rule, “If my wife, who is not technical or a data scientist, can’t understand the visual it probably needs more work”

Via: Microsoft Developer, Buck Woody

API Gateways, the Rosetta Stone for data

Services in a microservices architecture share some common requirements regarding authentication and transportation when they need to be accessible by external clients. API Gateway s provide a shared layer to handle differences between service protocols and fulfills the requirements of specific clients like desktop browsers, mobile devices, and legacy systems. Click to see all chapters…

API Gateways are the middle man in the Application-Data relationship.  They serve as a community hall where folks go to meet and talk to one another.  This community hall has a universal translator like on Star Trek that makes data understood by all the people in the room.   Developers don’t worry about XML/JSON because the gateway understands them both.   DBA don’t worry about formatting the data because the gateway loves to format stuff.

Have you ever been fustrated with Sri, OK Google or Alexa?  Gateway quality varies from one vendor to another.  Write your own in Node may be an alternative, I don’t know.  Let’s talk.

via Building an API Gateway using Node.js — RisingStack Engineering

Presto, Magico open source distributed SQL Engine

2017-04-25_11-23-16In today’s blog, I will be introducing you to a new open-source distributed SQL query engine, Presto. It is designed for running SQL queries over Big Data (petabytes of data). It was designed by the people at Facebook. Quoting its formal definition:

“Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.”

Th folks at Facebook are at it again.   They build a SQL engine especially for analytical work, this is not an online transaction processing (OLTP) engine.  It’s an engine for ad-hoc queries across SQL/NoSQL databases distributed all over the place.

They use connectors for MySQL, Hadoop/Hive, MongoDB, Postgres and more.  Missing are some of standards like Microsoft SQL and Teradata.  However, this won’t be the story for long.

Presto is in its open source newness but you should take a look at the documentation to really appreciate the power of this new thing.

via An Introduction to Presto — DZone Big Data Zone

GeekMustHave is now on YouTube

Hello and welcome to Geek Must Have YouTube channel. Everyone once and awhile have you come across someone who you know, is going to try and build something cool… that’s me.  This channel is going to be the Geek’s musing about technologies like Electronic Gadgets that blink, buzz, talk, listen and keep you awake at night. Computers of all sizes from the tiny single chip ones all the way up to big servers in the cloud.  Coding from Cobol to JavaScript and everything in between.  Databases starting with the humble flat file, tried and true relational and Big Data, noSQL, REST and JSON.  Communications with wireless, GSM, Bluetooth and amateur radio.  Lots of cheap bits, bobs and do-dads from China.  This channel plans to have mailbags, project builds, deconstructions, reviews, recommendations, tools, and tips

Check out the champion blog at HTTP://GeekMustHave.COM

This a new channel and the Geek needs your help, please click on the subscribe button,
watch the videos and click on the like button, leave comments and questions.

The Geek is busy learning and building stuff, so don’t be upset if the response isn’t immediate.

Thank you and now ….“Let’s build something…”

Big Data in Healthcare Made Simple — DZone Big Data Zone

Big Data in Healthcare Made Simple – DZone Big Data Knowing how to use big data to improve patient care is beneficial for those working in the healthcare industry.  Big data is valuable to the healthcare industry in dozens of ways. Physicians can use specific data about their patients taking a type of medication and their reaction to the medicine. Data can also be used to determine high-risk groups based upon common factors. Knowing how to use big data to improve patient care is…

Read on to learn more.

via Big Data in Healthcare Made Simple — DZone Big Data Zone

Lets not talk about the elephant (Hadoop) in the room


Hadoop and HealthCare is a pairing that can help patient outcomes become much more positive.  Everyone who is a health care provider from the nurse aids, doctors, pharmacists, large corporate medical providers all the way to State and Federal governments could be using this but, many are not. Big data sounds impressive and is the “Bright Shinny” thing.  The Hadoop elephant is slowly plodding though the ranks of these providers, and it scares them.  The lack of the “Structure” in the data makes some think it not very usable, highly inaccurate and less intuitive to consume.   The IT departments say “It’s Not SQL, it’s not relational”.  Others think it’s necessary to convert all the structured SQL based databases to Hadoop Document databases.  Some other worry about how they combine two different beasts together.  This article from Richard Proctor outlines just some of the way the elephant in the room should be a new tool for innovation in health care; it is a multi-part article and worthy read.

3D Graphics JS Library WhiteStorm is coming


Imagine writing a JavaScript application with full 3D graphics capability.  Sounds simple but when you include the physics of an object, shadows and where the light is coming from it starts to get overwhelming.  This article by Alexander introduces the WhiteStorm JS library.  After going to WhiteStorm website and trying some of the examples there, I can see this as an extension to the D3.JS charts library to create business intelligence charts that have three dimensions that you look at from 360 degrees by just dragging the mouse about.  Whitestorm with some other components is going to be a significant improvement to complex analytics.   Yes,  I know they are touting WhiteStorm as a gaming library but think a little outside the box. I have added this to my review and test list.  New at 11PM.


Databases Too many choices, oh what shall I do?

MextGenDB.jpgI suggest read a book before you dive headlong into the next shiny bright thing in databases.  This book appears to be a good read, I’ve downloaded it to my Kindle and started it.  So far it is very interesting and easy to read.  If you don’t have any background in databases I would suggest reading an entry level book first.  This review by I Programmer is a much better review than I would ever write, give it a quick2-minutee read.  We are truly in the “next Generation” of database evolution, you need to pay close attention to what going on right now, or be stuck in the dBase/Paradox past again.


Master Big Data, Master Data Science first


This is a very good article to read.   It is academic based but still is very relevant to business.  Data Science was a term coined in 1974, it was one of the courses I took in college.  Now it is back again, to define some skills you should consider learning to help manage and use any “Big Data” you may have.