The basic gist of this article is that the exercise of data modeling is just as important when using the big data and NoSQL technologies as it is when using the more traditional relational algebra based technologies.
This conclusion came after a series of experiments were performed pitting Cloudera’s Hadoop distribution against an unidentified ‘major relational database’. A suite of 5 business questions were distilled into either SQL for the relational database, or HQL for execution against Hadoop stacked with Hive. For each of the queries, for each data store, 5 experimental scenarios were explored:
- Flat Schema vs. Star Schema
- Using compressed data vs. uncompressed in Hadoop
- Indexing appropriate columns
- Partitioning the data by date
- Compare Hive/HQL/MapReduce to Cloudera Impala
Details of the experiment and intermediate results can be found in the article, but at a macro level, the results were mixed with the exception of it being clear that a flat un-modeled was not a scenario one should use and expect performance. As the article points out, the question is not whether one should model or not, but rather how and when.
Tamara Dull, Director of Emerging Technologies, SAS Best Practices wrote the following at SAS’ “The Data Roundtable” blog:
It was refreshing to see that the RDBMS skills some of us have developed over the years still apply with these new big data technologies. And while discussions of late binding (i.e., applying structure to the data at query time, not load time) work their way down our corporate hallways, we are reminded once again that “it depends” is a far more honest and accurate answer than it’s been given credit for in the past.
To model or not to model is no longer the question. How and when to model is.
The Internet of Things (abbreviated as: IoT) is what those in the know have come to call the collection of various sensors and cameras and every other sort of device that’s been connected to the internet. These things have been connected so that they can communicate with one another of like kind to coordinate activities as well as send their collected information back to central collection points for status monitoring, analysis and everything’s else one might be able to think of.
There’s tremendous opportunity for good to be done by connecting these devices together and back to central control points— smart power grids that allow for solar and wind power to be dynamically brought on and offline as needed and able is one of the many potential opportunities that gets talked about. Intelligent traffic light switching to manage the delicate balance of public safety with fuel saving and traffic flow is another talked about use for the IoT.
There’s also the potential for an ugly downside. Many of these devices are little more than a sensor, such as a thermometer, hook up to as little as possible (for cost or deployment and operations) to make them able to report their information out over the internet. Security has been a second thought if a thought at all. There’s already been reports of a ‘smart’ refrigerator being hacked to generate and deliver spam. There is the real possibility that an industrious hacker figures a way to take control of every net connected toaster in the country leading to the unending morning drama of burnt toast one day and slightly warm bread the next.
Bill Franks at SmartData Collective writes:
The way most devices that are part of the IOT communicate is by connecting to the Internet. Within most homes today, there are a handful of computers, tablets, and smart phones that have been registered on the network and allowed access to the Internet. In a few years, there could be dozens or hundreds of items per household connected. For hackers, this presents an unprecedented opportunity.
Phil Kemelor is Senior Manager, Enterprise Intelligence Digital Analytics of Ernst & Young. Phil was one of the earliest adopters and advocates for the use of analytics and has 16 years of experience in the field as a practitioner, industry analyst and consultant.
Phil recently wrote a couple of at articles which were published at CMSWire. Both articles were on the broad topic of presentation effectiveness with one focusing in on how not to be boring when presenting analytic information and the other focusing on how to present analytic information to a ‘C’-Level executive. I’ve distilled down his bullet points to give you a 30,000 ft fly-by of his advice. Please read the entire articles (links are embedded in several places) to get the full benefit of Phil’s wisdom.
[…] I’m going to share six reasons I see presentations fall off the rails and self-destruct. If you think analytics is more art than science, presenting analytics data is even more so. […]
Too Much Data
[…] Your audience wants to hear your recommendations as soon as possible. They trust that you know what you’re doing. Pages of data will put them to sleep — or make them antsy and irritated. […]
Passive Slide Titles
[…] Your audience doesn’t want to have to read through the slide to unearth the key takeaway. They don’t care so much about the data analysis as the perspective and smarts that you bring to the discussion — that’s why you have your job in the first place. […]
Metrics That Don’t Add Up
[…] Your audience doesn’t understand how [what you observed in your analysis] helps them figure out the business problem that they gave you to solve. It might help you figure out the answer, but for the audience, it’s like watching sausage get made. […] Only put the key data points and visuals that will support the headline into the rest of the slide.
[…] Your audience sees a slide with more text than they can read. It blends together. Their eyes glaze over. They are interested in being told what to do with the data. They want to get to the bottom line. […]
Speaking in a “Foreign Language”
[…] Your audience doesn’t necessarily live in the digital world. Maybe they live in the world of finance, regulation, operations and legal affairs. Digital is new to them and they don’t understand it quite as you do. They want to understand, but they also want the intersection of your world and their world to be clear and easy to grasp. […] Think about all of the terms that you might want to use to explain your data. Remove any of the industry language that you’ve grown accustomed to. […]
Overconfidence in One’s Ability as an Awesome Analyst
You know the data better than anyone. Your recommendations are freakin’ brilliant. Your insights will make your audience swoon. You’re charming and witty to boot. Naturally your presentation will be a stunning success.
But do you know this for sure? Have you shown your deck to anyone? Has it been proofed, edited? Have you done a run through? Is your timing going to be razor sharp?
[…] I’m going to share four techniques that will help you reach more people, more effectively than you are today with analytics data.
Know the difference between tactical and strategic data
[…] What do you present to executives? I think we’re all on board that you don’t present the same data that you would to the marketing team, the granularity isn’t of interest and pretty much a waste of time to someone who is running an organization, business unit or an entire program.
Understand what drives your organization’s strategy
[It surprises me that ]…] many analytics managers and analysts do not know their organization’s strategic goals and objectives. If they do, they have not figured out how to tie the data that they collect to these goals. […]
[…] Roll up campaign data into one number on the revenue being driven by all digital campaigns, or […] focus on the success of expansion into new markets by highlighting site registrations from a specific geolocation or visitor segment.
Understand executive concerns
Figuring out organizational goals and objectives are requisite items to building relevant executive level reports. […] I find three items of high interest to the C-Suite:
- Competition — Understand how your organization’s performance stacks up to the marketplace and whether you’re ahead or behind.
- Voice of Customer — Bringing the voice of customer to life in your reporting through surveys or social media commentary adds an element of humanity that executives value […]
- Risk — Is there data that you are collecting that suggests negative market or product changes? Is negative social media or mass media having an impact on digital channel activity? These are among the risks that keep C-level execs up at night.
Understand the person
[…] You have an audience of one, and the research should reflect that. It’s more personal. Do you know the person you’re presenting to? What type of presentations they like and don’t? Do they want you to provide a point of view and recommendations? Or do they want to come up with the conclusion themselves?
You also want to know how they are going to use what you present for their own presentations. No matter where you are in the food chain of business, you have to figure out the best way to communicate “up.” […]
Where to go from here
Communicating data is a lot more complex than most organizations would like to think. As a “soft skill” it doesn’t get nearly the same attention as technology or data. If you want to have a successful analytics program, you will need to spend more time on this part of delivery and roll out then you are likely spending today.
The etcML, Easy Text Classification with Machine Learning, website allows the user to upload their own data, and then, using the various built-in algorithms, run the already trained machine learning classifiers against that uploaded data to tag the text with sentiment (positive, negative or neutral), topic (such as politics, sports, business and the like) or with the user’s own classifiers. In addition to uploading data which is to be classified, the user is able to upload pre-labeled training data and train a classifier to predict tags for the uploaded raw data.
This is truly an amazing system. A tutorial on how to upload your own data and training data sets as well as create and train your own classifier is also available.
Daniel Gutierrez, Managing Editor of insideBigData reports:
Have you every wondered whether a certain TV network has a specific political bias? Is your favorite news source fair and balanced? A group of Stanford computer scientists have created a website with the ability to answer such questions for free using machine learning technology.
The newly launched website is called etcML, short for Easy Text Classification with Machine Learning.
Machine learning is a field of computer science that gives computers the ability to acquire new understanding of data content in a more human-like way. The etcML website is based on machine-learning techniques that were developed to analyze the meaning embodied in text, then perform sentiment analysis – to gauge the text’s overall positive or negative sentiment.
Another bit of Big Data humor courtesy Daniel Gutierrez, Managing Editor of insideBigData. This time, the topic is the (ir)-rational fear of statistics: