What is the Big Data?
In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”
As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. The limitations also affect Internet search, finance and business informatics. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks.The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created. The challenge for Large enterprises is who should own big data initiatives that straddle the entire organization.
Big data is difficult to work with using relational databases and desktop statistics and visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers”. What is considered “big data” varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”
Big data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, a new platform of “big data” tools has arisen to handle sensemaking over large quantities of data, as in the Apache Hadoop Big Data Platform.
In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this “3Vs” model for describing big data. In 2012, Gartner updated its definition as follows: “Big data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”
Big data is a popular term used to describe the exponential growth, availability and use of information, both structured and unstructured. Much has been written on the big data trend and how it can serve as the basis for innovation, differentiation and growth.
According to IDC, it is imperative that organizations and IT leaders focus on the ever-increasing volume, variety and velocity of information that forms big data.1
- Volume. Many factors contribute to the increase in data volume – transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, etc. In the past, excessive data volume created a storage issue. But with today’s decreasing storage costs, other issues emerge, including how to determine relevance amidst the large volumes of data and how to create value from data that is relevant.
- Variety. Data today comes in all types of formats – from traditional databases to hierarchical data stores created by end users and OLAP systems, to text documents, email, meter-collected data, video, audio, stock ticker data and financial transactions. By some estimates, 80 percent of an organization’s data is not numeric! But it still must be included in analyses and decision making.
- Velocity. According to Gartner, velocity “means both how fast data is being produced and how fast the data must be processed to meet demand.” RFID tags and smart metering are driving an increasing need to deal with torrents of data in near-real time. Reacting quickly enough to deal with velocity is a challenge to most organizations.
Consider two other dimensions when thinking about big data:
- Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something big trending in the social media? Perhaps there is a high-profile IPO looming. Maybe swimming with pigs in the Bahamas is suddenly the must-do vacation activity. Daily, seasonal and event-triggered peak data loads can be challenging to manage – especially with social media involved.
- Complexity. When you deal with huge volumes of data, it comes from multiple sources. It is quite an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control. Data governance can help you determine how disparate data relates to common definitions and how to systematically integrate structured and unstructured data assets to produce high-quality information that is useful, appropriate and up-to-date.
Examples of big data
- RFID (radio frequency ID) systems generate up to 1,000 times the data of conventional bar code systems.
- 10,000 payment card transactions are made every second around the world.
- Walmart handles more than 1 million customer transactions an hour.
- 340 million tweets are sent per day. That’s nearly 4,000 tweets per second.
- Facebook has more than 901 million active users generating social interaction data.
- More than 5 billion people are calling, texting, tweeting and browsing websites on mobile phones.
Uses for big data
So the real issue is not that you are acquiring large amounts of data (because we are clearly already in the era of big data). It’s what you do with your big data that matters. The hopeful vision for big data is that organizations will be able to harness relevant data and use it to make the best decisions.
Technologies today not only support the collection and storage of large amounts of data, they provide the ability to understand and take advantage of its full value, which helps organizations run more efficiently and profitably. For instance, with big data and big data analytics, it is possible to:
- Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory.
- Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk.
- Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
- Quickly identify customers who matter the most.
- Generate retail coupons at the point of sale based on the customer’s current and past purchases, ensuring a higher redemption rate.
- Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
- Analyze data from social media to detect new market trends and changes in demand.
- Use clickstream analysis and data mining to detect fraudulent behavior.
- Determine root causes of failures, issues and defects by investigating user sessions, network logs and machine sensors.
High-performance analytics, coupled with the ability to score every record and feed it into the system electronically, can identify fraud faster and more accurately.
Many organizations are concerned that the amount of amassed data is becoming so large that it is difficult to find the most valuable pieces of information.
- What if your data volume gets so large and varied you don’t know how to deal with it?
- Do you store all your data?
- Do you analyze it all?
- How can you find out which data points are really important?
- How can you use it to your best advantage?
Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. What is the point of collecting and storing terabytes of data if you can’t analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data.
You now have two choices:
- Incorporate massive data volumes in analysis. If the answers you are seeking will be better provided by analyzing all of your data, go for it. The game-changing technologies that extract true value from big data – all of it – are here today. One approach is to apply high-performance analytics to analyze the massive amounts of data using technologies such as grid computing, in-database processing and in-memory analytics.
- Determine upfront which big data is relevant. Traditionally, the trend has been to store everything (some call it data hoarding) and only when you query the data do you discover what is relevant. We now have the ability to apply analytics on the front end to determine data relevance based on context. This analysis can be used to determine which data should be included in analytical processes and which can be placed in low-cost storage for later availability if needed.
Now you can run hundreds and thousands of models at the product level – at the SKU level – because you have the big data and analytics to support those models at that level.
A number of recent technology advancements are enabling organizations to make the most of big data and big data analytics:
- Cheap, abundant storage and server processing capacity.
- Faster processors.
- Affordable large-memory capabilities, such as Hadoop.
- New storage and processing technologies designed specifically for large data volumes, including unstructured data.
- Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs.
- Cloud computing and other flexible resource allocation arrangements.
Big data technologies not only support the ability to collect large amounts of data, they provide the ability to understand it and take advantage of its value. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making.
It is very important to understand that not all of your data will be relevant or useful. But how can you find the data points that matter most? It is a problem that is widely acknowledged. “Most businesses have made slow progress in extracting value from big data. And some companies attempt to use traditional data management practices on big data, only to learn that the old rules no longer apply,” says Dan Briody, in the 2011 Economist Intelligence Unit’s
Role of Analytics in Big Data?High-performance analytics from SAS enables you to tackle complex problems usingbig data and provides the timely insights needed to make decisions in an ever-shrinkingprocessing window. Successful organizations can’t wait days or weeks to look at what’snext. Decisions need to be made in minutes or hours, not days or weeks.High-performance analytics also makes it possible to analyze all available data (notjust a subset of it) to get precise answers for hard-to-solve problems and uncover newgrowth opportunities and manage unknown risks – all while using IT resources moreeffectively.
Whether you need to analyze millions of SKUs to determine optimal price points,recalculate entire risk portfolios in minutes, identify well-defined segments to pursue customers that matter most or make targeted offers to customers in near-real time,high-performance analytics from SAS forms the backbone of your analytic endeavors.
To ensure that you have the right combination of high-performance technologies tomeet the demands of your business, we offer several processing options. These optionsenable you to make the best use of your IT resources while achieving performance gainsyou never would have thought possible.Accelerated processing of huge data sets is made possible by four primarytechnologies:
• Grid computing. A centrally managed grid infrastructure provides dynamicworkload balancing, high availability and parallel processing for data management,analytics and reporting. Multiple applications and users can share a gridenvironment for efficient use of hardware capacity and faster performance, while ITcan incrementally add resources as needed.
• In-database processing. Moving relevant data management, analytics andreporting tasks to where the data resides improves speed to insight, reduces datamovement and promotes better data governance. Using the scalable architectureoffered by third-party databases, in-database processing reduces the time neededto prepare data and build, deploy and update analytical models.
• In-memory analytics. Quickly solve complex problems using big data andsophisticated analytics in an unfettered manner. Use concurrent, in-memory,multiuse access to data and rapidly run new scenarios or complex analyticalcomputations. Instantly explore and visualize data. Quickly create and deployanalytical models. Solve dedicated, industry-specific business challenges byprocessing detailed data in-memory within a distributed environment, rather thanon a disk.
• Support for Hadoop. You can bring the power of SAS Analytics to the Hadoopframework (which stores and processes large volumes of data on commodityhardware). SAS provides seamless and transparent data access to Hadoop asjust another data source, where Hive-based tables appear native to SAS. Youcan develop data management processes or analytics using SAS tools – whileoptimizing run-time execution using Hadoop Distributed Process Capability or SASenvironments. With SAS Information Management, you can effectively managedata and processing in the Hadoop environment.
In addition, a new product from SAS provides a Web-based solution that leveragesSAS high-performance analytics technologies to explore huge volumes of data inmere seconds. Using SAS Visual Analytics, you can very quickly see correlationsand patterns in big data, identify opportunities for further analysis and easily publishreports and information to an iPad®. Because it’s not just the fact that you have bigdata, it’s what you can do with the data to improve decision making that will result inorganizational gains. SAS can cut through the complexities of big data and identify themost valuable insights so decision makers can solve complex problems faster than everbefore.High-performance analytics from SAS is optimized to address new businessrequirements and overcome technical constraints. In addition, SAS is leading the wayin empowering organizations to transform their structured and unstructured data assetsinto business value using multiple deployment options.