Why Graph Databases Are so Effective in Big Data Analytics
We live in an era of data. Information is everywhere and it can be accessed in different ways. Information is also collected in vast quantities. You can’t do much in the modern world without it being noted down and stored in a database.
Big data analytics and graph databases are buzzwords you’ve most likely encountered. It’s likely you’ve been told to start using graph databases in your big data analytics to boost your organizational efficiency.
But why? Let’s look at the concepts and the reasons graph databases are so effective in big data.
WHAT’S BIG DATA?
Unless you have been living under a rock, you should have heard the term Big Data thrown around. In fact, you’ve probably heard it mentioned in so many different contexts, described in a number of different ways that it might be hard to know and understand what the term actually means and what is its significance.
So, I’ll try to explain the term concisely and let you in on the definite reasons it matters.
The definition of big data
If you search for the definition of big data on Google, you’ll receive well over 10 million results. The dictionary definition states big data as:
“extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.”
Many fancy words, right? The definition might not quite open up the idea and purpose behind the concept. You could state the above in a bit more simplistic way and describe big data as a large collection of data, gathered from traditional and digital sources. The data can be collected either within a specific organization and its different channels or outside of the organization. Furthermore, big data is not just collected, but used to discover new things and to analyze existing patterns and processes.
The key point to understand with big data is that the datasets collected are huge – you aren’t talking about a few phone numbers here, but vast amounts of different types of data. In addition, the data tend to be mostly in the digital format, although you shouldn’t exclude traditional datasets. Financial records, for example, also constitute as part of big data. Furthermore, big data always mixes multi-structured and unstructured data. What does that mean? Big data can use:
- Unstructured data, which is information that cannot be easily organized or interpreted by traditional databases and models.
- Multi-structured data, which is different types and formats of data, derived from the interactions between people and machines.
So, what does the above look in reality? An example of big data would be how Wal-Mart collected data of its customers and the weather. By combining these different datasets and points of information, the company noticed that as the storms are heading towards the location, customers buy more flashlights (understandable!) and Pop-Tarts (interesting and somewhat surprising).
For a quick recap of the above and an insight into the world of big data, you should watch the short clip below:
Why does big data matter?
But what does the above mean for an organization? Why does it matter whether you collect and use big data? Well, the Wal-Mart example shows the two main reasons for utilizing big data:
- It reveals hidden information – You don’t need to know the weather patterns and customer consumption of certain goods go hand in hand or have a link. Big data helps reveal this information and therefore, you don’t need to know what you are looking for in order to find a connection. Wal-Mart didn’t know it’s looking to find a specific food item increase its sales prior to a storm, but it was able to find this interesting connection because of big data analytics.
- It extracts value – The information you gain helps you better understand the connections between actions and behaviors. This in turn, will help you increase value extraction, either by helping you make or save more money, or improve efficiency. In the example of Wal-Mart, the company could use the information in order to promote Pop-Tarts when storms are heading in or make flashlights more easily accessible.
WHAT ARE GRAPH DATABASES?
But what about the other concept we are connecting with big data analytics? In order to understand the benefits of using graph databases in connection with big data, you need to understand the meaning and importance of them.
The definition of a graph database
The definitions of a graph database also come in different complexities. The computing definition of the concept says a graph database is:
“a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data”.
If you are not a computer expert or used to technology jargon, the above probably went way over your head. Don’t worry, there’s a more down-to-earth way of looking at the concept. A graph is an illustration of information and a database is naturally a set of information grouped together. Graph databases have two defining elements:
- A node, which represents an entity. This can be a person, a place, a thing and so on.
- A relationship, which is the connection between two separate nodes.
Essentially then, graph databases are datasets that focus on the connections between different pieces of information and represent these connections in a simple, graphical manner.
You can think of it through an example, such as Twitter, which is in itself a huge graph database. The users would be the nodes and the connections or relationships the different nodes have can be variable and which are represented with ‘follows’. The connection between two users (nodes) could be that of node 1 following node 2, with node 2 not following node 1. Both node 1 and 2 could also be following each other and so on. All of the different users (nodes) and the relationships they have with other nodes can be represented in a massive graph database.
Source: Visual.ly website, published by Neo Technology
Why do graph databases matter?
But why does a graph database make analyzing and understanding information easier? What is the purpose of representing Twitter users and their relationships with a graph database? An organization can benefit from using graph databases in three different ways, with the database helping to:
- Boost performance – Each organization will have data and the datasets will always continue to grow. The growth of these datasets will also lead to growth in the connections the datasets have. Graph databases are specifically designed to understand the relationships between different pieces of data, meaning the growth of the relationships won’t hinder performance.
- Provide flexibility–Using graph databases is also flexible, as the database can change at the same speed as your organization changes. The structure of the model fits all sorts of needs and requirements.
- Improve agility – The graph database also supports agility, which is crucial in a test-driven development environment. As your business requirements change, the database can change with it.
The interconnected world of today means different pieces of information are connected with each other in a number of unique ways. The use of graph databases means you don’t just understand the importance of the information and data, but the relationships between them.
The acquired understanding of relationships can boost your organization in terms of efficiency and value creation – just as we saw with big data. Information quintessentially leads to better service and enhanced value – both for you and the customer.
WHY GRAPH DATABASES WORK IN BIG DATA ANALYTICS?
So, what do you get when you implement graph databases in big data analytics? An effective and powerful tool to create connections and utilize your data. But why is that?
As mentioned in the first section, big data has generally relied on the Structured Query Language (SQL) to communicate with a database. It’s the language of the relational database management, which are databases build around tables and collections of rows of attributes.
The communication between the different tables and rows can be slow and difficult when huge and irregular datasets are brought into question. Essentially, as data keeps growing and evolving, the traditional SQL model can become insufficient in understanding the relationships between these different datasets.
How are these issues being solved? Well, graph databases are one part of the solution. They belong to a so-called Not Only SQL movement or NoSQL. Instead of structuring data in the traditional table and row model, NoSQL allows the database design to be built around the requirements at hand. This can mean the data is structured and defined by:
- Key-value stores
- Graph databases
The graph database model focuses on the relationships of the different nodes, or data-points. So, instead of looking at the value of the data-point (which is what SQL database would do), the graph database is organizing and analyzing the messy data-points according to the relationships. Graph database adds another layer of structuring and analyzing your data – increasing the effectiveness of your big data analytics. You simply open more doors for your organization.
But what is the importance of the node-relationship in big data? Why is it so effective in adding to the way you analyze data. Quite simply, the answer is that it can clarify the interconnected data more clearly. Instead of just understanding what is the value of specific data, you understand the value of the relationship between data. If you think of the example of Wal-Mart’s findings, graph database would help notice the relationship between the storm, the shopping decisions, and customers who bought flashlights and Pop-Tarts.
An organization doesn’t just rely on data when it comes to decision-making. If you want to increase sales in your bookshop, you don’t just need data on the books that are being sold in order to boost sales. You need to understand how the customers connect to books – for example, what books tend to be bought by the same person and what do the buyers of a specific book have in common. If you figure out these relationships, you can drive up sales much easier. Perhaps you find a connection where people who read J.K. Rowling also tend to buy books by Terry Pratchett and you can use the information in marketing or positioning of the books. You therefore enhance the way you interpret and use data. You don’t just focus on the specific value, but the value of the relationship. For any organization, the relationships between data-points are important and will continue to grow in importance.
Another example of the above could be to understand why the transport of books from the warehouse takes a long time. With the help of a graph database, you can find the relationship between the warehouse, retailer, delivery company and the customer and find what connections take the longest or whether you could get the product faster by using different relationships, i.e. processes of delivery. You can solve the problems your business has in different ways because you are able to look at the data in a different way to the traditional model and find connections you might not realize with the SQL model. You end up creating more value for both the organization and the customer. You solve an issue that might prevent a customer from shopping with you again and you create a more efficient service that could increase the value you are able to draw from the services you provide.
Furthermore, the NoSQL database model can be much more efficient in terms of finding these data connections. A SQL database would begin its search by checking individual data-points and comparing them to another one-by-one. Consider you have Data-point A and you want to find who is connected to it. If you use traditional database systems, the A would be individually checked against B, C, D, E, and so on. On the other hand, with the graph database the connections between A and different data-points would be created much quicker. The relationships are captured on their own and the properties of the data-points are checked directly. This cuts the processing time and ensures you can access information quicker. All of this ensures further efficiency. Ryan Boyd, head of developer relations North America for Neo4J, gave an example of the different technique and processing model in a TechRepublic interview. Boyd said,
“with a graph database, you find a logical starting point and you branch out from there and identify the relationships. For instance, you might write a query that asks, ‘Find all of the friends of the friends of John’. Instead of having to JOIN many different indexes, the graph database uses pointer arithmetic that is in-memory or in cache and performs the operation.”
An example of the effective use of graph databases in big data analysis in the light of the above is eBay and how it provides fast and efficient service to its customers. The shopping platform utilizes graph databases to connect buyers with local sellers, creating localized door-to-door delivery connections. The company has observed how queries powered by a graph database take on 1/50th of a second to solve, while the traditional database queries took around 15 minutes. The example highlights the power of creating efficiency in terms of saving time and resources of the organization and providing better value for customers.
USING GRAPH DATABASES
Graph databases provide plenty of opportunities for organizations. The benefits discussed above have already been noted by a number of industries, including:
- Financial services – example uses include monitoring and preventing internal and external fraud and its risks.
- Retail – can be used for understanding purchase decisions and to provide recommendations to customers based on how different products link with each other.
- Logistics – an example in the industry would be the use of graph database for planning routes.
- Networking and IT – identifying and understanding root cause analysis.
As you start implementing the graph databases in your organization, you should be aware of a few things. First, there are two key properties the graph database technologies use:
- Graph storage – Some storage options are specifically designed for storing and managing graphs, while others use relational or object-oriented databases. The latter options tend to be slower.
- Graph processing engine – The native or graph-specific processing is the most efficient way of processing data within a graph. The non-native processing engines tend to use other processing means, such as ‘creating’, ‘reading’ or ‘deleting’.
Finding the right technology to use will depend on your specific business needs and requirements. There are quite a few different graph database technologies available, with the most widely used graph database being Neo4J. The open-source system is a native graph database, both in terms of storage and processing. The database began in development in 2003, becoming publicly available system in 2007.
The graph database is used by a number of organizations and companies around the world, representing a large number of industries. The system is used in scientific research, project management and matchmaking. Its users include established organizations such as Wal-Mart and Lufthansa, as well as start-ups like FiftyThree and CrunchBase.
|Neo4J in a nutshell|
|What is it?||An ACID transaction database, which doesn’t require a schema or data typing to operate.|
|What are the features?||Neo4J features include:|
|How is it available?||The database is an open source project, licensed under the GNU public license v3.0.|
You can also get a supported, commercial version of the graph database. It is provided by Neo Technology under the GNU AGPL v3.0 and commercial licensing.
|What is it suited for?||Neo4J can be used for a number of different purposes. It’s especially good graph database for social networking, the classification of specific data, and creating communities of interests or practices.|
You can also check out videos on YouTube on how to start using Neo4J and make the most out of it. For example, the below video is a great introduction to the database and its functions:
Other examples of graph database technologies and systems include:
- Titan, which is a backend-agnostic graph database.
- Stardog, a scalable, Java semantic graph database.
- InfiniteGraph, a commercial, cloud-enable graph database.
- Apache Accumulo, a generic database that can also store graphs at scale.
With the help of graph databases, your organization could start answering more questions and getting more value out of the data you have at hand. The model is perfect for finding the answer to those abstract questions your organization might have. How do you get Thing A to go to Thing B? What can you recommend to someone who used Service X?
The use of these databases is not that difficult, as the example of Max de Marzi shows. He built a system of Facebook Graph Search over a single weekend (the system has now been shut down by Facebook) by using a graph database. With the system he was able to find things like ‘who contacted me’, ‘who likes a specific type of post’, and so on.
THE BOTTOM LINE
Big data is unavoidable for any modern organization. Neither should you try running away when the concept is mentioned, as big data analytics can help companies create better value for their customers and themselves. But relying on the traditional data management and analytics tools is not enough, as data-points continue to develop and grow in the interconnected world. To make the most out of the interconnected nature of things, you should consider using graph databases.
These databases are based on the relationships and their value, rather than the value of data itself. The graph databases are effective because they provide an alternative way of looking at the connections of data-points and solving problems. They are effective and fast because of the way they are built to find these different relationships and to represent them in a graphical, and simple, model. So, in a world of interconnected data, understanding the value of data alone is not enough, you also need to look for the different ways things connect with each other.