Big Data Analysis for Customer Behaviour

Big data is a field that treats ways to analyse, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data- processing application software

In this technological era of large scale data, businesses need to rethink on the modern approaches to better understand the customers to gain a competitive edge in the market. Data is worthless if it cannot be analysed, interpreted and applied in context . In this work, we have used the Walmarts sales data to create business value by understanding customer intent (sentiment analysis) and business analytics. A picture speaks a thousand words and business analytics would help paint a picture through visualization of data to give the retailers insights on their business. With these insights the businesses can make relevant changes to their strategy for the future to maximize profits and success.

The big data application enables retailers to use historical dataset to better observe the supply chain, then a clear picture can be obtained about a particular store whether they are making profit or are under loss. When data is properly analysed, we will start to see the patterns, insights and the big picture of the company. Then the required suitable actions can be applied accordingly. This will help optimize operations and maximize sales and profit.

Hardware and Software

The tools and techniques used for this work includes the collection of Huge Walmart sales datasets stored in CSV format. This paper Apache Spark with a build version of Hadoop leveraging HDFS as a data storage option. Apache Spark is a framework capable of handling both batch and stream processing on the same application at the same time. Our development tools include InteliJ Idea Community Edition and iPython Notebook. InteliJ Idea was integrated with Spark instead of using the traditional Spark shell.

After we configured our environment, our first task was to load the files as spark dataframes. Dataframe is a distributed collection of data organized into named columns which is equivalent to tables in RDMS . The spark dataframe API was designed to make big data processing simple for a wider audience and also it supports distributed data processing in general purpose programing languages like Scala, Python and Java.


Their strategy included the collection of huge Sales data and transferred on HDFS and performed Map Reduce which later due to enormous data size, proved difficult to draw conclusion. Thus Hive processing was done to calculate average sales feature for all 45 stores and 99 departments. Machine learning algorithm, R programming was used for statistic computing. Henceforth, Holt Winters was used for training dataset provided by Walmart and then sales prediction was done. Subsequently the predicted sales were given graphical representation using Tableau interactive


Wal-Mart is the number one retailer in the USA and it also operates in many other countries all around the world and is moving into new countries as years pass by. There, are other companies who are constantly rising as well and would give Walmart a tough competition in the future if Walmart does not stay to the top of their game.

In order to do so, they will need to understand their business trends, the customer needs and manage the resources wisely. In this era when the technologies are reaching out to new levels, Big Data is taking over the traditional method of managing and analyzing data. These technologies are constantly used to understand complex datasets in a matter of time with beautiful visual representations.