Data Analysis-101(WNS)

This analysis is for finding a click event of the user, the details are as :

Zbay is an E-commerce website which sells a variety of products at its online platform. Zbay records user behaviour of its customers and stores it as a log. However, most of the times, users do not buy the products instantly and there is a time gap during which the customer might surf the internet and maybe visit competitor websites.

Now, to improve sales of products, Zbay has hired Adiza, an Adtech company which built a system such that ads are being shown for Zbay’s products on its partner websites.

If a user comes to Zbay’s website and searches for a product, and then visits these partner websites or apps, his/her previously viewed items or their similar items are shown on as an ad. If the user clicks this ad, he/she will be redirected to the Zbay’s website and might buy the product.

Load the basic libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder

The Train and Test samples look like:

dftrain=pd.read_csv('train.csv')
dftrainlog=pd.read_csv('view_log.csv')
dftrainitem=pd.read_csv('item_data.csv')
dftest=pd.read_csv('test.csv')
dftrain.head(3)

Output

dftrain.shape ————–>>>> (237609, 7)

impression_idimpression_timeuser_idapp_codeos_versionis_4Gis_click
0c4ca4238a0b923820dcc509a6f75849b2018-11-15 00:00:0087862422old00
145c48cce2e2d7fbdea1afc51c7c6ad262018-11-15 00:01:0063410467latest11
270efdf2ec9b086079795c442636b55fb2018-11-15 00:02:0071748259intermediate10

dftrainlog.head(2)

This table is having user-item mapping and its many to many.

server_timedevice_typesession_iduser_iditem_id
02018-10-15 08:58:00android112333455732970
12018-10-15 08:58:00android503590747887640

dftrainitem.head(3)

This table is having item price mapping and its one to one

item_iditem_pricecategory_1category_2category_3product_type
02688046021135203040
15493935131257856822
2403838251782791619

The data volume is too high and we have to check the ratio of click and not clicked, and then make it in a balance like this,

df1=(dftrain[dftrain['is_click']==1])
df123=(dftrain[dftrain['is_click']==0])
df2=df123.iloc[:12000,:]
frames=[df1,df2]
dftrain=pd.concat(frames)
dftrain.shape ----->>>> Output
(22862, 7)
          Total numer of rows      Ratio
is_click                                
0                       12000  52.488846
1                       10862  47.511154

Code for generating a pie chart:

def ShowPieChart(DataFrame,column1,column2):
proportion=DataFrame.groupby(column1).count()[column2]
col1=(((proportion)).sort_values(axis=0,ascending=False))
total=proportion.sum()
lstProportion=proportion/total
col2=(((lstProportion*100)).sort_values(axis=0,ascending=False))
plt.pie(lstProportion, autopct='%1.1f%%', shadow=False, startangle=140)
plt.axis('equal')
plt.show()
dftemp=pd.DataFrame({'Total numer of rows':col1,'Ratio':col2})
print(dftemp)
Now, we have to merge all the three tables to give a better meaning,before this we need to check and convert the data types into same data types. for this use following code:

For Table1, ie, dftrain

convert_dict = {'impression_id': str, 
'impression_time':str,
'user_id':str,
'app_code':str,
'os_version':str,
'is_4G':str,
'is_click':str
}

dftrain = dftrain.astype(convert_dict)

for table 2, i.e dftrainlog,

convert_dict = {'server_time': str, 
'device_type':str,
'session_id':str,
'user_id':str,
'item_id':str
}
dftrainlog = dftrainlog.astype(convert_dict)

for table 3, dftrainitem

convert_dict = {'item_id': str,
'item_price':str,
'category_1':int,
'category_2':int,
'category_3':int,
'product_type':int
}

dftrainitem = dftrainitem.astype(convert_dict)

Now, let’s check what we can do with a grouping of tables and check how many items users check.

dftrainloggroup=dftrainlog.groupby('user_id')['item_id'].agg([('count', 'count'), ('item_id', ', '.join)])

dftrainloggroup.reset_index('user_id',inplace=True)

dftrainloggroup.head(3)
user_idcountitem_id
004244459, 44459, 44459, 44459, 44459, 44459, 4022…
11864835, 27574, 22411, 91179, 126431, 67259, 409…
2109288834, 8241, 59255, 82124, 98271, 26950, 11329…

Now Merge the tables,

dftrainMerged=pd.merge(dftrain,dftrainloggroup[['user_id','count','item_id']],on='user_id',how='left')

dftrainMerged.shape

dftrainMerged.head(4)
impression_idimpression_timeuser_idapp_codeos_versionis_4Gis_clickcountitem_id
045c48cce2e2d7fbdea1afc51c7c6ad262018-11-15 00:01:0063410467latest111243209, 66370, 43209, 43209, 71877, 43209, 4320…
143ec517d68b6edd3015b3edc9a11367b2018-11-15 00:06:0074339481old0161100847, 117590, 6131, 78766, 129770, 13924, 11…
242998cf32d552343bc8e460416382dca2018-11-15 00:42:0081926249latest0113884085, 44104, 18653, 44104, 131433, 4907, 7615…
3a9a1d5317a33ae8cef33961c34144f842018-11-15 00:57:008779242latest112949782, 49782, 45958, 90576, 90576, 107829, 991…

Now, calculate the price of each item and minimum,max and average price that the user checks, for this we have used a function.

def getPriceItem(items):
print(items)
item_prices=[]
exception_item=[]
for item in items:
try:
price=int(dftrainitem[dftrainitem['item_id'] == item]['item_price'])
item_prices.append(price)
except:
item_prices.append(0)
print("Exception occured at ",item)
exception_item.append(item)
continue
print('item_prices',item_prices)
avg=np.mean(item_prices)
minimum=min(item_prices)
print('minimum',minimum)
maximum=max(item_prices)
return avg,minimum,maximum
lstavg=[]
lstminimum=[]
lstmaximum=[]
counter=0
for index,rows in dftrainMerged.iloc[:,:].iterrows():
arrSplit=rows['item_id'].split(',')
items=[str(i.strip()) for i in arrSplit]
avg,minimum,maximum=getPriceItem(items)
lstavg.append(avg)
lstminimum.append(minimum)
lstmaximum.append(maximum)
print(counter)
counter+=1

Then add the columns into the table,

dftrainMerged['AverageCost']=lstavg
dftrainMerged['MinCost']=lstminimum
dftrainMerged['MaxCost']=lstmaximum

Now, make a check Point and save the file,

dftrainMerged.to_csv('Merged.csv')

Now, load the file again and process the things further:

dftrain=pd.read_csv('Merged.csv')

Now, check the time of day click happened and day of month…

arrTimeOfDay=[]
for i in dftrainMerged['impression_time']:
tim=i
hour=int((tim[11:]).split(':')[0])
print(hour)
if(hour<6):
arrTimeOfDay.append(1)
elif(hour<12):
arrTimeOfDay.append(2)
elif(hour<18):
arrTimeOfDay.append(3)
else:
arrTimeOfDay.append(4)
dftrain['timeofDay']=arrTimeOfDay
arrDay=[]
for i in dftrainMerged['impression_time']:
tim=i
day=int((tim[8:10]).split(':')[0])
print(day)
arrDay.append(day)

dftrain['day']=arrDay

Now, check the data,

plt.scatter(dftrain['AverageCost'],dftrain['is_click'])

plt.scatter(dftrain[‘day’],dftrain[‘is_click’])

We can see that between 15 and 20 the click not happened..

We can also add the potential user, who checks the website gaian and again..

arrpotentialuser=dftrain[dftrain['user_id'].duplicated()]['user_id']

lstpotential=[]
for i,j in dftrain.iterrows():
#print(user)
if(j['user_id'] in arrpotentialuser):
lstpotential.append(1)
else:
lstpotential.append(0)
dftrain['potentialUser']=lstpotential
potuser = pd.crosstab(dftrain['potentialUser'], dftrain['is_click'])
print(potuser)

potuser.plot(kind='bar', stacked=True, color=['green','yellow'], grid=False)
temp3 = pd.crosstab([dftrain['timeofDay'],dftrain['day'],dftrain["potentialUser"]],dftrain['is_click'])
temp3.plot(kind="bar",stacked=True)
plt.show()

convert required columns to numeric using label encoder,

var_mod = ['os_version']
le = LabelEncoder()
for i in var_mod:
dftrain[i] = le.fit_transform(dftrain[i])
dftrain.dtypes
#Rearrange the column positions
cols=['impression_id', 'impression_time', 'user_id', 'app_code','os_version', 'is_4G', 'count', 'item_id', 'AverageCost','MinCost', 'MaxCost', 'timeofDay', 'day', 'potentialUser','is_click']

dftrain=dftrain[cols]

print("===============================Correlation of categorical values=====================")
plt.figure(figsize=[10,10])
corr=dftrain.corr()
sns.heatmap(corr, annot=True, fmt=".2f",cmap="YlGnBu")
plt.show()
=========Correlation of categorical values============

Comments 1

  • Hi there! This post couldn’t be written any better! Reading through this post reminds me of my previous room mate! He always kept talking about this. I will forward this article to him. Pretty sure he will have a good read. Thank you for sharing!

Leave a Reply

Your email address will not be published. Required fields are marked *