Sap hana ml python apis sequential pattern mining algorithm spm

Image gallery for: Sap hana ml python apis sequential pattern mining algorithm spm

SAP HANA ML Python APIs: Sequential Pattern Mining Algorithm (SPM)

Welcome to HANA ML Python API for sequential pattern mining aka (SPM) method.I explained first four methods of association analysis in my previous blog post. Note: Make Sure your python environment with HANA ML is up and running ,if not please follow the steps mentioned in previous blog post. Introduction and Algorithm description: The sequential pattern mining algorithm (SPM) searches for frequent patterns in sequence databases. A sequence database consists of ordered elements or events. For example, a customer first buys bread, then eggs and cheese, and then milk. This forms a sequence consisting of three ordered events. We consider an event or a subsequent event is frequent if its support, which is the number of sequences that contain this event or sub-sequence, is greater than a certain value. This algorithm finds patterns in input sequences satisfying user defined minimum support. Understand Sequence Pattern Mining before going into practice : ◈ Find all subsets of items that occur with a specific sequence in all other transactions: e.g {Playing cricket -> high ECG -> Sweating}. ◈ Find all rules that correlate the order of one set of items after that another set of items in the transaction database: e.g 72% of users who perform a web search then make a long eye gaze over the ads follow that by a successful add-click. Prerequisites: ◈ The input data does not contain null value. ◈ There are no duplicated items in each transaction Let’s Start: Import Packages ##First, import packages needed in the data loading. from hana_ml import dataframe from data_load_utils import DataSets, Settings Setup Connection In our case, the data is loaded into a table called “PAL_APRIORI_TRANS_TBL” in HANA from a csv file “apriori_item_data.csv”. To do that, a connection to HANA is created and then passed to the data loader. To create a such connection, a config file, config/e2edata.ini is used to control the connection parameters. A sample section in the config file is shown below which includes HANA url, port, user and password information. *************************** [hana] url=host-url user=username passwd=userpassword port=3xx15 *************************** Maintain the login information in one config file & have it ready in your root folder url, port, user, pwd = Settings.load_config("../config/e2edata.ini") connection_context = dataframe.ConnectionContext(url, port,user,pwd) print(connection_context.connection.isconnected()) If connection is successful – “True” Data-Set: we will analyze the store data for frequent pattern mining ,this is the sample data which is available on SAP’s help webpage. Attribute Information: 1. CUSTID – Customer ID 2. TRANSID – Transaction ID 3. ITEMS – Item of Transaction Load Data for SPM: Check if table already exists in your schema Select * from PAL_SPM_DATA_TBL data_tbl = DataSets.load_spm_data(connection_context) The function DataSets.load_spm_data() is used to decide load or reload the data from scratch. If it is the first time to load data, an example of return message is shown below: ERROR:hana_ml.dataframe:Failed to get row count for the current Dataframe, (259, ‘invalid table name: Could not find table/view PAL_SPM_DATA_TBL in schema DM_PAL: line 1 col 37 (at pos 36)’) Table PAL_SPM_DATA_TBL doesn’t exist in schema DM_PAL Creating table PAL_SPM_DATA_TBL in schema DM_PAL …. Drop unsuccessful Creating table DM_PAL.PAL_SPM_DATA_TBL Data Loaded:100% ##################### if data is already loaded into HANA: data_tbl = DataSets.load_spm_data(connection_context) print("Table Name is: " +str(data_tbl)) import pandas as pd Table PAL_SPM_DATA_TBL exists and data exists Table Name is: PAL_SPM_DATA_TBL Create dataframes using Pandas Dataframes for data load from SAP HANA ##Create a dataframe df from PAL_SPM_TRANS_TBL for the following steps. df_spm = pd.DataFrame(columns=['CUSTID' , 'TRANSID' , 'ITEMS']) df_spm = dataframe.create_dataframe_from_pandas(connection_context=connection_context, pandas_df=df_spm, table_name=data_tbl, force=False, replace=True) Re-execute the below command and check what message it returns data_tbl = DataSets.load_spm_data(connection_context) print("Table Name is: " +str(data_tbl)) Creating table DM_PAL.PAL_SPM_DATA_TBL Data Loaded:100% Table Name is: PAL_SPM_DATA_TBL df = df_spm Display dataframe records df.collect().head(100) ##Display Data df.dropna() ##Drop NAN if any of the blank record is present in your dataset print("Toal Number of Records : " + str(df.count())) print("Columns:") df.columns Data Operations: Get insights of dataset using the following few operations , you can play more by applying few more functions to get more insights. Filter Data: Fetch all customers who has CUSTID = ‘A’ df.filter("CUSTID = 'A'").head(10).collect() Fetch all customers having transaction id = 1 df.filter('TRANSID = 1').head(10).collect() Fetch all customers & transaction having item id = ‘Apple’ df.filter("ITEMS = 'Apple'").head(10).collect() Group by on Data: GroupBy ITEMS df.agg([('count' , 'ITEMS' , 'TOTAL TRANSACTIONS')] , group_by='ITEMS').head(100).collect() GroupBy CUSTID df.agg([('count' , 'CUSTID', 'TOTAL TRANSACTIONS')] , group_by='CUSTID').head(100).collect() GroupBy TRANSID df.agg([('count' , 'TRANSID', 'TOTAL TRANSACTIONS')] , group_by='TRANSID').head(100).collect() Display the most popular items: Import matplotlib library to display the plot having the most popular items in our dataset import matplotlib.pyplot as plt from wordcloud import WordCloud plt.rcParams['figure.figsize'] = (10, 10) wordcloud = WordCloud(background_color = 'white', width = 500, height = 500, max_words = 120).generate(str(df_spm.head(100).collect())) plt.imshow(wordcloud) plt.axis('off') plt.title('Most Popular Items',fontsize = 10) plt.show() The most popular items from plot are Apple & Blueberry and if we try to find the count of transactions for both of them : Count of Apple – df.filter(“ITEMS = ‘Apple'”).head(100).count() = 10 Count of Blueberry – df.filter(“ITEMS = ‘Blueberry'”).head(100).count() = 10 Import SPM Method from HANA ML Library: from hana_ml.algorithms.pal.association import SPM Setup SPM instance: sp = SPM(conn_context=connection_context, min_support=0.5, relational=False, ubiquitous=1.0, max_len=10, min_len=1, calc_lift=True) Method fit(data, customer=None, transaction=None, item=None, item_restrict=None, min_gap=None) Sequential pattern mining from input data.Execute the method Training of model with dataset. sp.fit(data=df_spm, customer='CUSTID', transaction='TRANSID', item='ITEMS') Collect the result sp.result_.collect() Result Analysis: ◈ Item-set Apple has support 1.0 indicates the frequency of the item in all the transactions , most frequent item – confidence & lift is 0 for all the single items which states there is no antecedent & consequent item of them ◈ Consider (Apple , Blueberry): Support is .88 (Frequency of these items together is 88%) , Confidence is 88% means if someone is buying Apple then 88% chances they will also have blueberry in their bucket , lift is .89 close to 1 indicates high Association among items purchase. ◈ Result displays most of the items are having support , Confidence & Lift more than 70% indicates the mining of highly frequent items & can be considered strong association rules.
Advertisement
HTML input types

math&it
programación
Developer
computer geek
Advertisement
Microsoft
Страницы планировщика
codigo
programação
SQL Interview Cheat Sheet

data analytics
Remove surname in excel 💯

work
Work and School
Advertisement
Advertisement
Advertisement
Python and SQL