Transcript
Actian Analytics Platform – Express Hadoop SQL Edition 2.0 Tutorial
AH-2-TU-05
This Documentation is for the end user's informational purposes only and may be subject to change or withdrawal by Actian Corporation ("Actian") at any time. This Documentation is the proprietary information of Actian and is protected by the copyright laws of the United States and international treaties. It is not distributed under a GPL license. You may make printed or electronic copies of this Documentation provided that such copies are for your own internal use and all Actian copyright notices and legends are affixed to each reproduced copy. You may publish or distribute this document, in whole or in part, so long as the document remains unchanged and is disseminated with the applicable Actian software. Any such publication or distribution must be in the same manner and medium as that used by Actian, e.g., electronic download via website with the software or on a CDROM. Any other use, such as any dissemination of printed copies or use of this documentation, in whole or in part, in another publication, requires the prior written consent from an authorized representative of Actian. To the extent permitted by applicable law, ACTIAN PROVIDES THIS DOCUMENTATION "AS IS" WITHOUT WARRANTY OF ANY KIND, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT. IN NO EVENT WILL ACTIAN BE LIABLE TO THE END USER OR ANY THIRD PARTY FOR ANY LOSS OR DAMAGE, DIRECT OR INDIRECT, FROM THE USER OF THIS DOCUMENTATION, INCLUDING WITHOUT LIMITATION, LOST PROFITS, BUSINESS INTERRUPTION, GOODWILL, OR LOST DATA, EVEN IF ACTIAN IS EXPRESSLY ADVISED OF SUCH LOSS OR DAMAGE. The manufacturer of this Documentation is Actian Corporation. For government users, the Documentation is delivered with "Restricted Rights" as set forth in 48 C.F.R. Section 12.212, 48 C.F.R. Sections 52.227-19(c)(1) and (2) or DFARS Section 252.227-7013 or applicable successor provisions. Copyright © 2014 Actian Corporation. All Rights Reserved. Actian, Actian Analytics Platform, Actian DataFlow, Actian Analytics Database – Vector Edition, Actian Director, Cloud Action Platform, Cloud Action Server, Action Server, Ingres, Vectorwise, OpenROAD, Enterprise Access, and EDBC are trademarks or registered trademarks of Actian Corporation. All other trademarks, trade names, service marks, and logos referenced herein belong to their respective companies.
Contents Setup
7
Components of Actian Analytics Platform - Express Hadoop SQL Edition ......................................... 7 Actian DataFlow (KNIME) .................................................................................................... 7 Actian Vector - Hadoop Edition ............................................................................................ 8 Actian Director .................................................................................................................. 8 What You Will Learn ................................................................................................................ 8 Requirements ......................................................................................................................... 9 Access the Actian Vector Instance ............................................................................................. 9 Sample Database .................................................................................................................. 10 Data Files Used in the Tutorial ................................................................................................ 10 How to Run SQL in this Tutorial .............................................................................................. 10 Create Tables in the Sample Database ..................................................................................... 13
Lesson 1—Create a Simple ETL Workflow
15
Workflows ............................................................................................................................ 15 The Sales Data Input File ....................................................................................................... 16 sales_fact Table .................................................................................................................... 17 Lesson 1 Step 1: Create a New Workflow Project ...................................................................... 18 Lesson 1 Step 2: Add a Text Reader Node ................................................................................ 18 Add a Text Reader Node ................................................................................................... 19 Test the Node ................................................................................................................. 20 Reset Node Status ........................................................................................................... 21 Lesson 1 Step 3: Add a Load Actian Vector On Hadoop Node ...................................................... 21 Add a Load Actian Vector On Hadoop Node ......................................................................... 22 Configure the Load Actian Vector On Hadoop Node .............................................................. 23 Lesson 1 Step 4: Run the Workflow ......................................................................................... 24 Execute the Workflow ...................................................................................................... 24 Save the Workflow ........................................................................................................... 24 Optimize the Database ..................................................................................................... 25 Lesson 1 Step 5: Query the Data ............................................................................................ 25 Verify Data Loaded into sales_facts Table ........................................................................... 26 Determine the Profit at Each Store for Each Product ............................................................. 26
Lesson 2—Load Multiple Data Files
29
Lesson 2 Step 1: Use Wildcards in the Text Reader Node ........................................................... 29 Open the ETL Workflow Project ......................................................................................... 30 Configure Text Reader to Load Multiple Files ....................................................................... 30
Contents iii
Run the Amended Workflow .............................................................................................. 31 Lesson 2 Step 2: Query the Data ............................................................................................ 31 Verify Data Loaded into sales_fact Table............................................................................. 31 Determine the Profit at Each Store for Each Product ............................................................. 32
Lesson 3—Derive a Field and Add Data Lookup
33
sales_city_fact Table ............................................................................................................. 34 Lesson 3 Step 1: Add a Derive Fields Node ............................................................................... 35 Remove Connection Between the Text Reader and Load Actian Vector On Hadoop Nodes ......... 35 Add a Derive Fields Node .................................................................................................. 36 Lesson 3 Step 2: Add Text Reader Node .................................................................................. 37 Add a Text Reader Node ................................................................................................... 38 Lesson 3 Step 3: Add a Join Node ........................................................................................... 39 Add a Join Node .............................................................................................................. 39 Configure the Join Node ................................................................................................... 40 Lesson 3 Step 4: Reconfigure the Load Actian Vector On Hadoop Node ........................................ 41 Reconfigure the Load Actian Vector On Hadoop Node ........................................................... 41 Lesson 3 Step 5: Run the Workflow ......................................................................................... 43 Run the Workflow ............................................................................................................ 43 Save the Workflow ........................................................................................................... 44 Lesson 3 Step 6: Query the Data ............................................................................................ 44 Verify Data Loaded into sales_city_fact Table ...................................................................... 44 Answer the BI Questions .................................................................................................. 45
Lesson 4—Workflow Deployment
47
DataFlow Executor ................................................................................................................ 47 Execute in Batch Mode ........................................................................................................... 49 Understanding Batch Mode Command Parameters ............................................................... 49 Running the SIMPLE ETL Workflow in Batch Mode ................................................................ 51 KNIME Server ....................................................................................................................... 51 Scheduling ........................................................................................................................... 52
Lesson 5—Connect to Actian Vector through ODBC
53
Install Actian ODBC Driver ..................................................................................................... 54 Create an Actian Vector DSN .................................................................................................. 55 Use Excel to Query Data ........................................................................................................ 57
Lesson 6—Connect to Actian Vector through JDBC
59
JDBC Connection URL ............................................................................................................ 59
iv Tutorial
Run the Example Java Program............................................................................................... 60 View JDBC Connection URL in Actian Director ........................................................................... 61
FAQ
63
Appendix
65
Installing an X Server ............................................................................................................ 65 Connect to the Host as Actian User.......................................................................................... 65 Installing Actian Director on Windows ...................................................................................... 66
Contents v
Setup This chapter describes the requirements for working through the tutorial, including:
Required software
How to run SQL
How to create the required tables in the database
Components of Actian Analytics Platform - Express Hadoop SQL Edition Actian Analytics Platform - Express Hadoop SQL Edition consists of the following components:
Actian DataFlow (KNIME)
Actian Vector – Hadoop Edition
Actian Director
Actian DataFlow (KNIME) Actian DataFlow (KNIME) is a user-friendly graphical workbench that can be used for creating ETL (Extraction, Transformation, and Loading) workflows. Express Hadoop SQL Edition contains the full set of Actian DataFlow operators for use within KNIME. These operators let you create nodes (modules) for:
Extracting data from files and databases
Executing SQL statements
Controlling flow
Deriving values
Performing predictive analysis
Visualizing
Reporting
The Actian DataFlow engine provides scalability by using all the server cores and cluster nodes that are available at runtime. With Actian DataFlow, a workflow created on a development server can be deployed for execution onto a production server without any change. The Actian DataFlow operators will automatically use the extra resources (such as cores and memory) available on the production server.
Setup 7
What You Will Learn
Actian Vector - Hadoop Edition Actian Vector - Hadoop Edition scales out the raw single machine performance of Actian Analytics Database - Vector Edition by leveraging Hadoop Distributed File System (HDFS) for storage. Actian Vector is a database management system for analytical database applications such as data warehousing, data mining, and reporting. Vector is optimized to work with both memory- and disk-resident datasets, allowing it to efficiently process large amounts of data. Its innovative technology allows analytical queries to run fast. The Hadoop Edition provides Vector performance on every node, SQL access with regular tools, and full analytical capabilities.
Actian Director Actian Director is an easy-to-use graphical interface that lets you interact with Actian Vector installations. Using Director, you can:
Manage databases, tables, servers, and their components
Administer security (users, groups, roles, and profiles)
Create, store, and execute queries
What You Will Learn In this tutorial, you will learn how to use Actian DataFlow operators to create a simple ETL workflow that loads sales details into an Actian Vector database. After the simple ETL workflow has been executed, you will be able to answer the following BI questions:
How much profit is generated by each product at each store?
What is the total profit generated by all the stores in a city?
What is the total profit generated per country?
Specifically, you will learn how to use Actian DataFlow operators to:
8 Tutorial
Extract the sales information from multiple CSV files.
Load the sales details into an Actian Vector database.
As part of the workflow, derive a value for the profit for each product, because this value is not included in the CSV file.
As part of the workflow, add the city and country where each store is located.
Requirements
The simple ETL workflow will be created using the following Actian DataFlow operators:
Delimited Text Reader
Load Actian Vector On Hadoop
Derive Fields
Join
Requirements This tutorial is intended for users familiar with database fundamentals but new to Actian DataFlow and Actian Vector. The following must be available to run the tutorial:
Actian Analytics Platform - Express Hadoop SQL Edition 2.0 (For installation instructions, see the readme)
X11 forwarding over SSH enabled If accessing the remote Linux node from your desktop, you must have an X11 server and an SSH client to work with Actian DataFlow (KNIME) and Actian Director from the remote instance. Instructions for installing an X server and SSH client on a Windows PC can be found at Installing an X Server (see page 65). Actian also provides a Windows native client package for Actian Director that can be installed and used for this tutorial. If you want to install Actian Director on your Windows desktop and use it to connect to your remote Linux node, follow the installation instructions for Actian Director on Windows (see page 66). You still need to have an X server running on your desktop to run Actian DataFlow (KNIME) if you are connecting remotely to a Linux node using SSH.
Access the Actian Vector Instance During installation of Express Hadoop SQL Edition, an environment file named .ingAHsh or .ingAHcsh is written to the home directory ($HOME) of the "actian" user, which was created during installation: To access your instance, you must source this environment file as user actian. To source the environment file Issue the following command: source ~actian/.ingAHsh
Setup 9
Sample Database
Sample Database The simple ETL workflow will load data into a table in a database. The database for this tutorial was created as part of the Express Hadoop SQL Edition install process. The database is named sample.
Data Files Used in the Tutorial The three data files used in the tutorial are included in the distribution:
sample_sales_data_1.csv
sample_sales_data_2.csv
store_city.csv
After installing Express Hadoop SQL Edition, you can find these files in this location: /opt/Actian/AnalyticsPlatformAH/ingres/tutorial
How to Run SQL in this Tutorial You will need to run SQL several times during the tutorial. Actian Director is an easy-to-use graphical interface that lets you interact with Actian Vector installations and execute SQL statements. There are two ways to work with Actian Director. You can either start Director on the Linux node where you installed Express (and have the Director user interface display on your desktop) or connect to the remote node with a Director installed natively on your Windows desktop. To start Actian Director and connect to instance If you want to start Director from the Linux node 1. Log in (or switch user) as the actian user. For specifics on connecting based on your scenario, see Connect to the Host as Actian User (see page 65). 2. Start Actian Director by entering the following command: director
Director is started. The Actian Vector Hadoop Edition AH instance is displayed in the Instance Explorer.
10 Tutorial
How to Run SQL in this Tutorial
3. Right-click the Actian Vector Hadoop Edition AH instance in the Instance Explorer, and then select Connect. The Connect to Instance dialog opens. 4. Enter the following credentials, and then click Connect: Authentication: Authenticated User Login: demo Password: hsedemo You are connected to the instance. If you want to start Director on your Windows PC and use it to connect to the Linux node 1. If you have not yet installed Actian Director on your Windows desktop, follow the installation instructions (see page 66). 2. Start Actian Director from the Start Menu.
3. Click "Connect to an instance" in the Start Page, or click Connection, Connect on the toolbar. The Connect to instance dialog is displayed.
Setup 11
How to Run SQL in this Tutorial
4. Enter the following credentials, and then click Connect: Instance: Name of the Linux remote host where you have installed Express. Authentication: Authenticated User Login: demo Password: hsedemo You are connected to the instance. To run SQL from Director 1. Expand the Actian Vector Hadoop Edition AH instance in the Instance Explorer. Select Databases, sample. On the menu ribbon, select Query, New. A Query tab is opened in the right pane. 2. Type in your SQL statements, and then press Execute on the query tool bar. The SQL statements are executed. (Alternatively, you can select Query, Open to select a file containing SQL.)
12 Tutorial
Create Tables in the Sample Database
Create Tables in the Sample Database Two tables must be created in the sample database:
sales_fact (used in Lesson 1)
sales_city_fact (used in Lesson 3)
SQL scripts are provided that create the tables for you. To create the sales_fact and the sales_city_fact tables 1. Connect to the sample database using Actian Director. For details, see How to Run SQL in this Tutorial (see page 10). 2. Select Query, Open on the menu ribbon. In the Browse dialog, navigate to /opt/Actian/AnalyticsPlatformAH/ingres/tutorial, and then select the following file: create_sales_fact_table.txt The SQL to create the table is displayed in a new tab in the right pane. 3. Select the sample database from the drop-down list on the query toolbar. 4. Click Execute on the toolbar. The table is created. 5. Repeat Steps 1 through 4, except this time for Step 2 select the following file: create_sales_city_fact_table.txt 6. (Optional) Select Home, Refresh on the menu ribbon. The newly created tables are listed in the Instance Explorer under: Actian Vector Hadoop Edition AH, Databases, sample, Tables.
Setup 13
Lesson 1—Create a Simple ETL Workflow In this lesson you will create a simple ETL workflow that extracts data from a flat file and loads the contents of the file into the sales_fact table in Actian Vector. After the simple ETL workflow has been executed, you will be able to answer the following BI question: How much profit is generated by each product at each store?
Workflows A workflow consists of a set of nodes (also known as operators) that operate on a data set. The input data set for the workflow is created either by loading a flat file (using the nodes Text Reader, Log Reader, etc.) or reading data from a table (using the nodes Database Reader, HBase Reader, etc.). Nodes are connected together so that the output from a node is the input of subsequent nodes. By connecting nodes together a workflow is created that transforms the input data. Once the transformation has been completed, the data set can be saved as a flat file (Text Writer node) or loaded into a database table (Load Actian Vector On Hadoop node). Note: KNIME refers to operators as nodes.
Data sets can be joined together to create a larger data set. Joining data sets is covered in Lesson 3 (see page 33). Also, the output from one node can be used as the input to multiple nodes.
Lesson 1—Create a Simple ETL Workflow 15
The Sales Data Input File
The Sales Data Input File For this tutorial, the source input data file is sample_sales_data_1.csv, found in the subdirectory named tutorial. This is a comma separated file containing sales information with the following fields:
product_id
time_id
customer_id
promotion_id
store_id
store_sales
store_cost
unit_sales
These are the first four rows in the file: product_id,time_id,customer_id, promotion_id,store_id,store_sales,store_cost,unit_sales 337,371,6280,0,2,1.5,0.51,2 1512,371,6280,0,2,1.62,0.632,3 963,371,4018,0,2,2.4,0.72,1
Notes:
The source file has a header row.
The Text Reader node analyzes the contents of the input source file to determine the data types for each field. Additionally, if the file has a header row, these field names will be used as the column names in the data set created by the Text Reader node. The data set column names are then available for use by the Load Actian Vector On Hadoop node when mapping the data set to the columns in the destination table.
The input file has no value for profit. You will initially use SQL to derive a value for profit, but in Lesson 3 (see page 33) you will amend the workflow to include a node to derive the profit for each product.
If the source file does not have a header row, then the Text Reader node will create generic column names. You can either inject column names after the file has been read using the Insert Column Header node, or map the generic column names when configuring the Load Actian Vector On Hadoop node.
16 Tutorial
sales_fact Table
sales_fact Table The sales data from the input source files will be loaded into the sales_fact table in the sample database. If you did not create the sales_fact table earlier (as described in Create Tables in the Sample Database (see page 13)), then use the following SQL to create it: CREATE TABLE sales_fact( product_id INTEGER NOT NULL, time_id INTEGER NOT NULL, customer_id INTEGER NOT NULL, promotion_id INTEGER NOT NULL, store_id INTEGER NOT NULL, store_sales DECIMAL(10,4) NOT NULL, store_cost DECIMAL(10,4) NOT NULL, unit_sales INTEGER NOT NULL ); WITH PARTITION = (HASH ON product_id
PARTITIONS); COMMIT;
Note: In the WITH PARTITION clause, numparts is the number of partitions to be created for the data. This number should be a multiple of the total number of physical cores across all nodes where Vector Hadoop Edition is installed. Note: If you want other users to access the tables in this Tutorial, you may have to grant privileges to those users. You can do so by using Actian Director.
Lesson 1—Create a Simple ETL Workflow 17
Lesson 1 Step 1: Create a New Workflow Project
Lesson 1 Step 1: Create a New Workflow Project The first step in creating a new workflow is to create a new Workflow project. The project includes the Workflow Credentials, Workflow Variables, Work Preferences, and workflow canvas onto which the workflow nodes will be placed that will transform the data. To create a new workflow project 1. Connect as user actian to the Linux master node on which you installed Express. For specifics on connecting based on your scenario, see Connect to the Host as Actian User (see page 65). 2. Start Actian DataFlow (KNIME) by entering knime. 3. On the File menu, select New. The New dialog is displayed, which lets you select a wizard. 4. Select Actian Dataflow Workflow, and then click Next. The Create a New Actian Dataflow Workflow dialog is displayed. 5. Change the default workflow name to SimpleETL, leave the Workflow Group set to the default value, leave the profile set as Development, and then click Finish. A new tab titled SimpleETL is created. Note: Several different templates can be used when creating a new workflow project. The Actian DataFlow Workflow template must be used if the workflow will include any Actian DataFlow nodes. Using the Actian DataFlow Workflow template allows for streaming execution of the nodes, which allows for much faster execution of workflows. The data is streamed from node to node without having to be staged to disk. Also, the workflow nodes can be run in parallel on a single machine or even on a Hadoop cluster. For more details, see the Actian DataFlow Online Help pages (http://help.pervasive.com/display/RA2/Enabling+Workflows+to+Execute+wit h+DataRush).
Lesson 1 Step 2: Add a Text Reader Node In this step, you will add and configure the Actian DataFlow Text Reader node to the workflow. The Text Reader node extracts data from a file and creates a data set ready to be used by subsequent nodes in the workflow.
18 Tutorial
Lesson 1 Step 2: Add a Text Reader Node
Add a Text Reader Node To configure a Text Reader you specify the location of the file, the field separator, and whether the file has a header row. To add a new Text Reader node to the workflow 1. In the Node Repository expand the following nodes by clicking on their expand arrows: Actian Dataflow, I/O, Read 2. Select the Delimited Text Reader node and drag it onto the workflow canvas. 3. Right-click the Delimited Text Reader node and select Configure from the context menu (or simply double-click the node). The Reader Properties tab for the Delimited Text Reader is displayed. 4. Click Browse. The Open dialog box is displayed. 5. Locate the file sample_sales_data_1.csv in the following directory, and then click Open. /opt/Actian/AnalyticsPlatformAH/ingres/tutorial The file name is then displayed in the Location field on the Reader Properties tab. 6. Leave the Field separator set to the default value of comma. 7. Select the "Has header row" check box, and then click OK. The Actian DataFlow Text Reader node analyzes the text file to determine the data type for each field. Because the input text file has a header row, the field names from this row will be used as the column names in the data set created by the Text Reader. While the Text Reader node has several advanced features (including Operator Settings, Flow Variables, Job Manager Selection, Memory Policy, Read a file from a cluster), only the Job Manager Selection is covered in this tutorial.
Node Status Indicator Under the Text Reader node there is a traffic light status indicator.
Lesson 1—Create a Simple ETL Workflow 19
Lesson 1 Step 2: Add a Text Reader Node
This status indicator is present on all nodes and signifies the following:
failed.
(Red) The node has not been configured or execution of the node
(Amber) The node has been configured, but not run. This status is also set if this or a preceding node in the workflow has been reset and this node needs to be rerun. (Green) The node has been configured and execution completed successfully. Within a workflow, a node with a status of green will not be rerun because nothing has changed. To force the execution of a node with a status of green requires the node to be RESET.
Test the Node To test that the Text Reader node has been configured successfully Click Execute All on the toolbar (or select Node, Execute All). Execution of the workflow will start. If the Text Reader node has been successfully configured, then the status indicator will show green.
20 Tutorial
Lesson 1 Step 3: Add a Load Actian Vector On Hadoop Node
Reset Node Status After a node has been run successfully, the node will not be rerun unless the node is reconfigured or a preceding node in the workflow is changed. To force the execution of a node that has been run successfully, the node status must be reset. To reset the node status Right-click the Delimited Text Reader node and select Reset from the context menu. The node is reset.
Lesson 1 Step 3: Add a Load Actian Vector On Hadoop Node In this step, you will add and configure the Actian DataFlow Load Actian Vector On Hadoop node to the workflow. The Load Actian Vector On Hadoop node loads data into a database table using the data set created by a preceding node in the workflow.
The process for setting up the Load Actian Vector On Hadoop node is summarized as follows: 1. Create a Database Connection to Actian Vector. Note: A connection named Actian_Vector_AH was created during installation. 2. Connect the input of the Load Actian Vector On Hadoop node to the output from a preceding node. This must be done before you can configure Load Actian Vector On Hadoop node. 3. Configure the Load Actian Vector On Hadoop node with details of the database connection to use, the destination table name, and then map the input data set columns to the destination table columns.
Lesson 1—Create a Simple ETL Workflow 21
Lesson 1 Step 3: Add a Load Actian Vector On Hadoop Node
4. Provide the following SQL:
Initialization SQL – This will be executed before the data is loaded to the table. For example, delete existing rows in the table before loading in new data. Finalization SQL – This will be executed after the data has been loaded to the table.
There are two nodes that can be used to load data into a Vector Hadoop Edition database: Load Actian Vector On Hadoop (Used in this tutorial) This node formats the input data for loading using the cluster-enabled vwload utility. If executing the node from a remote client, the client must have access to the HDFS instance on which Vector Hadoop Edition is running. Database Writer This node uses JDBC to load the data into the database table. This option is slower than using the Load Actian Vector On Hadoop node.
Add a Load Actian Vector On Hadoop Node A Load Actian Vector On Hadoop node will load a data set into a database table. To add the Load Actian Vector On Hadoop node to the workflow 1. In the Node Repository expand the following tree nodes by clicking on their expand arrows: Actian Dataflow, I/O, Write 2. Select the Load Actian Vector On Hadoop node and drag it onto the workflow canvas. 3. Connect the output from the Text Reader node to the input of the Load Actian Vector On Hadoop node:
22 Tutorial
a.
Click and hold the output arrow on the Text Reader node.
b.
Drag the mouse to the input arrow on Load Actian Vector On Hadoop.
Lesson 1 Step 3: Add a Load Actian Vector On Hadoop Node
Configure the Load Actian Vector On Hadoop Node The final step in adding the Actian Vector On Hadoop Node is to configure it. To configure the Load Actian Vector On Hadoop ode 1. Right-click the Load Actian Vector On Hadoop node and select Configure. 2. On the Load Actian Vector On Hadoop dialog: a.
For Connection, select the Actian_Vector_AH database connection.
b.
For Table Name, select sales_fact.
c.
For Temporary Directory, select hdfs://localhost:8020/Actian/tmp. Note: You must provide a temporary directory in the HDFS instance on which Vector Hadoop Edition is installed. The actual HDFS URL may be different depending on your Hadoop distribution and configuration. To determine the correct Temporary Directory value, run the following command to determine the default URL for your Hadoop instance, and then append '/Actian/tmp' to the result: hdfs getconf -confKey fs.defaultFS
d.
Click Map Fields. The Source to Target Mapper dialog is displayed.
e.
Click Map by name, and then click OK.
3. Click the Initialization SQL tab and enter the following SQL. The tables were created for the "demo" user. MODIFY demo.sales_fact TO TRUNCATED
Any existing data will be deleted from the sales_fact table. This is required because you will execute the Load Actian Vector On Hadoop node many times as part of this tutorial. 4. Click OK. The Load Actian Vector On Hadoop dialog is closed.
Lesson 1—Create a Simple ETL Workflow 23
Lesson 1 Step 4: Run the Workflow
Lesson 1 Step 4: Run the Workflow You have created a simple ETL workflow that extracts data from a flat file and loads the data into an Actian Vector table.
Execute the Workflow To verify that the nodes have been correctly configured, execute the workflow. To execute the workflow Click Execute All on the toolbar (or select Node, Execute All). Execution of the workflow starts. If the nodes have been successfully configured then the status indicator for both nodes will show green.
Save the Workflow In a later lesson in this tutorial you will amend this ETL workflow. Therefore, you may wish to save the workflow before moving on to the next lesson. To save the simple ETL workflow On the File menu, select Save. The workflow is saved.
24 Tutorial
Lesson 1 Step 5: Query the Data
Optimize the Database After loading data, you should optimize the database. Optimizing the database generates statistics that tell the query optimizer what the data looks like. The query optimizer uses the statistics to generate a query execution plan (QEP) that shows how your query is executed. The QEP can be reused to execute the same query. To optimize the tables in the sample database 1. Start Actian Director and connect to the Actian Vector Hadoop Edition AH instance. (For instructions, see How to Run SQL in this Tutorial (see page 10)). 2. Select the sample database from the Instance Explorer. 3. Select Database, Generate Statistics on the menu ribbon. The Generate Statistics dialog is displayed. 4. Accept the default settings, and then click OK. The optimizedb utility generates the statistics and issues the message: "Execution of the utility has completed." 5. Click Close. Note: While this step is not repeated in the tutorial, it is a good practice to optimize the database after each data load.
Lesson 1 Step 5: Query the Data You have created and executed a simple ETL workflow that extracts data from a flat file and loads the data into an Actian Vector database table. In this step, you will verify that the data has been loaded and run SQL to identify the profitability of each product at all stores.
Lesson 1—Create a Simple ETL Workflow 25
Lesson 1 Step 5: Query the Data
Verify Data Loaded into sales_facts Table Running a query will verify that all the data from the flat file has been successfully loaded into the sales_fact table. To verify that the data was successfully loaded 1. Connect to the Actian Vector Hadoop Edition AH instance using Actian Director. 2. Select the sample database from the Instance Explorer. 3. Select Query, New on the menu ribbon. A query tab is opened. 4. Type the following SQL into the query document: SELECT COUNT(*) FROM sales_fact; COMMIT;
5. Click Execute on the query toolbar. The query should return 55,063 rows.
Determine the Profit at Each Store for Each Product To determine the profitability of each product at each store, we must calculate the profit of each product because this value is not included in the input data file. Note: Instead of calculating the profit for each product using SQL, this value could have been derived as part of the ETL workflow. How to add a derived field into a workflow is covered in Lesson 3–Derive a Field and Add Data Lookup (see page 33). To determine the profitability of each product Run the following SQL: SELECT store_id, product_id, SUM(store_sales - ( store_cost * unit_sales)) AS total_profit, SUM(unit_sales) AS total_units_sold FROM sales_fact GROUP BY store_id, product_id ORDER BY store_id, total_profit DESC; COMMIT;
26 Tutorial
Lesson 1 Step 5: Query the Data
Note: For instructions, see How to Run SQL in this Tutorial (see page 10).
Lesson 1—Create a Simple ETL Workflow 27
Lesson 2—Load Multiple Data Files In Lesson 1 (see page 15) you created a simple ETL workflow that loaded the contents of the sample_sales_data_1.csv file into a data table. But what if you need to load multiple files, all with the same file structure? What changes need to be made to the workflow? In this lesson you will change the Delimited Text Reader node to process multiple files (with the same structure) by using wildcards. The files are named as follows:
sample_sales_data_1.csv This file contains 55,064 rows, with one row being a header row.
sample_sales_data_2.csv This file contains 31,775 rows, with one row being a header row.
In total you will load 86,837 rows into the sales_fact table. Note: If each file had a different structure you would have to:
Use a Text Reader node for each file.
Use a Join to merge the data sets together. In Lesson 3 (see page 33) you will learn how to merge data from two files using a Join node.
Lesson 2 Step 1: Use Wildcards in the Text Reader Node By using a wildcard in the file name, the Text Reader node will extract data from all the matching file names.
Lesson 2—Load Multiple Data Files 29
Lesson 2 Step 1: Use Wildcards in the Text Reader Node
Open the ETL Workflow Project If you saved the workflow created in Lesson 1 (see page 15), you need to open the simple ETL workflow. To open the simple ETL workflow that you created in Lesson 1 1. On the KNIME Explorer panel, expand the LOCAL (Local Workspace) tree node. 2. Double-click the simple ETL workflow named SimpleETL. The workflow is opened.
Configure Text Reader to Load Multiple Files The Text Reader allows for wildcards to be used in the file name. If a wildcard is used then the Text Reader node will extract data from all the matching file names before passing the data set to subsequent nodes. To use wildcards with the Text Reader node 1. Right-click the Delimited Text Reader node and select Configure. 2. Click the Location entry field and navigate to the end of the file name. 3. Add a wildcard to the file name by changing sample_sales_data_1.csv to sample_sales_data_*.csv. 4. Tab out of the Location field. The contents of the files are analyzed. 5. Click OK. The Confirm reset message is displayed. 6. Click OK. The status indicator on the Text Reader node is reset to amber to signify that there has been a change and execution of the node is required. The status of the Load Actian Vector On Hadoop node is also reset to amber because this node needs to be rerun as its preceding node has been reset.
30 Tutorial
Lesson 2 Step 2: Query the Data
Run the Amended Workflow To test that the Text Reader node will load multiple files, run the amended workflow. To run the amended workflow Click Execute All on the toolbar (or select Node, Execute All). By adding a wildcard to the filename, you have loaded the content of multiple files. A total of 86,837 lines are loaded into the sales_fact table.
Lesson 2 Step 2: Query the Data You have changed the simple ETL workflow to extract data from multiple flat files and loaded the data into an Actian Vector table. In this step, you will verify that the data has been loaded successfully.
Verify Data Loaded into sales_fact Table Running a query will verify that all the data from the flat files have been successfully loaded into the sales_fact table. To verify that the data was successfully loaded Run the following SQL: SELECT COUNT(*) FROM sales_fact; COMMIT;
The query should return 86,837 rows. Note: For instructions, see How to Run SQL in this Tutorial (see page 10).
Lesson 2—Load Multiple Data Files 31
Lesson 2 Step 2: Query the Data
Determine the Profit at Each Store for Each Product To identify the profitability of each product at each store we must derive a value for the profit of each product because this value is not included in the data file. Note: Instead of deriving the profit for each product as part of the SQL, this value can be derived as part of the ETL workflow by including a node to calculate the profit value and inject this to the data set before it is loaded into the database table. How to add a derived field into a workflow is covered in Lesson 3–Derive a Field and Add Data Lookup (see page 33). To determine the profit for each product Run the following SQL (the same SQL as Lesson 1): SELECT
store_id, product_id, SUM(store_sales - ( store_cost * unit_sales)) AS total_profit, SUM(unit_sales) AS total_units_sold FROM sales_fact GROUP BY store_id, product_id ORDER BY store_id, total_profit DESC; COMMIT;
32 Tutorial
Lesson 3—Derive a Field and Add Data Lookup In Lesson 2 (see page 29) you created a simple ETL workflow that loaded the contents of multiple files into a data table. But we had to use SQL to calculate the profit for each product because the value for profit is not included in the input data files. As part of an ETL workflow, a transformation of the data may be required before the data is loaded into a database. In this lesson, you add nodes to:
Calculate a derived field (profit for a product at a store)
Perform a lookup transformation by adding the city and country where the store is located
After these nodes have been added and the workflow successfully run, you will be able to answer the following BI questions:
What is the total profit generated by all the stores in a city?
What is the total profit generated per country?
Lesson 3—Derive a Field and Add Data Lookup 33
sales_city_fact Table
sales_city_fact Table Although the input files for this lesson remain the same, adding a derived field and lookup transformation mean that we must use the sales_city_fact table, rather than the sales_fact table, for this lesson. If you did not create the sales_city_fact table earlier (as described in Create Tables in the Sample Database (see page 13)), then use the following SQL to create it: CREATE TABLE sales_city_fact( product_id INTEGER NOT NULL, time_id INTEGER NOT NULL, customer_id INTEGER NOT NULL, promotion_id INTEGER NOT NULL, store_id INTEGER NOT NULL, store_sales DECIMAL(10,4) NOT NULL, store_cost DECIMAL(10,4) not NOT NULL, unit_sales INTEGER NOT NULL, profit DECIMAL(10,4) NOT NULL, city VARCHAR(25) NOT NULL, country VARCHAR(25) NOT NULL ) WITH PARTITION = (HASH ON product_id PARTITIONS); COMMIT;
Note: In the WITH PARTITION clause, numparts is the number of partitions to be created for the data. This number should be a multiple of the total number of physical cores across all nodes where Vector Hadoop Edition is installed. The sales_city_fact table has these additional columns:
34 Tutorial
profit, which will hold the derived profit value
city, which will hold the city where the store is located
country, which will hold the country where the store is located
Lesson 3 Step 1: Add a Derive Fields Node
Lesson 3 Step 1: Add a Derive Fields Node In this step you will amend the Simple ETL workflow that you created as part of Lesson 2 to include a node to derive a field.
Remove Connection Between the Text Reader and Load Actian Vector On Hadoop Nodes Because new nodes will be added into the workflow after the sales data files have been loaded, the connection between the output of the Text Reader and the input of the Load Actian Vector On Hadoop nodes needs to be removed. To remove the connection 1. Right-click the link (black line) between the two nodes and select Delete. The Confirm dialog is displayed. 2. Click OK. The connection is deleted and the status of the Load Actian Vector On Hadoop node changes to red.
Lesson 3—Derive a Field and Add Data Lookup 35
Lesson 3 Step 1: Add a Derive Fields Node
Add a Derive Fields Node To add a new Derive Fields node to the workflow 1. In the Node Repository expand the following tree nodes by clicking on their expand arrows: Actian Dataflow, Transformations, Manipulation 2. Select the Derive Fields node and drag it onto the workflow canvas. 3. Connect the output from the Text Reader node to the input of the Derive Fields node:
Click and hold the output arrow on the Text Reader node.
Drag the mouse to the input arrow on the Derive Fields node.
Note: The Text Reader and Derive Fields nodes must be connected together before the Derive Fields node can be configured because the column names used by the Derive Fields node are taken from the input data set. 4. Right-click the Derive Fields node and select Configure (or simply doubleclick the node). 5. Click Add. 6. Double-click the Output Field column and replace default value of field0 with profit. 7. Double-click the Expression column and replace the default value of 0 with store_sales - ( store_cost * unit_sales). 8. Click OK. The formula is saved. If the expression entered has an error, a red warning triangle is shown next to the expression. You can amend an existing derivation by clicking Edit on the Derived Outputs panel. This also allows you to check the syntax of the expression.
36 Tutorial
Lesson 3 Step 2: Add Text Reader Node
Lesson 3 Step 2: Add Text Reader Node In this step you will amend the ETL workflow to include a node to add details for the city and country for each store. The city and country for each store are in the store_city.csv file. This is a comma separated file with columns of store_id, city, country. These are the first five rows in this file: store_id, city, country 2,Leeds, United Kingdom 3,Slough, United Kingdom 6,Suresnes,France 7,Islandia, USA
As in the sample_sales_data_1.csv file used in Lesson 1 (see page 15) the input source file has a header row.
Lesson 3—Derive a Field and Add Data Lookup 37
Lesson 3 Step 2: Add Text Reader Node
Add a Text Reader Node In Lesson 1 Step 2 you added a Text Reader node to extract data from a file and import into a data set for processing by subsequent nodes. To add a Text Reader node onto the workflow to read the store_city.csv file Repeat the steps from Lesson 1 Step 2, but this time specify the store_city.csv file as the Source.
In your workflow, you have now loaded the contents of three flat files (remember that you have loaded two sales data files and one data lookup file). The files have been loaded into separate data sets, which need to be joined together. In the next step in this lesson you will add a Join node to perform an INNER join between the two data sets to add the city and country for each store.
38 Tutorial
Lesson 3 Step 3: Add a Join Node
Lesson 3 Step 3: Add a Join Node In this step you will join the data sets from the two Text Reader nodes together using a Join node.
Add a Join Node A Join node has two inputs (left and right) that can be joined either as INNER, LEFT OUTER, RIGHT OUTER, or FULL OUTER join. For the simple ETL workflow you need an INNER join. To add a new Join node to the workflow 1. In the Node Repository expand the following tree nodes by clicking on their expand arrows: Actian Dataflow, Transformations, Aggregate 2. Select the Join node and drag it onto the workflow canvas.
Lesson 3—Derive a Field and Add Data Lookup 39
Lesson 3 Step 3: Add a Join Node
3. Connect the output from the Derived Fields node to the top input on the Join node. 4. Connect the output from the Text Reader node (used to read the store_city.csv file) to the bottom input on the Join node.
Configure the Join Node You need to configure the Join node to specify what the join should be and what fields to be used for the join. To configure the Join node 1. Right-click the Join node and select Configure (or simply double-click the node). 2. For Join Type select INNER. 3. Select Merge Key Fields. Selecting Merge Key Fields ensures that the output data set from the Join will not contain two store_id columns. 4. Click Add to create a new join row in the Join Source table grid. 5. Select store_id as the Left Key. 6. Select store_id as the Right Key. 7. Click OK. You can specify additional predicates to be used when performing the join, but these are not needed for the simple ETL workflow.
40 Tutorial
Lesson 3 Step 4: Reconfigure the Load Actian Vector On Hadoop Node
Lesson 3 Step 4: Reconfigure the Load Actian Vector On Hadoop Node In this step you will reconfigure the Load Actian Vector On Hadoop node to load data into the sales_city_fact table.
Reconfigure the Load Actian Vector On Hadoop Node To configure the Load Actian Vector On Hadoop node to load into the sales_city_fact table 1. Connect the output from the Join node to the input on the Load Actian Vector On Hadoop node. 2. Right-click the Load Actian Vector On Hadoop node and select Configure (or simply double-click the node). The Load Actian Vector On Hadoop dialog is displayed. a.
For Connection, select the Actian_Vector_AH database connection.
Lesson 3—Derive a Field and Add Data Lookup 41
Lesson 3 Step 4: Reconfigure the Load Actian Vector On Hadoop Node
b.
For Table Name, select sales_city_fact.
c.
For Temporary Directory, select hdfs://localhost:8020/Actian/tmp. Note: The actual HDFS URL may be different depending on your Hadoop distribution and configuration. To determine the correct Temporary Directory value, run the following command to determine the default URL for your Hadoop instance, and then append '/Actian/tmp' to the result: hdfs getconf -confKey fs.defaultFS
3. Click Map Fields. The Source to Target Mapper dialog is displayed. 4. Click Map by name, and then click OK. 5. Click the Initialization SQL tab and enter the following SQL: MODIFY demo.sales_city_fact TO TRUNCATED
Data will be deleted from the sales_city_fact table. Note: When deleting all data in a large table, MODIFY...TO TRUNCATED is preferred to DELETE FROM because MODIFY creates only one entry in the log file, regardless of the number of rows, while DELETE FROM creates one entry per row. A DELETE FROM operation will take much longer to complete due to logging. 6. Click OK. The Load Actian Vector On Hadoop dialog is closed.
42 Tutorial
Lesson 3 Step 5: Run the Workflow
Lesson 3 Step 5: Run the Workflow You have modified the simple ETL workflow to include transformations to derive a profit field and perform a Data Lookup to add the city and country for each store.
Run the Workflow To test that the nodes in the workflow have been configured successfully, we execute the workflow. To run the workflow Click Execute All on the toolbar (or select Node, Execute All). Execution of the workflow starts. The status indicator for all nodes will show green if the Text Reader and Load Actian Vector On Hadoop nodes have been successfully configured. If the execution of the workflow fails, look at the error messages in the Console. If the error mentions the store_id_2 column, then on the Join node you did not select the Merge Key Fields check box.
Lesson 3—Derive a Field and Add Data Lookup 43
Lesson 3 Step 6: Query the Data
Save the Workflow To save the simple ETL workflow On the File menu, select Save. The workflow is saved.
Lesson 3 Step 6: Query the Data You have modified the simple ETL workflow to include transformations that derive a profit field and perform a Data Lookup to add the city and country for each store. In this step, you will verify that the data has been loaded and answer the following BI questions:
What is the total profit generated by all the stores in a city?
What is the total profit generated per country?
Verify Data Loaded into sales_city_fact Table Running a query will verify that all the data from the flat file has been successfully loaded into the sales_city_fact table. To verify that the data was successfully loaded Run the following SQL: SELECT COUNT(*) FROM sales_city_fact; COMMIT;
The query should return 86,837 rows. Note: For instructions, see How to Run SQL in this Tutorial (see page 10).
44 Tutorial
Lesson 3 Step 6: Query the Data
Answer the BI Questions To answer the two BI questions Run the following SQL: SELECT
city, SUM(profit) AS total_profit, SUM(unit_sales) AS total_units_sold FROM sales_city_fact GROUP BY city ORDER BY total_profit DESC; COMMIT;
SELECT
country, SUM(profit) AS total_profit, SUM(unit_sales) AS total_units_sold FROM sales_city_fact GROUP BY country ORDER BY total_profit DESC; COMMIT;
The results of these two queries are displayed in the Results 1 and Results 2 tabs in Actian Director.
Lesson 3—Derive a Field and Add Data Lookup 45
Lesson 4—Workflow Deployment The simple ETL workflow you have created in Lesson 3 (see page 33) to load store sales data into the Actian Vector data table was executed interactively inside the KNIME workbench. In this lesson you will learn how to execute a workflow outside the KNIME workbench. The options available to run a workflow non-interactively outside the KNIME workbench are:
Execute workflow with Dataflow
Execute in headless batch node
Execute using the KNIME Server
DataFlow Executor The Actian DataFlow nodes are designed for scalability and performance. Scalability can be achieved by adding more resources that are available for use. When adding additional resources, scaling can be achieved through:
Scaling up – Adding more CPUs, more disks, or more memory
Scaling out – Adding additional machines. This is the typical distributed cluster model, enlarging the cluster size to increase performance.
Actian DataFlow is designed to handle scaling in either direction. Applications built using DataFlow will scale equally well on "fat" SMP (symmetric multiprocessor) machines and on clusters of "skinny" commodity PCs. Concurrency and parallelism help to reduce execution times and enable an application to scale as the data volume to process increases. The Actian DataFlow nodes that you will include as part of the Simple ETL workflow can be configured to use the DataFlow scalability by changing the Job Manager settings to use the DataFlow Executor instead of the KNIME engine.
Lesson 4—Workflow Deployment 47
DataFlow Executor
For example, to configure a Text Reader node to use the DataFlow Executor 1. Right-click the Delimited Text Reader node and select Configure. 2. Select the Job Manager Selection tab. 3. Select DataFlow Executor as the job manager. 4. Select development for the Profile.
Note: The SimpleETL workflow was created using the Actian "development" profile. You can set up different Actian profiles (such as production or test) that define your environment, including level of parallelism and whether to execute in a cluster. To do so, in KNIME select File, Preferences, Actian The following diagram illustrates the time difference between a workflow executing using the default KNIME job manager (sequential execution of nodes) and the DataFlow engine (parallel execution of nodes).
48 Tutorial
Execute in Batch Mode
Details on performance and scalability can be found on the DataFlow General Concepts help page (http://help.pervasive.com/display/DR62/General+Concepts). Details on how to configure a workflow to execute with DataFlow are on the Enabling Workflows to Execute with DataFlow help page (http://help.pervasive.com/display/RA2/Enabling+Workflows+to+Execute+wit h+DataRush).
Execute in Batch Mode KNIME can be run in headless batch mode by specifying parameters and their values on the command line. To see usage information on valid input parameters, CD (change directory) to the directory where KNIME is installed, and run the following command: ./knime -nosplash -application org.knime.product.KNIME_BATCH_APPLICATION
For more details on running KNIME in batch mode see this KNIME FAQ on tech.knime.org (http://tech.knime.org/faq#q12).
Understanding Batch Mode Command Parameters To execute a workflow in batch mode you need to know the following:
Full path to the knime executable
Full path to the workflow subdirectory (see page 49) in the workspace directory
Full path and name of the exported preferences file
Workflow Directory When using the KNIME workbench, all workflows are shown in a repository view. All workflows are stored in a subdirectory inside the KNIME workspace directory. For example, the Simple ETL workflow you created in Lesson 3 (see page 33) will be workflow/SimpleETL. Note: The KNIME workspace directory was created when you started KNIME for the first time on your machine.
Lesson 4—Workflow Deployment 49
Execute in Batch Mode
When running a workflow in batch mode, there are two options to define the location of the workflow: -workflowDir Specify the workflow subdirectory inside the workspace directory. -workflowFile Export the workflow as a ZIP file, and then specify the location of the ZIP file as a parameter. To export a workflow, follow these steps: 1. On the File menu, select either Export Actian DataFlow Project (if the workflow contains any Actian DataFlow nodes) or Export KNIME Workflow. 2. Select the workflow to be exported. 3. Enter the export file name. 4. Click Finish. An exported workflow can be imported into another installation of KNIME. For more details on exporting or importing workflows, see this help article on tech.knime.org (https://tech.knime.org/workbench#import_export).
Preferences When creating the simple ETL Workflow, you specified the Actian_Vector_AH connection as part of your Preferences. These preferences are also required when executing a workflow in batch mode. The preferences to use when executing a workflow in batch mode are defined in a file with an EPF file extension. An EPF file is created by exporting the preferences in same manner as exporting a workflow. To export the preferences and create the EPF file 1. On the File menu, select Export Preferences. 2. Enter the export file name: /opt/Actian/AnalyticsPlatformAH/ingres/tutorial/preferences.epf
3. Click OK. For more details on preferences see this help article on tech.knime.org (https://tech.knime.org/workbench#preferences).
50 Tutorial
KNIME Server
Running the SIMPLE ETL Workflow in Batch Mode To execute the Simple ETL workflow in batch mode Issue the following command: /opt/Actian/AnalyticsPlatformAH/knime_2.9.4/knime -nosplash -reset -nosave workflowDir=/opt/Actian/AnalyticsPlatformAH/knime_2.9.4/workspace/SimpleETL preferences=/opt/Actian/AnalyticsPlatformAH/ingres/tutorial/preferences.epf application org.knime.product.KNIME_BATCH_APPLICATION
where /opt/Actian/AnalyticsPlatformAH/knime_2.9.4/ Specifies the full path to the knime executable. -reset Resets all workflow nodes so that they will be executed. -nosave Does not save the workflow after execution is finished. -workflowDir Specifies the directory that contains the workflow to be executed. -preferences Specifies the path to the preferences file. -application (Required) Starts KNIME in batch mode. The application must be specified in uppercase.
KNIME Server The KNIME Server is a business solution from www.knime.org (http://www.knime.org) that allows for central management of workflows. KNIME Server includes User Access Rights to control how data is grouped for projects, workgroups, or departments. Workflow execution can be automated and scheduled to run at set times. For more information, see KNIME Server (http://www.knime.org/knimeserver).
Lesson 4—Workflow Deployment 51
Scheduling
Scheduling Out of the box, KNIME has no scheduling abilities. If you want to schedule the execution of a workflow your options are:
52 Tutorial
Use Actian DataConnect http://www.actian.com/products/dataconnect/.
Use Cron (if on Linux).
Use Task Scheduler (if on Windows).
Use the KNIME Server.
Use a custom application that invokes a KNIME workflow.
Lesson 5—Connect to Actian Vector through ODBC Note: You must do this lesson from a remote machine running Microsoft Windows that is connected to the Express Hadoop SQL Edition instance. In this lesson, you will learn how to query the data using Excel through an ODBC connection to Actian Vector. The Ingres ODBC Driver is available in 64-bit and 32-bit format. Depending on the bitness of the Excel you are using, you must download either the 64-bit or 32-bit ODBC Driver. For example, if the client (Excel 2013) is 32 bit, you need to download and install the 32-bit ODBC driver; this is still the case if you are using Windows 7 64 bit. This lesson assumes that you are using a Windows 7 machine. The ODBC driver can be installed as part of the Client Runtime package, as described in Install Actian ODBC Driver (see page 54). To see if you have 32-bit or 64-bit Excel Excel 2010: Select File, Help. Information about Excel is displayed in the right pane. Excel 2013: Select File, Account, About Excel.
Lesson 5—Connect to Actian Vector through ODBC 53
Install Actian ODBC Driver
Install Actian ODBC Driver In this step you will install the Actian Vector ODBC driver that is included as part of the Client Runtime package. You must download either the 64-bit or 32-bit Client Runtime, depending on the bitness of the Excel you are using. To download and install the Actian Vector ODBC driver 1. Go to the Actian Electronic Software Distribution Client Runtime page: http://esd.actian.com/product/drivers/Client_Runtime (http://esd.actian.com/product/drivers/Client_Runtime) 2. For Platform, select either Windows 32-bit or Windows 64-bit, depending on the bitness of the Excel you are using. 3. Click Client Runtime. The Client Runtime downloads are displayed. 4. Download the Client Runtime with configuration defaulted for Vector (for example: clientruntime-vw-3.5.1-417-win-x86.zip) by either FTP or HTTP. 5. Open the ZIP file and extract the contents. 6. Start the installation process by executing the setup.exe file. The Actian Client Runtime Package - Installation Wizard starts. 7. Click Next. The License Agreement panel is displayed. 8. Read and accept the license agreement, and then click Next. The Destination Folder is displayed. 9. Click Next to accept the default destination folder. The Locale panel is displayed. 10. Select your time zone from the drop-down list, leave the character set to the default value of UTF8, and then click Next. The message "Do you want to configure a remote connection now?" is displayed. 11. Select No. The Ready to Install the Program panel is displayed. 12. Click Install. The installation starts. The Actian Client VT Configuration panel displays the installation progress.
54 Tutorial
Create an Actian Vector DSN
13. Click Next. The Client Runtime Installation Summary is displayed. 14. Click Finish. The Client Runtime installation is complete. As part of the install, the following components were installed in the Actian, Client Runtime start menu:
Ingres Service Manager
Ingres Network Utility
nn-bit ODBC Data Source Administrator
Further information of the ODBC drivers can be found on the Actian Online Documentation Portal (http://docs.actian.com/).
Create an Actian Vector DSN In this step you will create a User Data Source Name (DSN) that can be used to connect to Actian Vector. Depending on the security settings on your Windows machine, you could create a System DSN instead. To create the User DSN 1. Start the ODBC Data Source Administrator that was installed with the Client Runtime (under Start, All Programs, Actian, Client Runtime). 2. Select the User DSN tab and click Add. The Create New Data Source dialog is displayed. 3. Select Ingres VT as the driver, and then click Finish. Note: If other installations of Client Runtime exist on your machine, the instance ID may be different, such as VU or VV. The Ingres ODBC Administrator is displayed.
Lesson 5—Connect to Actian Vector through ODBC 55
Create an Actian Vector DSN
4. Enter the following values on the Simple tab:
Data Source: Vector AH
Description: Actian Vector DB
Host Name: IP address or machine name where Express is installed
Listen Address: AH
Database: Select sample from the drop-down list. Note: Before you can select the database, you must enter the following credentials on the DriverConnect dialog: User ID: demo Password: hsedemo
5. Click Apply. The Test button is enabled. 6. Click Test. "Successful connection" message is displayed if the details were entered correctly. 7. Click OK. The Ingres ODBC Administrator is closed. 8. Click OK. The ODBC Data Source Administrator is closed.
56 Tutorial
Use Excel to Query Data
Use Excel to Query Data In this step you will use the User DNS created in the previous step to query the data in the sales_city_fact table. The data in the table will be queried using the Excel Pivot tool. To use Excel to query data in the sales_city_fact table 1. Start Excel. 2. Select from the ribbon: Data, Get External Data, From Other Sources, From Data Connection Wizard. The Welcome page of the Data Connection Wizard is displayed. 3. Select ODBC DSN, and then click Next. The Connect to ODBC Data Source page is displayed. 4. Select Vector AH (as created earlier), and then click Next. The DriverConnect dialog appears. 5. Enter the following credentials: User ID: demo Password: hsedemo The Select Database and Table page is displayed. 6. Select sales_city_fact, and then click Next. The Save Data Connection File and Finish page is displayed. 7. Select Finish. The Data Connection Wizard dialog is closed and the Import Data dialog is displayed. 8. Select PivotTable Report, and then click OK. The DriverConnect dialog appears. 9. Enter the following credentials: User ID: demo Password: hsedemo 10. Use the PivotTable Field List to add fields to the report:
Add Country and Profit to see the total profit generated by the stores in that country. Add City to break down the profit by City.
Lesson 5—Connect to Actian Vector through ODBC 57
Use Excel to Query Data
58 Tutorial
Lesson 6—Connect to Actian Vector through JDBC This simple example shows how to connect to Actian Vector using the Actian JDBC Driver.
JDBC Connection URL JDBC Connection URL has the following format: jdbc:ingres://:/;UID=;PWD=
where: jdbc:ingres:// Connects using the Ingres JDBC driver hostname localhost or the name of the computer you are connecting to port The port to connect to (three characters: the Vector-H instance ID and the digit 7, for example: AH7) database The name of the database you want to access user The user name to connect as password The password For example, the JDBC connection string to access the installed Vector AH instance is: jdbc:ingres://localhost:AH7/sample;UID=demo;PWD=hsedemo
Note: If accessing the installed instance remotely, replace localhost with the host name of the master node. For details see DriverManager.getConnection method in the Connectivity Guide.
Lesson 6—Connect to Actian Vector through JDBC 59
Run the Example Java Program
Run the Example Java Program The example program demonstrates how to connect to Analytics Platform AH instance to sample database with user "demo" and password "hsedemo" and select a list of user tables. Note: The program returns a list of tables from the iitables system catalog and if any tables are partitioned, the partitions are returned along with table names; therefore, the list of tables returned may be larger than expected. The program is in the tutorial directory: jdbc_connect_example.java.. JDK 1.7 (64 bit) is required to compile the program. To configure your environment 1. Log in (or switch user) as the actian user. 2. Open a terminal window. 3. Ensure that the javac executable is in the PATH. For example: export PATH=/usr/bin/java/default:$PATH
4. Add iijdbc.jar to CLASSPATH. For example: export CLASSPATH=.:$II_SYSTEM/ingres/lib/iijdbc.jar:$CLASSPATH
or export CLASSPATH=.:/opt/Actian/AnalyticsPlatformAH/ingres/lib/iijdbc.jar:$CLASSPATH
To compile and run the program 1. From the command prompt, configured as in the previous steps, change directory to the tutorial directory: cd /opt/Actian/AnalyticsPlatformAH/ingres/tutorial
2. Compile the program: javac jdbc_connect_example.java
3. Run the program: java jdbc_connect_example
The output will be similar to this: Demo test: connecting ... Demo test: connected successfully Demo test: Retrieving Result Set Row 1: Column 1:
tbl_churn_quick_start
Demo test: done. Press ENTER key to exit...
60 Tutorial
View JDBC Connection URL in Actian Director
View JDBC Connection URL in Actian Director Actian Director displays the JDBC connection string for connecting to Actian Vector. To view the JDBC connection string in Actian Director 1. Connect to the Actian Vector Hadoop Edition AH instance. 2. In the Instance Explorer select Management, Servers, Data Access Servers, default. 3. Right-click the default server and select Properties. The connection string is displayed. For example: jdbc:ingres://localhost:AH7/;UID=;PWD=
Lesson 6—Connect to Actian Vector through JDBC 61
FAQ What is causing the error "Cannot create workspace"? This normally occurs when you are attempting to start a second instance of KNIME. To remove the message, close down the other instances of KNIME. Why does the "tips and tricks" dialog appear each time I start KNIME? The simple answer is that KNIME is open source and the "tips and tricks" dialog reiterates this message. For further details, see the KNIME forums topic Stopping "Tips and Tricks" from Displaying (https://tech.knime.org/forum/KNIME-users/stopping-tips-tricks-fromdisplaying). Where can I get help with Actian Vector - Hadoop Edition? Online documentation can be found at: docs.actian.com (http://docs.actian.com/). Where can I get help with Actian DataFlow? Online help can be found at: Actian DataFlow (http://help.pervasive.com/display/RA2/RushAnalytics+for+KNIME+Help). Where can I get help with ODBC and DSN (Data Source Name)? See the following links on the Actian Community Wiki:
ODBC Setup (http://community.actian.com/wiki/How_to_Create_an_Ingres_ODBC_Dat a_Source_Definition_(Windows))
DSN Setup (http://community.actian.com/wiki/ODBC_DSN_Configuration_Table)
Configure Data Source (docs.actian.com) (http://bit.ly/1oUnjp9)
Error enabling dataflow update site:null This error occurs when a workflow is being run in batch mode. When running in this mode, the Actian DataFlow remote repository was not loaded. This error can be ignored.
FAQ 63
Appendix Installing an X Server If you have installed Express on a remote Linux node, you need to have X11 forwarding over SSH enabled to run the graphical tools used in this guide. To enable this, you need to have an X server and an SSH client installed. For Windows PCs we recommend installing MobaXterm. MobaXterm bundles an X server and an SSH client in one package and is simple to install and use. Download and install MobaXterm (http://mobaxterm.mobatek.net/download.html).
Connect to the Host as Actian User If you have installed Express on a remote Linux node If you have installed MobaXterm (see page 65) on your Windows PC: 1. Start MobaXterm from the Start menu. 2. Click the Session button, and then SSH. In the Remote host field, type the name of the host where you ran the installer and click OK. 3. Log in as actian and enter the password for the actian user (the password you provided during installation). If you have your own setup of X11 port forwarding over SSH: 1. Start your SSH client with X11 port forwarding enabled. 2. Log in to the host where you ran the Express installer as user actian and the password that you set for user actian during the Express installation. If you are locally logged in to the Linux box where you have installed Express 1. As root, make sure that user actian is allowed to connect to the X server: xhost +SI:localuser:actian
2. Switch user to actian: su actian
Appendix 65
Installing Actian Director on Windows
Installing Actian Director on Windows To install Actian Director 64-bit 1. Download the installation package (http://esd.actian.com/Express/director-client-1.3.0.407-win-x86_64.zip). 2. Unzip and run the Setup program to install. 3. Follow the instructions to install Actian Director and click Finish to complete the installation.
66 Tutorial