Visualize Univariate Distribution of a Dataset (using Plotly)

Plotly library makes interactive graphs. Using this library a function ddist has been written for visualization of data distribution of each variable within a dataset. This function may become quite handy during the exploration of any dataset.

Since this function uses plotly library, therefore you must install and load this library before calling the ddist function.


library(plotly)

ddist function takes a dataset (of type data.frame) as an input parameter and returns a list containing plotly plot object for each variable. See below:


#Function for generating plot objects for each variable within the dataset (using plotly library). For numeric and integer variables, a histogram plot is generated, while for others a barplot is generated.
#param: data.frame  
#returns: list
ddist <- function(dataset) {
  
  #create a list for holding the plot objects
  plots <- list(length(dataset))
  
  #iterate through each variable
  for(i in 1:length(dataset)) {
    
    #for numeric and integer variables plot histogram  
    if(is.numeric(dataset[,i]) || is.integer(dataset[,i])) {            
      plots[[i]] <- plot_ly(x=dataset[,i]) %>% add_histogram(name=names(dataset)[i])  
    } 
    #for remaining plot barplot
    else {
      tbl = table(dataset[,i])
      plots[[i]] <- plot_ly(x=names(tbl), y=tbl, name=names(dataset)[i], type='bar')
    }
  } 
  #return list of plots
  return(plots)
}

Now by using the above function we can easily explore the data distribution of each variable within any dataset. For example, the below code passes irisdataset to ddist function and then calls subplot function (of plotly library) to display the resulting plots on two rows:

#generate plotly plot objects
plots <- ddist(iris)

#display plots on two rows
subplot(plots, nrows=2)


Click to view interactive plots of univariate distributions of iris dataset

Similarly passing the diamonds dataset to ddist function results in:


Click to view interactive plots of univariate distributions of diamonds dataset

Advertisements
Posted in Data Analysis, R | Tagged | Leave a comment

Oracle Materialized View Complete Refresh over Database Link (DBLink)


Since Oracle 10g, materialized view can be created in LOGGING mode (default) and NOLOGGING mode. NOLOGGING mode does not provides transactional safety to the refresh process. Therefore NOLOGGING mode is only suitable where the user process itself ensures the data integrity.

Under LOGGING mode, the complete refresh of materialized is a three step process:

  • delete the existing records
  • insert the records from data source
  • if above steps are successful then commit else rollback

This provides two key benefits:

  • during refresh cycle, the materialized view will keep entertaining the user queries with existing data
  • new data will only be available after commit (i.e. when refresh cycle completed with no error)

Ideally, the above approach (LOGGING mode) will be sufficient in most of the scenarios. However, in cases where the materialized view is

  • fully refreshed over DBLink
  • and refresh time is more than couple of minutes
  • and lastly the network over which it is refreshed is not 100% fault tolerant

then in such cases, a network glitch can stuck your materialized view process indefinitely with the following message

SQL*Net more data from dblink

In such scenarios, there are at least two solutions (although you may use both to further optimize and increase the resiliency of refresh process):

Solution 1: Create a job, that kills the sessions of materialized view refresh process that are stuck with ‘SQL*Net more data from dblink’ message. Read more on this here.

Solution 2: Try to reduce the refresh time. With the decrease in refresh time, the chances of refresh process facing a network glitch also reduces. Assuming that your data source query is already optimized, you may further reduce the refresh time as described below:

Step 1) Create materialized view in NOLOGGING mode (with no indexes and no auto refresh schedule) on machine where original materialized view exists

CREATE MATERIALIZED VIEW MV_EMPLOYEE_TEMP NOLOGGING AS SELECT * FROM EMP@SCOTT_AT_SERVER1;

Step 2) Recreate the original materialized view in LOGGING mode with the above materialized view as the local data source (with required indexes and no auto refresh schedule)

CREATE MATERIALIZED VIEW SCOTT.MV_EMPLOYEE AS SELECT * FROM MV_EMPLOYEE_TEMP;

Step 3) Create a refresh procedure that performs the following tasks:

1- refresh the materialized view created in NOLOGGING mode

2- verify the data integrity between the the local and remote data source

3- if verified then do the complete refresh of the materialized view (created in LOGGED mode)

CREATE OR REPLACE PROCEDURE PRC_REFRESH_EMPLOYEE_DATA AS
iTotalLocalRecords Integer;
iTotalRemoteRecords Integer;
BEGIN
--refresh mv in nologging mode
DBMS_MVIEW.REFRESH('MV_EMPLOYEE_TEMP','C');

--verify record length
SELECT COUNT(*) INTO iTotalLocalRecords FROM MV_EMPLOYEE_TEMP;
SELECT COUNT(*) INTO iTotalRemoteRecords FROM EMP@SCOTT_AT_SERVER1;

--refresh only if count matches
IF iTotalLocalRecords = iTotalRemoteRecords THEN
DBMS_MVIEW.REFRESH('MV_EMPLOYEE','C');
END IF;

END;

Step 4) Schedule the procedure execution at your desired frequency

In our organizational setting, one of the materialized view (created in LOGGING mode), the refresh process required more than 25 minutes for complete refresh over dblink (with more than 0.5 millions records and a number of indexes). At least every once a month this refresh process got stuck indefinitely with ‘SQL*Net more data from dblink’ message.
After applying the above fix, the complete refresh time reduced to around 10 minutes (of which only less than 20% network time was required):

  • 1 to 2 minutes for materialized view refresh (created in NOLOGGING mode over DBLink)
  • 8 to 10 minutes for materialized view refresh (created in LOGGING mode with a number of indexes)

The cost of this solution is that our system now has to keep two copies of the same data. However we further optimized the above solution by replacing the materialized view created in NOLOGGING mode with a table created in NOLOGGING mode. At the end of our refresh procedure we truncates the table to save this space.

 

Posted in Oracle, Replication | Tagged , , , , , , , , , , , , | Leave a comment

How to install a (portable) Oracle SQL Developer in Windows


Setting up Oracle SQL Developer on any machine may require less than 5 minutes provided you have access to high speed internet along with administrative rights of your machine. However if you are one of myself, working in an organization having restricted access, the portable version may be quite useful.

After googling around, i managed to create a portable version of Oracle SQL Developer 4.1 for windows platform. Below is the summary of steps (URL references) you may find useful in creating a portable version:

Step 1: Download Java (e.g. jdk-8u45-windows-i586.exe for 32-bit windows)

Step 2: Create a portable version of Java
Details of this step are nicely defined at ‘How to install a (portable) JDK in Windows without admin rights‘.

Step 3: Download Oracle SQL Developer (e.g sqldeveloper-4.1.0.19.07-no-jre.zip)

Step 4: Unzip the downloaded sqldeveloper-xxx.zip file to say ‘d:\sqldev-portable\’ folder

Step 5: Create a ‘jdk’ folder within ‘d:\sqldev-portable\sqldeveloper\’ folder

Step 6: Copy portable version of Java to the ‘jdk’ folder (as created in previous step)

Step 7: Define user directory. This is the directory where sqldeveloper will save user specific settings (on first time use). On windows the default directory for this is

c:\users\myuser\AppData\Roaming\SQL Developer\

However since we are creating a portable version therefore it would be better that these settings are also saved relative to our ‘d:\sqldev-portable\’ folder. Now in order to override the defaults, we need to add the following line to the SQL Developer configuration file (d:\sqldev-portable\sqldeveloper\bin\sqldeveloper.conf) (reference)

AddVMOption -Dide.user.dir=../../.sqldeveloper

Step 8: Now take a backup of ‘d:\sqldev-portable\’ folder before executing the sqldevelper.exe file (within d:\sqldev-portable\sqldeveloper\’ folder).

Posted in Oracle, SQL Developer | Tagged | 5 Comments

Lessons learned from the Book: Passionate Programmer


A few days back i came across this book and found it to be an excellent resource for individuals who not only enjoys doing programming, but some how don’t know how to pursue their passion that can make a real difference or at least attain happiness from their professional work.

A summary of my learning from this book are:

1) If programming (software development) isn’t your passion then its better to switch your career (or at least plan to do so)

Why:

    • Very hard to achieve personal satisfaction (happiness) from your work.

In today’s competitive world, around 50% of your wake-up time is spent at workplace, so if that time doesn’t gives you happiness then chances of getting happiness from remaining time is very low.

    • Difficult to keep yourself up-to-date.

With 50% of your time spent at workplace, only your passion can keep you motivated for additional investment of your time in continuous learning activities.

    • Chances of being innovative at your work reduces.

There is an inherent risk of failure associated with every innovation. Failure helps human to clear his/her illusions and learn what not to do. So without passion the ability to face failure and convert failure into a learning becomes very difficult.

    • Life is very precious but time bounded.

Once a time has passed by, it’s gone forever. It would be unwise to spent your time and effort in activities that doesn’t touches you or becomes a source of happiness. For a moment assume that you are now 60 years old and you are sitting on a chair and recalling your past. How would you feel about your past? Does it gives you satisfaction or makes you unsatisfied about your achievements. Associating yourself with your passion seems to be the best bet one can employ for securing his/her future.

2) Understand demand & supply spectrum of software languages, platforms or technologies so as to know where you stands. Certain languages or platforms are in more demand today then others. This demand opens up job opportunities but also attract others to develop this skill. Demanding languages/technologies also offers an incentive for outsourcing companies to offer such services. Hence such demanding languages are more likely to get outsourced. So what this means is that you should also pursue less demanding technologies (for e.g. new technologies) so as to get yourself on top of market demands.

3) Never completely commit yourself to a particular technology/language. This is equivalent to the common saying that don’t put all your eggs in one basket. If that company goes down or stops promoting the specific technology, then this can badly hit your career.

4) Broaden your knowledge spectrum from both ends. This means that you should not only try to develop expertise in specific area (vertical growth) but should also try to develop your knowledge in other relevant areas (horizontal growth). What this means for example to a java expert is that besides putting efforts in knowing the java internals; he/she must equally put efforts in developing  skills in other relevant areas. This may include ability to setup development/test environments, performing software testing, database tuning, etc.

5) Know your true potential by trying to associate yourself with individuals or teams that are better than you. If you cope to survive with them you will at least know that your true potential is no less then them. Contrary to that if you don’t do this then you will remain unaware of your true potential.

6) Get mentoring. Good mentors can really help you accelerate your pace of learning. A mentor can be your team lead, an associate, a book (for me this book really deserves this status).

7) Become mentor.

8) Train yourself for daily plan/execute cycle. This will not only improve your planning/execution skills but will also help you in tracking your progress.

9) Know that your cost of ownership for the company is at least twice your pay. Calculate this for yourself:

Hourly Cost = 2 x (monthly pay) / 160

By knowing how much an hour you cost to the company, you will better be able to evaluate whether the work you delivers really matches the cost.
Making this a habit will help you in maintaining your focus on your planned
goals.

10) Few Misconceptions about Programming

– Since i am only working on java platform (and not looking at other technologies/platforms) so i am a java expert

Why: Expertise require an in-depth knowledge.

To become an eye specialist an individual must first complete his/her education in general medicine and then invest additional time/money in pursuing the in-depth knowledge in the area of ophthalmology. Our field is no exception; same logic should follow. Therefore for a person to be called a java expert he/she must have knowledge in the areas like:

– internal working of Java VM (virtual machine)
– in what scenarios Java VM will under perform or will crash
– internal working of Java application server
– how Java VM differ from other virtual machines
and so on….

– Programmer’s don’t require practice

Why: Humans by nature, develops their skills by practice. If we look around, we finds that in sports, players have regular practice sessions. Similarly in music, musicians, singers do regular practice.

Since programming is also an art, therefore this art can only
thrive with regular practice. By 20/80 rule, around 20% of language features are used during an implementation of a typical project. So typically an average programmer focuses on these 20% features. But what about the remaining 80% features. Are they unimportant. I think these features are what gives distinctive advantage to one language over the other. Learning these feature demands extra practice. So for a passionate programmer, practice is a key element that differentiates him over the others.

11) Perception matters. Accept this reality. It doesn’t matters how good you are at coding or making designs/architectures unless it helps your manager or customer in developing a positive perception about yourself. What this means is that you need to put extra efforts in marketing your work. This can be a presentation, a training, a white paper or even an email communication (whatever applies).

12) Always be ready for elevator speech. This means that whenever your boss, manager or CEO meets you and asks about your current activity, you should be able to describe your work in terms of what impact this will have on business. Inability to describe your work in term of business impact makes you and your work irrelevant and hence develops a negative perception.

13) Improve your written communication skill. In face to face communication you are not fully dependent on your words. Your facial expression and body language also assists you in communicating your message. However in many cases you have no other option then to communicate your message in written form: emails, letters, reports, etc. Inability to communicate effectively in written form can become a source of distraction.

14) Work habits a passionate programmer should possess are:

Do
– Invest your time/resources in developing your passion but never
use workplace for this purpose
– Develop habit of practicing programming at leisure time

Don’t
– Don’t consider workplace as a practicing ground
– Don’t use workplace for your personal development
– Don’t check your personal emails, website or blogs at workplace
(purpose is to keep focus on your work)

Posted in Book Review, Passion | Tagged , , , , , | Leave a comment

Run Oracle SQL*Loader from SQL or PL/SQL


In continuation of my previous blog entry where a generic framework for invoking Operating System commands from SQL or PL/SQL was described [read], we will now try to demonstrate how this service can be used to invoke SQL*Loader utility.

Business Case: Suppose you want to load a data file in ‘.csv‘ format containing list of employees to ‘emp‘ table in ‘hr‘ schema from SQL. A step by step instructions for the implementation of this business case is described as follows:

Step 1: Create a folder named  myprojects in c drive

Step 2: Download ‘osservice.py‘ file and copy it to ‘c:/myprojects‘ folder

Step 3: Download SQL*Loader control file with name ‘emp.ctl‘ and copy it to ‘c:/myprojects‘ folder. Contents of this file are shown below            

load data
infile ”
append
into table employee
fields terminated by “,”
( code, name, salary )

Step 4: Download a sample data file with name ‘emp.csv‘ and copy it to ‘c;/myprojects‘ folder. Contents of this file is shown below              

code,name,salary
E1,Akbar,20000
E2,John,30000
E3,Sam,15000

Step 5: Create emp table             

create table hr.emp
(
       code varchar2(30),
       name varchar2(100),
       salary integer
);

Step 6: Open command window and first change the directory to ‘c:/myprojects‘ and then run the following command

python c:/myprojects/osservice.py

Step 7: Open SQLPlus and connect to user ‘sys as sysdba‘ and then run the following query

select utl_http.request(utl_url.escape(‘http://localhost:9090/run?command=sqlldr -userid=hr/hr control=./emp.ctl data=./emp.csv log=./emp.log‘)) from dual;

Incase you want to execute the above query from another database user (for example hr) then you need to perform the following:

–  grant execute on utl_http to hr;
–  grant execute on utl_url to hr;
–  configure the Network ACL so that hr user can invoke the web-service

Step 8: Verify the results by issuing the following query            

select count(*) from hr.emp;

Posted in File system, Operating System, Oracle, Python, Web Service | Tagged , , , , , , | 3 Comments

Run Operating System (OS) Command in Oracle PL/SQL: A Web Service Approach


Three years back when i was working on a project in which there was a requirement for execution of OS commands from Oracle 11g, i found the following options:

Option 1: Create Java Stored Procedure and access it through PL/SQL wrapper
Option 2: Use DBMS_SCHEDULER package to schedule an OS command
Option 3: Create a native shared library in C

Technical implementation details of the above options can be found here [1], [2], [3]

After weighing the pros and cons of each approach, i went with the first option. However recently when i was exploring python network library, i found that this simple command can share my files/folder over HTTP interface

python -m SimpleHTTPServer 80

This made me to think that can we use the HTTP interface for invocation of Operating system commands like:

  • running a shell/batch script from Oracle PL/SQL
  • getting list of running processes
  • getting list of files in a folder
  • getting contents of log file

A simple service architectural diagram is shown below:

os_service_architecture

Now for implementing a proof of concept, i am using python’s subprocess.Popen class (as discussed in my previous blog) for interacting with operating system and cherrypy — a python’s minimalist web framework. A step by step instruction for creating such service is given below:

Step 1: Setup Cherrypy

a) Dowonload cherrypy and unzip to a folder say “c:/cherrypy”

b) Change your working directory to “c:/cherrypy” and run the following command

python setup.py install

Step 2: Create a python script named ‘osservice.py’ and copy the following code into it

import cherrypy, subprocess

#overide the default cherrypy port from 8080 to say 9090
cherrypy.config.update({'server.socket_port': 9090,})

#define set of commands that you want to allow to be executed over this interface
command_set = ['ls','ps','dir','head','tnsping','sqlldr']

class ServiceManager:

   def run(self, command, working_directory=None, environment=None):

      #check whether the given command is allowed over this interface or not
      if command.split()[0] in command_set:

         #run the command
         p = subprocess.Popen(command, shell=False, universal_newlines=True, cwd=working_directory, env=environment, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

         #read the stdout and stderror messages
         output = p.stdout.read()
         error = p.stderr.read()

         #if no error reported then format the stdout message
         if len(error) == 0:
            str = self.format(output)
         else:
            str = self.format(error)

      else:
         str = 'Service Error: This command is not allowed over this interface!'

      #return the message
      return str

   run.exposed = True

   def format(self,str):
      lst = str.split('\n')
      result = "
"      for row in lst:         result += "
"      result += "</pre>
<table>
<tbody>
<tr>
<td>"
 result += row
 result += "</td>
</tr>
</tbody>
</table>
<pre>"
      print result
      return result

cherrypy.quickstart(ServiceManager())

Step 3: Start the service by issuing the following command

python c:/myprojects/osservice.py

Step 4: Test the service in web browser. Note the following url should display the contents of the folder from where the python command was issued in step3

http://localhost:9090/run?command=ls

Step 5: Test the service from sqlplus (connect as sysdba).

select utl_http.request(utl_url.escape(‘http://localhost:9090/run?command=ls -l’)) from dual;

select utl_http.request(utl_url.escape(‘http://localhost:9090/run?command=head -5 c:/myprojects/osservice.py’)) from dual;

select utl_http.request(utl_url.escape(‘http://localhost:9090/run?command=tnsping my_remotedb’))

Please note that the current implementation of osservice.py formats all the output as html, however this can be modified as per your the requirement by refining the code of format method in ServiceManager class.

When comparing this option with other 3 options(as described in the starting) in terms of security, this interface can pose a higher security risk. However such risks can be mitigated with the following strategies:

  • only allow this interface to be accessible from local machine
  • limit the type of commands that may be executed over this interface

I think with proper risk mitigation strategies, this interface can open up numerous opportunities for performing operating system level operations from within sql or plsql in minimum amount of time.

Posted in File system, Oracle | Tagged , , , , , , , , , , , , , , , , , | 3 Comments

Run Oracle Commands From Python (via SQL Plus)

Often one may want to run a query or execute stored procedure from python without installing any specific modules like cx_Oracle. In such cases it is often more convenient to use SQL Plus (assuming sqlplus is already installed).

In python, Popen class (of subprocess module) can be used to invoke a sqlplus session.

session = Popen([‘sqlplus’,’-S’,’hr/hr’], stdin=PIPE, stdout=PIPE, stderr=PIPE)

The first parameter of Popen is the actual command given as a list of words. This means that if we want to run the command ‘sqlplus -S hr/hr’ then we must pass it as a list of words in the command i.e. [‘sqlplus’, ‘-S’, ‘hr/hr’]. Note that by setting -S flag when invoking sqlplus will prevent it from writing the startup and termination messages to stdout output interface

Once a session is created, we can write our command(s) to stdin input interface (similar to as we would be writing on command window).

session.stdin.write(‘select sysdate from dual;’)
session.stdin.write(‘select first_name from employees;’)

Note that just writing to stdin input interface will not communicate the results to stdout. For this we need to call communicate function.

stdout, stderr = session.communicate()

This function returns two values (a tuple): stdout holds the results of our queries written on stdin input interface while stderr holds the error messages.

The following code creates a wrapper function that takes the sqlCommand and connectString as inputs and returns the queryResult and errorMessage (if any):


from subprocess import Popen, PIPE

#function that takes the sqlCommand and connectString and returns the queryReslut and errorMessage (if any)
def runSqlQuery(sqlCommand, connectString):
   session = Popen(['sqlplus', '-S', connectString], stdin=PIPE, stdout=PIPE, stderr=PIPE)
   session.stdin.write(sqlCommand)
   return session.communicate()

With the above function, running sql queries, calling stored procedures or executing a sql script file becomes quite easy.

#example 1: run a query that returns a numeric value
connectString = 'hr/hr'
sqlCommand = 'select count(*) from jobs;'
queryResult, errorMessage = runSqlQuery(sqlCommand, connectString)


#example 2: run a query that returns a next value of sequence
connectString = 'hr/hr'
sqlCommand = 'select employees_seq.nextval from jobs;'
queryResult, errorMessage = runSqlQuery(sqlCommand, connectString)

In the above examples, the queryResult variable will hold a string  (i.e.the formatted output displayed by sqlplus). For example in example 1, queryResult will hold a string like shown below:

‘\r\n  COUNT(*)\r\n———-\r\n\t19\r\n\r\n’

where the number 19 in this string resembles the total count of records in jobs table


#example 3: call stored procedure secure_dml
connectString = 'hr/hr'
sqlCommand = 'execute  secure_dml;'
queryResult, errorMessage = runSqlQuery(sqlCommand, connectString)


#example 4: run a sql commands defined in .sql file (assuming that ‘test.sql’ file is on same path from where the python session has been invoked)
connectString = ‘hr/hr’
sqlCommand = ‘@test.sql’
queryResult, errorMessage = runSqlQuery(sqlCommand, connectString)


#example 5: run a sql commands defined in .sql file (similar to example 4) but with some parameter like table name
#----- assuming the content of test.sql file is: select count(*) from &1;
connectString = ‘hr/hr’
sqlCommand = ‘@test.sql employees’
queryResult, errorMessage = runSqlQuery(sqlCommand, connectString)

Please note that the above examples will work on Python Version 2.x. In order to run it on Python Version 3.x, the only change required is to pass the sqlCommand as a bytes object (not as a string object). In python a string can be converted to bytes object by prefixing it with letter b, for example

sqlCommand = b’select count(*) from jobs;’


With this exception, rest of code will run as described above. If you are interested in further details then go through the pages 402-405 of “Python Essential Reference” by David M. Beazley

Posted in Oracle, Python, SQL*Plus | Tagged , , , , , , , , , | 12 Comments