Wednesday, April 8, 2015

Web Scraping using R


#================================Example USPS Web Data===============================
#Scrape all the information about the USPS postoffice location from web#
#Read URL lines
#One can chage the "=06484" part to get information about other ZIP areas
r.page2 <- readLines("

#Regular Expressions of different info needed
LName_pattern <- 'locationName=([^&]*)'
lon_pattern <- 'longitude=([^&]*)'
lat_pattern <- 'latitude=([^&]*)' 
add_pattern <- 'address1=([^&]*)' 
city_pattern <- 'city=([^&]*)' 
state_pattern <- 'state=([^&]*)' 
zip_pattern <- 'zip5=([^&]*)' 

#See what the dataline looks like
grep(lon_pattern, r.page2[1: length(r.page2)], value = T)

#Difine two funs
getexpr <- function(s, g)substring(s, g, g+attr(g, 'match.length')-1)

WebScrape <- function(mypattern,{
  datalines <- grep(mypattern,[1: length(], value = T)
  gg <- gregexpr(mypattern, datalines)
  matches <- mapply(getexpr, datalines, gg)
  result <- gsub(mypattern, '\\1', matches)
  names(result) = NULL

#Save the results into variables
LName <- WebScrape(LName_pattern, r.page2)
lat <- WebScrape(lat_pattern, r.page2)
lon <- WebScrape(lon_pattern, r.page2)
add <- WebScrape(add_pattern, r.page2)
city <- WebScrape(city_pattern, r.page2)
state <- WebScrape(state_pattern, r.page2)
zip <- WebScrape(zip_pattern, r.page2)

#Output the overall results

USPS_Location <- cbind(LName, lat, lon, add, city, state, zip)

Thursday, January 29, 2015

An Integrated-Documentation Tool for Oracle Databases

Oracle:  An Integrated-Documentation Tool for Oracle Databases
If you have ever had to analyze the data in a large Oracle database with no documentation, this integrated-documentation tool is for you. Check out this PL/SQL-generated HTML database dictionary. 

Download the code from google drive.

Wednesday, January 21, 2015

Tuesday, January 13, 2015

Creating customer journeys in R

Attached is a sample of creating customer journeys within R code this was using a SQL Server database as a backend.

#==========================EM Clustering on the text===============================#

#word frequency
wordFreq <- sort(rowSums(m), decreasing = T)

#String count fun
strcount <- function(x, pattern, split){
    strsplit(x, split),
    function(z) na.omit(length(grep(pattern, z)))

#Find counts of each words for each record
string <- names(wordFreq[wordFreq > 40])

#Null data frame for storing
count.string <- matrix(0, ncol = length(string), nrow = length(MN1[, 1]))
count.string <- data.frame(count.string)
for (i in 1:length(string)){

   count.string[, i] <- strcount(tolower(MN1[1:length(MN1[, 1]), 1]), string[i], " ")
colnames(count.string) <- string

#Conbind the data frame with orginal data records
MN2 <- cbind(MN1[ ,1], count.string)

mc <- Mclust(MN2[, 2:length(MN2[1, ])], 3)
plot(mc, what = c('classification'),
     dimens = c(3, 4))

MN2.output <- MN2[mc$classification == 2, ]