I was wondering if you could help me, I've looked through the documentation but can't seem to find a clear answer.
Is it possible to use the window_partition_clause [ ie OVER( PARTITION BY blah ) ] with transform functions? And how do I do this with R?
More precisely..
I've been trying to run the Kmeans clustering algorithm as a polymorphic transform function like the example in the documentation- my code is here-
Code: Select all
# k-means ploymorphic algorithm
#Input: A dataframe consisting of one column of labels and then n metrics
#Output: A dataframe with one column stating which cluster the data point belongs
kmeans_clusterPoly<-function(x,y)
{
#load required packages
#library(cluster)
#Parameter Check: Number of clusters to be made, k.
if(!is.null(y[['k']]))
k=as.numeric(y[['k']])
else
stop(" Expected parameter k. Syntax '...USING PARAMETER k=3)'")
# Get the number of columns in the input dataframe
cols <- ncol(x)
#runs the k mean algorithm
cl<-kmeans(x[,2:cols],k)
#returns the clustering vector
Result <- cl$cluster
#Return result to vertica
Result <- data.frame( x[,1], Result )
Result
}
kmeans_clusterPolyFactory<-function()
{
list(
name=kmeans_clusterPoly, #function that does the processing
udxtype=c("transform"), #type of the function
intype=c("any"), #iput types
outtype=c("int","int"), #output types
parametertypecallback=kmeans_clusterPolyParameters
)
}
kmeans_clusterPolyParameters <- function()
{
params <- data.frame( datatype=rep( NA, 1), length=rep( NA,1), scale=rep( NA,1), name=rep( NA,1) )
params[1,1] = "int"
params[1,4] = "k"
params
}
I can run the code fine using the select statement below -
Code: Select all
SELECT
kmeans_clusterPoly( c2, c3 USING PARAMETERS k = 4) over( )
FROM
t1
Code: Select all
SELECT
kmeans_clusterPoly( c2, c3 USING PARAMETERS k = 4) over( partition by c1 )
FROM
t1
But I get the following error
ERROR 3399: Failure in UDx RPC call InvokeProcessPartition(): Error calling processPartition() in User Defined Object [kmeans_clusterPoly] at [/scratch_a/release/vbuild/vertica/UDxFence/RInterface.cpp:1342], error code: 0, message: Exception in processPartitionForR: [cannot take a sample larger than the population when 'replace = FALSE']
What do I need to do? Is there a way to write the 'Partition' in a similar way to the Factory or Parameters ? Where can I find documentation for this?
Or do I need to write the partition in the R code directly and is it that transform functions can only be applied as one big 'function' acting on the whole 'dataframe'?
Thanks in advance for any help