GSoC Part 1 - Explore the Mechanism in MariaDB
Overview
My task is to implement regression functions in MariaDB, which was mentioned in MDEV-17467. This feature has been implemented by many MariaDB competitors, like PostgreSQL, Oracle, and so on.
You can see my pr here
Tasks
The functions I need to implement are following:
REGR_SLOPE: Returns the slope, equal toREGR_SXY/REGR_SYYREGR_INTERCEPT: Returns the y-intercept of the regression line, which is equal toREGR_AVGX-REGR_SLOPE*REGR_AVGYREGR_COUNT: Returns the number of non-empty pairs of numbers used to fill the regression lineREGR_R2: The return value is equal to (REGR_SXY*REGR_SXY) / (REGR_SYY*REGR_SXX)REGR_AVGX: calculate the mean of the independent variable (expr2) of the regression line, after removing the null pair (expr1, expr2), is equal toAVG(expr2)REGR_AVGY: Calculate the mean of the strain variable (expr1) of the regression line, after removing the null pair (expr1, expr2), which is equal toAVG(expr1)REGR_SXX: The return value is equal toREGR_COUNT*VAR_POP(expr2)REGR_SYY: The return value is equal toREGR_COUNT*VAR_POP(expr1)REGR_SXY: The return value is equal toREGR_COUNT*COVAR_POP(expr1, expr2)
Mechanism
How to define a function in MariaDB
For native functions, it is easy to define. All native functions are defined in sql/item_create.cc. You can add function definitions there, parser can successfully recognize them.
In terms of non-native functions, this process is different. You can follow these steps, if you add a new aggregate functions:
sql\lex.h: list of function symbolssql\sql_yacc.yy: contains the definition of the SQL language subset understood by MySQLsql\item_sum.cc: list all of aggregate functions
Basically, you could only modify these three files to create a new non-native function.
How does the aggregate function work?
The typical aggregate function has the following member:
add(): iterate each row and update the resultval_real(): return the final resultreset_field(): before iterating each row, reset the class membersupdate_field(): update and store The processes of querys without and withGROUP BYare different.
If you use the query without GROUP BY, like SELECT SUM(value) FROM T1, the parser will find SUM and invoke Item_sum_sum(thd, $3) constructor. Then the parser goes through each row, and update the result by functions above. But if using GROUP BY in the query, the class will call constructor Item_sum_sum(THD *thd, Item_sum_sum *item):. The result will be provided by Item_sum_field class.
How is the value stored?
I used the class Regression_result to store all the results related to regression functions. The main components are below
class Regression_result
{
// store results
Stddev sxx, syy;
double sxy;
double sx, sy;
ulonglong N;
public:
// iterate over the results
void recurrence_next(double nr_y, double nr_x);
// for window function
void remove(double nr_y, double nr_x);
};
How could we iterate the value instead of calculating at the end of the process?
The difficult point is to calculate sxy, syy and sxx. Because they are equivalent to variance and covariance, I adopted Welford’s online algorithm to calculate them. I defined recurrence_next to implement it.
void Regression_result::recurrence_next(double nr_y, double nr_x)
{
if (N++)
{
sx+= nr_x;
sy+= nr_y;
double diff = nr_x - sxx.get_m();
sxx.recurrence_next(nr_x);
syy.recurrence_next(nr_y);
sxy= sxy + diff * (nr_y - syy.get_m());
}
else
{
sx= nr_x;
sy= nr_y;
sxx.recurrence_next(nr_x);
syy.recurrence_next(nr_y);
}
}
In this way, the result can be calculated iteratively.
On the other hand, if we only want to efficiently calculate the result in window function , like
REGR_AVGX(age, weight) OVER(ORDER BY id rows between 2 preceding and 2 following) as avgx
the result need to remove the first one in order to calculate the next. The implementation is similar to the recurrence_next
void Regression_result::remove(double nr_y, double nr_x)
{
if (N > 0)
{
sx-= nr_x;
sy-= nr_y;
sxx.remove(nr_x);
double diff = nr_x - sxx.get_m();
sxy= sxy - diff * (nr_y - syy.get_m());
syy.remove(nr_y);
N--;
}
}
What should I do next step?
- Figure out memory management in aggregate functions
- Complete all functions with and without
GROUP BY.