GSoC Part 1 - Explore the Mechanism in MariaDB
Overview
My task is to implement regression functions in MariaDB, which was mentioned in MDEV-17467. This feature has been implemented by many MariaDB competitors, like PostgreSQL, Oracle, and so on.
You can see my pr here
Tasks
The functions I need to implement are following:
REGR_SLOPE
: Returns the slope, equal toREGR_SXY
/REGR_SYY
REGR_INTERCEPT
: Returns the y-intercept of the regression line, which is equal toREGR_AVGX
-REGR_SLOPE
*REGR_AVGY
REGR_COUNT
: Returns the number of non-empty pairs of numbers used to fill the regression lineREGR_R2
: The return value is equal to (REGR_SXY
*REGR_SXY
) / (REGR_SYY
*REGR_SXX
)REGR_AVGX
: calculate the mean of the independent variable (expr2) of the regression line, after removing the null pair (expr1, expr2), is equal toAVG(expr2)
REGR_AVGY
: Calculate the mean of the strain variable (expr1) of the regression line, after removing the null pair (expr1, expr2), which is equal toAVG(expr1)
REGR_SXX
: The return value is equal toREGR_COUNT
*VAR_POP(expr2)
REGR_SYY
: The return value is equal toREGR_COUNT
*VAR_POP(expr1)
REGR_SXY
: The return value is equal toREGR_COUNT
*COVAR_POP(expr1, expr2)
Mechanism
How to define a function in MariaDB
For native functions, it is easy to define. All native functions are defined in sql/item_create.cc
. You can add function definitions there, parser can successfully recognize them.
In terms of non-native functions, this process is different. You can follow these steps, if you add a new aggregate functions:
sql\lex.h
: list of function symbolssql\sql_yacc.yy
: contains the definition of the SQL language subset understood by MySQLsql\item_sum.cc
: list all of aggregate functions
Basically, you could only modify these three files to create a new non-native function.
How does the aggregate function work?
The typical aggregate function has the following member:
add()
: iterate each row and update the resultval_real()
: return the final resultreset_field()
: before iterating each row, reset the class membersupdate_field()
: update and store The processes of querys without and withGROUP BY
are different.
If you use the query without GROUP BY
, like SELECT SUM(value) FROM T1
, the parser will find SUM
and invoke Item_sum_sum(thd, $3)
constructor. Then the parser goes through each row, and update the result by functions above. But if using GROUP BY
in the query, the class will call constructor Item_sum_sum(THD *thd, Item_sum_sum *item):
. The result will be provided by Item_sum_field
class.
How is the value stored?
I used the class Regression_result
to store all the results related to regression functions. The main components are below
class Regression_result
{
// store results
Stddev sxx, syy;
double sxy;
double sx, sy;
ulonglong N;
public:
// iterate over the results
void recurrence_next(double nr_y, double nr_x);
// for window function
void remove(double nr_y, double nr_x);
};
How could we iterate the value instead of calculating at the end of the process?
The difficult point is to calculate sxy
, syy
and sxx
. Because they are equivalent to variance and covariance, I adopted Welford’s online algorithm to calculate them. I defined recurrence_next
to implement it.
void Regression_result::recurrence_next(double nr_y, double nr_x)
{
if (N++)
{
sx+= nr_x;
sy+= nr_y;
double diff = nr_x - sxx.get_m();
sxx.recurrence_next(nr_x);
syy.recurrence_next(nr_y);
sxy= sxy + diff * (nr_y - syy.get_m());
}
else
{
sx= nr_x;
sy= nr_y;
sxx.recurrence_next(nr_x);
syy.recurrence_next(nr_y);
}
}
In this way, the result can be calculated iteratively.
On the other hand, if we only want to efficiently calculate the result in window function , like
REGR_AVGX(age, weight) OVER(ORDER BY id rows between 2 preceding and 2 following) as avgx
the result need to remove the first one in order to calculate the next. The implementation is similar to the recurrence_next
void Regression_result::remove(double nr_y, double nr_x)
{
if (N > 0)
{
sx-= nr_x;
sy-= nr_y;
sxx.remove(nr_x);
double diff = nr_x - sxx.get_m();
sxy= sxy - diff * (nr_y - syy.get_m());
syy.remove(nr_y);
N--;
}
}
What should I do next step?
- Figure out memory management in aggregate functions
- Complete all functions with and without
GROUP BY
.