Efficient handling on huge # of columns
I have a "table" containing millions of rows and thousands of columns. The columns are essentially booleans, and on average, each row contains around 10-20 columns that return True/1, but this varies widely, and the columns are partially correlated. So this is basically a large, sparse matrix. The database server needs to handle hundreds of queries per second, so I need the queries to be handled efficiently by MySQL.
Ignoring for the moment questions about how to physically implement the "table", if we assume the "table" is actually a single table with millions of rows and thousands of columns, then my queries are of the following nature:
select
mytable.mykey
from
mytable
where
mytable.colX = 1
and mytable.colY = 1
and mytable.colZ = 1
The number of columns, and the identity of the columns varies between queries.
Writes to the table are infrequent, and can be eliminated if required.
My problem: How to efficiently implement this in MySQL considering storage requirements and query retrieval speed.
A table with millions of rows and thousands of columns is too big. Not to mention the problem of indexing all those columns. And the problem that several columns are involved in each query, so multi-column indexes are required for maximum retrieval speed.
The next obvious technique is a 2 column table:
create table pairwise (
row_id int(11) not null,
column_id mediumint(8) not null,
key column_id (column_id)
)
This stores the sparse table reasonably efficiently on disk (in my case a few hundred million rows), but retrieval is cumbersome:
select
p1.row_id
from
pairwise as p1,
pairwise as p2,
pairwise as p3
where
p1.row_id = p2.row_id
and p1.row_id = p3.row_id
and p1.column_id = X
and p2.column_id = Y
and p3.column_id = Z
Is this query efficient in MySQL? MySQL seems to try to find the least freqent column and then brute-force check the other columns, but I am not sure.
Another thought is to use the boolean fulltext capabilities:
create table rows_as_text (
row_id int(11) not null,
cols_as_text text,
primary key (row_id),
fulltext cols_as_text (cols_as_text)
)
Each row would have cols_as_text = 'c123 c479 c899 c1024 c1573' or similar (a letter followed by digits is considered a "word" by the fulltext indexer).
Queries would be of the form:
select
row_id
from
rows_as_text
where
match (cols_as_text) against ('c123 c889 c1573' in boolean mode)
Compared to using table pairwise, this approach:
- uses more space on disk
- takes an eternity to build the fulltext index
- uses a more compact query
- seems to be very variable in query response time
Does anybody have any better ways of handling this problem? Any tips or advice would be appreciated.
Stephen