MySQL Forums
Forum List  »  Performance

How can I improve this table and query
Posted by: Adam Copley
Date: February 17, 2016 04:49AM

I have inherited this table of POS transaction level data. In my opinion it is terrible. There are 3.5 million rows, which span ~4 years, with a distribution of anywhere between 1 result and 100k results for things like `sku`, `brand`, `nationality` and `destination`.

There is no primary key, and I am struggling to figure how I can implement one.

here is the `CREATE TABLE`

CREATE TABLE `transaction_data` (
`txn_id` varchar(32) DEFAULT NULL,
`nationality_id` int(3) DEFAULT NULL,
`transaction date` int(8) DEFAULT NULL,
`YrQtr` varchar(6) DEFAULT NULL,
`Yr` int(4) DEFAULT NULL,
`mth` int(2) DEFAULT NULL,
`dy` int(2) DEFAULT NULL,
`destination` varchar(80) DEFAULT NULL,
`sku` int(6) DEFAULT NULL,
`sales` decimal(8,2) DEFAULT NULL,
`units` int(8) DEFAULT NULL,
KEY `txnid` (`txn_id`),
KEY `cov` (`sku`,`nationality_id`,`Yr`,`YrQtr`),
KEY `period` (`Yr`,`YrQtr`,`mth`),
KEY `natperiod` (`nationality_id`,`Yr`,`YrQtr`),
KEY `desperiod` (`destination`,`Yr`,`YrQtr`)

Here are 4 example rows of data:

**txn_id** - is a non unique weird concatenate of field values, which represent which **basket** the **sku** was purchased in.

In some **transactions**, the **sku** is not unique, as POS reductions are recorded by another row with the same `txn_id` and `sku` but with negative values in sales and units to reflect those.

INSERT INTO `2_sar`.`transaction_data`
`transaction date`,
('1_20120113_EZY 1832_3_08:58:00', '143', '20120113', '2012Q1', '2012', '1', '13', 'United Kingdom - Manchstr UK', '30312', '52.00', '1'
), ('1_20120121_OR 367_1_11:21:00', '143', '20120121', '2012Q1', '2012', '1', '21', 'Neth Antills - Curacao', '710028', '3.95', '1'
), ('1_20120203_PC 672_2_13:07:00', '200', '20120203', '2012Q1', '2012', '2', '3', 'Turkey - Istanbul', '1840', '27.95', '1'
), ('1_20120203_PC 672_2_13:07:00', '200', '20120203', '2012Q1', '2012', '2', '3', 'Turkey - Istanbul', '2156', '12.95', '1'

Here is a query which I run on the data.

This query finds the top 10 brands that are purchased with `<Brand>` (not including `<brand>`) ordered by total number of transactions.

This query took 11.344 seconds to run directly from mysql workbench. The server is a t2.micro AWS RDS Instance, and I usually run the queries from a separate EC2 web server. which I believe is (1GHz, 1GB).

The wildcards in this query represent `<select>` filters on the PHP page which the query is run from. They may or may not be selected.

SELECT `products`.`brand`, COUNT(DISTINCT transaction_data.txn_id) AS numOrders
FROM `products`
INNER JOIN `transaction_data` ON `transaction_data`.sku=`products`.sku
INNER JOIN `nationalities` ON `transaction_data`.`nationality_id`=`nationalities`.`nationality_id`
INNER JOIN (SELECT DISTINCT `transaction_data`.`txn_id`
FROM `transaction_data`
INNER JOIN `products` USING (sku)
INNER JOIN `nationalities` USING (nationality_id)
WHERE brand = <Brand>) AS tmp_txns ON tmp_txns.txn_id=`transaction_data`.txn_id
WHERE brand <> <Brand> AND
nationality LIKE '%' AND
Yr = '2014' AND
YrQtr LIKE '%'
GROUP BY brand
ORDER BY numOrders DESC, brand ASC

And the explain:

1 PRIMARY <derived2> ALL 1295294 Using where; Using temporary; Using filesort
1 PRIMARY transaction_data ref txnid,cov,period,natperiod txnid 99 tmp_txns.txn_id 1 Using where
1 PRIMARY nationalities eq_ref PRIMARY,natv PRIMARY 4 2_sar.transaction_data.nationality_id 1 Using where
1 PRIMARY products ref skubrand skubrand 5 2_sar.transaction_data.sku 1 Using where; Using index
2 DERIVED products index skubrand skubrand 128 2669 Using where; Using index; Using temporary
2 DERIVED transaction_data ref txnid,cov,natperiod cov 5 2_sar.products.sku 647 Using index condition
2 DERIVED nationalities eq_ref PRIMARY,natv PRIMARY 4 2_sar.transaction_data.nationality_id 1 Using index

Here are create table statements for the reference tables:

CREATE TABLE `nationalities` (
`nationality_id` int(3) NOT NULL,
`nationality` varchar(14) DEFAULT NULL,
PRIMARY KEY (`nationality_id`),
KEY `natv` (`nationality_id`,`nationality`)

CREATE TABLE `products` (
`sku` int(6) DEFAULT NULL,
`category` varchar(40) DEFAULT NULL,
`subcategory` varchar(40) DEFAULT NULL,
`brand` varchar(40) DEFAULT NULL,
`product name` varchar(50) DEFAULT NULL,
`pack size` decimal(8,2) DEFAULT NULL,
`rk_group` varchar(20) DEFAULT NULL,
`rk_category` varchar(30) DEFAULT NULL,
KEY `skubrand` (`sku`,`brand`)

**My Thoughts So Far**

The `txn_id` is almost guaranteed to bring poor performance, not only due to the fact it's non-unique, but also because it's a string rather than an int. But I need some sort of non-unique identifier for the basket.

I think my `indexes` are okay. All my queries use them, and obviously some `indexes` are specific to certain queries where a single composite `index` won't serve the different `where` clauses in queries.

Options: ReplyQuote

Written By
How can I improve this table and query
February 17, 2016 04:49AM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.