Query to find duplicates across multiple columns & time out issues
Hi all,
At first, I was trying to run this query against a MySQL table to find duplicates in 1 column. The table has about 1,200,000 records:
SELECT * FROM SomeTable WHERE Address IN (SELECT Address FROM SomeTable GROUP BY Address HAVING count(Address) > 1 ) ORDER BY Address
However, this query was causing timeouts with not results returned. The Address column is indexed.
So, I left out the "ORDER BY" clause, and it works...but returns over 1 million records. So, I need to add additional columns to check for dups.
The results from an EXPLAIN:
Generated by: phpMyAdmin 3.4.5 / MySQL 5.1.56
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY SomeTable ALL NULL NULL NULL NULL 1212743 Using where; Using filesort
2 DEPENDENT SUBQUERY SomeTable index NULL Address 42 NULL 3 Using index
The results from a SHOW CREATE TABLE SomeTable:
SomeTable
CREATE TABLE `SomeTable` (
`Column1` varchar(10) NOT NULL DEFAULT '',
`Column2` varchar(30) NOT NULL DEFAULT '',
`Column3` varchar(50) NOT NULL DEFAULT '',
`Column4` bigint(20) NOT NULL DEFAULT '0',
`Column5` varchar(22) DEFAULT NULL,
`Column6` enum('false','true') NOT NULL DEFAULT 'false',
`Column7` tinyint(1) NOT NULL DEFAULT '0',
`Column8` decimal(7,2) NOT NULL DEFAULT '0.00',
`Address` varchar(40) NOT NULL DEFAULT '',
`Column10` varchar(30) NOT NULL DEFAULT '',
`Column11` char(2) NOT NULL DEFAULT '',
`Column12` mediumint(5) NOT NULL DEFAULT '0',
`Column13` tinyint(2) NOT NULL DEFAULT '0',
`Column14` decimal(3,2) NOT NULL DEFAULT '0.00',
`Column15` tinyint(1) NOT NULL DEFAULT '0',
`Column16` tinyint(1) DEFAULT NULL,
`Column17` tinyint(1) DEFAULT NULL,
`Column18` set('0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19') DEFAULT NULL,
`Column19` set('0','1','2','3','4','5','6','7','8','9','10') DEFAULT NULL,
`Column20` set('0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17') DEFAULT NULL,
`Column21` varchar(40) NOT NULL DEFAULT '',
`Column22` text,
`Column23` varchar(26) DEFAULT NULL,
`Column24` varchar(27) DEFAULT NULL,
`Column25` varchar(200) DEFAULT NULL,
`Column26` enum('true','false') NOT NULL DEFAULT 'false',
`Column27` enum('Y','N') NOT NULL DEFAULT 'Y',
`Date_Created` int(8) NOT NULL DEFAULT '0',
`Last_Mod_Stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`Column1`),
KEY `Address` (`Address`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
This works (but too many rows returned):
SELECT * FROM SomeTable WHERE Address IN (SELECT Address FROM SomeTable GROUP BY Address HAVING count(Address) > 1 )
This is the query I've got, but it's giving a syntax error near "Column 8 IN":
SELECT * FROM SomeTable WHERE Address,Column8 IN (SELECT Address, Column 8 FROM SomeTable GROUP BY Address, Column8 HAVING count(Address, Column8) > 1 )
At minimum, I'd like to add that Column 8 check for duplicates, at best, I'd like to add dup checks for Columns 18,19 and 20 as well if possible.
Also, I'd like to keep the "SELECT *" because I'd like to see the rest of the columns as well.
Ideas on adding more columns to the dup check?