MySQL Forums
Forum List  »  Newbie

Query to find duplicates across multiple columns & time out issues
Posted by: We5 InElGr
Date: June 30, 2012 03:39PM

Hi all,

At first, I was trying to run this query against a MySQL table to find duplicates in 1 column. The table has about 1,200,000 records:

SELECT * FROM SomeTable WHERE Address IN (SELECT Address FROM SomeTable GROUP BY Address HAVING count(Address) > 1 ) ORDER BY Address

However, this query was causing timeouts with not results returned. The Address column is indexed.

So, I left out the "ORDER BY" clause, and it works...but returns over 1 million records. So, I need to add additional columns to check for dups.

The results from an EXPLAIN:
Generated by: phpMyAdmin 3.4.5 / MySQL 5.1.56

id select_type table type possible_keys key key_len ref rows Extra

1 PRIMARY SomeTable ALL NULL NULL NULL NULL 1212743 Using where; Using filesort

2 DEPENDENT SUBQUERY SomeTable index NULL Address 42 NULL 3 Using index


The results from a SHOW CREATE TABLE SomeTable:

SomeTable

CREATE TABLE `SomeTable` (
`Column1` varchar(10) NOT NULL DEFAULT '',
`Column2` varchar(30) NOT NULL DEFAULT '',
`Column3` varchar(50) NOT NULL DEFAULT '',
`Column4` bigint(20) NOT NULL DEFAULT '0',
`Column5` varchar(22) DEFAULT NULL,
`Column6` enum('false','true') NOT NULL DEFAULT 'false',
`Column7` tinyint(1) NOT NULL DEFAULT '0',
`Column8` decimal(7,2) NOT NULL DEFAULT '0.00',
`Address` varchar(40) NOT NULL DEFAULT '',
`Column10` varchar(30) NOT NULL DEFAULT '',
`Column11` char(2) NOT NULL DEFAULT '',
`Column12` mediumint(5) NOT NULL DEFAULT '0',
`Column13` tinyint(2) NOT NULL DEFAULT '0',
`Column14` decimal(3,2) NOT NULL DEFAULT '0.00',
`Column15` tinyint(1) NOT NULL DEFAULT '0',
`Column16` tinyint(1) DEFAULT NULL,
`Column17` tinyint(1) DEFAULT NULL,
`Column18` set('0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19') DEFAULT NULL,
`Column19` set('0','1','2','3','4','5','6','7','8','9','10') DEFAULT NULL,
`Column20` set('0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17') DEFAULT NULL,
`Column21` varchar(40) NOT NULL DEFAULT '',
`Column22` text,
`Column23` varchar(26) DEFAULT NULL,
`Column24` varchar(27) DEFAULT NULL,
`Column25` varchar(200) DEFAULT NULL,
`Column26` enum('true','false') NOT NULL DEFAULT 'false',
`Column27` enum('Y','N') NOT NULL DEFAULT 'Y',
`Date_Created` int(8) NOT NULL DEFAULT '0',
`Last_Mod_Stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`Column1`),
KEY `Address` (`Address`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

This works (but too many rows returned):
SELECT * FROM SomeTable WHERE Address IN (SELECT Address FROM SomeTable GROUP BY Address HAVING count(Address) > 1 )

This is the query I've got, but it's giving a syntax error near "Column 8 IN":

SELECT * FROM SomeTable WHERE Address,Column8 IN (SELECT Address, Column 8 FROM SomeTable GROUP BY Address, Column8 HAVING count(Address, Column8) > 1 )

At minimum, I'd like to add that Column 8 check for duplicates, at best, I'd like to add dup checks for Columns 18,19 and 20 as well if possible.

Also, I'd like to keep the "SELECT *" because I'd like to see the rest of the columns as well.

Ideas on adding more columns to the dup check?

Options: ReplyQuote


Subject
Written By
Posted
Query to find duplicates across multiple columns & time out issues
June 30, 2012 03:39PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.