Re: Performance Problems for large inserts into INNODB table via LOAD DATA
> PRIMARY KEY (TKT_ID), UNIQUE KEY iux_TicketsTKT_ID(TKT_ID),
You said: A PRIMARY KEY is a UNIQUE KEY. So the second is totally redundant; recommend DROPing it. Having it significantly slows down the load.
My response: confused about this one, are you implying that MySQL created 2 unique keys: one for the index and one for the primary key?
You said: TicketAttachments has no _explicit_ PRIMARY KEY. It is usually better to have one, even it it is just INT UNSIGNED NOT NULL AUTO_INCREMENT.
My response: There is a primary key defined on it, look again:
FILE_ID INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
You said: How big is the table?
My response: Actually I have multiple tables, one for each year, but the attachments table is for all of them.
I am not using foreig keys because my application enforces referential integrity.
I have attached the show table status as you requested. It is a CSV file.
Tickets2010
Tickets2011
TicketAttachments
> innodb_io_capacity=2000
You said: Are you using SSDs? Hardware RAID? If not, this value may be excessive. An ordinary drive cannot handle more than about 200.
My response: I am using software raid level 1.
> I do an explicit commit every 90 seconds.
You said, Does this mean that your Python script generates a LOAD DATA file of a few thousand rows, then closes it and issues a COMMIT?
My Response: I have thousands of LOAD DATA files (a previous program created them in MySQL format).
There is one file for the tickets table and it can have 1-20 rows.
1 file for attachments, which can only have one row.
So, every 90 seconds I issue a commit and a LOAD DATA has been issued during that time for hundreds of files.
You said:
> create_date VARCHAR(40) NULL,
> mod_date VARCHAR(40) NULL,
> read_date VARCHAR(40) NULL,
There is a much more compact, and usable, DATE datatype.
My Response: Actually the data for these is in various types of formats, so I treat it as unformatted data.
You said: What is your conclusion on Tablespaces, etc?
My response: I love controlling where this data resides and reducing the impact on ibdata.
I do not see any measurable performance difference whether I use tablespaces or not.
There is one problem though, the INFORMATION_SCHEMA does not track information for it in the TABLESPACES or FILES tables.
I created a separate bug for this.
You said: A significant issue in the performance of loading Tickets is whether TKT_ID values are ordered or random.
My response: the data is ordered. TKT_ID acts like an auto-generated number but it is managed by my application.
You said: Do you find that performance starts out fast, and later slows down? Some other pattern?
Yes, I see performance starts out fast (1000-5000 rows inserted per second), then slows down after about an hour (100-500 rows per second).