MySQL Forums
Forum List  »  MySQL & Storage

MySQL + NAS(NFS or CIFS) + Huge Table Insert Problem (Long)
Posted by: JinNo Kim
Date: September 07, 2005 08:29AM

OK, here's what I've been fighting for the last couple of weeks...

Dell 2650 with dual HT XEON 3.0/3.2 (Played with a mix of 2650's 2~4G of RAM), dual GigE NICS, Gentoo Linux fully updated, (MySQL 4.0.25 and 4.1.14). 1 GigE link to network, 1 GigE(9000 MTU) Xover cable to Network Appliance FAS940 with (2) 2port FC controllers and 6 fully populated DS-14FC disk shelves with redundant FC connections to controllers. No network queries, all run locally from the Dell.

I can start 6 `dd if=/dev/zero of=[netapps mount point]/file[A-F] bs=8900` threads and peg my CPUs, the GigE link, and the NetApps Filer CPU - he doesn't really cry for mercy, just gets to a point where writes are delayed for a minute+. I can run these for a while (15 minutes to an hour) and then have to start worrying about disk space. There are really NO complaints about the NetApps - it performs insanely well OS file copy to disk. I can also run 3 of the same dd's writing while I run 3 dd's reading files on the netapps and piping the output back to the NetApps - the result is basically the same with marginally higher IOwait on the Dell.

In this application, I need to insert roughly 15-40 Million rows a day (in hourly batched inserts) and _occasionally_ query on a single keyed field. There is _only_ a single process EVER inserting data, and the queries all use indexes and long queries are not really a problem - we can queue them in our application and run them forever. My table looks like:

CREATE TABLE LARGE_TABLE (
A_key bigint(20) unsigned NOT NULL auto_increment,
B smallint(5) unsigned NOT NULL default '0',
C datetime NOT NULL default '0000-00-00 00:00:00',
D int(10) unsigned NOT NULL default '0',
E bigint(20) unsigned NOT NULL default '0',
F tinyint(3) unsigned NOT NULL default '0',
G smallint(5) unsigned NOT NULL default '0',
H bigint(20) unsigned NOT NULL default '0',
I smallint(5) unsigned NOT NULL default '0',
J bigint(20) unsigned NOT NULL default '0',
K int(10) unsigned NOT NULL default '0',
L bigint(20) unsigned NOT NULL default '0',
M datetime NOT NULL default '0000-00-00 00:00:00',
PRIMARY KEY (`A_key`),
KEY C (`C`),
KEY D (`D`),
KEY G (`G`),
KEY I (`I`),
KEY K (`K`)
) TYPE [Tried Inno And MyISAM while troubleshooting sometimes with MAX_ROWS=4294967295 ]

The table is heavily normalized, but joins don't occur until the data is selected out to a temp working table and none of my problems occured during that process.

My issue is with the MySQL server locking up on inserts. When I started the collection of data, all was fine. This installation ran for almost 2 months inserting into a single InnoDB table without issue... Queries against indexed fields into a local temp table were reasonably fast against a 1.7 Billion row table in like ~30 seconds (if the key was buffered, a few minutes if it had to be read). Then, the Dell started having issues (hence the variety of hardware message above). The first symptoms were random lockups of MySQL that ate our production server to the point where console logins no longer responded. I moved this MySQL database and the collection process to a separate server for troubleshooting at this point, so I could troubleshoot without impacting our other applications (users were starting to complain about the lockups in our other apps hosted on the same server and I got tired of being called at 3 A.M.).

A little research got me to the NetApps doc on NetApps+Linux and optimizations I could/should try - all of which were futile. My initial NFS client settings achieved basically the maximum my hardware can deliver (v3, UDP, r/wsize@32768 with OS tuning to buffer and support the connections).

Basically, MySQL would lock completely and become _unkillable_ - can't stop the server, will not accept connections and waiting did nothing to resolve the issue. The NFS server was still responding and I could `ls -la` the mount point, create and delete files, etc (from the Dell). The network link utilization was low and most of the time CPU utilization on the Dell and the NetApps filer were non-existant. IOWait was low, bi/bo in vmstat were at acceptable levels (rarely more than 2K and usually far lower), but context switches would skyrocket occasionally (30 second interval?) (10-16K/sec). nfsstat -c showed very few (no) errors and the ethernet interface was also reporting no (or like 1 per billion packets) errors.

If I established a client connection from localhost to the MySQL server before a lockup, it would remain connected and I could get information from `show processlist;` - I moved my hourly batched inserts back to single row inserts and they would lock up at random rows with no apparent cause. I created MyISAM tables in the same DB and redirected inserts there - no real help (15 or so hours of happy operation, followed by the lockup). If I reboot the Dell server, myisamchk the table, restart MySQL, that batch and a few immediately following it would insert, followed by another lockup.

During lockup, show processlist would display something similar to:

Id | User | Host | db | Command | Time | State | Info
6 | inserter | localhost | work | Query | 1688 | ???? | INSERT INTO LARGE_TABLE {valid type=data list}

I don't recall the "State" at lockup, I believe it dealt with some lock issue, but cannot be positive.

If I run the same importer against a MyISAM table on localhost it runs fine, albeit slower. MySQL inserts _never_ consume a whole lot more than 350Mbps of the GigE link (even on batched inserts) and the CPU on both the filer and the Dell are bored. IOwait during inserts is low and I never swap. I can barely catch rpciod on top.

I'm reasonably certain my issue is somewhere between the Linux NFS client and MySQL. After MySQL locks up, no amount of patience or cajoleing will get it to restart (even when I do not mount the point with the NFS "hard" and "noinrt" flags). I've tried many combinations of NFS flags, v234 clients, UDP and TCP, literally _every applicable_ OS performance tuning tip in "Performance Tuning for Linux Servers", and screwing extensively with my.cnf. My my.cnf at this point consumes 650M of RAM on a 2G server, but MySQL is literally the only purpose of this box.

I'm in the incredibly slow process of splitting our tables from initial production and subsequent testing into _daily_ tables of 16~30 million rows, but would prefer to use a single or few huge tables. I basically have diskspace to burn for this application and want to use it. When things went South, I was using a single InnoDB data table (IBData at approx 500G, with a +100M growth). The data portion of the file was still under 300G, with indexes eating the rest. Starting MySQL pointed at this table virtually assures a crash (not on startup, but upon execution of the first real query - even if it's not an insert).

/var/lib/mysql/mysql.err:

Number of processes running now: 1
mysqld process hanging, pid 27186 - killed
050905 08:12:57 mysqld restarted
050905 8:12:57 [ERROR] Can't start server: Bind on TCP/IP port: Address already in use
050905 8:12:57 [ERROR] Do you already have another mysqld server running on port: 3306 ?
050905 8:12:57 [ERROR] Aborting

050905 8:12:57 [Note] /usr/sbin/mysqld: Shutdown complete

050905 08:12:57 mysqld ended

While /var/log/mysql/mysqld.err shows only the below for the same times:

Version: '4.1.13' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Gentoo Linux mysql-4.1.13-r1
/usr/sbin/mysqld: ready for connections.
Version: '4.1.13' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Gentoo Linux mysql-4.1.13-r1
/usr/sbin/mysqld: ready for connections.
Version: '4.1.13' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Gentoo Linux mysql-4.1.13-r1
/usr/sbin/mysqld: ready for connections.
Version: '4.1.13' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Gentoo Linux mysql-4.1.13-r1


Anything I'm missing, other than trying the whole merge table with day-to-day MyISAM tables thing?

Is MySQL+Linux+dedicated NAS just not the way to go (I didn't procure the hardware - it was handed to me for this purpose)?

Ultimately, I guess this thread is just for comments - I think I have a somewhat workable solution by playing the file rotation game.

I just wanted to document my frustration - while not truly "Enterprise Class", I feel like this is not underpowered hardware. Given the fact that I can saturate a GigE tg3 Xover link with jumbo frames and peg CPU's on all ends I feel like MySQL shouldn't be the weak link in this chain, though the interaction between MySQL and the Linux NFS client (or lockd) appears to be. I'd love to get this working, but am not really sure where (other than chopping tables) to go.

Anyone have any ideas?

-KJN



Edited 2 time(s). Last edit at 09/23/2005 06:55PM by JinNo Kim.

Options: ReplyQuote


Subject
Written By
Posted
MySQL + NAS(NFS or CIFS) + Huge Table Insert Problem (Long)
September 07, 2005 08:29AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.