Monday, January 14, 2008

DO NOT USE O_DIRECT with EXT3

O_DIRECT under high load causes these issues

This is a symptom

Losing some ticks... checking if CPU frequency changed.
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interrupts
rip __do_softirq+0x4d/0xd0
ttyS1: 1 input overrun(s)
ttyS1: 1 input overrun(s)
ttyS1: 1 input overrun(s)
ttyS1: 1 input overrun(s)
ttyS1: 1 input overrun(s)



BUT THE REAL PROBLEM is it Locks up the partition that the ibdata file is on.




Systems where the server locked up.

2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64 x86_64 GNU/Linux

total used free shared buffers cached
Mem: 16412760 16389976 22784 0 82440 686368
-/+ buffers/cache: 15621168 791592
Swap: 8393952 144 8393808


I had O_DIRECT running in production for over a month on some pretty loaded servers, but once I put it on some older servers, all hell broke loose.

Pretty Loaded is defined as

1000 qps mainly selects mixed with large ranges at a high concurrency of 30 threads.

CPU WIO is around 10-15% (acceptable thresholds)


If you insist on running O_DIRECT I recommend

1. Test O_DIRECT on ever OS version in your farm
2. Test O_DIRECT by producing so much load that it's unrealistic.

7 comments:

Anonymous said...

Can you provide any details around this? What is a "pretty loaded server?" What is the hardware configuration? What kind of application workload was this database? Since we run a lot of MySQL databases (using InnoDB) on ext3 file systems, I'm concerned but without anything to go on, I can't take this very seriously.

Dathan Pattishall said...

I am still tracking down why O_DIRECT does not work with EXT3 on my 2.6 RHEL version. All I have to go on is when I turned it off the partition never froze again.

Unknown said...

IM me, it turns out you do have support.
-Brian

Anonymous said...

We're using O_DIRECT on innodb for > 60 days now. No problems with locked partitions.

Anonymous said...

I have the same load here (1000 qps) with ext3 and O_DIRECT (mysql 5.0.44sp1, ubuntu dapper server), no issues for more than 6 months. Looks like a platform-specific problem to me.

Dathan Pattishall said...

Thank you for your input. We still have no clue as to why it's not working for us.

Maybe a conflict with MegaRaid?

Anonymous said...

I think it's pretty irresponsible to immediately assume ext3 was culpable in your issue. Did you remount with ext2 or xfs for example, repeat the situation that caused the crash, and NOT have it crash?

I'm trying to understand where ext3 fits into your hypothesis.