Thanks for the response Rick.
Okay, let me provide some further details on the use case.
Background:
Currently, the J2EE application access the data from these CSV files, which contains the fields below, With the exception of ID and URI which don't exist in the CSV files now. ID would be an auto-increment INT field as part of PK key and URI would be the CSV file name which will be part of WHERE clause in a SELECT statement. i.e. if we look at the following SELECT statement.
SELECT START_CM, END_CM, SCORE FROM BOOK_TBL WHERE URI='MC001'
START_CM = INT --> signifies the starting position of a given text.
END_CM = INT --> signifies the ending position of a given text.
SCORE = float --> signifies the percentage value.
URI = VARCHAR(~50) --> This would be the name of the CSV file. IT contains alpha-numeric value.
URI_CM1 = VARCHAR (~25) --> signifies a text (Text_1)
URI_CM2 = VARCHAR (~25) --> signifies a text (Text_2)
TEXT_1 VARCHCAR (~500 to ~1000 or maybe more I need to analyze) --> holds Arabic text.
TEXT_2 VARCHCAR (~500 to ~1000 or maybe more I need to analyze) --> holds Arabic text.
But, TEXT_1 and _2 aren't going to be part of my initial select. So, here's how the application behaves.
1. Issue a SELECT to get the set of records for a given CM (between 5 to 1000).
2. perform some actions.
3. While I have the data in memory, the application will issue another select that will return a UNIQUE record to get TEXT_1 and _2. i.e.
SELECT TEXT_1, TEXT_2 FROM BOOK_TBL WHERE URI_CM1='ABxxxxx' AND URI_CM2='EFxxxxx' AND URI='MCxxxxx'
This query will return uniquely ONE record.
Now my concern is should I include TEXT_1 and _2 as part of my initial SELECT or should I issue another SELECT which will return a unique record?
There's no other field that can be unique in these CSV files (not by itself). combination of few fields can however. That's why I have to create an auto-increment INT to make it -- as part of -- the PK. So, combination of URI_CM1, URI_CM2, URI can uniquely identify a row.
**I realized I forgot to mention URI_CM1 and URI_CM2 in my initial post.
| ID | SCORE | URI_CM1 | URI_CM2 | URI | START_CM | END_CM | TEXT_1 | TEXT_2 |
Another pattern that I think is worth mentioning is that the entire 9 Billion records can be categorized (or identified) in 3 main chunks.
1. ~5 million CSV files
2. ~2 million CSV files
3. ~2 million CSV files
These files typically start with a common set of characters. i.e. all the files starts with either 1) ABxxxxxx, 2)EFxxxxxx or 3)MCxxxxxx. So, I was thinking, if that helps, to have 3 tables, as opposed to 1. This way, at least we don't deal with 1 table worth of 9 billion records but 3 -- rather -- smaller tables of 5, 2 and 2 billion records.
Some questions:
>>>"If there will be 5 billion rows, INT will not be big enough."
- What's the largest value INT can hold?
- Do I have another option -- besides INT -- to be my PK auto-increment?