Re: Partitioning
Posted by: Ahmad Sakhi
Date: May 09, 2016 09:18AM

Thanks for the response Rick.

Okay, let me provide some further details on the use case.

Background:
Currently, the J2EE application access the data from these CSV files, which contains the fields below, With the exception of ID and URI which don't exist in the CSV files now. ID would be an auto-increment INT field as part of PK key and URI would be the CSV file name which will be part of WHERE clause in a SELECT statement. i.e. if we look at the following SELECT statement.

SELECT START_CM, END_CM, SCORE FROM BOOK_TBL WHERE URI='MC001'

START_CM = INT --> signifies the starting position of a given text.
END_CM = INT --> signifies the ending position of a given text.
SCORE = float --> signifies the percentage value.
URI = VARCHAR(~50) --> This would be the name of the CSV file. IT contains alpha-numeric value.
URI_CM1 = VARCHAR (~25) --> signifies a text (Text_1)
URI_CM2 = VARCHAR (~25) --> signifies a text (Text_2)
TEXT_1 VARCHCAR (~500 to ~1000 or maybe more I need to analyze) --> holds Arabic text.
TEXT_2 VARCHCAR (~500 to ~1000 or maybe more I need to analyze) --> holds Arabic text.

But, TEXT_1 and _2 aren't going to be part of my initial select. So, here's how the application behaves.


1. Issue a SELECT to get the set of records for a given CM (between 5 to 1000).
2. perform some actions.
3. While I have the data in memory, the application will issue another select that will return a UNIQUE record to get TEXT_1 and _2. i.e.

SELECT TEXT_1, TEXT_2 FROM BOOK_TBL WHERE URI_CM1='ABxxxxx' AND URI_CM2='EFxxxxx' AND URI='MCxxxxx'

This query will return uniquely ONE record.


Now my concern is should I include TEXT_1 and _2 as part of my initial SELECT or should I issue another SELECT which will return a unique record?


There's no other field that can be unique in these CSV files (not by itself). combination of few fields can however. That's why I have to create an auto-increment INT to make it -- as part of -- the PK. So, combination of URI_CM1, URI_CM2, URI can uniquely identify a row.

**I realized I forgot to mention URI_CM1 and URI_CM2 in my initial post.

| ID | SCORE | URI_CM1 | URI_CM2 | URI | START_CM | END_CM | TEXT_1 | TEXT_2 |


Another pattern that I think is worth mentioning is that the entire 9 Billion records can be categorized (or identified) in 3 main chunks.

1. ~5 million CSV files
2. ~2 million CSV files
3. ~2 million CSV files

These files typically start with a common set of characters. i.e. all the files starts with either 1) ABxxxxxx, 2)EFxxxxxx or 3)MCxxxxxx. So, I was thinking, if that helps, to have 3 tables, as opposed to 1. This way, at least we don't deal with 1 table worth of 9 billion records but 3 -- rather -- smaller tables of 5, 2 and 2 billion records.


Some questions:

>>>"If there will be 5 billion rows, INT will not be big enough."
- What's the largest value INT can hold?
- Do I have another option -- besides INT -- to be my PK auto-increment?

Options: ReplyQuote


Subject
Written By
Posted
April 19, 2016 12:29PM
April 20, 2016 11:34PM
April 21, 2016 07:47AM
April 22, 2016 06:41PM
April 22, 2016 06:59PM
April 23, 2016 05:11PM
April 25, 2016 09:27AM
April 30, 2016 05:08PM
May 04, 2016 11:41PM
Re: Partitioning
May 09, 2016 09:18AM
May 10, 2016 10:15AM
May 11, 2016 08:11AM
May 17, 2016 07:49AM
May 23, 2016 03:21PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.