MySQL Forums
Forum List  »  Data Warehouse

Re: Large Database Question
Posted by: John McGiles
Date: February 12, 2010 06:23PM

Currently, I am inserting using large group inserts with the LOAD DATA INFILE command and that has worked very well.

I have not looked at using infobright with the open source ETL tool, I will definitely look into this. Thank you for the tip.

And now for the example:

There are thousands of libraries, each individual library containing books of extremely related subject matter. I need to store all of the book information in a database by page. This information is collected and inserted with a program I made which I will call "the collector".

Some information about the books:
Some books are not complete (ie missing pages).
Some books are a part of a volume of books (ie the lord of the rings has 3 books)

So logically, I create a table for each library, since there are thousands of libraries and they already have books grouped by subject matter.

Below is the create table statement.


CREATE TABLE `LIBRARY_DATABASE`.`PHYSICS_LIBRARY` (
`ID` int(11) NOT NULL,
`BOOK_HASH` varchar(200) DEFAULT NULL,
`FULL_TITLE` varchar(1000) DEFAULT NULL,
`INDIVIDUAL_TITLE` varchar(767) DEFAULT NULL,
`VOLUME_TITLE` varchar(767) DEFAULT NULL,
`VOLUME_TITLE_CORRECT` tinyint(1) DEFAULT NULL,
`BOOK_TYPE` varchar(10) DEFAULT NULL,
`DATE_OF_PUBLISH` datetime DEFAULT NULL,
`RETRIEVE_DATE` datetime DEFAULT NULL,
`AUTHOR` varchar(300) DEFAULT NULL,
`SIZE` int(11) DEFAULT NULL,
`PAGE_NUMBER` int(11) DEFAULT NULL,
`TOTAL_PAGES` int(11) DEFAULT NULL,
`BEGIN_BOOK` tinyint(1) DEFAULT '0',
`BOOK_PAGES_AVAILABLE` mediumint(9) DEFAULT '0',
`BOOK_COMPLETE` tinyint(1) DEFAULT '0',
`IS_VOLUME` tinyint(1) DEFAULT '0',
`IS_BEGIN_VOLUME` tinyint(1) DEFAULT '0',
`NUMBOOKS` mediumint(9) DEFAULT '0',
`VOLUME_TOTAL_PAGES` mediumint(9) DEFAULT '0',
`volume_PAGES_AVAILABLE` mediumint(9) DEFAULT '0',
`VOLUME_SIZE` int(11) DEFAULT '0',
`TYPES_IN_VOLUME` varchar(300) DEFAULT NULL,
`VOLUME_COMPLETE` tinyint(1) DEFAULT '0',
PRIMARY KEY (`ID`),
KEY `INDEX_FULL_TITLE` (`FULL_TITLE`),
KEY `INDEX_INDIVIDUAL_TITLE` (`INDIVIDUAL_TITLE`),
KEY `INDEX_VOLUME_TITLE` (`VOLUME_TITLE`),
KEY `INDEX_BOOK_TYPE` (`BOOK_TYPE`),
KEY `INDEX_AUTHOR` (`AUTHOR`),
KEY `INDEX_DATE_OF_PUBLISH` (`DATE_OF_PUBLISH`),
KEY `INDEX_PAGE_NUMBER` (`PAGE_NUMBER`),
KEY `INDEX_TOTAL_PAGES` (`TOTAL_PAGES`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=2;

id - This is a unique id per page of a book per library that the collector provides, it does not need to be an autoincrement it's already unique.
book_hash - genearted by the collector, it's needed but doesn't have to be searched on, joined on, or used to do lookups.
full_title - the full title of the book (ie "The lord of the Rings the return of the king")
individual_title - the title of the book without the volume information (ie "the return of the king" instead of "the lord of the rings the return of the king"). This is also determined by the collector.
volume title - example: the lord of the rings
volume_title_correct - ocassionally some of the volume information is incorrect, I need to note this with a boolean
book type - ie biography, autobio, fantasy fict,...
date_of_publish - self explanatory
retrieve_date - date I insert it into the database
author - self explanatory. In my case there are authors with very long names.
size - size of the book in bytes on disk
page_number - self explanatory
total_pages - total pages that a book should have
BEGIN_BOOK - true if this is the first row for a given book
book_pages_available - how many rows there are in the database for a given book
book_complete - true if total_pages equals number of rows in the database for a given book
is_volume - true if the book is part of a volume
is_begin_volume - true if the row is the first entry for a given volume in the database
numbooks - number of books in a given volume
volume_total_pages - number of rows there should be in the database for a given volume (ie 1 volume that consists of three 100 page books would have a value of 300)
volume_pages_available - how many rows there are in the database for a given volume_title and author
volume_size - volume size on disk (in bytes)
types in volume - comma seperated string of different types of books in a volume (ie 2 autobiography, 1 bio)
volume_complete - boolean to represent if all pages for a volume are present


I have select statements that I use to determine a lot of this information for example, book_complete,book_pages_available, volume_complete, etc.. but let's discuss the initial table design before we dive into the select statements.

Options: ReplyQuote


Subject
Views
Written By
Posted
8253
February 10, 2010 09:30PM
4752
February 11, 2010 10:27PM
3852
February 12, 2010 08:39AM
Re: Large Database Question
3989
February 12, 2010 06:23PM
3831
February 12, 2010 06:33PM
4056
February 13, 2010 01:45AM
3927
February 13, 2010 10:40AM
3725
February 13, 2010 03:38PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.