Alvin's Big Data Notebook : --Split-by Usage in Sqoop

When we tried to --split-by a String column, some rows are missing.
The column used for 'split-by' may be also missing.

The reason here is you chose a non-numeric column here for your -split-by column. This would usually be an ID column.

--split-by is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.

By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.

If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

Reason to use : Sometimes the primary key doesn't have an even distribution of values between the min and max values(which is used to create the splits if --split-by is not available). In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.

2.

There is a db.TextSplitter for this in sqoop. If there is no primary key to the table, you could in fact use a string column to determine the splits. If there is duplication in that column, unlike a primary key, you may not get proper splits and unbalanced tasks. Also the warning about if your database sorts in an case insensitive fasion (e.g. MySQL), then you can get a partial import also applies here. As stated below, you are encouraged to use a numeric split column, if you have one.
13/11/30 01:12:57 WARN db.TextSplitter: Generating splits for a textual index column. 13/11/30 01:12:57 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records. 13/11/30 01:12:57 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.

Reference:

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.sqoop/sqoop/1.4.4-cdh5.0.0/org/apache/sqoop/mapreduce/db/TextSplitter.java/

Alvin's Big Data Notebook

Friday, 1 August 2014

--Split-by Usage in Sqoop

No comments:

Post a Comment