Finding Skewed Tables in Teradata



       Skewed Tables in Teradata                                                         
       Teradata distributes its data for a given table based on its Primary Index and Primary Index alone .If this Primary Index is not selected appropriately it can cause performance bottle necks. Following are general parameters we should consider while creating Primary Indexes
      1:Access Path
      2:Volatility
      3: Data Distribution
  •        Volatility is usually not an issue in a well designed database .That is to say we do not expect update clauses updating the primary index itself. This usually leaves us with data distribution and Access Path.
  •       Access Path implies particular column ( set of columns ) are always used in join conditions .Advantage of using these columns as PI is that it will avoid redistribution of data during join .( One of the most expensive operation for teradata ) and therefore can reduce usage of Spool files.
  •        Data distribution implies, Data is evenly distributed amongst all amps of teradata. This will ensure that all amps would be doing same amount of work. In case of Skewed distribution one amp will end up doing most of the work this can increase execution time considerable.
Following Queries can be used to determine table skeweness.

SELECT
DatabaseName
,TableName
,SUM(CurrentPerm) AS CurrentPerm,
(
SUM(PeakPerm) AS PeakPerm

,100 - (AVG(CurrentPerm)/MAX(CurrentPerm)*100)) AS SkewFactor
FROM DBC.TableSize
WHERE TABLENAME =:TABLENAME
GROUP BY 1,2;

However tables which have very few rows (esp. lesser than amps) are inherently skewed and cannot be changed. So we should take into account size of table also while considering skeweness.
Query can be further customized to have better details


SELECT
DATABASENAME
, TABLENAME
,SKEW_CATEGORY
FROM
(
,
SELECT

DATABASENAME
,TABLENAME
,(MAX(CURRENTPERM)-AVG(CURRENTPERM))/MAX(CURRENTPERM) *100 SKEW_FACTOR,SUM(CURRENTPERM) /(1024*1024*1024) ACTUAL_SPACE
 CASE WHEN SKEW_FACTOR >90 THEN 'EXTREMELY SKEWED,T0 BE RECTIFIED IMMEDIATELY'
WHEN SKEW_FACTOR >70 THEN 'HIGHLY SKEWED'
WHEN SKEW_FACTOR >50 THEN 'SKEWED TABLE'
WHEN SKEW_FACTOR >40 THEN 'SLIGHTLY SKEWED'
WHEN SKEW_FACTOR >30 AND ACTUAL_SPACE >.5 THEN 'SLIGHTLY SKEWED'
ELSE 'ACCEPTABLE'
END as SKEW_CATEGORY
FROM DBC.TABLESIZE
WHERE DATABASENAME in ('DATABASENAME')
GROUP BY 1,2
)x
ORDER BY SKEW_FACTOR DESC;
To find distribution of data amongst amps, we can use hash functions as follows

SELHASHAMP(HASHBUCKET(HASHROW( Column )))
 ,COUNT(*)
FROM DATABASENAME.TABLENAME
GROUP BY 1;

This factor can further be exploited to find distribution for primary index as follows

SEL 'ColumnName ' ,
(MAX(CN)-AVG(CN))/MAX(CN)*100
FROM
(
SEL

VPROC
,COUNT(*) CN
FROM DATABASENAME.TABLENAME
RIGHT OUTER JOIN
(SEL
VPROC
FROM
DBC.TABLESIZE
GROUP BY 1
)X ON VPROC=HASHAMP(HASHBUCKET(HASHROW( ColumnName )))
 GROUP BY 1
)
C
GROUP BY 1 ;
This query can also be used to check which singular column can be best fitted for primary index on sole criteria of data distribution. Following query is a generic query which would create SQL statements to check distribution for that column amongst amps.

SEL
'
SEL ' || '27'xc || COLUMNNAME || '27'xc || ' , (MAX(CN)-AVG(CN))/MAX(CN)*100
FROM
(

 SEL VPROC ,COUNT(*) CN
FROM
'
' || DATABASENAME || '.' ||TABLENAME ||

RIGHT OUTER
JOIN
(SEL VPROC
 FROM DBC.TABLESIZE
GROUP BY 1
)X ON VPROC=HASHAMP(HASHBUCKET(HASHROW( '|| COLUMNNAME || ' )))
 GROUP BY 1)
C
GROUP BY 1

 UNION ALL
' Title
FROM DBC.COLUMNS
WHERE TABLENAME='
AND DATABASENAME='
;
tablename'databasename'

Comments

  1. The query
    select VPROC, count(*) from testdb.test

    produces an error
    [5628] Column VPROC not found in tesdb.test.

    ReplyDelete
  2. VPROC is not supposed to come out from test database ,
    Its a field from DBC.TABLESIZE

    ReplyDelete

Post a Comment

Popular posts from this blog

The Secret behind A B I N I T I O.

Macro to generate Mload Script