Skip to main content

Finding Duplicate records and Deleting Duplicate records in TERADATA

Requirement:
Finding duplicates and removing duplicate records by retaining original record in TERADATA

Suppose I am working in an office and My boss told me to enter the details of a person who entered in to office. I have below table structure.
Create Table DUP_EXAMPLE
(
PERSON_NAME VARCHAR2(50),
PERSON_AGE INTEGER,
ADDRS VARCHAR2(150),
PURPOSE VARCHAR2(250),
ENTERED_DATE DATE
)

If a person enters more than once then I have to insert his details more than once.
First time, I inserted below records.

INSERT INTO DUP_EXAMPLE VALUES('Krishna reddy','25','BANGALORE','GENERAL',TO_DATE('01-JAN-2014','DD-MON-YYYY'))
INSERT INTO DUP_EXAMPLE VALUES('Anirudh Allika','25','HYDERABAD','GENERAL',TO_DATE('01-JAN-2014','DD-MON-YYYY'))
INSERT INTO DUP_EXAMPLE VALUES('Ashok Vunnam','25','CHENNAI','INTERVIEW',TO_DATE('01-JAN-2014','DD-MON-YYYY'))

And on same day the person named Ashok came again to office and I entered once again into table.
INSERT INTO DUP_EXAMPLE VALUES ('Ashok Vunnam','25','CHENNAI','INTERVIEW',TO_DATE('01-JAN-2014','DD-MON-YYYY'))

Now, I have below data in the table.
SELECT * FROM DUP_EXAMPLE

PERSON_NAME
PERSON_AGE
ADDRS
PURPOSE
ENTERED_DATE
Krishna reddy
25
BANGALORE
GENERAL
01-JAN-2014
Anirudh Allika
25
HYDERABAD
GENERAL
01-JAN-2014
Ashok Vunnam
25
CHENNAI
INTERVIEW
01-JAN-2014
Ashok Vunnam
25
CHENNAI
INTERVIEW
01-JAN-2014


I have a requirement to get the person details that who entered more than once in a day. So, now I have to run below query to get correct result set.
We can write this query in two ways.
1) First Option:

SELECT
PERSON_NAME,
PERSON_AGE,
ADDRS,
PURPOSE,
ENTERED_DATE,
COUNT(*)
FROM DUP_EXAMPLE
GROUP BY 1,2,3,4,5
HAVING COUNT(*)>1

2) Second Option:

SELECT
PERSON_NAME,
PERSON_AGE,
ADDRS,
PURPOSE,
ENTERED_DATE,
ROW_NUMBER() OVER(PARTITION BY PERSON_NAME,PERSON_AGE,ADDRS,PURPOSE,ENTERED_DATE ORDER BY PERSON_NAME,PERSON_AGE,ADDRS,PURPOSE,ENTERED_DATE) AS RECORD_NUMBER
FROM DUP_EXAMPLE
WHERE RECORD_NUMBER > 1

And we can delete duplicate records by retaining original record using below query.

DELETE FROM DUP_EXAMPLE
WHERE ROW_NUMBER() OVER(PARTITION BY PERSON_NAME,PERSON_AGE,ADDRS,PURPOSE,ENTERED_DATE ORDER BY PERSON_NAME,PERSON_AGE,ADDRS,PURPOSE,ENTERED_DATE) > 1

Note: Wherever you go for interview, you will face this question How to find duplicates and how to delete duplicate records by retaining original record.

Comments

  1. Ordered analytical functions are not allowed in WHERE Clause anymore in teradata

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete

Post a Comment

Popular posts from this blog

Target Load Type - Normal or Bulk in Session Properties

We can see the Target load type ( Normal or Bulk) property in session under Mapping tab and we will go for Bulk to improve the performance of session to load large amount of data. SQL loader utility will be used for Bulk load and it will not create any database logs(redolog and undolog), it directly writes to data file.Transaction can not be rolled back as we don't have database logs.However,Bulk loading is very as compared to Normal loading. In target if you are using Primary Key or Primary Index or any constraints you can't use Bulk mode. We can see this property in the below snap shot.

Looping using Expression Transformation in Informatica

One of the most common used transformation in Informatica is Expression transformation. In Expression transformation we can perform various operations such as data conversions i.e to_date,to_char, string manipulation such as substr,instr etc. Now coming to one of the widely and prominent task which we perform using Expression transformation is looping a value. Expression transformation has three types of ports i.e. input,variable and output.Only output port values can be propagated to next transformations. So in order to pass values of input and variable ports to next level of transformation these must be assigned to output ports.The order of execution in Expression transformation is top to bottom and first input then variable and finally output ports are processed. let us consider the following scenario   The files should be generated with employee name as file name and that particular file should have the details of that respective employee only, if the employee has more than