Data Manipulation 1.3

Search the site...

Sentry Page Protection

Data Manipulation [3-18]

Removing Duplicate Observations

Duplicate observations affect your analysis results.

You can remove them by using the NODUPKEY option in Proc Sort.

Example

Data Supermarket;
Infile datalines dsd;
Input Product : $20. Price DemandPerWeek ;
Datalines;
Campbell Soup,3.99,150
Lay's Chip,2.99,300
Kinder Chocolate,5.99,50
Nestle Ice cream,6.99,80
Maxwell Coffee,5.99,90
Coca cola,5.99,300
Pringles,2.99,200
Pringles,2.99,200
Lipton Milk Tea,3.99,150
Flamingo Fried Chicken,8.99,60
Dempster's Bread,1.99,450
;
Run;

The SUPERMARKET data set contains 3 variables:

Product
Price
DemandPerWeek

Pringles is a hot seller. However, it is also duplicated in the data set.

We can use the NODUPKEY option from Proc Sort to remove the duplicate observations.

Example

Proc Sort Data=Supermarket Out=Supermarket2 NODUPKEY;
By Product Price DemandPerWeek;
Run;

The data set is sorted with the duplicated observation (Pringles) removed.

The Log window also shows a note about the removal of the duplicated observation.

Note: the NODUPKEY option should be used with caution. Triple check the duplication before you remove them from the data set.

Exercise

Copy and run the INCOME data set from the yellow box below.

Data Income;
Input HouseholdID $ NumMembers HomeOwner $ Income $;
Datalines;
HID1001 4 Yes >120000
HID1002 3 No  <120000
hid1003 2 yes>120000
HID1004 6 Yes >120000
HID1004 6 Yes >120000
HID1005 6 No  <120000
hid1006 4 yes>120000
HID1007 5 Yes <120000
hid1008 3 yes>120000
HID1009 5 Yes <120000
hid1010 4 no>120000
;
Run;

The INCOME data set contains 4 variables:

HouseHoldID
NumMembers
HomeOwner
Income

Remove any duplicate observation(s) from the INCOME data set.

Need some help?

HINT:
It is highly recommended to create a new data set when removing the duplication. Keep the original data set intact in case you need to go back to the source data.

SOLUTION:
Proc Sort Data=Income Out=Income2 NODUPKEY;
By HouseHoldID NumMembers HomeOwner Income;
Run;

Fill out my online form.