Data Manipulation 2.1 (temp)

Search the site...

Sentry Page Protection

Data Manipulation [11-18]

(In=) Option

The (in=) option is a special technique that allows you to identify observations when merging data sets.

Again, let's take a look at the DS1 and DS2 data sets.

Data DS1;
Input ID Gender $;
Datalines;
9001 Male
9002 Female
9003 Male
9004 Male
9005 Female
;
Run;

Data DS2;
Input ID Income;
Datalines;
9003 50000
9004 60000
9005 75000
9006 65000
9007 45000
;
Run;

Data set: DS1

Data set: DS2

Again, we are going to merge the DS1 and DS2 data sets.

Example

proc sort data=ds1;
by id;
run;

proc sort data=ds2;
by id;
run;

data ds;
merge ds1 (in=a) ds2 (in=b);
by id;
run;

(In=a) Option

The (in=a) option creates a variable called "a".

The variable "a" has an interesting characteristics:

It does not exist in the data set.

You cannot see it in the data set.

However, it exists in the data step.

That is, you can refer to this variable in your program.

Is there a way to show "a" in the data set?

No, the variable "a" is not created in the data set.

However, you can create a carbon-copy of "a" in the data set.

Example

data ds;
merge ds1 (in=a) ds2 (in=b);
by id;
var1 = a;
run;

The variable VAR1 is a carbon-copy of the variable "a".

It contains exactly the same set of data as "a".

By looking at VAR1, we can tell the values of "a".

Default values of "a"

By default, the variable "a" contains either 1's or 0's (binary).

It is 1's if the observation is contributed by DS1 (the data set "a" is associated with).

The ID's (9001, 9002, 9003, 9004, 9005) are all contributed by DS1.

They are all flagged as "1".

On the other hand, the observations that are NOT contributed by DS1 (i.e. 9006 and 9007) are flagged as "0".

Identifying observations

Now that we know the values of "a" (and "b"), identifying the matching and unmatched observations becomes much easier.

Example

data ds;
merge ds1 (in=a) ds2 (in=b);
by id;
if a=1 and b=1 then i=1;
else if a=1 then i=2;
else if b=1 then i=3;
run;

When both "a" and "b" are 1's, the observations are contributed by both DS1 and DS2.

These observations are flagged as 1 (i=1).

When the observations are not matching, and they are contributed by only DS1, the observations are flagged as 2 (i=2).

Similarly, when the observations are contributed by only DS2, they are flagged as 3 (i=3).

Overall, the observations are flagged into 3 groups:

Matched observations (i=1)
Unmatching observations from DS1 only (i=2)
Unmatching observations from DS2 only (i=3)

How come ... the code looks different than before?

Please note the code below essentially achieves the exact same results:

data ds;
merge ds1 (in=a) ds2 (in=b);
by id;
if a and b then i=1;
else if a then i=2;
else if b then i=3;
run;

The (if a and b) statement is the same as (if a=1 and b=1).

Exercise

Copy and run the ALLPOP and DROPOUT data sets from the yellow box below:

Data Allpop;
Input PID $ Gender $ Age;
Datalines;
PT70001 M 39
PT70002 F 45
PT70003 M 38
PT70004 F 23
PT70005 F 25
PT70006 M 41
PT70007 M 36
PT70008 F 56
PT70009 F 42
PT70010 F 29
;
Run;

Data Dropout;
Input PID $ DDate;
Format DDate YYMMDD10.;
Datalines;
PT70003 19500
PT70004 19502
PT70007 19603
;
Run;

The ALLPOP data set contains a list of 10 patients enrolled in a clinical study.

While the clinical study has been ongoing, some patients dropped out of the study due to various reasons.

The list of dropped out patients are stored in the DROPOUT data set.

Task

Create a new variable that flags the patient:

1 for current patient
2 for dropped out patient

Create any additional data set or variable if needed.

Need some help?

Fill out my online form.