搞定基本的函数之后,开始鼓捣SAS里面的模型。也就是说,要开始写PROC了。说实话,越学SAS,越觉得SAS像Stata…无论是从输出 的样式,还是语法。好不习惯没有()的模型调用呀。若是说SAS和Stata的区别,怕只是Stata更侧重于计量模型而SAS则是服务于大多数统计模型 吧。
PROC的基本内容:CONTENT
先是一个最基本的PROC:content,可以显示数据集的主要特性。比如:
1 2 |
LIBNAME tropical 'c:MySASLib'; PROC CONTENTS DATA = tropical.banana; |
这里主要是两个声明:TITLE和FOOTNOTE。前者输出时候会产生一个标题,后者会产生尾注。用法也是比较直接的:
1 2 3 |
TITLE ”Here’s another title”; TITLE ’Here’’s another title’; FOOTNOTE3 ’This is the third footnote’; |
最后还有一个很像Stata的LABEL声明:
1 2 |
LABEL ReceiveDate = ’Date order was received’ ShipDate = ’Date merchandise was shipped’; |
可以变量加注释。其实R里面给变量加注释是一件非常麻烦的事情,只有少数几个包可以搞定,还非常不值的。一般说来,我尽量在变量命名的时候长一点,这样直接可以读懂;再就是重建一个新的表,存储变量名和label。
SAS PROC求子集:WHERE
如果要在PROC里面先求子集的话,可以直接调用WHERE。感觉这里和SQL的思路比较像。用法也算是比较简单(SAS里面的用法都不是很麻烦,除了某些模型):
1 2 3 4 5 |
PROC PRINT DATA = 'c:MySASLibstyle'; WHERE Genre = 'Impressionism'; TITLE 'Major Impressionist Painters'; FOOTNOTE 'F = France N = Netherlands U = US'; RUN; |
这样最终得到的结果就是:
1 2 3 4 5 6 7 |
Major Impressionist Painters 1 Obs Name Genre Origin 1 Mary Cassatt Impressionism U 3 Edgar Degas Impressionism F 5 Claude Monet Impressionism F 6 Pierre Auguste Renoir Impressionism F F = France N = Netherlands U = US |
SAS PROC 数据进行排序:SORT
排序就更简单了,直接PROC SORT就可以了。
1 2 3 4 5 6 7 8 9 10 |
DATA marine; INFILE 'c:MyRawDataLengths.dat'; INPUT Name $ Family $ Length @@; RUN; * Sort the data; PROC SORT DATA = marine OUT = seasort NODUPKEY; BY Family DESCENDING Length; PROC PRINT DATA = seasort; TITLE 'Whales and Sharks'; RUN; |
这样数据就按照Family、Length(递减)排序了。
1 2 3 4 5 6 7 8 9 10 11 12 |
Whales and Sharks 1 Obs Name Family Length 1 humpback 50.0 2 whale shark 40.0 3 basking shark 30.0 4 mako shark 12.0 5 dwarf shark 0.5 6 blue whale 100.0 7 sperm whale 60.0 8 gray whale 50.0 9 killer whale 30.0 10 beluga whale 15.0 |
SAS PROC 输出数据:PRINT
最简单的数据输出怕就是PRINT了,顾名思义,直接打印数据出来。这里可以进行便啦的选择,还就可以选择统计量:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
DATA sales; INFILE 'c:MyRawDataCandy.dat'; INPUT Name $ 1–11 Class @15 DateReturned MMDDYY10. CandyType $ Quantity; Profit = Quantity * 1.25; PROC SORT DATA = sales; BY Class; PROC PRINT DATA = sales; BY Class; SUM Profit; VAR Name DateReturned CandyType Profit; TITLE 'Candy Sales for Field Trip by Class'; RUN; |
得到的结果为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Candy Sales for Field Trip by Class 1 ———————————————— Class=14 ————————————————– Date Candy Obs Name Returned Type Profit 1 Nathan 17612 CD 23.75 2 Matthew 17612 CD 17.50 3 Claire 17613 CD 13.75 4 Chris 17616 CD 7.50 5 Stephen 17616 CD 12.50 ——– ——— Class 75.00 ———————————————— Class=21 ————————————————– Date Candy Obs Name Returned Type Profit 6 Adriana 17612 MP 8.75 7 Caitlin 17615 CD 11.25 8 Ian 17615 MP 22.50 9 Anthony 17616 MP 16.25 10 Erika 17616 MP 21.25 ——– ——— Class 80.00 ====== 155.00 |
SAS PROC里面改变输出格式:FORMAT
基本就是FORMAT一下就可以了,再就是PUT的时候也可以调整。
1 2 3 4 5 6 7 8 9 10 |
DATA sales; INFILE 'c:MyRawDataCandy.dat'; INPUT Name $ 1–11 Class @15 DateReturned MMDDYY10. CandyType $ Quantity; Profit = Quantity * 1.25; PROC PRINT DATA = sales; VAR Name DateReturned CandyType Profit; FORMAT DateReturned DATE9. Profit DOLLAR6.2; TITLE 'Candy Sale Data Using Formats'; RUN; |
输出结果为:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Candy Sale Data Using Formats 1 Date Candy Obs Name Returned Type Profit 1 Adriana 21MAR2008 MP $8.75 2 Nathan 21MAR2008 CD $23.75 3 Matthew 21MAR2008 CD $17.50 4 Claire 22MAR2008 CD $13.75 5 Caitlin 24MAR2008 CD $11.25 6 Ian 24MAR2008 MP $22.50 7 Chris 25MAR2008 CD $7.50 8 Anthony 25MAR2008 MP $16.25 9 Stephen 25MAR2008 CD $12.50 10 Erika 25MAR2008 MP $21.25 |
常用的格式有:
-
文本型:$HEXw.和$w.
-
日期型:DATEw.(输出为ddmmyy或者ddmmyyyy)、DATETIMEw.d(输出为ddmmyy:hh:mm:ss)、 DAYw.(输出为dd)、EURDFDDw. 、JULIANw.、MMDDYYw.(输出为mmddyy或mmddyyyy)、TIMEw.d(输出为hh:mm:ss)、WEEKDATEw.(输 出为工作日)、WORDDATEw.(输出为单词)。
-
数字型:BESTw.(自动选择)、COMMAw.d(逗号分隔)、DOLLARw.d(货币)、Ew.(科学计数法)、PDw.d、w.d(标准小数)。
输出的样本见下。
当然FORMAT还可以自定义factor型变量的输出格式,比如:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
DATA carsurvey; INFILE 'c:MyRawDataCars.dat'; INPUT Age Sex Income Color $; PROC FORMAT; VALUE gender 1 = 'Male' 2 = 'Female'; VALUE agegroup 13 –< 20 = 'Teen' 20 –< 65 = 'Adult' 65 – HIGH = 'Senior'; VALUE $col 'W' = 'Moon White' 'B' = 'Sky Blue' 'Y' = 'Sunburst Yellow' 'G' = 'Rain Cloud Gray'; * Print data using user–defined and standard (DOLLAR8.) formats; PROC PRINT DATA = carsurvey; FORMAT Sex gender. Age agegroup. Color $col. Income DOLLAR8.; TITLE 'Survey Results Printed with User-Defined Formats'; RUN; |
就可以把数字型的1,2转换为对应的文本male和female等,还可以把变量离散化,得到的输出为:
1 2 3 4 5 6 7 |
Survey Results Printed with User–Defined Formats 1 Obs Age Sex Income Color 1 Teen Male $14,000 Sunburst Yellow 2 Adult Male $65,000 Rain Cloud Gray 3 Senior Female $35,000 Sky Blue 4 Adult Male $44,000 Sunburst Yellow 5 Adult Female $83,000 Moon White |
最终可以实现的自定义输出还包括简单的文本连接,比如:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
* Write a report with FILE and PUT statements; DATA _NULL_; INFILE 'c:MyRawDataCandy.dat'; INPUT Name $ 1–11 Class @15 DateReturned MMDDYY10. CandyType $ Quantity; Profit = Quantity * 1.25; FILE 'c:MyRawDataStudent.txt' PRINT; TITLE; PUT @5 'Candy sales report for ' Name 'from classroom ' Class // @5 'Congratulations! You sold ' Quantity 'boxes of candy' / @5 'and earned ' Profit DOLLAR6.2 ' for our field trip.'; PUT _PAGE_; RUN; |
可以给出若干连续的输出(注意DATA _NULL_;将不生成任何SAS的数据表):
1 2 3 4 5 6 7 8 9 |
Candy sales report for Adriana from classroom 21 Congratulations! You sold 7 boxes of candy and earned $8.75 for our field trip. —————— Candy sales report for Nathan from classroom 14 Congratulations! You sold 19 boxes of candy and earned $23.75 for our field trip. —————— Candy sales report for Matthew from classroom 14 Congratulations! You sold 14 boxes of candy and earned $17.50 for our field trip. —————— |
原文始发于微信公众号(PPV课数据科学社区):【学习】七天搞定SAS(三):基本模块调用(格式、计数、概要统计、排序等)(上)
原创文章,作者:ppvke,如若转载,请注明出处:http://www.ppvke.com/archives/31024