写在前面的话
这篇文章为我提交的 dplyr issue ,关于连接数据库以及解决中文乱码问题。所以这篇文章主要有三个点:
dplyr 调用数据库
R连接数据库 RSQLite
readr 方便快捷读入写出数据
这些都是基本的处理数据的技能,dplyr包能够处理GB级别的数据,再加上SQLite本地轻型数据库,可以实现数据存储、调用、清洗、处理等一个系列问题。时间有限,背景说明就少说一点,直接上干货。
请对照参考我提交的两个issues:https://github.com/hadley/dplyr/issues/1561 , https://github.com/rstats-db/RSQLite/issues/108 .
1. 可重复使用的例子
SQL代码生成需要用的tables
在你的sqlite里面运行下面代码,就能得到例子中用到的两个表。关于搭建sqlite下次有时间再写详细教程,其实非常简单,根本不需要部署,安装了就可以用,可自行百度去。
-- ----------------------------
-- Table structure for flights
-- ----------------------------
DROP TABLE IF EXISTS "main" . "flights" ;
CREATE TABLE flights (
year INT ,
month INT ,
day INT ,
dep_time INT ,
dep_delay REAL ,
arr_time INT ,
arr_delay REAL ,
carrier TEXT ,
tailnum TEXT
);
-- ----------------------------
-- Records of flights
-- ----------------------------
INSERT INTO "main" . "flights" VALUES ( 2013 , 1 , 1 , 517 , 2 . 0 , 830 , 11 . 0 , 'UA' , 'N14228' );
INSERT INTO "main" . "flights" VALUES ( 2013 , 1 , 1 , 533 , 4 . 0 , 850 , 20 . 0 , 'UA' , 'N24211' );
INSERT INTO "main" . "flights" VALUES ( 2013 , 1 , 1 , 542 , 2 . 0 , 923 , 33 . 0 , 'AA' , 'N619AA' );
INSERT INTO "main" . "flights" VALUES ( 2013 , 1 , 1 , 544 , - 1 . 0 , 1004 , - 18 . 0 , 'B6' , 'N804JB' );
INSERT INTO "main" . "flights" VALUES ( 2013 , 1 , 1 , 554 , - 6 . 0 , 812 , - 25 . 0 , 'DL' , 'N668DN' );
INSERT INTO "main" . "flights" VALUES ( 2013 , 1 , 1 , 554 , - 4 . 0 , 740 , 12 . 0 , 'UA' , 'N39463' );
-- ----------------------------
-- Table structure for cor
-- ----------------------------
DROP TABLE IF EXISTS "main" . "cor" ;
CREATE TABLE cor (
XXDM TEXT ,
XBDM TEXT ,
NJDM TEXT ,
"身体形态" INT ,
"50米跑等级" INT ,
"肺活量等级" INT
);
-- ----------------------------
-- Records of cor
-- ----------------------------
INSERT INTO "main" . "cor" VALUES ( '0015' , 2 , 11 , 3 , 3 , 4 );
INSERT INTO "main" . "cor" VALUES ( '0015' , 1 , 11 , 2 , 2 , 4 );
INSERT INTO "main" . "cor" VALUES ( '0015' , 1 , 11 , 2 , 2 , 4 );
INSERT INTO "main" . "cor" VALUES ( '0015' , 1 , 11 , 2 , 2 , 2 );
INSERT INTO "main" . "cor" VALUES ( '0015' , 1 , 11 , 2 , 2 , 3 );
INSERT INTO "main" . "cor" VALUES ( '0007' , 1 , 13 , 2 , 3 , 4 );
关于表的一些说明:flights 这个表是 {nycflights13} package 里面的,描述的是 On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
cor 这个表是我的sqlite数据库中的表。为了篇幅,只取了前6行。
2. dplyr连接数据库
> library ( dplyr )
> mydb <- src_sqlite ( "E:/SQLite/database/yourdbname.db" , create = F )
> mydb
src : sqlite [ E :/ SQLite / database / yourdbname.db ]
tbls : cor , flights
> tbl ( mydb , "flights" )
Source : sqlite [ E :/ SQLite / database / yourdbname.db ]
From : flights [ 336 , 776 x 16 ]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum
( int ) ( int ) ( int ) ( int ) ( dbl ) ( int ) ( dbl ) ( chr ) ( chr )
1 2013 1 1 517 2 830 11 UA N14228
2 2013 1 1 533 4 850 20 UA N24211
3 2013 1 1 542 2 923 33 AA N619AA
4 2013 1 1 544 -1 1004 -18 B6 N804JB
5 2013 1 1 554 -6 812 -25 DL N668DN
6 2013 1 1 554 -4 740 12 UA N39463
> tbl ( mydb , "cor" )
Error in gsub ( quote , paste0 ( quote , quote ), x , fixed = TRUE ) :
input string 1 is invalid in this locale
> tbl ( mydb , sql ( "SELECT * from flights" ))
> tbl ( mydb , sql ( "SELECT * from cor" ))
Error in gsub ( quote , paste0 ( quote , quote ), x , fixed = TRUE ) :
input string 1 is invalid in this locale
发现如果表名称中包含中文,dplyr会出现[]报错]issue1 。强烈建议数据库中的表中变量名字尽量不用中文,哪怕用拼音代替也行
3. RSQLite连接数据库
虽然dplyr
包会报错,但是用RSQLite
包是没有问题的,只不过多一步转化步骤。详细请看我提交的RSQLite issue
> library ( DBI )
> library ( RSQLite )
> con <- dbConnect ( SQLite (), "E:/SQLite/database/你的数据名字.db" )
> query <- "select * from cor;"
> cd <- tbl_df ( dbGetQuery ( con , query ))
> cd
Source : local data frame [ 130 , 789 x 9 ]
XXDM XBDM NJDM 韬 綋褰 ㈡€\ x81 50 绫宠窇绛夌骇 鑲烘椿閲忕瓑绾 \ xa7
( chr ) ( chr ) ( chr ) ( int ) ( int ) ( int )
1 0015 2 11 3 3 4
2 0015 1 11 2 2 4
3 0015 1 11 2 2 4
> names ( cd ) <- iconv ( names ( cd ), "UTF-8" , "GBK" )
> cd
Source : local data frame [ 130 , 789 x 9 ]
XXDM XBDM NJDM 身体形态 50 米跑等级 肺活量等级
( chr ) ( chr ) ( chr ) ( int ) ( int ) ( int )
1 0015 2 11 3 3 4
2 0015 1 11 2 2 4
3 0015 1 11 2 2 4
RSQLite能够调入数据,但是标题乱码。其原因我个人认为是 sqlite的默认编码是 UTF-8, 而本地R环境是GBK编码,R没有自动将这两个代码转化,故乱码,所以需要加一句函数 iconv(names(cd),"UTF-8","GBK")
。
请仔细看,我的这个小数据库有13W行,才200M多,很快就能处理完。前天我没有用数据库直接是用dplyr包轻松处理了256W行的数据。一开始还担心会不会爆内存,看来是我多虑了。
4. 解决中文乱码问题
中文乱码这个问题困扰我多时,从生成中文PDF -> 中文beamer ->中文分词 ->现在 一直伴随着我,最近摸索了一个套经验,还想了一下原因,仅供参考:
I think I figure out the reason why Chinese Characters garbled in my R environment. It‘s because the encoding of the data is UTF-8
while my local environment is GBK
[(Simplified)_China.936)] , R don’t automatically convert the UTF-8
encoding to GBK
. So ,I use iconv(char,"UTF-8","GBK")
to convert the encoding. 装逼,请走开。
> library ( stringi )
> stri_enc_get () # 用于检测本地是什么编码环境
> stri_enc_detect2 ( char ) # 用于检测字符串是什么编码
> stri_conv ( char , "UTF-8" , "GBK" ) # 直接转码
> iconv ( char , "UTF-8" , "GBK" ) # 与{base} package 效果一样
To solve the problem from the source, I recommend the {readr} package
为了从源头解决这个问题,我推荐使用 readr 这个包。下面的例子仅供参考:
> library ( readr )
> guess_encoding ( "filename.csv" ) # guess the encoding of the file
encoding confidence
1 GB18030 1
> read_csv ( "filename.csv" , col_names = T , col_types = cols (
xxdm = "c" ))
Source : local data frame [ 247 x 6 ]
xxdm nj economic chi mat eng
( chr ) ( int ) ( chr ) ( dbl ) ( dbl ) ( dbl )
1 0003 14 < U +00 B8 >< df > 89.93566 85.29368 80.90276
2 0003 15 < U +00 B8 >< df > 83.33544 83.82542 88.48453
3 0004 14 < U +00 B8 >< df > 84.26736 84.38928 75.06619
4 0004 15 < U +00 B8 >< df > 81.33385 84.25302 83.96041
5 0005 14 < d6 >< d0 > 80.89536 81.41362 81.64121
6 0005 15 < d6 >< d0 > 82.39663 82.67030 87.31950
you may find that the `economic` is in the form of UTF -8 encoding.
> read_csv ( "filename.csv" , col_names = T , col_types = cols (
xxdm = "c" ),
locale = locale ( encoding = "GB18030" ) # important)
Source : local data frame [ 247 x 6 ]
xxdm nj economic chi mat eng
( chr ) ( int ) ( chr ) ( dbl ) ( dbl ) ( dbl )
1 0003 14 高 89.93566 85.29368 80.90276
2 0003 15 高 83.33544 83.82542 88.48453
3 0004 14 高 84.26736 84.38928 75.06619
4 0004 15 高 81.33385 84.25302 83.96041
5 0005 14 中 80.89536 81.41362 81.64121
6 0005 15 中 82.39663 82.67030 87.31950
附上我的R环境
Session info -----------------------------------------------------------------
setting value
version R version 3.2.2 ( 2015-08-14 )
system x86_64 , mingw32
ui RStudio ( 0.99.486 )
language ( EN )
collate Chinese ( Simplified ) _ China.936
tz Asia / Taipei
date 2015-12-01
Packages -----------------------------------------------------------------
package * version date source
assertthat 0.1 2013-12-06 CRAN ( R 3.2.1 )
DBI * 0.3.1.9008 2015-11-05 Github ( rstats - db / DBI @ 7 a0ad76 )
devtools 1.9.1 2015-09-11 CRAN ( R 3.2.2 )
digest 0.6.8 2014-12-31 CRAN ( R 3.2.1 )
dplyr * 0.4.3 2015-09-01 CRAN ( R 3.2.2 )
magrittr 1.5 2014-11-22 CRAN ( R 3.2.1 )
memoise 0.2.1 2014-04-22 CRAN ( R 3.2.1 )
nycflights13 * 0.1 2014-07-22 CRAN ( R 3.2.2 )
R6 2.1.1 2015-08-19 CRAN ( R 3.2.2 )
Rcpp 0.12.2 2015-11-15 CRAN ( R 3.2.2 )
RSQLite * 1.0.9000 2015-11-05 Github ( rstats - db / RSQLite @ 1187 d0c )
5. 最后
改手机界面确实好难,一点都不懂CSS+html的人,完全无从下手。上个周末改版失败,还没有这个好看,所以暂时就这样了。所以为了更好的观看体验,请使用网页端。
Bruce Zhao
/
2015-12-17
/ - views
Published under(CC) BY-NC-SA 3.0 CN .
//评论数
//参与数