GreenPlum高效去除表重复数据
2018-06-17 19:49:56来源:未知 阅读 ()
1.针对PostgreSQL数据库表的去重复方法基本有三种,这是在网上查找的方法,在附录1给出。但是这些方法对GreenPlum来说都不管用。
2.数据表分布在不同的节点上,每个节点的ctid是唯一的,但是不同的节点就有ctid重复的可能,因此GreenPlum必须借助gp_segment_id来进行去重复处理。
3.在网上找到了一个相对繁琐的方法,在附录2给出:
4.最终的方法是:
delete from test where (gp_segment_id, ctid) not in (select gp_segment_id, min(ctid) from test group by x, gp_segment_id);
验证通过。
附录1:PostgreSQL数据表去重复的三种方法:
引用自:http://my.oschina.net/swuly302/blog/144933
采用PostgreSQL 9.2 官方文档例子为例:
CREATE TABLE weather ( city varchar(80), temp_lo int, -- low temperature temp_hi int, -- high temperature prcp real, -- precipitation date date ); INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27'), ('San Francisco', 43, 57, 0, '1994-11-29'), ('Hayward', 37, 54, NULL, '1994-11-29'), ('Hayward', 37, 54, NULL, '1994-11-29'); --- duplicated row
这里有3中方法:
第一种:替换法 -- 剔除重复行的数据转存到新表weather_temp SELECT DISTINCT city, temp_lo, temp_hi, prcp, date INTO weather_temp FROM weather; -- 删除原表 DROP TABLE weather; -- 将新表重命名为weather ALTER TABLE weather_temp RENAME TO weather; 或者 -- 创建与weather一样的表weather_temp CREATE TABLE weather_temp (LIKE weather INCLUDING CONSTRAINTS); -- 用剔除重复行的数据填充到weather_temp中 INSERT INTO weather_temp SELECT DISTINCT * FROM weather; -- 删除原表 DROP TABLE weather; -- 将新重命名为weather. ALTER TABLE weather_temp RENAME TO weather; 通俗易懂,有很多毁灭性的操作如DROP,而且当数据量大时,耗时耗空间。不推荐。 第二种: 添加字段法 -- 添加一个新字段,类型为serial ALTER TABLE weather ADD COLUMN id SERIAL; -- 删除重复行 DELETE FROM weather WHERE id NOT IN ( SELECT max(id) FROM weather GROUP BY city, temp_lo, temp_hi, prcp, date ); -- 删除添加的字段 ALTER TABLE weather DROP COLUMN id; 需要添加字段,「暂时不知道Postgres是如何处理添加字段的,是直接在原表追加呢,还是复制原表组成新表呢?」,如果是原表追加,可能就会因为新字段的加入而导致分页(一般block: 8k),如果是复制的话那就罪过了。不好。 第三种:系统字段[查看 System Columns] DELETE FROM weather WHERE ctid NOT IN ( SELECT max(ctid) FROM weather GROUP BY city, temp_lo, temp_hi, prcp, date ); 针对性强[Postgres独有],但是简单。
----------------但是对GreenPlum的表来说,表分割在各个节点上,不能单纯的用ctid来做去重复处理。
附录2:
https://discuss.pivotal.io/hc/zh-cn/community/posts/206428018-What-is-the-most-efficient-way-of-deleting-duplicate-records-from-a-table-
What is the most efficient way of deleting duplicate records from a table?
Currently we use Primary Keys to avoid loading duplicate data into our tables, but PK brings many restrictions. Since we can’t easily identify or prevent duplicates arriving from the variety of 3rd party upstream systems, we wanted to investigate the ‘load everything, remove duplicates afterwards’ approach.
In Postgres, you can use an efficient method such as:
DELETE FROM test WHERE ctid NOT IN ( SELECT min(ctid) FROM test GROUP BY x); (where 'x' is the unique column list)
However in Greenplum ‘ctid’ is only unique per segment.
One approach would be:
DELETE FROM test USING (select gp_segment_id, ctid from (select gp_segment_id, ctid, rank() over (partition by x order by gp_segment_id, ctid) as rk from test ) foo WHERE rk <> 1) rows_to_delete WHERE test.gp_segment_id=rows_to_delete.gp_segment_id AND test.ctid=rows_to_delete.ctid;
But the use of window functions, subqueries etc. feels pretty inefficient.
Is there a better form?
Note that in our use case our unique column list varies up to ~10 columns so we don’t have a single unique key field – hence the RANK in the example. I suppose adding a sequence column could be used, but how much overhead does this add when doing bulk data loading?
标签:
版权申明:本站文章部分自网络,如有侵权,请联系:west999com@outlook.com
特别注意:本站所有转载文章言论不代表本站观点,本站所提供的摄影照片,插画,设计作品,如需使用,请与原作者联系,版权归原作者所有
- python爬虫学习,这里有一条高效的学习路径 2019-07-24
- 【Python实践-4】切片操作去除字符串首尾的空格 2019-04-25
- 18个Python高效编程技巧,Mark! 2019-04-11
- 打造一个健壮高效的网络爬虫 2019-01-15
- greenplum4.3.8.2安装 2018-06-22
IDC资讯: 主机资讯 注册资讯 托管资讯 vps资讯 网站建设
网站运营: 建站经验 策划盈利 搜索优化 网站推广 免费资源
网络编程: Asp.Net编程 Asp编程 Php编程 Xml编程 Access Mssql Mysql 其它
服务器技术: Web服务器 Ftp服务器 Mail服务器 Dns服务器 安全防护
软件技巧: 其它软件 Word Excel Powerpoint Ghost Vista QQ空间 QQ FlashGet 迅雷
网页制作: FrontPages Dreamweaver Javascript css photoshop fireworks Flash