📅  最后修改于: 2020-11-12 00:49:51             🧑  作者: Mango
在本节中,我们将学习在MySQL和Oracle中删除重复行的不同方法。如果SQL表包含重复的行,那么我们必须删除重复的行。
该脚本将创建名为contacts的表。
DROP TABLE IF EXISTS contacts;
CREATE TABLE contacts (
id INT PRIMARY KEY AUTO_INCREMENT,
first_name VARCHAR(30) NOT NULL,
last_name VARCHAR(25) NOT NULL,
email VARCHAR(210) NOT NULL,
age VARCHAR(22) NOT NULL
);
在上表中,我们插入了以下数据。
INSERT INTO contacts (first_name,last_name,email,age)
VALUES ('Kavin','Peterson','kavin.peterson@verizon.net','21'),
('Nick','Jonas','nick.jonas@me.com','18'),
('Peter','Heaven','peter.heaven@google.com','23'),
('Michal','Jackson','michal.jackson@aol.com','22'),
('Sean','Bean','sean.bean@yahoo.com','23'),
('Tom ','Baker','tom.baker@aol.com','20'),
('Ben','Barnes','ben.barnes@comcast.net','17'),
('Mischa ','Barton','mischa.barton@att.net','18'),
('Sean','Bean','sean.bean@yahoo.com','16'),
('Eliza','Bennett','eliza.bennett@yahoo.com','25'),
('Michal','Krane','michal.Krane@me.com','25'),
('Peter','Heaven','peter.heaven@google.com','20'),
('Brian','Blessed','brian.blessed@yahoo.com','20');
('Kavin','Peterson','kavin.peterson@verizon.net','30'),
在执行DELETE语句后,我们将执行脚本以重新创建测试数据。
该查询从联系人表返回数据:
SELECT * FROM contacts
ORDER BY email;
id | first_name | last_name | age | |
7 | Ben | Barnes | ben.barnes@comcast.net | 21 |
13 | Brian | Blessed | brian.blessed@yahoo.com | 18 |
10 | Eliza | Bennett | eliza.bennett@yahoo.cm | 23 |
1 | Kavin | Peterson | kavin.peterson@verizon.net | 22 |
14 | Kavin | Peterson | kavin.peterson@verizon.net | 23 |
8 | Mischa | Barton | mischa.barton@att.net | 20 |
11 | Michal | Krane | michal.Krane@me.com | 17 |
4 | Michal | Jackson | Michal.jackson@aol.com | 18 |
2 | Nick | Jonas | nick.jonas@me.com | 16 |
3 | Peter | Heaven | Peter.heaven@google.com | 25 |
12 | Peter | Heaven | Peter.heaven@google.com | 25 |
5 | Sean | Bean | Sean.bean@yahoo.com | 20 |
9 | Sean | Bean | Sean.bean@yahoo.com | 20 |
6 | Tom | Baker | tom.baker@aol.com | 30 |
以下SQL查询从联系人表返回重复的电子邮件:
SELECT
email, COUNT(email)
FROM
contacts
GROUP BY
email
HAVING
COUNT (email) > 1;
COUNT(email) | |
kavin.peterson@verizon.net | 2 |
Peter.heaven@google.com | 2 |
Sean.bean@yahoo.com | 2 |
我们有三行重复的电子邮件。
DELETE t1 FROM contacts t1
INNERJOIN contacts t2
WHERE
t1.id < t2.id AND
t1.email = t2.email;
输出:
Query OK, three rows affected (0.10 sec)
三行已被删除。我们执行下面给出的查询,以从表中查找重复的电子邮件。
SELECT
email,
COUNT (email)
FROM
contacts
GROUP BY
email
HAVING
COUNT (email) > 1;
查询返回空集。要验证联系人表中的数据,请执行以下SQL查询:
SELECT * FROM contacts;
id | first_name | last_name | age | |
7 | Ben | Barnes | ben.barnes@comcast.net | 21 |
13 | Brian | Blessed | brian.blessed@yahoo.com | 18 |
10 | Eliza | Bennett | eliza.bennett@yahoo.cm | 23 |
1 | Kavin | Peterson | kavin.peterson@verizon.net | 22 |
8 | Mischa | Barton | mischa.barton@att.net | 20 |
11 | Micha | Krane | michal.Krane@me.com | 17 |
4 | Michal | Jackson | Michal.jackson@aol.com | 18 |
2 | Nick | Jonas | nick.jonas@me.com | 16 |
3 | Peter | Heaven | Peter.heaven@google.com | 25 |
5 | Sean | Bean | Sean.bean@yahoo.com | 20 |
6 | Tom | Baker | tom.baker@aol.com | 30 |
行ID的9、12和14已被删除。我们使用以下语句删除重复的行:
执行用于创建联系人的脚本。
DELETE c1 FROM contacts c1
INNERJ OIN contacts c2
WHERE
c1.id > c2.id AND
c1.email = c2.email;
id | first_name | last_name | age | |
1 | Ben | Barnes | ben.barnes@comcast.net | 21 |
2 | Kavin | Peterson | kavin.peterson@verizon.net | 22 |
3 | Brian | Blessed | brian.blessed@yahoo.com | 18 |
4 | Nick | Jonas | nick.jonas@me.com | 16 |
5 | Michal | Krane | michal.Krane@me.com | 17 |
6 | Eliza | Bennett | eliza.bennett@yahoo.cm | 23 |
7 | Michal | Jackson | Michal.jackson@aol.com | 18 |
8 | Sean | Bean | Sean.bean@yahoo.com | 20 |
9 | Mischa | Barton | mischa.barton@att.net | 20 |
10 | Peter | Heaven | Peter.heaven@google.com | 25 |
11 | Tom | Baker | tom.baker@aol.com | 30 |
要使用中间表删除重复的行,请按照以下步骤操作:
步骤1.创建一个新表结构,与真实表相同:
CREATE TABLE source_copy LIKE source;
步骤2.插入数据库原始计划中的不同行:
INSERT INTO source_copy
SELECT * FROM source
GROUP BY col;
步骤3.删除原始表,并将立即表重命名为原始表。
DROP TABLE source;
ALTER TABLE source_copy RENAME TO source;
例如,以下语句从联系人表中删除具有重复电子邮件的行:
-- step 1
CREATE TABLE contacts_temp
LIKE contacts;
-- step 2
INSERT INTO contacts_temp
SELECT * FROM contacts
GROUP BY email;
-- step 3
DROP TABLE contacts;
ALTER TABLE contacts_temp
RENAME TO contacts;
注意:自MySQL 8.02版以来,已支持ROW_NUMBER()函数,因此我们应在使用该函数之前检查MySQL版本。
以下语句使用ROW_NUMBER()为每个行分配一个顺序整数。如果电子邮件重复,则该行将大于一。
SELECT id, email, ROW_NUMBER()
OVER (PARTITION BY email
ORDER BY email
) AS row_num
FROM contacts;
以下SQL查询返回重复行的ID列表:
SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (
PARTITION BY email ORDER BY email) AS row_num
FROM
contacts
) t
WHERE
row_num> 1;
输出:
id |
9 |
12 |
14 |
当我们在表中找到重复的记录时,我们必须删除不需要的副本,以保持数据的干净唯一。如果表中有重复的行,我们可以使用DELETE语句将其删除。
在这种情况下,我们有一列,它不是用于评估表中重复记录的组的一部分。
考虑下面给出的表:
VEGETABLE_ID | VEGETABLE_NAME | COLOR |
01 | Potato | Brown |
02 | Potato | Brown |
03 | Onion | Red |
04 | Onion | Red |
05 | Onion | Red |
06 | Pumpkin | Green |
07 | Pumpkin | Yellow |
-- create the vegetable table
CREATE TABLE vegetables (
VEGETABLE_ID NUMBER generated BY DEFAULT AS ID ENTITY,
VEGETABLE_NAME VARCHAR2(100),
color VARCHAR2(20),
PRIMARY KEY (VEGETABLE_ID)
);
-- insert sample rows
INSERT INTO vegetables (VEGETABLE_NAME,color) VALUES('Potato','Brown');
INSERT INTO vegetables (VEGETABLE_NAME,color) VALUES('Potato','Brown');
INSERT INTO vegetables (VEGETABLE_NAME,color) VALUES('Onion','Red');
INSERT INTO vegetables (VEGETABLE_NAME,color) VALUES('Onion','Red');
INSERT INTO vegetables (VEGETABLE_NAME,color) VALUES('Onion','Red');
INSERT INTO vegetables (VEGETABLE_NAME,color) VALUES('Pumpkin','Green');
INSERT INTO vegetables (VEGETABLE_NAME,color) VALUES('Pumpkin','Yellow');
-- query data from the vegetable table
SELECT * FROM vegetables;
假设我们要保留具有最高VEGETABLE_ID的行,并删除所有其他副本。
SELECT
MAX (VEGETABLE_ID)
FROM
vegetables
GROUP BY
VEGETABLE_NAME,
color
ORDER BY
MAX(VEGETABLE_ID);
MAX(VEGETABLE_ID) |
2 |
5 |
6 |
7 |
我们使用DELETE语句删除VEGETABLE_ID COLUMN中的值不是最高的行。
DELETE FROM
vegetables
WHERE
VEGETABLE_IDNOTIN
(
SELECT
MAX(VEGETABLE_ID)
FROM
vegetables
GROUP BY
VEGETABLE_NAME,
color
);
三行已被删除。
SELECT *FROM vegetables;
VEGETABLE_ID | VEGETABLE_NAME | COLOR |
02 | Potato | Brown |
05 | Onion | Red |
06 | Pumpkin | Green |
07 | Yellow |
如果我们想让ID最小的行,请使用MIN()函数而不是MAX()函数。
DELETE FROM
vegetables
WHERE
VEGETABLE_IDNOTIN
(
SELECT
MIN(VEGETABLE_ID)
FROM
vegetables
GROUP BY
VEGETABLE_NAME,
color
);
如果我们有一个不属于评估重复项的组的列,则上述方法有效。如果列中的所有值都有副本,那么我们将无法使用VEGETABLE_ID列。
让我们拖放并创建一个具有新结构的蔬菜表。
DROP TABLE vegetables;
CREATE TABLE vegetables (
VEGETABLE_ID NUMBER,
VEGETABLE_NAME VARCHAR2(100),
Color VARCHAR2(20)
);
INSERT INTO vegetables (VEGETABLE_ID,VEGETABLE_NAME,color) VALUES(1,'Potato','Brown');
INSERT INTO vegetables (VEGETABLE_ID,VEGETABLE_NAME,color) VALUES(1, 'Potato','Brown');
INSERT INTO vegetables (VEGETABLE_ID,VEGETABLE_NAME,color)VALUES(2,'Onion','Red');
INSERT INTO vegetables (VEGETABLE_ID,VEGETABLE_NAME,color)VALUES(2,'Onion','Red');
INSERT INTO vegetables (VEGETABLE_ID,VEGETABLE_NAME,color) VALUES(2,'Onion','Red');
INSERT INTO vegetables (VEGETABLE_ID,VEGETABLE_NAME,color) VALUES(3,'Pumpkin','Green');
INSERT INTO vegetables (VEGETABLE_ID,VEGETABLE_NAME,color) VALUES('4,Pumpkin','Yellow');
SELECT * FROM vegetables;
VEGETABLE_ID | VEGETABLE_NAME | COLOR |
01 | Potato | Brown |
01 | Potato | Brown |
02 | Onion | Red |
02 | Onion | Red |
02 | Onion | Red |
03 | Pumpkin | Green |
04 | Pumpkin | Yellow |
在蔬菜表中,已复制所有列VEGETABLE_ID,VEGETABLE_NAME和颜色中的值。
我们可以使用rowid,这是一个指定Oracle在哪里存储行的定位器。因为rowid是唯一的,所以我们可以使用它来删除重复的行。
DELETE
FROM
Vegetables
WHERE
rowed NOT IN
(
SELECT
MIN(rowid)
FROM
vegetables
GROUP BY
VEGETABLE_ID,
VEGETABLE_NAME,
color
);
该查询验证删除操作:
SELECT * FROM vegetables;
VEGETABLE_ID | VEGETABLE_NAME | COLOR |
01 | Potato | Brown |
02 | Onion | Red |
03 | Pumpkin | Green |
04 | Pumpkin | Yellow |