Prune Objects --keep-threads option (#350)
authorilja <akkoma.dev@ilja.space>
Mon, 9 Jan 2023 22:15:41 +0000 (22:15 +0000)
committerfloatingghost <hannah@coffee-and-dreams.uk>
Mon, 9 Jan 2023 22:15:41 +0000 (22:15 +0000)
This adds an option to the prune_objects mix task.
The original way deleted all non-local public posts older than a certain time frame.
Here we add a different query which you can call using the option --keep-threads.

We query from the activities table all context id's where
    1. the newest activity with this context is still old
    2. none of the activities with this context is is local
    3. none of the activities with this context is bookmarked
and delete all objects with these contexts.

The idea is that posts with local activities (posts, replies, likes, repeats...) may be interesting to keep.
Besides that, a post lives in a certain context (the thread), so we keep the whole thread as well.

Caveats:
* ~~Quotes have a different context. Therefore, when someone quotes a post, it's possible the quoted post will still be deleted.~~ fixed in https://akkoma.dev/AkkomaGang/akkoma/pulls/379
* Although undocumented (in docs/docs/administration/CLI_tasks/database.md/#prune-old-remote-posts-from-the-database), the 'normal' delete action still kept old remote non-public posts. I added an option to keep this behaviour, but this also means that you now have to explicitly provide that option. **This could be considered a breaking change!**
* ~~Note that this removes from the objects table, but not from the activities.~~ See https://akkoma.dev/AkkomaGang/akkoma/pulls/427 for that.

Some statistics from explain analyse:
(cost=1402845.92..1933782.00 rows=3810907 width=62) (actual time=2562455.486..2562455.495 rows=0 loops=1)
 Planning Time: 505.327 ms
 Trigger for constraint chat_message_references_object_id_fkey: time=651939.797 calls=921740
 Trigger for constraint deliveries_object_id_fkey: time=52036.009 calls=921740
 Trigger for constraint hashtags_objects_object_id_fkey: time=20665.778 calls=921740
 Execution Time: 3287933.902 ms

***
**TODO**
1. [x] **Question:** Is it OK to keep it like this in regard to quote posts? If not (ie post quoted by local users should also be kept), should we give quotes the same context as the post they are quoting? (If we don't want to give them the same context, I'll have to see how/if I can do it without being too costly)
    * See https://akkoma.dev/AkkomaGang/akkoma/pulls/379
2. [x] **Question:** the "original" query only deletes public posts (this is undocumented, but you can check the code). This new one doesn't care for scope. From the docs I get that the idea is that posts can be refetched when needed. But I have from a trusted source that Pleroma can't refetch non-public posts. I assume that's the reason why they are kept here. I see different options to deal with this
    1. ~~We keep it as currently implemented and just don't care about scope with this option~~
    2. ~~We add logic to not delete non-public posts either (I'll have to see how costly that becomes)~~
    3. We add an extra --keep-non-public parameter. This is technically speaking breakage (you didn't have to provide a param before for this, now you do), but I'm inclined to not care much because it wasn't documented nor tested in the first place.
3. [x] See if we can do the query using Elixir
4. [x] Test on a bigger DB to see that we don't run into a timeout
5. [x] Add docs

Co-authored-by: ilja <git@ilja.space>
Reviewed-on: https://akkoma.dev/AkkomaGang/akkoma/pulls/350
Co-authored-by: ilja <akkoma.dev@ilja.space>
Co-committed-by: ilja <akkoma.dev@ilja.space>
CHANGELOG.md
docs/docs/administration/CLI_tasks/database.md
lib/mix/tasks/pleroma/database.ex
test/mix/tasks/pleroma/database_test.exs

index 8e638bdd8f6655464e8b89be741de1036b634143..c3e88f071628d18a8d7fb41dfee6ceed30523a97 100644 (file)
@@ -27,6 +27,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
   - Admin scopes will be dropped on create
 - Rich media will now backoff for 20 minutes after a failure
 - Quote posts are now considered as part of the same thread as the post they are quoting
+- Extend the mix task `prune_objects` with options to keep more relevant posts
 - Simplified HTTP signature processing
 - Rich media will now hard-exit after 5 seconds, to prevent timeline hangs
 - HTTP Content Security Policy is now far more strict to prevent any potential XSS/CSS leakages
index 73419dc816a4b493c8493fa445f923cc6b33d23d..915139cf7db0e5829551fba1ee78b140556dcb8b 100644 (file)
@@ -27,7 +27,7 @@ Replaces embedded objects with references to them in the `objects` table. Only n
 
 ## Prune old remote posts from the database
 
-This will prune remote posts older than 90 days (configurable with [`config :pleroma, :instance, remote_post_retention_days`](../../configuration/cheatsheet.md#instance)) from the database, they will be refetched from source when accessed.
+This will prune remote posts older than 90 days (configurable with [`config :pleroma, :instance, remote_post_retention_days`](../../configuration/cheatsheet.md#instance)) from the database. Pruned posts may be refetched in some cases.
 
 !!! danger
     The disk space will only be reclaimed after `VACUUM FULL`. You may run out of disk space during the execution of the task or vacuuming if you don't have about 1/3rds of the database size free.
@@ -45,6 +45,9 @@ This will prune remote posts older than 90 days (configurable with [`config :ple
     ```
 
 ### Options
+
+- `--keep-threads` - don't prune posts when they are part of a thread where at least one post has seen local interaction (e.g. one of the posts is a local post, or is favourited by a local user, or has been repeated by a local user...)
+- `--keep-non-public` - keep non-public posts like DM's and followers-only, even if they are remote
 - `--vacuum` - run `VACUUM FULL` after the objects are pruned
 
 ## Create a conversation for all existing DMs
@@ -178,4 +181,4 @@ to the current day.
 
     ```sh
     mix pleroma.database prune_task
-    ```
\ No newline at end of file
+    ```
index 272c9e3e536884af38007a0841850b2660b00057..be59e2271e6b748aacb2da44d71138a07ee2eb00 100644 (file)
@@ -67,33 +67,92 @@ defmodule Mix.Tasks.Pleroma.Database do
       OptionParser.parse(
         args,
         strict: [
-          vacuum: :boolean
+          vacuum: :boolean,
+          keep_threads: :boolean,
+          keep_non_public: :boolean
         ]
       )
 
     start_pleroma()
 
     deadline = Pleroma.Config.get([:instance, :remote_post_retention_days])
+    time_deadline = NaiveDateTime.utc_now() |> NaiveDateTime.add(-(deadline * 86_400))
 
-    Logger.info("Pruning objects older than #{deadline} days")
+    log_message = "Pruning objects older than #{deadline} days"
 
-    time_deadline =
-      NaiveDateTime.utc_now()
-      |> NaiveDateTime.add(-(deadline * 86_400))
+    log_message =
+      if Keyword.get(options, :keep_non_public) do
+        log_message <> ", keeping non public posts"
+      else
+        log_message
+      end
 
-    from(o in Object,
-      where:
-        fragment(
-          "?->'to' \\? ? OR ?->'cc' \\? ?",
-          o.data,
-          ^Pleroma.Constants.as_public(),
-          o.data,
-          ^Pleroma.Constants.as_public()
-        ),
-      where: o.inserted_at < ^time_deadline,
-      where:
+    log_message =
+      if Keyword.get(options, :keep_threads) do
+        log_message <> ", keeping threads intact"
+      else
+        log_message
+      end
+
+    Logger.info(log_message)
+
+    if Keyword.get(options, :keep_threads) do
+      # We want to delete objects from threads where
+      # 1. the newest post is still old
+      # 2. none of the activities is local
+      # 3. none of the activities is bookmarked
+      # 4. optionally none of the posts is non-public
+      deletable_context =
+        if Keyword.get(options, :keep_non_public) do
+          Pleroma.Activity
+          |> join(:left, [a], b in Pleroma.Bookmark, on: a.id == b.activity_id)
+          |> group_by([a], fragment("? ->> 'context'::text", a.data))
+          |> having(
+            [a],
+            not fragment(
+              # Posts (checked on Create Activity) is non-public
+              "bool_or((not(?->'to' \\? ? OR ?->'cc' \\? ?)) and ? ->> 'type' = 'Create')",
+              a.data,
+              ^Pleroma.Constants.as_public(),
+              a.data,
+              ^Pleroma.Constants.as_public(),
+              a.data
+            )
+          )
+        else
+          Pleroma.Activity
+          |> join(:left, [a], b in Pleroma.Bookmark, on: a.id == b.activity_id)
+          |> group_by([a], fragment("? ->> 'context'::text", a.data))
+        end
+        |> having([a], max(a.updated_at) < ^time_deadline)
+        |> having([a], not fragment("bool_or(?)", a.local))
+        |> having([_, b], fragment("max(?::text) is null", b.id))
+        |> select([a], fragment("? ->> 'context'::text", a.data))
+
+      Pleroma.Object
+      |> where([o], fragment("? ->> 'context'::text", o.data) in subquery(deletable_context))
+    else
+      if Keyword.get(options, :keep_non_public) do
+        Pleroma.Object
+        |> where(
+          [o],
+          fragment(
+            "?->'to' \\? ? OR ?->'cc' \\? ?",
+            o.data,
+            ^Pleroma.Constants.as_public(),
+            o.data,
+            ^Pleroma.Constants.as_public()
+          )
+        )
+      else
+        Pleroma.Object
+      end
+      |> where([o], o.updated_at < ^time_deadline)
+      |> where(
+        [o],
         fragment("split_part(?->>'actor', '/', 3) != ?", o.data, ^Pleroma.Web.Endpoint.host())
-    )
+      )
+    end
     |> Repo.delete_all(timeout: :infinity)
 
     prune_hashtags_query = """
index 7a1a759da2fae14b351cb78f23106644fb2d35cf..447a4404e26620061098885467a4e8ba1bd29765 100644 (file)
@@ -46,7 +46,44 @@ defmodule Mix.Tasks.Pleroma.DatabaseTest do
 
   describe "prune_objects" do
     test "it prunes old objects from the database" do
+      deadline = Pleroma.Config.get([:instance, :remote_post_retention_days]) + 1
+
+      date =
+        Timex.now()
+        |> Timex.shift(days: -deadline)
+        |> Timex.to_naive_datetime()
+        |> NaiveDateTime.truncate(:second)
+
       insert(:note)
+
+      %{id: note_remote_public_id} =
+        :note
+        |> insert()
+        |> Ecto.Changeset.change(%{updated_at: date})
+        |> Repo.update!()
+
+      note_remote_non_public =
+        %{id: note_remote_non_public_id, data: note_remote_non_public_data} =
+        :note
+        |> insert()
+
+      note_remote_non_public
+      |> Ecto.Changeset.change(%{
+        updated_at: date,
+        data: note_remote_non_public_data |> update_in(["to"], fn _ -> [] end)
+      })
+      |> Repo.update!()
+
+      assert length(Repo.all(Object)) == 3
+
+      Mix.Tasks.Pleroma.Database.run(["prune_objects"])
+
+      assert length(Repo.all(Object)) == 1
+      refute Object.get_by_id(note_remote_public_id)
+      refute Object.get_by_id(note_remote_non_public_id)
+    end
+
+    test "with the --keep-non-public option it still keeps non-public posts even if they are not local" do
       deadline = Pleroma.Config.get([:instance, :remote_post_retention_days]) + 1
 
       date =
@@ -55,18 +92,266 @@ defmodule Mix.Tasks.Pleroma.DatabaseTest do
         |> Timex.to_naive_datetime()
         |> NaiveDateTime.truncate(:second)
 
-      %{id: id} =
+      insert(:note)
+
+      %{id: note_remote_id} =
         :note
         |> insert()
-        |> Ecto.Changeset.change(%{inserted_at: date})
+        |> Ecto.Changeset.change(%{updated_at: date})
         |> Repo.update!()
 
+      note_remote_non_public =
+        %{data: note_remote_non_public_data} =
+        :note
+        |> insert()
+
+      note_remote_non_public
+      |> Ecto.Changeset.change(%{
+        updated_at: date,
+        data: note_remote_non_public_data |> update_in(["to"], fn _ -> [] end)
+      })
+      |> Repo.update!()
+
+      assert length(Repo.all(Object)) == 3
+
+      Mix.Tasks.Pleroma.Database.run(["prune_objects", "--keep-non-public"])
+
       assert length(Repo.all(Object)) == 2
+      refute Object.get_by_id(note_remote_id)
+    end
 
-      Mix.Tasks.Pleroma.Database.run(["prune_objects"])
+    test "with the --keep-threads and --keep-non-public option it keeps old threads with non-public replies even if the interaction is not local" do
+      # For non-public we only check Create Activities because only these are relevant for threads
+      # Flags are always non-public, Announces from relays can be non-public...
+      deadline = Pleroma.Config.get([:instance, :remote_post_retention_days]) + 1
+
+      old_insert_date =
+        Timex.now()
+        |> Timex.shift(days: -deadline)
+        |> Timex.to_naive_datetime()
+        |> NaiveDateTime.truncate(:second)
+
+      remote_user1 = insert(:user, local: false)
+      remote_user2 = insert(:user, local: false)
+
+      # Old remote non-public reply (should be kept)
+      {:ok, old_remote_post1_activity} =
+        CommonAPI.post(remote_user1, %{status: "some thing", local: false})
+
+      old_remote_post1_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      {:ok, old_remote_non_public_reply_activity} =
+        CommonAPI.post(remote_user2, %{
+          status: "some reply",
+          in_reply_to_status_id: old_remote_post1_activity.id
+        })
+
+      old_remote_non_public_reply_activity
+      |> Ecto.Changeset.change(%{
+        local: false,
+        updated_at: old_insert_date,
+        data: old_remote_non_public_reply_activity.data |> update_in(["to"], fn _ -> [] end)
+      })
+      |> Repo.update!()
+
+      # Old remote non-public Announce (should be removed)
+      {:ok, old_remote_post2_activity = %{data: %{"object" => old_remote_post2_id}}} =
+        CommonAPI.post(remote_user1, %{status: "some thing", local: false})
+
+      old_remote_post2_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      {:ok, old_remote_non_public_repeat_activity} =
+        CommonAPI.repeat(old_remote_post2_activity.id, remote_user2)
+
+      old_remote_non_public_repeat_activity
+      |> Ecto.Changeset.change(%{
+        local: false,
+        updated_at: old_insert_date,
+        data: old_remote_non_public_repeat_activity.data |> update_in(["to"], fn _ -> [] end)
+      })
+      |> Repo.update!()
+
+      assert length(Repo.all(Object)) == 3
+
+      Mix.Tasks.Pleroma.Database.run(["prune_objects", "--keep-threads", "--keep-non-public"])
+
+      Repo.all(Pleroma.Activity)
+      assert length(Repo.all(Object)) == 2
+      refute Object.get_by_ap_id(old_remote_post2_id)
+    end
+
+    test "with the --keep-threads option it still keeps non-old threads even with no local interactions" do
+      remote_user = insert(:user, local: false)
+      remote_user2 = insert(:user, local: false)
+
+      {:ok, remote_post_activity} =
+        CommonAPI.post(remote_user, %{status: "some thing", local: false})
+
+      {:ok, remote_post_reply_activity} =
+        CommonAPI.post(remote_user2, %{
+          status: "some reply",
+          in_reply_to_status_id: remote_post_activity.id
+        })
+
+      remote_post_activity
+      |> Ecto.Changeset.change(%{local: false})
+      |> Repo.update!()
+
+      remote_post_reply_activity
+      |> Ecto.Changeset.change(%{local: false})
+      |> Repo.update!()
+
+      assert length(Repo.all(Object)) == 2
+
+      Mix.Tasks.Pleroma.Database.run(["prune_objects", "--keep-threads"])
+
+      assert length(Repo.all(Object)) == 2
+    end
+
+    test "with the --keep-threads option it deletes old threads with no local interaction" do
+      deadline = Pleroma.Config.get([:instance, :remote_post_retention_days]) + 1
+
+      old_insert_date =
+        Timex.now()
+        |> Timex.shift(days: -deadline)
+        |> Timex.to_naive_datetime()
+        |> NaiveDateTime.truncate(:second)
+
+      remote_user = insert(:user, local: false)
+      remote_user2 = insert(:user, local: false)
+
+      {:ok, old_remote_post_activity} =
+        CommonAPI.post(remote_user, %{status: "some thing", local: false})
+
+      old_remote_post_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      {:ok, old_remote_post_reply_activity} =
+        CommonAPI.post(remote_user2, %{
+          status: "some reply",
+          in_reply_to_status_id: old_remote_post_activity.id
+        })
+
+      old_remote_post_reply_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      {:ok, old_favourite_activity} =
+        CommonAPI.favorite(remote_user2, old_remote_post_activity.id)
+
+      old_favourite_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      {:ok, old_repeat_activity} = CommonAPI.repeat(old_remote_post_activity.id, remote_user2)
+
+      old_repeat_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      assert length(Repo.all(Object)) == 2
+
+      Mix.Tasks.Pleroma.Database.run(["prune_objects", "--keep-threads"])
+
+      assert length(Repo.all(Object)) == 0
+    end
+
+    test "with the --keep-threads option it keeps old threads with local interaction" do
+      deadline = Pleroma.Config.get([:instance, :remote_post_retention_days]) + 1
+
+      old_insert_date =
+        Timex.now()
+        |> Timex.shift(days: -deadline)
+        |> Timex.to_naive_datetime()
+        |> NaiveDateTime.truncate(:second)
+
+      remote_user = insert(:user, local: false)
+      local_user = insert(:user, local: true)
+
+      # local reply
+      {:ok, old_remote_post1_activity} =
+        CommonAPI.post(remote_user, %{status: "some thing", local: false})
+
+      old_remote_post1_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      {:ok, old_local_post2_reply_activity} =
+        CommonAPI.post(local_user, %{
+          status: "some reply",
+          in_reply_to_status_id: old_remote_post1_activity.id
+        })
+
+      old_local_post2_reply_activity
+      |> Ecto.Changeset.change(%{local: true, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      # local Like
+      {:ok, old_remote_post3_activity} =
+        CommonAPI.post(remote_user, %{status: "some thing", local: false})
+
+      old_remote_post3_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      {:ok, old_favourite_activity} = CommonAPI.favorite(local_user, old_remote_post3_activity.id)
+
+      old_favourite_activity
+      |> Ecto.Changeset.change(%{local: true, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      # local Announce
+      {:ok, old_remote_post4_activity} =
+        CommonAPI.post(remote_user, %{status: "some thing", local: false})
+
+      old_remote_post4_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      {:ok, old_repeat_activity} = CommonAPI.repeat(old_remote_post4_activity.id, local_user)
+
+      old_repeat_activity
+      |> Ecto.Changeset.change(%{local: true, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      assert length(Repo.all(Object)) == 4
+
+      Mix.Tasks.Pleroma.Database.run(["prune_objects", "--keep-threads"])
+
+      assert length(Repo.all(Object)) == 4
+    end
+
+    test "with the --keep-threads option it keeps old threads with bookmarked posts" do
+      deadline = Pleroma.Config.get([:instance, :remote_post_retention_days]) + 1
+
+      old_insert_date =
+        Timex.now()
+        |> Timex.shift(days: -deadline)
+        |> Timex.to_naive_datetime()
+        |> NaiveDateTime.truncate(:second)
+
+      remote_user = insert(:user, local: false)
+      local_user = insert(:user, local: true)
+
+      {:ok, old_remote_post_activity} =
+        CommonAPI.post(remote_user, %{status: "some thing", local: false})
+
+      old_remote_post_activity
+      |> Ecto.Changeset.change(%{local: false, updated_at: old_insert_date})
+      |> Repo.update!()
+
+      Pleroma.Bookmark.create(local_user.id, old_remote_post_activity.id)
+
+      assert length(Repo.all(Object)) == 1
+
+      Mix.Tasks.Pleroma.Database.run(["prune_objects", "--keep-threads"])
 
       assert length(Repo.all(Object)) == 1
-      refute Object.get_by_id(id)
     end
   end