Skip to content

Commit

Permalink
Allow passing directory as input
Browse files Browse the repository at this point in the history
Summary:
**Why?**
fbsource cxx skycastle indexing workflow takes about 20 hours to complete with ownership and the bottleneck is writing.
https://www.internalfb.com/sandcastle/workflow/1747396655424399432
In that workflow we index targets in batches with size 3072 and merge each batch by chunks of 1024. We could try to merge more to deduplicate more and improve writing speed, but having a big chunk causes merging to be slow (not improving the total time).

Another way to merge more is to apply another merge command to all merged chunks. This way we will merge whole batch with similar performance.

It's implemented in the next diff, but for this we need `glean merge` to accept directories as an input to be passed from bxl script which is implemented here

Reviewed By: malanka

Differential Revision: D64594579

fbshipit-source-id: 2cf49576864a44fe702e3aef4eec657bf2510c78
  • Loading branch information
iamirzhan authored and facebook-github-bot committed Oct 21, 2024
1 parent f4b12a1 commit 2abceaa
Showing 1 changed file with 11 additions and 3 deletions.
14 changes: 11 additions & 3 deletions glean/tools/gleancli/GleanCLI/Merge.hs
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ import GleanCLI.Types
import GleanCLI.Common (dbOpts, fileFormatOpt, FileFormat (..))
import Glean.Write (fileToBatches)
import Glean.Write.JSON (buildJsonBatch)
import System.Directory.Extra (listFiles)

data MergeCommand = MergeCommand
{ mergeFiles :: [FilePath]
Expand All @@ -58,8 +59,8 @@ inventoryOpt = strOption $
instance Plugin MergeCommand where
parseCommand = commandParser "merge" (progDesc "Merge fact files") $ do
mergeFiles <- many $ strArgument (
metavar "FILE" <>
help ("File of facts, either in json or binary format. "
metavar "PATH" <>
help ("File or directory of facts, either in json or binary format. "
<> "For json format specify the database"))
mergeFileSize <- option auto $
long "max-file-size" <>
Expand Down Expand Up @@ -88,12 +89,19 @@ instance Plugin MergeCommand where
createDirectoryIfMissing True mergeOutDir
hSetBuffering stderr LineBuffering
outputs <- newIORef []
stream 1 (merge fileFormat inventory dbSchema mergeFiles)
expandedMergeFiles <- mapM expandFile mergeFiles
stream 1 (merge fileFormat inventory dbSchema $ concat expandedMergeFiles)
(writeToFile outputs)
-- stream overlaps writing with reading
files <- readIORef outputs
L.putStrLn (Aeson.encode (Aeson.toJSON files))
where
expandFile :: FilePath -> IO [FilePath]
expandFile file = do
isDirectory <- doesDirectoryExist file
if isDirectory
then listFiles file
else return [file]
factSetSize :: FactSet -> IO Int
factSetSize f = do
c <- FactSet.factCount f
Expand Down

0 comments on commit 2abceaa

Please sign in to comment.