Skip to content

WikiMovimentoBrasil/arquivonacional

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arquivo Nacional

This is a collection of tools to scrape SIAN's (The National Archives of Brazil) website: https://sian.an.gov.br/. The website contains an assortment of national documents, separated into different dossiers, according to their origin. All files are saved as available on source, majority as PDF files.

Running the scripts

fundo.php

This script generates two lists:

  1. a list of all pages in SIAN's fund website
  2. a list of the links to all files on the specified page, along with its respective dossier link.

You must edit the file in two separate points before running it.

First, edit your login credentials:

//Login
$params2 = [
		"login" 		=> "YOUR_LOGIN",
		"senha" 		=> "YOUR_PASSWORD",

Then, on line 82, replace ENDEREÇO for the URL of where the script has been hosted. In cases where the server is hosted locally, use localhost.

	header("Location: https://ENDEREÇO/fundo.php?colecao=".$_GET["colecao"]."&pag=".$_GET["pag"]."&time=".time());

To run the script, call the PHP script in your browser and set the "ID" to the fund ID wanted to download:

https://localhost/fundo.php?colecao=ID

Due to performance issues, it is recommended to use the script together with IDM's Grabber tool or another preferred download manager.

dossie.php

This script generates local HTML files with the entire content of the pages in a dossier listed in SIAN's website. You must edit the file in two separate points before running it.

First, edit your login credentials:

//Login
$params2 = [
		"login" 		=> "YOUR_LOGIN",
		"senha" 		=> "YOUR_PASSWORD",

Then, on line 66, replace ENDEREÇO for the URL of where the script has been hosted. In cases where the server is hosted locally, use localhost.

	header("Location: https://ENDEREÇO/dossie.php?id=".$_GET["id"]."&time=".time());

To run the script, call the PHP script in your browser and set the "ID" to the file ID wanted to download:

https://localhost/dossie.php?id=ID

tabelador.php

This script converts the html files generated by dossie.php into a CSV file. The end file contains informations such as the code of the pdf related to that dossie / fund, the pdf's title and the date in which it was created. The end file may require some minimal manual editing.

To run the script, place it in the same folder as the dossies you wish to convert, as well as the file Html2Text.php, which can be found here: https://github.com/mtibben/html2text/blob/master/src/Html2Text.php

Then, use the following command:

php tabelador.php

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

GNU General Public License v3.0

Credits

This application was developed by the Wiki Movimento Brasil User Group, supported by the University of São Paulo and the University of São Paulo Support Foundation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages